操作系统: 01 2010存档

[linux-kernel] tcp连接在断网后的恢复能力

By DongHao on 01 14, 2010 3:51 PM | Permalink | Comments (2) | TrackBacks (0)

作者：董昊（要转载的同学帮忙把名字和博客链接http://oldblog.donghao.org/uii/带上，多谢了！)

做项目中遇到一个问题。两台机器上用socket建立一个TCP连接，双向通信，流量很大，这时，通过在路由器上设置100%的丢包率将网络断开，这时 socket当然是发不了包，也收不了，出现大量的重传，然后，取消路由器上的设置，恢复网络，结果，TCP连接client去往server的流量正常了，但server去往client却不通，任凭你如何使劲的send，返回值就是0，而且errno为EAGAIN。
我用tcpdump看了一下此时的包数据（tc2是server，tc1是client）：

12:08:21.020291 IP tc1.corp.com.42171 > tc2.corp.com.3003: S 4009389430:4009389430(0) win 5840
12:08:21.020571 IP tc2.corp.com.3003 > tc1.corp.com.42171: R 0:0(0) ack 4009389431 win 0
12:08:38.934329 IP tc2.corp.com.3903 > tc1.corp.com.3904: P 2398055392:2398056153(761) ack 2538876742 win 724
12:08:38.934519 IP tc1.corp.com.3904 > tc2.corp.com.3903: . ack 2165 win 13756
12:08:39.958457 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1:763(762) ack 2165 win 13756
12:08:39.958485 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 763 win 1448
12:08:39.958653 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 763:881(118) ack 2165 win 13756
12:08:39.958660 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 881:997(116) ack 2165 win 13756
12:08:39.958719 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 997 win 1448
12:08:39.958890 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 997:1114(117) ack 2165 win 13756
12:08:39.958898 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1114:1232(118) ack 2165 win 13756
12:08:39.958903 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1232:1349(117) ack 2165 win 13756
12:08:39.958971 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 1349 win 1448
12:08:39.959141 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1349:1466(117) ack 2165 win 13756
12:08:39.959149 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1466:1583(117) ack 2165 win 13756
12:08:39.959154 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1583:1700(117) ack 2165 win 13756
12:08:39.959222 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 1700 win 1448

tc2不发自己的数据，却只是一味的ACK从tc1传来的数据，等上半个小时，依然如此。它为什么不发呢？

最后发现是因为我们在socket上设了TCP_NODELAY。去掉这个设置，重启程序，断网恢复以后，TCP双向正常工作。同样用tcpdump看：

16:05:38.782427 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: P 0:887(887) ack 1 win 26064
16:05:38.782619 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 3783 win 25352
16:05:38.782634 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 3783:5231(1448) ack 1 win 26064
16:05:38.782637 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 5231:6679(1448) ack 1 win 26064
16:05:38.782890 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 5231 win 25352
16:05:38.782896 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 6679:8127(1448) ack 1 win 26064
16:05:38.782898 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 8127:9575(1448) ack 1 win 26064
16:05:38.782901 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 6679 win 25352
16:05:38.782904 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 9575:11023(1448) ack 1 win 26064
16:05:38.783183 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 8127 win 25352
16:05:38.783188 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 11023:12471(1448) ack 1 win 26064
16:05:38.783191 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 9575 win 25352
16:05:38.783193 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 12471:13919(1448) ack 1 win 26064
16:05:38.783196 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 11023 win 25352
16:05:38.783199 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 13919:15367(1448) ack 1 win 26064
16:05:38.783201 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 15367:16815(1448) ack 1 win 26064
16:05:38.783502 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 12471 win 25352
16:05:38.783506 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 16815:18263(1448) ack 1 win 26064
16:05:38.783509 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 13919 win 25352
16:05:38.783512 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 18263:19711(1448) ack 1 win 26064
16:05:38.783514 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 15367 win 25352
16:05:38.783517 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 19711:21159(1448) ack 1 win 26064
16:05:38.783519 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 16815 win 25352

tc2这次发自己的数据流了，tc1对其ACK，过了一段时间，tc1也开始发数据，最后双向正常。

为什么带了TCP_NODEALY的socket，在网络好了以后恢复不了正常？
看看recv系统调用的实现（2.6.9内核），一直追溯到tcp_recvmsg函数:

[net/ipv4/tcp.c --> tcp_recvmsg]
   813     while (--iovlen >= 0) {
   814         int seglen = iov->iov_len;
   815         unsigned char __user *from = iov->iov_base;
   816
   817         iov++;
   818
   819         while (seglen > 0) {
   820             int copy;
   821
   822             skb = sk->sk_write_queue.prev;
   823
   824             if (!sk->sk_send_head ||
   825                 (copy = mss_now - skb->len) <= 0) {
   826
   827 new_segment:
   828                 /* Allocate new segment. If the interface is SG,
   829                  * allocate skb fitting to single page.
   830                  */
   831                 if (!sk_stream_memory_free(sk))
   832                     goto wait_for_sndbuf;
   833
   834                 skb = sk_stream_alloc_pskb(sk, select_size(sk, tp),
   835                                0, sk->sk_allocation);
   836                 if (!skb)
   837                     goto wait_for_memory;

831行判断sndbuf里还有没有空间，如果没有，跳到wait_for_sndbuf

[net/ipv4/tcp.c --> tcp_recvmsg]
   958 wait_for_sndbuf:
   959             set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
   960 wait_for_memory:
   961             if (copied)
   962                 tcp_push(sk, tp, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
   963
   964             if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
   965                 goto do_error;
   966
   967             mss_now = tcp_current_mss(sk, !(flags&MSG_OOB));
   968         }
   969     }
   970
   971 out:
   972     if (copied)
   973         tcp_push(sk, tp, flags, mss_now, tp->nonagle);
   974     TCP_CHECK_TIMER(sk);
   975     release_sock(sk);
   976     return copied;
   977
   978 do_fault:
   979     if (!skb->len) {
   980         if (sk->sk_send_head == skb)
   981             sk->sk_send_head = NULL;
   982         __skb_unlink(skb, skb->list);
   983         sk_stream_free_skb(sk, skb);
   984     }
   985
   986 do_error:
   987     if (copied)
   988         goto out;
   989 out_err:
   990     err = sk_stream_error(sk, flags, err);
   991     TCP_CHECK_TIMER(sk);
   992     release_sock(sk);
   993     return err;

sndbuf 不够，于是设个bit位，961行的判断不成立，因为这会儿还啥也没发送，copied为0。继续，执行sk_stream_wait_memory，顾名思义，它是等snbbuf有可用空间，但是我们的socket是设了NONBLOCK的，所以sk_stream_wait_memory很快返回，并设返回值为-EAGAIN。所以，又要跳到do_error，987行的判断依然不成立，于是到了out_err，最后带着-EAGAIN离开 tcp_recvmsg函数。
这就是我们不停send，却返回结果为0且errno为EAGAIN的原因。
如果一切正常，socket不停的往外发数据，早晚sndbuf会出现可用空间的。但如果异常呢？比如设了TCP_NODELAY而网络又断了，那就瞬间会发送大量的包，对端却没有ACK。
我们再看看如果正常，tcp_sendmsg会如何：832行的跳转是不会发生了，于是，程序继续往下（略去一部分skb的操作代码）

[net/ipv4/tcp.c --> tcp_sendmsg]
   936             if (!copied)
   937                 TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
   938
   939             tp->write_seq += copy;
   940             TCP_SKB_CB(skb)->end_seq += copy;
   941             skb_shinfo(skb)->tso_segs = 0;
   942
   943             from += copy;
   944             copied += copy;
   945             if ((seglen -= copy) == 0 && iovlen == 0)
   946                 goto out;

如果这一把就把消息全放进了skb，且iovec也轮完了，此时945行的判断就生效了，直接跳转out，执行tcp_push。tcp_push调用__tcp_push_pending_frame:

[net/ipv4/tcp.h --> __tcp_push_pending_frame]
1508 static __inline__ void __tcp_push_pending_frames(struct sock *sk,
1509                          struct tcp_opt *tp,
1510                          unsigned cur_mss,
1511                          int nonagle)
1512 {
1513     struct sk_buff *skb = sk->sk_send_head;
1514
1515     if (skb) {
1516         if (!tcp_skb_is_last(sk, skb))
1517             nonagle = TCP_NAGLE_PUSH;
1518         if (!tcp_snd_test(tp, skb, cur_mss, nonagle) ||
1519             tcp_write_xmit(sk, nonagle))
1520             tcp_check_probe_timer(sk, tp);
1521     }
1522     tcp_cwnd_validate(sk, tp);
1523 }

1518行的这个"||"符号很讲究，只有tcp_snd_test返回1了，tcp_write_xmit才会被执行。所以我们先看tcp_snd_test

[net/ipv4/tcp.h --> tcp_snd_test]
1452 static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,
1453                    unsigned cur_mss, int nonagle)
1454 {
1455     int pkts = tcp_skb_pcount(skb);
1456
1457     if (!pkts) {
1458         tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
1459         pkts = tcp_skb_pcount(skb);
1460     }
1461
1462     /* RFC 1122 - section 4.2.3.4
1463      *
1464      * We must queue if
1465      *
1466      * a) The right edge of this frame exceeds the window
1467      * b) There are packets in flight and we have a small segment
1468      *     [SWS avoidance and Nagle algorithm]
1469      *     (part of SWS is done on packetization)
1470      *     Minshall version sounds: there are no _small_
1471      *     segments in flight. (tcp_nagle_check)
1472      * c) We have too many packets 'in flight'
1473      *
1474      * Don't use the nagle rule for urgent data (or
1475      * for the final FIN -DaveM).
1476      *
1477      * Also, Nagle rule does not apply to frames, which
1478      * sit in the middle of queue (they have no chances
1479      * to get new data) and if room at tail of skb is
1480      * not enough to save something seriously (<32 for now).
1481      */
1482
1483     /* Don't be strict about the congestion window for the
1484      * final FIN frame. -DaveM
1485      */
1486     return (((nonagle&TCP_NAGLE_PUSH) || tp->urg_mode
1487          || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) &&
1488         (((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) ||
1489          (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) &&
1490         !after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd));
1491 }

这个函数的注释比实现代码还多，return后面复杂的条件判断可以被拆开：其实是三个条件的“and”操作，我们看第二个条件，就是：
(((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) || (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN))
其中，tcp_packets_in_flight是指正在“飞行中”的packet数，也就是正在网络上的包数，它的计算方法是：

        发送过一次的包数 + 重发过的包数 - 队列中存留的包数

而TCPCB_FLAG_FIN是指一端是否发完了数据，这个在我们的项目中不存在，我们的数据没个完。
这下清楚了：
如果设了NODELAY，则关闭了nagle算法，大量的小数据包被发出去（看看上面第一个tcpdump的数据），在突然断网时，in_flight的包很多，多得超过了snd_cwnd即发送窗口的大小，于是tcp_snd_test返回0，真正的发送没有发生。不发送存着的数据，snd_buf中的空间就腾不出来，tcp_sendmsg就一直返回0。恶性循环。

有人要问了，既然snd_buf没空间了，那ACK又是怎么发出去的呢？答案是：发ACK不需要snd_buf空间，它直接就扔出去了。在socket收消息时，会调用tcp_recvmsg，收完后会清空读缓冲cleanup_rbuf，cleanup_rbug里会发送ACK消息，如图：

tcp_write_xmit里的操作其实就是从发送队列里循环拿skb，然后调用tcp_transmit_skb发到网络上去，而ACK是直接就调用tcp_transmit_skb，故而不经过发送队列，也就不受snd_buf空间的影响。

还有人可能问，这岂不是linux tcp协议栈的bug？我觉得有可能，因为在linux的2.6.32的内核里，tcp_snd_test函数已经没有了（实际上从2.6.13开始tcp_snd_test就没了，用rhel5的人可以松口气了），__tcp_push_pending_frame里那个别扭的“||”操作也拿掉了，改为直接调用tcp_write_xmit，再在tcp_write_xmit里对窗口和nagle算法就行判断，决定是否该发送包。逻辑更清晰，bug也避开了。

« 操作系统: 12 2009 | 主索引 | 存档 | 操作系统: 02 2010 »

操作系统: 01 2010存档

[linux-kernel] tcp连接在断网后的恢复能力

关于存档

操作系统: 01 2010每月存档