[linux-kernel] tcp连接在断网后的恢复能力

作者:董昊 (要转载的同学帮忙把名字和博客链接http://oldblog.donghao.org/uii/带上,多谢了!)

做项目中遇到一个问题。两台机器上用socket建立一个TCP连接,双向通信,流量很大,这时,通过在路由器上设置100%的丢包率将网络断开,这时 socket当然是发不了包,也收不了,出现大量的重传,然后,取消路由器上的设置,恢复网络,结果,TCP连接client去往server的流量正常 了,但server去往client却不通,任凭你如何使劲的send,返回值就是0,而且errno为EAGAIN。
我用tcpdump看了一下此时的包数据(tc2是server,tc1是client):

  12:08:21.020291 IP tc1.corp.com.42171 > tc2.corp.com.3003: S 4009389430:4009389430(0) win 5840
  12:08:21.020571 IP tc2.corp.com.3003 > tc1.corp.com.42171: R 0:0(0) ack 4009389431 win 0
  12:08:38.934329 IP tc2.corp.com.3903 > tc1.corp.com.3904: P 2398055392:2398056153(761) ack 2538876742 win 724
  12:08:38.934519 IP tc1.corp.com.3904 > tc2.corp.com.3903: . ack 2165 win 13756
  12:08:39.958457 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1:763(762) ack 2165 win 13756
  12:08:39.958485 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 763 win 1448
  12:08:39.958653 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 763:881(118) ack 2165 win 13756
  12:08:39.958660 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 881:997(116) ack 2165 win 13756
  12:08:39.958719 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 997 win 1448
  12:08:39.958890 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 997:1114(117) ack 2165 win 13756
  12:08:39.958898 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1114:1232(118) ack 2165 win 13756
  12:08:39.958903 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1232:1349(117) ack 2165 win 13756
  12:08:39.958971 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 1349 win 1448
  12:08:39.959141 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1349:1466(117) ack 2165 win 13756
  12:08:39.959149 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1466:1583(117) ack 2165 win 13756
  12:08:39.959154 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1583:1700(117) ack 2165 win 13756
  12:08:39.959222 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 1700 win 1448
 

tc2不发自己的数据,却只是一味的ACK从tc1传来的数据,等上半个小时,依然如此。它为什么不发呢?

最后发现是因为我们在socket上设了TCP_NODELAY。去掉这个设置,重启程序,断网恢复以后,TCP双向正常工作。同样用tcpdump看:

  16:05:38.782427 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: P 0:887(887) ack 1 win 26064
  16:05:38.782619 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 3783 win 25352
  16:05:38.782634 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 3783:5231(1448) ack 1 win 26064
  16:05:38.782637 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 5231:6679(1448) ack 1 win 26064
  16:05:38.782890 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 5231 win 25352 
  16:05:38.782896 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 6679:8127(1448) ack 1 win 26064
  16:05:38.782898 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 8127:9575(1448) ack 1 win 26064
  16:05:38.782901 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 6679 win 25352 
  16:05:38.782904 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 9575:11023(1448) ack 1 win 26064
  16:05:38.783183 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 8127 win 25352
  16:05:38.783188 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 11023:12471(1448) ack 1 win 26064
  16:05:38.783191 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 9575 win 25352
  16:05:38.783193 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 12471:13919(1448) ack 1 win 26064
  16:05:38.783196 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 11023 win 25352
  16:05:38.783199 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 13919:15367(1448) ack 1 win 26064
  16:05:38.783201 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 15367:16815(1448) ack 1 win 26064
  16:05:38.783502 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 12471 win 25352
  16:05:38.783506 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 16815:18263(1448) ack 1 win 26064
  16:05:38.783509 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 13919 win 25352
  16:05:38.783512 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 18263:19711(1448) ack 1 win 26064
  16:05:38.783514 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 15367 win 25352
  16:05:38.783517 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 19711:21159(1448) ack 1 win 26064
  16:05:38.783519 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 16815 win 25352

tc2这次发自己的数据流了,tc1对其ACK,过了一段时间,tc1也开始发数据,最后双向正常。

为什么带了TCP_NODEALY的socket,在网络好了以后恢复不了正常?
看看recv系统调用的实现(2.6.9内核),一直追溯到tcp_recvmsg函数:

[net/ipv4/tcp.c --> tcp_recvmsg]
   813     while (--iovlen >= 0) {
   814         int seglen = iov->iov_len;
   815         unsigned char __user *from = iov->iov_base;
   816
   817         iov++;
   818
   819         while (seglen > 0) {
   820             int copy;
   821
   822             skb = sk->sk_write_queue.prev;
   823
   824             if (!sk->sk_send_head ||
   825                 (copy = mss_now - skb->len) <= 0) {
   826
   827 new_segment:
   828                 /* Allocate new segment. If the interface is SG,
   829                  * allocate skb fitting to single page.
   830                  */
   831                 if (!sk_stream_memory_free(sk))
   832                     goto wait_for_sndbuf;
   833
   834                 skb = sk_stream_alloc_pskb(sk, select_size(sk, tp),
   835                                0, sk->sk_allocation);
   836                 if (!skb)
   837                     goto wait_for_memory;

831行判断sndbuf里还有没有空间,如果没有,跳到wait_for_sndbuf

[net/ipv4/tcp.c --> tcp_recvmsg]
   958 wait_for_sndbuf:
   959             set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
   960 wait_for_memory:
   961             if (copied)
   962                 tcp_push(sk, tp, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
   963
   964             if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
   965                 goto do_error;
   966
   967             mss_now = tcp_current_mss(sk, !(flags&MSG_OOB));
   968         }
   969     }
   970
   971 out:
   972     if (copied)
   973         tcp_push(sk, tp, flags, mss_now, tp->nonagle);
   974     TCP_CHECK_TIMER(sk);
   975     release_sock(sk);
   976     return copied;
   977
   978 do_fault:
   979     if (!skb->len) {
   980         if (sk->sk_send_head == skb)
   981             sk->sk_send_head = NULL;
   982         __skb_unlink(skb, skb->list);
   983         sk_stream_free_skb(sk, skb);
   984     }
   985
   986 do_error:
   987     if (copied)
   988         goto out;
   989 out_err:
   990     err = sk_stream_error(sk, flags, err);
   991     TCP_CHECK_TIMER(sk);
   992     release_sock(sk);
   993     return err;

sndbuf 不够,于是设个bit位,961行的判断不成立,因为这会儿还啥也没发送,copied为0。继续,执行sk_stream_wait_memory,顾 名思义,它是等snbbuf有可用空间,但是我们的socket是设了NONBLOCK的,所以sk_stream_wait_memory很快返回,并 设返回值为-EAGAIN。所以,又要跳到do_error,987行的判断依然不成立,于是到了out_err,最后带着-EAGAIN离开 tcp_recvmsg函数。
这就是我们不停send,却返回结果为0且errno为EAGAIN的原因。
如果一切正常,socket不停的往外发数据,早晚sndbuf会出现可用空间的。但如果异常呢?比如设了TCP_NODELAY而网络又断了,那就瞬间会发送大量的包,对端却没有ACK。
我们再看看如果正常,tcp_sendmsg会如何:832行的跳转是不会发生了,于是,程序继续往下(略去一部分skb的操作代码)

[net/ipv4/tcp.c --> tcp_sendmsg]
   936             if (!copied)
   937                 TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
   938
   939             tp->write_seq += copy;
   940             TCP_SKB_CB(skb)->end_seq += copy;
   941             skb_shinfo(skb)->tso_segs = 0;
   942
   943             from += copy;
   944             copied += copy;
   945             if ((seglen -= copy) == 0 && iovlen == 0)
   946                 goto out;

如果这一把就把消息全放进了skb,且iovec也轮完了,此时945行的判断就生效了,直接跳转out,执行tcp_push。tcp_push调用__tcp_push_pending_frame:

[net/ipv4/tcp.h --> __tcp_push_pending_frame]
  1508 static __inline__ void __tcp_push_pending_frames(struct sock *sk,
  1509                          struct tcp_opt *tp,
  1510                          unsigned cur_mss,
  1511                          int nonagle)
  1512 {          
  1513     struct sk_buff *skb = sk->sk_send_head;
  1514    
  1515     if (skb) {
  1516         if (!tcp_skb_is_last(sk, skb))
  1517             nonagle = TCP_NAGLE_PUSH;
  1518         if (!tcp_snd_test(tp, skb, cur_mss, nonagle) ||
  1519             tcp_write_xmit(sk, nonagle))
  1520             tcp_check_probe_timer(sk, tp);
  1521     }
  1522     tcp_cwnd_validate(sk, tp);
  1523 }

1518行的这个"||"符号很讲究,只有tcp_snd_test返回1了,tcp_write_xmit才会被执行。所以我们先看tcp_snd_test

[net/ipv4/tcp.h --> tcp_snd_test]
  1452 static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,
  1453                    unsigned cur_mss, int nonagle)
  1454 {
  1455     int pkts = tcp_skb_pcount(skb);
  1456
  1457     if (!pkts) {
  1458         tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
  1459         pkts = tcp_skb_pcount(skb);
  1460     }
  1461
  1462     /*  RFC 1122 - section 4.2.3.4
  1463      *
  1464      *  We must queue if
  1465      *
  1466      *  a) The right edge of this frame exceeds the window
  1467      *  b) There are packets in flight and we have a small segment
  1468      *     [SWS avoidance and Nagle algorithm]
  1469      *     (part of SWS is done on packetization)
  1470      *     Minshall version sounds: there are no _small_
  1471      *     segments in flight. (tcp_nagle_check)
  1472      *  c) We have too many packets 'in flight'
  1473      *
  1474      *  Don't use the nagle rule for urgent data (or
  1475      *  for the final FIN -DaveM).
  1476      *
  1477      *  Also, Nagle rule does not apply to frames, which
  1478      *  sit in the middle of queue (they have no chances
  1479      *  to get new data) and if room at tail of skb is
  1480      *  not enough to save something seriously (<32 for now).
  1481      */
  1482
  1483     /* Don't be strict about the congestion window for the
  1484      * final FIN frame.  -DaveM
  1485      */
  1486     return (((nonagle&TCP_NAGLE_PUSH) || tp->urg_mode
  1487          || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) &&
  1488         (((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) ||
  1489          (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) &&
  1490         !after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd));
  1491 }

这个函数的注释比实现代码还多,return后面复杂的条件判断可以被拆开:其实是三个条件的“and”操作,我们看第二个条件,就是:
(((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) || (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN))
其中,tcp_packets_in_flight是指正在“飞行中”的packet数,也就是正在网络上的包数,它的计算方法是:

        发送过一次的包数 + 重发过的包数 - 队列中存留的包数

而TCPCB_FLAG_FIN是指一端是否发完了数据,这个在我们的项目中不存在,我们的数据没个完。
这下清楚了:
如 果设了NODELAY,则关闭了nagle算法,大量的小数据包被发出去(看看上面第一个tcpdump的数据),在突然断网时,in_flight的包 很多,多得超过了snd_cwnd即发送窗口的大小,于是tcp_snd_test返回0,真正的发送没有发生。不发送存着的数据,snd_buf中的空 间就腾不出来,tcp_sendmsg就一直返回0。恶性循环。

有人要问了,既然snd_buf没空间了,那ACK又是怎么发出去的呢?答案是:发ACK不需要snd_buf空间,它直接就扔出去了。在socket收消息时,会调用tcp_recvmsg,收完后会清空读缓冲cleanup_rbuf,cleanup_rbug里会发送ACK消息,如图:



tcp_write_xmit里的操作其实就是从发送队列里循环拿skb,然后调用tcp_transmit_skb发到网络上去,而ACK是直接就调用tcp_transmit_skb,故而不经过发送队列,也就不受snd_buf空间的影响。

还有人可能问,这岂不是linux tcp协议栈的bug?我觉得有可能,因为在linux的2.6.32的内核里,tcp_snd_test函数已经没有了(实际上从2.6.13开始tcp_snd_test就没了,用rhel5的人可以松口气了),__tcp_push_pending_frame里那个别扭的“||”操作也拿掉了,改为直接调用tcp_write_xmit,再在tcp_write_xmit里对窗口和nagle算法就行判断,决定是否该发送包。逻辑更清晰,bug也避开了。

相关文章

分类

2 Comments

hoterran said:

第二卷我也要好好看看~~~

DongHao Author Profile Page said:

第二卷我也没看过,也得好好看....

留言:

关于文章

This page contains a single entry by DongHao published on 01 14, 2010 3:51 PM.

“我能见一下为我做这份晚餐的厨师吗?” was the previous entry in this blog.

坏孩子的天空 is the next entry in this blog.

Find recent content on the main index or look in the 存档 to find all content.