操作系统: 01 2010存档

作者:董昊 (要转载的同学帮忙把名字和博客链接http://oldblog.donghao.org/uii/带上,多谢了!)

做项目中遇到一个问题。两台机器上用socket建立一个TCP连接,双向通信,流量很大,这时,通过在路由器上设置100%的丢包率将网络断开,这时 socket当然是发不了包,也收不了,出现大量的重传,然后,取消路由器上的设置,恢复网络,结果,TCP连接client去往server的流量正常 了,但server去往client却不通,任凭你如何使劲的send,返回值就是0,而且errno为EAGAIN。
我用tcpdump看了一下此时的包数据(tc2是server,tc1是client):

  12:08:21.020291 IP tc1.corp.com.42171 > tc2.corp.com.3003: S 4009389430:4009389430(0) win 5840
  12:08:21.020571 IP tc2.corp.com.3003 > tc1.corp.com.42171: R 0:0(0) ack 4009389431 win 0
  12:08:38.934329 IP tc2.corp.com.3903 > tc1.corp.com.3904: P 2398055392:2398056153(761) ack 2538876742 win 724
  12:08:38.934519 IP tc1.corp.com.3904 > tc2.corp.com.3903: . ack 2165 win 13756
  12:08:39.958457 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1:763(762) ack 2165 win 13756
  12:08:39.958485 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 763 win 1448
  12:08:39.958653 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 763:881(118) ack 2165 win 13756
  12:08:39.958660 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 881:997(116) ack 2165 win 13756
  12:08:39.958719 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 997 win 1448
  12:08:39.958890 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 997:1114(117) ack 2165 win 13756
  12:08:39.958898 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1114:1232(118) ack 2165 win 13756
  12:08:39.958903 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1232:1349(117) ack 2165 win 13756
  12:08:39.958971 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 1349 win 1448
  12:08:39.959141 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1349:1466(117) ack 2165 win 13756
  12:08:39.959149 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1466:1583(117) ack 2165 win 13756
  12:08:39.959154 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1583:1700(117) ack 2165 win 13756
  12:08:39.959222 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 1700 win 1448
 

tc2不发自己的数据,却只是一味的ACK从tc1传来的数据,等上半个小时,依然如此。它为什么不发呢?

最后发现是因为我们在socket上设了TCP_NODELAY。去掉这个设置,重启程序,断网恢复以后,TCP双向正常工作。同样用tcpdump看:

  16:05:38.782427 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: P 0:887(887) ack 1 win 26064
  16:05:38.782619 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 3783 win 25352
  16:05:38.782634 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 3783:5231(1448) ack 1 win 26064
  16:05:38.782637 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 5231:6679(1448) ack 1 win 26064
  16:05:38.782890 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 5231 win 25352 
  16:05:38.782896 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 6679:8127(1448) ack 1 win 26064
  16:05:38.782898 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 8127:9575(1448) ack 1 win 26064
  16:05:38.782901 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 6679 win 25352 
  16:05:38.782904 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 9575:11023(1448) ack 1 win 26064
  16:05:38.783183 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 8127 win 25352
  16:05:38.783188 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 11023:12471(1448) ack 1 win 26064
  16:05:38.783191 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 9575 win 25352
  16:05:38.783193 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 12471:13919(1448) ack 1 win 26064
  16:05:38.783196 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 11023 win 25352
  16:05:38.783199 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 13919:15367(1448) ack 1 win 26064
  16:05:38.783201 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 15367:16815(1448) ack 1 win 26064
  16:05:38.783502 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 12471 win 25352
  16:05:38.783506 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 16815:18263(1448) ack 1 win 26064
  16:05:38.783509 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 13919 win 25352
  16:05:38.783512 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 18263:19711(1448) ack 1 win 26064
  16:05:38.783514 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 15367 win 25352
  16:05:38.783517 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 19711:21159(1448) ack 1 win 26064
  16:05:38.783519 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 16815 win 25352

tc2这次发自己的数据流了,tc1对其ACK,过了一段时间,tc1也开始发数据,最后双向正常。

为什么带了TCP_NODEALY的socket,在网络好了以后恢复不了正常?
看看recv系统调用的实现(2.6.9内核),一直追溯到tcp_recvmsg函数:

[net/ipv4/tcp.c --> tcp_recvmsg]
   813     while (--iovlen >= 0) {
   814         int seglen = iov->iov_len;
   815         unsigned char __user *from = iov->iov_base;
   816
   817         iov++;
   818
   819         while (seglen > 0) {
   820             int copy;
   821
   822             skb = sk->sk_write_queue.prev;
   823
   824             if (!sk->sk_send_head ||
   825                 (copy = mss_now - skb->len) <= 0) {
   826
   827 new_segment:
   828                 /* Allocate new segment. If the interface is SG,
   829                  * allocate skb fitting to single page.
   830                  */
   831                 if (!sk_stream_memory_free(sk))
   832                     goto wait_for_sndbuf;
   833
   834                 skb = sk_stream_alloc_pskb(sk, select_size(sk, tp),
   835                                0, sk->sk_allocation);
   836                 if (!skb)
   837                     goto wait_for_memory;

831行判断sndbuf里还有没有空间,如果没有,跳到wait_for_sndbuf

[net/ipv4/tcp.c --> tcp_recvmsg]
   958 wait_for_sndbuf:
   959             set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
   960 wait_for_memory:
   961             if (copied)
   962                 tcp_push(sk, tp, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
   963
   964             if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
   965                 goto do_error;
   966
   967             mss_now = tcp_current_mss(sk, !(flags&MSG_OOB));
   968         }
   969     }
   970
   971 out:
   972     if (copied)
   973         tcp_push(sk, tp, flags, mss_now, tp->nonagle);
   974     TCP_CHECK_TIMER(sk);
   975     release_sock(sk);
   976     return copied;
   977
   978 do_fault:
   979     if (!skb->len) {
   980         if (sk->sk_send_head == skb)
   981             sk->sk_send_head = NULL;
   982         __skb_unlink(skb, skb->list);
   983         sk_stream_free_skb(sk, skb);
   984     }
   985
   986 do_error:
   987     if (copied)
   988         goto out;
   989 out_err:
   990     err = sk_stream_error(sk, flags, err);
   991     TCP_CHECK_TIMER(sk);
   992     release_sock(sk);
   993     return err;

sndbuf 不够,于是设个bit位,961行的判断不成立,因为这会儿还啥也没发送,copied为0。继续,执行sk_stream_wait_memory,顾 名思义,它是等snbbuf有可用空间,但是我们的socket是设了NONBLOCK的,所以sk_stream_wait_memory很快返回,并 设返回值为-EAGAIN。所以,又要跳到do_error,987行的判断依然不成立,于是到了out_err,最后带着-EAGAIN离开 tcp_recvmsg函数。
这就是我们不停send,却返回结果为0且errno为EAGAIN的原因。
如果一切正常,socket不停的往外发数据,早晚sndbuf会出现可用空间的。但如果异常呢?比如设了TCP_NODELAY而网络又断了,那就瞬间会发送大量的包,对端却没有ACK。
我们再看看如果正常,tcp_sendmsg会如何:832行的跳转是不会发生了,于是,程序继续往下(略去一部分skb的操作代码)

[net/ipv4/tcp.c --> tcp_sendmsg]
   936             if (!copied)
   937                 TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
   938
   939             tp->write_seq += copy;
   940             TCP_SKB_CB(skb)->end_seq += copy;
   941             skb_shinfo(skb)->tso_segs = 0;
   942
   943             from += copy;
   944             copied += copy;
   945             if ((seglen -= copy) == 0 && iovlen == 0)
   946                 goto out;

如果这一把就把消息全放进了skb,且iovec也轮完了,此时945行的判断就生效了,直接跳转out,执行tcp_push。tcp_push调用__tcp_push_pending_frame:

[net/ipv4/tcp.h --> __tcp_push_pending_frame]
  1508 static __inline__ void __tcp_push_pending_frames(struct sock *sk,
  1509                          struct tcp_opt *tp,
  1510                          unsigned cur_mss,
  1511                          int nonagle)
  1512 {          
  1513     struct sk_buff *skb = sk->sk_send_head;
  1514    
  1515     if (skb) {
  1516         if (!tcp_skb_is_last(sk, skb))
  1517             nonagle = TCP_NAGLE_PUSH;
  1518         if (!tcp_snd_test(tp, skb, cur_mss, nonagle) ||
  1519             tcp_write_xmit(sk, nonagle))
  1520             tcp_check_probe_timer(sk, tp);
  1521     }
  1522     tcp_cwnd_validate(sk, tp);
  1523 }

1518行的这个"||"符号很讲究,只有tcp_snd_test返回1了,tcp_write_xmit才会被执行。所以我们先看tcp_snd_test

[net/ipv4/tcp.h --> tcp_snd_test]
  1452 static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,
  1453                    unsigned cur_mss, int nonagle)
  1454 {
  1455     int pkts = tcp_skb_pcount(skb);
  1456
  1457     if (!pkts) {
  1458         tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
  1459         pkts = tcp_skb_pcount(skb);
  1460     }
  1461
  1462     /*  RFC 1122 - section 4.2.3.4
  1463      *
  1464      *  We must queue if
  1465      *
  1466      *  a) The right edge of this frame exceeds the window
  1467      *  b) There are packets in flight and we have a small segment
  1468      *     [SWS avoidance and Nagle algorithm]
  1469      *     (part of SWS is done on packetization)
  1470      *     Minshall version sounds: there are no _small_
  1471      *     segments in flight. (tcp_nagle_check)
  1472      *  c) We have too many packets 'in flight'
  1473      *
  1474      *  Don't use the nagle rule for urgent data (or
  1475      *  for the final FIN -DaveM).
  1476      *
  1477      *  Also, Nagle rule does not apply to frames, which
  1478      *  sit in the middle of queue (they have no chances
  1479      *  to get new data) and if room at tail of skb is
  1480      *  not enough to save something seriously (<32 for now).
  1481      */
  1482
  1483     /* Don't be strict about the congestion window for the
  1484      * final FIN frame.  -DaveM
  1485      */
  1486     return (((nonagle&TCP_NAGLE_PUSH) || tp->urg_mode
  1487          || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) &&
  1488         (((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) ||
  1489          (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) &&
  1490         !after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd));
  1491 }

这个函数的注释比实现代码还多,return后面复杂的条件判断可以被拆开:其实是三个条件的“and”操作,我们看第二个条件,就是:
(((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) || (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN))
其中,tcp_packets_in_flight是指正在“飞行中”的packet数,也就是正在网络上的包数,它的计算方法是:

        发送过一次的包数 + 重发过的包数 - 队列中存留的包数

而TCPCB_FLAG_FIN是指一端是否发完了数据,这个在我们的项目中不存在,我们的数据没个完。
这下清楚了:
如 果设了NODELAY,则关闭了nagle算法,大量的小数据包被发出去(看看上面第一个tcpdump的数据),在突然断网时,in_flight的包 很多,多得超过了snd_cwnd即发送窗口的大小,于是tcp_snd_test返回0,真正的发送没有发生。不发送存着的数据,snd_buf中的空 间就腾不出来,tcp_sendmsg就一直返回0。恶性循环。

有人要问了,既然snd_buf没空间了,那ACK又是怎么发出去的呢?答案是:发ACK不需要snd_buf空间,它直接就扔出去了。在socket收消息时,会调用tcp_recvmsg,收完后会清空读缓冲cleanup_rbuf,cleanup_rbug里会发送ACK消息,如图:



tcp_write_xmit里的操作其实就是从发送队列里循环拿skb,然后调用tcp_transmit_skb发到网络上去,而ACK是直接就调用tcp_transmit_skb,故而不经过发送队列,也就不受snd_buf空间的影响。

还有人可能问,这岂不是linux tcp协议栈的bug?我觉得有可能,因为在linux的2.6.32的内核里,tcp_snd_test函数已经没有了(实际上从2.6.13开始tcp_snd_test就没了,用rhel5的人可以松口气了),__tcp_push_pending_frame里那个别扭的“||”操作也拿掉了,改为直接调用tcp_write_xmit,再在tcp_write_xmit里对窗口和nagle算法就行判断,决定是否该发送包。逻辑更清晰,bug也避开了。