[linux-kernel] tcp连接在断网后的恢复能力
作者:董昊 (要转载的同学帮忙把名字和博客链接http://oldblog.donghao.org/uii/带上,多谢了!)
做项目中遇到一个问题。两台机器上用socket建立一个TCP连接,双向通信,流量很大,这时,通过在路由器上设置100%的丢包率将网络断开,这时 socket当然是发不了包,也收不了,出现大量的重传,然后,取消路由器上的设置,恢复网络,结果,TCP连接client去往server的流量正常 了,但server去往client却不通,任凭你如何使劲的send,返回值就是0,而且errno为EAGAIN。
我用tcpdump看了一下此时的包数据(tc2是server,tc1是client):
12:08:21.020291 IP tc1.corp.com.42171 > tc2.corp.com.3003: S 4009389430:4009389430(0) win 5840
12:08:21.020571 IP tc2.corp.com.3003 > tc1.corp.com.42171: R 0:0(0) ack 4009389431 win 0
12:08:38.934329 IP tc2.corp.com.3903 > tc1.corp.com.3904: P 2398055392:2398056153(761) ack 2538876742 win 724
12:08:38.934519 IP tc1.corp.com.3904 > tc2.corp.com.3903: . ack 2165 win 13756
12:08:39.958457 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1:763(762) ack 2165 win 13756
12:08:39.958485 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 763 win 1448
12:08:39.958653 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 763:881(118) ack 2165 win 13756
12:08:39.958660 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 881:997(116) ack 2165 win 13756
12:08:39.958719 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 997 win 1448
12:08:39.958890 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 997:1114(117) ack 2165 win 13756
12:08:39.958898 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1114:1232(118) ack 2165 win 13756
12:08:39.958903 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1232:1349(117) ack 2165 win 13756
12:08:39.958971 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 1349 win 1448
12:08:39.959141 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1349:1466(117) ack 2165 win 13756
12:08:39.959149 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1466:1583(117) ack 2165 win 13756
12:08:39.959154 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1583:1700(117) ack 2165 win 13756
12:08:39.959222 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 1700 win 1448
tc2不发自己的数据,却只是一味的ACK从tc1传来的数据,等上半个小时,依然如此。它为什么不发呢?
最后发现是因为我们在socket上设了TCP_NODELAY。去掉这个设置,重启程序,断网恢复以后,TCP双向正常工作。同样用tcpdump看:
16:05:38.782427 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: P 0:887(887) ack 1 win 26064
16:05:38.782619 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 3783 win 25352
16:05:38.782634 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 3783:5231(1448) ack 1 win 26064
16:05:38.782637 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 5231:6679(1448) ack 1 win 26064
16:05:38.782890 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 5231 win 25352
16:05:38.782896 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 6679:8127(1448) ack 1 win 26064
16:05:38.782898 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 8127:9575(1448) ack 1 win 26064
16:05:38.782901 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 6679 win 25352
16:05:38.782904 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 9575:11023(1448) ack 1 win 26064
16:05:38.783183 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 8127 win 25352
16:05:38.783188 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 11023:12471(1448) ack 1 win 26064
16:05:38.783191 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 9575 win 25352
16:05:38.783193 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 12471:13919(1448) ack 1 win 26064
16:05:38.783196 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 11023 win 25352
16:05:38.783199 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 13919:15367(1448) ack 1 win 26064
16:05:38.783201 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 15367:16815(1448) ack 1 win 26064
16:05:38.783502 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 12471 win 25352
16:05:38.783506 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 16815:18263(1448) ack 1 win 26064
16:05:38.783509 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 13919 win 25352
16:05:38.783512 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 18263:19711(1448) ack 1 win 26064
16:05:38.783514 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 15367 win 25352
16:05:38.783517 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 19711:21159(1448) ack 1 win 26064
16:05:38.783519 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 16815 win 25352
tc2这次发自己的数据流了,tc1对其ACK,过了一段时间,tc1也开始发数据,最后双向正常。
为什么带了TCP_NODEALY的socket,在网络好了以后恢复不了正常?
看看recv系统调用的实现(2.6.9内核),一直追溯到tcp_recvmsg函数:
[net/ipv4/tcp.c --> tcp_recvmsg]
813 while (--iovlen >= 0) {
814 int seglen = iov->iov_len;
815 unsigned char __user *from = iov->iov_base;
816
817 iov++;
818
819 while (seglen > 0) {
820 int copy;
821
822 skb = sk->sk_write_queue.prev;
823
824 if (!sk->sk_send_head ||
825 (copy = mss_now - skb->len) <= 0) {
826
827 new_segment:
828 /* Allocate new segment. If the interface is SG,
829 * allocate skb fitting to single page.
830 */
831 if (!sk_stream_memory_free(sk))
832 goto wait_for_sndbuf;
833
834 skb = sk_stream_alloc_pskb(sk, select_size(sk, tp),
835 0, sk->sk_allocation);
836 if (!skb)
837 goto wait_for_memory;
831行判断sndbuf里还有没有空间,如果没有,跳到wait_for_sndbuf
[net/ipv4/tcp.c --> tcp_recvmsg]
958 wait_for_sndbuf:
959 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
960 wait_for_memory:
961 if (copied)
962 tcp_push(sk, tp, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
963
964 if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
965 goto do_error;
966
967 mss_now = tcp_current_mss(sk, !(flags&MSG_OOB));
968 }
969 }
970
971 out:
972 if (copied)
973 tcp_push(sk, tp, flags, mss_now, tp->nonagle);
974 TCP_CHECK_TIMER(sk);
975 release_sock(sk);
976 return copied;
977
978 do_fault:
979 if (!skb->len) {
980 if (sk->sk_send_head == skb)
981 sk->sk_send_head = NULL;
982 __skb_unlink(skb, skb->list);
983 sk_stream_free_skb(sk, skb);
984 }
985
986 do_error:
987 if (copied)
988 goto out;
989 out_err:
990 err = sk_stream_error(sk, flags, err);
991 TCP_CHECK_TIMER(sk);
992 release_sock(sk);
993 return err;
sndbuf 不够,于是设个bit位,961行的判断不成立,因为这会儿还啥也没发送,copied为0。继续,执行sk_stream_wait_memory,顾 名思义,它是等snbbuf有可用空间,但是我们的socket是设了NONBLOCK的,所以sk_stream_wait_memory很快返回,并 设返回值为-EAGAIN。所以,又要跳到do_error,987行的判断依然不成立,于是到了out_err,最后带着-EAGAIN离开 tcp_recvmsg函数。
这就是我们不停send,却返回结果为0且errno为EAGAIN的原因。
如果一切正常,socket不停的往外发数据,早晚sndbuf会出现可用空间的。但如果异常呢?比如设了TCP_NODELAY而网络又断了,那就瞬间会发送大量的包,对端却没有ACK。
我们再看看如果正常,tcp_sendmsg会如何:832行的跳转是不会发生了,于是,程序继续往下(略去一部分skb的操作代码)
[net/ipv4/tcp.c --> tcp_sendmsg]
936 if (!copied)
937 TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
938
939 tp->write_seq += copy;
940 TCP_SKB_CB(skb)->end_seq += copy;
941 skb_shinfo(skb)->tso_segs = 0;
942
943 from += copy;
944 copied += copy;
945 if ((seglen -= copy) == 0 && iovlen == 0)
946 goto out;
如果这一把就把消息全放进了skb,且iovec也轮完了,此时945行的判断就生效了,直接跳转out,执行tcp_push。tcp_push调用__tcp_push_pending_frame:
[net/ipv4/tcp.h --> __tcp_push_pending_frame]
1508 static __inline__ void __tcp_push_pending_frames(struct sock *sk,
1509 struct tcp_opt *tp,
1510 unsigned cur_mss,
1511 int nonagle)
1512 {
1513 struct sk_buff *skb = sk->sk_send_head;
1514
1515 if (skb) {
1516 if (!tcp_skb_is_last(sk, skb))
1517 nonagle = TCP_NAGLE_PUSH;
1518 if (!tcp_snd_test(tp, skb, cur_mss, nonagle) ||
1519 tcp_write_xmit(sk, nonagle))
1520 tcp_check_probe_timer(sk, tp);
1521 }
1522 tcp_cwnd_validate(sk, tp);
1523 }
1518行的这个"||"符号很讲究,只有tcp_snd_test返回1了,tcp_write_xmit才会被执行。所以我们先看tcp_snd_test
[net/ipv4/tcp.h --> tcp_snd_test]
1452 static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,
1453 unsigned cur_mss, int nonagle)
1454 {
1455 int pkts = tcp_skb_pcount(skb);
1456
1457 if (!pkts) {
1458 tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
1459 pkts = tcp_skb_pcount(skb);
1460 }
1461
1462 /* RFC 1122 - section 4.2.3.4
1463 *
1464 * We must queue if
1465 *
1466 * a) The right edge of this frame exceeds the window
1467 * b) There are packets in flight and we have a small segment
1468 * [SWS avoidance and Nagle algorithm]
1469 * (part of SWS is done on packetization)
1470 * Minshall version sounds: there are no _small_
1471 * segments in flight. (tcp_nagle_check)
1472 * c) We have too many packets 'in flight'
1473 *
1474 * Don't use the nagle rule for urgent data (or
1475 * for the final FIN -DaveM).
1476 *
1477 * Also, Nagle rule does not apply to frames, which
1478 * sit in the middle of queue (they have no chances
1479 * to get new data) and if room at tail of skb is
1480 * not enough to save something seriously (<32 for now).
1481 */
1482
1483 /* Don't be strict about the congestion window for the
1484 * final FIN frame. -DaveM
1485 */
1486 return (((nonagle&TCP_NAGLE_PUSH) || tp->urg_mode
1487 || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) &&
1488 (((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) ||
1489 (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) &&
1490 !after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd));
1491 }
这个函数的注释比实现代码还多,return后面复杂的条件判断可以被拆开:其实是三个条件的“and”操作,我们看第二个条件,就是:
(((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) || (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN))
其中,tcp_packets_in_flight是指正在“飞行中”的packet数,也就是正在网络上的包数,它的计算方法是:
发送过一次的包数 + 重发过的包数 - 队列中存留的包数
而TCPCB_FLAG_FIN是指一端是否发完了数据,这个在我们的项目中不存在,我们的数据没个完。
这下清楚了:
如 果设了NODELAY,则关闭了nagle算法,大量的小数据包被发出去(看看上面第一个tcpdump的数据),在突然断网时,in_flight的包 很多,多得超过了snd_cwnd即发送窗口的大小,于是tcp_snd_test返回0,真正的发送没有发生。不发送存着的数据,snd_buf中的空 间就腾不出来,tcp_sendmsg就一直返回0。恶性循环。
有人要问了,既然snd_buf没空间了,那ACK又是怎么发出去的呢?答案是:发ACK不需要snd_buf空间,它直接就扔出去了。在socket收消息时,会调用tcp_recvmsg,收完后会清空读缓冲cleanup_rbuf,cleanup_rbug里会发送ACK消息,如图:
tcp_write_xmit里的操作其实就是从发送队列里循环拿skb,然后调用tcp_transmit_skb发到网络上去,而ACK是直接就调用tcp_transmit_skb,故而不经过发送队列,也就不受snd_buf空间的影响。
还有人可能问,这岂不是linux tcp协议栈的bug?我觉得有可能,因为在linux的2.6.32的内核里,tcp_snd_test函数已经没有了(实际上从2.6.13开始tcp_snd_test就没了,用rhel5的人可以松口气了),__tcp_push_pending_frame里那个别扭的“||”操作也拿掉了,改为直接调用tcp_write_xmit,再在tcp_write_xmit里对窗口和nagle算法就行判断,决定是否该发送包。逻辑更清晰,bug也避开了。
做项目中遇到一个问题。两台机器上用socket建立一个TCP连接,双向通信,流量很大,这时,通过在路由器上设置100%的丢包率将网络断开,这时 socket当然是发不了包,也收不了,出现大量的重传,然后,取消路由器上的设置,恢复网络,结果,TCP连接client去往server的流量正常 了,但server去往client却不通,任凭你如何使劲的send,返回值就是0,而且errno为EAGAIN。
我用tcpdump看了一下此时的包数据(tc2是server,tc1是client):
12:08:21.020291 IP tc1.corp.com.42171 > tc2.corp.com.3003: S 4009389430:4009389430(0) win 5840
12:08:21.020571 IP tc2.corp.com.3003 > tc1.corp.com.42171: R 0:0(0) ack 4009389431 win 0
12:08:38.934329 IP tc2.corp.com.3903 > tc1.corp.com.3904: P 2398055392:2398056153(761) ack 2538876742 win 724
12:08:38.934519 IP tc1.corp.com.3904 > tc2.corp.com.3903: . ack 2165 win 13756
12:08:39.958457 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1:763(762) ack 2165 win 13756
12:08:39.958485 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 763 win 1448
12:08:39.958653 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 763:881(118) ack 2165 win 13756
12:08:39.958660 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 881:997(116) ack 2165 win 13756
12:08:39.958719 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 997 win 1448
12:08:39.958890 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 997:1114(117) ack 2165 win 13756
12:08:39.958898 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1114:1232(118) ack 2165 win 13756
12:08:39.958903 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1232:1349(117) ack 2165 win 13756
12:08:39.958971 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 1349 win 1448
12:08:39.959141 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1349:1466(117) ack 2165 win 13756
12:08:39.959149 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1466:1583(117) ack 2165 win 13756
12:08:39.959154 IP tc1.corp.com.3904 > tc2.corp.com.3903: P 1583:1700(117) ack 2165 win 13756
12:08:39.959222 IP tc2.corp.com.3903 > tc1.corp.com.3904: . ack 1700 win 1448
tc2不发自己的数据,却只是一味的ACK从tc1传来的数据,等上半个小时,依然如此。它为什么不发呢?
最后发现是因为我们在socket上设了TCP_NODELAY。去掉这个设置,重启程序,断网恢复以后,TCP双向正常工作。同样用tcpdump看:
16:05:38.782427 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: P 0:887(887) ack 1 win 26064
16:05:38.782619 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 3783 win 25352
16:05:38.782634 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 3783:5231(1448) ack 1 win 26064
16:05:38.782637 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 5231:6679(1448) ack 1 win 26064
16:05:38.782890 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 5231 win 25352
16:05:38.782896 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 6679:8127(1448) ack 1 win 26064
16:05:38.782898 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 8127:9575(1448) ack 1 win 26064
16:05:38.782901 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 6679 win 25352
16:05:38.782904 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 9575:11023(1448) ack 1 win 26064
16:05:38.783183 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 8127 win 25352
16:05:38.783188 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 11023:12471(1448) ack 1 win 26064
16:05:38.783191 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 9575 win 25352
16:05:38.783193 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 12471:13919(1448) ack 1 win 26064
16:05:38.783196 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 11023 win 25352
16:05:38.783199 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 13919:15367(1448) ack 1 win 26064
16:05:38.783201 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 15367:16815(1448) ack 1 win 26064
16:05:38.783502 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 12471 win 25352
16:05:38.783506 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 16815:18263(1448) ack 1 win 26064
16:05:38.783509 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 13919 win 25352
16:05:38.783512 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 18263:19711(1448) ack 1 win 26064
16:05:38.783514 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 15367 win 25352
16:05:38.783517 IP tc2.corp.alimama.com.3903 > tc1.corp.alimama.com.3904: . 19711:21159(1448) ack 1 win 26064
16:05:38.783519 IP tc1.corp.alimama.com.3904 > tc2.corp.alimama.com.3903: . ack 16815 win 25352
tc2这次发自己的数据流了,tc1对其ACK,过了一段时间,tc1也开始发数据,最后双向正常。
为什么带了TCP_NODEALY的socket,在网络好了以后恢复不了正常?
看看recv系统调用的实现(2.6.9内核),一直追溯到tcp_recvmsg函数:
[net/ipv4/tcp.c --> tcp_recvmsg]
813 while (--iovlen >= 0) {
814 int seglen = iov->iov_len;
815 unsigned char __user *from = iov->iov_base;
816
817 iov++;
818
819 while (seglen > 0) {
820 int copy;
821
822 skb = sk->sk_write_queue.prev;
823
824 if (!sk->sk_send_head ||
825 (copy = mss_now - skb->len) <= 0) {
826
827 new_segment:
828 /* Allocate new segment. If the interface is SG,
829 * allocate skb fitting to single page.
830 */
831 if (!sk_stream_memory_free(sk))
832 goto wait_for_sndbuf;
833
834 skb = sk_stream_alloc_pskb(sk, select_size(sk, tp),
835 0, sk->sk_allocation);
836 if (!skb)
837 goto wait_for_memory;
831行判断sndbuf里还有没有空间,如果没有,跳到wait_for_sndbuf
[net/ipv4/tcp.c --> tcp_recvmsg]
958 wait_for_sndbuf:
959 set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
960 wait_for_memory:
961 if (copied)
962 tcp_push(sk, tp, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
963
964 if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
965 goto do_error;
966
967 mss_now = tcp_current_mss(sk, !(flags&MSG_OOB));
968 }
969 }
970
971 out:
972 if (copied)
973 tcp_push(sk, tp, flags, mss_now, tp->nonagle);
974 TCP_CHECK_TIMER(sk);
975 release_sock(sk);
976 return copied;
977
978 do_fault:
979 if (!skb->len) {
980 if (sk->sk_send_head == skb)
981 sk->sk_send_head = NULL;
982 __skb_unlink(skb, skb->list);
983 sk_stream_free_skb(sk, skb);
984 }
985
986 do_error:
987 if (copied)
988 goto out;
989 out_err:
990 err = sk_stream_error(sk, flags, err);
991 TCP_CHECK_TIMER(sk);
992 release_sock(sk);
993 return err;
sndbuf 不够,于是设个bit位,961行的判断不成立,因为这会儿还啥也没发送,copied为0。继续,执行sk_stream_wait_memory,顾 名思义,它是等snbbuf有可用空间,但是我们的socket是设了NONBLOCK的,所以sk_stream_wait_memory很快返回,并 设返回值为-EAGAIN。所以,又要跳到do_error,987行的判断依然不成立,于是到了out_err,最后带着-EAGAIN离开 tcp_recvmsg函数。
这就是我们不停send,却返回结果为0且errno为EAGAIN的原因。
如果一切正常,socket不停的往外发数据,早晚sndbuf会出现可用空间的。但如果异常呢?比如设了TCP_NODELAY而网络又断了,那就瞬间会发送大量的包,对端却没有ACK。
我们再看看如果正常,tcp_sendmsg会如何:832行的跳转是不会发生了,于是,程序继续往下(略去一部分skb的操作代码)
[net/ipv4/tcp.c --> tcp_sendmsg]
936 if (!copied)
937 TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
938
939 tp->write_seq += copy;
940 TCP_SKB_CB(skb)->end_seq += copy;
941 skb_shinfo(skb)->tso_segs = 0;
942
943 from += copy;
944 copied += copy;
945 if ((seglen -= copy) == 0 && iovlen == 0)
946 goto out;
如果这一把就把消息全放进了skb,且iovec也轮完了,此时945行的判断就生效了,直接跳转out,执行tcp_push。tcp_push调用__tcp_push_pending_frame:
[net/ipv4/tcp.h --> __tcp_push_pending_frame]
1508 static __inline__ void __tcp_push_pending_frames(struct sock *sk,
1509 struct tcp_opt *tp,
1510 unsigned cur_mss,
1511 int nonagle)
1512 {
1513 struct sk_buff *skb = sk->sk_send_head;
1514
1515 if (skb) {
1516 if (!tcp_skb_is_last(sk, skb))
1517 nonagle = TCP_NAGLE_PUSH;
1518 if (!tcp_snd_test(tp, skb, cur_mss, nonagle) ||
1519 tcp_write_xmit(sk, nonagle))
1520 tcp_check_probe_timer(sk, tp);
1521 }
1522 tcp_cwnd_validate(sk, tp);
1523 }
1518行的这个"||"符号很讲究,只有tcp_snd_test返回1了,tcp_write_xmit才会被执行。所以我们先看tcp_snd_test
[net/ipv4/tcp.h --> tcp_snd_test]
1452 static __inline__ int tcp_snd_test(struct tcp_opt *tp, struct sk_buff *skb,
1453 unsigned cur_mss, int nonagle)
1454 {
1455 int pkts = tcp_skb_pcount(skb);
1456
1457 if (!pkts) {
1458 tcp_set_skb_tso_segs(skb, tp->mss_cache_std);
1459 pkts = tcp_skb_pcount(skb);
1460 }
1461
1462 /* RFC 1122 - section 4.2.3.4
1463 *
1464 * We must queue if
1465 *
1466 * a) The right edge of this frame exceeds the window
1467 * b) There are packets in flight and we have a small segment
1468 * [SWS avoidance and Nagle algorithm]
1469 * (part of SWS is done on packetization)
1470 * Minshall version sounds: there are no _small_
1471 * segments in flight. (tcp_nagle_check)
1472 * c) We have too many packets 'in flight'
1473 *
1474 * Don't use the nagle rule for urgent data (or
1475 * for the final FIN -DaveM).
1476 *
1477 * Also, Nagle rule does not apply to frames, which
1478 * sit in the middle of queue (they have no chances
1479 * to get new data) and if room at tail of skb is
1480 * not enough to save something seriously (<32 for now).
1481 */
1482
1483 /* Don't be strict about the congestion window for the
1484 * final FIN frame. -DaveM
1485 */
1486 return (((nonagle&TCP_NAGLE_PUSH) || tp->urg_mode
1487 || !tcp_nagle_check(tp, skb, cur_mss, nonagle)) &&
1488 (((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) ||
1489 (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)) &&
1490 !after(TCP_SKB_CB(skb)->end_seq, tp->snd_una + tp->snd_wnd));
1491 }
这个函数的注释比实现代码还多,return后面复杂的条件判断可以被拆开:其实是三个条件的“and”操作,我们看第二个条件,就是:
(((tcp_packets_in_flight(tp) + (pkts-1)) < tp->snd_cwnd) || (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN))
其中,tcp_packets_in_flight是指正在“飞行中”的packet数,也就是正在网络上的包数,它的计算方法是:
发送过一次的包数 + 重发过的包数 - 队列中存留的包数
而TCPCB_FLAG_FIN是指一端是否发完了数据,这个在我们的项目中不存在,我们的数据没个完。
这下清楚了:
如 果设了NODELAY,则关闭了nagle算法,大量的小数据包被发出去(看看上面第一个tcpdump的数据),在突然断网时,in_flight的包 很多,多得超过了snd_cwnd即发送窗口的大小,于是tcp_snd_test返回0,真正的发送没有发生。不发送存着的数据,snd_buf中的空 间就腾不出来,tcp_sendmsg就一直返回0。恶性循环。
有人要问了,既然snd_buf没空间了,那ACK又是怎么发出去的呢?答案是:发ACK不需要snd_buf空间,它直接就扔出去了。在socket收消息时,会调用tcp_recvmsg,收完后会清空读缓冲cleanup_rbuf,cleanup_rbug里会发送ACK消息,如图:
tcp_write_xmit里的操作其实就是从发送队列里循环拿skb,然后调用tcp_transmit_skb发到网络上去,而ACK是直接就调用tcp_transmit_skb,故而不经过发送队列,也就不受snd_buf空间的影响。
还有人可能问,这岂不是linux tcp协议栈的bug?我觉得有可能,因为在linux的2.6.32的内核里,tcp_snd_test函数已经没有了(实际上从2.6.13开始tcp_snd_test就没了,用rhel5的人可以松口气了),__tcp_push_pending_frame里那个别扭的“||”操作也拿掉了,改为直接调用tcp_write_xmit,再在tcp_write_xmit里对窗口和nagle算法就行判断,决定是否该发送包。逻辑更清晰,bug也避开了。
相关文章
- [kernel] epoll里的EPOLLET标记 - 05 14, 2010
- [tcp] 异步connect是否成功? - 05 11, 2010
- 超时问题调研 - 04 29, 2010
第二卷我也要好好看看~~~
第二卷我也没看过,也得好好看....