操作系统: 06 2011存档
今天下午我的讲座,名字是《Why we need ext4》,其实主要是我上半年学到东西的一个小结
why we need ext4
View more presentations from donghao.
前一阵子,我在一个SATA硬盘上模拟测试一个应用:
1. 创建十万个文件
2. 启动100个线程,每个线程都是对这十万个文件不停的调用 fgetxattr 和 fsetxattr
在跑了大概40分钟后,dmesg里出现了一些warning:
May 25 18:50:48 core097002 kernel: [ 4680.778040] INFO: task kjournald:172 blocked for more than 120 seconds.
May 25 18:51:47 core097002 snmpd[2666]: Connection from UDP: [172.23.14.66]:33299
May 25 18:57:45 core097002 kernel: [ 4680.778181] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 25 18:57:45 core097002 kernel: [ 4680.778297] kjournald D 0000000000000000 0 172 2 0x00000000
May 25 18:57:45 core097002 kernel: [ 4680.778304] ffff880224efdd40 0000000000000046 0000000000000000 0000000000016980
May 25 18:57:45 core097002 kernel: [ 4680.778454] ffff880224e84738 ffff880224e84240 ffff880224e84180 ffff880226de8240
May 25 18:57:45 core097002 snmpd[2666]: Received SNMP packet(s) from UDP: [172.23.14.66]:33299
May 25 18:57:45 core097002 kernel: [ 4680.778603] ffff880224e84738 00000001003fda2c 00000000249100e0 0000000000000000
May 25 18:57:45 core097002 kernel: [ 4680.778753] Call Trace:
May 25 18:57:45 core097002 kernel: [ 4680.778818] [<ffffffffa000a49c>] journal_commit_transaction+0x17c/0x1070 [jbd]
May 25 18:57:45 core097002 kernel: [ 4680.778938] [<ffffffff81059742>] ? finish_task_switch+0x42/0xe0
May 25 18:57:46 core097002 kernel: [ 4680.779037] [<ffffffff8108d520>] ? autoremove_wake_function+0x0/0x40
May 25 18:57:46 core097002 kernel: [ 4680.779145] [<ffffffff810788b9>] ? try_to_del_timer_sync+0x49/0xe0
May 25 18:57:46 core097002 kernel: [ 4680.779255] [<ffffffffa000e0b2>] kjournald+0xe2/0x250 [jbd]
May 25 18:57:46 core097002 kernel: [ 4680.779353] [<ffffffff8108d520>] ? autoremove_wake_function+0x0/0x40
May 25 18:57:47 core097002 kernel: [ 4680.779451] [<ffffffffa000dfd0>] ? kjournald+0x0/0x250 [jbd]
May 25 18:57:49 core097002 kernel: [ 4680.779535] [<ffffffff8108d3a6>] kthread+0x96/0xa0
May 25 18:57:49 core097002 kernel: [ 4680.779621] [<ffffffff8101408a>] child_rip+0xa/0x20
May 25 18:57:49 core097002 kernel: [ 4680.779710] [<ffffffff8108d310>] ? kthread+0x0/0xa0
May 25 18:57:49 core097002 kernel: [ 4680.779797] [<ffffffff81014080>] ? child_rip+0x0/0x20
May 25 18:57:49 core097002 kernel: [ 4680.779881] INFO: task syslogd:2430 blocked for more than 120 seconds.
May 25 18:57:49 core097002 kernel: [ 4680.779980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 25 18:57:49 core097002 kernel: [ 4680.780105] syslogd D 0000000000000000 0 2430 1 0x00000004
May 25 18:57:49 core097002 kernel: [ 4680.780110] ffff880221ae98c8 0000000000000082 0000000000000000 0000000000016980
May 25 18:57:50 core097002 kernel: [ 4680.780260] ffff88022505ec38 ffff88022505e740 ffff88022505e680 ffff880226e1e340
May 25 18:57:50 core097002 kernel: [ 4680.780408] ffff88022505ec38 00000001003fd7ce 0000000023d91340 0000000000000000
May 25 18:57:50 core097002 kernel: [ 4680.780557] Call Trace:
May 25 18:57:50 core097002 kernel: [ 4680.780600] [<ffffffff814ae453>] io_schedule+0x73/0xc0
May 25 18:57:50 core097002 kernel: [ 4680.780685] [<ffffffff8118f925>] sync_buffer+0x45/0x50
May 25 18:57:50 core097002 kernel: [ 4680.780769] [<ffffffff814aeb3a>] __wait_on_bit_lock+0x5a/0xb0
May 25 18:57:50 core097002 kernel: [ 4680.780872] [<ffffffff8118f8e0>] ? sync_buffer+0x0/0x50
May 25 18:57:50 core097002 kernel: [ 4680.780960] [<ffffffff8118f8e0>] ? sync_buffer+0x0/0x50
May 25 18:57:50 core097002 kernel: [ 4680.781046] [<ffffffff814aec0d>] out_of_line_wait_on_bit_lock+0x7d/0x90
May 25 18:57:51 core097002 kernel: [ 4680.781161] [<ffffffff8108d560>] ? wake_bit_function+0x0/0x40
May 25 18:57:51 core097002 kernel: [ 4680.781252] [<ffffffff8118fb1c>] __lock_buffer+0x2c/0x30
May 25 18:57:51 core097002 kernel: [ 4680.781345] [<ffffffffa0009c2f>] do_get_write_access+0x3bf/0x480 [jbd]
May 25 18:57:51 core097002 kernel: [ 4680.781449] [<ffffffff811904ce>] ? __getblk+0x2e/0x240
May 25 18:57:51 core097002 kernel: [ 4680.781535] [<ffffffffa0009e81>] journal_get_write_access+0x31/0x50 [jbd]
May 25 18:57:51 core097002 kernel: [ 4680.781656] [<ffffffffa0031cd4>] __ext3_journal_get_write_access+0x34/0x70 [ext3]
May 25 18:57:51 core097002 kernel: [ 4680.781784] [<ffffffffa0023ab3>] ext3_reserve_inode_write+0x83/0xb0 [ext3]
May 25 18:57:51 core097002 kernel: [ 4680.781904] [<ffffffffa0023b16>] ext3_mark_inode_dirty+0x36/0x60 [ext3]
May 25 18:51:47 core097002 snmpd[2666]: Connection from UDP: [172.23.14.66]:33299
May 25 18:57:45 core097002 kernel: [ 4680.778181] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 25 18:57:45 core097002 kernel: [ 4680.778297] kjournald D 0000000000000000 0 172 2 0x00000000
May 25 18:57:45 core097002 kernel: [ 4680.778304] ffff880224efdd40 0000000000000046 0000000000000000 0000000000016980
May 25 18:57:45 core097002 kernel: [ 4680.778454] ffff880224e84738 ffff880224e84240 ffff880224e84180 ffff880226de8240
May 25 18:57:45 core097002 snmpd[2666]: Received SNMP packet(s) from UDP: [172.23.14.66]:33299
May 25 18:57:45 core097002 kernel: [ 4680.778603] ffff880224e84738 00000001003fda2c 00000000249100e0 0000000000000000
May 25 18:57:45 core097002 kernel: [ 4680.778753] Call Trace:
May 25 18:57:45 core097002 kernel: [ 4680.778818] [<ffffffffa000a49c>] journal_commit_transaction+0x17c/0x1070 [jbd]
May 25 18:57:45 core097002 kernel: [ 4680.778938] [<ffffffff81059742>] ? finish_task_switch+0x42/0xe0
May 25 18:57:46 core097002 kernel: [ 4680.779037] [<ffffffff8108d520>] ? autoremove_wake_function+0x0/0x40
May 25 18:57:46 core097002 kernel: [ 4680.779145] [<ffffffff810788b9>] ? try_to_del_timer_sync+0x49/0xe0
May 25 18:57:46 core097002 kernel: [ 4680.779255] [<ffffffffa000e0b2>] kjournald+0xe2/0x250 [jbd]
May 25 18:57:46 core097002 kernel: [ 4680.779353] [<ffffffff8108d520>] ? autoremove_wake_function+0x0/0x40
May 25 18:57:47 core097002 kernel: [ 4680.779451] [<ffffffffa000dfd0>] ? kjournald+0x0/0x250 [jbd]
May 25 18:57:49 core097002 kernel: [ 4680.779535] [<ffffffff8108d3a6>] kthread+0x96/0xa0
May 25 18:57:49 core097002 kernel: [ 4680.779621] [<ffffffff8101408a>] child_rip+0xa/0x20
May 25 18:57:49 core097002 kernel: [ 4680.779710] [<ffffffff8108d310>] ? kthread+0x0/0xa0
May 25 18:57:49 core097002 kernel: [ 4680.779797] [<ffffffff81014080>] ? child_rip+0x0/0x20
May 25 18:57:49 core097002 kernel: [ 4680.779881] INFO: task syslogd:2430 blocked for more than 120 seconds.
May 25 18:57:49 core097002 kernel: [ 4680.779980] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 25 18:57:49 core097002 kernel: [ 4680.780105] syslogd D 0000000000000000 0 2430 1 0x00000004
May 25 18:57:49 core097002 kernel: [ 4680.780110] ffff880221ae98c8 0000000000000082 0000000000000000 0000000000016980
May 25 18:57:50 core097002 kernel: [ 4680.780260] ffff88022505ec38 ffff88022505e740 ffff88022505e680 ffff880226e1e340
May 25 18:57:50 core097002 kernel: [ 4680.780408] ffff88022505ec38 00000001003fd7ce 0000000023d91340 0000000000000000
May 25 18:57:50 core097002 kernel: [ 4680.780557] Call Trace:
May 25 18:57:50 core097002 kernel: [ 4680.780600] [<ffffffff814ae453>] io_schedule+0x73/0xc0
May 25 18:57:50 core097002 kernel: [ 4680.780685] [<ffffffff8118f925>] sync_buffer+0x45/0x50
May 25 18:57:50 core097002 kernel: [ 4680.780769] [<ffffffff814aeb3a>] __wait_on_bit_lock+0x5a/0xb0
May 25 18:57:50 core097002 kernel: [ 4680.780872] [<ffffffff8118f8e0>] ? sync_buffer+0x0/0x50
May 25 18:57:50 core097002 kernel: [ 4680.780960] [<ffffffff8118f8e0>] ? sync_buffer+0x0/0x50
May 25 18:57:50 core097002 kernel: [ 4680.781046] [<ffffffff814aec0d>] out_of_line_wait_on_bit_lock+0x7d/0x90
May 25 18:57:51 core097002 kernel: [ 4680.781161] [<ffffffff8108d560>] ? wake_bit_function+0x0/0x40
May 25 18:57:51 core097002 kernel: [ 4680.781252] [<ffffffff8118fb1c>] __lock_buffer+0x2c/0x30
May 25 18:57:51 core097002 kernel: [ 4680.781345] [<ffffffffa0009c2f>] do_get_write_access+0x3bf/0x480 [jbd]
May 25 18:57:51 core097002 kernel: [ 4680.781449] [<ffffffff811904ce>] ? __getblk+0x2e/0x240
May 25 18:57:51 core097002 kernel: [ 4680.781535] [<ffffffffa0009e81>] journal_get_write_access+0x31/0x50 [jbd]
May 25 18:57:51 core097002 kernel: [ 4680.781656] [<ffffffffa0031cd4>] __ext3_journal_get_write_access+0x34/0x70 [ext3]
May 25 18:57:51 core097002 kernel: [ 4680.781784] [<ffffffffa0023ab3>] ext3_reserve_inode_write+0x83/0xb0 [ext3]
May 25 18:57:51 core097002 kernel: [ 4680.781904] [<ffffffffa0023b16>] ext3_mark_inode_dirty+0x36/0x60 [ext3]
直观看起来,好像是jbd在提交事务时迟迟没有返回,那就是jbd的bug喽?于是我们在jbd里插入了很多trace_printk,企图找出问题,最后发现是jbd很久都无法lock_buffer,故而顿在那里。那是谁已经lock住buffer了呢?除了jbd写硬盘,那就只剩write-back了,于是接着trace write_back,发现是submit_bio后长时间没有返回,于是我问马涛“如果硬盘设备真的很慢,是不是会出现120秒IO都无法完成的情况”——马涛说可能性很低,与其怀疑硬盘,不如怀疑io-scheduler,于是我们把io-scheduler从cfq换成deadline,再跑测试,24小时都没有再出现问题了。
于是我得出结论:cfq太讲求公平,所以忽视了IO可能造成的超时,因此对于IO超时要求高的应用,最好用deadline。
但是马涛同学没有善罢甘休,他继续看cfq的实现代码,发现cfq给ASYNC型的IO分配了太低的权限,而write-back的IO都是ASYNC型的,所以被严重耽误了。于是马涛发了个邮件 http://marc.info/?l=linux-kernel&m=130761660610470&w=2 提出了这个问题,并把测试用例也附上了 http://code.taobao.org/trac/dirbench/browser/trunk/meta_test/press/set_vs_get ,vivek回了一个改进cfq的patch,我们打上这一patch后情况有所好转,但是没有根除,接着把cfq的参数slice_idle置为0,测试ok了。
感谢马涛同学穷根究底的精神,很值得我学习。
看来我们这一高频率操作文件系统meta-data的测试用例很特别,也许实际应用中很少有这样的所以cfq才漏掉了这一情况。
关于存档
This page is a archive of entries in the 操作系统 category from 06 % Y.
操作系统: 05 2011 is the previous archive.
操作系统: 07 2011 is the next archive.
Find recent content on the main index or look in the 存档 to find all content.
操作系统: 06 2011每月存档
- 02 2013 (2)
- 01 2013 (2)
- 12 2012 (4)
- 11 2012 (1)
- 10 2012 (4)
- 09 2012 (1)
- 08 2012 (2)
- 06 2012 (5)
- 05 2012 (1)
- 02 2012 (3)
- 01 2012 (1)
- 12 2011 (1)
- 11 2011 (1)
- 10 2011 (1)
- 08 2011 (1)
- 07 2011 (1)
- 06 2011 (2)
- 05 2011 (2)
- 04 2011 (2)
- 03 2011 (2)
- 02 2011 (1)
- 01 2011 (1)
- 10 2010 (1)
- 08 2010 (2)
- 07 2010 (1)
- 06 2010 (2)
- 05 2010 (1)
- 03 2010 (4)
- 02 2010 (1)
- 01 2010 (1)
- 12 2009 (2)
- 11 2009 (4)
- 10 2009 (2)
- 08 2009 (2)
- 05 2009 (1)