BUG实例分析三:kernel panic in run timer softirq

来自个人维基
2015年5月6日 (三) 20:11Hovercool讨论 | 贡献的版本

(差异) ←上一版本 | 最后版本 (差异) | 下一版本→ (差异)
跳转至: 导航搜索

一、系统垂死前的log

<7>[ 8888.334134] (1)[560:er.ServerThread]misc timerirq: stop_timerirq current->pid d9832380
<7>[ 8888.334349] (0)[3602:mobile_log_d]misc timerirq: timerirq_handler, 5901467128368305820, -645717120
<7>[ 8888.335111] (1)[560:er.ServerThread]misc timerirq: timerirq_ioctl current->pid 560, 1074037090, 40
<7>[ 8888.335125] (1)[560:er.ServerThread]misc timerirq: start_timerirq current->pid 560
<4>[ 8888.336110]-(1)[556:SensorService]timer:0xd862db80, entry:0xd862db80
<4>[ 8888.337072]-(1)[556:SensorService]timer:0xd862db80, entry:0xd862db80
<4>[ 8888.338712]-(1)[556:SensorService]timer:0xd862db80, entry:0xd862db80
<4>[ 8888.340193]-(1)[556:SensorService]timer:0xd862db80, entry:0xd862db80
<4>[ 8888.340487]-(0)[67:disp_config_upd]timer:0xdcf6fd90, entry:0xdcf6fd90
<4>[ 8888.340565]-(0)[67:disp_config_upd]timer:0xdcf6fd90, entry:0xdcf6fd90
<4>[ 8888.341013]-(0)[67:disp_config_upd]timer:0xdcf6fdc0, entry:0xdcf6fdc0
<4>[ 8888.343425]-(1)[402:VSyncThreadHWC]timer:0xd8651e28, entry:0xd8651e28
<4>[ 8888.344930]-(1)[556:SensorService]timer:0xd862db38, entry:0xd862db38
<4>[ 8888.346111]-(1)[556:SensorService]timer:0xd862db38, entry:0xd862db38
<7>[ 8888.348029] (1)[560:er.ServerThread]misc timerirq: timerirq_ioctl current->pid 560, 33123, 0
<7>[ 8888.348045] (1)[560:er.ServerThread]misc timerirq: stop_timerirq current->pid d9832380
<4>[ 8888.348059]-(1)[560:er.ServerThread]timer:0xd9832414, entry:0xd9832414
<4>[ 8888.353563]-(1)[859:m.bbk.launcher2]timer:0xd9832414, entry:0xd9832414
<4>[ 8888.353592]-(0)[808:Binder_4]timer:0xdcdc7058, entry:0xdcdc7058
<4>[ 8888.353648]-(0)[808:Binder_4]timer:0xd9832414, entry:0xd9832414
<7>[ 8888.353664] (0)[808:Binder_4]misc timerirq: timerirq_handler, 5901467128368325134, -645717120
<4>[ 8888.353692]-(0)[808:Binder_4]timer:0xdd3e6c08, entry:0xdd3e6c08
<7>[ 8888.353712]-(0)[808:Binder_4]queue work=dd3e6be4 function=thermal_zone_device_check workqueue=dd835300 req_cpu=0 cpu=0
<7>[ 8888.353724]-(0)[808:Binder_4]activate work=dd3e6be4
<4>[ 8888.353735]-(0)[808:Binder_4]timer:0xdd3e6748, entry:0xdd3e6748
<7>[ 8888.353750]-(0)[808:Binder_4]queue work=dd3e6724 function=thermal_zone_device_check workqueue=dd835300 req_cpu=0 cpu=0
<7>[ 8888.353761]-(0)[808:Binder_4]activate work=dd3e6724
<4>[ 8888.353772]-(0)[808:Binder_4]timer:0xc1244bb4, entry:0xc1244bb4
<4>[ 8888.353792]-(0)[808:Binder_4]timer:0xc1b51828, entry:0xc1b51828
<6>[ 8888.354201] (0)[2829:kworker/0:0][Power/Battery] [EM] FG_Battery_CurrentConsumption : 3791/10 mA
<4>[ 8888.355076]-(0)[4024:kworker/0:1]timer:0xc735fbc8, entry:0xc735fbc8
<4>[ 8888.357054]-(0)[4024:kworker/0:1]timer:0xc735fbc8, entry:0xc735fbc8
<4>[ 8888.357351]-(0)[4024:kworker/0:1]timer:0xc735fbc8, entry:0xc735fbc8
<4>[ 8888.357729]-(0)[4024:kworker/0:1]timer:0xc735fbc8, entry:0xc735fbc8
<1>[ 8888.361935]-(1)[859:m.bbk.launcher2]Unable to handle kernel NULL pointer dereference at virtual address 00000004
<1>[ 8888.363198]-(1)[859:m.bbk.launcher2]pgd = d9a5c000
<1>[ 8888.363794]-(1)[859:m.bbk.launcher2][00000004] *pgd=00000000
<4>[ 8888.364500]-(1)[859:m.bbk.launcher2][KERN Warning] Some Kernel ERROR or WARN occur and Force debug_lock off!
<4>[ 8888.365724]-(1)[859:m.bbk.launcher2][KERN Warning] check below backtrace first:
<4>[ 8888.366635]-(1)[859:m.bbk.launcher2]Backtrace: 
<4>[ 8888.367213]-(1)[859:m.bbk.launcher2][<c0011e5c>] (dump_backtrace+0x0/0x10c) from [<c06cb1f0>] (dump_stack+0x18/0x1c)
<4>[ 8888.368519]-(1)[859:m.bbk.launcher2] r6:d9454000 r5:c084c2ac r4:d94557c0 r3:271ae919
<4>[ 8888.369491]-(1)[859:m.bbk.launcher2][<c06cb1d8>] (dump_stack+0x0/0x1c) from [<c02b6d6c>] (debug_locks_off+0x48/0x6c)
<4>[ 8888.370802]-(1)[859:m.bbk.launcher2][<c02b6d24>] (debug_locks_off+0x0/0x6c) from [<c005e8c8>] (oops_enter+0x14/0x30)
<4>[ 8888.372114]-(1)[859:m.bbk.launcher2][<c005e8b4>] (oops_enter+0x0/0x30) from [<c0012058>] (die+0x38/0x2e0)
<4>[ 8888.373308]-(1)[859:m.bbk.launcher2][<c0012020>] (die+0x0/0x2e0) from [<c06cb250>] (__do_kernel_fault.part.9+0x5c/0x7c)
<4>[ 8888.374653]-(1)[859:m.bbk.launcher2][<c06cb1f4>] (__do_kernel_fault.part.9+0x0/0x7c) from [<c06dbb94>] (do_page_fault+0x3b8/0x3fc)
<4>[ 8888.376111]-(1)[859:m.bbk.launcher2] r7:dafbe900 r3:d94557c0

...

<4>[ 8888.455561]-(1)[859:m.bbk.launcher2]CPU: 1    Tainted: G        W     (3.4.5 #1)
<4>[ 8888.456483]-(1)[859:m.bbk.launcher2]PC is at run_timer_softirq+0x1b0/0x46c
<4>[ 8888.457339]-(1)[859:m.bbk.launcher2]LR is at trace_hardirqs_off+0x14/0x18
<4>[ 8888.458186]-(1)[859:m.bbk.launcher2]pc : [<c006d9e8>]    lr : [<c00d4fac>]    psr: 60000193
<4>[ 8888.458192]-(1)[859:m.bbk.launcher2]sp : d9455808  ip : c09b74c8  fp : d945587c
<4>[ 8888.460136]-(1)[859:m.bbk.launcher2]r10: c0a6e220  r9 : c0a6dfb1  r8 : d9454000
<4>[ 8888.461047]-(1)[859:m.bbk.launcher2]r7 : c09a10c0  r6 : d9832414  r5 : dd85e000  r4 : 00000000
<4>[ 8888.462121]-(1)[859:m.bbk.launcher2]r3 : 00200200  r2 : 00200200  r1 : 00000000  r0 : dd85e000
<4>[ 8888.463196]-(1)[859:m.bbk.launcher2]Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
<4>[ 8888.464356]-(1)[859:m.bbk.launcher2]Control: 10c5387d  Table: 99a5c06a  DAC: 00000015
<4>[ 8888.465331]-(1)[859:m.bbk.launcher2]
<4>[ 8888.465335]-(1)[859:m.bbk.launcher2]PC: 0xc006d968:
<4>[ 8888.466382]-(1)[859:m.bbk.launcher2]d968  ebffff89 e595202c eaffffc8 e3a03000 e1a00005 e5853028 eb19aff9 ebffcc0a
<4>[ 8888.467662]-(1)[859:m.bbk.launcher2]d988  ebffcbdf e24bd028 e89daff0 e50b5054 e50b4060 e596c010 e1a01006 e1a02006
<4>[ 8888.468942]-(1)[859:m.bbk.launcher2]d9a8  e59f02d8 e50bc050 e596c014 e50bc05c e51bc054 e58c6028 eb19786b e59a3004
<4>[ 8888.470221]-(1)[859:m.bbk.launcher2]d9c8  e3530000 1a00007e e5962004 e3a04000 e5961000 e3a03c02 e51b0054 e3403020
<4>[ 8888.471501]-(1)[859:m.bbk.launcher2]d9e8  e5812004 e5821000 e5863004 e5864000 eb19afdb e5982004 e286c01c e24be040
<4>[ 8888.472781]-(1)[859:m.bbk.launcher2]da08  e50b2058 e8bc000f e59cc000 e8ae000f e1a02004 e58ec000 e3a0c002 e58dc000
<4>[ 8888.474061]-(1)[859:m.bbk.launcher2]da28  e1a03004 e59fc258 e1a01004 e24b0040 e58d4004 e58dc008 eb01b2f5 e59f2244
<4>[ 8888.475341]-(1)[859:m.bbk.launcher2]da48  e5923004 e1530004 1a00003e e51b0050 eb0e47da e51b005c e51bc050 e12fff3c
<4>[ 8888.476621]-(1)[859:m.bbk.launcher2]
<4>[ 8888.476625]-(1)[859:m.bbk.launcher2]LR: 0xc00d4f2c:
<4>[ 8888.477673]-(1)[859:m.bbk.launcher2]4f2c  e5d5300c e3530001 0afffff3 e59f004c e3001a68 ebfe2694 e3a03001 e5c5300c
<4>[ 8888.478952]-(1)[859:m.bbk.launcher2]4f4c  eaffffed e59f503c e5d5300b e3530001 0affffd0 e59f0024 e3001a66 ebfe268a

...

<4>[ 8888.544370]-(1)[859:m.bbk.launcher2]R6: 0xd9832394:
<4>[ 8888.545418]-(1)[859:m.bbk.launcher2]2394  00000000 00006f0f 00000000 0009e20e 51e63916 00000001 00000000 00000001
<4>[ 8888.546697]-(1)[859:m.bbk.launcher2]23b4  00000000 00000000 dead4ead ffffffff ffffffff c104c768 c0c9ea24 00000000
<4>[ 8888.547977]-(1)[859:m.bbk.launcher2]23d4  c08625d4 00000001 d8197e78 d8197e78 00000000 00000000 dead4ead ffffffff
<4>[ 8888.549257]-(1)[859:m.bbk.launcher2]23f4  ffffffff c104c780 c0c98204 00000000 c089de3c 00000001 d862dd44 d862dd44
<4>[ 8888.550537]-(1)[859:m.bbk.launcher2]2414  00000000 00200200 000d1388 dd85e000 c0303270 d9832380 ffffffff c104c770
<4>[ 8888.551817]-(1)[859:m.bbk.launcher2]2434  00000000 00000000 c089de10 00000000 dcc96d00 00000000 00000000 00000000
<4>[ 8888.553097]-(1)[859:m.bbk.launcher2]2454  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<4>[ 8888.554377]-(1)[859:m.bbk.launcher2]2474  00000000 00000000 00000000 cccccccc d9832c40 c03034e4 c06ce624 c06cf3f0
<4>[ 8888.555657]-(1)[859:m.bbk.launcher2]
<4>[ 8888.555661]-(1)[859:m.bbk.launcher2]R7: 0xc09a1040:
<4>[ 8888.556709]-(1)[859:m.bbk.launcher2]1040  00000000 00000000 deaf1eed ffffffff ffffffff c09a1054 c0c5ba14 00000000
<4>[ 8888.557988]-(1)[859:m.bbk.launcher2]1060  c086267c 00000001 00000000 00000000 00000000 00000000 00000000 00000000

二、分析

可以看出系统是在 run_timer_softirq函数中出现了 dataabort,即遭遇非法指针。

查看代码:

static void run_timer_softirq(struct softirq_action *h)
{
	struct tvec_base *base = __this_cpu_read(tvec_bases);

	hrtimer_run_pending();

	if (time_after_eq(jiffies, base->timer_jiffies))
		__run_timers(base);
}

static inline void __run_timers(struct tvec_base *base)
{
	struct timer_list *timer;

	spin_lock_irq(&base->lock);
	while (time_after_eq(jiffies, base->timer_jiffies)) {
	...
}

run_timer_softirq里面的均为inline函数,所以stack中表明的问题出现在 run_timer_softirq,其实也可能是出现在其中的inline调用函数中。

从上面可以看出,出现问题的进程是 859:m.bbk.launcher2,另外说明一下,由于明确了是softirq中的问题,我们在 detach_timer中加了相关log来用于定位,现在我们找到问题进程timer的信息:

<4>[ 8888.353563]-(1)[859:m.bbk.launcher2]timer:0xd9832414, entry:0xd9832414

这应该就是出问题时所处理的timer,其地址是 0xd9832414,我们找一下这个地址所存放的内容数据,84行便是:

<4>[ 8888.544370]-(1)[859:m.bbk.launcher2]R6: 0xd9832394:
<4>[ 8888.545418]-(1)[859:m.bbk.launcher2]2394  00000000 00006f0f 00000000 0009e20e 51e63916 00000001 00000000 00000001
<4>[ 8888.546697]-(1)[859:m.bbk.launcher2]23b4  00000000 00000000 dead4ead ffffffff ffffffff c104c768 c0c9ea24 00000000
<4>[ 8888.547977]-(1)[859:m.bbk.launcher2]23d4  c08625d4 00000001 d8197e78 d8197e78 00000000 00000000 dead4ead ffffffff
<4>[ 8888.549257]-(1)[859:m.bbk.launcher2]23f4  ffffffff c104c780 c0c98204 00000000 c089de3c 00000001 d862dd44 d862dd44
<4>[ 8888.550537]-(1)[859:m.bbk.launcher2]2414  00000000 00200200 000d1388 dd85e000 c0303270 d9832380 ffffffff c104c770
<4>[ 8888.551817]-(1)[859:m.bbk.launcher2]2434  00000000 00000000 c089de10 00000000 dcc96d00 00000000 00000000 00000000
<4>[ 8888.553097]-(1)[859:m.bbk.launcher2]2454  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<4>[ 8888.554377]-(1)[859:m.bbk.launcher2]2474  00000000 00000000 00000000 cccccccc d9832c40 c03034e4 c06ce624 c06cf3f0

0xd9832414所指向的内容是 timer_list结构体:

struct timer_list {
	/*
	 * All fields that change during normal runtime grouped to the
	 * same cacheline
	 */
	struct list_head entry; //8 Bytes
	unsigned long expires; ///4 Bytes
	struct tvec_base *base; ///4 Bytes

	void (*function)(unsigned long); ///4 Bytes
	unsigned long data;

	int slack;

#ifdef CONFIG_TIMER_STATS
	int start_pid;
	void *start_site;
	char start_comm[16];
#endif
#ifdef CONFIG_LOCKDEP
	struct lockdep_map lockdep_map;
#endif
};

data的偏离地址是 8+4+4+4=20 Bytes,而此32位CPU,即4Bytes,故对应偏移 20/4=5,所以 0xd9832414+5=0xd9832419所存即为data内容,为d9832380,此值也正好与我们怀疑的模块(一个功能驱动,用到了时间中断)打印出的data地址一致:

<7>[ 8888.334134] (1)[560:er.ServerThread]misc timerirq: stop_timerirq current->pid d9832380