分析wiki卡住问题
来自个人维基
从24年9月开始,wiki便开始偶发宕机卡住问题,因此当时建立了监测预警,权宜之计就是出现问题后执行reboot。
但后面发现,问题出现后连ssh都很难连上,reboot也响应得很慢,因此彻底分析一下。
出现问题后查看服务器监测,可以明显看到系统cpu/memory异常,都基本用完了。
root@iZ23diqq85dZ:/var/log/apache2# free -m total used free shared buffers cached Mem: 2012 1958 54 0 2 14 -/+ buffers/cache: 1941 71 Swap: 1023 1023 0
查看top信息,发现load average非常高,而基本都是apache进程占用了,并且这些进程很多处于D状态:
root@iZ23diqq85dZ:/var/log/apache2# top top - 12:45:01 up 1:27, 2 users, load average: 86.24, 80.57, 80.64 Tasks: 234 total, 2 running, 232 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.7 us, 17.9 sy, 0.0 ni, 0.0 id, 80.4 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 2061064 total, 2008040 used, 53024 free, 2628 buffers KiB Swap: 1048572 total, 1047992 used, 580 free, 14392 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5586 mysql 20 0 1341m 58m 916 S 5.6 2.9 0:05.22 mysqld 5250 www-data 20 0 177m 15m 3836 D 5.0 0.8 0:02.09 apache2 5091 www-data 20 0 183m 9444 3240 D 3.7 0.5 0:02.43 apache2 5088 www-data 20 0 188m 22m 2860 D 3.3 1.1 0:03.46 apache2 5261 www-data 20 0 183m 9380 3236 D 2.7 0.5 0:01.79 apache2 5242 www-data 20 0 184m 16m 4304 D 2.3 0.8 0:02.20 apache2 5282 www-data 20 0 188m 21m 2940 D 2.3 1.1 0:01.98 apache2 5310 www-data 20 0 187m 21m 2708 D 2.3 1.1 0:02.11 apache2 5590 www-data 20 0 181m 19m 3908 S 2.3 1.0 0:00.50 apache2 5113 www-data 20 0 183m 14m 3528 D 1.0 0.7 0:01.77 apache2 5240 www-data 20 0 178m 9768 3200 D 1.0 0.5 0:01.62 apache2 5278 www-data 20 0 181m 19m 3916 D 1.0 1.0 0:00.77 apache2 5080 www-data 20 0 188m 23m 2988 D 0.7 1.2 0:03.09 apache2 2008 root 20 0 426m 7780 864 S 0.3 0.4 1:04.18 exe 4956 www-data 20 0 183m 13m 3456 D 0.3 0.7 0:07.05 apache2 4969 www-data 20 0 188m 22m 2912 D 0.3 1.1 0:06.81 apache2 5085 www-data 20 0 187m 25m 4256 D 0.3 1.3 0:01.32 apache2 5234 www-data 20 0 188m 25m 4224 D 0.3 1.3 0:02.33 apache2 5241 www-data 20 0 175m 8604 3224 D 0.3 0.4 0:01.04 apache2 5254 www-data 20 0 188m 24m 2928 D 0.3 1.2 0:01.50 apache2 5256 www-data 20 0 245m 13m 3524 D 0.3 0.7 0:02.12 apache2 5266 www-data 20 0 183m 22m 4268 D 0.3 1.1 0:01.28 apache2 5267 www-data 20 0 188m 23m 2964 D 0.3 1.2 0:01.53 apache2 5284 www-data 20 0 188m 24m 2936 D 0.3 1.2 0:01.07 apache2 5286 www-data 20 0 188m 23m 2928 D 0.3 1.2 0:01.99 apache2 5597 www-data 20 0 181m 19m 3872 S 0.3 0.9 0:00.27 apache2
选取其中几个看下进程详情:
root@iZ23diqq85dZ:/var/log/apache2# cat /proc/5242/status /proc/5242/stack Name: apache2 State: D (disk sleep) Tgid: 5242 Pid: 5242 PPid: 2069 TracerPid: 0 Uid: 33 33 33 33 Gid: 33 33 33 33 FDSize: 64 Groups: 33 VmPeak: 193612 kB VmSize: 188756 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 26828 kB VmRSS: 17804 kB VmData: 30780 kB VmStk: 136 kB VmExe: 456 kB VmLib: 23760 kB VmPTE: 352 kB VmSwap: 9296 kB Threads: 1 SigQ: 0/16008 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000001000 SigCgt: 000000018c0046eb CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: ffffffffffffffff Cpus_allowed: 3 Cpus_allowed_list: 0-1 Mems_allowed: 00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 4406 nonvoluntary_ctxt_switches: 558 [<ffffffff8119a2fc>] get_request_wait+0x105/0x18f [<ffffffff8105fd53>] autoremove_wake_function+0x0/0x2a [<ffffffff8119b25b>] blk_queue_bio+0x17f/0x28c [<ffffffff81199d88>] generic_make_request+0x90/0xcf [<ffffffff81199e9a>] submit_bio+0xd3/0xf1 [<ffffffff810bc1cf>] test_set_page_writeback+0xdc/0xeb [<ffffffff810de46d>] swap_writepage+0x8b/0x95 [<ffffffff810c3433>] shrink_page_list+0x40d/0x73f [<ffffffff810ca636>] zone_page_state_add+0x14/0x23 [<ffffffff810c3b89>] shrink_inactive_list+0x256/0x3f0 [<ffffffff8107116d>] arch_local_irq_save+0x11/0x17 [<ffffffff810c43c5>] shrink_zone+0x3c0/0x4e6 [<ffffffff810c48e3>] do_try_to_free_pages+0x1cc/0x41c [<ffffffff810c4d9e>] try_to_free_pages+0xa9/0xe9 [<ffffffff810bbc75>] __alloc_pages_nodemask+0x4ed/0x7aa [<ffffffff810380dd>] set_next_entity+0x32/0x55 [<ffffffff810e6969>] alloc_pages_vma+0x12d/0x136 [<ffffffff810de82f>] read_swap_cache_async+0x67/0x142 [<ffffffff810de961>] swapin_readahead+0x57/0x9a [<ffffffff810d165a>] handle_pte_fault+0x347/0x79f [<ffffffff810ceb49>] pte_offset_kernel+0x16/0x35 [<ffffffff813533ee>] do_page_fault+0x320/0x345 [<ffffffff810d6a04>] mmap_region+0x353/0x44a [<ffffffff81350a25>] async_page_fault+0x25/0x30 [<ffffffffffffffff>] 0xffffffffffffffff
这些状态为D的apache进程基本都是在swap内存,即内存不足。
再以mem对top进行排序:
top - 12:56:38 up 1:39, 2 users, load average: 101.88, 108.95, 97.38 Tasks: 233 total, 1 running, 232 sleeping, 0 stopped, 0 zombie %Cpu(s): 14.2 us, 2.2 sy, 0.0 ni, 0.0 id, 83.5 wa, 0.0 hi, 0.2 si, 0.0 st PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5847 mysql 20 0 351m 36m 2876 D 0.7 1.8 0:00.13 mysqld 5023 www-data 20 0 187m 25m 3592 S 0.0 1.2 0:02.37 apache2 5270 www-data 20 0 186m 23m 3552 S 1.0 1.2 0:02.11 apache2 5592 www-data 20 0 186m 21m 2324 S 0.0 1.1 0:01.28 apache2 5289 www-data 20 0 184m 21m 3408 D 0.0 1.1 0:02.44 apache2 5063 www-data 20 0 183m 21m 3780 S 0.3 1.1 0:03.26 apache2 5299 www-data 20 0 184m 20m 3316 D 0.0 1.0 0:01.88 apache2 5291 www-data 20 0 247m 20m 3800 S 0.3 1.0 0:02.23 apache2 5242 www-data 20 0 184m 20m 3316 D 0.0 1.0 0:03.11 apache2 5269 www-data 20 0 183m 20m 3860 S 0.0 1.0 0:02.45 apache2 5266 www-data 20 0 183m 20m 3420 D 1.0 1.0 0:01.88 apache2 5286 www-data 20 0 183m 20m 3420 S 1.3 1.0 0:02.87 apache2 5232 www-data 20 0 244m 19m 3800 S 0.0 1.0 0:01.92 apache2 5618 www-data 20 0 186m 19m 3636 S 0.0 1.0 0:00.99 apache2 5015 www-data 20 0 183m 19m 3800 S 0.0 1.0 0:04.99 apache2 5301 www-data 20 0 181m 19m 3604 S 0.3 1.0 0:01.89 apache2 5617 www-data 20 0 183m 19m 3220 D 0.0 1.0 0:00.94 apache2 5595 www-data 20 0 186m 19m 2028 S 0.0 1.0 0:01.18 apache2 5254 www-data 20 0 183m 19m 2508 S 0.0 1.0 0:02.11 apache2 5264 www-data 20 0 181m 18m 3876 S 0.0 0.9 0:03.15 apache2
可以看到,每个apache进程占用1%左右的内存,而这样的进程有多少个呢?-->150个!
那内存自然是不够的,所以还是设置一个并发上限,查看apache配置文件:
<IfModule mpm_prefork_module> StartServers 5 MinSpareServers 5 MaxSpareServers 10 MaxClients 150 MaxRequestsPerChild 0 </IfModule>
将 MaxClients改为 50,重启。