分析wiki卡住问题

来自个人维基
跳转至: 导航搜索

从24年9月开始,wiki便开始偶发宕机卡住问题,因此当时建立了监测预警,权宜之计就是出现问题后执行reboot。
但后面发现,问题出现后连ssh都很难连上,reboot也响应得很慢,因此彻底分析一下。

出现问题后查看服务器监测,可以明显看到系统cpu/memory异常,都基本用完了。
Cpu Mem信息.jpg

root@iZ23diqq85dZ:/var/log/apache2# free -m
             total       used       free     shared    buffers     cached
Mem:          2012       1958         54          0          2         14
-/+ buffers/cache:       1941         71
Swap:         1023       1023          0

查看top信息,发现load average非常高,而基本都是apache进程占用了,并且这些进程很多处于D状态:

root@iZ23diqq85dZ:/var/log/apache2# top
top - 12:45:01 up  1:27,  2 users,  load average: 86.24, 80.57, 80.64
Tasks: 234 total,   2 running, 232 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.7 us, 17.9 sy,  0.0 ni,  0.0 id, 80.4 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   2061064 total,  2008040 used,    53024 free,     2628 buffers
KiB Swap:  1048572 total,  1047992 used,      580 free,    14392 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 5586 mysql     20   0 1341m  58m  916 S   5.6  2.9   0:05.22 mysqld
 5250 www-data  20   0  177m  15m 3836 D   5.0  0.8   0:02.09 apache2
 5091 www-data  20   0  183m 9444 3240 D   3.7  0.5   0:02.43 apache2
 5088 www-data  20   0  188m  22m 2860 D   3.3  1.1   0:03.46 apache2
 5261 www-data  20   0  183m 9380 3236 D   2.7  0.5   0:01.79 apache2
 5242 www-data  20   0  184m  16m 4304 D   2.3  0.8   0:02.20 apache2
 5282 www-data  20   0  188m  21m 2940 D   2.3  1.1   0:01.98 apache2
 5310 www-data  20   0  187m  21m 2708 D   2.3  1.1   0:02.11 apache2
 5590 www-data  20   0  181m  19m 3908 S   2.3  1.0   0:00.50 apache2
 5113 www-data  20   0  183m  14m 3528 D   1.0  0.7   0:01.77 apache2
 5240 www-data  20   0  178m 9768 3200 D   1.0  0.5   0:01.62 apache2
 5278 www-data  20   0  181m  19m 3916 D   1.0  1.0   0:00.77 apache2
 5080 www-data  20   0  188m  23m 2988 D   0.7  1.2   0:03.09 apache2
 2008 root      20   0  426m 7780  864 S   0.3  0.4   1:04.18 exe
 4956 www-data  20   0  183m  13m 3456 D   0.3  0.7   0:07.05 apache2
 4969 www-data  20   0  188m  22m 2912 D   0.3  1.1   0:06.81 apache2
 5085 www-data  20   0  187m  25m 4256 D   0.3  1.3   0:01.32 apache2
 5234 www-data  20   0  188m  25m 4224 D   0.3  1.3   0:02.33 apache2
 5241 www-data  20   0  175m 8604 3224 D   0.3  0.4   0:01.04 apache2
 5254 www-data  20   0  188m  24m 2928 D   0.3  1.2   0:01.50 apache2
 5256 www-data  20   0  245m  13m 3524 D   0.3  0.7   0:02.12 apache2
 5266 www-data  20   0  183m  22m 4268 D   0.3  1.1   0:01.28 apache2
 5267 www-data  20   0  188m  23m 2964 D   0.3  1.2   0:01.53 apache2
 5284 www-data  20   0  188m  24m 2936 D   0.3  1.2   0:01.07 apache2
 5286 www-data  20   0  188m  23m 2928 D   0.3  1.2   0:01.99 apache2
 5597 www-data  20   0  181m  19m 3872 S   0.3  0.9   0:00.27 apache2

选取其中几个看下进程详情:

root@iZ23diqq85dZ:/var/log/apache2# cat /proc/5242/status  /proc/5242/stack
Name:   apache2
State:  D (disk sleep)
Tgid:   5242
Pid:    5242
PPid:   2069
TracerPid:      0
Uid:    33      33      33      33
Gid:    33      33      33      33
FDSize: 64
Groups: 33
VmPeak:   193612 kB
VmSize:   188756 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     26828 kB
VmRSS:     17804 kB
VmData:    30780 kB
VmStk:       136 kB
VmExe:       456 kB
VmLib:     23760 kB
VmPTE:       352 kB
VmSwap:     9296 kB
Threads:        1
SigQ:   0/16008
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 000000018c0046eb
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed:   3
Cpus_allowed_list:      0-1
Mems_allowed:   00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        4406
nonvoluntary_ctxt_switches:     558
[<ffffffff8119a2fc>] get_request_wait+0x105/0x18f
[<ffffffff8105fd53>] autoremove_wake_function+0x0/0x2a
[<ffffffff8119b25b>] blk_queue_bio+0x17f/0x28c
[<ffffffff81199d88>] generic_make_request+0x90/0xcf
[<ffffffff81199e9a>] submit_bio+0xd3/0xf1
[<ffffffff810bc1cf>] test_set_page_writeback+0xdc/0xeb
[<ffffffff810de46d>] swap_writepage+0x8b/0x95
[<ffffffff810c3433>] shrink_page_list+0x40d/0x73f
[<ffffffff810ca636>] zone_page_state_add+0x14/0x23
[<ffffffff810c3b89>] shrink_inactive_list+0x256/0x3f0
[<ffffffff8107116d>] arch_local_irq_save+0x11/0x17
[<ffffffff810c43c5>] shrink_zone+0x3c0/0x4e6
[<ffffffff810c48e3>] do_try_to_free_pages+0x1cc/0x41c
[<ffffffff810c4d9e>] try_to_free_pages+0xa9/0xe9
[<ffffffff810bbc75>] __alloc_pages_nodemask+0x4ed/0x7aa
[<ffffffff810380dd>] set_next_entity+0x32/0x55
[<ffffffff810e6969>] alloc_pages_vma+0x12d/0x136
[<ffffffff810de82f>] read_swap_cache_async+0x67/0x142
[<ffffffff810de961>] swapin_readahead+0x57/0x9a
[<ffffffff810d165a>] handle_pte_fault+0x347/0x79f
[<ffffffff810ceb49>] pte_offset_kernel+0x16/0x35
[<ffffffff813533ee>] do_page_fault+0x320/0x345
[<ffffffff810d6a04>] mmap_region+0x353/0x44a
[<ffffffff81350a25>] async_page_fault+0x25/0x30
[<ffffffffffffffff>] 0xffffffffffffffff

这些状态为D的apache进程基本都是在swap内存,即内存不足。

再以mem对top进行排序:

top - 12:56:38 up  1:39,  2 users,  load average: 101.88, 108.95, 97.38
Tasks: 233 total,   1 running, 232 sleeping,   0 stopped,   0 zombie
%Cpu(s): 14.2 us,  2.2 sy,  0.0 ni,  0.0 id, 83.5 wa,  0.0 hi,  0.2 si,  0.0 st

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 5847 mysql     20   0  351m  36m 2876 D   0.7  1.8   0:00.13 mysqld
 5023 www-data  20   0  187m  25m 3592 S   0.0  1.2   0:02.37 apache2
 5270 www-data  20   0  186m  23m 3552 S   1.0  1.2   0:02.11 apache2
 5592 www-data  20   0  186m  21m 2324 S   0.0  1.1   0:01.28 apache2
 5289 www-data  20   0  184m  21m 3408 D   0.0  1.1   0:02.44 apache2
 5063 www-data  20   0  183m  21m 3780 S   0.3  1.1   0:03.26 apache2
 5299 www-data  20   0  184m  20m 3316 D   0.0  1.0   0:01.88 apache2
 5291 www-data  20   0  247m  20m 3800 S   0.3  1.0   0:02.23 apache2
 5242 www-data  20   0  184m  20m 3316 D   0.0  1.0   0:03.11 apache2
 5269 www-data  20   0  183m  20m 3860 S   0.0  1.0   0:02.45 apache2
 5266 www-data  20   0  183m  20m 3420 D   1.0  1.0   0:01.88 apache2
 5286 www-data  20   0  183m  20m 3420 S   1.3  1.0   0:02.87 apache2
 5232 www-data  20   0  244m  19m 3800 S   0.0  1.0   0:01.92 apache2
 5618 www-data  20   0  186m  19m 3636 S   0.0  1.0   0:00.99 apache2
 5015 www-data  20   0  183m  19m 3800 S   0.0  1.0   0:04.99 apache2
 5301 www-data  20   0  181m  19m 3604 S   0.3  1.0   0:01.89 apache2
 5617 www-data  20   0  183m  19m 3220 D   0.0  1.0   0:00.94 apache2
 5595 www-data  20   0  186m  19m 2028 S   0.0  1.0   0:01.18 apache2
 5254 www-data  20   0  183m  19m 2508 S   0.0  1.0   0:02.11 apache2
 5264 www-data  20   0  181m  18m 3876 S   0.0  0.9   0:03.15 apache2

可以看到,每个apache进程占用1%左右的内存,而这样的进程有多少个呢?-->150个!
那内存自然是不够的,所以还是设置一个并发上限,查看apache配置文件:

<IfModule mpm_prefork_module>
    StartServers          5
    MinSpareServers       5
    MaxSpareServers      10
    MaxClients          150
    MaxRequestsPerChild   0
</IfModule>

将 MaxClients改为 50,重启。