记一次lsof卡住导致的问题

背景

k8s集群里面新加入了一台机器，装上rocklinux:9的系统，调度cdc上去的时候会运行lsof -i:8080 然后总是卡住。但是主机上的lsof又一切正常。

排查过程

使用k debug node/10.2.12.132 -it --image=hub.pingcap.net/jenkins/centos7_golang-1.20 起一个pods，运行lsof -i:8080 卡住。 k8s 版本不支持debug attach pod, 直接上主机 container的进程在主机上可见，在主机上找到pod里面的进程 ps -ef|grep lsof 使用gdb调试

gdb
attach <pid>
bt

显示

Program received signal SIGTSTP, Stopped (user).
0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6
(gdb) bt
#0  0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6
#1  0x000000000040272a in main ()
(gdb) disassemble
Dump of assembler code for function __close_nocancel:
   0x00007f7a7967e0d9 <+0>:        mov    $0x3,%eax
   0x00007f7a7967e0de <+5>:        syscall
=> 0x00007f7a7967e0e0 <+7>:        cmp    $0xfffffffffffff001,%rax
   0x00007f7a7967e0e6 <+13>:        jae    0x7f7a7967e119 <close+73>
   0x00007f7a7967e0e8 <+15>:        ret
End of assembler dump.

一直在进行系统调用 Linux 3号调用是close 使用catch syscall 分析系统调用

Catchpoint 1 (call to syscall close), 0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6
(gdb) c
Continuing.

Catchpoint 1 (returned from syscall close), 0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6
(gdb)
Continuing.

Catchpoint 1 (call to syscall close), 0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6
(gdb)
Continuing.

Catchpoint 1 (returned from syscall close), 0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6
(gdb)
Continuing.

Catchpoint 1 (call to syscall close), 0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6
(gdb)
Continuing.

Catchpoint 1 (returned from syscall close), 0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6
(gdb)
Continuing.

Catchpoint 1 (call to syscall close), 0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6
(gdb)
Continuing.

Catchpoint 1 (returned from syscall close), 0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6
(gdb)
Continuing.

Catchpoint 1 (call to syscall close), 0x00007f7a7967e0e0 in __close_nocancel 
() from target:/lib64/libc.so.6
(gdb)
Continuing.

Catchpoint 1 (returned from syscall close), 0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6
(gdb)
Continuing.

Catchpoint 1 (call to syscall close), 0x00007f7a7967e0e0 in __close_nocancel () from target:/lib64/libc.so.6

发现不停地进行close调用，怀疑是程序跑飞了。

主机上的lsof又一切正常，内核应该没啥问题。

怀疑容器里的lsof有bug

在主机上运行容器里的lsof

cd /proc/<pid>/root
sbin/lsof -i:8080

又一切正常

尝试把主机的lsof copy在容器里面去运行

# cp `which lsof` usr/sbin/lsof
cp: overwrite 'usr/sbin/lsof'? y
#

但是容器里面缺少动态库，跑不起来

$ lsof
lsof: error while loading shared libraries: libtirpc.so.3: cannot open shared object file: No such file or directory

打算直接分析二进制

(gdb) bt
#0  0x00007ffff7d3f017 in close () from /lib64/libc.so.6
#1  0x0000555555557e4c in main ()
(gdb) up
#1  0x0000555555557e4c in main ()
(gdb) disassemble
Dump of assembler code for function main:
---
   0x0000555555557e0b <+91>:        test   %rax,%rax
   0x0000555555557e0e <+94>:        je     0x5555555594ea <main+5946>
   0x0000555555557e14 <+100>:        add    $0x1,%rax
   0x0000555555557e18 <+104>:        mov    %rax,0x279b1(%rip)        # 0x55555557f7d0
   0x0000555555557e1f <+111>:        mov    $0x3,%ebx
   0x0000555555557e24 <+116>:        call   0x555555557cc0 <getdtablesize@plt>
   0x0000555555557e29 <+121>:        movdqa 0x21caf(%rip),%xmm1        # 0x555555579ae0
   0x0000555555557e31 <+129>:        movd   %eax,%xmm0
   0x0000555555557e35 <+133>:        pmaxsd %xmm1,%xmm0
   0x0000555555557e3a <+138>:        movd   %xmm0,0x280d2(%rip)        # 0x55555557ff14
   0x0000555555557e42 <+146>:        mov    %ebx,%edi
   0x0000555555557e44 <+148>:        add    $0x1,%ebx
   0x0000555555557e47 <+151>:        call   0x555555557d90 <close@plt>
=> 0x0000555555557e4c <+156>:        cmp    %ebx,0x280c2(%rip)        # 0x55555557ff14
   0x0000555555557e52 <+162>:        jg     0x555555557e42 <main+146>
   0x0000555555557e54 <+164>:        lea    0x203ba(%rip),%rbx        # 0x555555578215
   0x0000555555557e5b <+171>:        jmp    0x555555557e62 <main+178>
   0x0000555555557e5d <+173>:        cmp    $0x1,%eax

发现有循环在close，看起来是故意为之

尝试去找源码 https://github.com/lsof-org/lsof 看到main.c里面没有类似的循环 [奇怪] 切换到老版本

git switch v4.94.0

里面看到了

    if ((MaxFd = (int) GET_MAX_FD()) < 53)
        MaxFd = 53;

#if defined(HAS_CLOSEFROM)
    (void) closefrom(3);
#else   /* !defined(HAS_CLOSEFROM) */
    for (i = 3; i < MaxFd; i++)
        (void) close(i);
#endif  /* !defined(HAS_CLOSEFROM) */

最后找到了GET_MAX_FD 是进行了getdtablesize系统调用

./proto.h:#define        GET_MAX_FD        getdtablesize

man getdtablesize

发现getdtablesize 与ulimit有关查看各个环境的ulimit 问题容器

$ ulimit -n
1073741816

主机

ulimit -n
1024

其他机器

$ ulimit -n
1048576

真相大白

lsof 会循环关闭MaxFd内的所有文件描述符，也不管它们是否存在，看起来像是一个bug，会导致性能下降。当max open file过大的时候性能会极差，看起来就像是卡住了。看起来最新master里面有所缓解（maybe）。

解决方法

减少max open files。主机上的ulimit很小，然后容器里面却很大，怀疑是容器运行时设置了默认值，docker本身就可以配置这个选项。各个主机的值不一样，说明和主机有关。查阅containerd的配置，也没找到相关的配置项。通过一些试验，发现max open files 只能减少不能增加。说明这个可能是继承下来的，所以从lsof进程一直找父进程，挨个查看他们的limits

[root@10 ~]# cat /proc/87625/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             unlimited            unlimited            processes
Max open files            1073741816           1073741816           files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       1540070              1540070              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

并且比较不同主机的情况：最后发现

是不同主机的systemd的Max open files 不同

而我们也没必要改systemd的配置，只需要改containerd的配置即可

vim /etc/systemd/system/containerd.service

---
LimitNOFILE=1048576

然后重启

[root@10 ~]# systemctl daemon-reload
[root@10 ~]# systemctl restart containerd.service

问题解决