Recently, I was maintaining an open source project called air on the weekend. It is a hot loading code tool for Golang, which will monitor local file changes and then automatically reload.

Problem encountered

Recently, I encountered a particularly interesting problem, that is, when using the kill -9 pid command to kill the process, although it will kill its child process, its grandchild process will still survive.

Background

In short, our hot loading component will run commands, and then will monitor file changes, once the file changes, it will kill the previous process, then recompile the code, and then execute the running command.

But I encountered a user who raised such a problem: https://github.com/cosmtrek/air/issues/216#issuecomment-982348931 When executing the command, use dlv exec --accept-multiclient --log --headless --continue --listen :2345 --api-version 2 ./tmp/main to run the code and start debugging, our component will not completely kill the process, but will continue to survive. This causes the corresponding port to be occupied the next time it comes up.

Troubleshooting

Through ps -efj | grep "tmp/main" you can clearly see that actually running this command will start three processes

1
2
3
1594910868 75277 74711   0 10:09PM ttys005    0:00.14 dlv exec --accep xt       75277      0    1 S    s005
1594910868 75280 75277   0 10:09PM ttys005    0:00.02 /Library/Develop xt       75280      0    1 S+   s005
1594910868 75281 75280   0 10:09PM ttys005    0:00.01 ./tmp/main       xt       75280      0    1 SX+  s005

And it’s very clear to see the grandparent-child relationship of the processes:

75277 is the parent process

75280 is the child process

75281 is the grandchild process

If you just use kill -9 pid to kill the process, its child process will also be killed, but the grandchild process will still survive.

1
2
3
> kill -9 75277
> ps -ef | grep "tmp/main"
1594910868 75281     1   0 10:09PM ttys005    0:00.01 ./tmp/main

You can see that only the 75281 process is left, and the parent process of this process has now become 1, an orphan process. It’s really an orphan.

If this process continues to occupy the port, it will prevent the command from being executed normally the next time.

Solution

After consulting various materials, I found a good solution: use the pgid parameter to allow the processes in the process group to share a process group number.

1
2
3
4
  PID  PPID  PGID   UID   C STIME   TTY             TIME CMD              
75837 74711 75837 1594910868   0 10:22PM ttys005    0:00.23 dlv exec --accep 
75840 75837 75840 1594910868   0 10:22PM ttys005    0:00.02 /Library/Develop 
75841 75840 75840 1594910868   0 10:22PM ttys005    0:00.01 ./tmp/main       

You can see that the third column corresponds to the pgid. Although the pgid we start with the command is different, we can use Golang to set the process group number, so that we can share the process group number.

At the same time, when killing the process, you also need to use this pgid parameter, so that you can kill the corresponding process group. You can refer to man kill

Negative PID values may be used to choose whole process groups; see the PGID column in ps command output.

That is, for pid, it represents PGID, which is the entire process group. When killing, it will kill all processes in the process group.

Although it is impossible to share processes in the above command, for this bug, we can use Setpgid to enable PGID, so that the started process can share the process group number. At the same time, use syscall.Kill(-pgid, 15) to kill the process group.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
cmd := exec.Command( some_command )
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
cmd.Start()

pgid, err := syscall.Getpgid(cmd.Process.Pid)
if err == nil {
syscall.Kill(-pgid, 15)  // note the minus sign
}

cmd.Wait()

Conclusion

Add relevant unit tests in the unit test to ensure that the behavior of kill all child processes will not be lost due to iteration. https://github.com/cosmtrek/air/commit/1c27effe33a180f3fbbcee8f2d9ea7122d89a50b#diff-6266cec6be43e607de84d431f656ea78fac62405058d84312d9c12f3f52c7462R146

Reference Materials

Finally, I would like to share with you some good articles that I have been reading recently. I thought about sending them in a weekly newsletter, but because I read them sporadically, I decided to put them at the end of each blog post. I hope you find them useful!