When reading about the function call explanation, I figured that I needed first to understand how the signal works in a Unix-like system. The signal system is a crucial design that cannot be avoided in large systems as it allows communications between the operating system and processes and also between process and process. However, the signal system brings certain problems due to its design, and here are some of the key issues.

Here are some problems that are caused by the design of asynchronous communication.  When a signal arrives, there could be a system call that blocks the process, like IO, which will terminate the system call and return EINTR which is an interruption of the system call. Some of the system calls can be restarted by SA_RESTART but not all system calls will restart. Which may result in partial data or operation loss. Other than that, signals can also have race conditions if signal handers and the process are accessing shared resources and not synchronized, which will result in inconsistent results. Also, if a signal arrives before the signal handler is fully set up, the signal may be lost, which could lead to the signal being ignored or incorrect processing.

signalfd():

The system call signalfd() is being introduced to address some of the problems by converting asynchronous signal processing into a synchronized process. The signalfd() transforms signals into file descriptors, allowing processes to handle signals in the same way they deal with I/O.

The system call works as follows. First, a specific set of signals needs to be defined so that the singalfd() knows which signals to expect. A signal set needs to be created to store those signals and sigprocmask() is used to block the original signals, which avoids receiving through traditional signal handlers and interrupts the current process. After that, signalfd() will be called to turn the signals into a file descriptor. The file descriptor is only readable when the signal in the signal set is received. This allows the process to monitor and handle signals synchronously. Once the file descriptor is readable, the read() function is used to read directly from the file descriptor, and a handler to manage the signal actions.

When a signal arrives and the file descriptor changes to readable epollwait() will be triggered and notify the process to deal with the signal. The process won’t need to deal with it immediately, but when the process is ready to respond to the signals, the process will read() from the file descriptor to deal with the signal. But receiving the signal itself won’t immediately hang the process and deal with the signal first.

The system call solves all the problems caused by the asynchronous signal handling, but signalfd() is only functional for those signals that are listed in the set, which means it is still possible that some other signals will cause the same problem I mentioned above.

pidfd()

The pidfd() is being used to solve the problem of Process ID (PID) duplication and PID reuse problem. Unix-like system is assigning every process an individual PID, which is being used as an identifier for the system. Once a process terminates, the PID is reused and assigned to a different process. A potential problem occurs after the resigning process. If process A is designed to check another process B’s PID to monitor the process, it could be possible that when process A is not monitoring, process B terminates and the PID gets reassigned to process C. This means the following monitoring will be on this new process C instead of B but process A thinks it’s monitoring process B. Which causes a faulty monitoring of other processes or a mistake in the interaction of two processes.

The pidfd() system call has introduced some features to solve this potential problem. The pidfd() system call makes sure that every process will have its own file descriptor that is tied to the process, which will not be reused after the process terminates. Instead of tracing the process through PID, the process will be tracked through the file descriptor. Even if the PID is reassigned to another process, the file descriptor will still be connected to its original process. By adopting the file descriptor, the system can monitor all processes without worrying about the reuse and duplication of PID anymore.

The system call pidfd() is useful when the system has a high concurrency like an online game platform or stock trading system. When an action like fork() or clone3() action is happening and a child process is created, the system call pidfd() uses pidfd_open() to create a file descriptor for the child process for the system and other processes to monitor and manage the current process. The function called epoll() will be able to monitor the current file descriptor created (like what happed in signalfd () ). The file descriptor created by pidfd() is connected to the entire lifecycle of the child process, ensuring that no PID reuse issues occur during process management.

However, since each child process is connected to a file descriptor and the number of file descriptors has a limit, a manual free of file descriptors will be needed in circumstances for the system to keep working correctly.

userfaultfd()

When a page fault occurs the process of fetching memory from the disk brings significant overhead as it involves address translation, memory search through the page table, and disk IO to fetch the correct data. The above operation is normally controlled by the kernel and the only thing user can do is wait.

But in some cases waiting for the kernel’s action it’s not the optimized situation for the system, especially in a remotely accessed system. The introduction of userfaultsfd() allows the user to take control of the page fault handling. The users can intercept and manage how the page fault is handled which provides flexibility in handling the error.

When using the system call userfaultfd(), a file descriptor will be created by the kernel to establish a communication channel between the user-space and kernel. Users will be notified when a page fault occurs through the read/POLLIN, which gives the user permission to react with the page fault. It also allows users to use various UFFDIO_* input/output controls (ioctls) to manage the virtual memory registered under userfaultfd() to resolve and manage. Including copying from another space, or read from a remote source. And also importantly, the userfaultfd() can have one manager process managing multiple processes through unix domain sockets and not depend on vmas.

I have also learned that userfaultfd() is extremely helpful for Postcopy live migration, a method used to migrate a virtual machine (VM) from one physical machine to another. For precopy, it requires several rounds of precopying to migrate as much memory from the previous machine as possible before moving to the new machine which also results in dirty page problems. On the other hand, after running one pass of precopy, postcopy migration allows the virtual machine to run on the target machine. Since not all data have been transferred yet, whenever the VM accesses memory that hasn't been migrated, the userfaultfd() will be triggered. Quick EMUlator(QEMU) requests the corresponding missing memory page from the original machine through a bidirectional socket, while the virtual machine runs another process. This method can lower the VM downtime for migration.

Whenever the userfaultfd() is triggered by a page fault, a POOLIN is generated. In postcopy migration circumstances, a postcopy thread will be determining if the address page is all 0 or with actual data. If all 0, UFFDIO_ZEROPAGE will generate a 0 page locally without sending from the source. If with an actual address, the postcopy thread will transmit the page and ask UFFDIO_COPY() to map to the current machine’s memory.

Based on the link provided on the HW page, I designed a relatively simple simulation to experiment with the userfaultfd() system call. In this setup, whenever a page fault occurs, instead of allowing the kernel to handle it, the simulation intercepts the fault and resolves it by forwarding a page filled with the letter 'A' using the UFFDIO_COPY operation. It also measures and displays the latency, which represents the time taken to handle each page fault.