Some features added in recent years to Linux and other modern Unix operating systems.

A supplement to "Advanced Programming in the UNIX Environment" by Stevens.

sendfile

The sendfile system call was added to FreeBSD 3.0 in 1998 and Linux 2.2 in 1999. It adds a way to copy data from one file descriptor to another without first copying the data into process memory.

Here is the signature on Linux:

#include <sys/sendfile.h>

ssize_t sendfile(int out_fd, int in_fd, off_t * offset, size_t count);

The sendfile on FreeBSD has a different signature. It can only be used to copy data from a file descriptor to a socket. However, in the case of Linux, the input file descriptor must be a regular file; i.e. it must be possible to call mmap on the file descriptor. Also, before Linux 2.6.13 the output file descriptor had to be a socket.

There is a pip package pysendfile for using this feature from Python.

epoll

If a process needs to read from multiple file descriptors, it is a bad idea to block on one of them since data could arrive at the other. For example if another process has the two files open for writing and is blocked writing to the other file descriptor, a deadlock results.

There are two POSIX system calls for dealing with the situation: select and the newer poll. select imposed a limit, usually 1024, on the number of file descriptors that could be multiplexed at the same time, whereas poll did not. However, both select and poll are rather slow when used on more than 100 file descriptors. There are also pselect and ppoll variants which allow the process to change the signal mask that is in effect while the process is blocked on the file descriptors.

To fix the slowness, kqueue was introduced in FreeBSD 4.1 (2000) and epoll in Linux 2.5.44 (2002). The idea is create a separate system call for declaring a list file descriptors that can be blocked on. This allows the kernel to maintain a data structure containing the list so that it has less processing to do each time the process blocks.

Here are the Linux epoll signatures:

#include <sys/epoll.h>

int epoll_create(int size);
int epoll_create1(int flags);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events,
               int maxevents, int timeout);
int epoll_pwait(int epfd, struct epoll_event *events,
                int maxevents, int timeout,
                const sigset_t *sigmask);

typedef union epoll_data {
    void    *ptr;
    int      fd;
    uint32_t u32;
    uint64_t u64;
} epoll_data_t;

struct epoll_event {
    uint32_t     events;    /* Epoll events */
    epoll_data_t data;      /* User data variable */
};

epoll_create creates an epoll list and returns a file descriptor for it. epoll_ctl adds, modifies, or removes (depending upon the 2nd argument) a file descriptor (the 3rd argument) in the epoll list (the 1st argument). The 4th argument is a pointer to a struct. The events field of the epoll_event struct is a bitmap used by the kernel to indicate whether the file descriptor is available for reading or writing, as well as some error conditions.

epoll_wait is used to block on the file descriptors. The process allocates an array of epoll_event structs. The 2nd argument points to the array, the the 3rd argument is the size of the array. A timeout can be specified, or set to -1 to block indefinitely. The system call returns the number of file descriptors available for i/o, or -1 on error.

close is used to free an epoll list.

kqueue in FreeBSD and Darwin has greater generality than epoll in Linux, but it is also more complicated to use. Rather than describe it, we describe the libuv library, which among other benefits allows one to write multiplexing code in a portable way.

libuv

Node.js started in 2009. libuv was made available as a separate library in 2012.

libuv is available in package managers:

$ brew install libuv

$ sudo apt install libuv

Here is an implementation of tee using libuv:

#include <uv.h>

uv_pipe_t stdin_pipe;
uv_pipe_t stdout_pipe;
uv_pipe_t file_pipe;

typedef struct {
    uv_write_t req;
    uv_buf_t buf;
} write_req_t;

void free_write_req(uv_write_t *req) {
    write_req_t *wr = (write_req_t*) req;
    free(wr->buf.base);
    free(wr);
}

void on_stdout_write(uv_write_t *req, int status) {
    free_write_req(req);
}

void on_file_write(uv_write_t *req, int status) {
    free_write_req(req);
}

void write_data(uv_stream_t *dest, size_t size, uv_buf_t buf, uv_write_cb cb) {
    write_req_t *req = (write_req_t*) malloc(sizeof(write_req_t));
    req->buf = uv_buf_init((char*) malloc(size), size);
    memcpy(req->buf.base, buf.base, size);
    uv_write((uv_write_t*) req, (uv_stream_t*)dest, &req->buf, 1, cb);
}

void alloc_buffer(uv_handle_t *handle, size_t suggested_size, uv_buf_t *buf) {
    *buf = uv_buf_init((char*) malloc(suggested_size), suggested_size);
}

void read_stdin(uv_stream_t *stream, ssize_t nread, const uv_buf_t *buf) {
    if (nread < 0){
        if (nread == UV_EOF){
            // end of file
            uv_close((uv_handle_t *)&stdin_pipe, NULL);
            uv_close((uv_handle_t *)&stdout_pipe, NULL);
            uv_close((uv_handle_t *)&file_pipe, NULL);
        }
    } else if (nread > 0) {
        write_data((uv_stream_t *)&stdout_pipe, nread, *buf, on_stdout_write);
        write_data((uv_stream_t *)&file_pipe, nread, *buf, on_file_write);
    }

    // OK to free buffer as write_data copies it.
    if (buf->base)
        free(buf->base);
}

int main(int argc, char **argv) {
    uv_loop_t *loop = uv_default_loop();

    uv_pipe_init(loop, &stdin_pipe, 0);
    uv_pipe_open(&stdin_pipe, 0);

    uv_pipe_init(loop, &stdout_pipe, 0);
    uv_pipe_open(&stdout_pipe, 1);

    uv_fs_t file_req;
    int fd = uv_fs_open(loop, &file_req, argv[1], O_CREAT | O_RDWR, 0644, NULL);
    uv_pipe_init(loop, &file_pipe, 0);
    uv_pipe_open(&file_pipe, fd);

    uv_read_start((uv_stream_t*)&stdin_pipe, alloc_buffer, read_stdin);

    uv_run(loop, UV_RUN_DEFAULT);
    return 0;
}

Compile and run the program:

$ gcc -luv -o uvtee uvtee.c
$ cat /etc/hosts | ./uvtee output.txt

inotify

The inotify suite of system calls were appeared in Linux 2.6.13 (2005). It provides an efficient way for a process to be notified of changes to the file system.

#include <sys/inotify.h>

int inotify_init(void);
int inotify_init1(int flags);
int inotify_add_watch(int fd, const char *pathname, uint32_t mask);
int inotify_rm_watch(int fd, int wd);

inotify_init creates a file descriptor for a list of monitored files. read is used to discover changes to those files, and close is used to release the list. inotify_init1 takes a flag so that a non blocking file descriptor can be created.

inotify_add_watch adds a file to the list of monitored files. The 3rd argument is a bit mask which specifies the types of operations which are monitored.

For a regular file, possible values are:

  • IN_ACCESS
  • IN_ATTRIB
  • IN_CLOSE_WRITE
  • IN_CLOSE_NOWRITE
  • IN_DELETE_SELF
  • IN_MODIFY
  • IN_MOVE_SELF
  • IN_OPEN

Additional values for directories:

  • IN_CREATE
  • IN_DELETE
  • IN_MOVED_FROM
  • IN_MOVED_TO

non-blocking inotify

kqueue on Darwin

chroot

chroot has been around since Version 7 Unix. It is useful to review it before discussing namespaces.

Despite the age and ubiquity of chroot, it is not a POSIX standard.

Here is the signature on Linux:

#include <unistd.h>

int chroot(const char *path);

If a process makes this call with the directory path "/home/bob/stuff", then the process loses the ability to open files outside of that directory, either for reading or writing. The process is said to be in a chroot jail. Any child processes inherit the limitation.

The process calling {{chroot}} has created a mapping from the files it can see of the form "/**" to files on the host operating system of the form "/home/bob/stuff/**". Namespaces, discussed below, work the same way: every resource in the child namespace is mapped to a resource in the parent namespace.

The chroot command is not particularly secure. If the process had open file descriptors to paths outside of the jail, they are maintained. If the working directory of a jailed process is moved outside of the jail, the jailed process can access files outside of the jail using "../../../foo" style paths.

chroot makes it possible to run an application with its own set of executables. One could even run the application as root, but if the process had access to a kill command, it could stop processes outside of the jail.

clone/setns/unshare

Linux namespaces are what make it possible to implement containers. Containers are a limited type of virtualization in which the host and guest are running the same operating system. In contrast to hypervisor virtualization, the processes, files, and other resources in the guest are also processes, files, or other resources in the host.

Linux namespaces allow the creation of jails or containers in which other operating system resources, such as user ids, process ids, and network resources, are mapped from container to host. A process which does not have a container mapping is not even visible inside the container and signals cannot be sent to it, even by a privileged process inside the container.

The mount namespace was the first, introduced with Linux 2.4.19 (2002). The user namespace was introduced with Linux 3.8 (2013).

Namespaces can be assigned to a process when it is created if it is created with the clone system call. A process can also change its namespace with the setns system call.

#include <sched.h>

int clone(int (*fn)(void *), void *child_stack,
          int flags, void *arg, ...
          /* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );
int setns(int fd, int nstype);
int unshare(int flags);

When cloning a new process, the third argument is a bitmask. The following bits can be set to create a new namespace:

  • CLONE_NEWUSER
  • CLONE_NEWPID
  • CLONE_NEWNS
  • CLONE_NEWUTS
  • CLONE_NEWNET
  • CLONE_NEWIPC

The namespaces for a process are available in the /proc directory. For example, for process 7236:

  • /proc/7236/ns/cgroup
  • /proc/7236/ns/ipc
  • /proc/7236/ns/mnt
  • /proc/7236/ns/net
  • /proc/7236/ns/pid
  • /proc/7236/ns/user
  • /proc/7236/ns/uts

These paths can be passed to the open system call to get a file descriptor which can be used as the 1st argument of setns. The 2nd argument is a bit mask similar to the 3rd argument of clone. If the 2nd argument is zero, the file descriptor can be for any namespace type.

The files in /proc/PID/ns are symlinks. Using readlink on them returns an inode number which can be used as an indentifier for the namespace.

The unshare system call creates new namespaces for the current process and joins those namespaces. Which namespace types is specified by the argument, which is a bit mask.

user

Each user namespace—excluding the root user namespace—has a mappings of uids and gids from the namespace to the parent namespace. These mappings are created by writing to the files

/proc/PID/uid_map
/proc/PID/gid_map

The writing process must have CAP_SETUID in the user namespace of PID.

To use setns to join another user namespace, a process must have CAP_SYS_ADMIN in that namespace.

pid

When a new pid namespace is created, the first process in the namespace is assigned PID 1 and has root power inside the namespace. Its parent PID, should it make a call to getppid, is 0. This process also becomes the parent of descendant processes which are orphaned. Signals are treated specially for PID 1; signals for which the process does not have a signal handler are always ignored. If the PID 1 for a pid namespace exits, attempting to fork inside the namespace results in a ENOMEM error.

If it is desirable to use the /proc file system for a namespace that was created, it must be explicitly mounted. This command, if executed inside the namespace, hides the parent namespace /proc file system:

$ mount -t proc proc /proc

unshare and setns can be used to create a new pid namespace. They do not put the calling process in the new PID namespace. Instead, the first child created by the calling process becomes the PID 1 in the new namespace. In this respect, unshare and setns behave differently that for other namespace types, but it means that getpid always returns the same value for a process.

mnt

uts

net

ipc