Advanced I/O

Chapter 6 FreeBSD System Programming
Prev	top	Next

Copyright(C) 2001,2002,2003,2004 Nathan Boeger and Mana Tominaga

Code samples for this chapter: here

6.1 Advanced I/O and Process Resources

As we have seen from the previous chapter, programs can have multiple file descriptors open at the same time. These file descriptors aren't necessarily files, but can be fifos, pipes, or sockets. As such, the ability to multiplex these open descriptors becomes important. For example, consider a simple mail reader program, like pine. It should of course allow the user to not only read and write email, but also simultaneously check for new email. That means the program receives input from at least two sources at any given point: one source is the user, the other is the descriptor checking for new mail. Handling multiplexing descriptors is a complex issue. One method is to mark all open descriptors non-blocking (O_NONBLOCK), and then loop through them until one is found that will allow an I/O operation. The problem with this approach is that the program constantly loops, and if no I/O is available for a long time, the process will tie up the CPU. Your CPU load only worsens when multiple processes are looping on a small set of descriptors.

Another approach is to set signal handlers to catch when I/O is available, and then put the process to sleep. This sounds good in theory, if you only have a few open descriptors and infrequent I/O requests. Because the process is sleeping, it will not tie up the CPU, and it will then only execute when I/O is available. However, the problem with this approach is that signal handling is somewhat expensive. For example a web server receiving 100 requests per minute, would need to catch signals almost constantly. The overhead from catching hundreds of signals per minute would be significant, not only for the process but for the kernel to send these signals as well.

So far, both options are limited and ineffective, with the common problem being that a process needs to know when I/O is available. However, this information is actually only known in advance by the kernel, because the kernel ultimately handles all the open descriptors on the system. For example, when a process sends data over a fifo to another, the sending process calls write, which is a system call and thus involves the kernel. The receiver will not know until the write system call is executed by the sender. So, a better way to multiplex file descriptors suggests itself: have the kernel manage it for the process. In other words, send the kernel a list of open descriptors and then wait until the kernel has one or more ready, or until a time-out has expired.

This is the approach taken by the select(), poll() and kqueue() interfaces. Through these, the kernel will manage the descriptors and awake the process when I/O is available. These interfaces elegantly handle the problems mentioned above. The process doesn't need to loop through the open descriptors, nor does it need to handle signals. The process will still incur a slight overhead, however, when using these functions. This is because the I/O operations are executed after the return from these interfaces. Thus it takes at least two system calls to perform any operation. For example, say your program has two descriptors used for reading. You use select and wait for them to have data to read. This requires the process to first call select, and then when select returns, to call read on the descriptor. More ideally, you could do a large blocking read against all of the open descriptors. Once one is ready to read, the read will return with the data inside the buffer and an indication of which descriptor the data was read from.

6.2 select

The first interface I will cover is select(). The format is:

  int  select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds,  struct timeval *timeout);

The first argument to select has caused some confusion over the years. The proper usage of the nfds argument is to set this to the maximum file descriptor value plus one. In other words, if you have a set of descriptors {0, 1, 8}, the ndfs parameter should be set to 9, because the highest numbered descriptor in your set is 8. Some mistake this parameter to mean the number of total descriptors to be n+1, which in our example would result in 4 incorrectly. Remember that a descriptor is simply an integer value, so your program will need to figure out which is the largest valued descriptor you want to select on.

Select will then examine the next three arguments, readfds, writefds and exceptfds for any pending reading, writing or exceptional conditions, in that order. (For more information, see man(2) select). Note that if no descriptors in readfds, writefds or execptfds are set, the values to select should be set to NULL.

The readfds, writefds and exceptfds arguments are to be set with the four macros listed below.

FD_ZERO(&fdset);

The FD_ZERO macro is used to clear the set bits in the desired descriptor set. One very important note: this macro should always be called when using select; otherwise, select will behave unpredictably.

FD_SET(fd, &fdset);

The FD_SET macro is used when you want to add an individual descriptor to a set of active descriptors.

FD_CLR(fd, &fdset);

The FD_CLR macro is used when you want to remove an individual descriptor from a set of active descriptors.

FD_ISSET(fd, &fdset);

The FD_ISSET macro is used once select returns to test if a particular descriptor is ready for an I/O operation. The final parameter is a time-out value. If the time-out value is set to NULL, then the call to select will block indefinitely until an operation is ready. However, if a specified time-out is desired then the time-out value should be a non-null timeval structure. The timeval structure is as follows:

  struct timeval {
      long    tv_sec;         /* seconds */
      long    tv_usec;        /* microseconds */
  };

Upon a successful call to select, the number of ready descriptors is returned. If select returns due to the expiration of a time limit 0 is returned. If an error occurs then -1 is returned and the errno is set accordingly.

6.3 Poll

For the most part, the I/O discussed is BSD specific. System V includes support for a special type of I/O known as streams. Streams, along with sockets, have a priority attribute sometimes referred to as the data-band. This data-band can be set to specify a high priority for certain data on the stream. BSD originally did not have support for this feature, but some have added System V emulation, and can support certain types. Because we don't focus on System V, we'll make reference to the data-band or data priority band only. For more information, see System V STREAMS.

The poll function is similar to select:

  int  poll(struct pollfd *fds, unsigned int nfds, int timeout);

Unlike select, which is native to BSD, poll was created by System V Unix and was not supported on earlier versions of BSD. Currently poll is supported on all major BSD systems.

Similar to select, poll will multiplex a set of given file descriptors. When specifying them, you have to use an array of structures, with each structure representing a single file descriptor. One advantage of using poll over select is that you can detect a few rare conditions that select will not. These conditions are POLLERR, POLLHUP, and POLLNVAL, discussed later. Although there is much discussion about choosing select or poll, for the most part, it'll depend on your personal taste. The structure used by poll is the pollfd structure, as below:

  struct pollfd {
      int     fd;             /* which file descriptor to poll */
      short   events;         /* events we are interested in */
      short   revents;        /* events found on return */
  };

The fd member is used to specify the file descriptor you wish to use poll on. If you want to remove a particular descriptor, set the fd member for that descriptor to -1. That way you can avoid having to shuffle the array around, and will also clear all events listed in the revents member.

events, revents

The events member is a bitmask specifying the events in which you are interested in for that specific descriptor. The revents member is also a bitmask, but its value is set by poll with the event(s) that have occurred on that specific descriptor. These events are defined bellow.

  #define POLLIN          0x0001

The POLLIN event lets you specify that your program will poll for readable data events for the descriptor. Note that this data will not include high priority data, such as out-of-bound data on sockets.

  #define POLLPRI         0x0002

The POLLPRI event is used to specify that your program is interested in polling for any high priority events for the descriptor.

  #define POLLOUT         0x0004
  #define POLLWRNORM      POLLOUT

The POLLOUT and POLLWRNORM events are used to specify that your program is interested in when a write on a descriptor can be performed. On FreeBSD and OpenBSD they are same; you can also check this in your system header file (/usr/include/poll.h). Technically speaking, the difference between them is that POLLWRNORM will only detect when it's possible to perform a write when the data priority band is equal to 0.

  #define POLLRDNORM      0x0040

The POLLRDNORM event is used to specify that your program is interested in polling for normal data on the descriptor. Note that on some systems, this is specified to have the exact same behavior as POLLIN. However on NetBSD and FreeBSD, this is not the same as the POLLIN event. Again, check your system header file (/usr/include/poll.h). Strictly speaking, POLLRDNORM will only detect when it is possible to perform a read when the data-band is equal to 0.

  #define POLLRDBAND      0x0080

The POLLRDBAND event is used to specify that your program is interested in knowing when it can read data with a non zero data-band value.

  #define POLLWRBAND      0x0100

The POLLWRBAND event is used to specify that your program is interested in knowing when it can write data to the descriptor with a non-zero data-band value.

FreeBSD Specific Options

The next options are FreeBSD specific and are not well known or widely used. They're worth mentioning, however, because of the flexibility afforded. These are new options and poll is not guaranteed to detect these conditions; plus, they only work with the UFS file system. If you need your program to detect these types of events, it's best to use the kqueue interface, covered later.

  #define POLLEXTEND      0x0200

The POLLEXTEND event will be set if the file has been executed.

  #define POLLATTRIB      0x0400

The POLLATTRIB event will be set if any file attributes have changed.

  #define POLLNLINK       0x0800

The POLLNLINK event will be set if the file has been renamed, deleted, or unlinked.

  #define POLLWRITE       0x1000

The POLLWRITE event will be set if the file contents have been modified.

The next events are not valid flags for the pollfd events member and poll will ignore them. They are returned in the pollfd revents instead, to specify certain events that have happened.

  #define POLLERR         0x0008

The POLLERR event will specify that an error has occurred.

  #define POLLHUP         0x0010

The POLLHUP will specify that a hang-up has occurred on the stream. The POLLHUP and POLLOUT are mutually exclusive events, because a stream is no longer writable once a hang-up has occurred.

  #define POLLNVAL        0x0020

The POLLNVAL event will specify that the request to poll was invalid.

The final argument to poll is timeout. This argument can specify to poll a desired timeout in milliseconds. When timeout is set to -1, poll will block until a requested event has occurred. When the timeout is set to 0, poll will return immediately.

A positive integer is returned upon a successful call to poll. The value of the positive integer will specify the number of descriptors in which events have occurred. If the time out has expired, then poll will return 0. If an error has occurred, the poll will return -1.

6.4 kqueue

So far, poll and select seem like elegant ways to multiplex file descriptors. To use either of those two functions, however, you need to create the list of descriptors, then send them to kernel, and then upon return, look through the list again. That seems a bit inefficient. A better model would be to give the descriptor list to the kernel and then wait. Once one or more events happen, the kernel can notify the process with a list of only the file descriptors which had events, avoiding loops through the entire list every time a function returns. Although this small gain is not noticeable if the process only has a few descriptors open, for programs with thousands of open file descriptors, the performance gains are significant. This was one of the main goals behind the creation of kqueue. Also, the designers wanted a process to be able to detect a wider variety of events, such as file modification, file deletion, signals delivered, or a child process exit, with a flexible function call that subsumed other tasks. Handling signals, multiplexing file descriptors, and waiting for child processes can all be wrapped into this single kqueue interface because they are all waiting for an event to occur.

Another design consideration was for a process to have multiple instances of kqueues without interference. As you have seen, a process can set a signal handler; however, what happens if another part of the code also wants to catch that specific signal? Or worse, say a library function sets a signal handler for the same signal your code wants to catch? Debugging this to figure out why your program is not executing the signal handler you set would take hours. For the most part, these situations don't happen often. Good programmers will avoid setting signal handlers inside library functions. With large, complex programs these situations become hard to avoid, so ideally, we should be able to detect these events, as kqueue allows.

The way kqueue works is with filters. These filters are identified by a unique identifier and filter tuple called a kevent (ident, filter). Only one unique kevent for each kqueue is allowed. These filters are processed during the initial registration with kqueue and during any retrieval of events. At registration if preexisting events exist, they will be placed onto the kqueue for retrieval. If multiple events exist for a given filter, they will be combined into a single kevent.

The kqueue API consists of two function calls and a macro to aid in setting events. These functions are outlined below.

  int kqueue(void);

The kqueue function will start a new kqueue. Upon a successful call the return value will be a descriptor that is to be used when interacting with the newly created kqueue. Each kqueue will have a unique descriptor associated with it. So, a program can have more than one unique kqueue open at once. The kqueue descriptors behave similarly to a regular file descriptor: they can be multiplexed.

One final note, the descriptors are not inherited by child processes created by fork. If the child process was created by the rfork call, however, the RFFDG flag will need to be set to avoid them from being shared with child processes. If the kqueue function fails, -1 is returned and errno is set accordingly.

  int kevent(int kq, const struct kevent *changelist, int nchanges, struct kevent *eventlist, int nevents, const struct timespec *timeout);

The kevent function is used to interact with the kqueue. The first argument is a descriptor returned by kqueue. The changelist argument is an array of kevent structures of size n changes. The changelist argument is used to register or modify events and will be processed before pending events are read from kqueue.

The eventlist argument is an array of kevent structures of size nevents. Kevent will return events to the calling process by placing events inside the eventlist argument. The eventlist and changelist arguments can point to the same array if desired. The final argument is the time-out desired for kevent. When the time-out is specified as NULL, kevent will block until an event has occurred. When the time-out argument is non-NULL, then kevent will block until the time out has expired. When the time-out is specified with a zero value structure, kevent will return immediately with any pending events.

The return value from kevent will specify the number of events placed inside the eventlist array. If the number of events exceeds the size of the eventlist argument then they can be retrieved with subsequent calls to kevent. Errors that occur from processing events will be placed in the eventlist argument providing space is available. The events with errors will have the EV_ERROR bit set along with the system error placed inside the data member. For all other errors -1 is returned and errno is set accordingly.

The kevent structure is used to communicate with kqueue. The header file on FreeBSD can be found in the /usr/include/sys/event.h file. This will have the declaration of the struct kevent as well as other options and flags. Because kqueue is still fairly new compared with select or poll, it's constantly evolving and adding new features. Check your system header for any new or system specific options.

The kevent structure declaration in its native form:

  struct kevent {
        uintptr_t       ident;
        short           filter;
        u_short         flags;
        u_int           fflags;
        intptr_t        data;
        void            *udata;
  };

Now, let's look at the individual members:

ident

The ident member is used to store the unique identifier for the kqueue. In other words, if you want to add a file descriptor to an event, the ident member would be set to the target descriptor's value.

filter

The filter member is used to specify the filter you want the kernel to apply to the ident member.

flags

The flags member will specify to the kernel what actions along with any needed flags should be processed with the event. Upon return, the flags member can contain error conditions.

fflags

The fflags member is used to specify filter specific flags you want the kernel to use. Upon return, the fflags member can contain filter-specific return values.

data

The data member is used to hold any filter-specific data.

udata

The udata member is not used by kqueue, but its value passes through kqueue unmodified. This member can be used by the process to send information or even a function to itself for use or considered upon an event detection.

kqueue filters

The filters used by kqueue are listed below. Some filters will have filter-specific flags. Set these flags in the fflags member of the kevent structure.

  #define EVFILT_READ     (-1)

The EVFILT_READ filter will detect when data is available to read. The ident member of the kevent should be set to a valid descriptor. Although this filter behaves in the same manner as select or poll would, the filter will return events specific to the type of descriptor used.

If the descriptor references an open file refereed to as a vnode, the event will specify that the read offset has not reached the end of the file. The data member will contain the current offset relative to the end of the file and can be negative.If the descriptor references a pipe or a fifo, then the filter will return when there is actual data to be read. The data member will contain the number of bytes available to read. The EV_EOF bit will specify that one of the writers has closed their connection. (See the kqueue man page for details on how the EVFILT_READ will behave when a socket is used.)

  #define EVFILT_WRITE    (-2)

The EVFILT_WRITE filter will detect whether it is possible to perform a write on the descriptor. If the descriptor references a pipe, fifo, or socket then the data member will contain the amount of bytes available in the write buffer. The EV_EOF bit will specify that the reader has closed its connection. This flag is not valid for open file descriptors.

  #define EVFILT_AIO      (-3)

The EVFILT_AIO is used with asynchronous I/O operations and will detect similar conditions as the aio_error system call.

  #define EVFILT_VNODE    (-4)

The EVFILT_VNODE filter detects certain changes to a file on a file system. Set the ident member to a valid open file descriptor, and specify which events are desired with the fflags member. Upon return, the fflags member will contain the bitwise mask of events that have happened. These events are as follows:

  #define NOTE_DELETE     0x0001

The NOTE_DELETE fflag will specify that the process wants to know when the file has been deleted.

  #define NOTE_WRITE      0x0002

The NOTE_WRITE fflag will specify that the process wants to know when the file contents have changed.

  #define NOTE_EXTEND     0x0004

The NOTE_EXTEND fflag will specify that the process wants to know when the file has been extended.

  #define NOTE_ATTRIB     0x0008

The NOTE_ATTRIB fflag will specify that the process wants to know when the file attributes have changed.

  #define NOTE_LINK       0x0010

The NOTE_LINK fflag will specify that the process wants to know when the files link count has changed. The link count can change when a file is hard linked by the link function call. (For more information see man(2) link.)

  #define NOTE_RENAME     0x0020

The NOTE_RENAME fflag will specify that the process wants to know if the file gets renamed.

  #define NOTE_REVOKE     0x0040

The NOTE_REVOKE fflag will specify that access to the file was revoked. For more information, see man(2) revoke.

  #define EVFILT_PROC     (-5)  /* attached to struct proc */

The EVFILT_PROC filter is used by a process to detect events that occur in another process. Include the PID of the desired process in the ident member, and set the fflag member with the desired events to monitor. Upon return the events will be placed inside the fflags member. These events are set by bitwise OR'ing the following events:

  #define NOTE_EXIT       0x80000000

The NOTE_EXIT fflag will detect when the process has exited.

  #define NOTE_FORK       0x40000000

The NOTE_FORK fflag will detect when the process has called fork.

  #define NOTE_EXEC       0x20000000

The NOTE_EXEC fflag will detect when the process has called an exec function.

  #define NOTE_TRACK      0x00000001

The NOTE_TRACK fflag will cause kqueue to track a process across a fork call. The child process will be returned with the NOTE_CHILD flag set in fflags and the parent processes PID will be placed into the data member.

  #define NOTE_TRACKERR   0x00000002

The NOTE_TRACKERR fflag will be set if an error occurred when trying to track a child process. This is a return only fflag.

  #define NOTE_CHILD      0x00000004

The NOTE_CHILD fflag is set on a child process. This is a return only fflag.

  #define EVFILT_SIGNAL      (-6)

The EVFILT_SIGNAL filter is used to detect if a signal has been delivered to the process. This filter will detect every time a signal has been delivered by placing the count inside the data member. This includes signal that have the SIG_IGN flag set. The event will be placed onto the kqueue after the execution of the normal signal handling process. Note that this filter will set the EV_CLEAR flag internally.

  #define EVFILT_TIMER    (-7)

The EVFILT_TIMER filter will create a timer for kqueue to count the number of timer expirations. If a one time timer is desired then set the EV_ONESHOT flag. Specify this timer with the ident member and use the data member to specify the timeout in milliseconds. The return value will be specified in the data member. Note that this filter will set the EV_CLEAR flag internally.

kqueue actions

The kqueue actions are set by bitwise OR'ing the desired actions along with any desired flags.

  #define EV_ADD          0x0001

The EV_ADD action will add the event to the kqueue. Because duplicates are not allowed in kqueue, if you try to add an event when one already exists, the existing event will be overwritten by the newer addition. Note that when events are added, they are enabled by default unless you've set the EV_DISABLE flag.

  #define EV_DELETE       0x0002

The EV_DELETE action will remove the event from the kqueue.

  #define EV_ENABLE       0x0004

The EV_ENABLE will enable the event in the kqueue. Note that newly added events will be enabled by default.

  #define EV_DISABLE      0x0008

The EV_DISABLE will disable kqueue from returning information on the event. Note that kqueue will not remove the filter, however.

kqueue action flags

The kqueue action flags are defined below. These are to be used in conjunction with the actions listed above. They are set by bitwise OR'ing the desired actions.

  #define EV_ONESHOT      0x0010

The EV_ONESHOT flag will notify kqueue to return only the first

  #define EV_CLEAR        0x0020

The EV_CLEAR flag will notify kqueue to reset the state for the event once the process retrieves the event from the kqueue.

kqueue returned values

Return only values are placed in the flag member of the kevent structure. These values are defined below.

  #define EV_EOF          0x8000

The EV_EOF will indicate an end of file condition.

  #define EV_ERROR        0x4000

The EV_ERROR will indicate that an error has occurred. The system error will be placed inside the data member.

6.5 Conclusion

This chapter explored multiplexing descriptors in BSD. As a programmer, you can choose three interfaces: select, poll or kqueue. The three are similar in performance for a small set of descriptors, but with a large amount, kqueue is an elegant choice. In addition, kqueue can detect much more than I/O events. It can also detect signals, file modifications and child processes related events. In the next chapter, we explore other ways to get information regarding child processes and determine the statistics of our current process, with a focus on new features in FreeBSD 5.x.

Chapter 6 FreeBSD System Programming
Prev	top	Next