Basic I/O

Chapter 5 FreeBSD System Programming
Prev	top	Next

Copyright(C) 2001,2002,2003,2004 Nathan Boeger and Mana Tominaga

Code samples for this chapter: here

5.1 Basic I/O

For the most part, Unix espouses the simple design philosophy, "everything is a file." This is quite a powerful feature - it means a text file you edit has the same programmatic interface as a modem, printer, or network card. And just as you can edit a text file, you should be able to perform the basic operations - read and write - on them as well. Although the actual implementation of this ideal isn't perfect, BSD Unix did actually adhere rather closely. It's another one of BSD's strengths - it's simple and elegant. Things that are not truly files are devices, which have entries in the /dev/ directory. Some devices will require specialized operations, such as block reading and writing. An extreme example is an Ethernet device, which doesn't even have an entry in /dev/ before FreeBSD 5.

A good example of an operating system that treats everything like a file is Plan 9. Plan 9 has files for everything, even Ethernets and network protocols. Please see the Plan 9 Web site http://www.cs.bell-labs.com/plan9dist/ for more information about Plan 9.

In general, files represent the most basic, elemental form of data on a computer, essentially as a linear sequence of bits. When a compiled program is executed with the exec command, the system will read the binary file into memory, which is placed into memory then the code is executed. It's irrelevant to the exec command where the program is located - it can reside on a floppy, hard disk, CD-ROM, or even a network mounted file system half -way around the world. What matters is that it can read the bytes sequentially. The same is true of sending data over a network link. When a program sends data, the data itself is simply a linear sequence of bits, sometimes called streams. The program does not care that it's sending data over a network link, just how to write that data. These two fundamental operations, read and write, are found in most aspects of computing. \ This chapter will cover the basic I/O subsystem and process resources.

5.2 I/O

Processes on Unix keep references called file descripters to open files, which are integers. Whenever a processes is created on Unix, it is given three file descriptors. These are:

0 standard in
1 standard out
2 standard error

These can be descriptors to read and write from the terminal, a file, or even another processes set of descriptors. Take the shell redirect, cat /etc/hosts >> hosts.out. The shell will open the file hosts.out and then execute the cat command with /etc/hosts as its argument. However, when the cat process writes to the standard out (1), the result doesn't get to the tty, but to the file hosts.out. (The cat program does not even know that it's writing to a file on the file system, and writes to the standard out file descriptor. We will see how to do this later in the chapter.)

The open and close functions are basic ways to manipulate descriptors.

The Open Function

   int     open(const char *path, int flags, /*  mode */ );

Upon a successful call, the open function will return a file descriptor associated with the file given in its arguments. This descriptor is just an integer index into the processes file descriptor table. This file descriptor has associated with it a structure that will describe to the kernel how to operate on this file. On BSD this structure is called filedesc, and can be found in the /usr/include/sys/filedesc.h header file. When the process wants to perform any operations on this descriptor, it will use this integer value for arguments to read, write, and execute other IO calls that require a file descriptor.

The kernel keeps a reference count on all file descriptors. This reference count is incremented when the processes opens it, duplicates it, or when the descriptor is kept open across forks or exec calls. When this reference reaches 0, the file is closed. That means that if you have a program and it calls fork or exec and if close-on-exec bit is not set, the reference count will increase and then, if a new program calls fork or exec, it will increase yet again. So, the file will stay open until its reference count reaches 0 or until all of the processes close it or exit.

The Close Function

  int close(int fd);

When a process wants to remove or close an open file descriptor, it calls the close function. This closes the given file descriptor, and decrements the file descriptor reference count. It's similar to exit - when a process calls exit, all open file descriptors associated with that process is automatically decremented. Once the file reference reaches 0, the kernel will fully release the associated file entry.

The getdtablesize Function

  int getdtablesize();

The getdtablesize function will return the size of the file descriptor table. This can be used to examine a system imposed limit. You can also use the sysctl program to examine this like so:

bash$ sysctl kern.maxfilesperproc
kern.maxfilesperproc: 3722

Depending on your system, you can tune this at run time or when you compile your kernel. This function will not tell you how many files your current process has open, (as getdtablesize) but only the maximum amount your process could potentially have open.

The fcntl Function

  int    fcntl(int fd, int cmd, ...);

The fcntl function will allow a process to manipulate file descriptors. The fcntl function takes at a minimum two arguments, a valid file descriptor, and a command. Then depending on the command the fcntl could require a third argument. The following values are defined for the command argument. On FreeBSD, you can find them inside the /usr/include/fcntl.h header.

  #define  F_DUPFD     0

The F_DUPFD is used to create a new file descriptor that is similar to the original. (You can accomplish the same thing with the dup calls that are covered later.) Upon a successful call to fcntl with the F_DUPFD flag, fcntl will return a new file descriptor with the following attributes:

If a third argument is given, the descriptor returned is the lowest available descriptor greater than or equal to the value of the third argument. The descriptor returned will reference the file descriptor given as the first argument to fcntl.
If the file descriptor argument to fcntl is a file, the new file descriptor will have the same file offsets, and the new file descriptor will have the same access mode (ie: O_RDONLY, O_RDWR, O_WRONLY ).
The new file descriptor will share the same file status flags.
The new file will have the close on exec flag turned off. That means the new file descriptor will remain open across exec calls.

The F_GETFD Command

  #define  F_GETFD     1

The F_GETFD command is used to retrieve the close-on-exec flag status. The return value when ANDed with FD_CLOEXEC will either result in 0 or 1. If 0, the close-on-exec flag is not set, so the file descriptor will remain open across exec calls. If 1, the close-on-exec flag is set, and the file descriptor will be closed on a successful call to one of the exec functions.

The F_SETFD Command

  #define  F_SETFD     2

The F_SETFD command is used to set the file descriptors close-on-exec flag. The third argument is either FD_CLOEXEC to set the close-on-exec flag, or 0 to unset the close-on-exec flag.

The F_GETFL and F_SETFL Commands

  #define  F_GETFL    3
  #define  F_SETFL    4

The F_GETFL command will tell fcntl to return the current file descriptor status flags. The opened mode can be retrieved by anding the O_ACCMODE (#define O_ACCMODE 0x0003) along with the returned value. The F_SETFL command will set the file status flags according to the third argument.

Common Flags

Some of these flags are also used in calls to open, and can only be set by calling open with the desired flags. Here are the most common found; check your system header file for other values.

  #define  O_RDONLY 0x0000

If the O_RDONLY flag is set, then the file is opened for reading only. Note that this O_RDONLY flag can only be set by a call to open; it cannot be set by calling fcntl with the F_SETFL command.

  #define  O_WRONLY 0x0001

If the O_WRONLY flag is set, the file is opened for writing only. This flag is set by open and cannot be set by calling fcntl with the F_SETFL command.

  #define  O_RDWR      0x0002

If the O_RDWR flag is set, the file is opened for reading and writing. Again, this flag can only be set by a call to open.

  #define  O_NONBLOCK  0x0004

If the O_NONBLOCK flag is set, the file descriptor will not block but instead return immediately. An example would be an open tty. If the user does not type anything into the terminal the call to read will block until the user types. When the O_NONBLOCK flag is set, the call to read will return immediately with the return value set to EAGAIN.

  #define  O_APPEND 0x0008

If the O_APPEND flag is set, the file is opened for append mode, and writes to the file will begin at the end.

  #define  O_SHLOCK 0x0010

If the O_SHLOCK flag is set, the file descriptor has a shared lock. A shared lock can be set on a file so that multiple processes can perform operations on the file. See the F_GETLK and F_SETLK commands to fcntl for more on shared file locks.

  #define  O_EXLOCK 0x0020

If the O_EXLOCK flag is set, the file descriptor has an exclusive lock on the file. Again, refer to the F_GETLK and F_SETLK commands to fcntl for details.

  #define  O_ASYNC     0x0040

If the O_ASYNC flag is set, the process group will be sent the SIGIO signal to notify them that IO is possible on the file descriptor. See the Signals chapter for details.

  #define  O_FSYNC     0x0080

If the O_FSYNC flag is set, all writes to the file descriptor will not be cached by the kernel. Instead, it will be written to media and all calls to write will block until the kernel finishes.

  #define  O_NOFOLLOW  0x0100

When the O_NOFOLLOW flag is set then the call to open would have failed if the file was a symbolic link. If this flag is set on a valid file descriptor, then the current file is not a symbolic link.

  #define  O_CREAT     0x0200

If the O_CREAT flag is set then the file would have been created if it did not exist upon a call to open. (The misspelling is interesting; when one of the original creators of C was asked "What one thing would you change about C?" he replied, "I would change O_CREAT to O_CREATE!", or at least how the rumor goes.)

  #define  O_TRUNC     0x0400

If the O_TRUNC flag is set, the file would have been truncated upon a successful call to open.

  #define  O_EXCL      0x0800

If the O_EXCL flag is set, the call to open would have resulted in an error if the file had already existed.

  #define F_GETOWN          5

The F_GETOWN command is used to retrieve the current process or process group receiving the SIGIO signal for this descriptor. If the value is positive then it will represent a process; negative values represent process groups.

  #define F_SETOWN  6

The F_SETOWN command is used to set the process or process group to receive the SIGIO signal when IO is ready. To specify a process use a positive value (a PID) as the third argument to fcntl. Otherwise use a negative value for the 3rd argument to fcntl to specify a process group.

5.3 File Locking

So what happens when multiple processes attempt writes to a file? They can trample each other, in something known as file locking. This occurs because each process has its own file descriptor with its own offsets. When each process writes their file, offset gets advanced independently so no process knows that others are writing. The resulting file will contain garbage because the multiple independent writes to the file can get intermixed. One way to solve this problem was to have file level locking, so only one process can write to a file at any time. Another was to allow locks of regions inside the file in a scheme called advisory file locking. The fcntl function can provide this functionality. For the most part there are two types of locks. The first is read and the second is write. The difference is that the read locks will not interfere with other processes reading the file, but only one write lock can exist on a specified region.

The following structure is used as the third argument to fcntl when using advisory locks.

  struct flock {
            off_t   l_start;    /* starting offset */
            off_t   l_len;      /* len = 0 means until end of file */
            pid_t   l_pid;      /* lock owner */
            short   l_type;     /* lock type:   */
            short   l_whence;   /* type of l_start */
    };

Now let's go over each element in detail.

l_start

This is an offset in bytes relative to the l_whence. In other words, the desired location is actually measured from l_whence + l_start.

l_len

This needs to be set to the length of the desired region in bytes. The lock will begin from l_whence + l_start for l_len bytes. If you want a lock for the entire file, then set this value to 0. If the value of l_len is negative, however, the behavior is unpredictable.

l_pid

This needs to be set to the process ID (PID) of the process that is working on the lock.

l_type

This needs to be set to the desired type of lock. The values are as follows:

F_RDLCK - a read lock
F_WRLCK - a write lock
F_UNLCK - used to clear the lock

l_whence

This is the most confusing part of this system call. This field will determine the offset of the l_start position. This will need to be set to:

SEEK_CUR - the current location in the file
SEEK_SET - the beginning of the file
SEEK_END - the end of the file

Commands to fcntl

The following commands to fcntl are as follows.

  #define  F_GETLK      7

The F_GETLK will try to check to see if a lock can be granted. When using this command fcntl will check to see if there is a conflicting lock. If a conflicting lock exists, fcntl will overwrite the flock structure passed to it with the conflicting locks information. If there are no conflicting locks, then the original information in the flock structure will be preserved, except the l_type field will be set to F_UNLCK.

  #define  F_SETLK       8

The F_SETLK command will try to obtain the lock as described by the flock structure. This call will not block if the lock cannot be granted, however; fctnl will return immediately with EAGAIN and errno will be set accordingly. You can use this to clear a lock when the l_type value in the flock structure is set to F_UNLCK.

  #define  F_SETLKW   9

The F_GETLK command will try to obtain the lock as described by the flock structure. This command to fcntl will block until the lock can be granted.

5.4 Why flock?

For the most part, the advisory file locking scheme is a good thing. However the POSIX.1 interface has a few drawbacks. For one, all locks associated with a file must be removed when any file descriptor for that file is closed. In other words, if you have a process that has a file open, and it calls a function that opens that same file, it reads and then closes it, so that all of your previous locks will be removed. This can cause serious problems if you are not sure what a library routine might do. Also, locks are not passed to children processes, so a child process must create its own locks independently. In addition, all locks obtained prior to an exec call will not be released until the process releases, close the file or exits. So, if you need to lock a region of a file, then you'll exec without releasing the locks or closing the file descriptor. That region will be locked until the process terminates, which might not be the expected behavior that you wanted. The designers of BSD, however, created a much simpler interface to file locking that is preferred by many - flock.

Flock is used to lock entire files, the BSD preferred method, as opposed to the fcntl advisory locks. The flock scheme allows the locks to be passed down to children processes. Another advantage to using the flock call is that locks are done at the file level and not at the descriptor level, which may be preferred in some cases. It means that multiple file descriptors that reference the same file, such as calls to dup(), or multiple calls to open(), would each refer to the same file lock. File locks with flock are similar to fcntl locks in that only one writer can exist while multiple readers are allowed. However, lock can be upgraded. When calling flock the following operations are defined:

  #define   LOCK_SH        0x01      /* shared file lock */

The LOCK_SH operation is used to create a shared lock on a file (similar to a read lock with fcntl). Multiple processes can have a shared lock on a file.

  #define   LOCK_EX        0x02      /* exclusive file lock */

The LOCK_EX operation is used to create an exclusive lock on a file. When an exclusive lock is granted, no other shared locks can exist on the file, including shared locks.

  #define   LOCK_NB        0x04      /* don't block when locking */

With this, calls to flock will block until the lock is granted. However if the LOCK_NB is ORed with the desired operation, the call to flock will return with either success ( 0 ) or (-1) with errno set to EWOULDBLOCK.

  #define   LOCK_UN        0x08      /* unlock file */

The LOCK_UN is used to remove a lock on a file.

Locks with flock can be upgraded or downgraded, by calling flock with the desired operation. New successful calls will replace the previous lock with the newly granted one.

The dup Function

  int     dup(int old);

Just like the fcntl call can be used to create duplicate descriptors for existing file descriptors, the dup function also creates duplicate file descriptors. The dup call will return a new file descriptor that is indistinguishable from the old parameter. This means all calls to read(), write() and lseek() will move both descriptors. Also, all options set with fcntl will remain, with the exception of the close on exec bit. The close on exec bit will be turned off, so you can dup a file descriptor then allow a child process to call one of the exec functions, a very common use of the dup function. The old parameter is the desired target descriptor to be duped and must be a valid descriptor. The new file descriptor returned by a successful call to dup() will be the lowest unused file descriptor. That means if you close STDIN_FILENO (its value is 0) then immediately call dup() the value of your new file descriptor will be STDIN_FILENO. If the dup function fails for any reason, the return value will be -1 and errno will be set accordingly.

The dup2 Function

  int     dup2(int old, int new);

The dup2 function is similar to the dup function, except the new parameter is the desired target value. If the new parameter already references a valid open descriptor and its value is not the same as the old parameter, the new file descriptor will be closed first. If the new parameter equals the old parameter, then nothing happens. The returned value from a successful call to dup2 will be equal to the new parameter. If a call to dup2 fails, the return value will be -1 and the errno be set accordingly.

5.5 Interprocess Communication

Basic interprocess communication, or better known as IPC, are mainly functions from System V. They are still very commonly used amongst the BSDs. The IPC functions allow programs to share data with each other. This is similar to the redirect that we just covered, but whereas the redirect is a one way process and is not bidirectional, in the example program redirect could share data with the cat command by setting its STDIN_FILENO. One problem: the cat command cannot share data with the redirect program. We could modify both to allow sharing in both directions using a fancy algorithm in which they read from each others file descriptor but using open files is clumsy. BSD offers much better methods for interprocess communication.

The Pipe Function

  int pipe(int *array)

The pipe function will allocate two file descriptors, given by calling pipe with a valid two dimensional array (ex: int array[2]; ). If successful, the array will contain two distinct file descriptors that will allow for unidirectional communication. The first descriptor (array[0]) is opened for reading, and the other (array[1]) is opened for writing. So once you have successfully called pipe, you get a unidirectional communication channel between these two descriptors. When you write to one of them, you will be able to read the output from the other. The benefit from the pipe function over a redirect is that you don't have to use any files.

The file descriptors will behave exactly the same way however they will not have any file associated with them. Unix shells make use of the pipe function for commands that are piped to each other like so:

bash$ find /  -user frankie  | grep -i jpg  | more

Here the find command will have its output piped into the grep command that will in turn have its output piped into the more command. When creating this sequence, the shell will handle the actual setting of the pipes. These programs have no idea that they are writing to another program, because they don't really need to know. From the example you can see that once pipe is called, it is normal for the process to fork. Once this is done the processes can communicate. If bidirectional communication is the goal, you can create two sets of pipes, one for the parent to communicate with the child, and the other for the child to communicate with the parent.

Communication with pipes have the following two rules:

If the read end of the pipe is closed, attempts to write to that pipe will result in a SIGPIPE being delivered to the writing process.
If the write end of the pipe is closed, attempts to read from that pipe will result in read returning 0 or end of file. Closing the write end of the pipe is the only way to deliver the end of file to the reading end of the pipe.

A successful call to the pipe function will result in 0 being returned. If the call fails then -1 is returned and the errno is set accordingly.

NOTE: On more modern BSDs, the pipe function supports bidirectional communication from a single set of descriptors. However this behavior is not portable, and thus is not recommended.

The mkfifo Function

  int mkfifo(const char *path, mode_t mode);

Pipes are useful when communicating between related processes, but for communicating between unrelated processes, use the mkfifo function. The mkfifo actually creates a file in the file system. This file is only a marker for other processes to use when communicating. These files are called fifos (first in first out). When a process creates a fifo, writes to this fifo will not be written to the file, but will be read by another processes instead. This behavior is very similar to a pipe; the fifos are also known as named pipes.

The mkfifo function takes two arguments. The first argument is a null-terminated string that specifies the path and file name. The second argument is the mode for that file. The mode argument is the standard Unix permissions for owner read and write (see S_IRUSR, S_IRGRP etc. found in /usr/include/sys/stat.h).

Once the mkfifo has been successfully called, the fifo will need to be opened for reading and writing using the open function. If the call fails, then -1 is returned and the errno is set accordingly.

Creating fifos is similar to file creation in that the process must have sufficient permissions to create the fifo, because the user id of the fifo will be set to the effective user id of the process and the group id of will be set to the effective group id of the process.

An important note about fifos is that they are blocking by default. Thus, reads to a fifo will block until the other end writes and vice versa. To avoid this, use the O_NOBLOCK option to open. You'll get the following behaviors; calls to read will return immediately with the read value of 0 or, calls to write will result in a SIGPIPE signal.

5.6 Message Queues

Another IPC mechanism, message queues, provides another way for processes to communicate. However unlike the others that I have mentioned, you should try to avoid using these if possible. If your program uses message queues, try to re-implement them as fifos or even Unix domain sockets. Before I discuss the reasons, here's a quick overview.

Message queues are similar to fifos but instead of using a file for a reference, they use a key. This key is an unsigned integer. Once a message queue is created, data sent to a message queue is buffered by the kernel. The kernel has a finite amount of memory allocated for message queues. Once this buffer is filled then no more data can be sent until a process reads from the message queue. In that sense queues are reliable and for the most part non-blocking if both processes are reading and writing at different speeds. This is different from a fifo, where a slow read process can actually slow down a faster writing process (unless the O_NONBLOCK option is set). Another benefit is that data written to the message queue will be saved until another process reads it even if the writing process exits, unlike a fifo where if the writing process exists the fifo is closed and the reading process will receive and end of file marker.

All of this good behavior of a message queue seems nice but let's take a closer look. Say a process opens a message queue, writes a large amount of data and fills up the kernel buffer then exits, the kernel will have to store this data until another process reads it, and any other process that wish to create a message queue and write will be denied. This will stay in effect until a process reads the data or until the system is rebooted. In a sense it's possible to create a simple denial of service against a message queue.

Another problem is that keys are not guaranteed to be unique. In other words, a process has no way to determine a method to determine weither it's the only one using a specific message queue. With fifos, when a process creates a fifo, it has a better chance to know it's unique because a pre-agreed upon file location can be specified (ie: /usr/local/myapp/fifo_dir ). Your application will be able to create a unique directory upon installation and thus almost guarantee a unique fifo location. Valid message queue keys can be generated by a function called ftok to help produce unique keys, but it's not a sure bet. The side effects of this problem can be hard to determine - your application could be reading data that it was not intended to read, or your program could be writing data that is being read by some other unintended process. In short, strange and errors that are tough to debug will result when you use message queues.

If you still insisit in using message queues, see the following man pages: ftok(3), msgget(3), msgctl(3), msgrcv(3), and msgsnd(3).

5.7 Conclusion

This chapter has covered various system calls for manipulating open file descriptors, including scenarios where the descriptor must be closed before we fork and exec. We've also discussed file locking, and setting and removing file locks, and special file descriptors such as fifos and queues, which in a sense don't store data on a file system at all. These system calls add a great deal of programmability and flexibility to BSD. But what happens once a process has multiple file descriptors open? The next chapter discusses handling multiple descriptors efficiently.

Chapter 5 FreeBSD System Programming
Prev	top	Next