Process and Kernel Services

Chapter 3 FreeBSD System Programming
Prev	top	Next

Copyright(C) 2001,2002,2003,2004 Nathan Boeger and Mana Tominaga

Code samples for this chapter: here

3.1 Process and Kernel Services
process: a series of actions or operations conducing to an end; especially : a continuous operation or treatment especially in manufacture www.m-w.com

The preceding chapter covered the boot process for BSD systems. Now, let's look at what happens once the kernel has booted and is running. Like most other modern operating systems, BSD is a multiuser, multitasking operating system, meaning it supports multiple users using system resources, each running different multiple processes. The concept of a 'process' offers a useful abstraction that comprises all the activities managed by an operating system. The term was first introduced by Multics designers during the 1960s, as a general encompassing term - as opposed to a 'job' - in a multiprogramming environment.

3.2 Scheduling

A process is a single instance of a running program. For example, when you run Netscape on a BSD system it creates a process while it's being executed. If you have three users logged into a BSD system all running the same Netscape program all at the same time, each user will have his or her own instance of Netscape, independent of the others. BSD systems can support many such processes at once. Every process will have associated with it a Process Identifier (PID). These processes will need resources and might have to access devices such as external storage.

When multiple processes are running on a system, the illusion that they are all executing at the same time is handled by the operating system. Assigning priorities to processes is managed by scheduling algorithms unique to the operating system; this area of computer science is extensive, and highly specialized; see the Resources section for more information.

The operating system actually moves executing processes in and out of the CPU(s). This way, each process gets a specific amount of time in execution on the CPU(s). This amount of time is called a 'slice'. The slice length is almost entirely determined by the kernel's scheduling algorithm. An adjustable value of this algorithm is the 'nice' value, which offers programmability for processes to specify the execution priority. These priority values are as follows:

Nice values Priority

-20 - 0 Higher priority for execution

0 - 20 Lower priority for execution

Note: A higher nice value yields a lower priority, which may seem counterintuitive. However, consider this calculation:

(scheduling algo calculation) + (nice value)

Simple math will show you that adding a negative value to a positive one will result in a lower number, and as such, when these numbers are sorted, the lower ones will all come at the front of the execution queue.

The default nice value for all processes is 0. The executing process can raise its nice value (that is, lower its priority), but only processes running as root can lower them (that is, raise its priority). BSDs provide two basic interfaces to manipulate and retrieve these values. These are:

int getpriority(int which, int who);
int setpriority(int which, int who, int prio);

and

int nice(int incr);

The nice function will set the calling process' nice value to the incr passed as its parameter. Note that the nice() function is rather obsolete because it's not very flexible and the nice function is really just wrapped by the newer functions (get and setpriority). The preferred method is to use setpriority().

Because a valid priority can be -1, the return value from getpriority should not be regarded as success or failure for the program execution - instead, check if errno (error number) has been set. If so, clear it prior to calling getpriority. (For more information on errno routines, see the man(2) pages on errno and read the example program, nice_code.c for an example.) Also setpriority and getpriority can set or get priority value for external processes, by setting the 'which' and 'who' parameters, discussed later.

The example program, nice_code.c demonstrates retrieving and setting the nice value for the current process. If it is executed by root and given a command line value of -20, the system will seem to stop responding. This is because this program has the highest priority that can be set, and will dominate the system. Use caution and have much patience if setting the priority below 0 is attempted; depending on the CPU, full execution can take upwards of twenty minutes. It's recommended that you run the program using the time command, as below:

bash$ time nice_code 10

Then, adjust the command line value. That way how long the process took to execute is evident. For example, this adjusts the value below 0:

$ time nice_code -1

This makes the value large:

bash$ time nice_code 20

Also, try running the program as any other non-root user, and attempt to set the priority below 0. The system should deny this; only processes running as root can lower their nice value. (Because Unix systems are multi-user and each process should get a fair amount of time on the CPU, only root can change the process priority to values below 0 to avoid users from monopolizing the CPU resources by lowering their priority so that only their process would be executed.)

3.3 System Processes

The concept of 'protected mode' was introduced in the previous chapter. In short, this is a special mode in modern CPUs that allows, amongst other things, for the operating system to protect memory. Given this, there are two such modes of operation. The first is kernel-land, meaning that the process will be executing inside the kernel's memory space and hence operates within the kernel's privileged protected mode in the CPU. The second is userland, referring to any proves that's executed and doesn't operate in the kernel's protected mode.

This distinction is very important, because any given process uses resources in both kernel and userland. These resources are in various forms, such as kernel structures that account for the process, allocated memory within the process, open files, and other execution context.

On a FreeBSD system, a few critical processes aid the kernel in performing its tasks. Some of these processes are completely implemented and run within kernel space, while others run in userland.These processes are as follows:

PID Name

0 swapper

1 init

2 pagedaemon

3 vmdaemon

4 bufdaemon

5 syncer

All the processes listed above, with the exception of init, are all implemented within the kernel. That means there is no regular binary program for these processes. These processes only resemble userland processes and because they are implemented within the kernel, they operate within the kernel's privileged mode. Such architecture results from various design considerations. For example, the pagedaemon process, which prevents thrashing, is only awakened when the system is low on memory. So if the system has lots of free memory then it never needs to be awakened. Thus, the advantage for running the pagedaemon in userland is that it can avoid using the CPU unless its really necessary. However it does add another process to the system and thus will need to be scheduled. But the scheduler calculations are very small and thus almost insignificant.

The init process is the mother of all processes. Every process, other than processes that operate in the kernelland privileged mode, are descendant from init. Also, zombies, or all processes that are abandoned, are then inherited by init. It also performs some administrative tasks also, including the spawning of gettys for the system's ttys, and executing an orderly shutdown of the system.

3.4 Process Creation and Process IDs

As mentioned above, when a program is executed, the kernel assigns it a unique PID. The PID is a positive integer and its value will range from 0 - PID_MAX, depending on the system. (For FreeBSD, the /usr/include/sys/proc.h has PID_MAX set at 99999.) The kernel will assign the PID value using the next sequential available PID. So, when the PID_MAX is reached, the values will loop around. The looping is important when using the PID for anything other than current process accounting.

Every process that runs on the system is created by another process. This is done by a few system calls that are discussed in the next chapter. When a process creates a new process, the original process is referred to as the parent and the new process is referred to as the child. This parent/child relationship provides an excellent analogy - every parent can have multiple children and parent processes are descendent from another process. Processes can retrieve their own or their parent's PID with the getpid function.

Processes can also be grouped into process groups. These process groups are identified by a unique grpID. Process groups were introduced to BSD as a mechanism for shells to control jobs. Take this example:

bash$ ps auwx  | grep root  | awk {'print $2' }

The programs, ps, grep, awk, and print, all belong to the same process group. This allows all the commands to be controlled by referencing a single job.

A process can get its group ID by calling getpgrp or getpgid:

 
  #include <sys/types.h>
  #include <unistd.h>
  
     pid_t   getpid(void);
     pid_t   getppid(void);
     pid_t   getpgrp(void);

All of the functions listed above are guaranteed to be successful. However, as the FreeBSD man pages strongly suggest, the PID should not be used to create a unique file. This is due to the fact that the PID is guaranteed to be unique only at the time of creation. Upon exit, the PID value is returned to the pool of unused PIDs and, will be reused at some point (of course, provided that the system stays running).

A simple source for getting these values is listed in proc_ids.c.

If you run the program like so:

bash$ sleep 10 |  ps auwx |  awk {'print $2'}  | ./proc_ids

And in another terminal run:

bash$ ps -j

Only one shell should run for these commands and each will have the same PPID and PGID when executed.

3.5 Processes from Processes

Processes can be created by other processes, and there are three ways to accomplish this in BSD. They are: fork, vfork and rfork, discussed in order. There are other calls (like system) that are really just wrappers for these three.

fork

When a process calls fork, a new process is created that is a duplicate of the parent:

 
  #include <sys/types.h>
  #include <unistd.h>

     pid_t   fork(void);

Unlike other function calls, when successful, fork will return twice - once in the parent process, where the return value will be the new child's PID and two, in the child process, where the return value will be 0. This way, you can distinguish between processes. Once a new process is created with fork, it will be almost an exact duplicate of its parent. These common traits are, in random order:

Controlling terminal
Working directory
Root directory
Nice value
Resource limits
Real, effective and saved user ID
Real, effective and saved group ID
Process group ID
Supplementary group ID
The set-user-id bits
The set-group-id bits
All saved environment variables
All open file descriptors and current offsets
Close-on-exec flags
File mode creation (umask)
Signals handling settings (SIG_DFL, SIG_IGN, addresses)
Signals mask

What is actually unique in the child is its new PID, a PPID which is set to the parent's PID, process resource utilization values set to 0, and a copy of its parent's file descriptors. The child can close the file descriptors without disturbing the parent. If the child wishes to read or write from them, however, it will retain the parent's offsets. Potentially, this may cause strange output, or cause both processes to crash if they both try to read or write from the same file descriptors.

After a new process is created, the order of execution is not known. The best way to control this is through the use of semiphores, pipes, or signals, discussed later. That way, reads and writes are controlled, making it impossible for one process to clobber the other and cause both to crash.

wait

The parent process should collect the child's exit status using one of the following wait system calls:

    #include <sys/types.h>
    #include <sys/wait.h>

     pid_t     wait(int *status);

     #include <sys/time.h>
     #include <sys/resource.h>

     pid_t     waitpid(pid_t wpid, int *status, int options);
     pid_t     wait3(int *status, int options, struct rusage *rusage);
     pid_t     wait4(pid_t  wpid, int *status, int options, struct rusage *rusage);

With wait calls, the options parameter is a bitwise, or is one of the following:

WNOHANG - do not block on wait. This will cause wait to return even if no child process have terminated.

WUNTRACED - set this if you want to wait for status of stopped and untraced children (due to SIGTTIN, SIGTTOU, SIGTSTP, or SIGSTOP signals).

WLINUXCLONE - set this if you want to wait for kthread spawned from linux_clone.

When using the wait4 and waitpid calls, note that the wpid parameter is the actual PID waited on. Specifying a -1 value will cause the call to wait for any available child process to terminate. When using calls with a rusage structure, if the structure is non-NULL then a summary of the resources used by the terminated process is returned. Of course, if the process has stopped then the rusage information is not available.

The wait function calls provide a means for the parent to gather information about its child processes upon exit. Once called, the actual calling process is blocked until either a child process terminates, or a signal is received. This can be avoided; a common tactic is to set the WNOHANG option in the call to wait. Upon a successful return, the status parameter will contain information about the exited process. Also, if the calling process is not interested in the exit status, a NULL parameter can be passed as status. More details on using the WNOHANG will be covered in the Signals section.

If the exit status information is of interest, macros definitions are available in /usr/include/sys/wait.h. The preferred method is to use these, for greater cross-platform portability. The three are listed below, with explanations on usage:

WIFEXITED(status) - If this evaluates to true (that is, it has a non-zero value) then the process has terminated normally by either a call to exit() or _exit().

WIFSIGNALED(status) - If this evaluates to true then the process terminated due to a signal.

WIFSTOPPED(status) - If this evaluates to true then the process has stopped and can be restarted. This macro should be used with the WUNTRACED option or when the child is being traced (such as with ptrace).

If needed, the following macros will extract the remaining status information:

WEXITSTATUS(status) - This is used when the WIFEXITED(status) macro evaluates to true. This will evaluate the low-order 8 bits of the argument passed to exit() or _exit().

WTERMSIG(status) - This is used when the WIFSIGNALED(status) evaluates to true. This will produce the signal number that caused the process to terminate.

WCOREDUMP(status) - This is used when the WIFSIGNALED(status) evaluates to true. This macro will also evaluate as true if the terminated process has created a core dump at the point at which the signal was received.

WSTOPSIG(status) - This is used when the WIFSTOPPED(status) evaluates to true. This macro will produce the signal number that caused the process to stop.

If the parent process never collects the exit status of its child processes, or if the parent process dies before the child exits, in either case, init will by default automatically inherit the children and collect their exit status.

vfork and rfork

The function vfork is similar to fork and was introduced in 2.9BSD. The difference between the two is that vfork will suppress the parent's execution and use the parent's current thread of execution. This is designed to accommodate the execv function calls (discussed later), in order to prevent fully copying the parent's address space, which would be rather inefficient. Actually, the use of vfork is not recommended because it is not guaranteed across platforms. For example, Irix as of 5.x did not have vfork. Here's a sample vfork call:

 
  #include <sys/types.h>
  #include <unistd.h>
  
    int  vfork(void);

The function call rfork is also quite similar to fork and vfork. It was introduced in Plan9. Its main goal was to create a more sophisticated method for controlling process creation and creating kernel threads. In FreeBSD/OpenBSD its support extends to simulate threads and the Linux clone call. In other words, rfork allows a faster and smaller process creation routine than fork. The caller can specifiy the resources that they want the child(ren) to share by logical OR'ing them. Here's a sample rfork call:

 
  #include <unistd.h>

    int    rfork(int flags);

The resources that can be selected with rfork are as follows:

RFPROC - Set this flag when you want to create a new process; otherwise the other flags will only affect the current process. By default, this flag must always be set when used.

RFNOWAIT - Set this flag if you want the child process to be dissociated from the parent. Once the child process exits, it will not leave status information for the parent to collect.

RFFDG - Set this flag when you want the parents file descriptor table to be copied. If not, the parent and child will share a common file descriptor table.

RFCFDG - This flag is mutually exclusive with the RFDG flag. Set this flag if you want the child process to have a new clean file descriptor table.

RFMEM - Set this flag if you want to force the kernel to share the entire address space. This is typically done by directly sharing the hardware page table. This cannot be called directly from C because the child process will return on the same stack as the parent. If that's what you want, then the best method is to use the rfork_thread function, as listed below:

#include <unistd.h>

   int     rfork_thread(int flags, void *stack, int (*func)(void*arg), void *arg);

This will create a new process that will run on the specified stack, and call the specified function with its arguments. Unlike fork, the return value when successful will be the new process PID to the parent only, because the child will execute the supplied function directly. If it fails, the return value will be -1 and errno will be set.

RFSIGSHARE - This is a FreeBSD specific flag and was recently used in an well-known buffer overflow exploit in FreeBSD 4.3. This flag will allow the child process to share signals with parent (done by sharing the sigacts structure).

RFLINUXTHPN - This is another FreeBSD specific flag. This will cause the kernel to return SIGUSR1 instead of SIGCHILD upon child exit. We will discuss signals in the next chapter but for now, consider it as a way for the rfork to mimic the Linux clone call.

The return values from rfork are similar to fork. The child will get a value of 0 and the parent will get the PID of the new process. One subtle distinction - rfork will sleep, if needed, until the necessary system resources become available. Also a fork call may be simulated by a rfork call as in RFFDG | RFPROC. However it's not designed for backwards compatibility. If the call to rfork results in a failure, the value -1 is returned and the errno is set. (See the man page or header file for more information on the error codes.)

As evident, the rfork call offers a slimmed down version of fork. One drawback is that there have been some widely exposed security flaws with this function call. Also it's not very compatible even across the BSDs, let alone with other Unix variants. For example, currently NetBSD-1.5 does not support rfork, and not all of the flags that are available on FreeBSD are supported on OpenBSD. As such, the recommended interface for threads is to use pthreads.

3.6 Executing Binary Programs

A process would be useless if it was limited to being an exact copy of its parent process. Hence, the exec functions are used. These are designed to replace the current process image with the new process image. Take, for example, the shell running the ls command. First, the shell will fork, or vfork, depending on its implementation, and then, it will call one of the exec functions. Once the exec function is successfully called the newly created process is replaced with the ls executable and the exec will itself never return.If scripts, such as shell or Perl, are the target programs, the process is similar to that of binary program execution. There is an additional step - calling the actual interpreter. That is, the first two characters will contain #!. For example, the following shows a form for an interpreter invocation, along with the remaining arguments:

#!  interpreter [arg]

This command below invokes the Perl interpreter, with -w as the argument:

#!/usr/bin/perl  -w

All of the exec calls are listed below. For a source code example see exec.c. The basic structure for calling them is:
(target program), (args), (environment if needed).

  
  #include <sys/types.h>
  #include <unistd.h>
    
    extern char **environ;
    
    int     execl(const char *path, const char *arg0, ...  const char *argn,   NULL);
    int     execle(const char *path, const char *arg0, ... const char  *argn, NULL, char *const envp[]);
    int     execlp(const char *file, const char *arg0, ... const char *argn , NULL );
    int     exect(const char *path, char *const argv[], char *const envp[]);
    int     execv(const char *path, char *const argv[]);
    int     execvp(const char *file, char *const argv[]);

NOTE: When using the exec calls that have the arg, ... parameters, arguments to the exec function must be a sequence of NULL-terminated strings, as in arg0, arg1, arg2... argn, with the final terminating character of NULL, or 0. If the ending NULL argument is not specified, the call will fail. The sequence will be the arguments to the target program to be executed.

The exec calls that have the *argv[] parameter are comprised of NULL terminated array of null-terminated strings. This array is the argument list that will be given to the target program to be executed. Again, this list must be terminated with the final pointer as NULL.

Another distinction between binary or scripts execution is in how the target program is specified. The functions execlp and execvp will search your PATH environment variable for the target program, as long as the target does not contain a solidus (/) character. The other function calls will require a path specified for the executable.

Quick guide:

Array arguments and Sequence list

array: exect, execv, execvp

sequence: execl, execle, eseclp

Path search and direct file

path: execl, execle, exect, execv

file: execlp, exevp

Specify environment and Inherit environment

specify: execle, exect

inherit: execl, execlp, execv, execvp

system

    
   #include <stdlib.h>

     int     system(const char *string);

Another key process execution function call is the system call. This function is fairly intuitive. The supplied argument is passed directly to the shell. If a NULL argument is specified, the function will return 1 if the shell is available and 0 if not. Once called, the process will wait until the shell exits and will return the shell's exit status. If a value of -1 is returned then fork or waitpid failed; a value of 127 specifies that the shell's execution failed.

Once called, some signals such as SIGINT and SIGQUIT will be ignored, and also may block SIGCHILD. Also, it's possible that the system function call can crash the calling process, such as if the child process writes to stderr.

As evident, BSD design has an easy but robust interface for process creation. This sophisticated design is comparable to any other modern operating system. The next chapter will cover signals and process management including resource usage, threads, and process limits.

Chapter 3 FreeBSD System Programming
Prev	top	Next

Nice values	Priority
-20 - 0	Higher priority for execution
0 - 20	Lower priority for execution

PID	Name
0	swapper
1	init
2	pagedaemon
3	vmdaemon
4	bufdaemon
5	syncer