2009年5月23日 星期六

LINUX System Programming -- File I/O

File I/O

This chapter introduces files, the most important abstraction in the Unix environment, and file I/O, the basis of the Linux programming mode. This chapter covers reading from and writing to files, along with other basic file I/O operations. The chapter culminates with a discussion on how the Linux kernel implements and manages files.

Opening Files

Reading via read( )

 #include <unistd.h>
ssize_t read (int fd, void *buf, size_t len);



Return Values:


  • The call returns a value equal to len. All len read bytes are stored in buf. The results are as intended.
  • The call returns a value less than len, but greater than zero. The read bytes are stored in buf.
  • The call returns 0. This indicates EOF. There is nothing to read.
  • The call blocks because no data is currently available. This won’t happen in non-blocking mode.
  • The call returns -1, and errno is set to EINTR. This indicates that a signal was received before any bytes were read. The call can be reissued.
  • The call returns -1, and errno is set to EAGAIN. This indicates that the read would block because no data is currently available, and that the request should be reissued later. This happens only in nonblocking mode.
  • The call returns -1, and errno is set to a value other than EINTR or EAGAIN. This indicates a more serious error.



ssize_t ret;
while (len != 0 && (ret = read (fd, buf, len)) != 0) {
if (ret == -1) {
if (errno == EINTR)
continue;
perror ("read");
break;
}
len -= ret;
buf += ret;
}




Writing with write( )

Synchronized I/O

Direct I/O

Closing Files

Seeking with lseek( )

Positional Reads and Writes

Truncating Files

Multiplexed I/O:

Multiplexed I/O allows an application to concurrently block on multiple file descriptors, and receive notification when any one of them becomes ready to read or write without blocking. Multiplexed I/O thus becomes the pivot point for the application, designed similarly to the following:
  1. Multiplexed I/O: Tell me when any of these file descriptors are ready for I/O.
  2. Sleep until one or more file descriptors are ready.
  3. Woken up: What is ready?
  4. Handle all file descriptors ready for I/O, without blocking.
  5. Go back to step 1, and start over.

Select


#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

int select (int n,
fd_set *readfds,
fd_set *writefds,
fd_set *exceptfds,
struct timeval *timeout);

FD_CLR(int fd, fd_set *set);
FD_ISSET(int fd, fd_set *set);
FD_SET(int fd, fd_set *set);
FD_ZERO(fd_set *set);


A call to select( ) will block until the given file descriptors are ready to perform I/O,or until an optionally specified timeout has elapsed.
The timeout parameter is a pointer to a timeval structure, which is defined as follows:

#include <sys/time.h>
struct timeval {
long tv_sec; /* seconds */
long tv_usec; /* microseconds */
};

FD_ISSET tests whether a file descriptor is part of a given set.FD_ISSET is used after a call from select( ) returns to test whether a given file descriptor is ready for action:
if (FD_ISSET(fd, &readfds))
/* 'fd' is readable without blocking! */


pselect( )

#define _XOPEN_SOURCE 600

#include <sys/select.h>

int pselect (int n,
fd_set *readfds,
fd_set *writefds,
fd_set *exceptfds,
const struct timespec *timeout,
const sigset_t *sigmask);

FD_CLR(int fd, fd_set *set);
FD_ISSET(int fd, fd_set *set);
FD_SET(int fd, fd_set *set);
FD_ZERO(fd_set *set);

There are three differences between pselect( ) and select( ):
  1. pselect( ) uses the timespec structure, not the timeval structure, for its timeout parameter. The timespec structure uses seconds and nanoseconds, not seconds and microseconds, providing theoretically superior timeout resolution. In practice, however, neither call reliably provides even microsecond resolution.
  2. A call to pselect( ) does not modify the timeout parameter. Consequently, this parameter does not need to be reinitialized on subsequent invocations.
  3. The select( ) system call does not have the sigmask parameter. With respect to signals, when this parameter is set to NULL, pselect( ) behaves like select( ).
poll( )

The poll( ) system call is System V’s multiplexed I/O solution. It solves several defi-ciencies in select( ), although select( ) is still often used (again, most likely out of habit, or in the name of portability):
#include <sys/poll.h>
int poll (struct pollfd *fds, unsigned int nfds, int timeout);
Unlike select( ), with its inefficient three bitmask-based sets of file descriptors, poll( ) employs a single array of nfds pollfd structures, pointed to by fds. The structure is defined as follows:
#include <sys/poll.h>

struct pollfd {
int fd; /* file descriptor */
short events; /* requested events to watch */
short revents; /* returned events witnessed */
};
POLLIN | POLLPRI is equivalent to select( )’s read event, and POLLOUT | POLLWRBAND is equivalent to select( )’s write event. POLLIN is equivalent to POLLRDNORM | POLLRDBAND, and POLLOUT is equivalent to POLLWRNORM.

poll( ) Versus select( )

Although they perform the same basic job, the poll( ) system call is superior to
select( ) for a handful of reasons:
  • poll( ) does not require that the user calculate and pass in as a parameter the value of the highest-numbered file descriptor plus one.
  • poll( ) is more efficient for large-valued file descriptors. Imagine watching a single file descriptor with the value 900 via select( )—the kernel would have to check each bit of each passed-in set, up to the 900th bit.
  • select( )’s file descriptor sets are statically sized, introducing a tradeoff: they are small, limiting the maximum file descriptor that select( ) can watch, or they are inefficient. Operations on large bitmasks are not efficient, especially if it is not known whether they are sparsely populated.* With poll( ), one can create an array of exactly the right size. Only watching one item? Just pass in a single structure.
  • With select( ), the file descriptor sets are reconstructed on return, so each sub-sequent call must reinitialize them. The poll( ) system call separates the input (events field) from the output (revents field), allowing the array to be reused without change.
  • The timeout parameter to select( ) is undefined on return. Portable code needs to reinitialize it. This is not an issue with pselect( ), however.
The select( ) system call does have a few things going for it, though:
  • select( ) is more portable, as some Unix systems do not support poll( ).
  • select( ) provides better timeout resolution: down to the microsecond. Both ppoll( ) and pselect( ) theoretically provide nanosecond resolution, but in practice, none of these calls reliably provides even microsecond resolution.

LINUX System Programming -- Introduction and Essential Concepts



System Call

System programming starts with system calls. System calls (often shorted to syscalls).
  • to execute and with what parameters via machine registers.
  • As a system programmer, you usually do not need any knowledge of how the kernel handles system call invocation. That knowledge is encoded into the standard calling conventions for the architecture, and handled automatically by the compiler and the C library.

The C Library

The GNUC library provides more than its name suggests. In addition to implementing the standard C library, glibc provides wrappers for system calls, threading support, and basic application facilities.

The C Compiler

APIs and ABIs --Both define and describe the interfaces between different pieces of computer software.
  • APIs
    • Application programming interface
    • An API defines the interfaces by which one piece of software communicates with another at the source level.
    • A real-world example is the API defined by the C standard and implemented by the standard C library.
  • ABIs
    • Whereas an API defines a source interface, an ABI defines the low-level binary
      interface between two or more pieces of software on a particular architecture.
    • An ABI ensures binary compatibility, guaranteeing that a piece of object code will function on any system with the same ABI, without requiring recompilation.
    • ABIs are concerned with issues such as calling conventions, byte ordering, register use, system call invocation, linking, library behavior, and the binary object format.
    • The ABI is enforced by the toolchain—the compiler, the linker, and so on
Standards


  • POSIX
    • In the mid-1980s, the Institute of Electrical and Electronics Engineers (IEEE) spearheaded an effort to standardize system-level interfaces on Unix systems. Richard Stallman, founder of the Free Software movement, suggested the standard be named POSIX (pronounced pahz-icks), which now stands for Portable Operating System Interface.
  • C Language Standards
    • Dennis Ritchie and Brian Kernighan’s famed book, The C Programming Language
    • In 1990, the International Organization for Standardization (ISO) ratified ISO C90
    • 1995 -- ISO C95
    • This was followed in 1999 with a large update to the language, ISO C99, that introduced many new features, including inline functions, new data types, variable-length arrays, C++-style comments, and new library functions.
Linux and the Standards



Concepts of Linux Programming


Files and the Filesystem
  • Regular files
    • A regular file contains bytes of data, organized into a linear array called a byte stream.
    • Any of the bytes within a file may be read from or written to. These operations start at a specific byte, which is one’s conceptual “location” within the file. This location is called the file position or file offset.
    • Writing a byte to a file position beyond the end of the file will cause the intervening bytes to be padded with zeros.
    • it is not possible to write bytes to a position before the beginning of a file.
    • Writing a byte to the middle of a file overwrites the byte previously located at that offset. Thus, it is not possible to expand a file by writing into the middle of it.
    • The size of a file is measured in bytes, and is called its length.
    • a file is referenced by an inode (originally information node), which is assigned a unique numerical value. This value is called the inode number, often abbreviated as i-number or ino.
    • files are always opened from user space by a name, not an inode number.

  • Directories and links
    • A directory acts as a mapping of human-readable names to inode numbers. A name and inode pair is called a link.
    • Initially, there is only one directory on the disk, the root directory. This directory is usually denoted by the path /.
    • A pathname that starts at the root directory is said to be fully qualified, and is called an absolute pathname.
    • Some pathnames are not fully qualified; instead, they are provided relative to some other directory (for example, todo/plunder). These paths are called relative pathnames. When provided with a relative pathname, the kernel begins the pathname resolution in the current working directory.
    • Hard links
      • When multiple links map different names to the same inode, we call them hard links.
      • Hard links allow for complex filesystem structures with multiple pathnames pointing to the same data.
      • Deleting a file involves unlinking it from the directory structure, which is done simply by removing its name and inode pair from a directory.
      • When a pathname is unlinked, the link count is decremented by one; only when it reaches zero are the inode and its associated data actually removed from the filesystem.
    • Symbolic links (symlinks)
      • Hard links cannot span filesystems because an inode number is meaningless
        outside of the inode’s own filesystem. To allow links that can span filesystems, and that are a bit simpler and less transparent, Unix systems also implement symbolic links (often shortened to symlinks).
      • A symbolic link that points to a nonexistent file is called a broken link.
  • Special files
    • Special files are kernel objects that are represented as files.
    • Linux supports four: block device files, character device files, named pipes, and Unix domain sockets.
    • Special files are a way to let certain abstractions fit into the filesystem, partaking in the everything-is-a-file paradigm. Linux provides a system call to create a special file.
    • Device files may be opened, read from, and written to, allowing user space to access and manipulate devices (both physical and virtual) on the system.
    • Unix devices are generally broken into two groups: character devices and block devices.
      • A character device is accessed as a linear queue of bytes. The device driver places bytes onto the queue, one by one, and user space reads the bytes in the order that they were placed on the queue.
      • A block device, in contrast, is accessed as an array of bytes. The device driver maps the bytes over a seekable device, and user space is free to access any
        valid bytes in the array, in any order
    • Named pipes (often called FIFOs, short for “first in, first out”) are an interprocess
      communication
      (IPC) mechanism that provides a communication channel over a file descriptor, accessed via a special file.
    • Sockets are an advanced form of IPC that allow for communication between two different processes, not only on the same machine, but on two different machines. Unix domain sockets use a special file residing on a filesystem, often simply called a socket file.
  • Filesystems and namespaces
    • Linux, like all Unix systems, provides a global and unified namespace of files and directories.
    • A filesystem is a collection of files and directories in a formal and valid hierarchy.
    • Filesystems usually exist physically (i.e., are stored on disk), although Linux also supports virtual filesystems that exist only in memory, and network filesystems that exist on machines across the network.
    • media-specific filesystems (for example, ISO9660), network filesystems (NFS), native filesystems (ext3), filesystems from other Unix systems (XFS), and even
      filesystems from non-Unix systems (FAT).
    • The smallest addressable unit on a block device is the sector. A block device cannot transfer or access a unit of data smaller than a sector
    • The smallest logically addressable unit on a filesystem is the block. The
      block is an abstraction of the filesystem
Processes


  • Processes are object code in execution: active, alive, running programs. But they’re more than just object code—processes consist of data, resources, state, and a virtualized computer.
  • Processes begin life as executable object code, which is machine-runnable code in an executable format that the kernel understands (the format most common in Linux is ELF).
  • The most important and common sections are the text section, the data
    section
    , and the bss section.
  • Processes typically request and manipulate resources only through system calls.
  • A process’ resources, along with data and statistics related to the process, are stored inside the kernel in the process’ process descriptor.
  • Threads
    • Each process consists of one or more threads of execution (usually just called threads).
    • A thread is the unit of activity within a process, the abstraction responsible for executing code, and maintaining the process’ running state.
    • Most processes consist of only a single thread; they are called single-threaded.
      Processes that contain multiple threads are said to be multithreaded.
    • A thread consists of a stack (which stores its local variables, just as the process stack does on nonthreaded systems), processor state, and a current location in the object code (usually stored in the processor’s instruction pointer). The majority of the remaining parts of a process are shared among all threads.
    • Internally, the Linux kernel implements a unique view of threads: they are simply normal processes that happen to share some resources (most notably, an address space). In user space, Linux implements threads in accordance with POSIX 1003.1c (known as pthreads). The name of the current Linux thread implementation, which is part of glibc, is the Native POSIX Threading Library (NPTL).
  • Process hierarchy
    • Each process is identified by a unique positive integer called the process ID (pid). The pid of the first process is 1, init process.
    • New processes are created via the fork( ) system call. This system call creates a duplicate of the calling process. The original process is called the parent; the new process is called the child.
    • If a parent process terminates before its child, the kernel will reparent the child to
      the init process.
    • A process that has terminated, but not yet been waited upon, is
      called a zombie.
  • Users and Groups
    • Authorization in Linux is provided by users and groups. Each user is associated with a unique positive integer called the user ID (uid).
    • Each process is in turn associated with exactly one uid, which identifies the user running the process, and is called the process’ real uid.
    • In addition to the real uid, each process also has an effective uid, a saved uid,
      and a filesystem uid.
    • Each user may belong to one or more groups, including a primary or login group, listed in /etc/passwd, and possibly a number of supplemental groups, listed in /etc/group.
    • Each process is therefore also associated with a corresponding group ID (gid), and has a real gid, an effective gid, a saved gid, and a filesystem gid.
  • Permissions
    • Table 1-1. Permission bits and their values
  • Signals
    • Signals are a mechanism for one-way asynchronous notifications. A signal may be sent from the kernel to a process, from a process to another process, or from a process to itself.
    • Handled signals cause the execution of a user-supplied signal handler function. The program jumps to this function as soon as the signal is received, and (when the signal handler returns) the control of the program resumes at the previously interrupted instruction.
  • Interprocess Communication
    • Allowing processes to exchange information and notify each other of events is one of an operating system’s most important jobs.
    • IPC mechanisms supported by Linux include pipes, named pipes, semaphores, message queues, shared memory, and futexes (short for "fast userspace mutex", Futex are Tricky).

Headers

Linux system programming revolves around a handful of headers. Both the kernel itself and glibc provide the headers used in system-level programming. These headers include the standard C fare (for example, ), and the usual Unix offerings (say, ).

Error Handling

Table 1-2. Errors and their descriptions