Table of Contents
The operating system's most fundamental task is to use and make available data necessary for the operation. This is the land of the files and file systems what we will overview in this section.
File is the basic entity in computing to store information. Computer files can be considered as the modern counterpart of paper documents which traditionally are kept in offices' and libraries' files. A file can be a block of related information which is available to a computer program (to a user) as a single, contiguous block of data and that is retained on some kind of durable storage; or the computer program itself can be a file.
In a traditional manner files are containing textual or binary data. Among binary files its worthy to made a distinction between executable and non-executable ones. Executable files are containing instructions which can be understood by the CPU. ( Note that there are text files what can be treated as executable because the command line interpreter can understood - interpret - its content and executes it. These files are called scripts. In the Windows environment these are the files with the extension .bat, while the .exe and .com are the binary counterpart. )
In files we can store any-kind of data, like text, picture, sound, etc. There is no regulation how to store these information. The format of a file is mostly defined by its content since a file is solely a container for data, although, on some platforms the format is usually indicated by its filename extension. Note, the extension is not required most of the files but it is more comfortable to indicate its content.
Based on the content the operating system can find the rules for how the bytes must be organized and interpreted meaningfully. For example, the bytes of a plain text file ( e.g. a .txt file) are associated with characters based on the encoding, while the bytes of image, video, and audio files are interpreted otherwise. Most file types also allocate a few bytes for metadata, which allows a file to carry some basic information about itself. Without this metadata the first ( or last ) few byte's special value can hold these information.
The way information is grouped into a file is entirely up to how it is designed. This has led to a plethora of more or less standardized file structures for all imaginable purposes, from the simplest to the most complex. Most computer files are used by computer programs which create, modify or delete the files for their own use on an as-needed basis. The programmers who create the programs decide what files are needed, how they are to be used and (often) their names.
Independently from the content there are common attributes for the files which are used by all the operating systems:
filename: used to access files.
Any string of characters may or may not be a well-formed name for a file, it depends on the operating system being used. Early computers permitted only a few letters or digits in the name of a file, but modern computers allow long names (some up to 255 characters) containing almost any combination of unicode letters or unicode digits.
E.g. in DOS the filename can be the length of 8 letters from the English alphabet, including digits, score and underscore characters. While in Linux and Windows systems (from XP to 8 ) the length is limited in 255 letters which can be any unicode character - including the space as well.
extension: a suffix (separated from the base filename by a dot) to the name of a computer file applied to indicate the file format of its content.
Some file systems limit the length of the extension (such as the FAT file system (without Long filename support) not allowing more than three characters) while others (such as NTFS ) do not. Unix file systems accept the separator dot as a legal character.
The exact definition, giving the criteria for deciding what part of the file name is its extension, belongs to the rules of the specific filesystem used; usually the extension is the substring which follows the last occurrence, if any, of the dot character.
size: expressed as number of bytes.
date: usually three different timestamps are stored. Namely: creation, last modification and last access date
time: the time part of the above timestamps.
In addition, each file system can extend this list with custom attributes. These attributes can serve privacy, permissions, compression or indexing as well.
Narrow sense, a file system organizes data in an efficient manner on the device(s) which contain it. The identification of the files are done by the filename. On early systems that was the whole thing, there were no other way to organize data. Starting with the CP/M operating system in the 1970's the tenet of "drive letters" is appeared to distinguish one disk or partition from another and was initially absent from hierarchical directories. Nowadays this method is still alive but most of the operating systems are utilized the philosophy of the common directory structure introduced by Unix.
Unix-like operating systems create a virtual file system, which makes all the files on all the devices appear to exist in a single hierarchy. This means, in those systems, there is one root directory, and every file existing on the system is located under it somewhere.
Unix-like systems assign a device name to each device, but this is not how the files on that device are accessed. Instead, to gain access to files on another device, the operating system must first be informed where in the directory tree those files should appear. This process is called mounting a file system. We should note that Microsoft's NTFS file system also support this - not well-documented - feature as reparse points (directories working as mount-points for other file systems, so we can ignore the drive letters other than C: ).
Today's file systems are not only for simple data storage, they are extended with several features, like restricting and permitting access or maintaining integrity. There are several mechanisms used by file systems to control access to data. Usually the intent is to prevent reading or modifying files by a user or group of users. Another reason is to ensure data is modified in a controlled way so access may be restricted to a specific program. Encryption is an other tool to prevent unwanted access to the date but losing the encryption seed means losing the data. On the other side, integrity means the file system structure remains consistent regardless of the actions by programs accessing the data or the failures of the media or the power. The file system must be able to correct damaged structures.
Stricter sense, a file system (or filesystem ) is an abstraction to store, retrieve and update a set of files. It describes the abstract data types used for storing the metadata as well.
Metadata is an other bookkeeping information what is typically associated with each file within a file system. This includes all the attribute discussed in the previous section. Other information can include the file's device type (e.g. block, character, socket, subdirectory, etc.), its owner user ID and group ID, its access permissions and other file attributes (e.g. read-only, executable, etc.).
Important to know, common file systems for storage devices are allocate space in a granular manner, usually multiple physical units on the device. The file system is responsible for organizing files and directories, and keeping track of which areas of the media belong to which file and which are not being used.
The size of the allocation unit is chosen when the file system is created. Choosing the allocation size based on the average size of the files expected to be in the file system can minimize the amount of unusable space - the space which is lost because the file is not an exact multiple of the allocation unit.
File system fragmentation occurs when unused space or single files are not contiguous. When a file is created and there is not an area of contiguous space available for its initial allocation the space must be assigned in fragments. When a file is modified such that it becomes larger it may exceed the space initially allocated to it, another allocation must be assigned elsewhere and the file becomes fragmented. This is also true for the free space. As files are deleted the space they were allocated eventually is considered available for use by other files. This creates alternating used and unused areas of various sizes.
However, we can found several file systems which are providing access to non-local files, like files residing on a server, by acting as clients for a network protocol (e.g. NFS, SMB, or 9P clients). Others provide access to data that is not stored on a persistent device, and/or may be computed on request (e.g. procfs to access to the currently running processes ). These file systems are mostly independent from the above problems.
Generally, the following most fundamental features are provided by all file systems: create, move and delete files or folders. Supported operations for files are the truncate, the expansion (append to), create, move, delete and in-place modification. However, do not support the file from the beginning of the truncation functions, but may allow unlimited in-place of insertion or deletion of the file. Before going details it is important to understand in more detail the presence of the directories.
Directory is a file system entity what is exactly a file containing filenames and their additional attributes (the metadata is determined by the actual file system which supports hierarchies). From this definition we can derive the following statement: if a directory contains files and directories are files than we can create directories inside a directory. A directory (also sometimes referred to as a folder) can be conveniently viewed as a container.
A directory contained inside another directory is called a subdirectory. The terms parent and child are often used to describe the relationship between a subdirectory and the directory in which it is cataloged, the latter being the parent. The top-most directory in such a file system, which does not have a parent of its own, is called the root directory.
However, the root directory's philosophy is not common in all
operating systems. While Unix-like operating systems are using only
one root directory to contain the entire hierarchy and using the
slash (/
) for it, Windows creates a root directory for
each drive - addressed by a drive letter and using the backslash
(\
).
In many operating systems, programs have an
associated working directory in
which they execute. Typically, file names accessed by the program
are assumed to reside within this directory if the file names are
not specified with an explicit directory name. Unix-like systems are
using the pwd
command to display it, while Windows
using the cd
command without arguments to reach the
same goal in the command line environment. When we want to access to
this information inside a script we can use the PWD
variable in Unix-like systems and the CD
variable in
Windows systems.
# Linux [adamkoa@kkk ~]$ echo $PWD /home/adamkoa [adamkoa@kkk ~]$ # Windows C:\Users\adamkoa\Documents\TAMOP-Op.Sys.notes\Book>echo %CD% C:\Users\adamkoa\Documents\TAMOP-Op.Sys.notes\Book C:\Users\adamkoa\Documents\TAMOP-Op.Sys.notes\Book>
Different operating systems are display in a different way the
directories. Unix is using the letter 'd
' in the front
of the line while Windows using the 'DIR
' string to
indicate the same information.
[adamkoa@kkk /mnt/data/TAMOP-Op.Sys.notes/Book]$ ls -l total 4116 -rwxr--r-- 1 adamkoa adamkoa 374207 Apr 06 18:11 book.xhtml -rw-rw-r-- 1 adamkoa adamkoa 371903 Apr 30 20:45 book.xml -rw-rw-r-- 1 adamkoa adamkoa 371717 Apr 30 20:44 book.xml.bak drwxr-xr-x 3 adamkoa adamkoa 4096 Apr 4 2011 images drwxr-xr-x 2 adamkoa adamkoa 4096 Apr 1 2011 meta
d:\TAMOP-Op.Sys.notes\Book> dir Volume in drive D is Data Volume Serial Number is 22F3-AEC8 Directory of d:\TAMOP-Op.Sys.notes\Book 2011.04.30. 20:35 <DIR> . 2011.04.30. 20:35 <DIR> .. 2011.04.06. 18:11 374 207 book.xhtml 2011.04.30. 20:45 371 903 book.xml 2011.04.30. 20:44 371 717 book.xml.bak 2011.04.04. 20:00 <DIR> images 2011.04.01. 14:01 <DIR> meta 3 File(s) 1 117 827 bytes 4 Dir(s) 11 812 753 408 bytes free
In effect, Directories let you sort your files into groups and place each related group into its own directory. This means you don't have to search an entire disk to find one type of file; just the directory it is most likely in but you need to know is path.
In a graphical interface in almost all modern operating systems' desktop environment, such as Microsoft Windows or GNU's GNOME directories are referred to as folders while its graphical representation with icons are often resemble physical file folders.
There is a difference between a directory, which is a file system concept, and the graphical user interface metaphor (folder) that is used to represent it. Moreover, Windows uses the concept of special folders to help present the contents of the computer to the user in a fairly consistent way that frees the user from having to deal with absolute directory paths, which can vary between versions of Windows, and between individual installations. These folders are the "Document and Settings", "Program files" and the "Windows" folder which can be placed anywhere on the storage in any name. [ Environment variables are used to point to the proper location. ]
Path is the address of an object (i.e., file, directory or link) on a file system. It points to a unique file system location by following the directory tree hierarchy expressed in a string of characters in which path components, separated by a delimiting character, represent each directory. The delimiting character is most commonly the slash ("/" - on Unix-like systems) and the backslash character ("\" - on Windows systems). [ The classic MacOS is an exception because it uses the colon (":") as a delimiter. ]
Path can be divided into two categories:
absolute or full path: a path that points to the same location on one file system regardless of the working directory. It is written in reference to a root directory. (it starts with the symbol of the root directory)
/home/adamkoa/foo.txt
On DOS and Windows systems this path is interpreted
on the current drive. If you wish to reference to an other
drive, first you need specify it with its associated
letter (e.g. D:
)
relative path: a path relative to the working directory of the user or application, so the full absolute path will not have to be given.
When a process refers to a file using a simple file name or relative path (as opposed to a file designated by a full path from a root directory ), the reference is interpreted relative to the current working directory of the process. It means if a process with working directory /home/adamkoa that asks to create the file foo.txt will end up creating the file /home/adamkoa/foo.txt .
In most operating system the working directory can be
changed by using the cd
or
chdir
commands.
Inside the path the following symbols could be used:
.
: the actual (working) directory
..
: parent directory
/
or \
: root directory
~
: on UNIX the user's own (home) directory
(also available in the HOME environment variable)
On modern operating systems one can use the letter space inside a filename which could cause several problems, especially if you use the command line interface where the space is the default separator character between the arguments. In this case you need to escape it with a backslash or put the whole path between quotes.
On Windows you can overcome about this problem by using
the legacy DOS-based eight-character length name that Windows
assigns to any directory for substitution in environment
variables. Using the directory listing command with the
/x
option we can get these names. For instance, the
following will get you the eight character name for all
directories directly under root:
C:\> dir /x
Before we start discussing hidden files we need to look back to the Directories section and see the program listing. In the first case, on Unix we cannot see the references to the actual and parent directory. Nevertheless, we are aware that they must exist because without them the file system cannot maintain the links between the individual directories. While in Windows, we see immediately the . and .. entries.
It has advantages and disadvantages. Basically, it is better to see a newly created directory empty because we believe it is empty. Mostly Mere Mortals have not got the knowledge to associate the meaning of the dots. Naturally, these are the necessary references for a working file system.
Unix-like systems normally save users from it because rarely contains important information it for us. Because we know that it exists and it is okay, or not, so why bother us with this. Hiding objects can be useful for reducing visual "clutter" in directories, and thereby making it easier for users to locate desired files and subdirectories.
However, many operating systems and application programs routinely hide objects in order to reduce the chances of users accidentally damaging or deleting critical system and configuration files. On Windows hidden files and directories are used to protect those files, which are usually system files, from accidentally being modified or deleted by the user. Unfortunately viruses, spyware, and hijackers often hide there files in this way making it hard to find them and then delete them.
In the Microsoft Windows operating systems, whether a file system object is hidden or not is an attribute of the item, along with such things as whether the file is read-only and a system file. Changing the visibility of such items is accomplished using a multi-step procedure.
Unix-like operating systems provide a larger set of attributes for file system objects but whether objects are hidden or not is not among the attributes. Rather, it is merely a superficial property that is easily changed by adding or removing a period from the beginning of the object name. In Unix-like operating systems, periods can appear anywhere within the name of a file, directory or link, and they can appear as many times as desired. However, usually, the only time that they have special significance is when used to indicate a hidden file or directory.
The dot is not a separate part of the filename. It is part of the filename!
Hidden items are, of course, completely visible to the operating system. They can also be visible to application programs, but they are not usually visible to user interfaces of application programs. Based on this we can conclude what are hidden files. A hidden file is a file that is not normally visible when examining the contents of the directory in which it resides. A hidden directory is a directory that is normally invisible when examining the contents of the directory in which it resides.
As we observed previously, every directory contains
two special files whose names
consist of dots only: one with a single dot and the other with two
dots. (The current directory and its parent directory.) They are not
hidden items but their filename on Unix-like operating systems
nevertheless match with the special pattern of hidden files so the
normal directory listing command (ls
) omits them.
However, they can easily be made to appear by adding
the -a
option, which instructs ls to show all items in
the designated directory. For example, the following command will
display all items (inclusive of hidden items) in the root
directory:
ls -a /
The -A
option can be used with the ls command in
place of the -a option to tell ls to list all items in a directory
except for those two special files. For example, the following would
list all items exclusive of them in the current
directory:
ls -A
In Windows as we seen it is influenced by an attribute which
can be edited with the attrib
command.
ATTRIB { +H | -H } [drive:][path][filename]
After we hide a file in Windows we can not see it in normal
ways. We need to instruct the listing command (dir
) to
display them.
dir /A:H
A large majority of the files found on UNIX-like systems are ordinary files. Ordinary files contain ASCII (human-readable) text, executable program binaries, program data, and more. We have seen that directories are files. However, on UNIX everything is a file! From UNIX's point of view a file is not much more than a plain collection of bytes that one can read and/or write. Once you have a reference to a file - called a file descriptor - I/O access in UNIX is done using the same set of operations, the same API.
This key design principle consists of providing a unified paradigm for accessing a wide range of input/output resources: documents, directories, hard-drives, CD-Roms, modems, keyboards, printers, monitors, terminals and even some inter-process and network communications. The unified API feature is extremely empowering and fundamental for UNIX programs: you can write a program processing a file while being unaware of whether the file is actually stored on a local disk, stored on a remote drive somewhere on the network, streamed over the Internet, typed interactively by the user or even generated in memory by another program.
At this point we will list all the available types of files under the UNIX environment.
link: files that are a link to another file (or directory). These allow them to be seen elsewhere or under a different name in the directory hierarchy. (later we will discuss in more details).
Links are identified mostly by the letter 'l
'
(will also be detailed later that there are hard links which are
undifferentiated from each other).
lrwxrwxrwx termcap
named pipe (a.k.a. FIFO
pipes) : These enable two processes on the same computer to
communicate like a water pipe - they allow a one way flow from
one place to another connecting their standard I/O stream's.
They are explicitly created using the mkfifo
command.
Pipes are identified by the letter 'p
'.
Details are in the following section about pipes.
prw-rw---- mypipe
socket: These also allow for inter-process communication like pipes but especially client server relations.
They are similar to named pipes but are full duplex (i.e. information can flow both ways) and allow for datagrams. This means that more than one client can connect to a server for example. They are also connectionless, programs can communicate without have to keep the connection open. Because they are treated as files security can be implemented by using standard file permissions.
One example could be the printer daemon using the /var/run/printer file as a socket for printing jobs.
Sockets are identified by the letter 's
'
.
srwxrwxrwx printer
device: representing all the hardware inside the operating system. Examples are the keyboard, terminal, hard drives, memory, floppy, ... etc. Permissions also can be applied to these file.
The first letter of these files shows the communication
method used by the device. This can a c
for
character based devices - where every byte is mapped to the
proper character; or can be a b
for block devices -
where transformation is not occur.
crw------- /dev/kbd # keyboard brw-rw---- /dev/hda # first IDE HDD (primary master) # Partitions inside this device are available through /dev/hda1 - /dev/hda15 name. # The first four serves for primary partitions, the fifth could be the extended partition and the remaining names are for logical drives. brw-rw---- /dev/hdb # second IDE HDD (primary slave) brw-rw---- /dev/hdc # third IDE HDD (secondary master) brw-rw---- /dev/hdd # fourth IDE HDD (secondary slave) brw-rw---- /dev/sda # first SCSI drive # Partition names are similar for HDDs # /dev/sdb ... /dev/sdd also means the same lrwxrwxrwx /dev/cdrom -> hda # link to the CD-ROM crw-rw---- /dev/ttyS0 to /dev/ttyS3 # 0 – 3 serial port crw------- /dev/tty1 - /dev/tty6 # virtual console (AltF1-F6)
The three most important non-existing device:
crw-rw-rw- /dev/null
Discards all data written to it but reports that the write operation succeeded; and provides no data to any process that reads from it (yielding EOF immediately). Fully identical counterpart in DOS was the NUL device. Typical usage if you want to disposing of unwanted output streams of a process, or as a convenient empty file for input streams. This is usually done by redirection what will be discussed in the following section.
cat $filename 2>/dev/null >/dev/null # If "$filename" is not existing there will be no error message (2>) # If "$filename" exists their content will not be displayed (>) # Its usage is important only if you want to test the command's output # The meaning of 2> and > is available in the next section.
crw-rw-rw- /dev/random
Serves as a random number generator or as pseudorandom number generator. Directly is not very useful, but can be utilized as follows:
Getting a two byte integer: od -An -N2 -i /dev/random # -An :no address shown # -N :size in bytes # -i :output format (integer) If we would like to get a number from an interval - e.g. from 100 to 1000: echo $(( 100+(`od -An -N2 -i /dev/random` )%(1000-100+1) )) # basic shell integer arithmetic done by the $(( ... )) construct # command substitution done by the ` ... ` construct Can be used to create temporary files: touch `od -An -N2 -i /dev/random`.tmp
crw-rw-rw- /dev/zero
Provides as many null characters ( ASCII NUL, 0x00) as are read from it. Typical usage with the dd utility program which reads octet streams from a source to a destination destroying existing data on a file system partition:
dd if=/dev/zero of=$FILE bs=$BLOCKSIZE count=$BLOCKS # if= input file # of= output file (/dev/<partition>)
Redirection could be performed on the standard input
(stdin
), standard output (stdout
) and
standard error (stderr
). These are three standard POSIX
file descriptors, corresponding to the three standard streams, which
presumably every process should expect to have. Generally, a file
descriptor is an index for an entry in a kernel-resident array data
structure containing the details of open files. In POSIX this data
structure is called a file descriptor table, and each process has
its own file descriptor table.
Each process have these standard streams and always using the following integer values to identify a given stream:
0 - standard input
1- standard output
2 - standard error
By default, input generally comes from the keyboard or mouse, and output goes to the display monitor. The standard error, where the program writes the error messages during the execution by default the display monitor as well. More precisely, the stdout and sdterr are the files used by the process's parent process because every process puts its output to the parent.
With a redirection operator you can override these defaults so that a command or program takes input from some other device and sends output to a different device. This occures before the command is executed. The following redirection operators may precede or appear anywhere within a simple command or may follow a command. Redirections are processed in the order they appear, from left to right.
< file
: redirects stdin (reads from the given
file)
> file
: redirects stdout ( writes to the given
file; if it exists then overwrites it)
>> file
: redirects stdout (writes to the
given file; if it exists then appends it)
2> file
: redirects stderr (writes error
messages to the given file)
&> file
: redirects stdout and stderr to
the same file
2>&1
: redirects stderr to the same
location where stdout refers
1>&2
: redirects stdout to the same
location where stderr refers
Examples:
dir > list.txt |
the output of the dir command goes to the list.txt - if it does not exist than creates it, otherwise overwrites it |
dir >> list.txt |
the output of the dir command goes to the list.txt - if it does not exist than creates it, otherwise appends it to the end of the file |
sort < names.txt > list.txt |
sorts names.txt and the result goes to list.txt |
A pipe is a form of redirection that is used in Linux and other Unix-like operating systems to send the output of one program to another program for further processing. In this case redirection means transferring of the standard output to some other destination, to another program instead of the display monitor (which is its default destination). Pipes are used to create what can be visualized as a pipeline of commands, which is a temporary direct connection between two or more simple programs.
A pipe is designated in commands by the vertical bar character. The general syntax for pipes is:
command_1 | command_2 [| command_3 . . . ]
The chain can continue for any number of commands or programs. This creates an anonymous pipe which will handle the inter process communication between the process with the operating system I/O subsystem. Remember back, that we can create named pipes as well with the mkfifo command.
Examples for anonymous pipes:
dir | sort |
sorting the result of the dir command |
dir | sort > \tmp\list.txt |
sorting the result of the dir command and redirected to the list.txt |
dir | sort | more |
sorting the result of the dir command than displaying one screen at a time with more |
Detailed examples could be found int the following section discussing filter programs. Now we will see one demonstration which can help to relief the disk subsystem by bypassing temporary files when feeding a database form a compressed file:
mkfifo --mode=0666 /tmp/namedPipe gzip --stdout -d file.gz > /tmp/namedPipe
In another terminal we can issue the following command into the mysql prompt to feed the table from the pipe:
LOAD DATA INFILE '/tmp/namedPipe' INTO TABLE tableName;
Similarly, we can demonstrate the inter-process communication by using two terminal screens. On the first one we could create the pipe for compressing data, while on the other one we could prepare the data (as listing the content of a file):
mkfifo my_pipe gzip -9 -c < my_pipe > out.gz
Serving the data:
cat my_file > my_pipe
A notation similar to the pipes of Unix-like operating systems is used in Microsoft's MS-DOS operating system. However, the method of implementation is completely different. Sometimes the pipe-like mechanism used in MS-DOS is referred to as fake pipes because, instead of running two or more programs simultaneously and channeling the output data from one continuously to the next, MS-DOS uses a temporary buffer file (i.e., section of memory) that first accumulates the entire output from the first program and only then feeds its contents to the next program.
This more closely resembles redirection through a file than it does the Unix concept of pipes. It takes more time because the second program cannot begin until the first has been completed, and it also consumes more system resources (i.e., memory and processor time). This approach could be particularly disadvantageous if the first command produces a very large amount of output and/or does not terminate.
Similar solution demonstrating the buffer as a temporary file on Unix-like systems:
cat my_file > tempfile1 tempfile1 > more