How System Calls Work on Linux/i86

The HyperNews Linux KHG Discussion Pages

How System Calls Work on Linux/i86

This section covers first the mechanisms provided by the 386 for handling system calls, and then shows how Linux uses those mechanisms. This is not a reference to the individual system calls: There are very many of them, new ones are added occasionally, and they are documented in man pages that should be on your Linux system.

What Does the 386 Provide?

The 386 recognizes two event classes: exceptions and interrupts. Both cause a forced context switch to new a procedure or task. Interrupts can occur at unexpected times during the execution of a program and are used to respond to signals from hardware. Exceptions are caused by the execution of instructions.

Two sources of interrupts are recognized by the 386: Maskable interrupts and Nonmaskable interrupts. Two sources of exceptions are recognized by the 386: Processor detected exceptions and programmed exceptions.

Each interrupt or exception has a number, which is referred to by the 386 literature as the vector. The NMI interrupt and the processor detected exceptions have been assigned vectors in the range 0 through 31, inclusive. The vectors for maskable interrupts are determined by the hardware. External interrupt controllers put the vector on the bus during the interrupt-acknowledge cycle. Any vector in the range 32 through 255, inclusive, can be used for maskable interrupts or programmed exceptions. Here is a listing of all the possible interrupts and exceptions:

0 divide error

1 debug exception

2 NMI interrupt

3 Breakpoint

4 INTO-detected Overflow

5 BOUND range exceeded

6 Invalid opcode

7 coprocessor not available

8 double fault

9 coprocessor segment overrun

10 invalid task state segment

11 segment not present

12 stack fault

13 general protection

14 page fault

15 reserved

16 coprocessor error

17-31 reserved

32-255 maskable interrupts

0	divide error
1	debug exception
2	NMI interrupt
3	Breakpoint
4	INTO-detected Overflow
5	BOUND range exceeded
6	Invalid opcode
7	coprocessor not available
8	double fault
9	coprocessor segment overrun
10	invalid task state segment
11	segment not present
12	stack fault
13	general protection
14	page fault
15	reserved
16	coprocessor error
17-31	reserved
32-255	maskable interrupts

The priority of simultaneous interrupts and exceptions is:

HIGHEST Faults except debug faults

. Trap instructions INTO, INT n, INT 3

. Debug traps for this instruction

. Debug traps for next instruction

. NMI interrupt

LOWEST INTR interrupt

HIGHEST	Faults except debug faults
.	Trap instructions INTO, INT n, INT 3
.	Debug traps for this instruction
.	Debug traps for next instruction
.	NMI interrupt
LOWEST	INTR interrupt

How Linux Uses Interrupts and Exceptions

Under Linux the execution of a system call is invoked by a maskable interrupt or exception class transfer, caused by the instruction int 0x80. We use vector 0x80 to transfer control to the kernel. This interrupt vector is initialized during system startup, along with other important vectors like the system clock vector.

iBCS2 requries an lcall 0,7 instruction, which Linux can send to the iBCS2 compatibility module appropriate if an iBCS2-compliant binary is being executed. In fact, Linux will assume that an iBCS2-compliant binary is being executed if an lcall 0,7 call is executed, and will automatically switch modes.

As of version 0.99.2 of Linux, there are 116 system calls. Documentation for these can be found in the man (2) pages. When a user invokes a system call, execution flow is as follows:

Each call is vectored through a stub in libc. Each call within the libc library is generally a syscallX() macro, where X is the number of parameters used by the actual routine. Some system calls are more complex then others because of variable length argument lists, but even these complex system calls must use the same entry point: they just have more parameter setup overhead. Examples of a complex system call include open() and ioctl().
Each syscall macro expands to an assembly routine which sets up the calling stack frame and calls _system_call() through an interrupt, via the instruction int $0x80
For example, the setuid system call is coded as
_syscall1(int,setuid,uid_t,uid);
which will expand to:
```
_setuid:
  subl $4,%exp
  pushl %ebx
  movzwl 12(%esp),%eax
  movl %eax,4(%esp)
  movl $23,%eax
  movl 4(%esp),%ebx
  int $0x80
  movl %eax,%edx
  testl %edx,%edx
  jge L2
  negl %edx
  movl %edx,_errno
  movl $-1,%eax
  popl %ebx
  addl $4,%esp
  ret
L2:
  movl %edx,%eax
  popl %ebx
  addl $4,%esp
  ret
```
The macro definition for the syscallX() macros can be found in /usr/include/linux/unistd.h, and the user-space system call library code can be found in /usr/src/libc/syscall/
At this point no system code for the call has been executed. Not until the int $0x80 is executed does the call transfer to the kernel entry point _system_call(). This entry point is the same for all system calls. It is responsible for saving all registers, checking to make sure a valid system call was invoked and then ultimately transfering control to the actual system call code via the offsets in the _sys_call_table. It is also responsible for calling _ret_from_sys_call() when the system call has been completed, but before returning to user space.
Actual code for system_call entry point can be found in /usr/src/linux/kernel/sys_call.S Actual code for many of the system calls can be found in /usr/src/linux/kernel/sys.c, and the rest are found elsewhere. find is your friend.
After the system call has executed, _ret_from_sys_call() is called. It checks to see if the scheduler should be run, and if so, calls it.
Upon return from the system call, the syscallX() macro code checks for a negative return value, and if there is one, puts a positive copy of the return value in the global variable _errno, so that it can be accessed by code like perror().

How Linux Initializes the system call vectors

The startup_32() code found in /usr/src/linux/boot/head.S starts everything off by calling setup_idt(). This routine sets up an IDT (Interrupt Descriptor Table) with 256 entries. No interrupt entry points are actually loaded by this routine, as that is done only after paging has been enabled and the kernel has been moved to 0xC0000000. An IDT has 256 entries, each 4 bytes long, for a total of 1024 bytes. When start_kernel() (found in /usr/src/linux/init/main.c) is called it invokes trap_init() (found in /usr/src/linux/kernel/traps.c). trap_init() sets up the IDT via the macro set_trap_gate() (found in /usr/include/asm/system.h). trap_init() initializes the interrupt descriptor table as shown here:

0 divide_error

1 debug

2 nmi

3 int3

4 overflow

5 bounds

6 invalid_op

7 device_not_available

8 double_fault

9 coprocessor_segment_overrun

10 invalid_TSS

11 segment_not_present

12 stack_segment

13 general_protection

14 page_fault

15 reserved

16 coprocessor_error

17 alignment_check

18-48 reserved

At this point the interrupt vector for the system calls is not set up. It is initialized by sched_init() (found in /usr/src/linux/kernel/sched.c). A call to set_system_gate (0x80, &system_call) sets interrupt 0x80 to be a vector to the system_call() entry point.

0	divide_error
1	debug
2	nmi
3	int3
4	overflow
5	bounds
6	invalid_op
7	device_not_available
8	double_fault
9	coprocessor_segment_overrun
10	invalid_TSS
11	segment_not_present
12	stack_segment
13	general_protection
14	page_fault
15	reserved
16	coprocessor_error
17	alignment_check
18-48	reserved

How to Add Your Own System Calls

Create a directory under the /usr/src/linux/ directory to hold your code.
Put any include files in /usr/include/sys/ and /usr/include/linux/.
Add the relocatable module produced by the link of your new kernel code to the ARCHIVES and the subdirectory to the SUBDIRS lines of the top level Makefile. See fs/Makefile, target fs.o for an example.
Add a #define __NR_xx to unistd.h to assign a call number for your system call, where xx, the index, is something descriptive relating to your system call. It will be used to set up the vector through sys_call_table to invoke you code.
Add an entry point for your system call to the sys_call_table in sys.h. It should match the index (xx) that you assigned in the previous step. The NR_syscalls variable will be recalculated automatically.
Modify any kernel code in kernel/fs/mm/, etc. to take into account the environment needed to support your new code.
Run make from the top level to produce the new kernel incorporating your new code.

At this point, you will have to either add a syscall to your libraries, or use the proper _syscalln() macro in your user program for your programs to access the new system call. The 386DX Microprocessor Programmer's Reference Manual is a helpful reference, as is James Turley's Advanced 80386 Programming Techniques. See the Annotated Bibliography.

Messages

4. wrong file for system_call code by Tim Bird
3. would be nice to explain syscall macros by Tim Bird
2. wrong file for syscallX() macro by Tim Bird
1. the directory /usr/src/libc/syscall/ by vijay gupta: 1. ...no longer exists. by Michael K. Johnson

-> the solution to the problem by Vijay Gupta