Next Previous Contents

6. Linux Multitasking

6.1 Overview

This section will analyze data structures--the mechanism used to manage multitasking environment under Linux.

Task States

A Linux Task can be one of the following states (according to [include/linux.h]):

  1. TASK_RUNNING, it means that it is in the "Ready List"
  2. TASK_INTERRUPTIBLE, task waiting for a signal or a resource (sleeping)
  3. TASK_UNINTERRUPTIBLE, task waiting for a resource (sleeping), it is in same "Wait Queue"
  4. TASK_ZOMBIE, task child without father
  5. TASK_STOPPED, task being debugged

Graphical Interaction

       ______________     CPU Available     ______________
      |              |  ---------------->  |              |
      | TASK_RUNNING |                     | Real Running |  
      |______________|  <----------------  |______________|
                           CPU Busy
            |   /|\       
Waiting for |    | Resource  
 Resource   |    | Available             
           \|/   |      
    ______________________                     
   |                      |
   | TASK_INTERRUPTIBLE / |
   | TASK-UNINTERRUPTIBLE |
   |______________________|
 
                     Main Multitasking Flow

6.2 Timeslice

PIT 8253 Programming

Each 10 ms (depending on HZ value) an IRQ0 comes, which helps us in a multitasking environment. This signal comes from PIC 8259 (in arch 386+) which is connected to PIT 8253 with a clock of 1.19318 MHz.

    _____         ______        ______        
   | CPU |<------| 8259 |------| 8253 |
   |_____| IRQ0  |______|      |___/|\|
                                    |_____ CLK 1.193.180 MHz
          
// From include/asm/param.h
#ifndef HZ 
#define HZ 100 
#endif
 
// From include/asm/timex.h
#define CLOCK_TICK_RATE 1193180 /* Underlying HZ */
 
// From include/linux/timex.h
#define LATCH ((CLOCK_TICK_RATE + HZ/2) / HZ) /* For divider */
 
// From arch/i386/kernel/i8259.c
outb_p(0x34,0x43); /* binary, mode 2, LSB/MSB, ch 0 */ 
outb_p(LATCH & 0xff , 0x40); /* LSB */
outb(LATCH >> 8 , 0x40); /* MSB */
 

So we program 8253 (PIT, Programmable Interval Timer) with LATCH = (1193180/HZ) = 11931.8 when HZ=100 (default). LATCH indicates the frequency divisor factor.

LATCH = 11931.8 gives to 8253 (in output) a frequency of 1193180 / 11931.8 = 100 Hz, so period = 10ms

So Timeslice = 1/HZ.

With each Timeslice we temporarily interrupt current process execution (without task switching), and we do some housekeeping work, after which we'll return back to our previous process.

Linux Timer IRQ ICA

Linux Timer IRQ
IRQ 0 [Timer]
 |  
\|/
|IRQ0x00_interrupt        //   wrapper IRQ handler
   |SAVE_ALL              ---   
      |do_IRQ                |   wrapper routines
         |handle_IRQ_event  ---
            |handler() -> timer_interrupt  // registered IRQ 0 handler
               |do_timer_interrupt
                  |do_timer  
                     |jiffies++;
                     |update_process_times  
                     |if (--counter <= 0) { // if time slice ended then
                        |counter = 0;        //   reset counter           
                        |need_resched = 1;   //   prepare to reschedule
                     |}
         |do_softirq
         |while (need_resched) { // if necessary
            |schedule             //   reschedule
            |handle_softirq
         |}
   |RESTORE_ALL
 

Functions can be found under:

Notes:

  1. Function "IRQ0x00_interrupt" (like others IRQ0xXY_interrupt) is directly pointed by IDT (Interrupt Descriptor Table, similar to Real Mode Interrupt Vector Table, see Cap 11 for more), so EVERY interrupt coming to the processor is managed by "IRQ0x#NR_interrupt" routine, where #NR is the interrupt number. We refer to it as "wrapper irq handler".
  2. wrapper routines are executed, like "do_IRQ","handle_IRQ_event" [arch/i386/kernel/irq.c].
  3. After this, control is passed to official IRQ routine (pointed by "handler()"), previously registered with "request_irq" [arch/i386/kernel/irq.c], in this case "timer_interrupt" [arch/i386/kernel/time.c].
  4. "timer_interrupt" [arch/i386/kernel/time.c] routine is executed and, when it ends,
  5. control backs to some assembler routines [arch/i386/kernel/entry.S].

Description:

To manage Multitasking, Linux (like every other Unix) uses a ''counter'' variable to keep track of how much CPU was used by the task. So, on each IRQ 0, the counter is decremented (point 4) and, when it reaches 0, we need to switch task to manage timesharing (point 4 "need_resched" variable is set to 1, then, in point 5 assembler routines control "need_resched" and call, if needed, "schedule" [kernel/sched.c]).

6.3 Scheduler

The scheduler is the piece of code that chooses what Task has to be executed at a given time.

Any time you need to change running task, select a candidate. Below is the ''schedule [kernel/sched.c]'' function.

|schedule
   |do_softirq // manages post-IRQ work
   |for each task
      |calculate counter
   |prepare_to__switch // does anything
   |switch_mm // change Memory context (change CR3 value)
   |switch_to (assembler)
      |SAVE ESP
      |RESTORE future_ESP
      |SAVE EIP
      |push future_EIP *** push parameter as we did a call 
         |jmp __switch_to (it does some TSS work) 
         |__switch_to()
          ..
         |ret *** ret from call using future_EIP in place of call address
      new_task

6.4 Bottom Half, Task Queues. and Tasklets

Overview

In classic Unix, when an IRQ comes (from a device), Unix makes "task switching" to interrogate the task that requested the device.

To improve performance, Linux can postpone the non-urgent work until later, to better manage high speed event.

This feature is managed since kernel 1.x by the "bottom half" (BH). The irq handler "marks" a bottom half, to be executed later, in scheduling time.

In the latest kernels there is a "task queue"that is more dynamic than BH and there is also a "tasklet" to manage multiprocessor environments.

BH schema is:

  1. Declaration
  2. Mark
  3. Execution

Declaration

#define DECLARE_TASK_QUEUE(q) LIST_HEAD(q)
#define LIST_HEAD(name) \
   struct list_head name = LIST_HEAD_INIT(name) 
struct list_head { 
   struct list_head *next, *prev; 
};
#define LIST_HEAD_INIT(name) { &(name), &(name) } 
 
      ''DECLARE_TASK_QUEUE'' [include/linux/tqueue.h, include/linux/list.h] 

"DECLARE_TASK_QUEUE(q)" macro is used to declare a structure named "q" managing task queue.

Mark

Here is the ICA schema for "mark_bh" [include/linux/interrupt.h] function:

|mark_bh(NUMBER)
   |tasklet_hi_schedule(bh_task_vec + NUMBER)
      |insert into tasklet_hi_vec
         |__cpu_raise_softirq(HI_SOFTIRQ) 
            |soft_active |= (1 << HI_SOFTIRQ)
 
                   ''mark_bh''[include/linux/interrupt.h]

For example, when an IRQ handler wants to "postpone" some work, it would "mark_bh(NUMBER)", where NUMBER is a BH declarated (see section before).

Execution

We can see this calling from "do_IRQ" [arch/i386/kernel/irq.c] function:

|do_softirq
   |h->action(h)-> softirq_vec[TASKLET_SOFTIRQ]->action -> tasklet_action
      |tasklet_vec[0].list->func
         

"h->action(h);" is the function has been previously queued.

6.5 Very low level routines

set_intr_gate

set_trap_gate

set_task_gate (not used).

(*interrupt)[NR_IRQS](void) = { IRQ0x00_interrupt, IRQ0x01_interrupt, ..}

NR_IRQS = 224 [kernel 2.4.2]

6.6 Task Switching

When does Task switching occur?

Now we'll see how the Linux Kernel switchs from one task to another.

Task Switching is needed in many cases, such as the following:

Task Switching

                           TASK SWITCHING TRICK
#define switch_to(prev,next,last) do {                                  \
        asm volatile("pushl %%esi\n\t"                                  \
                     "pushl %%edi\n\t"                                  \
                     "pushl %%ebp\n\t"                                  \
                     "movl %%esp,%0\n\t"        /* save ESP */          \
                     "movl %3,%%esp\n\t"        /* restore ESP */       \
                     "movl $1f,%1\n\t"          /* save EIP */          \
                     "pushl %4\n\t"             /* restore EIP */       \
                     "jmp __switch_to\n"                                \
                     "1:\t"                                             \
                     "popl %%ebp\n\t"                                   \
                     "popl %%edi\n\t"                                   \
                     "popl %%esi\n\t"                                   \
                     :"=m" (prev->thread.esp),"=m" (prev->thread.eip),  \
                      "=b" (last)                                       \
                     :"m" (next->thread.esp),"m" (next->thread.eip),    \
                      "a" (prev), "d" (next),                           \
                      "b" (prev));                                      \
} while (0)

Trick is here:

  1. ''pushl %4'' which puts future_EIP into the stack
  2. ''jmp __switch_to'' which execute ''__switch_to'' function, but in opposite of ''call'' we will return to valued pushed in point 1 (so new Task!)

      U S E R   M O D E                 K E R N E L     M O D E

 |          |     |          |       |          |     |          |
 |          |     |          | Timer |          |     |          |
 |          |     |  Normal  |  IRQ  |          |     |          |
 |          |     |   Exec   |------>|Timer_Int.|     |          |
 |          |     |     |    |       | ..       |     |          |
 |          |     |    \|/   |       |schedule()|     | Task1 Ret|
 |          |     |          |       |_switch_to|<--  |  Address |
 |__________|     |__________|       |          |  |  |          |
                                     |          |  |S |          | 
Task1 Data/Stack   Task1 Code        |          |  |w |          |
                                     |          | T|i |          |
                                     |          | a|t |          |
 |          |     |          |       |          | s|c |          |
 |          |     |          | Timer |          | k|h |          |
 |          |     |  Normal  |  IRQ  |          |  |i |          | 
 |          |     |   Exec   |------>|Timer_Int.|  |n |          |
 |          |     |     |    |       | ..       |  |g |          |
 |          |     |    \|/   |       |schedule()|  |  | Task2 Ret|
 |          |     |          |       |_switch_to|<--  |  Address |
 |__________|     |__________|       |__________|     |__________|
 
Task2 Data/Stack   Task2 Code        Kernel Code  Kernel Data/Stack

6.7 Fork

Overview

Fork is used to create another task. We start from a Task Parent, and we copy many data structures to Task Child.

 
                               |         |
                               | ..      |
         Task Parent           |         |
         |         |           |         |
         |  fork   |---------->|  CREATE |   
         |         |          /|   NEW   |
         |_________|         / |   TASK  |
                            /  |         |
             ---           /   |         |
             ---          /    | ..      |
                         /     |         |
         Task Child     / 
         |         |   /
         |  fork   |<-/
         |         |
         |_________|
              
                       Fork SysCall

What is not copied

New Task just created (''Task Child'') is almost equal to Parent (''Task Parent''), there are only few differences:

  1. obviously PID
  2. child ''fork()'' will return 0, while parent ''fork()'' will return PID of Task Child, to distinguish them each other in User Mode
  3. All child data pages are marked ''READ + EXECUTE'', no "WRITE'' (while parent has WRITE right for its own pages) so, when a write request comes, a ''Page Fault'' exception is generated which will create a new independent page: this mechanism is called ''Copy on Write'' (see Cap.10 for more).

Fork ICA

|sys_fork 
   |do_fork
      |alloc_task_struct 
         |__get_free_pages
       |p->state = TASK_UNINTERRUPTIBLE
       |copy_flags
       |p->pid = get_pid    
       |copy_files
       |copy_fs
       |copy_sighand
       |copy_mm // should manage CopyOnWrite (I part)
          |allocate_mm
          |mm_init
             |pgd_alloc -> get_pgd_fast
                |get_pgd_slow
          |dup_mmap
             |copy_page_range
                |ptep_set_wrprotect
                   |clear_bit // set page to read-only              
          |copy_segments // For LDT
       |copy_thread
          |childregs->eax = 0  
          |p->thread.esp = childregs // child fork returns 0
          |p->thread.eip = ret_from_fork // child starts from fork exit
       |retval = p->pid // parent fork returns child pid
       |SET_LINKS // insertion of task into the list pointers
       |nr_threads++ // Global variable
       |wake_up_process(p) // Now we can wake up just created child
       |return retval
              
               fork ICA
 

Copy on Write

To implement Copy on Write for Linux:

  1. Mark all copied pages as read-only, causing a Page Fault when a Task tries to write to them.
  2. Page Fault handler creates a new page.

 
 | Page 
 | Fault 
 | Exception
 |
 |
 -----------> |do_page_fault
                 |handle_mm_fault
                    |handle_pte_fault 
                       |do_wp_page        
                          |alloc_page      // Allocate a new page
                          |break_cow
                             |copy_cow_page // Copy old page to new one
                             |establish_pte // reconfig Page Table pointers
                                |set_pte
                            
              Page Fault ICA
 


Next Previous Contents