Linux

Next Previous Table of Contents

3. Linux Scheduling - the Code

3.1 Goodness()

In order to select the process to run the function goodness() get's called. Its easy to see, that real time processes get a large priority and are always run before any other processes. Second, every process gets a goodness approximation according to the time it has left to run. As Linux uses a preemptive scheduling mechanism that gives every process a fixed time slice (set in include/linux/sched.h 200 ms time slices) the more time slices a process has left, the better its chances are to be selected.

PROC_CHANGE_PENALTY is a magic constant (set to 15), that tries to keep a running process on the current CPU. Evidently, this only makes sense on multi-processor machines running an SMP-Kernel (SMP stands for Synchronous Multii Processing). It's important to keep one process on the same processor, because switching one process from one processor to another will loose the benefit of L1/L2 cache provoquing cache misses, and thus slowing the system down.

From now on due to the highly complex matter, I will only consider scheduling on systems with one processor. (Thinking about concurrency gets me regularly stuck in a deadlock, and rebooting myself is a pain.)

mm_struct is a data structure that is used to describe the virtual memory of a task or process. So if two proccess have the same priority up to now and for one of them the MM (memory mapping) equals the current MM, then this one gets a higher priority.

From kernel/sched.c

 98 /*
 99  * This is the function that decides how desirable a process is..
100  * You can weigh different processes against each other depending
101  * on what CPU they've run on lately etc to try to handle cache
102  * and TLB miss penalties.
103  *
104  * Return values:
105  *       -1000: never select this
106  *           0: out of time, recalculate counters (but it might still be
107  *              selected)
108  *         +ve: "goodness" value (the larger, the better)
109  *       +1000: realtime process, select this.
110  */
111 
112 static inline int goodness(struct task_struct * p, int this_cpu,    
                               struct mm_struct *this_mm)
113 {
114         int weight;
115 
116         /*
117          * Realtime process, select the first one on the
118          * runqueue (taking priorities within processes
119          * into account).
120          */
121         if (p->policy != SCHED_OTHER) {
122                 weight = 1000 + p->rt_priority;
123                 goto out;
124         }
125 
126         /*
127          * Give the process a first-approximation goodness value
128          * according to the number of clock-ticks it has left.
129          *
130          * Don't do any other calculations if the time slice is
131          * over..
132          */
133         weight = p->counter;
134         if (!weight)
135                 goto out;
136                         
137 #ifdef CONFIG_SMP
138         /* Give a largish advantage to the same processor...   */
139         /* (this is equivalent to penalizing other processors) */
140         if (p->processor == this_cpu)
141                 weight += PROC_CHANGE_PENALTY;
142 #endif
143 
144         /* .. and a slight advantage to the current MM */
145         if (p->mm == this_mm || !p->mm)
146                 weight += 1;
147         weight += p->priority;
148 
149 out:
150         return weight;
151 }

3.2 Schedule()

Astonishingly, most of the scheduler code has not much to do with scheduling per se, but with dealing with interrupts, acquiring and releasing locks on important data structures and optionally dealing with SMP-Architectures.

All the code snippets in this section are from kernel/sched.c

438 asmlinkage void schedule(void) 
439 { 
440         struct schedule_data * sched_data; 
441         struct task_struct *prev, *next, *p; 
442         struct list_head *tmp; 
443         int this_cpu, c; 
444 
445         if (!current->active_mm) BUG(); 

The following lines deal with the handling of bottom halfs, generated by interrupts. It's important to see (and it took me a long time), that the time accouting is done here. See time accounting for more details.

446         if (tq_scheduler) 
447                 goto handle_tq_scheduler; 
448 tq_scheduler_back: 
449 
450         prev = current; 
451         this_cpu = prev->processor; 
452 

If we are in an interrupt we must oops (crash) because an interrupt is not run in a process context and you cannot schedule away from servicing the interrupt.

453         if (in_interrupt()) 
454                 goto scheduling_in_interrupt; 
455 
456         release_kernel_lock(prev, this_cpu); 
457 

Handling softirqs:

458         /* Do "administrative" work here while we don't hold any locks */ 
459         if (softirq_state[this_cpu].active & softirq_state[this_cpu].mask) 
460                 goto handle_softirq; 
461 handle_softirq_back: 
462 
463         /* 
464          * 'sched_data' is protected by the fact that we can run 
465          * only one process per CPU. 
466          */ 
467         sched_data = &aligned_data[this_cpu].schedule_data; 

Acquiring a lock on the runqueue.

468 
469         spin_lock_irq(&runqueue_lock); 
470 

If this is a real time process running on a round robin strategy, it will get moved to the end.

471         /* move an exhausted RR process to be last.. */ 
472         if (prev->policy == SCHED_RR) 
473                 goto move_rr_last; 

But the default behaviour is to delete it from the runqueue.

474 move_rr_back: 
475 
476         switch (prev->state & ~TASK_EXCLUSIVE) { 
477                 case TASK_INTERRUPTIBLE: 
478                         if (signal_pending(prev)) { 
479                                 prev->state = TASK_RUNNING; 
480                                 break; 
481                         } 
482                 default: 
483                         del_from_runqueue(prev); 
484                 case TASK_RUNNING: 
485         } 
486         prev->need_resched = 0; 
487 

Here the scheduling as described above is done:

488         /* 
489          * this is the scheduler proper: 
490          */ 
491 
492 repeat_schedule: 
493         /* 
494          * Default process to select.. 
495          */ 
496         next = idle_task(this_cpu); 
497         c = -1000; 
498         if (prev->state == TASK_RUNNING) 
499                 goto still_running; 
500 
501 still_running_back:
502         list_for_each(tmp, &runqueue_head) {
503                 p = list_entry(tmp, struct task_struct, run_list);
504                 if (can_schedule(p)) {
505                         int weight = goodness(p, this_cpu, prev->active_mm);
506                         if (weight > c)
507                                 c = weight, next = p;
508                 }
509         }
510 
511         /* Do we need to re-calculate counters? */
512         if (!c)
513                 goto recalculate;

The algoritm has found the process with the highest priority. If it was lucky it's the same process already running, so not much has to be done.

514         /*
515          * from this point on nothing can prevent us from
516          * switching to the next task, save this fact in
517          * sched_data.
518          */
519         sched_data->curr = next;
520 #ifdef CONFIG_SMP
521         next->has_cpu = 1;
522         next->processor = this_cpu;
523 #endif
524         spin_unlock_irq(&runqueue_lock);
525 
526         if (prev == next)
527                 goto same_process;

The next 33 lines are SMP and hardware specific and not within the scope of this paper (and definitely not within the scope of my knowledge), so I skip them.

Doing some statistics:

560         kstat.context_swtch++; 

If we switch processes, the schedule algorithm must prepare the system to switch them. This is done with the function prepare_to_switch(). Switching is hardware-specific, on an Intel processor nothing happens ;-). (This took me some time to figure out: prepare_to_switch() is a macro that gets expanded to do { } while(0) which on the other hand gets optimised away by the compiler.)

561         /* 
562          * there are 3 processes which are affected by a context switch: 
563          *
564          * prev == .... ==> (last => next) 
565          * 
566          * It's the 'much more previous' 'prev' that is on next's stack, 
567          * but prev is set to (the just run) 'last' process by switch_to(). 
568          * This might sound slightly confusing but makes tons of sense. 
569          */ 
570         prepare_to_switch(); 

Now there has to be done some memory mapping, probably reloading some page tables and LDT's

571         { 
572                 struct mm_struct *mm = next->mm; 
573                 struct mm_struct *oldmm = prev->active_mm; 
574                 if (!mm) { 
575                         if (next->active_mm) BUG(); 
576                         next->active_mm = oldmm; 
577                         atomic_inc(&oldmm->mm_count); 
578                         enter_lazy_tlb(oldmm, next, this_cpu); 
579                 } else { 
580                         if (next->active_mm != mm) BUG(); 
581                         switch_mm(oldmm, mm, next, this_cpu); 
582                 } 
583 
584                 if (!prev->mm) { 
585                         prev->active_mm = NULL; 
586                         mmdrop(oldmm); 
587                 } 
588         } 
589 
590         /* 
591          * This just switches the register state and the 
592          * stack. 
593          */ 

Now the switching of the two processes occurs (switch_to(prev, next, prev)). This is again hardware specific, on an Intel machine this means saving the Stack Pointer, and the Base Pointer data and then reestablishing the state the new to run process was in the last time it was running, using the data saved in the task_struct.

594         switch_to(prev, next, prev); 

The next line is SMP-specific, nothing happens on a one processor machine.

595         __schedule_tail(prev); 
596 
597 same_process: 
698         reacquire_kernel_lock(current); 
699         return; 
600 

The end of the algorithm, what follows are the labels of the goto jumps.

A recalulating most be done for all the processes. This is done by halfing the counter (p->counter >> 1 == p->counter / 2) and adding the processes priority. This formula takes into account the process's history and the process's priority. If a process is running often, its credits will exhaust rapidly, while processes that seldom run will not use up their credits this fast and thus get a better chance to run. This scheme has a tendency to prioritize processes, which deserve a rapid response time. (See S&G, page 716 for more details.)

601 recalculate: 
602         { 
603                 struct task_struct *p; 
604                 spin_unlock_irq(&runqueue_lock); 
605                 read_lock(&tasklist_lock); 
606                 for_each_task(p) 
607                         p->counter = (p->counter >> 1) + p->priority; 
608                 read_unlock(&tasklist_lock); 
609                 spin_lock_irq(&runqueue_lock); 
610         } 
611         goto repeat_schedule; 
612 
613 still_running: 
614         c = prev_goodness(prev, this_cpu, prev->active_mm); 
615         next = prev; 
616         goto still_running_back; 
617 
618 handle_softirq: 
619         do_softirq(); 
620         goto handle_softirq_back; 
621 
622 handle_tq_scheduler:
623         /*
624          * do not run the task queue with disabled interrupts,
625          * cli() wouldn't work on SMP
626          */
627         sti();
628         run_task_queue(&tq_scheduler); 
629         goto tq_scheduler_back; 
630 

As described earlier, if this process is scheduled according to round robin policy we set its counter variable to its priority and move the current process to the end of the run_queue, and thus reducing its chances to run.

631 move_rr_last: 
632         if (!prev->counter) { 
633                 prev->counter = prev->priority; 
634                 move_last_runqueue(prev); 
635         } 
636         goto move_rr_back; 
637 

We crash, provoqued by the BUG macro.

638 scheduling_in_interrupt: 
639         printk("Scheduling in interrupt\n"); 
640         BUG();         
641         return; 
642 } 

Next Previous Table of Contents