Next Previous Table of Contents
In order to select the process to run the function goodness() get's called. Its easy to see, that real time processes get a large priority and are always run before any other processes. Second, every process gets a goodness approximation according to the time it has left to run. As Linux uses a preemptive scheduling mechanism that gives every process a fixed time slice (set in include/linux/sched.h 200 ms time slices) the more time slices a process has left, the better its chances are to be selected.
PROC_CHANGE_PENALTY is a magic constant (set to 15), that tries to keep a running process on the current CPU. Evidently, this only makes sense on multi-processor machines running an SMP-Kernel (SMP stands for Synchronous Multii Processing). It's important to keep one process on the same processor, because switching one process from one processor to another will loose the benefit of L1/L2 cache provoquing cache misses, and thus slowing the system down.
From now on due to the highly complex matter, I will only consider scheduling on systems with one processor. (Thinking about concurrency gets me regularly stuck in a deadlock, and rebooting myself is a pain.)
mm_struct is a data structure that is used to describe the virtual memory of a task or process. So if two proccess have the same priority up to now and for one of them the MM (memory mapping) equals the current MM, then this one gets a higher priority.
From kernel/sched.c
98 /* 99 * This is the function that decides how desirable a process is.. 100 * You can weigh different processes against each other depending 101 * on what CPU they've run on lately etc to try to handle cache 102 * and TLB miss penalties. 103 * 104 * Return values: 105 * -1000: never select this 106 * 0: out of time, recalculate counters (but it might still be 107 * selected) 108 * +ve: "goodness" value (the larger, the better) 109 * +1000: realtime process, select this. 110 */ 111 112 static inline int goodness(struct task_struct * p, int this_cpu, struct mm_struct *this_mm) 113 { 114 int weight; 115 116 /* 117 * Realtime process, select the first one on the 118 * runqueue (taking priorities within processes 119 * into account). 120 */ 121 if (p->policy != SCHED_OTHER) { 122 weight = 1000 + p->rt_priority; 123 goto out; 124 } 125 126 /* 127 * Give the process a first-approximation goodness value 128 * according to the number of clock-ticks it has left. 129 * 130 * Don't do any other calculations if the time slice is 131 * over.. 132 */ 133 weight = p->counter; 134 if (!weight) 135 goto out; 136 137 #ifdef CONFIG_SMP 138 /* Give a largish advantage to the same processor... */ 139 /* (this is equivalent to penalizing other processors) */ 140 if (p->processor == this_cpu) 141 weight += PROC_CHANGE_PENALTY; 142 #endif 143 144 /* .. and a slight advantage to the current MM */ 145 if (p->mm == this_mm || !p->mm) 146 weight += 1; 147 weight += p->priority; 148 149 out: 150 return weight; 151 }
Astonishingly, most of the scheduler code has not much to do with scheduling per se, but with dealing with interrupts, acquiring and releasing locks on important data structures and optionally dealing with SMP-Architectures.
All the code snippets in this section are from kernel/sched.c
438 asmlinkage void schedule(void) 439 { 440 struct schedule_data * sched_data; 441 struct task_struct *prev, *next, *p; 442 struct list_head *tmp; 443 int this_cpu, c; 444 445 if (!current->active_mm) BUG();
The following lines deal with the handling of bottom halfs, generated by interrupts. It's important to see (and it took me a long time), that the time accouting is done here. See time accounting for more details.
446 if (tq_scheduler) 447 goto handle_tq_scheduler; 448 tq_scheduler_back: 449 450 prev = current; 451 this_cpu = prev->processor; 452
If we are in an interrupt we must oops (crash) because an interrupt is not run in a process context and you cannot schedule away from servicing the interrupt.
453 if (in_interrupt()) 454 goto scheduling_in_interrupt; 455 456 release_kernel_lock(prev, this_cpu); 457
Handling softirqs:
458 /* Do "administrative" work here while we don't hold any locks */ 459 if (softirq_state[this_cpu].active & softirq_state[this_cpu].mask) 460 goto handle_softirq; 461 handle_softirq_back: 462 463 /* 464 * 'sched_data' is protected by the fact that we can run 465 * only one process per CPU. 466 */ 467 sched_data = &aligned_data[this_cpu].schedule_data;
Acquiring a lock on the runqueue.
468 469 spin_lock_irq(&runqueue_lock); 470
If this is a real time process running on a round robin strategy, it will get moved to the end.
471 /* move an exhausted RR process to be last.. */ 472 if (prev->policy == SCHED_RR) 473 goto move_rr_last;
But the default behaviour is to delete it from the runqueue.
474 move_rr_back: 475 476 switch (prev->state & ~TASK_EXCLUSIVE) { 477 case TASK_INTERRUPTIBLE: 478 if (signal_pending(prev)) { 479 prev->state = TASK_RUNNING; 480 break; 481 } 482 default: 483 del_from_runqueue(prev); 484 case TASK_RUNNING: 485 } 486 prev->need_resched = 0; 487
Here the scheduling as described above is done:
488 /* 489 * this is the scheduler proper: 490 */ 491 492 repeat_schedule: 493 /* 494 * Default process to select.. 495 */ 496 next = idle_task(this_cpu); 497 c = -1000; 498 if (prev->state == TASK_RUNNING) 499 goto still_running; 500 501 still_running_back: 502 list_for_each(tmp, &runqueue_head) { 503 p = list_entry(tmp, struct task_struct, run_list); 504 if (can_schedule(p)) { 505 int weight = goodness(p, this_cpu, prev->active_mm); 506 if (weight > c) 507 c = weight, next = p; 508 } 509 } 510 511 /* Do we need to re-calculate counters? */ 512 if (!c) 513 goto recalculate;
The algoritm has found the process with the highest priority. If it was lucky it's the same process already running, so not much has to be done.
514 /* 515 * from this point on nothing can prevent us from 516 * switching to the next task, save this fact in 517 * sched_data. 518 */ 519 sched_data->curr = next; 520 #ifdef CONFIG_SMP 521 next->has_cpu = 1; 522 next->processor = this_cpu; 523 #endif 524 spin_unlock_irq(&runqueue_lock); 525 526 if (prev == next) 527 goto same_process;
The next 33 lines are SMP and hardware specific and not within the scope of this paper (and definitely not within the scope of my knowledge), so I skip them.
Doing some statistics:
560 kstat.context_swtch++;
If we switch processes, the schedule algorithm must prepare the system to switch them. This is done with the function prepare_to_switch(). Switching is hardware-specific, on an Intel processor nothing happens ;-). (This took me some time to figure out: prepare_to_switch() is a macro that gets expanded to do { } while(0) which on the other hand gets optimised away by the compiler.)
561 /* 562 * there are 3 processes which are affected by a context switch: 563 * 564 * prev == .... ==> (last => next) 565 * 566 * It's the 'much more previous' 'prev' that is on next's stack, 567 * but prev is set to (the just run) 'last' process by switch_to(). 568 * This might sound slightly confusing but makes tons of sense. 569 */ 570 prepare_to_switch();
Now there has to be done some memory mapping, probably reloading some page tables and LDT's
571 { 572 struct mm_struct *mm = next->mm; 573 struct mm_struct *oldmm = prev->active_mm; 574 if (!mm) { 575 if (next->active_mm) BUG(); 576 next->active_mm = oldmm; 577 atomic_inc(&oldmm->mm_count); 578 enter_lazy_tlb(oldmm, next, this_cpu); 579 } else { 580 if (next->active_mm != mm) BUG(); 581 switch_mm(oldmm, mm, next, this_cpu); 582 } 583 584 if (!prev->mm) { 585 prev->active_mm = NULL; 586 mmdrop(oldmm); 587 } 588 } 589 590 /* 591 * This just switches the register state and the 592 * stack. 593 */
Now the switching of the two processes occurs (switch_to(prev, next, prev)). This is again hardware specific, on an Intel machine this means saving the Stack Pointer, and the Base Pointer data and then reestablishing the state the new to run process was in the last time it was running, using the data saved in the task_struct.
594 switch_to(prev, next, prev);
The next line is SMP-specific, nothing happens on a one processor machine.
595 __schedule_tail(prev); 596 597 same_process: 698 reacquire_kernel_lock(current); 699 return; 600
The end of the algorithm, what follows are the labels of the goto jumps.
A recalulating most be done for all the processes. This is done by halfing the counter (p->counter >> 1 == p->counter / 2) and adding the processes priority. This formula takes into account the process's history and the process's priority. If a process is running often, its credits will exhaust rapidly, while processes that seldom run will not use up their credits this fast and thus get a better chance to run. This scheme has a tendency to prioritize processes, which deserve a rapid response time. (See S&G, page 716 for more details.)
601 recalculate: 602 { 603 struct task_struct *p; 604 spin_unlock_irq(&runqueue_lock); 605 read_lock(&tasklist_lock); 606 for_each_task(p) 607 p->counter = (p->counter >> 1) + p->priority; 608 read_unlock(&tasklist_lock); 609 spin_lock_irq(&runqueue_lock); 610 } 611 goto repeat_schedule; 612 613 still_running: 614 c = prev_goodness(prev, this_cpu, prev->active_mm); 615 next = prev; 616 goto still_running_back; 617 618 handle_softirq: 619 do_softirq(); 620 goto handle_softirq_back; 621 622 handle_tq_scheduler: 623 /* 624 * do not run the task queue with disabled interrupts, 625 * cli() wouldn't work on SMP 626 */ 627 sti(); 628 run_task_queue(&tq_scheduler); 629 goto tq_scheduler_back; 630
As described earlier, if this process is scheduled according to round robin policy we set its counter variable to its priority and move the current process to the end of the run_queue, and thus reducing its chances to run.
631 move_rr_last: 632 if (!prev->counter) { 633 prev->counter = prev->priority; 634 move_last_runqueue(prev); 635 } 636 goto move_rr_back; 637
We crash, provoqued by the BUG macro.
638 scheduling_in_interrupt: 639 printk("Scheduling in interrupt\n"); 640 BUG(); 641 return; 642 }
Next Previous Table of Contents