原创【转】ARM Linux进程调度

 2010-11-8 23:48  1943 2 2 分类: MCU/ 嵌入式

小弟最近研究了一段时间的ARM Linux,想把进程管理方面的感受跟大家交流下，不对的地方多多指点

Process Creation and Termination
Process Scheduling and Dispatching
Process Switching
Porcess Synchronization and support for interprocess communication
Management of process control block
--------from <Operating system:internals and design principles>

进程调度
Linux2.4.x是一个基于非抢占式的多任务的分时操作系统，虽然在用户进程的调度上采用抢占式策略，但是而在内核还是采用了轮转的方法，如果有个内核态的线程恶性占有CPU不释放，那系统无法从中解脱出来，所以实时性并不是很强。这种情况有望在Linux 2.6版本中得到改善，在2.6版本中采用了抢占式的调度策略。

内核中根据任务的实时程度提供了三种调度策略：

SCHED_OTHER为非实时任务，采用常规的分时调度策略；
SCHED_FIFO为短小的实时任务，采用先进先出式调度，除非有更高优先级进程申请运行，否则该进程将保持运行至退出才让出CPU；
SCHED_RR任务较长的实时任务，由于任务较长，不能采用FIFO的策略，而是采用轮转式调度，该进程被调度下来后将被置于运行队列的末尾，以保证其他实时进程有机会运行。

需要说明的是，SCHED_FIFO和SCHED_RR两种调度策略之间没有优先级上的区别，主要的区别是任务的大小上。另外，task_struct结构中的policy中还包含了一个SCHED_YIELD位，置位时表示该进程主动放弃CPU。

在上述三种调度策略的基础上，进程依照优先级的高低被分别调系统。优先级是一些简单的整数，它代表了为决定应该允许哪一个进程使用CPU的资源时判断方便而赋予进程的权值——优先级越高，它得到CPU时间的机会也就越大。

在Linux中，非实时进程有两种优先级，一种是静态优先级，另一种是动态优先级。实时进程又增加了第三种优先级，实时优先级。

静态优先级（priority）——被称为“静态”是因为它不随时间而改变，只能由用户进行修改。它指明了在被迫和其它进程竞争CPU之前该进程所应该被允许的时间片的最大值（20）。
动态优先级（counter）——counter 即系统为每个进程运行而分配的时间片，Linux兼用它来表示进程的动态优先级。只要进程拥有CPU，它就随着时间不断减小；当它为0时，标记进程重新调度。它指明了在当前时间片中所剩余的时间量（最初为20）。
实时优先级(rt_priority)——值为1000。Linux把实时优先级与counter值相加作为实时进程的优先权值。较高权值的进程总是优先于较低权值的进程，理L垠I-;供Yo6网n,如果一个进程不是实时进程，其优先权就远小于1000，所以实时进程总是优先。

在每个tick到来的时候（也就是时钟中断发生），系统减小当前占有CPU的进程的counter，如果counter减小到0，则将need_resched置1，中断返回过程中进行调度。update_process_times()为时钟中断处理程序调用的一个子函数：

void update_process_times(int user_tick)
{
       struct task_struct *p = current;
       int cpu = smp_processor_id(), system = user_tick ^ 1;
       update_one_process(p, user_tick, system, cpu);
       if (p->pid) {
              if (--p->counter <= 0) {
                     p->counter = 0;
                     p->need_resched = 1;
              }
              if (p->nice > 0)
                     kstat.per_cpu_nice[cpu] += user_tick;
              else
                     kstat.per_cpu_user[cpu] += user_tick;
              kstat.per_cpu_system[cpu] += system;
       } else if (local_bh_count(cpu) || local_irq_count(cpu) > 1)
              kstat.per_cpu_system[cpu] += system;
}

Linux中进程的调度使在schedule（）函数中实现的，该函数在下面的ARM汇编片断中被调用到：

/*
* This is the fast syscall return path. We do as little as
* possible here, and this includes saving r0 back into the SVC
* stack.
*/
ret_fast_syscall:
       ldr   r1, [tsk, #TSK_NEED_RESCHED]
       ldr   r2, [tsk, #TSK_SIGPENDING]
       teq r1, #0            @ need_resched || sigpending
       teqeq     r2, #0
       bne slow
       fast_restore_user_regs
/*
* Ok, we need to do extra processing, enter the slow path.
*/

slow:      str   r0, [sp, #S_R0+S_OFF]!     @ returned r0
       b     1f

/*
* "slow" syscall return path. "why" tells us if this was a real syscall.
*/
reschedule:
       bl    SYMBOL_NAME(schedule)
ENTRY(ret_to_user)
ret_slow_syscall:
       ldr   r1, [tsk, #TSK_NEED_RESCHED]
       ldr   r2, [tsk, #TSK_SIGPENDING]
1:    teq r1, #0                   @ need_resched => schedule()
       bne reschedule           @如果需要重新调度则调用schedule
       teq r2, #0                   @ sigpending => do_signal()
       blne       __do_signal
       restore_user_regs

而这段代码在中断返回或者系统调用返回中反复被调用到。

1．进程状态转换时：如进程终止，睡眠等,当进程要调用sleep（）或exit（）等函数使进程状态发生改变时，这些函数会主动调用schedule（）转入进程调度。

2．可运行队列中增加新的进程时；

ENTRY(ret_from_fork)
       bl    SYMBOL_NAME(schedule_tail)
       get_current_task tsk
       ldr   ip, [tsk, #TSK_PTRACE]           @ check for syscall tracing
       mov why, #1
       tst   ip, #PT_TRACESYS           @ are we tracing syscalls?
       beq ret_slow_syscall
       mov r1, sp
       mov r0, #1                          @ trace exit [IP = 1]
       bl    SYMBOL_NAME(syscall_trace)
       b     ret_slow_syscall          @跳转到上面的代码片断

3．在时钟中断到来后：Linux初始化时，设定系统定时器的周期为10毫秒。当时钟中断发生时，时钟中断服务程序timer_interrupt立即调用时钟处理函数do_timer( )，在do_timer()会将当前进程的counter减1，如果counter为0则置need_resched标志，在从时钟中断返回的过程中会调用schedule.

4．进程从系统调用返回到用户态时；判断need_resched标志是否置位，若是则转入执行schedule()。系统调用实际上就是通过软中断实现的，下面是ARM平台下软中断处理代码。

       .align      5
ENTRY(vector_swi)
       save_user_regs
       zero_fp
       get_scno

       enable_irqs ip

       str   r4, [sp, #-S_OFF]!              @ push fifth arg

       get_current_task tsk
       ldr   ip, [tsk, #TSK_PTRACE]           @ check for syscall tracing
       bic   scno, scno, #0xff000000           @ mask off SWI op-code
       eor scno, scno, #OS_NUMBER << 20   @ check OS number
       adr tbl, sys_call_table              @ load syscall table pointer
       tst   ip, #PT_TRACESYS           @ are we tracing syscalls?
       bne __sys_trace

       adrsvc    al, lr, ret_fast_syscall         @ 装载返回地址，用于在跳转调用后返回到
                                                               @上面的代码片断中的ret_fast_syscall
       cmp scno, #NR_syscalls           @ check upper syscall limit
       ldrcc       pc, [tbl, scno, lsl #2]           @ call sys_* routine

       add r1, sp, #S_OFF
2:    mov why, #0                       @ no longer a real syscall
       cmp scno, #ARMSWI_OFFSET
       eor r0, scno, #OS_NUMBER << 20 @ put OS number back
       bcs SYMBOL_NAME(arm_syscall)
       b     SYMBOL_NAME(sys_ni_syscall)     @ not private func

5．内核处理完中断后，进程返回到用户态。

6．进程主动调用schedule()请求进行进程调度。

schedule()函数分析：

/*
* 'schedule()' is the scheduler function. It's a very simple and nice
* scheduler: it's not perfect, but certainly works for most things.
*
* The goto is "interesting".
*
* NOTE!! Task 0 is the 'idle' task, which gets called when no other
* tasks can run. It can not be killed, and it cannot sleep. The 'state'
* information in task[0] is never used.
*/
asmlinkage void schedule(void)
{
       struct schedule_data * sched_data;
       struct task_struct *prev, *next, *p;
       struct list_head *tmp;
       int this_cpu, c;

       spin_lock_prefetch(&runqueue_lock);

       if (!current->active_mm) BUG();
need_resched_back:
       prev = current;
       this_cpu = prev->processor;

       if (unlikely(in_interrupt())) {
              printk("Scheduling in interrupt\n");
              BUG();
       }

       release_kernel_lock(prev, this_cpu);

       /*
        * 'sched_data' is protected by the fact that we can run
        * only one process per CPU.
        */
       sched_data = & aligned_data[this_cpu].schedule_data;

       spin_lock_irq(&runqueue_lock);

       /* move an exhausted RR process to be last.. */
       if (unlikely(prev->policy == SCHED_RR))
              /*
               * 如果采用轮转法调度，则重新检查counter是否为0, 若是则将其挂到运行队列的最后
               */
              if (!prev->counter) {
                     prev->counter = NICE_TO_TICKS(prev->nice);
                     move_last_runqueue(prev);
              }

       switch (prev->state) {
              case TASK_INTERRUPTIBLE:
                     /*
                      * 如果是TASK_INTERRUPTIBLE,并且能够唤醒它的信号已经来临,
                      * 则将状态置为TASK_RUNNING
                      */
                     if (signal_pending(prev)) {
                            prev->state = TASK_RUNNING;
                            break;
                     }
              default:
                     del_from_runqueue(prev);
              case TASK_RUNNING:;
       }
       prev->need_resched = 0;

       /*
        * this is the scheduler proper:
        */

repeat_schedule:
       /*
        * Default process to select..
        */
       next = idle_task(this_cpu);
       c = -1000;
       list_for_each(tmp, &runqueue_head) {
              /*
               * 遍历运行队列,查找优先级最高的进程, 优先级最高的进程将获得CPU
               */
              p = list_entry(tmp, struct task_struct, run_list);
              if (can_schedule(p, this_cpu)) {
                     /*
                      * goodness()中，如果是实时进程，则weight = 1000   p->rt_priority,
                      * 使实时进程的优先级永远比非实时进程高
                      */
                     int weight = goodness(p, this_cpu, prev->active_mm);
                     if (weight > c)              //注意这里是”>”而不是”>=”，如果权值相同，则先来的先上
                            c = weight, next = p;
              }
       }

       /* Do we need to re-calculate counters? */
       if (unlikely(!c)) {
              /*
               * 如果当前优先级为0,那么整个运行队列中的进程将重新计算优先权
               */
              struct task_struct *p;

              spin_unlock_irq(&runqueue_lock);
              read_lock(&tasklist_lock);
              for_each_task(p)
                     p->counter = (p->counter >> 1)   NICE_TO_TICKS(p->nice);
              read_unlock(&tasklist_lock);
              spin_lock_irq(&runqueue_lock);
              goto repeat_schedule;
       }

       /*
        * from this point on nothing can prevent us from
        * switching to the next task, save this fact in sched_data.
        */
       sched_data->curr = next;
       task_set_cpu(next, this_cpu);
       spin_unlock_irq(&runqueue_lock);

       if (unlikely(prev == next)) {
              /* We won't go through the normal tail, so do this by hand */
              prev->policy &= ~SCHED_YIELD;
              goto same_process;
       }

       kstat.context_swtch ;
       /*
        * there are 3 processes which are affected by a context switch:
        *
        * prev == .... ==> (last => next)
        *
        * It's the 'much more previous' 'prev' that is on next's stack,
        * but prev is set to (the just run) 'last' process by switch_to().
        * This might sound slightly confusing but makes tons of sense.
        */
       prepare_to_switch();   {
              struct mm_struct *mm = next->mm;
              struct mm_struct *oldmm = prev->active_mm;
              if (!mm) {                     //如果是内核线程的切换，则不做页表处理
                     if (next->active_mm) BUG();
                     next->active_mm = oldmm;
                     atomic_inc(&oldmm->mm_count);
                     enter_lazy_tlb(oldmm, next, this_cpu);
              } else {
                     if (next->active_mm != mm) BUG();
                     switch_mm(oldmm, mm, next, this_cpu);       //如果是用户进程，切换页表
              }

              if (!prev->mm) {
                     prev->active_mm = NULL;
                     mmdrop(oldmm);
              }
       }

       /*
        * This just switches the register state and the stack.
        */
       switch_to(prev, next, prev);
       __schedule_tail(prev);

same_process:
       reacquire_kernel_lock(current);
       if (current->need_resched)
              goto need_resched_back;
       return;
}

switch_mm中是进行页表的切换，即将下一个的pgd的开始物理地址放入CP15中的C2寄存器。进程的pgd的虚拟地址存放在task_struct结构中的pgd指针中，通过__virt_to_phys宏可以转变成成物理地址。

static inline void
switch_mm(struct mm_struct *prev, struct mm_struct *next,
         struct task_struct *tsk, unsigned int cpu)
{
       if (prev != next)
              cpu_switch_mm(next->pgd, tsk);
}

#define cpu_switch_mm(pgd,tsk) cpu_set_pgd(__virt_to_phys((unsigned long)(pgd)))

#define cpu_get_pgd()      \
       ({                                        \
              unsigned long pg;                     \
              __asm__("mrc p15, 0, %0, c2, c0, 0"      \
                      : "=r" (pg));                \
              pg &= ~0x3fff;                           \
              (pgd_t *)phys_to_virt(pg);         \
       })

switch_to()完成进程上下文的切换，通过调用汇编函数__switch_to完成，其实现比较简单，也就是保存prev进程的上下文信息，该上下文信息由context_save_struct结构描述，包括主要的寄存器，然后将next的上下文信息读出，信息保存在task_struct中的thread.save中TSS_SAVE标识了thread.save在task_struct中的位置。

/*
* Register switch for ARMv3 and ARMv4 processors
* r0 = previous, r1 = next, return previous.
* previous and next are guaranteed not to be the same.
*/
ENTRY(__switch_to)
              stmfd      sp!, {r4 - sl, fp, lr}        @ Store most regs on stack
              mrs ip, cpsr
              str   ip, [sp, #-4]!                 @ Save cpsr_SVC
              str   sp, [r0, #TSS_SAVE]         @ Save sp_SVC
              ldr   sp, [r1, #TSS_SAVE]         @ Get saved sp_SVC
              ldr   r2, [r1, #TSS_DOMAIN]
              ldr   ip, [sp], #4
              mcr p15, 0, r2, c3, c0        @ Set domain register
              msr spsr, ip                 @ Save tasks CPSR into SPSR for this return
              ldmfd      sp!, {r4 - sl, fp, pc}^            @ Load all regs saved previously

struct context_save_struct {
       unsigned long cpsr;
       unsigned long r4;
       unsigned long r5;
       unsigned long r6;
       unsigned long r7;
       unsigned long r8;
       unsigned long r9;
       unsigned long sl;
       unsigned long fp;
       unsigned long pc;
};