深入linux内核架构--进程&线程

简介

进程和线程这两个词，每个程序员都十分熟悉，但是想要很清晰的描述出来却有一种不知道从何说起的感觉。所以今天结合一个具体的例子来描述一下进程与线程的相关概念：在terminal上敲出a.out这个自己编译出来可执行程序路径后，这个程序是怎么在linux系统中运行起来的？如果对于这个问题有兴趣的同学可以继续往下看。

对于linux内核下进程和线程的理解程度我经历过以下几个阶段：

知道进程和线程这两个基本概念：知道进程/线程相关的一些资源占用，调度的一些基本理论知识；知道进程=程序+数据等等；知道PC寄存器、指令、栈帧、上下文切换等一些零碎的概念。
知道进程和线程都是可独立调度的最小单位；知道进程拥有独立地址空间，线程共享进程的地址空间，有自己独立的栈；知道pthread协议相关接口定义，以及会使用pthread接口创建线程/进程等操作。
知道线程在linux中本质上就是进程，可以分为内核线程，可调度线程和用户空间线程。内核线程是内核启动后的一些deamon线程用于处理一写内核状态工作，比如kswapd，flushd等；可调度线程是能够被内核调度感知的线程，也就是正常的native thread；用户空间线程不为内核调度感知，比如goroutine，gevent等，需要自己实现相关调度策略。
知道进程和线程在内核中是如何调度的：实时调度器，完全公平调度器等相关调度策略；知道可调度线程有内核栈和用户态栈两个不同的栈，当执行系统调用时需要从用户态栈切换到内核栈执行一些特权指令，比如write，mmap之类的。
知道在terminal bash中输入一个cmd之后，相关的进程是如何被操作系统运行起来的，知道线程是如何创建出来的。
知道栈中的指令是如何加载到CPU中，如何开始执行，上下文切换时的寄存器如何操作等。(这个阶段还没达到，不过不做内核开发的话，也没必要达到这个阶段。)

那么我们就通过结合linux源码以及一些man文档一起来深入了解一下进程和线程的实现，结合简介中说的例子来看一下linux用户进程及线程是如何被创建起来以及执行的。

进程 & 线程

task_struct，mm_struct这两个数据结构，分别描述进程/线程的数据结构及进程虚拟地址空间，是内核管理的关键结构，下面是相关内核代码。

struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
    /*
     * For reasons of header soup (see current_thread_info()), this
     * must be the first element of task_struct.
     */
    struct thread_info      thread_info;
#endif
    /* -1 unrunnable, 0 runnable, >0 stopped: */
    volatile long           state;  // 运行状态

    void                *stack;  // 内核栈空间地址
    atomic_t            usage;
    /* Per task flags (PF_*), defined further below: */
    unsigned int            flags;
    unsigned int            ptrace; // 

#ifdef CONFIG_SMP  // 调度相关
    struct llist_node       wake_entry;
    int             on_cpu; // 是否已被分配到CPU上执行调度
#ifdef CONFIG_THREAD_INFO_IN_TASK
    /* Current CPU: */
    unsigned int            cpu; // 被分配到的CPU
...
#endif
// 调度相关
    int             on_rq; // 是否在runqueue队列中(就绪)
    int             prio; // 优先级
    int             static_prio; // 静态优先级
    int             normal_prio; // 普通优先级
    unsigned int            rt_priority; // 实时优先级
        // 上述优先级结合一个简单的算法来计算task_struct的权重，用于计算时间片 
    const struct sched_class    *sched_class; // 调度类，诸如实时调度，完全公平调度等
    struct sched_entity     se; // 关联到的调度实体，这是完全公平调度器真正调度的单位
    struct sched_rt_entity      rt; // 关联到的实时调度实体，这是实时调度器真正调度的单位
#ifdef CONFIG_CGROUP_SCHED
    struct task_group       *sched_task_group; // 组调度，cgroup相关
#endif
    struct sched_dl_entity      dl; // deadline调度器调度实体
...
    struct mm_struct        *mm; // 虚拟内存结构
    struct mm_struct        *active_mm;
...
    pid_t               pid;  // 进程id
    pid_t               tgid; // 组id

    /* Real parent process: */
    struct task_struct __rcu    *real_parent; // 真实的父进程(ptrace的时候需要用到)

    /* Recipient of SIGCHLD, wait4() reports: */
    struct task_struct __rcu    *parent; // 当前父进程

    /*
     * Children/sibling form the list of natural children:
     */
    struct list_head        children; // 子进程(当前进程fork出来的子进程/线程)
    struct list_head        sibling; // 兄弟进程
    struct task_struct      *group_leader;

    /*
     * 'ptraced' is the list of tasks this task is using ptrace() on.
     *
     * This includes both natural children and PTRACE_ATTACH targets.
     * 'ptrace_entry' is this task's link on the p->parent->ptraced list.
     */
    struct list_head        ptraced;
    struct list_head        ptrace_entry;

    /* PID/PID hash table linkage. */ //线程相关结构
    struct pid          *thread_pid;
    struct hlist_node       pid_links[PIDTYPE_MAX];
    struct list_head        thread_group;
    struct list_head        thread_node;

    struct completion       *vfork_done;

    /* CLONE_CHILD_SETTID: */
    int __user          *set_child_tid;

    /* CLONE_CHILD_CLEARTID: */
    int __user          *clear_child_tid;

    u64             utime; // 用户态运行时间
    u64             stime; // 和心态运行时间
...
    /* Context switch counts: */
    unsigned long           nvcsw; //自愿切换
    unsigned long           nivcsw; //非自愿切换

    /* Monotonic time in nsecs: */
    u64             start_time; // 创建时间

    /* Boot based time in nsecs: */
    u64             real_start_time; // 创建时间
    char                comm[TASK_COMM_LEN]; // 执行cmd名称，包括路径
    struct nameidata        *nameidata;  // 路径查找辅助结构
...
    /* Filesystem information: */
    struct fs_struct        *fs; // 文件系统信息，根路径及当前文件路径

    /* Open file information: */
    struct files_struct     *files; // 打开的文件句柄

    /* Namespaces: */
    struct nsproxy          *nsproxy; 

    /* Signal handlers: */
    struct signal_struct        *signal; // 信号量
    struct sighand_struct       *sighand; // 信号量处理函数
...
       /* CPU-specific state of this task: */
    struct thread_struct        thread; // 线程相关信息
};

struct mm_struct {
    struct {
                ... // 虚拟地址空间相关数据结构(虚拟地址空间比较复杂，这里不做过多介绍)
        unsigned long task_size;    /* size of task vm space: 虚拟地址空间大小 */
        unsigned long highest_vm_end;   /* highest vma end address: 堆顶地址 */
        pgd_t * pgd; // 页表相关
        atomic_t mm_users; // 共享该空间的用户数
        atomic_t mm_count; // 共享该空间的线程数(linux中线程与进程共享mm_struct)
#ifdef CONFIG_MMU
        atomic_long_t pgtables_bytes;   /* PTE page table pages：页表项占用的内存页数 */
#endif
        int map_count;          /* number of VMAs */

        spinlock_t page_table_lock; /* Protects page tables and some
                         * counters
                         */
        struct rw_semaphore mmap_sem;
        struct list_head mmlist; /* List of maybe swapped mm's. These
                      * are globally strung together off
                      * init_mm.mmlist, and are protected
                      * by mmlist_lock
                      */
        ... // 统计信息

        unsigned long stack_vm;    /* VM_STACK */
        unsigned long def_flags;

        spinlock_t arg_lock; /* protect the below fields */
        unsigned long start_code, end_code, start_data, end_data; // 代码段，数据段
        unsigned long start_brk, brk, start_stack; // 堆起始地址，堆顶地址，进程栈地址
        unsigned long arg_start, arg_end, env_start, env_end; // 初始变量段起始地址，环境变量段起始地址
...
        struct linux_binfmt *binfmt; // 进程启动相关信息
...
};

进程是从内核最原始的调度概念，伴随着内核诞生的，是调度的主体，每个进程都有唯一的进程ID用来标识，且在每一个命名空间都有一个独立的进程ID，进程ID在IPC中有着关键的作用。进程伴随着虚拟内存空间，内核栈，用户态栈，进程组，命名空间，打开文件，信号等资源。每个进程都有一些调度相关数据：优先级，cgroup；这两个数据用于计算进程运行过程中的时间片大小。进程可以被不同的调度器调度，主要的调度器包括：完全公平调度器，实时调度器，deadline调度器等。每个进程主要有三种状态：阻塞、就绪、运行中。
但是当你输入ps aux可以看到很多状态标识，分别表示：

PROCESS STATE CODES
       Here are the different values that the s, stat and state output
       specifiers (header "STAT" or "S") will display to describe the state of
       a process:
               D    uninterruptible sleep (usually IO) // 不可中断阻塞，IO访问涉及到设备操作，如果要实现可中断操作比较复杂，一般是进入D状态，禁止竞争。(无法接受信号，不能被kil)
               R    running or runnable (on run queue)
               S    interruptible sleep (waiting for an event to complete)// 可中断阻塞(能够接受信号，可以被kill)
               T    stopped by job control signal
               t    stopped by debugger during the tracing
               W    paging (not valid since the 2.6.xx kernel)
               X    dead (should never be seen)
               Z    defunct ("zombie") process, terminated but not reaped by
                    its parent

       For BSD formats and when the stat keyword is used, additional
       characters may be displayed:

               <    high-priority (not nice to other users)
               N    low-priority (nice to other users)
               L    has pages locked into memory (for real-time and custom IO)
               s    is a session leader
               l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads
                    do)
               +    is in the foreground process group

而线程是一个通用的标准概念，并不是操作系统的概念，很多操作系统最开始都没有线程的设计。线程的概念起源于pthread，其定义了一套POSIX标准接口，相比于原始进程，线程定义了一套具有更轻量级的资源分配，更方便的通信以及更融洽的协作等优势的调度标准，这个接口提供了一系列线程操作接口及线程通信接口。linux参考于unix，设计之初并没有考虑到这种设计，也就没有单独的特殊数据结构来描述线程，而且可以被调度的唯一结构也只有task_struct这一种结构。最开始实现pthread的实现库被称为LinuxThreads，但是该库对于pthread的支持并不是很兼容，因为内核很多设计没法很好的实现线程相关标准要求。内核为了更好的支持pthread，修改了task_struct中相关的数据结构及管理方式来支持更好的实现pthread标准，也就有了后来NPTL库，该库完美的兼容了POSIX协议中的各项要求，因此linux下线程又被成为LWP(Light-weight process)。

LinuxThreads

实现原理：
通过一个管理进程来创建线程，所有的线程操作包括创建、销毁等都由这个进程来执行。所以的线程本质上基本都是一个独立的进程，因此对POSIX要求的很多标准不兼容，具有很大的局限性。
LinuxThreads的局限性：

只有一个管理进程来协调所有线程，线程的创建和销毁开销大(管理进程是瓶颈)。
管理进程本身也需要上下文切换，且只能运行在一个CPU上，伸缩性和性能差。
与POSIX不兼容，诸如每个线程都有单独的进程ID；信号通信没法做到进程间级别；用户和组ID信息对进程下的所有线程来说不是通用的。
进程数有上限，这种线程机制很容易突破这个限制。
ps会显示所有线程。

NPTL

为了更好的兼容POSIX，基于linux内核在2.6之后开发了新的线程库Native POSIX Thread Library，解决了 LinuxThreads中的诸多局限，也完全兼容了POSIX协议，性能和稳定性都有了重大改进。
内核2.4之后对于task_struct的创建提供了更丰富的的参数选择，这样为上层library的实现提供了很多便利，也为去除管理进程提供了可能。上层可以通过参数化的进程创建并结合一些机制来区分实现线程和进程。有了线程组概念后能更好的处理线程之间的关联与信号量处理，这样基本就能去除LinuxThread中的相关限制了。
实现原理:
linux暴露出来创建进程的系统调用主要有fork和clone两个。看一下fork和clone的函数模型：
pid_t fork(void);
int clone(int (*fn)(void *), void *stack, int flags, void arg, ...
/ pid_t *parent_tid, void *tls, pid_t *child_tid */ );
fork只能用来创建进程，因为其没啥可做差异化的参数，而clone中的flags提供了丰富的参数化配置，也是用来做进程/线程差异化的关键，为上层lib更好的实现Thread Library提供了可能，clone系统调用就成为了创建线程的最佳接口。

通过一些特殊的FLAG可以在创建新的task_struct的时候与父task_struct共享内存空间，共享打开的文件，共享io调度器，共享命名空间，共享cgroup。
通过线程组的数据结构来标识同一进程下的所有线程，这样就可以实现进程级别的信号IPC通信。
通过pid及tgid两个参数来区分进程和线程。

我们看一下clone的man文档。clone通过flag来设置需要共享的资源，包括打开的文件，内存地址空间，命名空间等；但是有自己独立的执行栈，也就是方法中的fn地址，可以通过参数化CLONE_FS，CLONE_IO，CLONE_NEWIPC之类的来与父进程隔离独立的资源(这也是container namespace隔离的基础)。
NPTL 基于参数化clone系统调用，可以很好的实现兼容POSIX协议的线程库。而且2.6之后内核还实现了FUTEX系统调用来更好的处理用户空间线程间的通信。
相关实现可以见glibc ntpl create_thread

在clone中可以看到需要自己传入执行方法fn的地址，除此之外还需要传入栈地址stack，因此通过clone系统调用创建的线程的栈空间，需要用户自己分配，但不需要自己回收，父进程会负责子进程的资源回收，栈的分配在glibc中的具体实现如下：

static int
allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
        ALLOCATE_STACK_PARMS)
{
  /* Get memory for the stack.  */
  if (__glibc_unlikely (attr->flags & ATTR_FLAG_STACKADDR))
    {
      uintptr_t adj;
      char *stackaddr = (char *) attr->stackaddr;

      /* Assume the same layout as the _STACK_GROWS_DOWN case, with struct
     pthread at the top of the stack block.  Later we adjust the guard
     location and stack address to match the _STACK_GROWS_UP case.  */
      if (_STACK_GROWS_UP)
        stackaddr += attr->stacksize;

      /* If the user also specified the size of the stack make sure it
     is large enough.  */
      if (attr->stacksize != 0
      && attr->stacksize < (__static_tls_size + MINIMAL_REST_STACK))
        return EINVAL;
      /* The user provided stack memory needs to be cleared.  */
      memset (pd, '\0', sizeof (struct pthread));
      ... // init pd
  } else {
    void *mem;
    const int prot = (PROT_READ | PROT_WRITE 
          | ((GL(dl_stack_flags) & PF_X) ? PROT_EXEC : 0));
    ...
    mem = mmap (NULL, size, prot,
              MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
    ...
#if TLS_TCB_AT_TP
    pd = (struct pthread *) ((char *) mem + size - coloring) - 1;
#elif TLS_DTV_AT_TP
    pd = (struct pthread *) ((((uintptr_t) mem + size - coloring
                    - __static_tls_size)
                    & ~__static_tls_align_m1)
                   - TLS_PRE_TCB_SIZE);
  }
}

可以看到，glibc中的实现是通过mmap来申请MAP_STACK(grow down) flag的匿名映射作为线程栈地址。这样一个可被调度的native线程就在linux中被创建出来了。

所以知道linux下线程创建的关键之处还是clone的具体实现。fork和clone的会统一到_do_fork这个内核函数，唯一不同的是fork传入的都是默认参数，而clone可以进行特别详细的配置，看一下_do_fork的具体实现：

/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 */
long _do_fork(unsigned long clone_flags,
          unsigned long stack_start,
          unsigned long stack_size,
          int __user *parent_tidptr,
          int __user *child_tidptr,
          unsigned long tls)
{
    struct completion vfork; // 创建子进程后回调，vfork不拷贝页表页，
    struct pid *pid;
    struct task_struct *p;
    int trace = 0;
    long nr;

    /*
     * Determine whether and which event to report to ptracer.  When
     * called from kernel_thread or CLONE_UNTRACED is explicitly
     * requested, no event is reported; otherwise, report if the event
     * for the type of forking is enabled.
     */
    if (!(clone_flags & CLONE_UNTRACED)) { // ptrace相关
        if (clone_flags & CLONE_VFORK)
            trace = PTRACE_EVENT_VFORK;
        else if ((clone_flags & CSIGNAL) != SIGCHLD)
            trace = PTRACE_EVENT_CLONE;
        else
            trace = PTRACE_EVENT_FORK;

        if (likely(!ptrace_event_enabled(current, trace)))
            trace = 0;
    }

    p = copy_process(clone_flags, stack_start, stack_size,
             child_tidptr, NULL, trace, tls, NUMA_NO_NODE); // 拷贝task_struct对象，也就是生成一个新的进程/线程
    add_latent_entropy();

    if (IS_ERR(p))
        return PTR_ERR(p);

    /*
     * Do this prior waking up the new thread - the thread pointer
     * might get invalid after that point, if the thread exits quickly.
     */
    trace_sched_process_fork(current, p);

    pid = get_task_pid(p, PIDTYPE_PID);
    nr = pid_vnr(pid);

    if (clone_flags & CLONE_PARENT_SETTID)
        put_user(nr, parent_tidptr);

    if (clone_flags & CLONE_VFORK) {// vfork 相关特殊处理
        p->vfork_done = &vfork;
        init_completion(&vfork);
        get_task_struct(p);
    }
    wake_up_new_task(p); // 将该task唤醒：放入就绪队列
    /* forking complete and child started to run, tell ptracer */
    if (unlikely(trace))
        ptrace_event_pid(trace, pid);
    if (clone_flags & CLONE_VFORK) {
        if (!wait_for_vfork_done(p, &vfork))
            ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
    }
    put_pid(pid); // 记录各个namespace中pid被使用情况
    return nr;
}
/*
 * This creates a new process as a copy of the old one,
 * but does not actually start it yet.
 *
 * It copies the registers, and all the appropriate
 * parts of the process environment (as per the clone
 * flags). The actual kick-off is left to the caller.
 */
static __latent_entropy struct task_struct *copy_process(
                    unsigned long clone_flags,
                    unsigned long stack_start,
                    unsigned long stack_size,
                    int __user *child_tidptr,
                    struct pid *pid,
                    int trace,
                    unsigned long tls,
                    int node)
{
    int retval;
    struct task_struct *p;
    struct multiprocess_signals delayed;

    /*
     * Don't allow sharing the root directory with processes in a different
     * namespace
     */
    if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
        return ERR_PTR(-EINVAL);

    if ((clone_flags & (CLONE_NEWUSER|CLONE_FS)) == (CLONE_NEWUSER|CLONE_FS))
        return ERR_PTR(-EINVAL);

    /*
     * Thread groups must share signals as well, and detached threads
     * can only be started up within the thread group.
     */
    if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
        return ERR_PTR(-EINVAL);

    /*
     * Shared signal handlers imply shared VM. By way of the above,
     * thread groups also imply shared VM. Blocking this case allows
     * for various simplifications in other code.
     */
    if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
        return ERR_PTR(-EINVAL);

    /*
     * Siblings of global init remain as zombies on exit since they are
     * not reaped by their parent (swapper). To solve this and to avoid
     * multi-rooted process trees, prevent global and container-inits
     * from creating siblings.
     */
    if ((clone_flags & CLONE_PARENT) &&
                current->signal->flags & SIGNAL_UNKILLABLE)
        return ERR_PTR(-EINVAL);

    /*
     * If the new process will be in a different pid or user namespace
     * do not allow it to share a thread group with the forking task.
     */
    if (clone_flags & CLONE_THREAD) {
        if ((clone_flags & (CLONE_NEWUSER | CLONE_NEWPID)) ||
            (task_active_pid_ns(current) !=
                current->nsproxy->pid_ns_for_children))
            return ERR_PTR(-EINVAL);
    }

    /*
     * Force any signals received before this point to be delivered
     * before the fork happens.  Collect up signals sent to multiple
     * processes that happen during the fork and delay them so that
     * they appear to happen after the fork.
     */
    sigemptyset(&delayed.signal);
    INIT_HLIST_NODE(&delayed.node);

    spin_lock_irq(&current->sighand->siglock);
    if (!(clone_flags & CLONE_THREAD))
        hlist_add_head(&delayed.node, &current->signal->multiprocess);
    recalc_sigpending();
    spin_unlock_irq(&current->sighand->siglock);
    retval = -ERESTARTNOINTR;
    if (signal_pending(current))
        goto fork_out;

    retval = -ENOMEM;
    p = dup_task_struct(current, node); // 复制task_struct数据结构
    if (!p)
        goto fork_out;
    /*
     * This _must_ happen before we call free_task(), i.e. before we jump
     * to any of the bad_fork_* labels. This is to avoid freeing
     * p->set_child_tid which is (ab)used as a kthread's data pointer for
     * kernel threads (PF_KTHREAD).
     */
    p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL; // 设置ThreadId
    /*
     * Clear TID on mm_release()?
     */
    p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr : NULL; // 线程退出时是否清除ThreadId
    ftrace_graph_init_task(p);
    rt_mutex_init_task(p);
...
    retval = -EAGAIN;
    if (atomic_read(&p->real_cred->user->processes) >=
            task_rlimit(p, RLIMIT_NPROC)) {
        if (p->real_cred->user != INIT_USER &&
            !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
            goto bad_fork_free;
    }
    current->flags &= ~PF_NPROC_EXCEEDED;

    retval = copy_creds(p, clone_flags); // 权限相关
    if (retval < 0)
        goto bad_fork_free;

    /*
     * If multiple threads are within copy_process(), then this check
     * triggers too late. This doesn't hurt, the check is only there
     * to stop root fork bombs.
     */
    retval = -EAGAIN;
    if (nr_threads >= max_threads) // 超出线程上线
        goto bad_fork_cleanup_count;

    delayacct_tsk_init(p);  /* Must remain after dup_task_struct() */
    ...
        // 初始化统计信息
    p->utime = p->stime = p->gtime = 0;
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
    p->utimescaled = p->stimescaled = 0;
#endif
    prev_cputime_init(&p->prev_cputime);
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
    seqcount_init(&p->vtime.seqcount);
    p->vtime.starttime = 0;
    p->vtime.state = VTIME_INACTIVE;
#endif
#if defined(SPLIT_RSS_COUNTING)
    memset(&p->rss_stat, 0, sizeof(p->rss_stat));
#endif
    p->default_timer_slack_ns = current->timer_slack_ns;
#ifdef CONFIG_PSI
    p->psi_flags = 0;
#endif
...
    /* Perform scheduler related setup. Assign this task to a CPU. */
    retval = sched_fork(clone_flags, p);  // 初始化调度器类，优先级等调度相关信息，并根据numa设置一个更倾向于的CPU，将该task放入当前cpu的runqueue
    if (retval)
        goto bad_fork_cleanup_policy;
    retval = perf_event_init_task(p);
    if (retval)
        goto bad_fork_cleanup_policy;
    retval = audit_alloc(p);
    if (retval)
        goto bad_fork_cleanup_perf;
    /* copy all the process information */
    shm_init_task(p);
    retval = security_task_alloc(p, clone_flags);
    if (retval)
        goto bad_fork_cleanup_audit;
    retval = copy_semundo(clone_flags, p);
    if (retval)
        goto bad_fork_cleanup_security;
    retval = copy_files(clone_flags, p); // 拷贝打开文件句柄
    if (retval)
        goto bad_fork_cleanup_semundo;
    retval = copy_fs(clone_flags, p); // 拷贝进程路径信息
    if (retval)
        goto bad_fork_cleanup_files;
    retval = copy_sighand(clone_flags, p); // 拷贝信号处理函数
    if (retval)
        goto bad_fork_cleanup_fs;
    retval = copy_signal(clone_flags, p); // 拷贝信号
    if (retval)
        goto bad_fork_cleanup_sighand;
    retval = copy_mm(clone_flags, p); // 拷贝地址空间
    if (retval)
        goto bad_fork_cleanup_signal;
    retval = copy_namespaces(clone_flags, p); // 拷贝命名空间
    if (retval)
        goto bad_fork_cleanup_mm;
    retval = copy_io(clone_flags, p); // 拷贝IO调度相关信息
    if (retval)
        goto bad_fork_cleanup_namespaces;
    retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls); // 设置线程栈帧及相关寄存器上下文信息
    if (retval)
        goto bad_fork_cleanup_io;
    ...
    stackleak_task_init(p); // 记录栈起始地址，防止栈溢出
    if (pid != &init_struct_pid) {
        pid = alloc_pid(p->nsproxy->pid_ns_for_children);
        if (IS_ERR(pid)) {
            retval = PTR_ERR(pid);
            goto bad_fork_cleanup_thread;
        }
    }
...
    /* ok, now we should be set up.. */
    p->pid = pid_nr(pid);
    if (clone_flags & CLONE_THREAD) { // 设置线程组及实时退出信号量，线程不能处理退出信号(signal numbers 32 and 33)
        p->exit_signal = -1;
        p->group_leader = current->group_leader;
        p->tgid = current->tgid;
    } else {
        if (clone_flags & CLONE_PARENT)
            p->exit_signal = current->group_leader->exit_signal;
        else
            p->exit_signal = (clone_flags & CSIGNAL);
        p->group_leader = p;
        p->tgid = p->pid;
    }
    ... // cgroup 相关
    /*
     * From this point on we must avoid any synchronous user-space
     * communication until we take the tasklist-lock. In particular, we do
     * not want user-space to be able to predict the process start-time by
     * stalling fork(2) after we recorded the start_time but before it is
     * visible to the system.
     */
    p->start_time = ktime_get_ns();
    p->real_start_time = ktime_get_boot_ns();
    /*
     * Make it visible to the rest of the system, but dont wake it up yet.
     * Need tasklist lock for parent etc handling!
     */
    write_lock_irq(&tasklist_lock);
    /* CLONE_PARENT re-uses the old parent */
    if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) { // 设置父进程
        p->real_parent = current->real_parent;
        p->parent_exec_id = current->parent_exec_id;
    } else {
        p->real_parent = current;
        p->parent_exec_id = current->self_exec_id;
    }
    klp_copy_process(p);
    spin_lock(&current->sighand->siglock);
    /*
     * Copy seccomp details explicitly here, in case they were changed
     * before holding sighand lock.
     */
    copy_seccomp(p); // Secure Computing
    rseq_fork(p, clone_flags);
    /* Don't start children in a dying pid namespace */
    if (unlikely(!(ns_of_pid(pid)->pid_allocated & PIDNS_ADDING))) {
        retval = -ENOMEM;
        goto bad_fork_cancel_cgroup;
    }
    /* Let kill terminate clone/fork in the middle */
    if (fatal_signal_pending(current)) {
        retval = -EINTR;
        goto bad_fork_cancel_cgroup;
    }
... // pid分配及 更新统计信息 
    return p;
...
}
static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
{
    struct task_struct *tsk;
    unsigned long *stack;
    struct vm_struct *stack_vm_area __maybe_unused;
    int err;
    if (node == NUMA_NO_NODE)
        node = tsk_fork_get_node(orig); // 获取内核栈slab节点
    tsk = alloc_task_struct_node(node); // 分配task_struct  slab内存
    if (!tsk)
        return NULL;
    stack = alloc_thread_stack_node(tsk, node); // 分配内核栈slab内存
    if (!stack)
        goto free_tsk;
    if (memcg_charge_kernel_stack(tsk)) // 记录内核栈cgroup统计
        goto free_stack;
    stack_vm_area = task_stack_vm_area(tsk); // 获取栈虚拟地址(通过vmalloc内核空间虚拟映射)
    err = arch_dup_task_struct(tsk, orig); // 将原始task的数据拷贝到新的task中
    /*
     * arch_dup_task_struct() clobbers the stack-related fields.  Make
     * sure they're properly initialized before using any stack-related
     * functions again.
     */
    tsk->stack = stack; // 修改内核栈地址
#ifdef CONFIG_VMAP_STACK
    tsk->stack_vm_area = stack_vm_area; // 修改内核栈虚拟地址
#endif
#ifdef CONFIG_THREAD_INFO_IN_TASK
    atomic_set(&tsk->stack_refcount, 1);
#endif
    if (err)
        goto free_stack;
#ifdef CONFIG_SECCOMP
    /*
     * We must handle setting up seccomp filters once we're under
     * the sighand lock in case orig has changed between now and
     * then. Until then, filter must be NULL to avoid messing up
     * the usage counts on the error path calling free_task.
     */
    tsk->seccomp.filter = NULL;
#endif
    setup_thread_stack(tsk, orig); // 通过拷贝原始task线程信息，初始化新task线程信息
    clear_user_return_notifier(tsk); // 清除notifer
    clear_tsk_need_resched(tsk); // 清除调度标记
    set_task_stack_end_magic(tsk); // 设置内核栈栈尾地址，防止溢出

#ifdef CONFIG_STACKPROTECTOR
    tsk->stack_canary = get_random_canary();
#endif
    /*
     * One for us, one for whoever does the "release_task()" (usually
     * parent)
     */
    atomic_set(&tsk->usage, 2);
#ifdef CONFIG_BLK_DEV_IO_TRACE
    tsk->btrace_seq = 0;
#endif
    tsk->splice_pipe = NULL;
    tsk->task_frag.page = NULL;
    tsk->wake_q.next = NULL;
    account_kernel_stack(tsk, 1); // 技术内核栈数量
    kcov_task_init(tsk);
#ifdef CONFIG_FAULT_INJECTION
    tsk->fail_nth = 0;
#endif

#ifdef CONFIG_BLK_CGROUP
    tsk->throttle_queue = NULL;
    tsk->use_memdelay = 0;
#endif
#ifdef CONFIG_MEMCG
    tsk->active_memcg = NULL;
#endif
    return tsk;

free_stack:
    free_thread_stack(tsk);
free_tsk:
    free_task_struct(tsk);
    return NULL;
}

/*
 * wake_up_new_task - wake up a newly created task for the first time.
 *
 * This function will do some initial scheduler statistics housekeeping
 * that must be done for every newly created context, then puts the task
 * on the runqueue and wakes it.
 */
void wake_up_new_task(struct task_struct *p)
{
    struct rq_flags rf;
    struct rq *rq;
    raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
    p->state = TASK_RUNNING; // 设置状态
#ifdef CONFIG_SMP
    /*
     * Fork balancing, do it here and not earlier because:
     *  - cpus_allowed can change in the fork path
     *  - any previously selected CPU might disappear through hotplug
     *
     * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
     * as we're not fully set-up yet.
     */
    p->recent_used_cpu = task_cpu(p); // 设置最近跑的cpu
    __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
    rq = __task_rq_lock(p, &rf);
    update_rq_clock(rq); // 更新时钟
    post_init_entity_util_avg(&p->se);

    activate_task(rq, p, ENQUEUE_NOCLOCK); // 将其放入runqueue
    p->on_rq = TASK_ON_RQ_QUEUED;
    trace_sched_wakeup_new(p);
    check_preempt_curr(rq, p, WF_FORK); // 检查是否可抢占
#ifdef CONFIG_SMP
    if (p->sched_class->task_woken) {
        /*
         * Nothing relies on rq->lock after this, so its fine to
         * drop it.
         */
        rq_unpin_lock(rq, &rf);
        p->sched_class->task_woken(rq, p);
        rq_repin_lock(rq, &rf);
    }
#endif
    task_rq_unlock(rq, p, &rf);
}

通过阅读上述代码发现新建线程或者进程的逻辑主要包括以下步骤：

fork new task

如此一个新的进程或者线程就被创建出来了，设置好想应的指令栈帧，调度到该task的时候可以直接从对应的栈帧及寄存器上下文信息中恢复运行。其中最关键的部分就是根据flag参数判断是否拷贝内存空间，打开文件以及设置用户空间线程栈以及设置线程组等操作。

进程的产生与启动

知道了进程或者线程是如何被创建了的，那么当在terminal上输入一个命令行时对应的指令时如何执行的呢？

我们在terminal中输入: ps -p $$
可以看到类似于如下的内容：

PID TTY          TIME CMD
 21186 pts/1    00:00:00 bash

linux会通过pts/1将bash process与terminal进行连接，pts是一个字符设备，可以接收用户的输入，输入的指令会传入到bash进程，这些指令会通过bash来解释执行，输入的命令就这样被执行了，关于pts的细节这里不深入讨论。

当我们在这个terminal中输入 ls命令时，发生了啥？而当我们执行另外一个bash脚本时又发生了啥？为啥输入一个程序名字就可以执行产生进程了？为啥我们的shell脚本python脚本并不是二进制可执行文件，他们怎么可以直接被linux执行？如果你没思考过这些问题的话，那现在可以好好思考一下了。：）

针对上述问题，主要从两种类型可执行文件来介绍程序的运行过程： ELF文件与 script文件的加载与执行。

前面说到，新建进程或者线程必须要通过fork 或者 clone来创建，所以当我们在上述terminal中执行ls这个命令时，其实相当于在bash中fork出了一个子进程，fork出来的子进程会进行exec系统调用，exec系统调用会真正的去执行对应的二进制文件或者script脚本，将可执行二进制文件加载内存空间各个段：包括代码段，数据段，环境变量段等；初始化指令栈帧以及寄存器；被CPU调度执行后加载相关动态依赖库等。

仔细看一下exec系统调用相关文档，可以发现该调用需要指定执行文件路径及执行参数或者环境变量等，再来看一下linux相关实现。

do_execve是exec系统调用族的统一入口函数：

/*
 * This structure is used to hold the arguments that are used when loading binaries.
 */
struct linux_binprm {
    char buf[BINPRM_BUF_SIZE]; // 加载可执行文件内容
#ifdef CONFIG_MMU
    struct vm_area_struct *vma;
    unsigned long vma_pages;
#else
# define MAX_ARG_PAGES  32
    struct page *page[MAX_ARG_PAGES];
#endif
    struct mm_struct *mm;
    unsigned long p; /* current top of mem */
    unsigned long argmin; /* rlimit marker for copy_strings() */
    unsigned int
        /*
         * True after the bprm_set_creds hook has been called once
         * (multiple calls can be made via prepare_binprm() for
         * binfmt_script/misc).
         */
        called_set_creds:1,
        /*
         * True if most recent call to the commoncaps bprm_set_creds
         * hook (due to multiple prepare_binprm() calls from the
         * binfmt_script/misc handlers) resulted in elevated
         * privileges.
         */
        cap_elevated:1,
        /*
         * Set by bprm_set_creds hook to indicate a privilege-gaining
         * exec has happened. Used to sanitize execution environment
         * and to set AT_SECURE auxv for glibc.
         */
        secureexec:1;
#ifdef __alpha__
    unsigned int taso:1;
#endif
    unsigned int recursion_depth; /* only for search_binary_handler() */
    struct file * file;
    struct cred *cred;  /* new credentials */
    int unsafe;     /* how unsafe this exec is (mask of LSM_UNSAFE_*) */
    unsigned int per_clear; /* bits to clear in current->personality */
    int argc, envc;
    const char * filename;  /* Name of binary as seen by procps */
    const char * interp;    /* Name of the binary really executed. Most
                   of the time same as filename, but could be
                   different for binfmt_{misc,script} */
    unsigned interp_flags;
    unsigned interp_data;
    unsigned long loader, exec;

    struct rlimit rlim_stack; /* Saved RLIMIT_STACK used during exec. */
} __randomize_layout;

int do_execve(struct filename *filename,
    const char __user *const __user *__argv,
    const char __user *const __user *__envp)
{
    struct user_arg_ptr argv = { .ptr.native = __argv };
    struct user_arg_ptr envp = { .ptr.native = __envp };
    return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
}
static int do_execveat_common(int fd, struct filename *filename,
                  struct user_arg_ptr argv,
                  struct user_arg_ptr envp,
                  int flags)
{
    return __do_execve_file(fd, filename, argv, envp, flags, NULL);
}

/*
 * sys_execve() executes a new program.
 */
static int __do_execve_file(int fd, struct filename *filename,
                struct user_arg_ptr argv,
                struct user_arg_ptr envp,
                int flags, struct file *file)
{
    char *pathbuf = NULL;
    struct linux_binprm *bprm;
    struct files_struct *displaced;
    int retval;

    if (IS_ERR(filename))
        return PTR_ERR(filename);

    /*
     * We move the actual failure in case of RLIMIT_NPROC excess from
     * set*uid() to execve() because too many poorly written programs
     * don't check setuid() return code.  Here we additionally recheck
     * whether NPROC limit is still exceeded.
     */
    if ((current->flags & PF_NPROC_EXCEEDED) &&
        atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
        retval = -EAGAIN;
        goto out_ret;
    }

    /* We're below the limit (still or again), so we don't want to make
     * further execve() calls fail. */
    current->flags &= ~PF_NPROC_EXCEEDED;

    retval = unshare_files(&displaced); // 执行exec的进程与父进程不共享打开的文件
    if (retval)
        goto out_ret;

    retval = -ENOMEM;
    bprm = kzalloc(sizeof(*bprm), GFP_KERNEL); // bprm是进程启动的参数结构，包括执行文件，参数，权限，内存结构等。
    if (!bprm)
        goto out_files;

    retval = prepare_bprm_creds(bprm); // 准备权限信息
    if (retval)
        goto out_free;

    check_unsafe_exec(bprm);
    current->in_execve = 1;

    if (!file)
        file = do_open_execat(fd, filename, flags); // 打开目标执行文件
    retval = PTR_ERR(file);
    if (IS_ERR(file))
        goto out_unmark;

    sched_exec(); // 根据各个cpu的负载来确定是否需要迁移task到其他CPU

    bprm->file = file;
    if (!filename) {
        bprm->filename = "none";
    } else if (fd == AT_FDCWD || filename->name[0] == '/') {
        bprm->filename = filename->name;
    } else {
        if (filename->name[0] == '\0')
            pathbuf = kasprintf(GFP_KERNEL, "/dev/fd/%d", fd);
        else
            pathbuf = kasprintf(GFP_KERNEL, "/dev/fd/%d/%s",
                        fd, filename->name);
        if (!pathbuf) {
            retval = -ENOMEM;
            goto out_unmark;
        }
        /*
         * Record that a name derived from an O_CLOEXEC fd will be
         * inaccessible after exec. Relies on having exclusive access to
         * current->files (due to unshare_files above).
         */
        if (close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
            bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
        bprm->filename = pathbuf;
    }
    bprm->interp = bprm->filename;
    retval = bprm_mm_init(bprm);// 初始化内存空间mm_struct，主要初始化进程栈，
    if (retval)
        goto out_unmark;
    retval = prepare_arg_pages(bprm, argv, envp); // 计算栈起始地址 减去 环境变量段 和 启动参数段
    if (retval < 0)
        goto out;
    retval = prepare_binprm(bprm); // 将binary内容读入bprm buf数组中
    if (retval < 0)
        goto out;
    retval = copy_strings_kernel(1, &bprm->filename, bprm); // 设置文件名
    if (retval < 0)
        goto out;
    bprm->exec = bprm->p;
    retval = copy_strings(bprm->envc, envp, bprm); // 初始化环境变量段
    if (retval < 0)
        goto out;
    retval = copy_strings(bprm->argc, argv, bprm); // 初始化参数段
    if (retval < 0)
        goto out;
    would_dump(bprm, bprm->file);
    retval = exec_binprm(bprm); // 开始执行bprm，也就是我们的目标程序
    if (retval < 0)
        goto out;
    /* execve succeeded */
    ...
}

static int exec_binprm(struct linux_binprm *bprm)
{
    pid_t old_pid, old_vpid;
    int ret;
    /* Need to fetch pid before load_binary changes it */
    old_pid = current->pid;
    rcu_read_lock();
    old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
    rcu_read_unlock();
    ret = search_binary_handler(bprm);
    ... // audit & trace
    return ret;
}

/*
 * cycle the list of binary formats handler, until one recognizes the image
 */
int search_binary_handler(struct linux_binprm *bprm)
{
    bool need_retry = IS_ENABLED(CONFIG_MODULES);
    struct linux_binfmt *fmt; // 目标程序类型：aout or  ELF or Script .etc
    int retval;
    /* This allows 4 levels of binfmt rewrites before failing hard. */
    if (bprm->recursion_depth > 5)
        return -ELOOP;
    retval = security_bprm_check(bprm);
    if (retval)
        return retval;
    retval = -ENOENT;
 retry:
    read_lock(&binfmt_lock);
    list_for_each_entry(fmt, &formats, lh) {
        if (!try_module_get(fmt->module))
            continue;
        read_unlock(&binfmt_lock);
        bprm->recursion_depth++;
        retval = fmt->load_binary(bprm); // 目标文件相关分调用，调用load_elf_binary or load_aout_binary or load_script .etc
        read_lock(&binfmt_lock);
        put_binfmt(fmt);
        bprm->recursion_depth--;
        if (retval < 0 && !bprm->mm) {
            /* we got to flush_old_exec() and failed after it */
            read_unlock(&binfmt_lock);
            force_sigsegv(SIGSEGV, current);
            return retval;
        }
        if (retval != -ENOEXEC || !bprm->file) {
            read_unlock(&binfmt_lock);
            return retval;
        }
    }
    read_unlock(&binfmt_lock);
    if (need_retry) {
        if (printable(bprm->buf[0]) && printable(bprm->buf[1]) &&
            printable(bprm->buf[2]) && printable(bprm->buf[3]))
            return retval;
        if (request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2)) < 0)
            return retval;
        need_retry = false;
        goto retry;
    }
    return retval;
}

从上面的代码调用可以简化成如下调用链:

do_execv

execv主要做以下几件事：

初始化linux_binprm struct数据结构。
根据binary file 及启动参数和环境变量，填充linux_binprm struct相关启动信息。
调用load_binary这个函数根据目标可执行文件类型调用不同的启动方法，主要包括aout、elf、script等可执行文件。
回收掉linux_binprm struct，程序执行成功扫尾工作。

ELF or aout 文件的加载与执行

如果不知道什么是ELF的话，大家可以自行google一下。简而言之，这是linux系统下一种特殊的文件，满足一些协议使得这种文件可以直接在linux系统下能直接运行。像一些SO、编译出来文件目标Objecj文件都属于ELF文件。

aout文件是ELF文件的一种特殊形式，主要是ELF文件链接出来的最终可执行二进制文件。

/*
 * These are the functions used to load a.out style executables and shared
 * libraries.  There is no binary dependent code anywhere else.
 */

static int load_aout_binary(struct linux_binprm * bprm)
{
    struct pt_regs *regs = current_pt_regs();
    struct exec ex;
    unsigned long error;
    unsigned long fd_offset;
    unsigned long rlim;
    int retval;

    ex = *((struct exec *) bprm->buf);      /* exec-header */
    if ((N_MAGIC(ex) != ZMAGIC && N_MAGIC(ex) != OMAGIC &&
         N_MAGIC(ex) != QMAGIC && N_MAGIC(ex) != NMAGIC) ||
        N_TRSIZE(ex) || N_DRSIZE(ex) ||
        i_size_read(file_inode(bprm->file)) < ex.a_text+ex.a_data+N_SYMSIZE(ex)+N_TXTOFF(ex)) {
        return -ENOEXEC;
    }

    /*
     * Requires a mmap handler. This prevents people from using a.out
     * as part of an exploit attack against /proc-related vulnerabilities.
     */
    if (!bprm->file->f_op->mmap)
        return -ENOEXEC;

    fd_offset = N_TXTOFF(ex);

    /* Check initial limits. This avoids letting people circumvent
     * size limits imposed on them by creating programs with large
     * arrays in the data or bss.
     */
    rlim = rlimit(RLIMIT_DATA); // 检查 数据段 限制
    if (rlim >= RLIM_INFINITY)
        rlim = ~0;
    if (ex.a_data + ex.a_bss > rlim)
        return -ENOMEM;

    /* Flush all traces of the currently running executable */
    retval = flush_old_exec(bprm);
    if (retval)
        return retval;

    /* OK, This is the point of no return */
#ifdef __alpha__
    SET_AOUT_PERSONALITY(bprm, ex);
#else
    set_personality(PER_LINUX);
#endif
    setup_new_exec(bprm);

    current->mm->end_code = ex.a_text +
        (current->mm->start_code = N_TXTADDR(ex)); // 代码段地址
    current->mm->end_data = ex.a_data +
        (current->mm->start_data = N_DATADDR(ex)); // 数据段地址
    current->mm->brk = ex.a_bss +
        (current->mm->start_brk = N_BSSADDR(ex)); // 堆地址

    retval = setup_arg_pages(bprm, STACK_TOP, EXSTACK_DEFAULT); // 参数段 进程栈地址
    if (retval < 0)
        return retval;

    install_exec_creds(bprm);

    if (N_MAGIC(ex) == OMAGIC) { // 根据魔数来确定aout文件类型
        unsigned long text_addr, map_size;
        loff_t pos;

        text_addr = N_TXTADDR(ex);

#ifdef __alpha__
        pos = fd_offset;
        map_size = ex.a_text+ex.a_data + PAGE_SIZE - 1;
#else
        pos = 32;
        map_size = ex.a_text+ex.a_data;
#endif
        error = vm_brk(text_addr & PAGE_MASK, map_size); // 申请代码段虚拟内存
        if (error)
            return error;

        error = read_code(bprm->file, text_addr, pos, // 读入指令到内存
                  ex.a_text+ex.a_data);
        if ((signed long)error < 0)
            return error;
    } else {
        if ((ex.a_text & 0xfff || ex.a_data & 0xfff) &&
            (N_MAGIC(ex) != NMAGIC) && printk_ratelimit())
        {
            printk(KERN_NOTICE "executable not page aligned\n");
        }

        if ((fd_offset & ~PAGE_MASK) != 0 && printk_ratelimit())
        {
            printk(KERN_WARNING 
                   "fd_offset is not page aligned. Please convert program: %pD\n",
                   bprm->file);
        }

        if (!bprm->file->f_op->mmap||((fd_offset & ~PAGE_MASK) != 0)) {
            error = vm_brk(N_TXTADDR(ex), ex.a_text+ex.a_data);// 申请代码段虚拟内存
            if (error)
                return error;

            read_code(bprm->file, N_TXTADDR(ex), fd_offset,
                  ex.a_text + ex.a_data); // 读入指令到内存
            goto beyond_if;
        }

        error = vm_mmap(bprm->file, N_TXTADDR(ex), ex.a_text,
            PROT_READ | PROT_EXEC,
            MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE,
            fd_offset); // 匿名映射申请代码段虚拟内存

        if (error != N_TXTADDR(ex))
            return error;

        error = vm_mmap(bprm->file, N_DATADDR(ex), ex.a_data,
                PROT_READ | PROT_WRITE | PROT_EXEC,
                MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE,
                fd_offset + ex.a_text); // 匿名映射申请数据段虚拟内存
        if (error != N_DATADDR(ex))
            return error;
    }
beyond_if:
    set_binfmt(&aout_format);

    retval = set_brk(current->mm->start_brk, current->mm->brk); // 设置堆起始地址
    if (retval < 0)
        return retval;

    current->mm->start_stack =
        (unsigned long) create_aout_tables((char __user *) bprm->p, bprm); // 设置起始栈
#ifdef __alpha__
    regs->gp = ex.a_gpvalue;
#endif
    finalize_exec(bprm);
    start_thread(regs, ex.a_entry, current->mm->start_stack); // 将指令加载如寄存器开始执行
    return 0;
}

上面只贴了aout类型可执行文件的load过程，elf的load太过于麻烦，就不贴了，因为ELF的load除了很多ELF协议细节相关的处理，其他过程基本与aout的加载一致，主要包括以下几个步骤：

参数及安全检查
设置代码段、数据段、堆、栈、启动参数段地址
根据魔数判断不同aout类型，然后根据binaryfile中内容填充代码段、数据段内存
填充寄存器，及栈帧，加载指令执行。

延伸阅读动态链接的加载过程

如果写过C++的同学应该对于动态链接并不陌生，那么在linux中动态链接中的相关依赖SO文件是怎么加载到内存的呢？看上去aout文件并没有管SO的加载，当执行aout的二进制指令时依赖的so相关实现会怎么处理？(这一部分内容参考自《程序员的自我修养》)

动态链接的加载工作主要是由ld.so来执行的，而aout文件会将ld.so作为执行的第一条指令，将动态链接的相关操作交由ld.so来完成。ld.so的加载是通过load_elf_binary加载的，其会简析aout中的所有符号依赖，通过dlopen及dlsym两个函数来加载依赖的动态库文件并解析相应的符号，这样动态依赖库就完成了加载，具体的实现细节得看glibc的源码了，这里不做深入解析。

Script 文件的加载与执行

在我们执行脚本文件的时候，都需要在文件起始处指定解释器，类似于：

#!/bin/bash
#!/bin/python2.7

这样linux才知道用那个解释器去执行这份脚本，否则linux将无法执行这份脚本。

可以阅读以下 load_script是怎么来工作的：

static int load_script(struct linux_binprm *bprm)
{
    const char *i_arg, *i_name;
    char *cp, *buf_end;
    struct file *file;
    int retval;

    /* Not ours to exec if we don't start with "#!". */
    if ((bprm->buf[0] != '#') || (bprm->buf[1] != '!')) // 检查是否指定了解释器
        return -ENOEXEC;

    /*
     * If the script filename will be inaccessible after exec, typically
     * because it is a "/dev/fd/<fd>/.." path against an O_CLOEXEC fd, give
     * up now (on the assumption that the interpreter will want to load
     * this file).
     */
    if (bprm->interp_flags & BINPRM_FLAGS_PATH_INACCESSIBLE)
        return -ENOENT;

    /* Release since we are not mapping a binary into memory. */
    allow_write_access(bprm->file);
    fput(bprm->file);
    bprm->file = NULL;

    /*
     * This section handles parsing the #! line into separate
     * interpreter path and argument strings. We must be careful
     * because bprm->buf is not yet guaranteed to be NUL-terminated
     * (though the buffer will have trailing NUL padding when the
     * file size was smaller than the buffer size).
     *
     * We do not want to exec a truncated interpreter path, so either
     * we find a newline (which indicates nothing is truncated), or
     * we find a space/tab/NUL after the interpreter path (which
     * itself may be preceded by spaces/tabs). Truncating the
     * arguments is fine: the interpreter can re-read the script to
     * parse them on its own.
     */
    buf_end = bprm->buf + sizeof(bprm->buf) - 1;
    cp = strnchr(bprm->buf, sizeof(bprm->buf), '\n');
    if (!cp) { // 没有找到换行符
        cp = next_non_spacetab(bprm->buf + 2, buf_end);
        if (!cp)
            return -ENOEXEC; /* Entire buf is spaces/tabs */
        /*
         * If there is no later space/tab/NUL we must assume the
         * interpreter path is truncated.
         */
        if (!next_terminator(cp, buf_end))
            return -ENOEXEC;
        cp = buf_end;
    }
    /* NUL-terminate the buffer and any trailing spaces/tabs. */
    *cp = '\0';
    while (cp > bprm->buf) { // 去除尾部多余的空格
        cp--;
        if ((*cp == ' ') || (*cp == '\t'))
            *cp = '\0';
        else
            break;
    }
    for (cp = bprm->buf+2; (*cp == ' ') || (*cp == '\t'); cp++); // 去除头部多余的空格
    if (*cp == '\0')
        return -ENOEXEC; /* No interpreter name found */
    i_name = cp; // 获取真正解释程序路径 + 参数
    i_arg = NULL;
    for ( ; *cp && (*cp != ' ') && (*cp != '\t'); cp++)
        /* nothing */ ;
    while ((*cp == ' ') || (*cp == '\t'))
        *cp++ = '\0';
    if (*cp) // 获取执行参数(执行路径后的第一个空格后的所有字符串都为参数，也就是脚本内容都被当做参数了)
        i_arg = cp;
    /*
     * OK, we've parsed out the interpreter name and
     * (optional) argument.
     * Splice in (1) the interpreter's name for argv[0]
     *           (2) (optional) argument to interpreter
     *           (3) filename of shell script (replace argv[0])
     *
     * This is done in reverse order, because of how the
     * user environment and arguments are stored.
     */
    retval = remove_arg_zero(bprm); // 清除原始参数
    if (retval)
        return retval;
    retval = copy_strings_kernel(1, &bprm->interp, bprm); //设置真正的执行文件，但这个值之前是空，所以只是分配了内存
    if (retval < 0)
        return retval;
    bprm->argc++;
    if (i_arg) {
        retval = copy_strings_kernel(1, &i_arg, bprm); // 设置参数段
        if (retval < 0)
            return retval;
        bprm->argc++;
    }
    retval = copy_strings_kernel(1, &i_name, bprm); // 设置真正的执行文件，比如上面的/bin/bash 或者 /bin/python2.7
    if (retval)
        return retval;
    bprm->argc++;
    retval = bprm_change_interp(i_name, bprm); // 真正改变interp的值
    if (retval < 0)
        return retval;

    /*
     * OK, now restart the process with the interpreter's dentry.
     */
    file = open_exec(i_name); // 打开解释程序文件
    if (IS_ERR(file))
        return PTR_ERR(file);
    bprm->file = file;
    retval = prepare_binprm(bprm);
    if (retval < 0)
        return retval;
    return search_binary_handler(bprm); // 执行真正的目标解释程序 解释执行对应的脚本代码
}

通过阅读代码可以发现，脚本的执行主要获取获取两段内容：1. 解释程序文件路径；2. 脚本内容。这两个内容分别被设为新的linux_binprm file及argv，在此调用search_binary_handler去执行真正的解释程序，脚本内容被当做参数传入。解释程序一般也是aout文件，所以会调用load_aout_binary去执行目标解释文件，解释执行我们的脚本程序。
有了脚本模式的可执行文件，就有了我们的python 可执行程序，php可执行程序，java字节码可执行程序。

总结

在linux中，线程和进程是同一个结构，通过同样的方式创建，由同一个数据结构形容，被相同的调度器调度，唯一不同的是，线程在创建的时候通过一些参数来共享某些资源，诸如mm，files等，因而比真正的进程创建要轻量一些。而且为了更好的满足POSIX协议，线程创建后将其管理到一个线程组，这样就可以更好的处理信号。

在linux中，远程登录的服务器，并敲入一个指令运行一个可执行文件时，主要有以下几个过程：

建立远程连接，并将远程服务器上的一个pts字符设备与bash进程建立连接。
pts设备会处理输入中断，接收用户输入。
当用户输出回车时，bash进程会认为这是一条命令的结束，开始解释执行这一条命令。
bash对于每一条命令都会fork出来一个进程。
fork 出来的进程执行exec系统调用执行命令行中的可执行文件及参数，bash父进程依旧退回来接收新的用户指令。
linux内核解析exec系统调用传入的可执行文件名和参数，建立进程的内存段(代码段，数据段，启动参数段，环境变量段，栈，堆等等)。如果是执行脚本还会有一次递归深度调用到可执行文件。
将代码段中二进制指令加载到寄存器开始执行。

最后编辑于：2020.01.19 13:10:24

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 203,547评论 6赞 477
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,399评论 2赞 381
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 150,428评论 0赞 337
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,599评论 1赞 274
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,612评论 5赞 365
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,577评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,941评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,603评论 0赞 258
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,852评论 1赞 297
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,605评论 2赞 321
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,693评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,375评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,955评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,936评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,172评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 43,970评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,414评论 2赞 342