1.名词解释
VPN :virtual page number.
PPN :physical page number.
PTE :page-table entries.
ASID :address space identifier.
PMA :Physical Memory Attributes
PMP :Physical Memory Protection
PGD :Page Global Directory
PUD :Page Upper Directory
PMD :Page Middle Directory
PT :Page Table
TVM :Trap Virtual Memory
2 Linux Memory结构
4KB 的内存页大小可能不是最佳的选择,8KB 或者 16KB 说不定是更好的选择,但是这是过去在特定场景下做出的权衡。我们在这篇文章中不要过于纠结于 4KB 这个数字,应该更重视决定这个结果的几个因素,这样当我们在遇到类似场景时才可以从这些方面考虑当下最佳的选择,我们在这篇文章中会介绍以下两个影响内存页大小的因素,它们分别是:
- 过小的页面大小会带来较大的页表项增加寻址时 TLB(Translation lookaside buffer)的查找速度和额外开销;
- 过大的页面大小会浪费内存空间,造成内存碎片,降低内存的利用率;
上个世纪在设计内存页大小时充分考虑了上述的两个因素,最终选择了 4KB 的内存页作为操作系统最常见的页大小,我们接下来将详细介绍以上它们对操作系统性能的影响
每个进程能够看到的都是独立的虚拟内存空间,虚拟内存空间只是逻辑上的概念,进程仍然需要访问虚拟内存对应的物理内存,从虚拟内存到物理内存的转换就需要使用每个进程持有页表。
在如上图所示的四层页表结构中,操作系统会使用最低的 12 位作为页面的偏移量,剩下的 36 位会分四组分别表示当前层级在上一层中的索引,所有的虚拟地址都可以用上述的多层页表查找到对应的物理地址4。
因为操作系统的虚拟地址空间大小都是一定的,整片虚拟地址空间被均匀分成了 N 个大小相同的内存页,所以内存页的大小最终会决定每个进程中页表项的层级结构和具体数量,虚拟页的大小越小,单个进程中的页表项和虚拟页也就越多。
PagesCount=VirtualMemoryPageSizePagesCount/VirtualMemoryPageSize
因为目前的虚拟页大小为 4096 字节,所以虚拟地址末尾的 12 位可以表示虚拟页中的地址,如果虚拟页的大小降到了 512 字节,那么原本的四层页表结构或者五层页表结构会变成五层或者六层,这不仅会增加内存访问的额外开销,还会增加每个进程中页表项占用的内存大小。
PGD中包含若干PUD的地址,PUD中包含若干PMD的地址,PMD中又包含若干PT的地址。每一个页表项指向一个页框,页框就是真正的物理内存页。
PGD: Page Global Directory
Linux系统中每个进程对应用户空间的pgd是不一样的,但是linux内核 的pgd是一样的。当创建一个新的进程时,都要为新进程创建一个新的页面目录PGD,并从内核的页面目录swapper_pg_dir中复制内核区间页面目录项至新建进程页面目录PGD的相应位置,具体过程如下:
do_fork() --> copy_mm() --> mm_init() --> pgd_alloc() --> set_pgd_fast() --> get_pgd_slow() --> memcpy(&PGD + USER_PTRS_PER_PGD, swapper_pg_dir +USER_PTRS_PER_PGD, (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t))
3 进程与物理内存
3.1 进程创建
mm_init()
--->fork.c
文件 ,源码如下:
if (mm_alloc_pgd(mm))
goto fail_nopgd;
mm_init()
函数调用mm_alloc_pgd()
函数与底层物理内存产生关系,mm_alloc_pgd()
--->fork.c
文件
static inline int mm_alloc_pgd(struct mm_struct *mm)
{
mm->pgd = pgd_alloc(mm);
if (unlikely(!mm->pgd))
return -ENOMEM;
return 0;
}
pgd_alloc()
---> paglloc.h
这个函数为当前pgd
分配一个page
,并且将当前的page
的首地址返回,并且将内
核GPG拷贝的当前进程的结构体中。函数中调用了__get_free_page()
,获取一个空间的物理页保存当前进程信息,__get_free_page()
就是Kernel常用的__get_free_pages()
,这样子上层进程创建就与底层物理内存产生直接的关系,以上几个函数源码如下:
#define __get_free_page(gfp_mask) \
__get_free_pages((gfp_mask), 0)
static inline pgd_t *pgd_alloc(struct mm_struct *mm)
{
pgd_t *pgd;
pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
if (likely(pgd != NULL)) {
memset(pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t));
/* Copy kernel mappings */
memcpy(pgd + USER_PTRS_PER_PGD,
init_mm.pgd + USER_PTRS_PER_PGD,
(PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
}
return pgd;
}
init_mm()
--->init_mm.c
结构体记录了当前root table
的所有信息,swapper_pg_dir
是存放PGD 全局信息的全局变量,源码如下在init_mm.c
文件中,源码如下:
/*
* For dynamically allocated mm_structs, there is a dynamically sized cpumask
* at the end of the structure, the size of which depends on the maximum CPU
* number the system can see. That way we allocate only as much memory for
* mm_cpumask() as needed for the hundreds, or thousands of processes that
* a system typically runs.
*
* Since there is only one init_mm in the entire system, keep it simple
* and size this cpu_bitmask to NR_CPUS.
*/
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
.mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
.user_ns = &init_user_ns,
.cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
INIT_MM_CONTEXT(init_mm)
};
3.2 __get_free_pages()
这样一来,每个进程的页面目录就分成了两部分,第一部分为“用户空间”,用来映射其整个进程空间(0x0000 0000-0xBFFF FFFF)
即3G字节的虚拟地址;第二部分为“系统空间”,用来映射(0xC000 0000-0xFFFF FFFF)1G
字节的虚拟地址。可以看出Linux
系统中每个进程的页面目录的第二部分是相同的,所以从进程的角度来看,每个进程有4G
字节的虚拟空间,较低的2G
字节是自己的用户空间,最高的2G
字节则为与所有进程以及内核共享的系统空间。每个进程有它自己的PGD( Page Global Directory)
,它是一个物理页,并包含一个pgd_t
数组。
3.RISC-V Addressing and Memory Protection
3.1 虚拟地址
An Sv32 virtual address is partitioned into a virtual page number (VPN) and page offset, as shown in
Figure 4.15.
**Sv32 page tables consist of 2^10 page-table entries (PTEs), each of four bytes. 4K 大小的Page Table**
虚拟地址有三部分内容组成:VPN [1](10 bit),VPN0,page offset(10 bit);
根据以上表可以看到,Page Table entry 的组成部分:PPN[1] PPN[0] RSW D A G U X W R V 这几个条目组成。
The U bit: indicates whether the page is accessible to user mode.
The G bit: designates a global mapping.
The A bit: indicates the virtual page has been read, written, or fetched from since the last time the A bit was cleared.
The D bit: indicates the virtual page has been written since the last time the D bit was cleared.
The permission R, W, and X bits: indicate whether the page is readable, writable, and executable, respectively.
The V bit :indicates whether the PTE is valid; if it is 0, all other bits in the PTE are don’t-cares and may be used freely by software.
PPN[1]与PPN[0] 在Linux内核中统称为为:PFN,在
pgtable-bits.h
文件中,定义如下:
/*
* PTE format:
* | XLEN-1 10 | 9 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
* PFN reserved for SW D A G U X W R V
*/
2.2 satp register 寄存器
satp寄存器的组成:
2.3 Virtual Address Translation Process
虚拟地址转换为物理地址转换过程如下:
每一个应用程序都有自己的Page Global Directory(PGD),其保存物理地址的页帧,在<asm/page.h>中定义了pgd_t 结构体数组,不同的架构有不同的PGD加载方式。
A virtual address va is translated into a physical address pa as follows:
(1)If XLEN equals VALEN, proceed. (For Sv32, VALEN=32.) Otherwise, check whether each
bit of va[XLEN-1:VALEN] is equal to va[VALEN-1]. If not, stop and raise a page-fault
exception corresponding to the original access type.
(2)Let a be satp.ppn × PAGESIZE, and let i = LEVELS − 1. (For Sv32, PAGESIZE=212 and
LEVELS=2.)
(3)Let pte be the value of the PTE at address a+va.vpn[i]×PTESIZE. (For Sv32, PTESIZE=4.)
If accessing pte violates a PMA or PMP check, raise an access exception corresponding to
the original access type.
(4)If pte.v = 0, or if pte.r = 0 and pte.w = 1, stop and raise a page-fault exception corresponding
to the original access type.
(5)Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 5. Otherwise, this PTE is a
pointer to the next level of the page table. Let i = i−1. If i < 0, stop and raise a page-fault
exception corresponding to the original access type. Otherwise, let a = pte.ppn×PAGESIZE
and go to step 2.
(6)A leaf PTE has been found. Determine if the requested memory access is allowed by the
pte.r, pte.w, pte.x, and pte.u bits, given the current privilege mode and the value of the
SUM and MXR fields of the mstatus register. If not, stop and raise a page-fault exception
corresponding to the original access type.
(7)If i > 0 and pte.ppn[i − 1 : 0] ̸= 0, this is a misaligned superpage; stop and raise a page-fault
exception corresponding to the original access type.
(8)If pte.a = 0, or if the memory access is a store and pte.d = 0, either raise a page-fault
exception corresponding to the original access type, or:
• Set pte.a to 1 and, if the memory access is a store, also set pte.d to 1.
• If this access violates a PMA or PMP check, raise an access exception corresponding to
the original access type.
Volume II: RISC-V Privileged Architectures V20190608-Priv-MSU-Ratified 71
• This update and the loading of pte in step 2 must be atomic; in particular, no intervening
store to the PTE may be perceived to have occurred in-between.
(9)The translation is successful. The translated physical address is given as follows:
• pa.pgoff = va.pgoff.
• If i > 0, then this is a superpage translation and pa.ppn[i − 1 : 0] = va.vpn[i − 1 : 0].
• pa.ppn[LEVELS − 1 : i] = pte.ppn[LEVELS − 1 : i].
Virtual address to phy address 大概的步骤如下:
- 校验XLEN 与 VALEN 是否相等
- 计算基地址 base address = satp.ppn × PAGESIZE (PAGESIZE:4096) ,satp.ppn 放的是当前进程root page table phy page number.
- 计算PTE的地址 PTE adress = a+va.vpn[i]×PTESIZE (For Sv32, PTESIZE=4)
- 判断PAGE 属性是不是合法,如果X,W,R三个标志全部为:0;则说明当前的PTE指向下一个PAGETABLE,进行第三步,否则说明当前PTE地址是合法的
- 经过PMA与PMP的检查无误之后,则说明地址转换成功
- 然后根据va.offset 找到实际的物理地址
2.4 Create Pagetable process
当虚拟地址没有映射物理地址,最典型就是用户态Malloc
一段虚拟地址后,Linux
并没有为这段虚拟地址分配物理地址,而是当用写这段虚拟地址时,Linux Kernel
发生PageFault
才会为这段虚拟地址映射物理内存,大概的过程就是这样,但是其中Linux Kernel
产生缺页异常到映射物理的过程则是非常复杂的一个过程,其中涉及到很重要的一个函数就是缺页中断服务函数,在RISC-V
中叫do_page_fault()
在arch/risv-v/mm/fault.c
文件中定义了该函数。
2.4.1 do_page_fault
do_page_fault()
函数实现如下:
/*
* This routine handles page faults. It determines the address and the
* problem, and then passes it off to one of the appropriate routines.
*/
asmlinkage void do_page_fault(struct pt_regs *regs)
{
struct task_struct *tsk;
struct vm_area_struct *vma;
struct mm_struct *mm;
unsigned long addr, cause;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
int code = SEGV_MAPERR;
vm_fault_t fault;
cause = regs->scause;
addr = regs->sbadaddr;
tsk = current;
mm = tsk->mm;
/*
* Fault-in kernel-space virtual memory on-demand.
* The 'reference' page table is init_mm.pgd.
*
* NOTE! We MUST NOT take any locks for this case. We may
* be in an interrupt or a critical region, and should
* only copy the information from the master page table,
* nothing more.
*/
if (unlikely((addr >= VMALLOC_START) && (addr <= VMALLOC_END)))
goto vmalloc_fault;
/* Enable interrupts if they were enabled in the parent context. */
if (likely(regs->sstatus & SR_SPIE))
local_irq_enable();
/*
* If we're in an interrupt, have no user context, or are running
* in an atomic region, then we must not take the fault.
*/
if (unlikely(faulthandler_disabled() || !mm))
goto no_context;
if (user_mode(regs))
flags |= FAULT_FLAG_USER;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, addr);
if (unlikely(!vma))
goto bad_area;
if (likely(vma->vm_start <= addr))
goto good_area;
if (unlikely(!(vma->vm_flags & VM_GROWSDOWN)))
goto bad_area;
if (unlikely(expand_stack(vma, addr)))
goto bad_area;
/*
* Ok, we have a good vm_area for this memory access, so
* we can handle it.
*/
good_area:
code = SEGV_ACCERR;
switch (cause) {
case EXC_INST_PAGE_FAULT:
if (!(vma->vm_flags & VM_EXEC))
goto bad_area;
break;
case EXC_LOAD_PAGE_FAULT:
if (!(vma->vm_flags & VM_READ))
goto bad_area;
break;
case EXC_STORE_PAGE_FAULT:
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
flags |= FAULT_FLAG_WRITE;
break;
default:
panic("%s: unhandled cause %lu", __func__, cause);
}
/*
* If for any reason at all we could not handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
fault = handle_mm_fault(vma, addr, flags);
/*
* If we need to retry but a fatal signal is pending, handle the
* signal first. We do not need to release the mmap_sem because it
* would already be released in __lock_page_or_retry in mm/filemap.c.
*/
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(tsk))
return;
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
else if (fault & VM_FAULT_SIGBUS)
goto do_sigbus;
BUG();
}
/*
* Major/minor page fault accounting is only done on the
* initial attempt. If we go through a retry, it is extremely
* likely that the page will be found in page cache at that point.
*/
if (flags & FAULT_FLAG_ALLOW_RETRY) {
if (fault & VM_FAULT_MAJOR) {
tsk->maj_flt++;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
1, regs, addr);
} else {
tsk->min_flt++;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
1, regs, addr);
}
if (fault & VM_FAULT_RETRY) {
/*
* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
* of starvation.
*/
flags &= ~(FAULT_FLAG_ALLOW_RETRY);
flags |= FAULT_FLAG_TRIED;
/*
* No need to up_read(&mm->mmap_sem) as we would
* have already released it in __lock_page_or_retry
* in mm/filemap.c.
*/
goto retry;
}
}
up_read(&mm->mmap_sem);
return;
/*
* Something tried to access memory that isn't in our memory map.
* Fix it, but check if it's kernel or user first.
*/
bad_area:
up_read(&mm->mmap_sem);
/* User mode accesses just cause a SIGSEGV */
if (user_mode(regs)) {
do_trap(regs, SIGSEGV, code, addr, tsk);
return;
}
no_context:
/* Are we prepared to handle this kernel fault? */
if (fixup_exception(regs))
return;
/*
* Oops. The kernel tried to access some bad page. We'll have to
* terminate things with extreme prejudice.
*/
bust_spinlocks(1);
pr_alert("Unable to handle kernel %s at virtual address " REG_FMT "\n",
(addr < PAGE_SIZE) ? "NULL pointer dereference" :
"paging request", addr);
die(regs, "Oops");
do_exit(SIGKILL);
/*
* We ran out of memory, call the OOM killer, and return the userspace
* (which will retry the fault, or kill us if we got oom-killed).
*/
out_of_memory:
up_read(&mm->mmap_sem);
if (!user_mode(regs))
goto no_context;
pagefault_out_of_memory();
return;
do_sigbus:
up_read(&mm->mmap_sem);
/* Kernel mode? Handle exceptions or die */
if (!user_mode(regs))
goto no_context;
do_trap(regs, SIGBUS, BUS_ADRERR, addr, tsk);
return;
vmalloc_fault:
{
pgd_t *pgd, *pgd_k;
pud_t *pud, *pud_k;
p4d_t *p4d, *p4d_k;
pmd_t *pmd, *pmd_k;
pte_t *pte_k;
int index;
if (user_mode(regs))
goto bad_area;
/*
* Synchronize this task's top level page-table
* with the 'reference' page table.
*
* Do _not_ use "tsk->active_mm->pgd" here.
* We might be inside an interrupt in the middle
* of a task switch.
*
* Note: Use the old spbtr name instead of using the current
* satp name to support binutils 2.29 which doesn't know about
* the privileged ISA 1.10 yet.
*/
index = pgd_index(addr);
pgd = (pgd_t *)pfn_to_virt(csr_read(sptbr)) + index;
pgd_k = init_mm.pgd + index;
if (!pgd_present(*pgd_k))
goto no_context;
set_pgd(pgd, *pgd_k);
p4d = p4d_offset(pgd, addr);
p4d_k = p4d_offset(pgd_k, addr);
if (!p4d_present(*p4d_k))
goto no_context;
pud = pud_offset(p4d, addr);
pud_k = pud_offset(p4d_k, addr);
if (!pud_present(*pud_k))
goto no_context;
/*
* Since the vmalloc area is global, it is unnecessary
* to copy individual PTEs
*/
pmd = pmd_offset(pud, addr);
pmd_k = pmd_offset(pud_k, addr);
if (!pmd_present(*pmd_k))
goto no_context;
set_pmd(pmd, *pmd_k);
/*
* Make sure the actual PTE exists as well to
* catch kernel vmalloc-area accesses to non-mapped
* addresses. If we don't do this, this will just
* silently loop forever.
*/
pte_k = pte_offset_kernel(pmd_k, addr);
if (!pte_present(*pte_k))
goto no_context;
return;
}
}