深入 Linux 6.8.12 内核:x86_64 系统调用的快速路径与安全博弈
- Linux
- 14天前
- 12热度
- 0评论
深入解析 Linux 内核中的系统调用机制
在现代操作系统中,用户态与内核态之间的交互是通过系统调用来实现的。本文将深入探讨 Linux 6.8.12 中 x86_64 架构上的系统调用机制,从 MSR 配置、栈切换到安全返回,全面解析其内部细节。
概述
系统调用是从用户态进入内核态执行特权操作的一种方式。在 x86-64 架构中,使用 syscall 和 sysret 指令来实现这一点。本文将从 MSR 的配置开始,逐步剖析整个过程的各个阶段。
1. 系统调用初始化
系统调用机制的初始化主要通过 syscall_init 函数完成:
void syscall_init(void)
{
wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
if (ia32_enabled()) {
wrmsrl_cstar((unsigned long)entry_SYSCALL_compat);
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
}
}- MSR_STAR: 配置了用户态和内核态的切换信息。0 表示 32 位模式,__USER32_CS << 16 | __KERNEL_CS 表示用户态 CS 和内核态 CS。
- MSR_LSTAR: 设置 syscall 指令进入内核入口地址为 entry_SYSCALL_64。
2. 用户态到内核态的转移
当执行 syscall 指令时,处理器会跳转到 entry_SYSCALL_64 函数:
ENTRY(entry_SYSCALL_64)
pushq %rsp
movq $-ENOSYS, %rax
movq %rcx, 8(%rsp) /* save orig_ax */
jmp syscall_enter_common
ENDPROC(entry_SYSCALL_64)这里,将 orig_ax(系统调用号)保存在栈中,并跳转到 syscall_enter_common 处理通用逻辑。
3. 栈切换与上下文保存
接下来,内核需要建立完整的 pt_regs 结构来保存用户态的寄存器状态:
PUSH_AND_CLEAR_REGS rax=$-ENOSYS /* 保存剩余通用寄存器,rax 设为 -ENOSYS */压栈顺序与 struct pt_regs 的定义完全匹配。这些操作确保了从用户态进入内核态时能够完整地恢复上下文。
4. 系统调用分发
在栈建立完成后,执行 do_syscall_64(struct pt_regs *regs, int nr) 函数来进一步处理:
bool do_syscall_64(struct pt_regs *regs, int nr)
{
add_random_kstack_offset();
nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin();
if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1)
regs->ax = __x64_sys_ni_syscall(regs);
instrumentation_end();
syscall_exit_to_user_mode(regs);
}- syscall_enter_from_user_mode 进一步处理用户态到内核态的切换。
- do_syscall_x64 查找对应的系统调用函数并执行。
5. 系统调用返回决策
在返回时,内核需要决定使用 sysretq 或 iretq 指令:
bool do_syscall_64(struct pt_regs *regs, int nr)
{
if (cpu_feature_enabled(X86_FEATURE_XENPV))
return false;
if (unlikely(regs->cx != regs->ip || regs->r11 != regs->flags))
return false;
if (unlikely(regs->cs != __USER_CS || regs->ss != __USER_DS))
return false;
if (unlikely(regs->ip >= TASK_SIZE_MAX))
return false;
if (unlikely(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF)))
return false;
return true;
}- CPU_FEATURE_XENPV 强制使用 iretq
- 检查寄存器状态、CS/SS 是否为标准用户段
- 防止非规范地址和调试标志带来的问题
6. 快速返回实现细节
如果条件满足,内核将执行快速返回路径:
syscall_return_via_sysret:
IBRS_EXIT
POP_REGS pop_rdi=0 /* 恢复除 RDI、RSP 外的所有寄存器 */
movq %rsp, %rdi /* 保存当前栈指针 */
movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp /* 切到 trampoline 栈 */
pushq RSP-RDI(%rdi) /* 压入原用户 RSP */
pushq (%rdi) /* 压入原 RDI */
STACKLEAK_ERASE_NOCLOBBER
SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
popq %rdi
popq %rsp
swapgs
CLEAR_CPU_BUFFERS
sysretq- sysretq 指令恢复用户态的寄存器状态,并切换回 Ring 3。
7. 安全性与性能权衡
Linux 在系统调用路径中集成了多种针对现代 CPU 安全漏洞的缓解措施:
| 宏/标签 | 功能 |
|---|---|
| IBRS_ENTER | 阻止用户态间接分支污染内核 |
| UNTRAIN_RET | 清空返回预测器,防御 Retbleed |
| CLEAR_BRANCH_HISTORY | 冲刷分支历史缓冲区,防御 Branch History Injection (BHI) |
| CLEAR_CPU_BUFFERS | 返回用户态前清除 CPU 缓冲区 |
这些措施在性能与安全之间进行了权衡。虽然带来了一些额外开销,但有效缓解了现代处理器中的安全隐患。
结论
通过本文的详细分析,我们可以看到从用户态到内核态的系统调用机制涉及多个复杂的层次和细节。了解这些内部实现有助于更好地理解操作系统的核心抽象,并帮助我们写出更高效的程序以及提升系统的安全性。
源码详情如下:
void syscall_init(void)
{
wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
if (ia32_enabled()) {
wrmsrl_cstar((unsigned long)entry_SYSCALL_compat);
/*
* This only works on Intel CPUs.
* On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
* This does not cause SYSENTER to jump to the wrong location, because
* AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
*/
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
}
}
### 64-Bit SYSCALL Instruction Entry
The **SYSCALL** instruction is the primary mechanism for entering the kernel from a user-space application in 64-bit environments. This entry point handles up to six arguments passed through registers, making it efficient and straightforward.
#### Register Usage on Entry
When invoking a system call using the SYSCALL instruction, the following register states are observed:
- **RAX**: Contains the system call number.
- **RCX**: Holds the return address.
- **R11**: Stores the saved RFLAGS value.
- **RDI**, **RSI**, **RDX**, **R10**, **R8**, and **R9** contain the function arguments.
The SYSCALL instruction saves the RIP (instruction pointer) in RCX, clears the RF flag from rflags, and then loads new SS, CS, and RIP values from MSRs. It does not save anything on the stack or change RSP, ensuring a clean and efficient transition into kernel mode.
#### Entry Point Code
The entry point code for 64-bit SYSCALL includes several steps to ensure proper context switching:
```c
SYM_CODE_START(entry_SYSCALL_64)
UNWIND_HINT_ENTRY
ENDBR
swapgs
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) // Save RSP in TSS_sp2 scratch space
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
movq PER_CPU_VAR(pcpu_hot + X86_top_of_stack), %rsp
SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)
ANNOTATE_NOENDBR
// Construct struct pt_regs on stack
pushq $__USER_DS // pt_regs-&gt;ss
pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2) // pt_regs-&gt;sp
pushq %r11 // pt_regs-&gt;flags
pushq $__USER_CS // pt_regs-&gt;cs
pushq %rcx // pt_regs-&gt;ip
SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
pushq %rax // pt_regs-&gt;orig_ax
PUSH_AND_CLEAR_REGS rax=$-ENOSYS
movq %rsp, %rdi // Pass RSP as argument to do_syscall_64
movslq %eax, %rsi // Sign extend the lower 32-bit syscall number
IBRS_ENTER
UNTRAIN_RET
CLEAR_BRANCH_HISTORY
call do_syscall_64 // Transition into kernel mode with IRQs disabled
ALTERNATIVE "testb %al, %al; jz swapgs_restore_regs_and_return_to_usermode", \
"jmp swapgs_restore_regs_and_return_to_usermode", X86_FEATURE_XENPVdo_syscall_64 Function Details
The do_syscall_64 function handles the actual system call processing:
__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
{
add_random_kstack_offset();
nr = syscall_enter_from_user_mode(regs, nr);
instrumentation_begin();
if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
regs-&gt;ax = __x64_sys_ni_syscall(regs);
}
instrumentation_end();
syscall_exit_to_user_mode(regs);
/* Check conditions for SYSRET or IRET */
if (cpu_feature_enabled(X86_FEATURE_XENPV))
return false;
if (unlikely(regs-&gt;cx != regs-&gt;ip || regs-&gt;r11 != regs-&gt;flags))
return false;
if (unlikely(regs-&gt;cs != __USER_CS || regs-&gt;ss != __USER_DS))
return false;
if (unlikely(regs-&gt;ip &gt;= TASK_SIZE_MAX))
return false;
if (unlikely(regs-&gt;flags & (X86_EFLAGS_RF | X86_EFLAGS_TF)))
return false;
/* Return true to use SYSRET */
return true;
}do_syscall_x64 Function
The do_syscall_x64 function checks the system call number and calls the appropriate handler:
static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
{
unsigned int unr = nr;
if (likely(unr &lt; NR_syscalls)) {
unr = array_index_nospec(unr, NR_syscalls);
regs-&gt;ax = x64_sys_call(regs, unr);
return true;
}
return false;
}The SYSCALL mechanism is an essential part of the 64-bit operating system's interface with user-space applications. By providing a clean and efficient entry point, it ensures that system calls are handled securely and efficiently by the kernel.