Syscall Proxy
Enarx needs to support Keeps that are built on encrypted virtual machine technologies such as AMD SEV and IBM Power PEF. This means that we need to boot an operating system inside the guest virtual machine. However, existing operating systems do not meet the Enarx design principles (especially: minimal trusted computing base [TCB], external network stack and memory safety). Therefore, this page outlines a plan for building a minimal operating system which intends to service only the minimal requirements to run Enarx.
High-Level Architecture
Existing Systems
A traditional virtualization stack (such as Qemu + Linux) is typically composed of four components:
- The Virtual Machine Manager (e.g. Qemu)
- The VM BIOS / Firmware (e.g. OVMF)
- The Guest Bootloader
- The Guest Kernel
In a traditional setup, the first two of these components are provided by the Host and the latter two are provided by the Tenant. The VMM sets up the virtual machine environment using KVM and handles the events generated by KVM. The VM BIOS has the main job of loading the guest bootloader from the guest disk image; though it often performs some other basic hardware initialization and provides a boot-time environment such as UEFI. Once the bootloader is loaded, its job is to find the guest kernel from the guest disk image. Finally, the kernel boots the rest of the system.
This setup involves multiple interfaces that have varying degrees of stabilization:
- The VMM => BIOS Interface
- The BIOS => Bootloader Interface
- The BIOS => Kernel Interface
Because of the Enarx goal to reduce the trusted computing base and associated attack surfaces, these interfaces offer substantial complexity that is not desirable. Further, since the Host is not trusted, introducing the Host to the trust chain via a Host-provided BIOS is not workable. Even still, because of the complexity of the various interfaces involved between the VMM and the guest kernel, there is a complexity problem that makes security issues difficult to debug.
The Plan
In order to remove these problems, Enarx plans to produce three components when running in a VM-based TEE:
- The Enarx VMM
- The Enarx μKernel
- The Enarx Userspace WASM / WASI Runtime
These three components will be tightly coupled and shipped as an integrated system. The interfaces between the components will be considered an internal implementation detail that can be changed at any time. Enarx tenants will validate the cryptographic measurement of the three components (VMM Guest Memory Setup, μKernel and Userspace Runtime) as a single unit to reduce combinatorial complexity.
Syscall Proxying
In order to keep the TCB small, especially the exclusion of a full network stack, we intend to proxy syscalls to the host. This allows us to use as many of the host resources as possible while maintaining a small Keep size. It also allows for performance optimizations as Enarx gets more mature. The above chart shows a full trace of a single syscall across the various components. This works as follows:
An Enarx application, compiled to WebAssembly, makes a WASI call, for example:
read()
. This causes a transition from the JIT-compiled code into our guest userspace Rust code. This does not entail a full context switch and should be fast.The hand-crafted Rust code should translate the WASI call into a Linux
read()
syscall. From here we leave Ring 3 (onx86
; other architectures have similar structures) and jump into the μKernel, performing a context switch. At this point, some syscalls will be handled internally by the μKernel, for example, memory allocation where the virtual machine has sufficiently allocated pages to handle the request immediately.All syscalls which cannot be handled internally by the μKernel must be passed to the host, so the guest μKernel passes the syscall request to the host (Linux) kernel. As an optimization, some syscalls may be handled by the host (Linux) kernel directly. For example,
read()
of a socket can be handled immediately by the host kernel, avoiding future context switches. This requires the (future) development of a Linux kernel module to handle these request directly in the host kernel. Since this is an optimization step, we can wait until the interfaces have settled before pursuing this.All syscalls which cannot be handled internally by the host kernel must cause a
vmexit
in the host VMM. For example, a request for additional pages to be mapped into the guest must be passed to the VMM since that is the component which manages the allocated pages. Like previous layers, any syscalls which can be handled directly in the VMM (for example allocation from a pre-allocated memory pool) should be handled immediately to avoid future context switches.In some cases, the VMM will have to re-enter the host kernel in order to fulfil the request. This is the slowest performance path and should be avoided wherever possible.
Syscall Categories and their Layers
Memory Allocation: Memory allocation syscalls should be served by the μKernel from pre-allocated pools of huge pages. Allocation of huge pages should be passed through to the host layers.
Networking: All networking syscalls should be passed to the host layers. This ensures that the network stack lives outside the TCB.
Filesystem: The guest μKernel should implement a filesystem on top of block encryption and authentication. Block IO should be passed to the host layers. It may even be possible to implement this functionality directly in userspace to reduce the number of context switches. Block authentication, block encryption and the filesystem should be implemented as reusable crates for use in other (non-VM-based keep) contexts.
Threading: Techniques like NUMA are extremely hard to implement. Therefore, the μKernel should pass this to the host layers where possible. One particular strategy to accomplish this is to perform vCPU hotplug. When a new thread is created in the guest userspace, a new vCPU is created by the VMM. Therefore there is always a 1:1 mapping between userspace threads and vCPUs. The guest μKernel can pool pre-allocated vCPUs to increase performance.
The VMM / μKernel Interface
We need to devise a TBD interface for interactions between the VMM and the guest μKernel.
The Kernel / μKernel Interface
This interface represents a series of optimizations in the previous interface. For syscalls which can be handled by the host kernel without VMM involvement, we should make a new device to handle these. Perhaps we might call this virtio-syscall
. Because this is an opimization, it can be delayed until the VMM / μKernel interface is more stable.
The μKernel / Userspace Interface
Although this interface will not be exposed directly to applications, we plan to have the μKernel implement a subset of the Linux Syscall ABI. The userspace runtime will be a static Linux ELF binary. This makes debugging of the two components significantly easier. It also allows the VMM / μKernel pair to be reused in contexts outside Enarx. The subset of the Linux Syscall ABI will be configurable at compile time, and only the syscalls necessary to execute Enarx will be included in the Enarx build of the μKernel binary. This ensures that the Keep only exposes the smallest possible attack surface to the Host, while giving opportunities for different syscall sets, or syscall masks, ("application profiles") to be created and managed for other use cases.
Frequently Asked Questions
Why not implement the μKernel as a unikernel?
While this isn't off the table completely for future iterations (perhaps for further improvements to performance), there are a lot of drawbacks to this. For example, if we use the Linux Syscall ABI we can use standard Linux static ELF binaries. This is pretty great for being able to reproduce issues outside of the minimal VM context. There are also cases where some registers on some architectures which behave differently depending on which privilege level you are at. If, for example, the JIT were to attempt to spill onto these registers we could end up with very hard to debug problems.
How does attestation work?
This is TBD. Ideally we would bring up the guest and the tenant would talk directly to the userspace runtime. But specific attestation workflows may make this difficult (for example: pre-SNP SEV). This remains a currently undefined architecture for now.