3i and Grates

I. ThreeI

3i (Three-I) provides a general means for cages (and grates) to make system calls and intercept system calls via a programmable system call table. The goal is to enable complex functionality without modifying the microvisor or increasing the microvisor's trusted computing base.

1.1 Motivation

In a traditional Linux environment, extending or intercepting system calls (e.g., adding a new filesystem, tracing, or filtering calls) requires kernel modifications or mechanisms such as ptrace, which incur heavy overhead and depend on kernel-level mediation.

3i eliminates these constraints by introducing a user-space routing layer between cages, grates, and the underlying microvisor. Calls can be dispatched directly between grates or delegated to the microvisor when necessary, achieving kernel-level extensibility while keeping all new logic external to the kernel’s TCB.

1.2 Design Goals

3i is designed as a runtime-agnostic interposition layer that can operate on top of a wide range of isolation backends (e.g., software sandboxes, hardware-assisted memory protection, etc.) to execute arbitrary code. Its core goal is to provide a uniform mechanism for inter-cage call routing -- including system call interception, syscall customization, and cross-cage RPC -- without being tied to any specific runtime.

To achieve this, 3i exposes an abstraction that allows runtimes to register arbitrary entry and exit hooks, implemented as plain C-ABI function pointers. These hooks allow each backend to integrate its own cage-management requirements (e.g., switching execution contexts, updating thread-local state, or preparing runtime metadata) into the call path. Because 3i itself never assumes the presence of a particular runtime structure or object model, backend-specific behavior remains fully encapsulated in the corresponding adapter layer, leaving 3i’s core logic small, generic, and portable.

1.3 High-level Concepts

[todo] - add the figure of general lind

In traditional operating systems, a process makes a system call which traps into the kernel. Every process which makes a system call ends up trapping into the same kernel routine. In essence, there is one system call table which is shared by every process.

In contrast, in Lind, 3i provides a per-cage, per-system-call table. Each cage may define a function to serve as a system-call handler and register it for a specific system call (via register_handler). As a result, every system call of every cage can have its own distinct handler. When a cage issues a system call (make_syscall), the invocation is dispatched to the handler registered for that particular system call in that particular cage. Because handler tables are cage-local, multiple cages may register different handlers for the same system call without interfering with one another.

Specifically, 3i supports the following scenarios:

Per-call routing within a single cage:
Different system calls issued by the same cage can be handled by different grates (or RawPOSIX) by registering distinct handlers for each system call.
Shared handlers across multiple cages:
Multiple cages may register the same handler function (provided by a grate) by invoking register_handler with different cageids, enabling controlled sharing of system call implementations across cages.

As a term of convenience, we call a cage which processes system calls a "grate". This is meant to convey the mental model of a cage calling down towards the microvisor / kernel and having a grate filter or transform or handle those system calls. However, a grate is simply a cage and there is no special handling code or permission for it in 3i or the rest of the system. A grate may tend to make different system calls from a normal application, but it is still a cage, much like strace is still a normal Linux process that happens to use system calls like ptrace which are otherwise rare.

One important feature needed by a grate is the ability to read and write the memory of a cage which makes a system call it intercepts. For example, to handle a write system call, the grate must be able to read data out of the calling cage's buffer, which involves reading the calling cage's memory. 3i provides a function ( copy_data_between_cages ) to enable this feature safely.

Consider a grate that wishes to count how many times a specific cage invokes the write system call. The grate must increment its counter and then re-issue the write call on behalf of the originating cage. However, copying the user buffer into the grate’s own address space would be wasteful when the grate merely wants to observe and forward the call. To support this use case, make_syscall allows each argument of the system call to be annotated with a source cageid. The grate can therefore perform the forwarded write using its own system-call table while specifying that the buffer pointer resides in the calling cage’s address space. This per-argument cage identifier enables grates to distinguish which cage originated a system call and to safely access or forward data without unnecessary copying. This mechanism is required by many grates that interpose on system calls issued by other cages.

One final important feature of make_syscall is the ability for a grate to perform a system call as though another cage had issued it. Consider the fork system call: if a cage invokes fork and the grate simply executes a native fork, the grate -- not the originating cage -- would be duplicated. To prevent this, each make_syscall invocation explicitly specifies the target cage whose state and identity the system call should operate on.

Finally, 3i functions such as register_handler and copy_data_between_cages are themselves treated as system calls within 3i. They are 3i-specific APIs that participate in the same interception and dispatch framework, enabling grates to interpose on operations performed by other grates. This is the key mechanism that is used to provide security in 3i -- the ability to make a grate that can correctly namespace and enforce protections between cages, including calls to 3i.

1.4 3i Function Calls

[todo]: - a short instruction/example on how to write grates by using those functions

Caller	Callee	Function	Interposable	Remarks
grate	3i	`register_handler`	Yes	Register a handler for a syscall
grate	3i	`copy_handler_table_to_cage`	Yes	Overwrites the entire syscall handler table of a cage
grate	3i	`copy_data_between_cages`	Yes	Copies memory across cages
grate	3i	`make_syscall`	No	Call the registered handler for a syscall
WASM / NaCl / RawPOSIX	3i	`trigger_harsh_cage_exit`	No	Kill a cage: See detailed explanation below
3i / grate	grate / RawPOSIX	`harsh_cage_exit`	Yes	Notify a cage was killed: See detailed explaination below

NOTE:

- Interposable in the table means whether these calls are made via the system call table and thus whether or not a grate could alter their behavior

- Caller denotes the execution context that invokes the function (i.e., the component whose code initiates the transition into 3i or another cage). In other words, it represents the origin of the call site. A caller can be a grate, a normal cage, RawPOSIX, runtime, or 3i itself; the Callee column indicates which subsystem receives and executes the request.

`register_handler`

This function registers an interposition rule, mapping a syscall number from a source cage to a handler function in a destination grate or cage. Used for creating per-syscall routing rules that enable one cage to interpose or handle syscalls on behalf of another.

`copy_handler_table_to_cage`

This function copies the handler table used by a cage to another cage. This is often useful for calls like fork, so that a grate can later add or remove entries.

`copy_data_between_cages`

This function copies memory across cages. One common use of this is to read arguments which are passed by reference instead of by value. The source and destination cages may each be different from the calling cage. This may be useful for some grates.

`make_syscall`

This function actually performs a 3i call. It is not interposable. This is the most commonly used and simplest API, despite the number of arguments. All the code here does is route the call to the corresponding handler and deal with error situations.

Note that this call is itself not interposable, since this is the base call used to route other calls and do the interposition. In theory, this could be changed, but it doesn't seem useful to do so.

This is the main entry point used by cages or grates to invoke system calls through the 3i layer. The function inspects the caller’s interposition configuration (if any) and either routes the syscall to a grate for handling or directly invokes the corresponding function in the RawPOSIX layer.

`trigger_harsh_cage_exit` and `harsh_cage_exit`

This is essentially a way for grates to clean up if a cage was abruptly killed (perhaps due to a signal). trigger_harsh_cage_exit is triggered by the caging or signaling infrastructure to indicate that a cage will (uncleanly) exit. After receiving notification, 3i will cleanup the 3i data structure (which is the system call table) and then 3i will go through the respective grates until reaching 3i's version of the call by triggering harsh_cage_exit. This call can be thought of as notifying the grates and microvisor of the harsh exit of a program whose memory state cannot be relied upon. This is unlike the exit_syscall, which is performed by a functioning program with intact memory as part of its termination.

Why not interposable? At the time trigger_harsh_cage_exit or harsh_cage_exit is invoked, the target Cage or Grate is assumed to have unreliable memory and control flow. During the execution of these calls, the syscall table of the target cage/grate is either being torn down or may already be corrupted, meaning the call path itself is no longer trustworthy.

The cleanup process must complete system-level invariants such as: unmapping vmmap regions, cleaning fdtables, waking waiters, canceling schedulers or timers, etc. Allowing these calls to be interposable would permit third-party grates to inject arbitrary logic (e.g., blocking, allocation, or reentrancy), which could stall or disrupt the teardown sequence, resulting in resource leaks, deadlocks, or zombie cages/grates.

II. Prototype Implementation - Lind-Wasm

[todo] - figure

2.1 Background - Wasmtime

Store

In Wasmtime, a Store is the top-level container that owns all runtime objects. A single Store may own multiple Instances, and every Instance must belong to exactly one Store. All runtime items, such as Functions, Tables, Memories, and Globals, are allocated within the Store and are tied to its lifetime.

Module & Instance

A Module is only a compiled binary: it contains code and type information but no runtime state.
An Instance is the executable instantiation of a Module within a Store.

You cannot read memory, table, globals, or call functions on a Module. All executable interactions happen through an Instance.

VMContext

Each Instance has an internal data structure called VMContext. VMContext is a raw pointer used by the JIT-generated machine code. This has information about globals, memories, tables, and other runtime state associated with the current instance.

Call Stack

Although WebAssembly defines an abstract operand stack and structured control flow, Wasmtime lowers all function calls and stack frames to the native call stack of the executing host thread. Each Wasm function is compiled into a normal machine function that receives a VMContext pointer as an implicit first argument. Local variables, temporaries, and control-flow state are therefore represented using standard native stack slots and registers.

Wasmtime attaches a VMRuntimeLimits structure to every VMContext, which stores a stack-limit pointer. At function-entry, compiled code inserts a prologue check comparing the current native stack pointer against this limit; exceeding it triggers a Wasmtime stack-overflow trap rather than a process-level segmentation fault.

Memory

Wasmtime implements each linear memory as a sandboxed region in the host virtual address space. At instantiation time, the runtime reserves a contiguous virtual range using mmap and commits only the portion required by the module’s initial size.

Each memory is represented internally by a VMMemoryDefinition structure embedded in the instance’s VMContext. The VMContext is passed as an implicit argument to all JIT-compiled functions. Every load or store instruction is lowered to native code that first reads the memory’s base pointer and current length from the VMContext, performs an explicit bounds check, and then translates the Wasm address into a native pointer (base + offset).

2.2 Implementation

VMContext Pool Overview

Lind-wasm implements a global runtime-state lookup and pooling mechanism for lind-wasm and lind-3i, enabling explicit, controlled transfers of execution across cages, grates, and threads.

Unlike conventional WebAssembly execution models: where control flow is confined to a single Wasmtime Store, Instance, and linear call stack. Lind-wasm supports cross-instance and cross-module execution transfers. These transfers are required to implement POSIX-like process semantics (each process has its own state management) and lind-3i’s inter-cage and inter-grate call model.

To support this, lind-wasm must be able to:

Identify the correct Wasmtime runtime state without relying on implicit “current execution context” assumptions
Re-enter an existing Wasm instance from outside its original call stack
Support concurrent execution paths that operate on shared Wasm linear memory

Execution Scenarios Requiring Runtime Lookup

Process-like Operations (fork, exec, exit)

Operations such as fork, exec, and exit require Wasmtime instances to be created, cloned, or destroyed. However, the logic that performs the semantic handling of these operations may execute in a different cage or grate than the one that originally issued the system call.

After RawPOSIX completes the semantic work, control must return to Wasmtime, not necessarily to the instance that initiated the call. As a result, lind-3i cannot rely on an implicit “current” runtime state. Instead, it must explicitly retrieve the correct execution context:

For fork, exec, and exit, these operations conceptually create,replace, or terminate the execution state of a process. After RawPOSIX completes the semantic handling, lind-wasm must resume execution in a Wasmtime instance associated with an arbitrary cage_id, which may differ from the calling cage. The appropriate runtime context is therefore retrieved by directly looking up the execution context associated with the target cage_id.

Thread-like Operations

Thread operations introduce additional execution contexts within the same cage. These contexts are not part of the main execution flow and cannot be recovered via a global “current” state. Instead, lind-wasm explicitly looks up the runtime context associated with the corresponding (cage_id, tid) pair, ensuring correct control-flow transfer during thread creation, scheduling, and termination.

Grate Calls (Cross-Module Execution Transfers)

Grate calls represent explicit execution jumps between Wasm modules, such as:

Cage -> Grate
Grate -> RawPOSIX
Grate -> Grate

These jumps are not standard Wasm function calls and cannot rely on a shared call stack or Store. Supporting them requires the ability to (1) Locate a runtime state belonging to a different module, and (2) Re-enter Wasm execution from outside the original stack frame. To achieve this, lind-wasm relies on the following invariant: Each Wasmtime Store contains exactly one Wasm Instance, and each thread executes within its own independent Store / Instance pair. The Wasmtime VMContext pointer uniquely identifies the execution state of a running instance. Given a valid VMContext, lind-wasm can recover the associated Store and Instance using Wasmtime internals. Moreover, since VMContext is the raw pointer, it also helps lind-wasm bypass the lifetime restriction of Store and Instance.

Data structure

Because VMContext is opaque and lifetime-managed internally by Wasmtime, this module stores it as a raw pointer wrapped in a minimal abstraction:

pub struct VmCtxWrapper {
    pub vmctx: NonNull<c_void>,
}

Lind-Wasm maintains two global, per-cage pools of VMContext pointers:

General execution context lookup table

The global VMCTX_QUEUES structure primarily manages execution contexts for the main thread (tid = 1) of each cage and indexed by cage_id. Each cage owns a FIFO queue that stores the execution contexts currently available to it. The total number of cages is fixed at startup (MAX_CAGEID).

Importantly, table slots are never removed from the global pool. Instead, the slot contents(the queue) of each slot may be inserted or removed over time. When a cage terminates, its corresponding queue slot remains present, but its slot content is cleared and set to None.

This design ensures that each table index always directly corresponds to a cage_id, eliminating the need for dynamic index management or lookup structures. As a result, cage_id can be used as a stable, constant-time index into the pool, reducing lookup overhead and avoiding additional search or indirection costs.

static VMCTX_QUEUES: OnceLock<Vec<Mutex<VecDeque<VmCtxWrapper>>>>;

Thread Handling and Execution Context Lookup

To support thread-related operations, lind-wasm maintains a separate, thread-specific execution context table. This table is used only for non-main threads (tid != 1) and exists to support thread-related syscalls and thread exit. Each (cage_id, tid) maps to at most one VMContext. No pooling is performed. This table is not consulted for normal execution or grate calls.

static VMCTX_THREADS: OnceLock<Vec<Mutex<HashMap<u64, VmCtxWrapper>>>>;

Execution Flow

[todo] - use cases, end-to-end steps (first) --> explain each steps after that

[ Cage A ] │ │ (system call / grate call) ▼ ┌──────────────┐ │ 3i │ │ (dispatcher) │ └──────────────┘ │ │ redirect to wasmtime by MAKE_SYSCALL │ ▼ ┌──────────────┐ │ wasmtime │ │ │ └──────────────┘ │ │ lookup target cage_id │ ▼ ┌────────────────────────────┐ │ Global VMContext Pool │ │ get_vmctx(cage_id = G) │ └────────────────────────────┘ │ │ returns VmCtxWrapper │ (raw VMContext*) ▼ ┌────────────────────────────┐ │ Wasmtime internals │ │ recover Store/Instance│ │ from VMContext pointer │ └────────────────────────────┘ │ │ enter wasm execution ▼ [ Grate G ] │ │ Return ▼ ┌────────────────────────────┐ │ Wasmtime internals │ │set_vmctx(cageid, VMContext)│ │ put VMContext back to pool│ └────────────────────────────┘

Callback Definition (Wasmtime side):

The C-ABI callback function knows how to re-enter the Wasm module via the unified entry function.

Handler Registration:

When the Wasm module calls register_handler(), the redirection information entry is extracted and passed to 3i.

Cross-Cage Invocation:

When a syscall from cage A is routed to grate B:

the regular syscall reaches 3i via make_syscall
3i looks up the cageid entry for B, and lookup corresponding runtime function pointer for cageid
3i directly invokes the function pointer, re-entering the target Wasm instance through Wasmtime’s runtime context.

Dispatch Inside Grate:

The Wasm entry function (in module) receives a pointer identifying the target handler and dispatches control to the correct per-syscall implementation.