JefeOS Design Thesis

Xylem

We didn't bolt Kubernetes onto Linux. The OS itself is the cluster — and it runs your Linux containers too.

Thesis & direction — pre-implementation. Almost none of this is built yet.

A natively-clustered operating system with the cluster control plane folded into the kernel. · Identity decision: 2026-06-17 · Based on the internal design doc docs/design/xylem.md.

1.The thesis & the inversion

Today's industry stack is OS → container runtime → orchestrator, where the orchestrator (Kubernetes, Nomad, Borg, OpenStack) is an external control plane doing scheduling, replication, health-checking, failover, and autoscaling against an OS that knows nothing about the cluster. Xylem inverts that: it folds the control plane into the kernel.

A "service" becomes a first-class kernel object carrying a desired replica count and a supervision policy. Redundancy, scaling, and failover become kernel verbs — not YAML reconciled by an outside controller.

The arborist vs. the tissue

The name is the plant tissue. Xylem is a distributed transport and support network spanning the whole organism, redundant by construction: when a vessel cavitates (a local failure), flow reroutes around the dead vessel — built-in failover as a structural property. It regrows each season (self-healing / self-scaling) and it is load-bearing (structural).

k8s is an arborist standing outside the tree with shears and a clipboard, reconciling it toward a desired state. Xylem is the tissue inside the tree — routing around damage is the organism.
External control plane (status quo)Control plane in the kernel (Xylem)
Where the cluster's intelligence livesA separate distributed system (etcd + controllers, Raft servers, a Borgmaster, a daemon mesh)The kernel itself
What the OS knows about the clusterNothing — it schedules processesEverything — it schedules cells across nodes
How "a replica died" is learnedkubelet watch → API server → controller (network hops, sampled, stale)A kernel event, like an IRQ or page fault — zero sampling latency
Redundancy / scaling / failoverVerbs of the external orchestratorVerbs of the kernel
Stated honestly

There is no control plane to bolt on — but that does not delete the distributed-systems work. It relocates it into ring 0. Folding the control plane into the kernel moves the complexity; it does not make it vanish (see §7).

2.Why JefeOS, why now

JefeOS needs an identity that differentiates it from "yet another hobby Unix clone." The owner's decision is to be a natively-clustered OS — self-redundant, self-scaling, failure-resistant — chosen because it is more fun and less saturated than re-treading POSIX. The honest secondary reason it fits: failure-resistance is already JefeOS's de-facto through-line. The recent engineering culture has been reliability-first without anyone calling it "the cluster story," and that work is exactly the substrate Xylem needs.

Already-shipped reliability workWhat it gives Xylem
Panic persistence + next-boot recovery (crash record survives reboot)Node-death is recorded and recoverable, not silently lost — the raw material of a supervisor's restart policy
Fault-survivable syscalls (bad user pointers return -EFAULT instead of crashing the kernel)A bad cell cannot take the node down with it — the isolation that makes "kill and respawn" sound
Leak-free process teardown + regression test (6-cycle green, zero free-delta)Re-replication doesn't leak the fleet to death over time
Orphan/zombie reaper in PID-1The reap discipline a supervisor needs before it can claim "N healthy replicas"

The "why now" is honest, too: the isolation unit Xylem operates on — the JSL-2 cell — is scoped and its gates have cleared (the NTFS dir-index gate closed 2026-06-17; Alpine apk read/write is functional, with the interactive login-prompt the last in-progress item). For the first time there is a concrete near-term substrate to build the long arc on top of. Xylem does not start now; its prerequisite just stopped being blocked.

3.JSL ⊥ Xylem — two orthogonal axes that compose at the cell

The single most important conceptual guard-rail in this document: JSL is not Xylem, and calling Xylem a "JSL tier" would be wrong. They are orthogonal axes that compose at exactly one point — the cell.

AxisQuestion it answersOperates onStatus
JSL (Linux-compat ladder)"Can JefeOS run Linux software?"Linux-subsystem objects (translated syscalls, then isolation cells)JSL-1 near-done; JSL-2 gated-clear
Xylem (native clustering)"Can JefeOS scale and heal itself?"Cells (native or Linux), across a fleetPre-implementation

JSL is the horizontal axis (run more kinds of software on one node): JSL-1 is WSL1-style syscall translation, near-done through Alpine; JSL-2 is native containers, a single-node isolation cell. Xylem is the vertical axis (run the same software on more nodes, self-healingly). They meet at the cell:

JSL-2 builds the cell wall (isolate one workload on one node). Xylem is the tissue that grows, heals, and reroutes cells across the fleet (maintain N of this cell across M nodes). Xylem adds no new isolation mechanism — it adds the cross-node lifecycle of the boundary JSL-2 draws.

The payoff of keeping them separate: a cell's payload is content-agnostic, so a Linux workload in a Xylem cell inherits self-replication, migration, and self-healing for free — because those are properties of the cell, not of Linux. A Linux container on stock Linux is cluster-blind until k8s wraps it from outside; a Linux workload in a Xylem cell is cluster-aware by inheritance, with no external control plane. The capability is JefeOS's, not Linux's — which is exactly why calling Xylem a "JSL tier" would bury the differentiator.

4.The Linux-host fork: WSL1 vs WSL2 inside a cell

JefeOS still aims to be a great Linux host — "WSL 2.0"-class. That goal predates Xylem and it stays, reframed as the ecosystem on-ramp in service of the Xylem identity, not a competing destination. A Linux workload in a cell inherits Xylem's verbs for free (§3). But how Linux executes inside the cell is a genuine multi-year architectural fork — one to surface honestly, not paper over. It maps cleanly onto the WSL1 → WSL2 lineage:

Path A — WSL1-shaped (translate)

JSL syscall translation: SYSCALL → LSTAR → linux_syscall.cpp, serviced by JefeOS's own kernel. Fidelity is approximate — bug-for-bug Linux is unreachable. EXISTS today: this is how Alpine, apk, and real upstream packages (tree, jq) run under chroot right now. Cheap and incremental, but a perpetual treadmill against Linux's evolving syscall surface.

Path B — WSL2-shaped (real kernel in cell)

A real upstream Linux kernel runs inside a cell; JefeOS hosts it. Fidelity is true — it is Linux. ABSENT today: needs a hypervisor / kernel-hosting substrate JefeOS does not have. A major architectural pivot up front, then fidelity is free and permanent.

The lineage trap, stated plainly: the entire existing JSL/Alpine investment is Path A. Path B does not extend it — it stands beside it. That is why this is a fork to surface, not a step to schedule. The hybrid is probably the real answer over time: translate now, real-kernel-in-cell later, with both presenting to the fabric as "a cell." The fork is inside the cell; Xylem above it does not change.

Owner's resolution (2026-06-17)

JSL-1.x continues as incremental "better translation" (Path A keeps paying off near-term). JSL-2.0's headline becomes full kernel-in-a-cell — a real Linux kernel hosted inside a cell (WSL2-shaped), required for JSL to be credible at the 2.0 mark. Crucially, Xylem does not depend on JSL-2.0 shipping: it operates on cells regardless of what runs inside them.

5.Prior art as gold standards, NOT clone targets

Owner's constraint

We don't want to clone k8s / OpenStack / OpenShift / Nomad / Borg. We use them as gold standards for their use case and build what makes sense for Xylem. Every orchestrator below shares one assumption Xylem deliberately inverts — the cluster lives in an external control plane on top of cluster-blind OSes. We study what each does well and why, then build natively from first principles.

Cluster orchestrators — lessons to inherit, surfaces NOT to clone

SystemGold-standard lesson for XylemDo NOT clone
KubernetesLevel-triggered reconciliation — a loop that continuously re-asserts "I want N healthy replicas" is self-correcting against missed events. Declarative desired-state is the right contract.The external control plane + etcd-as-a-separate-quorum + the enormous declarative API surface. Full k8s API = stated non-goal.
NomadThe "evaluation" as the unit of work + feasibility→scoring split. An orchestrator can be one tight binary — which maps naturally to "in the kernel."The external 3–5 Raft-server topology + region/datacenter federation + HCL specs.
BorgReplicate the brain (consensus) but let the scheduler run on a cached, loosely-synchronized view; reserve resources as first-class allocs.The monolithic central Borgmaster as an external service tuned to Google scale + an operational army.
OpenStackA cluster OS must own the substrate — placement is meaningless without an answer for network fabric, storage, and identity.The "distributed monolith" of many daemons over a shared message bus. JefeOS is an OS, not an IaaS orchestrating other OSes.
OpenShiftOpinionated, secure-by-default + a coherent day-2 (lifecycle / upgrade / heal) story is a feature, not bloat.It thickens the entire k8s external control plane + a large operator/API surface.

The OS-native resilience lineage — what Xylem inherits

Folding resilience into the kernel is one of the most repeatedly-attempted ideas in systems history, and most attempts died — almost never because the idea was wrong, but because they were beautiful islands with no software: technically superb systems stranded outside the ecosystem gravity well.

SystemWhat it PROVEDThe trap
MOSIX / Kerrighed / Plan 9 (Single-System-Image)The cluster can look like one machine — the kernel migrates processes transparently; Plan 9 named resources uniformly via 9P.The market evaporated, and they cleanly migrated only stateless processes.
Erlang/OTP + BEAMThe closest production proof of the thesis: supervision trees, "let it crash," hot code reload, location-transparent messaging running global telecom for decades.Not an OS (a language island). Distributed Erlang punts split-brain — a human picks the winner on heal.
QNX Neutrino (microkernel)"Failure-resistant + hot-swap" as a shipping commercial reality — restart a crashed driver without rebooting, with ordered multi-stage recovery.Stayed vertical (automotive/embedded) and proprietary — single-node, no cluster fabric.
seL4 / Genode (capability microkernels)Fault isolation as a first-class, even formally-verified property — kill a component with provably no collateral authority leak.Proves the isolation primitive; gives no clustering.
Unikernels (MirageOS, Solo5)The disposable cell, demonstrated — boots in tens of milliseconds, immutable, spawn-on-demand.Sharpest island problem: you must rewrite your app into the library OS.
The unfair advantage. Every single-system-image OS died the same death: a gorgeous "one big computer" abstraction offered to a world whose software didn't run on it. JefeOS solves the ecosystem problem on a different axis (JSL) before it builds the clustering one (Xylem), and the two compose. The SSI OSes had to win the software war and the clustering war simultaneously and lost the first; Xylem fights them sequentially and orthogonally.
The caveat this lineage forces

The two systems that got furthest — Erlang/OTP and QNX — are precisely the two that mark the boundary. QNX restarts flawlessly on one node; Erlang supervises flawlessly until a stateful store partitions, at which point the best-in-class system stops and asks a human. Stateless cells are tractable; stateful cells are the dragon — the same place k8s itself bleeds (etcd is a separate Raft cluster precisely because this is the hard part).

6.The architecture, at a high level

Honesty gate

Almost nothing in this section exists today. JefeOS is a single-node kernel. The order matters: the cell comes first (JSL-2), then Xylem operates on cells. The table below classifies each asset honestly.

Asset Xylem needsState todayUsed by
Network stack (TCP / TLS 1.3 / SSH, DNS)EXISTS (single global instance; TLS client-only)Membership, replication transport
Preemptive scheduler with real load / mem / failure ground truthEXISTS (single node)Reconcile loop, scaling signal
Per-process page tables (own PML4, CR3-switched)EXISTS (Phases 0–3)Cell isolation, migration checkpoint
Panic persistence; fault-survivable syscalls; leak-free teardownEXISTS (recently hardened)Node-death detection, clean re-replication
JSL-2 isolation cell (namespaces + cgroups)ABSENT (scoped, gate-clear)The cell boundary Xylem manages
Cross-node anything (membership, consensus, migration)ABSENTAll of Xylem
Per-netns / multi-interface networkingABSENT (net stack is a global singleton)Per-cell network identity across nodes

The cell — the unit Xylem manages

A cell is the atom of supervision: a named, supervised, relocatable unit of execution with a declared identity and a supervision contract. Xylem never schedules "a process" or "a container" directly — it schedules cells. The cell is the JSL-2 isolation cell, reused. Its payload is content-agnostic: a native JefeOS service inherits Xylem directly; a Linux workload (a JSL-1-translated process tree inside a JSL-2 cell) inherits it for free — the workload is Linux but the capability is JefeOS's.

Cluster membership — kernels watching each other

Before anything can be redundant, nodes must agree on who is alive. The differentiator: join / leave / death arrive as kernel eventsnode_up(id), node_down(id, reason), node_suspect(id) — delivered to the reconcile loop the way an IRQ or page fault is, not log lines an external watcher scrapes. The minimal viable protocol is a deliberate split: gossip (SWIM-style) for liveness, and a small Raft group (3–5 voters) holding the authoritative desired-state log. We invent the state machine that rides the wire, not the wire protocol or crypto.

Service as a first-class kernel object

A service is the durable thing the user declares; a cell is a runtime instance. In Xylem it is a kernel object, not a YAML manifest — carrying name, payload spec, desired_replicas, a supervision policy, a placement policy, and a scaling rule. The reconcile loop lives inside the scheduler, because the scheduler already holds ground truth at zero sampling latency:

Fact the loop needsk8s gets it byXylem already has it
Real CPU / run-queue loadScraping cgroup stats over the networkThe scheduler's own run queue
Real free memorymetrics-server / cAdvisor scrapeThe PMM's live free-page count
A replica diedkubelet watch → API server → controllerA task-exit / panic / fault-survival event in-kernel
A node diedNode heartbeat timeout at the API serverThe membership kernel event
k8s reconciles a model of reality; Xylem reconciles reality, because the controller and the resource are the same address space. The honest cost: each node has local truth directly; cluster-wide desired state still needs agreement (the Raft group). We removed scrape latency; we did not remove the need for agreement.

Self-redundancy, self-scaling, live migration

  • Self-redundancy is RAID for compute. desired_replicas = N is an invariant the kernel maintains. When node_down(B) fires, every cell B hosted is a deficit, and the reconcile loop schedules replacements onto survivors — honoring anti-affinity so it doesn't recreate the single point of failure. Re-replication must be single-writer (the Raft group arbitrates an ownership lease).
  • Self-scaling is the same reconcile loop with desired_replicas free to move between min/max on in-kernel load and memory signals. The autoscaler and the scheduler are the same loop, so a scale-out decision and its placement are one atomic act, not two controllers negotiating over the network.
  • Live migration is the second dividend of per-process page tables. A cell that owns a private PML4 is exactly a cell whose entire user address space can be walked, serialized, and reconstructed on a peer (the classic MOSIX trick). We didn't build migration machinery; we built isolation, and isolation is most of the checkpoint.
Two gaps the design must own

Service addressing / front-door (ABSENT): when a cell respawns on a different node, what address do clients use? Failover is not invisible to clients unless a stable VIP / DNS-SD record / re-routing front-door sits in front of the moving cells. JefeOS has the DNS resolver and net stack to build on, but no service-discovery layer exists yet. Fleet observability (ABSENT): reliability culture lives on single-node dmesg / serial / panic-persistence. The intended shape is a xylem status command reading in-kernel reconcile state — but surfacing that truth to an operator is itself unbuilt.

7.The hard problems, stated honestly

The thesis is seductive; the discipline this section imposes is the price of the differentiator. Every hard problem k8s has, Xylem also has — now inside ring 0, where bugs are panics instead of crash-looped pods.

Hard problemWhere it bites XylemHonest posture
Consensus (Raft-in-kernel)Persistent log, leader election, replication, snapshotting — every Raft edge case as kernel code, where a liveness bug is a wedge and a safety bug is data lossAccept Raft, scope it narrowly to desired state (never the data plane), keep the voter set small (3–5)
Split-brain / partitionsA partition is indistinguishable from death to a failure detector; both sides may try to maintain N → 2N cells. The core hazardExplicit CAP choice: the majority partition stays available and may act; the minority must stop creating/mutating cells. Deliberately unavailable — correct, not a bug
Stateful cells — the dragon"Maintain N replicas of Postgres" needs quorum writes, per-shard leader election, conflict resolution — replicated storage Xylem does not haveExactly where k8s itself bleeds. Stateless cells first-class; stateful cells explicitly deferred, likely needing an external/replicated store
Security / cross-node multi-tenancyA compromised node could lie in gossip, forge desired-state, or exfiltrate a migrated cell's whole address space (the checkpoint ships memory over the wire)Node identity must be cryptographic (mutual TLS / SSH host keys — JefeOS has the primitives). "Trusted fleet" is an assumption to state, not an achievement. Hostile multi-tenancy is out of initial scope
CAP realitiesPervasive: membership, re-replication, scaling all make an implicit CAP choiceMake it explicit and uniform: Xylem is CP for authoritative actions. Eventual/AP only for non-authoritative liveness gossip
Bottom line. Xylem's bet is not that distributed systems are easy — it's that putting the control loop in the same address space as the ground truth removes a class of sampling/staleness/translation error external orchestrators fight forever, and that JefeOS's reliability-first culture is the right substrate to absorb the consensus and partition complexity that remains. The complexity moves into the kernel; it does not vanish.

8.The proof-of-thesis MVP — 2-node failover

The single demo that proves Xylem is real, concretely:

Two JefeOS nodes. Kernel-level membership between them. One stateless service declared at replicas=2, one cell on each node. Kill a node (power it off). The surviving kernel observes node_down as a kernel event, sees the replica deficit against its own ground truth, and auto-respawns the missing replica on itself — with no external orchestrator running anywhere in the demo.

  • It is the inversion, made visible. The failover happens with no external control plane anywhere — something you literally cannot demonstrate on stock Linux + k8s, where the orchestrator is the thing doing the failover.
  • It is honestly scoped. Stateless → no consensus-on-data, no stateful dragon. It still needs the tractable hard parts (membership, a kernel event, single-writer re-replication arbitration), so it is not a toy. It deliberately sidesteps the §6 addressing gap — called out, not hidden.
  • It is showable in one screen recording: tasklist on both nodes, kill one, watch the survivor's tasklist grow the replacement cell — driven by kernel logs, not a control-plane dashboard.

9.Phased roadmap (reliability-first sequencing)

Xylem is the long arc. Each phase is small, gated, and testable; the dev loop stays reliability-first throughout (a wedge or regression always preempts Xylem work). Effort figures are deliberately omitted — this is a direction, not a schedule, and distributed systems resist estimation.

  1. Phase 0 — Solidify the cell JSL-2 isolation cell (namespaces + cgroups). Xylem cannot supervise cells it cannot cleanly isolate. Gate: JSL-2's own track.
  2. Phase 1 — Kernel membership Two nodes discover + watch each other; node_up/down/suspect as kernel events. Gossip liveness first; a small Raft group for desired state. Gate: a node-to-node mutual-auth listener is new (net stack is a global singleton).
  3. Phase 2 — Single-service supervision A service kernel object + in-kernel reconcile loop that restarts a failed local cell (QNX-HAM-on-one-node, generalized). Gate: inherits the hardened teardown/reaper paths.
  4. Phase 3 — Multi-node replicas + failover ★ The MVP replicas=2 across two nodes; kill a node → survivor auto-respawns the replica. Single-writer re-replication via the Raft group. Gate: Phases 1+2 and the partition/split-brain story.
  5. Phase 4 — Service addressing + observability A stable front-door so a client reaches a service whose cells moved, and a xylem status view of fleet/replica state. Gate: service-VIP / DNS-SD layer is new.
  6. Phase 5 — Live migration Checkpoint / ship / resume a stateless, connection-light cell — drain a node without killing the workload. Gate: connection/fd migration needs absent per-netns + shared storage.
  7. Phase 6 — Autoscale desired_replicas moves between min/max on in-kernel load/memory signals, with hysteresis. Gate: coupled to SMP / cgroup-v2 CPU accounting maturity.
  8. Phase 7 — Stateful cells (last, hardest) Durable replicated state — the dragon. Gate: likely needs an external/replicated storage substrate; explicitly the long, dangerous arc.

The sequencing is the point: stateless across a trusted fleet is the honest, reachable milestone (Phase 3); consensus, partitions, and stateful cells are the long, hard arc — the same arc every serious cluster system walks, now walked in C++ at ring 0.

10.Design decisions / direction

The owner resolved several of the design doc's open questions on 2026-06-17. These are direction, not shipped work:

QuestionResolution
The WSL1 / WSL2 Linux-host forkJSL-1.x continues as incremental "better translation." JSL-2.0's headline becomes full kernel-in-a-cell (a real Linux kernel hosted in a cell, WSL2-shaped) — required for JSL to be credible at the 2.0 mark. Xylem does not depend on JSL-2.0 shipping — it operates on cells regardless.
Cluster membership protocol (gossip vs Raft)Deferred. The guiding principle is to model Xylem after actual plant xylem (biomimicry) — decentralized, pressure/flow-driven, reroute-around-embolism, no central authority.
Stateful-cell substrateThe roadmap supports both: an external replicated store (pragmatic — start here) and eventual kernel-native replicated storage.
Security / multi-tenant modelHostile multi-tenant security is not an initial goal (trusted-fleet posture); it remains an outstanding future possibility.

11.Open & breakout items

Two threads are explicitly held open for dedicated future sessions:

Membership & biomimicry

The gossip-vs-Raft boundary is deferred, with a strong design steer: model Xylem on actual plant xylem. Real xylem has no central authority — it is decentralized and pressure/flow-driven, and it reroutes around an embolized vessel as a structural property. Whether even liveness should be quorum-backed (simpler reasoning, worse scaling) or stay gossip-based is the open call.

Dual-kernel hot-swap (C++ ↔ Rust)

The originating thesis reached for a subsystem hot-swap angle. Hot-swapping a kernel subsystem (C++ → a Rust equivalent) at runtime is a multi-quarter architecture bet distinct from cell-level live-replaceability. It is open — the sustainability of JefeRust perpetually playing "catch-up" is genuinely questioned, and this is deferred to a dedicated roadmapping session.

A standing reminder

Everything in this whitepaper is a thesis and a direction. Almost none of the clustering described here is built — the foundations it stands on are. The actual clustering (service-as-kernel-object, replica supervision, fault failover, migration, service addressing, fleet observability) is multiple quarters away and will be the hard part. Read it as where JefeOS intends to go, not where it is.