karapace/docs/storage-format.md
Marco Allegretti e6e0f3dd6d docs: rewrite all documentation from source code
Delete 14 old docs files (AI-generated, riddled with Phase/M1/1.0
jargon, references to non-existent commands, stale CI snippets).

New documentation (6 files), written from repository source analysis:
- docs/architecture.md — crate graph, engine lifecycle, identity
  computation, runtime backends, store design, WAL, GC, unsafe blocks
- docs/cli-reference.md — all 23 commands with syntax, args, flags,
  exit codes, env vars, verified against crates/karapace-cli/src/main.rs
- docs/storage-format.md — directory layout, objects, layers, metadata,
  manifest format, lock file, WAL, atomic write contract
- docs/security-model.md — mount/device/env var policies with exact
  defaults from security.rs, trust assumptions, what is NOT protected
- docs/build-and-reproducibility.md — CI env vars, RUSTFLAGS, cargo
  profile, reproducibility verification, toolchain pinning
- docs/contributing.md — setup, verification, project layout, code
  standards, testing, CI workflows

README.md rewritten: concise, no marketing language, prerequisites
first, usage example, command table, limitations section.

CONTRIBUTING.md now points to docs/contributing.md.
CHANGELOG.md cleaned: removed M1-M8 labels, Phase refs, stale counts.
2026-02-23 01:25:07 +01:00

212 lines
5.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Storage Format
Store format version: **2**. Defined in `karapace-store/src/layout.rs::STORE_FORMAT_VERSION`.
## Directory layout
Default root: `~/.local/share/karapace`.
```
<root>/
store/
version # { "format_version": 2 }
.lock # flock(2) exclusive lock
objects/<blake3_hex> # content-addressable blobs
layers/<blake3_hex> # layer manifests (JSON)
metadata/<env_id> # environment metadata (JSON)
staging/ # temp workspace for atomic operations
wal/<op_id>.json # write-ahead log entries
env/
<env_id>/
upper/ # overlay writable layer
overlay/ # overlay mount point
images/
<cache_key>/
rootfs/ # extracted base image filesystem
```
Paths defined in `karapace-store/src/layout.rs::StoreLayout`.
## Version file
```json
{ "format_version": 2 }
```
Checked on every store access. Mismatched versions are rejected with `StoreError::VersionMismatch`.
## Objects
Content-addressable blobs keyed by blake3 hex digest of their content.
- Write: `NamedTempFile` in objects dir → write content → `sync_all()``persist()` (atomic rename)
- Read: read file → recompute blake3 → compare to filename → reject on mismatch
- Idempotent: writing identical content is a no-op
Defined in `karapace-store/src/objects.rs::ObjectStore`.
## Layers
JSON files in `store/layers/`. Each describes a tar archive stored in the object store.
```json
{
"hash": "<layer_hash>",
"kind": "Base | Dependency | Policy | Snapshot",
"parent": "<parent_hash> | null",
"object_refs": ["<hash>", ...],
"read_only": true,
"tar_hash": "<blake3_of_tar>"
}
```
Defined in `karapace-store/src/layers.rs::LayerManifest`.
**Layer kinds:**
| Kind | Hash computation | Parent |
|------|-----------------|--------|
| `Base` | `tar_hash` | None |
| `Dependency` | `tar_hash` | Base layer |
| `Policy` | `tar_hash` | — |
| `Snapshot` | `blake3("snapshot:{env_id}:{base_layer}:{tar_hash}")` | Base layer |
Layer integrity is verified on read: the file content is re-hashed and compared to the filename.
### Deterministic tar packing
`karapace-store/src/layers.rs::pack_layer(source_dir)`:
- Entries sorted by path
- Timestamps set to 0
- Owner set to `0:0`
- Permissions preserved
- Symlink targets preserved
**Dropped during packing:** extended attributes, device nodes, hardlinks (stored as regular files), SELinux labels, ACLs, sparse file holes.
`unpack_layer(tar_data, target_dir)` reverses the process.
## Metadata
JSON files in `store/metadata/`, one per environment. Filename is the `env_id`.
```json
{
"env_id": "...",
"short_id": "...",
"name": null,
"state": "Built",
"manifest_hash": "<object_hash>",
"base_layer": "<layer_hash>",
"dependency_layers": [],
"policy_layer": null,
"created_at": "RFC3339",
"updated_at": "RFC3339",
"ref_count": 1,
"checksum": "<blake3_of_json>"
}
```
Defined in `karapace-store/src/metadata.rs::EnvMetadata`.
**States:** `Defined`, `Built`, `Running`, `Frozen`, `Archived`.
**Checksum:** blake3 of the JSON content (excluding the checksum field itself). Computed on every `put()`, verified on every `get()`. Absent in legacy metadata (`#[serde(default)]`).
**Names:** optional, validated by `validate_env_name`: pattern `[a-zA-Z0-9_-]`, 164 characters. Unique across all environments.
## Manifest format
File: `karapace.toml`. Parsed by `karapace-schema/src/manifest.rs`.
```toml
manifest_version = 1
[base]
image = "rolling"
[system]
packages = ["git", "curl"]
[gui]
apps = []
[hardware]
gpu = false
audio = false
[mounts]
workspace = "./:/workspace"
[runtime]
backend = "namespace"
network_isolation = false
[runtime.resource_limits]
cpu_shares = 1024
memory_limit_mb = 4096
```
**Required:** `manifest_version` (must be `1`), `base.image` (non-empty).
**Optional:** all other sections. Unknown fields cause a parse error (`deny_unknown_fields`).
**Normalization** (`ManifestV1::normalize`): trim strings, sort and deduplicate packages/apps, sort mounts by label, lowercase backend name. Produces `NormalizedManifest` with a `canonical_json()` method.
## Lock file
File: `karapace.lock`. Written next to the manifest. TOML format.
```toml
lock_version = 2
env_id = "46e1d96f..."
short_id = "46e1d96fdd6f"
base_image = "rolling"
base_image_digest = "a1b2c3d4..."
runtime_backend = "namespace"
hardware_gpu = false
hardware_audio = false
network_isolation = false
[[resolved_packages]]
name = "git"
version = "2.44.0-1"
```
Defined in `karapace-schema/src/lock.rs::LockFile`.
**Verification:**
- `verify_integrity()`: recomputes `env_id` from locked fields, compares to stored value
- `verify_manifest_intent()`: checks manifest hasn't drifted from what was locked
## Hashing
All hashing uses **blake3**, 256-bit output, hex-encoded (64 characters).
Used for: object keys, layer hashes, env_id computation, metadata checksums, image content digests.
## Write-ahead log
`store/wal/<op_id>.json`. Defined in `karapace-store/src/wal.rs`.
```json
{
"op_id": "20260215120000123-a1b2c3d4",
"kind": "Build",
"env_id": "...",
"timestamp": "RFC3339",
"rollback_steps": [
{ "RemoveDir": "/path" },
{ "RemoveFile": "/path" }
]
}
```
**Operations:** `Build`, `Rebuild`, `Commit`, `Restore`, `Destroy`, `Gc`.
**Recovery:** on `Engine::new()`, all WAL entries are scanned. Each entry's rollback steps execute in reverse order. The entry is then deleted. Corrupt entries are silently removed.
## Atomic write contract
All store writes follow: `NamedTempFile::new_in(dir)` → write → `flush()``persist()` (atomic rename). No partial files are visible. Defined throughout `karapace-store`.