Checkpointing
Checkpoint and savepoint configuration, paths, and recovery.
Overview
Krishiv uses barrier-based checkpointing similar to Apache Flink. A barrier event is injected into all input streams simultaneously. When all operators have processed the barrier, the coordinator saves a consistent snapshot of all keyed state and source offsets.
Durability Profiles
| Profile | State Backend | Checkpoint Storage |
|---|---|---|
dev-local | In-memory | Ephemeral temp dir; not restart-durable. |
single-node-durable | RocksDB (local) | Local filesystem. Survives process restarts on same host. |
distributed-durable | RocksDB (restored from checkpoint) | Object store (S3/ADLS/GCS). Etcd tracks checkpoint metadata. |
Configuration
| Config Key | Default | Description |
|---|---|---|
krishiv.checkpoint.interval_ms | 60000 | Checkpoint interval in milliseconds. |
krishiv.checkpoint.storage.path | — | Local path or object-store URI (e.g. s3://bucket/checkpoints/). |
krishiv.checkpoint.retain | 3 | Number of completed checkpoints to retain. |
krishiv.checkpoint.min_pause_ms | 500 | Minimum gap between successive checkpoint barriers. |
krishiv.savepoint.path | — | Manual savepoint output path. Triggered via session.take_savepoint(). |
Recovery
# Resume from latest checkpoint
cargo run -p krishiv -- start --job-id <id> --resume-from latest
# Resume from specific checkpoint
cargo run -p krishiv -- start --job-id <id> --resume-from <checkpoint-uri>
# Resume from savepoint
cargo run -p krishiv -- start --job-id <id> --resume-from savepoint://<path>
Warning: Savepoints are operator-topology-aware. Adding, removing, or reordering operators between a savepoint and the recovery job may fail or produce incorrect state restoration.