Checkpointing | Krishiv

Overview

Krishiv uses barrier-based checkpointing similar to Apache Flink. A barrier event is injected into all input streams simultaneously. When all operators have processed the barrier, the coordinator saves a consistent snapshot of all keyed state and source offsets.

Durability Profiles

Profile	State Backend	Checkpoint Storage
`dev-local`	In-memory	Ephemeral temp dir; not restart-durable.
`single-node-durable`	RocksDB (local)	Local filesystem. Survives process restarts on same host.
`distributed-durable`	RocksDB (restored from checkpoint)	Object store (S3/ADLS/GCS). Etcd tracks checkpoint metadata.

Configuration

Config Key	Default	Description
`krishiv.checkpoint.interval_ms`	`60000`	Checkpoint interval in milliseconds.
`krishiv.checkpoint.storage.path`	—	Local path or object-store URI (e.g. `s3://bucket/checkpoints/`).
`krishiv.checkpoint.retain`	`3`	Number of completed checkpoints to retain.
`krishiv.checkpoint.min_pause_ms`	`500`	Minimum gap between successive checkpoint barriers.
`krishiv.savepoint.path`	—	Manual savepoint output path. Triggered via `session.take_savepoint()`.

Recovery

# Resume from latest checkpoint
cargo run -p krishiv -- start --job-id <id> --resume-from latest

# Resume from specific checkpoint
cargo run -p krishiv -- start --job-id <id> --resume-from <checkpoint-uri>

# Resume from savepoint
cargo run -p krishiv -- start --job-id <id> --resume-from savepoint://<path>

Warning: Savepoints are operator-topology-aware. Adding, removing, or reordering operators between a savepoint and the recovery job may fail or produce incorrect state restoration.