Scheduler
Coordinator lifecycle, job scheduling, task assignment, and fencing.
Overview
The krishiv-scheduler crate implements the job coordinator. In single-node mode it runs in-process; in distributed mode it runs as a separate daemon accepting Flight/gRPC task-control connections from executors.
Coordinator
The coordinator is the single authoritative owner of job state within an epoch. It:
- Accepts job submissions from sessions.
- Fragments jobs into tasks and assigns them to executors.
- Tracks task liveness and triggers reassignment on failure.
- Issues epoch fences to prevent stale completions from being accepted.
- Writes job/task metadata to an in-memory store (dev) or to etcd (distributed-durable).
Task Lifecycle
| State | Description |
|---|---|
Pending | Task created; waiting for an available executor. |
Assigned | Task sent to an executor; heartbeat required. |
Running | Executor acknowledged start. |
Completed | Executor reported success; output is committed. |
Failed | Executor reported failure or heartbeat timed out. May be retried. |
Cancelled | Job cancelled by user; task instructed to stop. |
Fencing
Each coordinator epoch has a monotone fence token. Executors include this token in completion reports. Stale completions (from a prior epoch or a replaced coordinator) are rejected, preventing double-commit.
Configuration
| Environment Variable | Default | Description |
|---|---|---|
KRISHIV_COORDINATOR | — | Flight endpoint for remote sessions. |
KRISHIV_COORDINATOR_BEARER_TOKEN | — | Bearer token for coordinator gRPC auth. |
KRISHIV_COORDINATOR_BEARER_TOKENS | — | Comma/newline-separated accepted tokens (rotation). |
KRISHIV_COORDINATOR_BEARER_TOKEN_FILE | — | File-based token for live reload. |
KRISHIV_COORDINATOR_AUTH_RELOAD_INTERVAL_SECS | 60 | Interval for reloading token file. |
KRISHIV_EXECUTOR_TASK_BEARER_TOKEN | — | Token for executor task-control gRPC. |
KRISHIV_MAX_TASK_RETRIES | 3 | Maximum task retries before failing the job. |
KRISHIV_HEARTBEAT_TIMEOUT_SECS | 30 | Executor heartbeat timeout before reassignment. |