ProductDocsArchitectureBlogGitHubGitHubGet Started
Available

Scheduler

Coordinator lifecycle, job scheduling, task assignment, and fencing.

Overview

The krishiv-scheduler crate implements the job coordinator. In single-node mode it runs in-process; in distributed mode it runs as a separate daemon accepting Flight/gRPC task-control connections from executors.

Coordinator

The coordinator is the single authoritative owner of job state within an epoch. It:

  • Accepts job submissions from sessions.
  • Fragments jobs into tasks and assigns them to executors.
  • Tracks task liveness and triggers reassignment on failure.
  • Issues epoch fences to prevent stale completions from being accepted.
  • Writes job/task metadata to an in-memory store (dev) or to etcd (distributed-durable).

Task Lifecycle

StateDescription
PendingTask created; waiting for an available executor.
AssignedTask sent to an executor; heartbeat required.
RunningExecutor acknowledged start.
CompletedExecutor reported success; output is committed.
FailedExecutor reported failure or heartbeat timed out. May be retried.
CancelledJob cancelled by user; task instructed to stop.

Fencing

Each coordinator epoch has a monotone fence token. Executors include this token in completion reports. Stale completions (from a prior epoch or a replaced coordinator) are rejected, preventing double-commit.

Configuration

Environment VariableDefaultDescription
KRISHIV_COORDINATORFlight endpoint for remote sessions.
KRISHIV_COORDINATOR_BEARER_TOKENBearer token for coordinator gRPC auth.
KRISHIV_COORDINATOR_BEARER_TOKENSComma/newline-separated accepted tokens (rotation).
KRISHIV_COORDINATOR_BEARER_TOKEN_FILEFile-based token for live reload.
KRISHIV_COORDINATOR_AUTH_RELOAD_INTERVAL_SECS60Interval for reloading token file.
KRISHIV_EXECUTOR_TASK_BEARER_TOKENToken for executor task-control gRPC.
KRISHIV_MAX_TASK_RETRIES3Maximum task retries before failing the job.
KRISHIV_HEARTBEAT_TIMEOUT_SECS30Executor heartbeat timeout before reassignment.