Available

Shuffle

Shuffle service configuration for local, disk, and distributed deployments.

Overview

Shuffle moves data between tasks across partition boundaries. The krishiv-shuffle crate provides pluggable shuffle backends selected by the active durability profile.

Shuffle Backends

Backend	Durability Profile	Description
In-memory	`dev-local`	Arrow batches in a tokio channel. Zero durability; fastest path.
Local disk	`single-node-durable`	Batches spilled to local disk under `KRISHIV_SHUFFLE_DIR`. Survives task restart on same host.
Object store	`distributed-durable`	Batches written to S3/ADLS/GCS. Enables cross-host executor reassignment.
Flight	Any distributed	Direct Arrow Flight streaming between executor pairs (low-latency; no intermediate persistence).

Configuration

Environment Variable	Default	Description
`KRISHIV_SHUFFLE_DIR`	`/tmp/krishiv-shuffle`	Local disk spill directory (single-node profile).
`KRISHIV_SHUFFLE_OBJECT_STORE_URI`	—	Object store URI for distributed shuffle (e.g. `s3://bucket/shuffle/`).
`KRISHIV_SHUFFLE_PARTITIONS`	`auto`	Number of shuffle partitions. `auto` = `executor_count × 2`.
`KRISHIV_SHUFFLE_BATCH_SIZE`	`8192`	Maximum rows per shuffle batch.
`KRISHIV_SHUFFLE_COMPRESS`	`lz4`	Shuffle payload compression: `lz4` \| `zstd` \| `none`.

Partitioning

Krishiv supports three shuffle partitioning strategies:

Hash partitioning: rows are hashed on one or more key columns. Used for GROUP BY, joins, and keyed operators.
Range partitioning: rows are sorted and split at boundary values. Used for range-based aggregations.
Broadcast: one partition is replicated to all downstream tasks. Used for small build-side joins.

← Checkpointing Deployment →