ProductDocsArchitectureBlogGitHubGitHubGet Started
Available

Shuffle

Shuffle service configuration for local, disk, and distributed deployments.

Overview

Shuffle moves data between tasks across partition boundaries. The krishiv-shuffle crate provides pluggable shuffle backends selected by the active durability profile.

Shuffle Backends

BackendDurability ProfileDescription
In-memorydev-localArrow batches in a tokio channel. Zero durability; fastest path.
Local disksingle-node-durableBatches spilled to local disk under KRISHIV_SHUFFLE_DIR. Survives task restart on same host.
Object storedistributed-durableBatches written to S3/ADLS/GCS. Enables cross-host executor reassignment.
FlightAny distributedDirect Arrow Flight streaming between executor pairs (low-latency; no intermediate persistence).

Configuration

Environment VariableDefaultDescription
KRISHIV_SHUFFLE_DIR/tmp/krishiv-shuffleLocal disk spill directory (single-node profile).
KRISHIV_SHUFFLE_OBJECT_STORE_URIObject store URI for distributed shuffle (e.g. s3://bucket/shuffle/).
KRISHIV_SHUFFLE_PARTITIONSautoNumber of shuffle partitions. auto = executor_count × 2.
KRISHIV_SHUFFLE_BATCH_SIZE8192Maximum rows per shuffle batch.
KRISHIV_SHUFFLE_COMPRESSlz4Shuffle payload compression: lz4 | zstd | none.

Partitioning

Krishiv supports three shuffle partitioning strategies:

  • Hash partitioning: rows are hashed on one or more key columns. Used for GROUP BY, joins, and keyed operators.
  • Range partitioning: rows are sorted and split at boundary values. Used for range-based aggregations.
  • Broadcast: one partition is replicated to all downstream tasks. Used for small build-side joins.