DataFrame | Krishiv

Overview

DataFrame is lazy — operations build a logical plan. Execution happens on collect(), show(), or write_* methods.

Transformation Methods

Method	Returns	Description
`select(*cols)`	`DataFrame`	Project columns or Column expressions.
`select_columns(*names)`	`DataFrame`	Project by column name strings.
`select_exprs(*exprs)`	`DataFrame`	Project using SQL expression strings.
`filter(condition)`	`DataFrame`	Apply a boolean Column or SQL string condition.
`filter_column(col_name, condition)`	`DataFrame`	Filter using a named column and a Column condition.
`group_by(*cols)`	`GroupedDataFrame`	Group by columns for aggregation.
`group_by_columns(*names)`	`GroupedDataFrame`	Group by column name strings.
`order_by(*cols, ascending=True)`	`DataFrame`	Sort by columns or Column expressions.
`sort(*cols)`	`DataFrame`	Alias for `order_by`.
`limit(n: int)`	`DataFrame`	Keep at most `n` rows.
`join(other, on, how='inner')`	`DataFrame`	Join with another DataFrame.
`join_on(other, condition, how='inner')`	`DataFrame`	Join using a Column condition expression.
`union(other)`	`DataFrame`	UNION ALL (preserves duplicates).
`union_distinct(other)`	`DataFrame`	UNION DISTINCT.
`intersect(other)`	`DataFrame`	INTERSECT DISTINCT.
`intersect_distinct(other)`	`DataFrame`	Alias for `intersect`.
`except_(other)`	`DataFrame`	EXCEPT DISTINCT.
`except_distinct(other)`	`DataFrame`	Alias for `except_`.
`distinct()`	`DataFrame`	Remove duplicate rows.
`with_column(name, col_expr)`	`DataFrame`	Add or replace a column.
`rename(old, new)`	`DataFrame`	Rename a column.
`drop_columns(*names)`	`DataFrame`	Remove columns by name.
`drop_nulls(*cols)`	`DataFrame`	Drop rows with nulls in any (or specific) columns.
`fill_null(value, *cols)`	`DataFrame`	Fill null values with a constant or per-column map.
`sample(fraction, seed=None)`	`DataFrame`	Random row sampling.
`repartition(n)`	`DataFrame`	Set the output partition count.
`alias(name)`	`DataFrame`	Alias the DataFrame as a subquery name.
`pivot(index, columns, values, agg='first')`	`DataFrame`	Pivot rows to columns.
`unpivot(ids, values, var_col, val_col)`	`DataFrame`	Melt columns into rows.
`cache()`	`DataFrame`	Materialise and cache the result in memory.
`persist()`	`DataFrame`	Same as `cache()`.
`unpersist()`	`DataFrame`	Release cached data.
`to_streaming()`	`StreamingDataFrame`	Convert to a structured streaming DataFrame.

Execution Methods

Method	Returns	Description
`collect() -> list[RecordBatch]`	`list`	Execute and collect all Arrow batches.
`collect_async() -> Awaitable`	`Awaitable`	Async collect for use in async contexts.
`collect_batches() -> list[RecordBatch]`	`list`	Alias for `collect()`.
`collect_pretty() -> str`	`str`	Execute and return a formatted ASCII table string.
`collect_with_stats() -> tuple`	`tuple`	Execute and return `(batches, ExecutionStats)`.
`show(n=20)`	`None`	Execute and print the first `n` rows.
`describe() -> DataFrame`	`DataFrame`	Return summary statistics (count, mean, std, min, max).
`num_rows() -> int`	`int`	Execute and return the row count.
`execute_stream_async() -> Awaitable`	`Awaitable`	Async execute returning a batch iterator.

Schema / Metadata

Method	Returns	Description
`schema() -> Schema`	`Schema`	Return the Arrow schema of the output.
`columns() -> list[str]`	`list[str]`	Return column names.
`is_bounded() -> bool`	`bool`	True for finite/batch DataFrames.
`boundedness() -> str`	`str`	"Bounded" or "Unbounded".
`explain(verbose=False) -> str`	`str`	Return the query plan as a string.
`explain_logical() -> str`	`str`	Return the logical plan only.
`explain_mode(mode) -> str`	`str`	Return plan in a specific format mode.
`create_or_replace_temp_view(name)`	`None`	Register the DataFrame as a named SQL temp view.

Write Methods

Method	Description
`write_parquet(path, compression=None)`	Write to a local Parquet file.
`write_parquet_with_options(path, opts)`	Write with typed options (compression, row_group_size).
`write_csv(path)`	Write to a CSV file.
`write_csv_with_options(path, opts)`	Write CSV with options (delimiter, has_header).
`write_json(path)`	Write as NDJSON.
`write_file(path)`	Write to a file (format inferred from extension).
`write_stream() -> DataStreamWriter`	Get a structured streaming writer for this DataFrame.