DataFrame
DataFrame class — lazy query plan builder for batch and streaming.
Overview
DataFrame is lazy — operations build a logical plan. Execution happens on collect(), show(), or write_* methods.
Transformation Methods
| Method | Returns | Description |
|---|---|---|
select(*cols) | DataFrame | Project columns or Column expressions. |
select_columns(*names) | DataFrame | Project by column name strings. |
select_exprs(*exprs) | DataFrame | Project using SQL expression strings. |
filter(condition) | DataFrame | Apply a boolean Column or SQL string condition. |
filter_column(col_name, condition) | DataFrame | Filter using a named column and a Column condition. |
group_by(*cols) | GroupedDataFrame | Group by columns for aggregation. |
group_by_columns(*names) | GroupedDataFrame | Group by column name strings. |
order_by(*cols, ascending=True) | DataFrame | Sort by columns or Column expressions. |
sort(*cols) | DataFrame | Alias for order_by. |
limit(n: int) | DataFrame | Keep at most n rows. |
join(other, on, how='inner') | DataFrame | Join with another DataFrame. |
join_on(other, condition, how='inner') | DataFrame | Join using a Column condition expression. |
union(other) | DataFrame | UNION ALL (preserves duplicates). |
union_distinct(other) | DataFrame | UNION DISTINCT. |
intersect(other) | DataFrame | INTERSECT DISTINCT. |
intersect_distinct(other) | DataFrame | Alias for intersect. |
except_(other) | DataFrame | EXCEPT DISTINCT. |
except_distinct(other) | DataFrame | Alias for except_. |
distinct() | DataFrame | Remove duplicate rows. |
with_column(name, col_expr) | DataFrame | Add or replace a column. |
rename(old, new) | DataFrame | Rename a column. |
drop_columns(*names) | DataFrame | Remove columns by name. |
drop_nulls(*cols) | DataFrame | Drop rows with nulls in any (or specific) columns. |
fill_null(value, *cols) | DataFrame | Fill null values with a constant or per-column map. |
sample(fraction, seed=None) | DataFrame | Random row sampling. |
repartition(n) | DataFrame | Set the output partition count. |
alias(name) | DataFrame | Alias the DataFrame as a subquery name. |
pivot(index, columns, values, agg='first') | DataFrame | Pivot rows to columns. |
unpivot(ids, values, var_col, val_col) | DataFrame | Melt columns into rows. |
cache() | DataFrame | Materialise and cache the result in memory. |
persist() | DataFrame | Same as cache(). |
unpersist() | DataFrame | Release cached data. |
to_streaming() | StreamingDataFrame | Convert to a structured streaming DataFrame. |
Execution Methods
| Method | Returns | Description |
|---|---|---|
collect() -> list[RecordBatch] | list | Execute and collect all Arrow batches. |
collect_async() -> Awaitable | Awaitable | Async collect for use in async contexts. |
collect_batches() -> list[RecordBatch] | list | Alias for collect(). |
collect_pretty() -> str | str | Execute and return a formatted ASCII table string. |
collect_with_stats() -> tuple | tuple | Execute and return (batches, ExecutionStats). |
show(n=20) | None | Execute and print the first n rows. |
describe() -> DataFrame | DataFrame | Return summary statistics (count, mean, std, min, max). |
num_rows() -> int | int | Execute and return the row count. |
execute_stream_async() -> Awaitable | Awaitable | Async execute returning a batch iterator. |
Schema / Metadata
| Method | Returns | Description |
|---|---|---|
schema() -> Schema | Schema | Return the Arrow schema of the output. |
columns() -> list[str] | list[str] | Return column names. |
is_bounded() -> bool | bool | True for finite/batch DataFrames. |
boundedness() -> str | str | "Bounded" or "Unbounded". |
explain(verbose=False) -> str | str | Return the query plan as a string. |
explain_logical() -> str | str | Return the logical plan only. |
explain_mode(mode) -> str | str | Return plan in a specific format mode. |
create_or_replace_temp_view(name) | None | Register the DataFrame as a named SQL temp view. |
Write Methods
| Method | Description |
|---|---|
write_parquet(path, compression=None) | Write to a local Parquet file. |
write_parquet_with_options(path, opts) | Write with typed options (compression, row_group_size). |
write_csv(path) | Write to a CSV file. |
write_csv_with_options(path, opts) | Write CSV with options (delimiter, has_header). |
write_json(path) | Write as NDJSON. |
write_file(path) | Write to a file (format inferred from extension). |
write_stream() -> DataStreamWriter | Get a structured streaming writer for this DataFrame. |