ProductDocsArchitectureBlogGitHubGitHubGet Started
Available

DataFrame

DataFrame class — lazy query plan builder for batch and streaming.

Overview

DataFrame is lazy — operations build a logical plan. Execution happens on collect(), show(), or write_* methods.

Transformation Methods

MethodReturnsDescription
select(*cols)DataFrameProject columns or Column expressions.
select_columns(*names)DataFrameProject by column name strings.
select_exprs(*exprs)DataFrameProject using SQL expression strings.
filter(condition)DataFrameApply a boolean Column or SQL string condition.
filter_column(col_name, condition)DataFrameFilter using a named column and a Column condition.
group_by(*cols)GroupedDataFrameGroup by columns for aggregation.
group_by_columns(*names)GroupedDataFrameGroup by column name strings.
order_by(*cols, ascending=True)DataFrameSort by columns or Column expressions.
sort(*cols)DataFrameAlias for order_by.
limit(n: int)DataFrameKeep at most n rows.
join(other, on, how='inner')DataFrameJoin with another DataFrame.
join_on(other, condition, how='inner')DataFrameJoin using a Column condition expression.
union(other)DataFrameUNION ALL (preserves duplicates).
union_distinct(other)DataFrameUNION DISTINCT.
intersect(other)DataFrameINTERSECT DISTINCT.
intersect_distinct(other)DataFrameAlias for intersect.
except_(other)DataFrameEXCEPT DISTINCT.
except_distinct(other)DataFrameAlias for except_.
distinct()DataFrameRemove duplicate rows.
with_column(name, col_expr)DataFrameAdd or replace a column.
rename(old, new)DataFrameRename a column.
drop_columns(*names)DataFrameRemove columns by name.
drop_nulls(*cols)DataFrameDrop rows with nulls in any (or specific) columns.
fill_null(value, *cols)DataFrameFill null values with a constant or per-column map.
sample(fraction, seed=None)DataFrameRandom row sampling.
repartition(n)DataFrameSet the output partition count.
alias(name)DataFrameAlias the DataFrame as a subquery name.
pivot(index, columns, values, agg='first')DataFramePivot rows to columns.
unpivot(ids, values, var_col, val_col)DataFrameMelt columns into rows.
cache()DataFrameMaterialise and cache the result in memory.
persist()DataFrameSame as cache().
unpersist()DataFrameRelease cached data.
to_streaming()StreamingDataFrameConvert to a structured streaming DataFrame.

Execution Methods

MethodReturnsDescription
collect() -> list[RecordBatch]listExecute and collect all Arrow batches.
collect_async() -> AwaitableAwaitableAsync collect for use in async contexts.
collect_batches() -> list[RecordBatch]listAlias for collect().
collect_pretty() -> strstrExecute and return a formatted ASCII table string.
collect_with_stats() -> tupletupleExecute and return (batches, ExecutionStats).
show(n=20)NoneExecute and print the first n rows.
describe() -> DataFrameDataFrameReturn summary statistics (count, mean, std, min, max).
num_rows() -> intintExecute and return the row count.
execute_stream_async() -> AwaitableAwaitableAsync execute returning a batch iterator.

Schema / Metadata

MethodReturnsDescription
schema() -> SchemaSchemaReturn the Arrow schema of the output.
columns() -> list[str]list[str]Return column names.
is_bounded() -> boolboolTrue for finite/batch DataFrames.
boundedness() -> strstr"Bounded" or "Unbounded".
explain(verbose=False) -> strstrReturn the query plan as a string.
explain_logical() -> strstrReturn the logical plan only.
explain_mode(mode) -> strstrReturn plan in a specific format mode.
create_or_replace_temp_view(name)NoneRegister the DataFrame as a named SQL temp view.

Write Methods

MethodDescription
write_parquet(path, compression=None)Write to a local Parquet file.
write_parquet_with_options(path, opts)Write with typed options (compression, row_group_size).
write_csv(path)Write to a CSV file.
write_csv_with_options(path, opts)Write CSV with options (delimiter, has_header).
write_json(path)Write as NDJSON.
write_file(path)Write to a file (format inferred from extension).
write_stream() -> DataStreamWriterGet a structured streaming writer for this DataFrame.