Parquet & Object Store
Reading and writing Parquet on local disk, S3, Azure ADLS, and GCS.
Local Parquet
-- Register a local Parquet file as a SQL table
CREATE EXTERNAL TABLE orders
STORED AS PARQUET
LOCATION 'data/orders.parquet';
-- Or a directory of Parquet files
CREATE EXTERNAL TABLE orders_partitioned
STORED AS PARQUET
LOCATION 'data/orders/';
S3 (AWS or S3-compatible)
CREATE EXTERNAL TABLE orders_s3
STORED AS PARQUET
LOCATION 's3://my-bucket/data/orders/'
OPTIONS (
'aws.region' = 'us-east-1',
'aws.access_key_id' = '...',
'aws.secret_access_key' = '...'
);
Credentials can also be provided via environment variables:
AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, or instance metadata (EKS IRSA).Azure ADLS Gen2
CREATE EXTERNAL TABLE orders_adls
STORED AS PARQUET
LOCATION 'abfss://mycontainer@myaccount.dfs.core.windows.net/path/'
OPTIONS (
'azure.account_name' = 'myaccount',
'azure.account_key' = '...'
);
Google Cloud Storage
CREATE EXTERNAL TABLE orders_gcs
STORED AS PARQUET
LOCATION 'gs://my-gcs-bucket/path/'
OPTIONS (
'gcp.service_account_path' = '/path/to/service-account.json'
);
Rust API
// Local
let df = session.read_parquet("data/orders.parquet").await?;
// S3 (via registered object store)
session.register_s3_object_store("my-bucket", s3_config)?;
session.register_parquet("orders", "s3://my-bucket/orders/").await?;
let df = session.table("orders")?;
Python API
import krishiv as ks
session = ks.Session.embedded()
# Local
df = session.read_parquet("data/orders.parquet")
df.show()
# Register and use as SQL table
session.register_parquet("orders", "data/orders.parquet")
session.sql("SELECT * FROM orders LIMIT 5").show()
# Write
df.write_parquet("/tmp/output.parquet")
Write Options
| Option | Default | Description |
|---|---|---|
compression | snappy | snappy | gzip | zstd | lz4 | none |
row_group_size | 1048576 | Rows per Parquet row group. |
write_batch_size | 1024 | Rows per Arrow write batch. |
max_row_group_size | 1048576 | Maximum row group size. |