Write to multiple files
Terality provides two additional methods that allow you to store DataFrame
into multiple files as csv or parquet : DataFrame.to_csv_folder
or DataFrame.to_parquet_folder
.
In addition to the original parameters of their respective counterparts to_csv
and to_parquet
, these methods provide four additional parameters so you can choose how to split and store the DataFrame
in several files.
DataFrame.to_csv_folder(
path_or_buf=None,
num_files=None, # new
num_rows_per_file=None, # new
in_memory_file_size=None, # new
with_leading_zeros=False, # new
sep=',',
na_rep='',
float_format=None,
columns=None,
header=True,
index=True,
index_label=None,
mode='w',
encoding=None,
compression='infer',
quoting=None,
quotechar='"',
line_terminator=None,
chunksize=None,
date_format=None,
doublequote=True,
escapechar=None,
decimal='.',
errors='strict',
storage_options=None
) -> None
DataFrame.to_parquet_folder(
path=None,
num_files=None, # new
num_rows_per_file=None, # new
in_memory_file_size=None, # new
with_leading_zeros=False, # new
engine='auto',
compression='snappy',
index=None,
partition_cols=None,
storage_options=None,
**kwargs
) -> None
path/path_or_buf:
The location to store the files. Basename must contain the special character *
that will be replaced by the file number. Example : path="path/to/folder/file_name_*.parquet"
.
num_files:
Optional[int]
-> The number of output files.
num_rows_per_file: Optional[int]
-> The number of rows in each output file. Total number of files will be deduced from the DataFrame rows number.
in_memory_file_size: Optional[int]
-> The in-memory size in megabytes of each chunk of the input DataFrame to save. Total number of files will be deduced from the DataFrame memory size. Use chunks of 1GB (maximum size allowed) if none of num_files
, num_rows_per_file
or in_memory_file_size
is filled.
with_leading_zeros: Optional[bool]
-> Whether file names numbers should have leading zeros so all file names have an identical length. Default False
.
Other parameters are strictly identical to DataFrame.to_csv and DataFrame.to_parquet.
Here is an example on how to use to_csv_folder
. Syntax is identical for to_parquet_folder
.
import terality as pd
df = pd.DataFrame({"A": range(10_000)})
df.to_csv_folder(path_or_buf="folder/file_name_*.csv", num_rows_per_file=4000)
# creates 3 files :
# - folder/file_name_0.csv with 4000 rows.
# - folder/file_name_1.csv with 4000 rows.
# - folder/file_name_2.csv with 2000 rows.
Last updated
Was this helpful?