Write to multiple files

Terality provides two additional methods that allow you to store DataFrame into multiple files as csv or parquet : DataFrame.to_csv_folder or DataFrame.to_parquet_folder.

In addition to the original parameters of their respective counterparts to_csv and to_parquet, these methods provide four additional parameters so you can choose how to split and store the DataFrame in several files.

DataFrame.to_csv_folder(
        path_or_buf=None, 
        num_files=None,  # new
        num_rows_per_file=None,  # new
        in_memory_file_size=None,  # new
        with_leading_zeros=False,  # new
        sep=',', 
        na_rep='', 
        float_format=None,    
        columns=None, 
        header=True, 
        index=True, 
        index_label=None, 
        mode='w', 
        encoding=None, 
        compression='infer', 
        quoting=None, 
        quotechar='"', 
        line_terminator=None, 
        chunksize=None, 
        date_format=None, 
        doublequote=True, 
        escapechar=None, 
        decimal='.', 
        errors='strict', 
        storage_options=None
) -> None
DataFrame.to_parquet_folder(
        path=None, 
        num_files=None, # new
        num_rows_per_file=None, # new
        in_memory_file_size=None, # new
        with_leading_zeros=False, # new
        engine='auto', 
        compression='snappy',
        index=None, 
        partition_cols=None, 
        storage_options=None, 
        **kwargs
) -> None

path/path_or_buf:The location to store the files. Basename must contain the special character * that will be replaced by the file number. Example : path="path/to/folder/file_name_*.parquet".

num_files: Optional[int] -> The number of output files.

num_rows_per_file: Optional[int] -> The number of rows in each output file. Total number of files will be deduced from the DataFrame rows number.

in_memory_file_size: Optional[int] -> The in-memory size in megabytes of each chunk of the input DataFrame to save. Total number of files will be deduced from the DataFrame memory size. Use chunks of 1GB (maximum size allowed) if none of num_files, num_rows_per_file or in_memory_file_size is filled.

with_leading_zeros: Optional[bool] -> Whether file names numbers should have leading zeros so all file names have an identical length. Default False.

Other parameters are strictly identical to DataFrame.to_csv and DataFrame.to_parquet.

Only one of num_files, num_rows_per_file or in_memory_file_size can be provided. If none of them are provided, the DataFrame is stored in chunks of 1GB.

Here is an example on how to use to_csv_folder. Syntax is identical for to_parquet_folder.

import terality as pd

df = pd.DataFrame({"A": range(10_000)})
df.to_csv_folder(path_or_buf="folder/file_name_*.csv", num_rows_per_file=4000)
# creates 3 files :
#   - folder/file_name_0.csv with 4000 rows.
#   - folder/file_name_1.csv with 4000 rows.
#   - folder/file_name_2.csv with 2000 rows.

Last updated