Importing and exporting data
(input/output)

Overview: how to load and export data

Terality uses the same methods as pandas to load data: read_csv and read_parquet (other formats are not yet supported). Terality also defines a from_pandas method.Example:
1
import terality as te
2
3
# Load all parquet files at this S3 location
4
df = te.read_parquet("s3://my-datasets/path/to/objects/")
5
6
# Load a CSV file from disk
7
df = te.read_csv("/path/to/my/data.csv")
8
9
# You can also convert a pandas DataFrame to Terality.
10
# Useful when data is in a format not supported by Terality.
11
import pandas as pd
12
df_pd = pd.read_xml("data.xml")
13
df_te = te.from_pandas(df_pdf)
Copied!
Additionally, you can export datasets with to_csv or to_parquet. For datasets with more than 1 GiB of data, use to_csv_folder or to_parquet_folder.
1
df.to_parquet_folder("s3://my-datasets/export/prefix/{}.parquet", num_files=10)
Copied!
The sections below describe all the import/export options currently available with Terality.

Supported formats

Terality supports importing or exporting data in the following formats:
Format
Import function
Export function
CSV
terality.read_csv
terality.DataFrame.to_csv
Parquet
terality.read_parquet
terality.DataFrame.to_parquet
These functions accept the same parameters as the pandas API.
Other pandas input/output functions (such as read_xml) are not supported yet. Please don't hesitate to suggest your preferred format by reaching out to us.

Exporting a big dataset: to_parquet_folder/to_csv_folder

When exporting a DataFrame with more than 1 GiB of data, use the to_csv_folder or to_parquet_folder methods instead. These methods write files in parallel to the output folder:
1
import terality as te
2
3
df = te.read_parquet("s3://bucket/some/key.parquet")
4
5
# Write the DataFrame with the parquet format in 10 files.
6
# As a rule of thumb and for best performance, try to choose num_files so that
7
# each output file is about 1 GiB (use `df.info` to get the size of your DataFrame).
8
# The {} placeholder will be replaced by the file number.
9
# This method will write s3://bucket_2/export/prefix/1.parquet, s3://bucket_2/export/prefix/2.parquet...
10
# until s3://bucket_2/export/prefix/10.parquet.
11
df.to_parquet_folder("s3://bucket_2/export/prefix/{}.parquet", num_files=10)
12
13
# Same with CSV
14
df.to_csv_folder("s3://bucket_2/export/prefix/{}.csv", num_files=10)
Copied!

Converting pandas DataFrames: from_pandas, to_pandas

If you have a pandas DataFrame or Series, you can import it into Terality with the from_pandas method. You can also convert to Terality DataFrame to a pandas DataFrame with to_pandas.
This is useful when you want to import data with a format not yet supported by Terality: read it with pandas first, then call from_pandas.
1
import terality as te
2
import pandas as pd
3
4
# Terality does not support the XML format yet, but pandas does.
5
df_pd = pd.read_xml("data.xml")
6
7
# Convert the pandas DataFrame to a Terality DataFrame
8
df_te = te.from_pandas(df_pd)
9
10
# Convert back the Terality DataFrame to a pandas DataFrame
11
df_pd = df_te.to_pandas()
Copied!
Using from_pandas is less performant than directly using read_parquet or read_csv, and requires to load the whole dataset into memory first. Prefer reading directly from a file when possible.

Supported storage services

Local disk

Terality can read files from the local filesystem, and write files back to this local filesystem.
In this configuration, files are directly uploaded to the Terality servers by the Terality client. If you are on a metered or low bandwidth Internet connection, we recommend that you use S3 or Azure Data Lake, as described in the next sections.
1
import terality as te
2
3
# Import
4
df = te.read_parquet("/path/to/some/file.parquet")
5
6
# Export
7
df.to_parquet("/path/to/output/file.parquet")
Copied!

AWS S3

Terality can import files from AWS S3 (https://aws.amazon.com/s3/) and write files back to AWS S3.
The machine running the Terality code must have enough permissions to read from or write objects to the source or destination S3 bucket. No additional setup is necessary.
This method copies files directly from S3 to the Terality servers. Files are not downloaded to the machine running the Terality client.
1
import terality as te
2
3
# Import
4
df = te.read_parquet("s3://bucket/some/object.parquet")
5
6
# Export
7
df.to_parquet("s3://bucket/output/object.parquet")
Copied!
All transfers are secured and your AWS credentials are never sent to the Terality servers. If you are curious about how this works, you can refer to this section: How can Terality import or export data without configuration and without accessing the user cloud credentials?

Azure Data Lake

Terality can import files from Azure Data Lake Storage Gen2 (https://azure.microsoft.com/en-us/services/storage/data-lake-storage/) and write files back to Azure Data Lake.
The machine running the Terality code must have enough permissions to read from or write objects to the source or destination storage account, as well as generating a user delegation key. These permissions maps to the standard Azure roles "Storage Blob Data Contributor"," Storage Blob Data Owner" or "Storage Blob Data Reader".
This method copies files directly from Azure Data Lake to the Terality servers. Files are not downloaded to the machine running the Terality client.
To use the Azure integration, you first need to install the Terality Azure extras Python package:
1
$ pip install 'terality[azure]'
Copied!
Then, you can use the Azure integration:
1
import terality as te
2
3
# Import
4
df = te.read_parquet(
5
"adfs://container/some/blob.parquet",
6
storage_options={"account_name": "yourstorageaccountname"}
7
)
8
9
# Export
10
df.to_parquet(
11
"adfs://container/output/object.parquet",
12
# Unlike pandas, Terality also supports the "storage_options" parameter in `to_parquet`
13
storage_options={"account_name": "yourstorageaccountname"}
14
)
Copied!
All transfers are secured and your Azure credentials are never sent to the Terality servers. If you are curious about how this works, you can refer to this section: How can Terality import or export data without configuration and without accessing the user cloud credentials?

Other services

Services that are not directly supported by Terality, such as Snowflake or Databricks, often offer an option to export data to an AWS S3 bucket or Azure Data Lake Storage filesystem. You can thus use S3 or Data Lake Storage to transfer data between Terality and many other services.
Last modified 6d ago