Storage services

Supported storage services

Local disk

Terality can read files from the local filesystem, and write files back to this local filesystem.

In this configuration, files are directly uploaded to the Terality servers by the Terality client. If you are on a metered or low bandwidth Internet connection, we recommend that you use S3 or Azure Data Lake, as described in the next sections.

import terality as te

# Import
df = te.read_parquet("/path/to/some/file.parquet")

# Export
df.to_parquet("/path/to/output/file.parquet")

AWS S3

Terality can import files from AWS S3 (https://aws.amazon.com/s3/) and write files back to AWS S3.

The machine running the Terality code must have enough permissions to read from or write objects to the source or destination S3 bucket. No additional setup is necessary.

This method copies files directly from S3 to the Terality servers. Files are not downloaded to the machine running the Terality client.

import terality as te

# Import
df = te.read_parquet("s3://bucket/some/object.parquet")

# Export
df.to_parquet("s3://bucket/output/object.parquet")

Azure Data Lake

Terality can import files from Azure Data Lake Storage Gen2 (https://azure.microsoft.com/en-us/services/storage/data-lake-storage/) and write files back to Azure Data Lake.

The machine running the Terality code must have enough permissions to read from or write objects to the source or destination storage account, as well as generating a user delegation key. These permissions maps to the standard Azure roles "Storage Blob Data Contributor"," Storage Blob Data Owner" or "Storage Blob Data Reader".

This method copies files directly from Azure Data Lake to the Terality servers. Files are not downloaded to the machine running the Terality client.

To use the Azure integration, you first need to install the Terality Azure extras Python package:

$ pip install 'terality[azure]'

Then, you can use the Azure integration:

import terality as te

# Import
df = te.read_parquet(
    "adfs://container/some/blob.parquet",
    storage_options={"account_name": "yourstorageaccountname"}
)

# Export
df.to_parquet(
    "adfs://container/output/object.parquet",
    # Unlike pandas, Terality also supports the "storage_options" parameter in `to_parquet`
    storage_options={"account_name": "yourstorageaccountname"}
)

Databricks

Terality can read from and write to the Databricks File System (DBFS). DBFS paths are accessible from nodes within a Databricks cluster and start with /dbfs:

import terality as te

# Import
df = te.read_parquet("/dbfs/in/data.parquet")

# Export
df.to_parquet("/dbfs/out/data.parquet")

If the underlying DBFS storage is a supported object store such as AWS S3, you can also directly import data from this object store, as described in previous sections. Directly using the underlying storage may reduce the time required to import or export data.

Other services

Services that are not directly supported by Terality, such as Snowflake, often offer an option to export data to an AWS S3 bucket or Azure Data Lake Storage filesystem. You can thus use S3 or Data Lake Storage to transfer data between Terality and many other services.

Last updated