Documentation
Plugins
Destinations
S3
Overview

S3 Destination Plugin

Latest: v4.8.0

This destination plugin lets you sync data from a CloudQuery source to remote S3 storage in various formats such as CSV, JSON and Parquet.

This is useful in various use-cases, especially in data lakes where you can query the data direct from Athena or load it to various data warehouses such as BigQuery, RedShift, Snowflake and others.

Example

This example uses the parquet format, to create parquet files in s3://bucket_name/path/to/files, with each table placed in its own directory.

The (top level) spec section is described in the Destination Spec Reference.

kind: destination
spec:
  name: "s3"
  path: "cloudquery/s3"
  version: "v4.8.0"
  spec:
    bucket: "bucket_name"
    region: "region-name" # Example: us-east-1
    path: "path/to/files/{{TABLE}}/{{UUID}}.parquet"
    format: "parquet" # options: parquet, json, csv
    format_spec:
      # CSV-specific parameters:
      # delimiter: ","
      # skip_header: false

    # Optional parameters
    # compression: "" # options: gzip
    # no_rotate: false
    # athena: false # <- set this to true for Athena compatibility
    # test_write: true # tests the ability to write to the bucket before processing the data
    # endpoint: "" # Endpoint to use for S3 API calls.
    # endpoint_skip_tls_verify # Disable TLS verification if using an untrusted certificate
    # use_path_style: false
    # batch_size: 10000 # 10K entries
    # batch_size_bytes: 52428800 # 50 MiB
    # batch_timeout: 30s # 30 seconds

It is also possible to use {{YEAR}}, {{MONTH}}, {{DAY}} and {{HOUR}} in the path to create a directory structure based on the current time. For example:

path: "path/to/files/{{TABLE}}/dt={{YEAR}}-{{MONTH}}-{{DAY}}/{{UUID}}.parquet"

Other supported formats are json and csv.

Note that the S3 plugin only supports append write-mode. The (top level) spec section is described in the Destination Spec Reference.

The S3 destination utilizes batching, and supports batch_size, batch_size_bytes and batch_timeout options (see below).

S3 Spec

This is the (nested) spec used by the CSV destination Plugin.

  • bucket (string) (required)

    Bucket where to sync the files.

  • region (string) (required)

    Region where bucket is located.

  • path (string) (required)

    Path to where the files will be uploaded in the above bucket. The path supports the following placeholder variables:

    • {{TABLE}} will be replaced with the table name
    • {{FORMAT}} will be replaced with the file format, such as csv, json or parquet. If compression is enabled, the format will be csv.gz, json.gz etc.
    • {{UUID}} will be replaced with a random UUID to uniquely identify each file
    • {{YEAR}} will be replaced with the current year in YYYY format
    • {{MONTH}} will be replaced with the current month in MM format
    • {{DAY}} will be replaced with the current day in DD format
    • {{HOUR}} will be replaced with the current hour in HH format
    • {{MINUTE}} will be replaced with the current minute in mm format

    Note that timestamps are in UTC and will be the current time at the time the file is written, not when the sync started.

  • format (string) (required)

    Format of the output file. Supported values are csv, json and parquet.

  • format_spec (format_spec) (optional)

    Optional parameters to change the format of the file.

  • compression (string) (optional) (default: empty)

    Compression algorithm to use. Supported values are empty or gzip. Not supported for parquet format.

  • no_rotate (boolean) (optional) (default: false)

    If set to true, the plugin will write to one file per table. Otherwise, for every batch a new file will be created with a different .<UUID> suffix.

  • athena (boolean) (optional) (default: false)

    When athena is set to true, the S3 plugin will sanitize keys in JSON columns to be compatible with the Hive Metastore / Athena. This allows tables to be created with a Glue Crawler and then queried via Athena, without changes to the table schema.

  • test_write (boolean) (optional) (default: true)

    Ensure write access to the given bucket and path by writing a test object on each sync. If you are sure that the bucket and path are writable, you can set this to false to skip the test.

  • endpoint (string) (optional) (default: empty)

    Endpoint to use for S3 API calls. This is useful for S3-compatible storage services such as MinIO. Note: if you want to use path-style addressing, i.e., https://s3.amazonaws.com/BUCKET/KEY, use_path_style should be enabled, too.

  • endpoint_skip_tls_verify (boolean) (optional) (default: false)

    Disable TLS verification for requests to your S3 endpoint. This option is intended to be used when using a custom endpoint using the endpoint option.

  • use_path_style (boolean) (optional) (default: false)

    Allows to use path-style addressing in the endpoint option, i.e., https://s3.amazonaws.com/BUCKET/KEY. By default, the S3 client will use virtual hosted bucket addressing when possible (https://BUCKET.s3.amazonaws.com/KEY).

  • batch_size (integer) (optional) (default: 10000)

    Number of records to write before starting a new object.

  • batch_size_bytes (integer) (optional) (default: 52428800 (= 50 MiB))

    Number of bytes (as Arrow buffer size) to write before starting a new object.

  • batch_timeout (duration) (optional) (default: 30s (30 seconds))

    Inactivity time before starting a new object.

format_spec

  • delimiter (string) (optional) (default: ,)

    Character that will be used as want to use as the delimiter if the format type is csv

  • skip_header (boolean) (optional) (default: false)

    Specifies if the first line of a file should be the headers (when format is csv).

Authentication

The plugin needs to be authenticated with your account(s) in order to sync information from your cloud setup.

The plugin requires only PutObject permissions (we will never make any changes to your cloud setup), so, following the principle of least privilege, it's recommended to grant it PutObject permissions.

There are multiple ways to authenticate with AWS, and the plugin respects the AWS credential provider chain. This means that CloudQuery will follow the following priorities when attempting to authenticate:

  • The AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN environment variables.
  • The credentials and config files in ~/.aws (the credentials file takes priority).
  • You can also use aws sso to authenticate cloudquery - you can read more about it here (opens in a new tab).
  • IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers).

You can read more about AWS authentication here (opens in a new tab) and here (opens in a new tab).

Environment Variables

CloudQuery can use the credentials from the AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN environment variables (AWS_SESSION_TOKEN can be optional for some accounts). For information on obtaining credentials, see the AWS guide (opens in a new tab).

To export the environment variables (On Linux/Mac - similar for Windows):

export AWS_ACCESS_KEY_ID={Your AWS Access Key ID}
export AWS_SECRET_ACCESS_KEY={Your AWS secret access key}
export AWS_SESSION_TOKEN={Your AWS session token}

Shared Configuration files

The plugin can use credentials from your credentials and config files in the .aws directory in your home folder. The contents of these files are practically interchangeable, but CloudQuery will prioritize credentials in the credentials file.

For information about obtaining credentials, see the AWS guide (opens in a new tab).

Here are example contents for a credentials file:

~/.aws/credentials
[default]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

You can also specify credentials for a different profile, and instruct CloudQuery to use the credentials from this profile instead of the default one.

For example:

~/.aws/credentials
[myprofile]
aws_access_key_id = YOUR_ACCESS_KEY_ID
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

Then, you can either export the AWS_PROFILE environment variable (On Linux/Mac, similar for Windows):

export AWS_PROFILE=myprofile

IAM Roles for AWS Compute Resources

The plugin can use IAM roles for AWS compute resources (including EC2 instances, Fargate and ECS containers). If you configured your AWS compute resources with IAM, the plugin will use these roles automatically. For more information on configuring IAM, see the AWS docs here (opens in a new tab) and here (opens in a new tab).

User Credentials with MFA

In order to leverage IAM User credentials with MFA, the STS "get-session-token" command may be used with the IAM User's long-term security credentials (Access Key and Secret Access Key). For more information, see here (opens in a new tab).

aws sts get-session-token --serial-number <YOUR_MFA_SERIAL_NUMBER> --token-code <YOUR_MFA_TOKEN_CODE> --duration-seconds 3600

Then export the temporary credentials to your environment variables.

export AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>
export AWS_SESSION_TOKEN=<YOUR_SESSION_TOKEN>

Using a Custom S3 Endpoint

If you are using a custom S3 endpoint, you can specify it using the endpoint spec option. If you're using authentication, the region option in the spec determines the signing region used.