S3

S3 is a cloud file storage service by Amazon

Features

FeatureSupported
Batch Mode
Deduplicationℹ️️
Folder Macros

Configuration

Jitsu supports both Access Key based authentication and IAM Role based authentication for Redshift data warehouse.

General parameters

Parameter nameDescription
Authentication MethodaccessKey - Access Key based authentication, iam - IAM Role based authentication
S3 RegionAWS Region of S3 bucket
S3 Bucket NameS3 Bucket Name
FolderFolder in the block storage bucket where files will be stored
FormatFormat of the files stored in the block storage: ndjson - Newline Delimited JSON, ndjson_flat - Newline Delimited JSON flattened, csv - CSV
CompressionCompression algorithm used for the files stored in the block storage: gzip - GZIP, none - no compression.

Configuration settings depend on the selected authentication method.

Access Key based authentication

Parameter nameDescription
S3 Access Key IdS3 Access Key Id.
S3 Secret Access KeyS3 Secret Access Key
EndpointCustom endpoint of S3-compatible server (Optional)

IAM Role based authentication

Parameter nameDescription
Role ARNIAM role ARN

To setup IAM Role based authentication for S3, follow the Advanced: IAM Role for Jitsu section.

Advanced: IAM Role for Jitsu

To allow Jitsu to connect to S3 using IAM Role, the following steps should be performed in AWS Console:

  • Create a new IAM Policy
  • Create a new IAM Role

Create a new IAM Policy

  • Sign in to your AWS Management Console and open the IAM console.
  • Go to Policies > Create policy.
  • Choose the JSON option. Then, paste the JSON below
  • Assign a unique and descriptive name to the policy, provide a clear description, and then select Create Policy.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::${S3BucketName}",
        "arn:aws:s3:::${S3BucketName}/*"
      ]
    }
  ]
}
Tip

Make Sure to replace ${S3BucketName} macros with the value from the Configuration section

Create a new IAM Role

  • Sign in to your AWS Management Console and open the IAM console.
  • Go to Roles > Create role.
  • Under Trusted entity type, select Custom trust policy.
  • Paste the JSON below into the Custom trust policy field and replace ${WorkspaceId} macro with your Jitsu Workspace ID (Jitsu UI -> Settings -> Workspace Settings).
  • In the policy selection screen, find and check the policy created in the Create policy section.
  • Assign a unique and descriptive name to the role, provide a clear description, and then select Create role.
  • Find the newly created role in the list and click on it.
  • Copy the ARN value from the Summary section and use in Jitsu S3 Configuration.

Custom trust policy:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Principal": {
				"AWS": "arn:aws:iam::907458119157:root"
			},
			"Action": "sts:AssumeRole",
			"Condition": {
				"StringEquals": {
					"sts:ExternalId": "${WorkspaceId}"
				}
			}
		}
	]
}
Info

907458119157 - Jitsu AWS Account Id

Advanced: Implementation Details

This section describes how Jitsu implements various modes and features for S3.

Batch Mode

Each batch run produces at least one file at s3 bucket with the following name format:

<folder>/<table_name>_<batch_start_time>_<file_number>.<file_format>

file_number in case when number of available of events is greater than max batch size (default 10_000) Jitsu splits batch into multiple files. batch_number is a number of file in batch (starting from 1).

Deduplication

Info

Deduplication is happening only in the context of a single batch. Jitsu doesn't guarantee deduplication across batches.

Organizing data into folders

You can use macros in Folder configuration parameter to organize data into folders. Macros are replaced with corresponding values during the batch run.

Supported macros:

MacroDescription
[DATE]Date of the batch run in YYYY-MM-DD format
[TIMESTAMP]Batch run time in unix timestamp format

You can use multiple macros in a single folder path. For example, events/[DATE]/[TIMESTAMP] will create a folder with the current date and time.

Note

Macros values are based on the batch start time and don't depend on timestamp of the events in the batch. So it is possible that events from different days will be placed into the one date folder.

See Accurate organization of data into folders for an example of how to organize data into folders based on event timestamps.

Accurate organization of data into folders

It is possible to use Functions to organize data into folders based on event timestamps.

Using functions it is possible to change the destination table for a particular event.

Table name is used as a prefix for batch file names in S3. Slashes (/) in file name works as directory separator and automatically creates corresponding directory structure in S3 bucket. So it is possible to use functions to organize data into folders based on event timestamps or other event criteria.

Example:

export default async function(event, { log, fetch, props: config }) {
  // Change destination table to <date>/events. E.g: 2023-01-01/events.
  // After batch run S3 will contain folder 2023-01-01 with batch files inside.
  const date = event.timestamp.split('T')[0];
  event.JITSU_TABLE_NAME = `${date}/events`;
  return event;
}