ClickHouse

ClickHouse is an open-source column-oriented database management system specialized for online analytical processing of queries (OLAP). There's two ways to use ClickHouse:

Features

FeatureSupported
Batch Mode
Stream Mode*
Deduplication
Queries Optimization
Cluster Support
Distributed Tables

Configuration

Advanced: Implementation Details

This section describes how Jitsu implements various modes and features for ClickHouse

Batch Mode

In Batch Mode Jitsu use temporary table with Memory engine to store batch events before they are inserted into the destination table.

Details
-- Jitsu collects events batches in tmp file on file system.
INSERT INTO tmp_table (...) VALUES (...); -- bulk load data from tmp file into tmp_table using prepared statement in transaction
INSERT INTO target_table(...) SELECT ... FROM tmp_table

Stream Mode

Do not use in production

Stream mode in ClickHouse is limited by MergeTree engine capabilities. ClickHouse is not designed to handle a large number of individual inserts.

Stream Mode is not available by default and can be enabled only in advanced mode. Use it only for testing purposes.

Deduplication

Jitsu relies on underlying table engine. By default ReplacingMergeTree will be used.

Primary key column configured on connection level will be used both as PRIMARY KEY and ORDER BY.

ReplacingMergeTree engine performs deduplication by primary key in background during some time after insertion So it's still possible to get rows with duplicated primary key columns using ordinary SELECT.

To make sure that no duplicates are present in query results use FINAL modifier, e.g:

SELECT * FROM target_table FINAL
Tip

ReplacingMergeTree is not only way to deduplicate data in ClickHouse. There other approaches too. To implement them, create destination table before Jitsu starts inserting the data. In this case Jitsu will respect table engine and primary key columns you specified.

Queries Optimization

Timestamp connection setting is used to optimize SELECT queries. Jitsu creates tables partitioned by specified timestamp column and monthly partitioning, e.g. PARTITION BY toYYYYMM(_timestamp)

Cluster Support

Jitsu supports both ClickHouse clusters and single node instances. To use cluster, specify cluster name in Cluster connection setting. When working in cluster mode, Jitsu creates all tables in Replicated mode and all DDL operations are performed ON CLUSTER.

Distributed Tables

When working in cluster mode, Jitsu creates destination tables as Distributed tables. On each cluster node Jitsu creates local table with the local_ prefix that contains actual data.