Jitsu ❤️ ClickHouse — and easiest way to automate data collection on-prem

Vladimir Klimontovich

CEO and Co-Founder
January 5th, 2021

Data has become an invaluable asset that helps companies understand users, predict behavior, and identify trends. Jitsu is an open-source project designed to simplify event data collection. Jitsu supports a few data warehouses as storage backends, and ClickHouse is one of them.

ClickHouse is the first open source SQL data warehouse to match the performance, maturity, and scalability of proprietary databases like Vertica, and Snowflake.

This article shows how to set up Jitsu with ClickHouse and gives operational advice on how to achieve the best performance and reliability

Getting data to ClickHouse is not as easy a task as it seems. Streaming millions of events from different applications where each event has its own structure can be very challenging. Things can become much more complicated when different versions of the same application are running in production (such as the different versions of iOS app).

Jitsu Architecture
Jitsu Architecture

Jitsu’s architecture is very efficient and robust. It consists of a lightweight HTTP server that accepts an incoming event-stream (JSON objects) and buffers it to local-disk. A separate thread takes care of processing the buffer, mapping JSON to ClickHouse tables, adjusting the schema, and storing the data.

ClickHouse and Jitsu quick-start

In this section we’ll configure a single node installation of ClickHouse and Jitsu using official Docker images.

Note that this is a dev setup to get things going. In production scenarios you would want to deploy multiple Jitsu nodes and to enable ClickHouse replicas to ensure availability of data as well as scale throughput.

1. Pull latest Docker images

docker pull jitsucom/server:latest && docker pull yandex/clickhouse-server:latest

2. Start ClickHouse

mkdir ./clickhouse_data && docker run -f --name clickhouse-test -p 8123:8123 -v $PWD/clickhouse_data:/var/lib/clickhouse yandex/clickhouse-server

Make sure that clickhouse is running with docker ps | grep clickhouse-test

3. Configure Jitsu

Put the following content to eventnative.yaml (EventNative is a legacy name of Jitsu project, some configuration file still has an old name)

server:
  auth:
    - server_secret: "ia7i92rqp3mh" # access token. We will need it later for sending events through HTTP API
destinations:
  clickhouse:
    mode: stream
    clickhouse:
      dsns:
        - "http://default:@host.docker.internal:8123?read_timeout=5m&timeout=5m"
      db: default
    data_layout:
      mappings:
        fields:
          - src: /field_1/sub_field_1
            action: remove
          - src: /field_2/sub_field_1
            dst: /field_10/sub_field_1
            action: move
          - src: /field_3/sub_field_1/sub_sub_field_1
            dst: /field_20
            action: move
            type: DateTime
          - dst: /constant_field
            action: constant
            value: 1000

Create a directory for logs: mkdir ./jitsu-data && chmod -R 777 ./jitsu-data

4. Start Jitsu

docker run -d -t --name jitsu-test -p 8001:8001 \
-v $PWD/eventnative.yaml:/home/eventnative/data/config/eventnative.yaml \
-v $PWD/jitsu-data:/home/eventnative/data/ jitsucom/server:latest

On Linux, add --add-host=host.docker.internal:host-gateway:

docker run --add-host=host.docker.internal:host-gateway -d -t --name jitsu-test -p 8001:8001 \
-v $PWD/eventnative.yaml:/home/eventnative/data/config/eventnative.yaml \
-v $PWD/jitsu-data:/home/eventnative/data/ jitsucom/server:latest

y

5. Send test event and check that it landed in ClickHouse

Put the following JSON to ./api.json:

{
  "eventn_ctx": {
    "event_id": "19b9907d-e814-42d8-a16d-c5da51e01531"
  },
  "field_1": {
    "sub_field_1": "text1",
    "sub_field_2": 100
  },
  "field_2": "text2",
  "field_3": {
    "sub_field_1": {
      "sub_sub_field_1": "2020-09-25T12:38:27"
    }
  }
}

Run the following command:

curl -X POST -H "Content-Type: application/json" -d @./api.json 'http://localhost:8001/api/v1/s2s/event?token=ia7i92rqp3mh'

And then check events landed into database

echo 'SELECT * FROM events;' | curl 'http://localhost:8123/' --data-binary @-

You’ll see one event in the database. The test worked!

6. Test event buffering

One of the core features of Jitsu is event buffering. Events are written to an internal queue with disk persistence. If a destination (ClickHouse in our case) is down, data won’t be lost! It will be kept locally until ClickHouse is up again.

Let’s test this feature!

Put the following JSON to ./api2.json:

{
  "eventn_ctx": {
    "event_id": "4748c7bb-50d4-43a7-91b4-21a5bcccb12e"
  },
  "field_1": {
    "sub_field_1": "text1",
    "sub_field_2": 100
  },
  "field_2": "text2",
  "field_3": {
    "sub_field_1": {
      "sub_sub_field_1": "2020-09-25T12:38:27"
    }
  }
}

Now let’s test buffering.

1. Shutdown ClickHouse:

docker stop clickhouse-test

2. Send an event:

curl -X POST -H "Content-Type: application/json" -d @./api2.json 'http://localhost:8001/api/v1/s2s/event?token=ia7i92rqp3mh'

3. Verify that ClickHouse is down:

echo 'SELECT * FROM events;' | curl 'http://localhost:8123/' --data-binary @-

4. Start ClickHouse again:

docker start clickhouse-test

5. Wait for 60 seconds, then verify that event hasn’t been lost:

echo 'SELECT * FROM events;' | curl 'http://localhost:8123/' --data-binary @-

If you see the event on the last step, the test succeeded.

Schema management with Jitsu and ClickHouse

Jitsu is designed to be a schema-less component in your stack. This means you don’t have to create table schemas and maintain them in advance. Jitsu takes care of it automatically! Each incoming JSON field will be mapped to a SQL field. If the field is missing, it will be automatically created with ClickHouse.

It’s particularly useful when one engineering team is in charge of event structure, and another team operates ClickHouse. As an example: a frontend developer may start sending very simple data to track product page views (product_id and price), and add more sophisticated fields later (currency, images). It’s nice to have.

Example:

Source
Result
"product_id": "1e48fb70-ef12-4ea9-ab10-fd0b910c49ce", "product_price": 399.99, "price_currency": "USD", "product_type": "supplies", "product_release_start": "2020-09-25T12:38:27", "images": { "main": "picture1", "sub": "picture2" }

Mapping configuration details

Jitsu can be configured to apply particular transformations to incoming JSON objects such as:

  • Remove fields
  • Rename fields (including moving element to another node)
  • Explicitly defining the SQL type of the node
  • Setting a constant

Rules:

- src: /field_1/sub_field_1
   action: remove
 - src: /field_2/sub_field_1
   dst: /field_10/sub_field_1
   action: move
 - src: /field_3/sub_field_1/sub_sub_field_1
   dst: /field_20
   action: move
   type: DateTime
 - dst: /constant_field
   action: constant
   value: 1000

See how those rules are applied:

Source
Mapped and flattened JSON
{
 "eventn_ctx": {
   "event_id": "19b9907d-e814-42d8-a16d-c5da51e01530"
   // this field indicates a unique id
 },
 "field_1":  {
   "sub_field_1": "text1",
   "sub_field_2": 100
 },
 "field_2": "text2",
 "field_3": {
   "sub_field_1": {
     "sub_sub_field_1": "2020-09-25T12:38:27"
   }
 }
}

See a full description of this feature in the documentation.

Performance Tips

ReplacingMergeTree (or ReplicatedReplacingMergeTree) is the best choice for data produced by Jitsu. Here’s why:

  • Usually, data produced by EvenNative is used in aggregated queries, such as the number of events per period satisfying filtering conditions. MergeTree engine family shows great performance for aggregation queries.
  • ReplacingMergeTree (unlike ordinary MergeTree) has a nice side-effect of data deduplication. Often, mistakes are found in data after it has been loaded. Sometimes, a replay is required. Since Jitsu can optionally keep a copy of data locally for a while, it’s possible to write a script to fix data and send it to Jitsu once again. ReplacingMergeTree will avoid data duplication provided each event has a unique id and the id is used as a key.

If the destination table is missing, Jitsu will create the table with ReplacingMergeTree or ReplicatedReplacingMergeTree if cluster size is greater than 1. However, it’s possible to configure the engine manually. Please, read more about table creation in the documentation.

Learning More

  • Follow & star Jitsu on GitHub
  • Try a cloud version of Jitsu. It's free for up to 1 million events per month and supports ClickHouse as well
  • Check out Altinity — ClickHouse cloud provider

About Jitsu

Jitsu is an open-source data integration platform offering features like pulling data from APIs, streaming event-base data to DBs, multiplexing and many others.
© Jitsu Labs, Inc

2261 Market Street #4109
San Francisco, CA 94114