Retroactive User Recognition on Clickhouse
Retroactive User Recognition feature needs to update data that was already loaded to a data warehouse
Clickhouse by its nature doesn't support
UPDATE queries, but it provides different solutions with important specifics.
Retroactive User Recognition on Clickhouse works with tables using ReplacingMergeTree or ReplicatedReplacingMergeTree engines and relies on its data deduplication mechanism.
Data deduplication occurs in the background at an unknown time, so you can’t plan for it. Some data may remain unprocessed.
Basically speaking, that means that
SELECT query will contain some duplicated rows when query time period intersects with
User Recognition anonymous events TTL period (Read more on TTL)
To avoid duplicated data in results you need to use
FINAL modifier after table name:
select * from events_with_user_recognition FINAL order by _timestamp desc;
FINAL is specified, ClickHouse fully merges the data before returning the result and thus performs all data transformations that happen during merges for the given table engine.
When Jitsu creates table in Clickhouse it automatically setups table engine compatible with Retroactive User Recognition.
If you need Retroactive User Recognition to work with existing table you need to make sure that correct
ENGINE and sorting key (
are set for table:
ENGINE = ReplacingMergeTree or
ENGINE = ReplicatedReplacingMergeTree
Otherwise, User Recognition needs to be turned off for that destination to avoid data duplication.
CREATE TABLE jitsu.events_with_user_recognition ( `_timestamp` DateTime, `api_key` String, `event_type` String, ... ) ENGINE = ReplacingMergeTree PARTITION BY toYYYYMM(_timestamp) ORDER BY eventn_ctx_event_id SETTINGS index_granularity = 8192