4. Online Analytical Processing

🔙

Date: 2025-01-23

Status

Accepted

Context

For audit and business intelligence purposes, it’s required to store and query various and numerous business events, such as ‘product added’, ‘order paid’.

Note: Uptrace accumulates technical metrics in ClickHouse and has no other options. This is a hint to use the same storage for both types of big data.

Decision

ClickHouse and Metabase.

Consequences

Business events are stored and aggregated in such a way that aggregation time of hundreds of millions of records is typically no more than 1 sec. (depends on proper data organization and queries). Ability to visualize data in an online constructor in the form of graphs and tables.

Options

ElasticSearch and Kibana

🔝

This system stores and visualizes big data, but only in terms of aggregated logs. Of course, all kinds of data can be stored there, however, the set of analytical capabilities is limited (approximately at the level of relational databases), and the query language is peculiar.

ElasticSearch was the standard solution before the emergence of new generation OLAP systems, which might not have appeared if ElasticSearch met the increased requirements in this area. Nevertheless, in the field of search and log visualization, as well as full-text search (of any data), it remains the leader. To compare, full-text search in ClickHouse is still in the experimental stage.

ElasticSearch 8’s license prohibits providing the product as a service (apparently to prevent clouds from reselling the free product), but allows for internal use in projects.

ClickHouse and Metabase

🔝

ClickHouse was developed by Yandex, has been serving Yandex Metrica since 2009, and became publicly available in 2016 under the Apache license. The vendor website offers a cloud version, but the product can be self-hosted for free.

Pros and Cons

Pros
  • The advantages of this system are discussed in many articles, for example in this one.
  • Comparison to ElasticSearch is discussed in this article, particularly the significantly higher degree of data compression and aggregation speed, which saves both time and hardware resources. Here’s a picture from that article - an example of aggregating one billion records:
    picture
  • The learning curve is reduced significantly because the query language is SQL, and there’s a Spring library which makes ClickHouse available via JdbcTemplate.
Cons
  • Has high RAM requirements: at least 4GB, preferably 8GB or more.
  • ClickHouse «doesn’t like» receiving frequent small batches of data - fixed by sending events not directly (which would slow down the applications - it’s a synchronous operation), but asynchronously - through Kafka.
  • For effective use, it requires studying specific approaches to data organization and processing, such as AggregatingMergeTree, which, however, is facilitated by good guides.

For data visualization, there’s Metabase. This online visualization constructor is not tied to any specific data source: there are plugins for ClickHouse, Postgres, and others.

Apache Superset

🔝

Pros and Cons

Pros
Cons
  • As noted on the competitor’s website (Metabase) - and this matches personal experimentation results in setting up and using both systems, – Apache Superset is noticeably more complex to set up and less intuitive in terms of interface.