6. Metrics Collection and Visualization

🔙

Date: 2025-01-24

Status

Accepted

Context

To evaluate code performance, it is necessary to collect metrics such as DB/HTTP query execution time, CPU consumption, RAM usage.

Decision

For performance measurement - Uptrace.

For collecting internal JVM metrics - Prometheus and Grafana (see their demo).

Consequences

Ability to evaluate the performance of all project subsystems.

Options

Uptrace

🔝

This is a popular and powerful solution. The vendor website offers a cloud option, self-hosting it free.

Uptrace accumulates metrics in ClickHouse and stores metric metadata in Postgres. The graphs are useful because they display percentiles1 instead of average values: p50, p90, p99. This helps answer questions like what time it takes to process 90% of incoming REST requests, leaving peaks (10%) out of scope. Accordingly, business requirements for speed are not applied to 100% of requests, which is technically impossible to guarantee, but to 90% or 99%. Here’s an example graph from the vendor website: Uptrace graph example

Pros and Cons

Pros
  • Transparently connects to any Java application as a Java agent, meaning applications don’t even know their speed is being measured.
  • Shows statistically significant values - percentiles, not average values.
  • No configuration required - all graphs are available out of the box.
Cons
  • The metrics collection process reduces application performance. In production, this is addressed by collecting only 10% of the metrics, as it’s crucial to understand how production works.
  • ClickHouse is required for metrics accumulation. It has high RAM requirements: at least 4GB, preferably 8GB or more. However, if ClickHouse is already part of the project (for audit and/or business analytics), this is not a disadvantage.
  • This is not a Java code profiler - only external system calls are measured: databases (execution time of each SQL query), message brokers, REST (both ways). Actual code profiling - line by line - is done by developers using other tools.

  1. A percentile is a measure where a percentage of the sample does not exceed it. For example, p90 for query execution time means that 90% of queries do not exceed this number of seconds.