Observability
This year's film festival has shown me how difficult it is for me to replicate bugs users report to me. I can't realistically expect KAFE's users to write me detailed reports with screenshots and replication steps (as amazing as that would be). I'm glad any reports reach me at all. I therefore decided to invest some time in learning about instrumentation, observability, and other fancy words one hears about applications running in production.
The end result is six more Docker containers running on mlejnek:
blackbox
-- Prometheus's Blackbox exporter
It checks that https://kafe.muni.cz and https://games.muni.cz are up and running, as well as their latency, and exports these metrics into Prometheus.
prometheus
-- Metrics monitor
It stores, filters, and displays metrics (variables over time) you throw at it.
otel
-- OpenTelemetry Collector (Contrib)
OpenTelemetry is a collection of APIs, SDKs, formats, etc. for getting, collecting, and exporting telemetry data.
From Kafe.Api
I export traces and metrics using the OpenTelemetry
NuGet packages, and logs through Serilog.
All of these reach the otel
container, from which they are sent further to prometheus
, loki
, and tempo
.
loki
-- Grafana's log aggregator
Gets and stores logs from otel
and makes them available in Grafana.
tempo
-- Grafana's trace aggregator
Gets and stores traces from otel
and makes them available in Grafana.
grafana
-- Observability front-end and visualizer
Reaches all of the data collected by blackbox
and otel
, and stored in prometheus
, loki
, and tempo
.
Allows us to filter it and visualize it.
These tools are all free and open-source. Together they allow us to more easily see errors and exception and hopefully will make hunting for reported bugs easier.