to close off the initial throw, I’ve added the command direction: to abort the experiment from the web UI, but of course other commands can be implemented.
Also, a live graph is shown and the sensor images embedded in the live-updated template.
The implementation is POC quality - not clean and not adding the abstractions from above - at least explicitly.
In that case, it might be worth it to look into what the ecosystem has to offer these days, because by that point someone must have built something decent for handling relatively complex stuff, so we don’t have to write our own Frankensteins all over again.
And indeed it has, one example I found particularly interesting is Logfire. It can even integrate itself with logging with very little hassle. Then, it is advertised as a cloud-based solution, but setting up a local environment is possible too, it just needs to be pointed to a local OpenTelemetry collector instead of their cloud endpoint.
Then, the setup could look like this:
The code registers Logfire as a sink for logging and configures a local OtelCol endpoint (where to send stuff to).
Local OtelCol instance is up and running nearby, configured to route various kinds of data to corresponding sinks.
The exact sink/UI layout can be different:
Option a: Uptrace for everything. It can ingest logs, metrics, and traces, and provides a unified UI to work with all these. It can use either Clickhouse or PostgreSQL as a backend, and in that case I’d go for the latter because it’s much simpler to manage.
Option b: separate services for separate kinds of data + Grafana on top. Personally in this case I’d go with VM stack - VictoriaMetrics, VictoriaLogs, and also VictoriaTraces even though it’s not 1.0 yet (or Jaeger/Grafana Tempo until it stabilizes).
All in all, this setup requires very little in terms of code, but a lot in terms of orchestration of the local environment + some learning curve for adopting either Uptrace or Grafana by the community.
Now for the most interesting part, the snapshots.
OpenTelemetry should not, under any circumstances, continuously consume huge chunks of binary data. It defies the entire purpose and eventually kills the setup. What can be done instead is handling it in two separate steps:
A snapshot is created and stored as a file, either on a filesystem for simplicity or in an S3-compliant backend (think Minio, since we’re adding Otel-related services to the setup anyway).
Then, the information about that file is sent via Otel channels as a structured log with a specific tag, and then whatever UI is used to render it loads the file from where it’s stored by its ID.
It is even possible, theoretically, to enrich Logfire with a method to handle this, but it’s monkeypatching. Maybe it’s wiser to do it in reverse, and call Logfire from a separate function that would handle both. Especially assuming that this kind of event isn’t as universal, and calls to that function would be probably reserved for very narrow code paths.
All of this is just something that looks like it could work end-to-end, not necessarily something I’m advocating for
while I’m with you on your approach for classic telemetry, as I now understand @tslominski , a rather domain-specific UI, including sensor and analysis data is what is needed beyond classic metrics, logs and traces. I’m sure grafana would be able to handle it just fine. In the end one can always write a plugin. I’m not sure about the command direction though - haven’t built commands into grafana before.
By the way, @Rich_Morin , check out grafana’s query language. If the concerns are properly separated, instead of graphql, in at least telemetry use cases one could use their query language, given a proper storage/source that supports it.
For now, I’m pleased how the barebones Pyview set up has worked out - without modifying monty core and adding any infrastructure like DBs, message brokers, extra servers beyond python itself
FWIW, the Goog says this about temporal data storage for OpenTelemetry:
OpenTelemetry itself is not a time series database; it is an observability framework for generating, collecting, processing, and exporting telemetry data (logs, metrics, and traces). The collected data must be sent to a separate, compatible time series database for storage and analysis.
Common time series databases and observability backends that integrate with OpenTelemetry for metrics include:
Prometheus: A widely adopted open-source time series database and monitoring system. The OpenTelemetry Collector has a Prometheus receiver and exporter, allowing it to ingest Prometheus metrics or export OTLP metrics to Prometheus storage.
TimescaleDB: An open-source time series database implemented as a PostgreSQL extension, offering a familiar SQL interface with optimizations for time-oriented data.
InfluxDB: A purpose-built, open-source time series database designed for high-volume metrics.
ClickHouse: An open-source, high-performance columnar database that can store and efficiently query OpenTelemetry data, including time series metrics.
Vendor-Specific Backends: Many commercial observability platforms (e.g., Dynatrace, Elastic, Grafana, Splunk) offer backends that are OpenTelemetry compliant and provide their own optimized storage solutions for time series data.
OpenTelemetry’s main goal is to provide a vendor-agnostic standard for data collection, giving users flexibility in choosing their preferred analysis and storage backend.
If we’re talking about backends, there’s a reason why I mentioned VictoriaStack: for a local environment, resource utilization becomes very important. So,
Prometheus → VictoriaMetrics. VM is a drop-in replacement that consumes several times less resources and compresses data several times better too.
Loki → VictoriaLogs, for all the same reasons.
Trust me, I’ve been using all of these in production for years on several different projects as a daily job, except VictoriaLogs because it became stable less than a year ago and we just didn’t switch onto it in time, now we’re stuck with Loki for a while.
On the other hand, this isn’t even important right now because we’re not even sure if Otel is what we really need given our requirements, at least at the moment
… VictoriaLogs (and other VictoriaMetrics products like VictoriaMetrics and VictoriaTraces) are fully open source, released under the Apache 2.0 license, offering a free-to-use, high-performance, and scalable solution for log management, with both single-node and cluster versions available on GitHub.
Key Points:
Open Source: VictoriaLogs is freely available and open source, meaning you can use it without licensing costs, paying only for your own infrastructure.
Apache 2.0 License: The project is released under the permissive Apache 2.0 license, making it suitable for commercial and personal use.
Comprehensive Suite: It’s part of the broader VictoriaMetrics ecosystem, which includes solutions for metrics (VictoriaMetrics) and tracing (VictoriaTraces).
Easy to Use: Designed to be simpler to set up and operate than alternatives like Elasticsearch or Grafana Loki, offering powerful query capabilities and compatibility with Unix tools.
Scalable & Efficient: It’s built for high performance, low memory usage, and scales linearly with resources, handling high-cardinality and high-volume logs efficiently.
I think there’s two sinks. One for logs, and one for telemetry. This is so that we have separate domains of configuration. I haven’t given it too much tought, but I was thinking the telemetry sink would be a logger with the name logging.getLogger(f"telemetry.{__name__}"). I think that would give us all the equivalent control fidelity as normal logs without conflicts.
Possibly, because they’re so different, a third domain of snapshots logging.getLogger(f"snapshots.{__name__}"), but that’s just a not yet fully thought out escape valve in case we need yet another layer of control over data emission.
made an approximation how it could work. as the episodes are trivially-parallel, the aggregation is more or less - “show all”. but again, this is made without any change to the core - only via the config, which has proved quite a powerful extension point. Again, kudos!
The “Total Eval Steps” is a variation on the gCounter CRDT theme (aggregating values from multiple named peers)
everything is pushed into the branch.
There were some complexities, of e.g. distributed task cancellation but all in all, I’m satisfied with the result
spent a couple of days with one actor model implementation in Python, getting to know the implementation detail a bit. Conclusion - it’s really hard to make it work. The Python run-time is simply not prepared for concurrency. Strapping async/await & asyncio onto it doesn’t solve fundamental problems like e.g. fair scheduling, safety, failure and data isolation, etc
@ash: “Monty’s codebase must remain sync. Introducing async into it will only affect the project negatively.”
Carl Sagan: “Extraordinary claims require extraordinary evidence,”
I’m quite certain that @ash can clarify and support this claim, but this topic is perhaps not the best place for that (let alone the ensuing discussion :-).