What if (?) Monty had a live view

DLed · January 7, 2026, 8:50am

to close off the initial throw, I’ve added the command direction: to abort the experiment from the web UI, but of course other commands can be implemented.

Also, a live graph is shown and the sensor images embedded in the live-updated template.

The implementation is POC quality - not clean and not adding the abstractions from above - at least explicitly.

monty-live-view-command

@Rich_Morin I guess you meant this pony(lang)

ash · January 7, 2026, 9:17am

In that case, it might be worth it to look into what the ecosystem has to offer these days, because by that point someone must have built something decent for handling relatively complex stuff, so we don’t have to write our own Frankensteins all over again.

And indeed it has, one example I found particularly interesting is Logfire. It can even integrate itself with logging with very little hassle. Then, it is advertised as a cloud-based solution, but setting up a local environment is possible too, it just needs to be pointed to a local OpenTelemetry collector instead of their cloud endpoint.

Then, the setup could look like this:

The code registers Logfire as a sink for logging and configures a local OtelCol endpoint (where to send stuff to).
Local OtelCol instance is up and running nearby, configured to route various kinds of data to corresponding sinks.
The exact sink/UI layout can be different:
Option a: Uptrace for everything. It can ingest logs, metrics, and traces, and provides a unified UI to work with all these. It can use either Clickhouse or PostgreSQL as a backend, and in that case I’d go for the latter because it’s much simpler to manage.
Option b: separate services for separate kinds of data + Grafana on top. Personally in this case I’d go with VM stack - VictoriaMetrics, VictoriaLogs, and also VictoriaTraces even though it’s not 1.0 yet (or Jaeger/Grafana Tempo until it stabilizes).

All in all, this setup requires very little in terms of code, but a lot in terms of orchestration of the local environment + some learning curve for adopting either Uptrace or Grafana by the community.

Now for the most interesting part, the snapshots.
OpenTelemetry should not, under any circumstances, continuously consume huge chunks of binary data. It defies the entire purpose and eventually kills the setup. What can be done instead is handling it in two separate steps:

A snapshot is created and stored as a file, either on a filesystem for simplicity or in an S3-compliant backend (think Minio, since we’re adding Otel-related services to the setup anyway).
Then, the information about that file is sent via Otel channels as a structured log with a specific tag, and then whatever UI is used to render it loads the file from where it’s stored by its ID.

It is even possible, theoretically, to enrich Logfire with a method to handle this, but it’s monkeypatching. Maybe it’s wiser to do it in reverse, and call Logfire from a separate function that would handle both. Especially assuming that this kind of event isn’t as universal, and calls to that function would be probably reserved for very narrow code paths.

All of this is just something that looks like it could work end-to-end, not necessarily something I’m advocating for

DLed · January 7, 2026, 9:43am

for aesthetic reasons - configured smoother updates: 10/s

monty-live-view-command-smooth

DLed · January 7, 2026, 9:48am

while I’m with you on your approach for classic telemetry, as I now understand @tslominski , a rather domain-specific UI, including sensor and analysis data is what is needed beyond classic metrics, logs and traces. I’m sure grafana would be able to handle it just fine. In the end one can always write a plugin. I’m not sure about the command direction though - haven’t built commands into grafana before.

By the way, @Rich_Morin , check out grafana’s query language. If the concerns are properly separated, instead of graphql, in at least telemetry use cases one could use their query language, given a proper storage/source that supports it.

For now, I’m pleased how the barebones Pyview set up has worked out - without modifying monty core and adding any infrastructure like DBs, message brokers, extra servers beyond python itself

Rich_Morin · January 7, 2026, 10:15am

FWIW, the Goog says this about temporal data storage for OpenTelemetry:

OpenTelemetry itself is not a time series database; it is an observability framework for generating, collecting, processing, and exporting telemetry data (logs, metrics, and traces). The collected data must be sent to a separate, compatible time series database for storage and analysis.

Common time series databases and observability backends that integrate with OpenTelemetry for metrics include:

Prometheus: A widely adopted open-source time series database and monitoring system. The OpenTelemetry Collector has a Prometheus receiver and exporter, allowing it to ingest Prometheus metrics or export OTLP metrics to Prometheus storage.

TimescaleDB: An open-source time series database implemented as a PostgreSQL extension, offering a familiar SQL interface with optimizations for time-oriented data.

InfluxDB: A purpose-built, open-source time series database designed for high-volume metrics.

ClickHouse: An open-source, high-performance columnar database that can store and efficiently query OpenTelemetry data, including time series metrics.

Vendor-Specific Backends: Many commercial observability platforms (e.g., Dynatrace, Elastic, Grafana, Splunk) offer backends that are OpenTelemetry compliant and provide their own optimized storage solutions for time series data.

OpenTelemetry’s main goal is to provide a vendor-agnostic standard for data collection, giving users flexibility in choosing their preferred analysis and storage backend.

DLed · January 7, 2026, 10:42am

PS kudos to @TBPStaff ! It’s really cool to now observe catastrophic-forgetting-free real-time learning in real-time. Simply thrilling!

DLed · January 7, 2026, 10:43am

plus this one Grafana Loki OSS | Log aggregation system

ash · January 7, 2026, 12:38pm

If we’re talking about backends, there’s a reason why I mentioned VictoriaStack: for a local environment, resource utilization becomes very important. So,

Prometheus → VictoriaMetrics. VM is a drop-in replacement that consumes several times less resources and compresses data several times better too.
Loki → VictoriaLogs, for all the same reasons.

Trust me, I’ve been using all of these in production for years on several different projects as a daily job, except VictoriaLogs because it became stable less than a year ago and we just didn’t switch onto it in time, now we’re stuck with Loki for a while.

On the other hand, this isn’t even important right now because we’re not even sure if Otel is what we really need given our requirements, at least at the moment

Rich_Morin · January 7, 2026, 1:40pm

FWIW, the Goog sez:

… VictoriaLogs (and other VictoriaMetrics products like VictoriaMetrics and VictoriaTraces) are fully open source, released under the Apache 2.0 license, offering a free-to-use, high-performance, and scalable solution for log management, with both single-node and cluster versions available on GitHub.

Key Points:

Open Source: VictoriaLogs is freely available and open source, meaning you can use it without licensing costs, paying only for your own infrastructure.

Apache 2.0 License: The project is released under the permissive Apache 2.0 license, making it suitable for commercial and personal use.

Comprehensive Suite: It’s part of the broader VictoriaMetrics ecosystem, which includes solutions for metrics (VictoriaMetrics) and tracing (VictoriaTraces).

Easy to Use: Designed to be simpler to set up and operate than alternatives like Elasticsearch or Grafana Loki, offering powerful query capabilities and compatibility with Unix tools.

Scalable & Efficient: It’s built for high performance, low memory usage, and scales linearly with resources, handling high-cardinality and high-volume logs efficiently.

You can find the code and more information on their GitHub repository: GitHub - VictoriaMetrics/VictoriaLogs: Fast and easy to use database for logs, which can efficiently handle terabytes of logs.

DLed · January 7, 2026, 1:41pm

monty-live-view-command-5lms-77

randrot_noise_77obj_5lms_dist_agent seems to work too

tslominski · January 7, 2026, 9:28pm

I think there’s two sinks. One for logs, and one for telemetry. This is so that we have separate domains of configuration. I haven’t given it too much tought, but I was thinking the telemetry sink would be a logger with the name logging.getLogger(f"telemetry.{__name__}"). I think that would give us all the equivalent control fidelity as normal logs without conflicts.

Possibly, because they’re so different, a third domain of snapshots logging.getLogger(f"snapshots.{__name__}"), but that’s just a not yet fully thought out escape valve in case we need yet another layer of control over data emission.

tslominski · January 7, 2026, 9:35pm

Nice .

Any thoughts on how run_parallel would work?

DLed · January 7, 2026, 10:05pm

each process/thread can easily connect to the same zeromq socket: https://stackoverflow.com/a/67752410/847349

the subscriber must somehow then aggregate based on some rules of merging/processing the incoming data

DLed · January 7, 2026, 10:06pm

there’s one inefficiency for now - the chart processing is quadratic but I’ll try to rewrite it to liveview streams / via pyview/streams

DLed · January 8, 2026, 9:22am

fixed the quadratic complexity. also, aggregated multiple LMs before publishing them further:

to not have the noisy sawtooth graph

DLed · January 8, 2026, 8:38pm

made an approximation how it could work. as the episodes are trivially-parallel, the aggregation is more or less - “show all”. but again, this is made without any change to the core - only via the config, which has proved quite a powerful extension point. Again, kudos!

The “Total Eval Steps” is a variation on the gCounter CRDT theme (aggregating values from multiple named peers)

monty-live-view-parallel

everything is pushed into the branch.

There were some complexities, of e.g. distributed task cancellation but all in all, I’m satisfied with the result

DLed · January 11, 2026, 4:45pm

spent a couple of days with one actor model implementation in Python, getting to know the implementation detail a bit. Conclusion - it’s really hard to make it work. The Python run-time is simply not prepared for concurrency. Strapping async/await & asyncio onto it doesn’t solve fundamental problems like e.g. fair scheduling, safety, failure and data isolation, etc

ash · January 11, 2026, 7:42pm

Monty’s codebase must remain sync. Introducing async into it will only affect the project negatively.

Rich_Morin · January 12, 2026, 1:31am

@DLed said: “By the way, @Rich_Morin , check out grafana’s query language.” So I did…

Rich_Morin · January 12, 2026, 1:55am

@ash: “Monty’s codebase must remain sync. Introducing async into it will only affect the project negatively.”

Carl Sagan: “Extraordinary claims require extraordinary evidence,”

I’m quite certain that @ash can clarify and support this claim, but this topic is perhaps not the best place for that (let alone the ensuing discussion :-).

So, I’ve started another cage match (ala Acts like IPC: an interoperability cage match). Any takers? – Bottling Time: an async/sync (etc.) cage match