What if (?) Monty had a live view

Due to the inevitable holiday season, I’ve spent some time burning the world’s electricity (using Cursor) to elaborate a pattern to use a LiveView implementation in Python: pyview. It uses the LiveView protocol for server-side and mixed web UIs I’ve mentioned before.

After elaborating the python library on a departures monitor I thought - why not demonstrate a live view for a Monty experiment - to not need any external services and also be able to control everything about the UI. One of the main ideas: no JS-side polling, complex single-page apps needed. Most logic is server-side python. Here’s a small screencast:

monty-live-view

Do not mind the actual contents of the view - this could be anything, even specific to the experiment being run. One could even think of throwing just the HTML template and the python LiveView class for other experiments with some preparation. The intent is to pave the way to work around limitations with the learnings from Erlang/Elixir.

Additionally to the live view, several other patterns are tried, e.g. single shot scripts to set up/run stuff like the experiment itself or quality checks, supported by some dedicated python libraries and tools and custom wrappers. Also, since pyview required Python 3.11+, it had to be put into a separate process, thus the architecture.

I’ve put it into the ongoing WIP pull request just as a reference. I don’t think this should be merged as is. Alternatively, I can open a merge request on my fork. WIP: live web view for an experiment by d-led · Pull Request #677 · thousandbrainsproject/tbp.monty · GitHub

The learnings are: there’s more code to be written in Python than elixir for that task, and the safeties are not given, thus one has to implement them oneself.

This time I haven’t written the code myself but let the “code genie” do it for me, although, constraining it with a process and quality check scripts. Thus, pardon the uncanny quality. The demonstration is hopefully what is worth picking up, and, perhaps, a sketch of a message-driven architecture. As a next challenge I’ll try to see if I can reproduce that particular experiment with Elixir - likewise to check what could have come out of it.

A sketch of the approach:

flowchart LR
    subgraph Users[" "]
        User1("fa:fa-user User 1")
        User2("fa:fa-user User 2")
    end

    subgraph LiveViewServer["`Live View (py 3.11)`"]
        StateManager@{ shape: das, label: "ExperimentStateManager\n(ZMQ SUB)" }
        LiveView1[LiveView 1] -- subscribed to --> StateManager
        LiveView2[LiveView 2] -- subscribed to --> StateManager
    end

    subgraph MontyExperiment["Monty Experiment (py 3.8)"]
        %% Sensors[Sensors] -- publishes via --> ZmqBroadcaster
        %% LearningModules[Learning Modules] -- publishes via --> ZmqBroadcaster
        %% ProgressLoggers[Progress Loggers] -- publishes via --> ZmqBroadcaster
        Experiment[MontyExperimentWithLiveView] -- uses --> ZmqBroadcaster
        ZmqBroadcaster@{ shape: das, label: "ZmqBroadcaster\n(ZMQ PUB)" }
    end

    ZmqBroadcaster -. ZMQ messages .-> StateManager
    LiveView1 -- serves to --> User1
    LiveView2 -- serves to --> User2

pyview has a python-internal pub/sub implementation, while the inter-process communication is done via zeromq. This is the approach I’ve seen work in many places, and have implemented various flavors of it.

Unless, broken (hopefully not), if the monty conda environment is set up, running scripts/setup.sh and scripts/run.sh in the contrib experiment repo should start the experiment and the live view which is accessible at port 8000.

The up to date description will be in README.md

6 Likes

Thank you for sharing @DLed. Great to see a telemetry prototype!

What sort of data and views are you thinking of being in scope? Any thoughts on relationship with GitHub - thousandbrainsproject/tbp.plot: Tooling for plotting tbp.monty visualizations?

1 Like

Very cool; especially the part about not requiring Elixir, Phoenix, etc. Did you notice anything that seems to be “missing” in your version (e.g., due to limitations in Python and/or Monty)?

1 Like

Ah, I’ve missed out on tbp.blot! Indeed, that’s the kind of visualizations I was thinking about as well - customizable, perhaps, wrappable in easy to use config. I’m not sure I’d put it strictly under telemetry but the point is that you can look into or maybe even steer (if interactive) the experiments from a web UI.

So, in effect, it could become like tbp.plot but not after the fact but during the experiments. Having a Web UI could help open up the views for easier design than the matplotlib derivatives, perhaps.

@Rich_Morin the main limitations are the missing actor abstraction and safeties of the BEAM. It’s harder to think about the live views as robust independent processes than in Phoenix. As Joe Armstrong said on some occasions: “in all technologies you manage many connections on a server (which requires juggling, taking care of shared state, etc), however in Erlang we have one server per connection”. It’s a bit of an exaggeration but shows how one can think about it. So, without reading the code of pyview, which I haven’t done much, I can’t say for sure, how safe it is, e.g. with regard to state corruption or how easy is it to get it to crash. I’m test driving it on a live departures monitor. For now it hasn’t misbehaved for the end-user.

Yes, agreed, that sounds promising and exciting.

What I mean by telemetry is the… data that makes those visualizations possible (your usual logs, metrics, but also buffer “snapshots” to see what a sensor sees). I believe one of the enablers of visualizations like this is to have an ability to emit telemetry using structured logging. This way, Monty emits structured data, out of band, and any visualization can consume the log stream and do whatever it wants to do.

I noticed that the prototype reads the configuration. I think if you were to attempt to visualize what a sensor module sees, or the internal state of a learning module, then that would require reaching into a lot of machinery to get the data out. In essence, you’d have to replicate Monty’s artisanal data loggers. This is where I think an out-of-band separate telemetry stream would be useful. Also, an out-of-band telemetry stream would help to avoid OOM-kill due to storing telemetry snapshots of too many learning modules (Undesirable Effect 116 Platform detailed logging explodes memory.)

2 Likes

yes, this definitely is the way. One can see this as a typical iot/edge/cloud/data center/thin client separation, where the sensors stream or provide pull-able data, the edge collects them and does something (e.g local learning), and the bigger computers do bigger computations, yet, even they can benefit from splitting the requirements for compute resources. A visualization server, e.g. can run dedicated visualization databases and analyses (akin the grafana renderer service) to support better visualizations. Splitting the command and query paths can help simplify the design of the interactive bits (similar as in my mermaidlive demo).

I’ve tried to make the demo non-intrusive for the tbp.monty core code and hook into what I can without touching the core, thus the experiment itself. However, given the right abstractions, e.g. a central events object (port) for telemetry could route it to either /dev/null, a log file, or an OpenTelemetry collector or just a zeromq socket with an agreed upon format/protocol.

As for structured logging one thing I wouldn’t do these days is “scrape” the actual structured log files or streams. Parsing should be trivial, thus, skpipping the serialization bit within one python might be a good thing. Only at the exit out of the port/adapter should the serialization happen in whatever format the adapter needs.

I’d like to promote the idea of a separate “console” or “dashboard” program which can interrogate Monty components (e.g., learning or sensor modules) using (say) GraphQL. Could this fit in to the current design?

1 Like

I’m thinking that while inside Python, other Python things can hook into the logging pipeline to do whatever. But, what I’d like there to be is a telemetry logging pipeline for things to hook into, not reach into implementation details of the Platform.

I’m thinking every file can access two loggers:

logger = logging.getLogger(__name__)
telemetry = monty_telemetry.getTelemetry(__name__)

where getTelemetry is a wrapper around logging.getLogger prefixing "telemetry." to all the logger names or something similar. (naming provisional)

1 Like

nah :smiley:

1 Like

logging is a separate concern from telemetry. I’d not abuse it, as it’s been optimized for a different purpose. you can easily stream millions of points per second in a dedicated slim stack but trying to force logging pipelines to do that might invite unnecessary work-arounds.

Logging pipeline could be at the end of the domain-specific port that’d translate it into structured logging but if you keep it in python until then, you’re free to stream it in mqtt/zeromq+json/even grpc if so needed

From the Parnas’ gem:

We propose instead that one begins with a list of
difficult design decisions or design decisions which are
likely to change. Each module is then designed to hide
such a decision from the others.

IDEone runnable :smiley:

class TelemetrySink(ABC):
    """Interface that hides implementation details of telemetry handling"""
 
    @abstractmethod
    def handle(self, *args):
        # type: (Tuple[str, Any]) -> None
        """Process telemetry with key-value pairs"""
        pass

class Events:
    """
    Facade that hides the complexity of multi-sink telemetry processing
    """
    
    def __init__(self):
        self._sinks = []
    
    # ...
    def emit(self, *args):
        """
        Delegate event to all sinks
        
        Client doesn't need to know: 
        - How many sinks exist
        - What each sink does
        - If sinks can be added/removed at runtime
        """
        for sink in self._sinks:
            sink.handle(*args)

Are we talking about the same thing?

Yes, logging is a separate concern from telemetry, hence I’m thinking of the two pipelines, a pipeline for each concern.

When I say logging, I mean the Python logging module, which keeps everything in Python and does not serialize anything until actual emission. There is no serialization until a handler is configured to serialize the data… into whatever: mqtt/zeromq+json/grpc/etc.

What is abusive about using the logging module for telemetry? LogRecord objects seem to just be some objects in memory. Are you concerned that creating in-memory LogRecords is too much for telemetry? I see the positive of using logging that it exists and everyone uses it, so I’m pretty sure it works.

FWIW, I’m open to using a tracing library for telemetry. But, that seems like a heavier lift than logging structured events.

I think, my concern is that logging is quite a closed concern, whereas, different decisions may require to add something to the metrics/telemetry sink that it will not be designed for. A telemetry sink, in the sketched approach would be just another adapter.

In OpenTelemetry, which is not the first iteration of the kind, the different kinds of payload are explicitly treated differently from operational experience with observability:

Signals | OpenTelemetry - traces, metrics and logs (+ some more).

What I see from the monty context is that we’re rather talking about metrics (not log lines). Perhaps, metrics with baggage.

I don’t think logging is wrong as such. Just that there’s a question of what happens when one does log a line. For now, I think, it’s not much different from a custom I pasted above. But it somehow feels wrong to use the generic unspecific logger interface for the domain-specific events. A LogRecord might contain things that are perhaps not wanted for whatever reason.

Also, for refactoring reason: having a domain-specific type will help e.g. find all instances within the code without false-positives. Using the unspecific logger will deliver a long list of all the places that log, but picking out the metrics out of them might get hard. Perhaps, this is my most convincing argument at least for me from experience.

I’ve abused structured logging for metrics before :smiley:. But in the last years, always via a domain-specific interface for the reason above

Right now loggingis your best bet, because the code is heavily reliant on it. But once Monty has the capacity to inject a logger and pass it around, suddenly logging is excessive. It’s also synchronous, and this is especially important if there’s a plan to fire up lots of events and send them in various places over the network. So ideally it becomes a small but extendable custom class with an asynchronous queue and a task that processes that queue, sending whatever has been put into it wherever it needs to be sent. Something like this, a short version of what I’ve been using myself lately

3 Likes

One thing that those unfamiliar with elixir and phoenix might not know and what Rich is perhaps alluding to is that the erlang runtime has telemetry, tracing and remote query ability built in, meaning that one can connect to a running process, find its internal processes and send messages to them and receive answers. A web version of that functionality is called a live dashboard which can now plot custom metrics or show the innards of the running process.

1 Like

The big three (logs, metrics, and traces) are insufficient for the Monty use case. We also need a fourth, what I’ve been calling “snapshots”. I wasn’t able to find a good industry standard name for these, they depend on context: buffers, frames, snapshots, scrapes, etc. It’s the things that would allow you to see the live visualization. Which reminds me, have you run an experiment with experiment.config.show_sensor_output=true?

python run.py experiment=base_config_10distinctobj_dist_agent experiment.config.show_sensor_output=true

show_sensor_output

What I’m calling “snapshot” is the the things that go into “Camera image”, “Sensor depth image”, and “MLH”. They’re not logs, metrics, or traces.

And no, I’m not saying we should base64 encode those in a structured log :slightly_smiling_face:. What I am saying is that in the telemetry pipeline, references to those snapshots should be available for a handler to do a live visualization of the kind the live plotter is doing.

3 Likes

I think we’re thinking similar things here. This is why I’m suggesting something along the lines of:

logger = logging.getLogger(__name__)
telemetry = monty_telemetry.getTelemetry(__name__)

I’m not advocating for unspecific logger.

The other component I’m thinking of is event schema: (Injection #82 Use schema for structured telemetry). The shape of all the data we emit that might be of scientific interest is not generic. A SensorModule will emit an event shape that’s different from a LearningModule. Different types of LearningModules might emit different shapes of data. I think less in terms of Prometheus/Datadog metrics and more along the lines of Honeycomb “wide” events (lots of baggage). In Monty’s case, each event also having a unique schema.

For experimental purposes, events are more likely to be useful than guages, counters, or histograms. Experiments only care what happens at the end of the step, not about a fixed interval scrape. So.. there’s also experimental vs operational use cases to consider.

1 Like

The synchronous bit is a good highlight.

The context of wanting to remove experimental framing from the platform is relevant here. I’m thinking that as the first step, we would transition from the current way of collecting telemetry where we call a post_episode hook on MontyHandlers, and switch to emitting generic Honeycomb-style “wide” events which contain all the telemetry that may be of scientific interest. Shifting to a telemetry wrapper around the logging module would allow us to get rid of our artisanal SILENT, BASIC, DETAILED, SELECTIVE log levels in favor of the usual log levels and gets us a module-specific fine-grained telemetry emission control “for free.”

I think going to logging like this will make it clearer what the next move we should make. And, at that point, things will look a lot more pipelined and standard, so I’m hoping it’ll be easier to refactor to a follow-on architecture at that point.

5 Likes

Yes, the blob snapshots are a perfect example why a custom interface makes sense - one handler could immediately send the blobs while another could just keep them in memory or put on the filesystem. The non-generic schema - another one. The streaming architecture is for the preparedness for operation (vs experiments)

Indeed, I must have overseen the telemetry being its own API. Perhaps, the name telemetry has existing connotations thus the confusion. Sorry.

So, then the logger can be omitted and the metrics, logs, snapshots can go through the telemetry interface, right? So, in what you propose it’s a custom interface but only with one sink - the logger with the __name__

Although Elixir’s telemetry support is very powerful, the built-in stuff is quite BEAM-specific. There are also Elixir libraries for Telemetry and OpenTelemetry, but these seem to be mostly driven by the reporting process.

My desire is to have Monty incorporate either GraphQL or some other protocol that would support asynchronous (i.e., unexpected) inquiries. In an Elixir system, this would be very simple, requiring only the specification of which part(s) of the current process state to return. (In an OOP language such as Python, I suspect that things might get a lot more difficult…)

Using this sort of protocol (GraphQL not required), any process “in the Monty ecosystem” could request and (probably :-) receive any desired runtime information. And a pony…