Habitat IPC configuration

Hello,

In my spare time have been working on a version of monty that talks to habitat in a different process via IPC, inspired by the grpc prototype done by the team last year. The branch now passes all the tests. What I have done with this is validate the basic data exchange protocol between monty and the simulator, without grpc (which I considered to be a sub-optimal transport), which can now be further worked on to use different transport protocols or even different serialization formats (currently protobuf→ maybe cap’n’proto ?). The most basic transport in place right now is based on multiprocessing.queues. Tests take %50 longer with this approach (I haven’t run any experiment with it yet).

This was something I did somewhat to prove myself I could do it, simultaneously a fertile ground to explore some low-level ideas, and function as a nice induction to a project I always admired. Eventually it could become a PR, but in time it got a life of its own due to some side-quests. This said, I have been silently adjusting the code to the larger architectural direction the system takes, but in its current state there are architectural questions arising, for which I think I may have a privileged view due to how far I took it, and that’s why I’m sharing one of them here, asking for insights.

The main open question for me going forward relates with configuration. In my version, habitat has a client and a server modules. The server encapsulates the code to be externalized, eventually to a diff repo, eventually using a different python version. I had to change most of the class names in the hydra config related with environment initialization, to point to the server module, but this works only as long as that module exists in the monty repo. Externalizing those classes (and eventually locking them in a distinct python version) will break the experiment initialization. My question to the team is therefore about what are your plans for this topic, please ?

Thank you :slight_smile:
Nuno

6 Likes

@nunoo Welcome to the community and what an amazing first post! :tada:

3 Likes

This sounds great and would allow finally to move towards modern Python. To my taste I’d rather prefer untyped ipc, eg via zeromq plus optional contact checking for Postel’s Law but even that could just be one adapter to a port.

FYI What if (?) experiments had a live view

2 Likes

Hi nunoo,

That’s amazing work! I had started looking into Habitat IPC myself back in September. However, before going forward with the Habitat separation, Tristan suggested to first focus on integrating the changes from prototype.simulator’s main branch, which is nearly done.

Although I am big fan of IPC, in the meantime I started to doubt that Python’s multiprocessing was the right approach for the simulator, because it restricts the whole system to Python. Ideally, simulator communication should be language-agnostic, e.g. the Habitat endpoint could run entirely in C++ for maximum performance. @DLed’s suggestion about ZeroMQ is definitely in my line of thinking.

Tristan did however say that we should try out different serialization/communication approaches, so your work is in the right direction, and it’s great to have another voice at the table.

Tests taking 50% longer is absolutely an improvement over the team’s 1000% with gRPC :laughing:

You could try running a simple benchmark with the main repo and then your fork to get a more accurate number:

python run.py experiment=randrot_10distinctobj_surf_agent
3 Likes

Indeed I’d not use thread or even coroutine-centric abstractions for all 3: concurrency, parallelism and distribution. The Actor Model encapsulates all without extra mental overhead. Vaughn Vernon had recently published the Python port of his actor model library GitHub - VaughnVernon/DomoActors-Py: Actor Model toolkit for Python: Fault-tolerant, message-driven concurrency.
I thought trying that in my demo but left it for later. I’ve interfaced many technologies successfully with the actor abstraction, and, for example, zeromq being the messaging abstraction, enabling concurrency even in single-threaded, non-async software.

JSON is a safe bet for serialization for most cases and msgpack has proven ok when squeezing bytes and some more performance depending on the context is needed. However, for most part, the strongly-typed approaches like protobuf/grpc increased perceived cognitive load of teams.

Trying out a mini demo, eg starting from a simple p2p CMP exchange and then adding one more significant field to the protocol would be a nice test for the team to see which tech would fit beyond the technical evaluation

1 Like

Hi @AgentRev and @DLed, other than considering that grpc is not performant enough, this work is not overly opinionated about the exact transport to use. It should be as fast as possible, so in principle shared mem based, (or loopback), and allow different python interpreters, hence multiprocessing.queues are certainly out of question. Nevertheless, these queues are still the easiest transport to help with conquering the immediate challenge : refining the data protocol (and the simulator lifecycle). Once that is stabilized, other transports can be tested. There are a number of open questions in this area, for which I have seen answers popping up naturally as the larger roadmap materializes, and that’s just fine, it gives people time to tackle them organically, driven by higher abstractions and product design concerns, and these decisions can then be mapped to the data protocol. As I see it, the specific details of the data protocol implementation (e.g. how to serialize a numpy array) do not need to drive the design of the data flows, even if these may ultimately conform the success of those flows. This is why this work has been mainly a listeners job in a project that is constantly improving, autonomously. Two simple examples : the environment interface is half ready to accept multiple agents, even if there is no single experiment (at least in the tests) that uses it; or there are some points in the code where the internals of an environment are directly used (and abused). These are just two examples of things that will be refined eventually and to which the data protocol can just adapt. Also recently, the apply_actions method in the simulator interface was replaced by the step method, with multiple returns, again something easy to accommodate in the data protocol. Another one : the object config aggregate is too flexible to be successfully serialized other than by using pickle, but pickle will not survive different python versions. In general there is a larger tension that needs to be resolved between the desirable flexibility of configuration in monty and having a performant serialization, and this tension will become evident for everyone at a certain point, as this migration becomes more of a priority and its impact in terms of product management starts to become more obvious. Right now, my focus was totally on making a proof of concept, which worked within some constraints (e.g. mocking an habitat_sim in a separate server process doesn’t work, using a daemon pool to parallelize experiments does not allow spawning sub-processes, etc), and that allowed me a basis to play with other transports. If in addition, it can be inspirational to others, all the best.
The main reason I surfaced my work now despite these limitations was because reaching the milestone of having running tests (not withstanding changes that will still pop up from running experiments) provides a more tangible basis to raise what I perceive to be something that will eventually become a blocking configuration issue. Faithful to the thread title, this is indeed the only topic I’d like to discuss here, if possible. This is certainly not a call to discuss all aspects of this migration, which in my humble opinion can only benefit from a very disciplined incremental approach consisting of data and simulator lifecycle first (functional concerns), and only after that, transport and serialization (performance concerns), something aligned with the larger strategy you mention, I think.

Thank you!

PS: as a more general note going forward, this thread was not intended as an opportunity to debate the actor model in Monty, a topic that can only benefit from its own threads, I think.

4 Likes

Although a number of topics have discussed the actor model and IPC, I can’t recall many that specifically focus on these topics. So, I just started Acts like IPC: an interoperability cage match . Let’s honor @nunoo’s request and take this topical spur over there…

1 Like

I would like to mention that I actually prototyped a solution for the configuration issue when i started, consisting of lazy loading the server side config from the same exact hydra cfg files, based on a configuration file name provided on environment start, but quickly soon afterwards I noticed that this file name was not only dynamicaly unavailable but that there were tests that constructed the cfg fully in mem. I abandoned that approach in order to move on, but maybe it can be contemplated as an option going forward ?

PS: As an aside, I did run the first experiment randrot_10distinctobj_surf_agent today, using the original and the ipc version, deleting the results between runs. Detected yet another type inconsistency in the code, fixed and committed. I’ll try to make an individual PR in main for that one, when time allows. For future ref, the observed experiment durations (in secs) were : original → 590 , ipc → 647.

3 Likes

Hi @nunoo, thank you for putting IPC together. Wonderful to see a prototype of it.

Future configuration is a broad and (as you pointed out) evolving topic, but hopefully I can provide some context that might be helpful.

One related point I want to highlight is the recent discussion about the Simulator vs Environment protocol, specifically this illustration:

It seems to me that Environment will always be on the local/Monty side of things as the interface, and therefore, part of Monty configuration. However, a Simulator and others can be remote (or local). How are we going to organize this?

My current high-level thinking is that we have Monty (local) and an Environment (local or remote). Please note that Environment at this level is different from Environment class inside Monty which will interface with the Environment.

However, because Environment can be remote, we need another thing that knows about Monty and the Environment. I’ve been calling this Runtime.

The two things you can do to a Runtime is run() or step().

Runtime.run() just starts a forever loop.

Runtime.step() takes one turn of the loop:

def step(self):
    self.observations, self.proprioceptive_state = self.environment.step(self.actions)
    self.actions = self.monty.step(self.observations, self.proprioceptive_state)

Relevant to your question, this future Runtime is the configuration container for both Monty and Environment.

Lastly, we want to continue to run Experiments.

Experiments will configure the Runtime and then execute the Runtime using fine grained control via Runtime.step().

So, this is how I’m thinking we’ll evolve everything. This way, there are four configuration domains: Monty, Environment, Runtime, and Experiment. None of that is present in configuration right now.

In the short term, we want to do more dependency injection instead of crafting objects inside constructors based on arguments.

Ok, I’ve said some things. I’m not sure if this is what you asked about :slight_smile:.

Cheers,

Tristan

4 Likes

Thank you for sharing the big picture Tristan!

I think that perhaps my concerns will be present in that picture also, so let me try to make them a bit more concrete:

Take this piece of configuration as an example:

agent_type: ${monty.class:tbp.monty.simulators.habitat_ipc.server.MultiSensorAgent}
env_init_func: ${monty.class:tbp.monty.simulators.habitat_ipc.client.environment.HabitatEnvironment}

It is crucial that no one tries to instantiate MultiSensorAgent in Monty, that only the simulator (the server) does that.
The same is true for tests, where direct access to classes that will otherwise be available only in the server side exists, even if not pervasively.
I believe that before migrating, it may be necessary to encode this design constraint explicitly in the current code base (e.g. start moving some tests to separate packages). Maybe the fact that my branch already marks all those places may provide a map for that preliminary work ?

Then comes usability. I think everyone would agree that this migration is a necessary evil. It will inevitably make for a more complicated product for scientists to use. Some product design concerns that started growing on me as the work advanced :

  • Monty tries to be as abstract as possible, delegating concrete class resolution to runtime as much as it can, enforcing lose contracts for maximum flexibility. New concepts may flow more easily through a larger pipe than through a stricter one. But this new client-server interface needs to enforce a minimally stricter contract, one that deals with the lack of an universal serialization mechanism across different python interpreters (words from someone that naively did try to develop one…). This opens the door for a use case where users may need to adapt the ipc contract for some types of changes, an undesirable technicality that they currently don’t need to handle (as an aside this makes me think of automatically generating a complete Monty version, ipc included, based on config…but that’s another story). The current ObjectConfig class is a good example of something that to me is just too generic to serialize properly.

  • Users may end up having to deal with touching a different repo, creating PRs in two repos, creating a distro, etc, in order to get their experiments running, whenever changes would be needed in the server code base too. Not ideal.

  • One other such concern relates with the simulator life cycle. It seems to me that the current convenience of launching a simulator instance per experiment is ideal and that we should try to preserve it as a model exactly because of how flexible and transparent it is. Beats forcing a user to start a sim first and execute the experiment second, or to force that orchestration in tests, and seems much much easier than a model where one sim would be shared across experiments. I think modern operating systems have enough resources to support this model. And I think that the fact that all tests pass right now proves that that model is feasible.

I hope this makes my points clearer, and that they may be useful going forward.

Thank you!

2 Likes

I think we’re thinking similar. Although, in the future state, I’d expect it to be something like

experiment:
  ...
  runtime:
    ...
    monty:
      ...
      environment:
        _target_: ${monty.class:tbp.monty.environments.HabitatEnvironment}
        ...
    environment:
      _target_: ${monty.class:tbp.monty.simulators.HabitatSimulator}
      ...

Monty would only receive the monty: ... portion of configuration to instantiate. Similarly, Environment would only receive the environment: ... (monty sibling one) portion of configuration to instantiate.

In my translation above, habitat_ipc.server part of the tree ends up being the simulator (there’s a separate discussion that on the environment side, environment and simulator are pretty much the same). The habitat_ipc.client part of the tree ends up being the HabitatEnvironment which is basically the interface to the enviroment (local or remote).

I think the general shape is like the above, although the details will probably look different once we get to that point.

Yes, this aligns with our intentions as well. Namespace things correctly before extracting them. That’s part of our plan for extracting experimental framing, for example.

2 Likes