Prototype: HypothesesUpdater operations on CUDA GPUs

In Preliminary GPU Acceleration Data Indicates Significant Speedup of Hypothesis Update Operations, @collin demonstrated that Hypotheses update operations execute faster on a GPU :folded_hands:.

How does this work fit into the overall Platform work?

One of the Monty’s Undesirable Effects is (115 Most experiment episodes take more than 1 second to execute). @collin’s data provided solid grounding for the intuitive idea that if Monty executes only on the CPU, and there are systems with GPUs, and HypothesesUpdater operations execute faster on GPU, then that contributes some magnitude to most experiment episodes taking more than 1 second to execute.

What we’d like to get to is the Desired Effect of (100 Most experiment episodes take less than 1 second to execute.) One way of improving the situation is to make an (Injection #1 When available, execute hypotheses update operations on a GPU), which will allow the hypotheses update operations to execute on CPU or GPU. Once hypotheses update operations can execute on CPU or GPU, and knowing that (102 Hypotheses update operations execute faster on a GPU) and knowing that (103 There are systems with GPUs), then that will contribute some magnitude to getting to most experiment episodes taking less than 1 second to execute.

@collin, here is what I’m thinking and a proposal for how we might go forward with this. Please note that I made this up overnight, so feel free to propose something different or point out where things don’t fit or don’t seem correct or true.

I’m conceptualizing everything into three main phases: 1) create a prototype, 2) validate the prototype, and 3) integrate the prototype. (This approach stems from RFC 14 Conducting Research While Building a Stable Platform).

I think the immediate and most accessible work will be 1) & 2) and that’s where we should focus our immediate effort. To this end, what I am proposing is that the prototype be created on a fork of tbp.monty so that we can get to validation as soon as possible. Here’s my initial impression of the work that needs to be done:

I think with this minimal work we’ll be able to run some benchmarks and evaluate the effect.

Now, you may have noticed a line going off of the top of the illustration. So, that’s leading to phase 3) integrate the prototype. For transparency, I want to share what I’m thinking we might want to do there:

As you can see, there’s a whole bunch more happening there. In summary, I think we’ll want the HypothesesUpdater CUDA GPU backend to be what I’m starting to call a “community component.” So, it wouldn’t be part of tbp.monty, but a separate package. The primary motivation is one of the “obstacles” (in the shape of stop signs/octagons), and that is our lack of in-house capacity for maintaining CUDA GPU kernels.

The primary reason why this might take a lot of effort is because yours would be the first community component, so on the TBP side we’ll need to figure out a bunch of things. If it helps, I’m thinking your part would only be the (Create a community HypothesesUpdater CUDA GPU backend) work, and the rest would largely fall upon TBP, with perhaps additional collaboration on a few of the details.

So, that’s the overview. Would love to hear your thoughts.

Cheers,

Tristan

3 Likes

Thanks @tslominski for putting this thoughtful proposal overview for how to integrate a GPU backend to the HypothesesUpdater.

Generally, the plan and organization that you have outlined makes sense to me. The three phases makes intuitive sense and aligns with my understanding of your research prototyping workflow. I agree that focusing our immediate efforts on phases 1 (prototype development) and 2 (prototype validation) will be most valuable and provide a synchronization point to evaluate and consider integration into Monty.

In regards to the structure of this GPU backend as a community component, I trust you to know what is best for the Monty project. In general I am happy for this to be the guinea pig community component, however I do have some questions about how this will work.

How will community components fit within Monty versioning? This seems like a natural place for synchronization issues to arise if community components don’t update at the same interval or aren’t included in verification of the whole Monty system with the same rigor.

What is the ideal interface between community components and Monty code? I would imagine we would want a clearly defined interface where the Monty code doesn’t need to consider the implementation of the community component, however there are some technical details of how I envision a GPU backend that may make such clean separation challenging. More details below.

How do you envision environment dependencies on community components? The GPU backend will necessitate environment changes, not only to the Conda environment but the hardware platform as well.

Potential use of CUDA streams for better parallelization across LMs

While most of the GPU implementation could be reasonably abstracted away behind a backend class similar to the current interface between LMs and the HypothesesUpdater class, I have some ideas for more efficient implementation that may complicate the existing interface. As I referenced in the discussion with @vclay in my original post, even if we don’t explicitly batch operations across LMs, there are still ways we could capitalize on the inherent parallelism of LMs through the use of CUDA streams.

Since GPUs are designed for much higher throughput than individual LM operations, it may be worth investigating an approach that utilizes the full available hardware. GPUs are designed not only to support parallelism through a single kernel, but can also support running multiple kernels in parallel through the use of CUDA streams. Each CUDA stream is basically a separate sequence of kernel dispatches from the host device, enabling the GPU to schedule kernels from multiple streams to run simultaneously depending on the GPU resources available. If we design Monty such that each LM has its own CUDA stream, we could potentially see significant performance gains by having LMs run their streams in parallel, while conceptually keeping each LM implementation separate.

In practice, this would be simpler than the synchronization necessary for the single-kernel dispatch system in my PoC, but still require some thoughtful refactoring to support. The key idea is that we would want to separate the calls for dispatching hypothesis updates from the code that processes the results of the updates. For example, instead of one loop through each LM where each iteration of the loop does all hypothesis updates for that LM (and the next LM can’t start work until this LM is done with its work), we would do two loops through the LMs. The first loop would dispatch the GPU work through the use of different CUDA streams, while the second loop would process the results from these dispatches. Separating into two loops like this will free the GPU to parallelize the work across LMs, even though each LM is defined separately in the code.

Note that unlike in my PoC where each LM needs to run the same operation in order to batch them together, in this form of CUDA stream parallelization the LMs can be processing different data or performing different steps entirely. So each LM can still be defined separately in the code and have different behavior, but we would capitalize on the inherent parallelization of LMs and better utilize the hardware available.

We would need to investigate the CUDA stream approach more, and it may be more complicated than the simple two-loop system described above. In particular, hypothesis update steps are actually a sequence of multiple GPU kernels, not a single dispatch.

Since this approach would necessitate changes to how hypotheses update operations would be dispatched, it would require some refactoring of how LMs interact with the HypothesesUpdater. Currently each LM has its own HypothesesUpdater object which it calls internally. Perhaps we would shift to an approach like having a singular HypothesesDispatcher object that iterates over the LMs. Though there are multiple ways that we could approach these changes, I bring it up now so that we can discuss how this fits within the context of separating out the GPU backend to a community component.

I am happy to have a call to talk through this in more detail or to continue the chat here.

2 Likes

Hey @collin,

Regarding community components and versioning, I think that we can defer deciding the exact specifics until after the prototype is working.


However, I don’t want to ignore your query. RFC 7 Monty Versioning specifies that we will increment the MAJOR version whenever a backwards-incompatible changes are made. So, ideally, the community component would specify the minimum tbp.monty version supported as a dependency, e.g., >= 1 or >= 1.1 if depending on a feature released at 1.1. Per RFC 7, we promise to increment the MAJOR version if we break the API.

Having said the above, MAJOR version 0 (Monty’s current major version) is special, in that we may introduce breaking changes while only incrementing the MINOR version. For example, as of this writing, tbp.monty is version 0.10. If we break the API, we would release version 0.11. I believe the community component dependency versioning approach should be the same, in that it would specify the minimum tbp.monty version, e.g., >= 0.10. However, because of MAJOR version 0, an additional burden is placed on the component maintainer to track all MINOR version changes and look for any breaking API changes that may require an upgrade. I intend to include release notes in our release PRs, highlighting breaking changes, for example: chore: version 0.10.0 by tristanls · Pull Request #427 · thousandbrainsproject/tbp.monty · GitHub.


Regarding environment dependencies, I don’t know what you mean by environment changes. Do you mean conditionally compiling CUDA kernels or downloading them for the specific architecture? One reason to proceed with the prototype before settling on an approach is that I imagine we’ll learn what is needed.


Potential use of CUDA streams for better parallelization across LMs.

Have you had a chance to see the data flow in the Draft RFC Cortical Messaging Protocol (CMP) v1?

When I was putting this together, one of my goals was to better fit the Monty step with a data flow paradigm. Figure 1 is focused on the Cortical Message flow, but it also gives an intuition for the sequence of phases each step must execute. This sequence might help inform parallelization opportunities of LMs and other modules. Roughly:

  • Sensor Modules must all do similar computation together
  • Learning Modules must all process incoming messages, observations, and state together
  • Learning Modules must all send out votes together
  • Learning Modules must all receive and process votes together
  • Learning Modules must all compute and send out Cortical Messages together
  • Goal State Selectors must all process and output goals together
  • Motor Modules must all process goals, observations, and state together

Because these phases are tied to messaging, I don’t think we’ll want Learning Modules to do different things in parallel (at least not for what I’m proposing for CMP v1). I think we might have Learning Modules all do similar ops (or no-ops) in each phase.

In principle, I don’t have an issue with rearchitecting Monty to support performance improvements. I prefer our main loop to improve its data flow anyway. Of course, it would have to be done for a concrete improvement, and we’ll have to see how the details work out and see if the CPU doesn’t like it for some reason.

But again, I wouldn’t focus on separating the community component yet…


The prototype I am proposing is not a community component. I think what we’d want to do with the prototype is to update/refactor Monty however we need, to allow for both CPU and GPU support.

I imagine that everything will become concrete and (at least I) will better understand the specific challenges we’ll encounter in extracting the community component after the prototype is complete.


I’ll reach out via chat to schedule a meeting to discuss further.

1 Like

I agree that we don’t need to have the specifics figured out right now, perhaps I was jumping the gun on some of these details, but it was helpful to get some clarity into how you envision this separation of components. Thanks for responding.

Versioning

Aligning the community component with the MAJOR version makes sense. The component should stay functional as long as the API stays the same, assuming we define the interface between the community component and Monty well. For MAJOR version 0 being a special case, it makes sense and is fine with me to have some integration pain points as the platform stabilizes.

Environment changes

I agree that these details can be ironed out later, I just wanted to bring up to get a sense of how much integration effort we are ok with community components adding. GPU environments can be a bit of a pain point, making sure that there is an Nvidia GPU with the right driver and CUDA toolkit version for the CUDA backend of PyTorch. This prototype phase should uncover these challenges.

LM independence

I had not seen that data flow diagram, that is helpful to understand the vision. I agree with having modules process together in a pipeline, which will enable us to better optimize for performance. The reason I brought up the potential of LMs doing different things in parallel was to address Viviane’s concerns about the stacked approach I outlined in my original post. She pointed out the challenges of stacking operations across LMs due to their difficult to predict input processing and conceptually independent design. This CUDA streams approach would be a better balance between enabling limited differences between LMs and capitalizing on the parallelism of their operations. In general, the more we have LMs doing the same operations at the same time the more that we can take advantage of the parallelism.

Concluding thoughts

I better understand how you see the difference between this initial prototype in phase 1 and 2 from the separation into a community component in phase 3. These first phases will help us uncover what needs to be done for phase 3 and answer many of the questions I have.

Looking forward to meeting and starting this effort together :slight_smile:

3 Likes

@rmounir, the ResamplingHypothesesUpdater author recommended that we focus the prototype on the DefaultHypothesesUpdater as resampling is new and actively being worked on and we don’t have many benchmarks or a baseline for it yet.

3 Likes

@collin, I attempted to update the plan further based on our conversation today.

I added the “Implement the prototype” step so that it’s easier to see what needs to be done before we run benchmarks.

On the left, I added the (Refactor MontyForGraphMatching._step_learning_modules to use “grouping” of LMs) we discussed.

On the right, I added the part about parallelizing across object graphs.

I also updated the hypotheses updater items to refer specifically to DefaultHypothesesUpdater as recommended by @rmounir .

Right now, these are depicted as three independent work streams, but we’ll probably need to work through in more detail whether that is true and how to combine them.

2 Likes

@tslominski that looks good, thanks for updating the plan.

Since the hypothesis updates happen for LM-object graph pairs, if we setup the object graph parallelization on the right to handle arbitrary LM-object graph pairs (even though to start off the LM will be the same), then it might be straightforward to scale to include the other LMs in the group once we get to the grouping stage.

1 Like