Background and Methodology
Giving a brief update on my investigation into GPU acceleration of KNN within Monty. The starting point for this investigation was to see if a GPU-based algorithm could be a drop-in replacement for the default CPU-based Scipy KDTree algorithm.
In replacing the current KDTree implementation, we need to consider both query efficiency and build efficiency, since many KNN algorithms use custom data structures which take preprocessing time to organize.
So far I have tested these approaches. Note only two of these are GPU accelerated (I also tried a few others that didn’t get correct results or weren’t worth including).
- KDTree: Monty’s default Scipy implementation
- FAISS CPU: CPU implementation of IndexIVFFlat from the FAISS library
- FAISS GPU: GPU implementation of IndexIVFFlat from the FAISS library
- Open3D: Open3D library’s CPU based KDTreeFlann implementation for operating on point cloud data
- PCL GPU KNN: Point Cloud Library’s GPU based KDTreeFlann implementation for operating on point cloud data
Results
To test these approaches I used the mug model learned from the surf_agent_1lm_2obj tutorial (let me know if there is a better/more realistic model/experiment to try!). This first test uses the 5210 model locations with a small offset added as the query points. I recorded both the average query time averaged across 1000 runs and the build time, shown below.
I also ran a small scaling test, using the same model points but with random query points of differing batch sizes:
These experiments were performed on my Windows laptop with these Specs:
- CPU: Intel(R) Core™ Ultra 9 185H 2.50 GHz
- GPU: NVIDIA GeForce RTX 4070 Laptop GPU
- RAM: 32 Gb
Of course, this data would be more meaningful on a representative runtime environment for the typical Monty researcher.
Based on this experiment, the only real contender is the FAISS GPU algorithm, which presents an interesting tradeoff. This algorithm provides the fastest query times, but comes at a steep build cost. This could be useful for evaluation-only experiments, but for the core use case of continual learning it may not be ideal.
Also, it is interesting to note that this IndexIVFFlat algorithm relies on clustering the query points ahead of time. If the number of clusters is set to anything higher than 1, then the algorithm is not guaranteed to return the exact correct answers, making in an approximate KNN search (this was true for me experimentally). However if only using one cluster, as I did in the data shown above, then this algorithm is an exact KNN algorithm (guaranteed to find the true k nearest neighbors). I would have expected this to have hurt the performance greatly, but the query time is still the fastest of any algorithm that I have evaluated. This leads me to wonder if the build time could be reduced for this special case with only 1 cluster…
Recommendation
For evaluation-heavy workflows: Consider FAISS GPU
For continual learning: Stick with Scipy KDTree for now
For future development: Focus on LM-level parallelization
I think a more promising direction for GPU acceleration of Monty would be parallelization of LMs and/or object comparisons. This would require a more thoughtful rework of some of the Monty code and require GPU parallelization for several operations. I plan to start on this once I am back from vacation
but am also open to thoughts/feedback from the community!
TLDR: Scipy’s KDTree implementation for KNN is still pretty good. If we do want to speedup the KNN query operation, but at the cost of increased build time, then there is a GPU accelerated version we can go with.
P.S. I spent the most time on getting this algorithm working (GitHub - horizon-research/rtnn). This approach leverages Ray-Tracing acceleration hardware on GPUs to do KNN. I thought this was a promising direction, and still do to an extent, however this open source implementation seems poorly managed: it was a pain to setup, and in the end was giving incorrect results. It could be worth looking into more in the future.

