GPU Acceleration of KNN Search - Update

Background and Methodology

Giving a brief update on my investigation into GPU acceleration of KNN within Monty. The starting point for this investigation was to see if a GPU-based algorithm could be a drop-in replacement for the default CPU-based Scipy KDTree algorithm.

In replacing the current KDTree implementation, we need to consider both query efficiency and build efficiency, since many KNN algorithms use custom data structures which take preprocessing time to organize.

So far I have tested these approaches. Note only two of these are GPU accelerated (I also tried a few others that didn’t get correct results or weren’t worth including).

  • KDTree: Monty’s default Scipy implementation
  • FAISS CPU: CPU implementation of IndexIVFFlat from the FAISS library
  • FAISS GPU: GPU implementation of IndexIVFFlat from the FAISS library
  • Open3D: Open3D library’s CPU based KDTreeFlann implementation for operating on point cloud data
  • PCL GPU KNN: Point Cloud Library’s GPU based KDTreeFlann implementation for operating on point cloud data

Results

To test these approaches I used the mug model learned from the surf_agent_1lm_2obj tutorial (let me know if there is a better/more realistic model/experiment to try!). This first test uses the 5210 model locations with a small offset added as the query points. I recorded both the average query time averaged across 1000 runs and the build time, shown below.

I also ran a small scaling test, using the same model points but with random query points of differing batch sizes:

These experiments were performed on my Windows laptop with these Specs:

  • CPU: Intel(R) Core™ Ultra 9 185H 2.50 GHz
  • GPU: NVIDIA GeForce RTX 4070 Laptop GPU
  • RAM: 32 Gb

Of course, this data would be more meaningful on a representative runtime environment for the typical Monty researcher.

Based on this experiment, the only real contender is the FAISS GPU algorithm, which presents an interesting tradeoff. This algorithm provides the fastest query times, but comes at a steep build cost. This could be useful for evaluation-only experiments, but for the core use case of continual learning it may not be ideal.

Also, it is interesting to note that this IndexIVFFlat algorithm relies on clustering the query points ahead of time. If the number of clusters is set to anything higher than 1, then the algorithm is not guaranteed to return the exact correct answers, making in an approximate KNN search (this was true for me experimentally). However if only using one cluster, as I did in the data shown above, then this algorithm is an exact KNN algorithm (guaranteed to find the true k nearest neighbors). I would have expected this to have hurt the performance greatly, but the query time is still the fastest of any algorithm that I have evaluated. This leads me to wonder if the build time could be reduced for this special case with only 1 cluster…

Recommendation

For evaluation-heavy workflows: Consider FAISS GPU

For continual learning: Stick with Scipy KDTree for now

For future development: Focus on LM-level parallelization

I think a more promising direction for GPU acceleration of Monty would be parallelization of LMs and/or object comparisons. This would require a more thoughtful rework of some of the Monty code and require GPU parallelization for several operations. I plan to start on this once I am back from vacation :slight_smile: but am also open to thoughts/feedback from the community!

TLDR: Scipy’s KDTree implementation for KNN is still pretty good. If we do want to speedup the KNN query operation, but at the cost of increased build time, then there is a GPU accelerated version we can go with.

P.S. I spent the most time on getting this algorithm working (GitHub - horizon-research/rtnn). This approach leverages Ray-Tracing acceleration hardware on GPUs to do KNN. I thought this was a promising direction, and still do to an extent, however this open source implementation seems poorly managed: it was a pain to setup, and in the end was giving incorrect results. It could be worth looking into more in the future.

6 Likes

Any reason to avoid ANN (approximate near neighbor) indexing? There are quite a few libraries/algorithms ANN-Benchmarks

Wow, this is a great analysis and summary! Thank you so much for looking into this and sharing your results.

The FAISS GPU tradeoff is interesting. It’s similar to the tradeoff I found when looking into location-sensitive hashing (LSH) as an alternative. Basically, there are ways to reduce query time at the cost of build time. Like you said, this may be acceptable for inference-heavy experiments. Also, a build time of <2 seconds still seems like it could be acceptable in many applications.

Would it be difficult to integrate a flag to switch between the CPU KDTree option and the FAISS GPU option in Monty? We could then run some continual learning experiments (like the benchmark experiments here Benchmark Experiments or the one outlined in this tutorial Unsupervised Continual Learning ) and see if the cost of building the graph is offset by the gains during inference (since we query much more frequently than we update graphs).

It also seems like the FAISS algorithms scale much better with batch size. This would allow us to test more hypotheses than we currently do and thereby achieve higher accuracies, which could be an interesting option for some applications when GPUs are available.

I would also be curious to know how the accuracy is affected if you change the cluster size. Have you looked into that? I am not sure how different the results of the nearest neighbor lookup will become, but if there are just slight variations, then Monty might still be able to deal with it quite well.

Don’t feel like you need to look into all these followup questions if you think there are more promising directions. This is just what came to my mind :slight_smile: Thanks again for sharing this thorough investigation!

  • Viviane
1 Like

Hi @vclay thank you for the thoughtful response and ideas.

Now that I am back home, I will start in on some of these next steps. Not sure yet how hard adding a flag to switch between the implementations would be, but it seems like a good starting point for me to get my hands dirty in the Monty code :slight_smile:

Once integrated we could then explore scaling to larger numbers of hypotheses and investigate the impact of batch size on Monty performance.

Let me know if you have any other thoughts before I dive in.

1 Like

That sounds great! Thank you for diving deeper on this. Let me know how it goes and if you run into any issues when integrating with Monty :slight_smile: