RF: Fix GPU performance regression

This MR resolves a performance regression in the GPU build, brought about during the changes made to support a CPU build. The regression had two causes:

  • A CUDA device function which calls a second function that was compiled in a different object file has much slower performance than if the function were defined in the same source file. This has been resolved by inlining all of the index-related functions in inc/IndexUtils.inl.
  • The use of lambda functions within CUDA device functions does not seem to be a performance issue in and of itself, but there was a substantial slowdown when creating lambda functions which directly read from/wrote to device memory. This has been resolved by modifying the IndexUtils::populate_hessian function so that, instead of accepting a function to update the values in the hessian matrix, it instead populates and returns an array of index pairs (diagonal and row) into a sparse matrix tile. The caller is then responsible for updating the hessian matrix. This avoids the need to create lambda functions/closures, and also feels to me like a much cleaner design.

As a result of these changes, there is no longer any need to have separate MMORF::IndexUtilsGPU and MMORF::IndexUtilsCPU namespaces - there is now just a single MMORF::IndexUtils namespace.

Merge request reports

Loading