ENH: CPU implementation of `SparseDiagonalMatrixTiled` class
This MR depends on !21 and !24 which must be merged first.
The SparseDiagonalMatrixTiled
class has been restructured to allow for separate CPU and GPU implementations.
- There is a single
SPDT
class definition ininc/SparseDiagonalMatrixTiled.h
. The class has been adjusted so that it does not use the PIMPL idiom, but instead holds a reference to an opaqueSPDT::Data
instance. - Platform agnostic logic is contained in
src/SparseDiagonalMatrixTiled.cpp
. - CPU and GPU specific logic is contained in
src/SparseDiagonalMatrixTiledCPU.cpp
src/SparseDiagonalMatrixTiledGPU.cpp
, which also contain definitions of theSPDT::Data
class - the GPU implementation usesthrust::device_vector
to store the matrix data, and the CPU implementation usesstd::vector
. - The implementation can be selected at compilation time simply by linking either
SparseDiagonalMatrixTiledCPU.o
orSparseDiagonalMatrixTiledGPU.o
Some changes have been made to the SPDT
interface to remove coupling to CUDA device memory, and allow for a CPU implementation:
-
The
SPDT.get_element
andSPDT.set_element
functions have been removed. The only use of these functions was by theCostFxnPointCloudWarpField
class, which has been adjusted so that the matrix data is created on the CPU, and then copied into the SPDT matrix via a call to the newSPDT.set_elements
function. This change has resulted in a substantial speed-up of the initialisation logic in theCostFxnPointCloudWarpField
class, and has allowed CPU-GPU synchronisation logic to be removed from theSPDT
class. -
The
SPDT.get_raw_pointer
andSPDT.get_const_raw_pointer
functions have been removed and replaced withSPDT.get_tile
andSPDT.set_tile
functions, which allow a client to read/write entire matrix tiles. This means that a copy must be made when reading/writing matrix data, but this doesn't seem to have a substantial effect on performance.
In order to allow the CPU SPDT
implementation to be used with GPU code (e.g. cost functions for which we don't yet have CPU implementations), a few functions have been added in src/SparseDiagonalMatrixTiled_temporary_shim.cu
- these functions simply copy between thrust::device_vector
/thrust::host_vector
and std::vector
, and can be removed once we have a full CPU implementation.