ENH: CPU implementation of SparseDiagonalMatrixTiled class
This MR depends on !21 (merged) and !24 (merged) which must be merged first.
The SparseDiagonalMatrixTiled class has been restructured to allow for separate CPU and GPU implementations.
- There is a single
SPDTclass definition ininc/SparseDiagonalMatrixTiled.h. The class has been adjusted so that it does not use the PIMPL idiom, but instead holds a reference to an opaqueSPDT::Datainstance. - Platform agnostic logic is contained in
src/SparseDiagonalMatrixTiled.cpp. - CPU and GPU specific logic is contained in
src/SparseDiagonalMatrixTiledCPU.cppsrc/SparseDiagonalMatrixTiledGPU.cu, which also contain definitions of theSPDT::Dataclass - the GPU implementation usesthrust::device_vectorto store the matrix data, and the CPU implementation usesstd::vector. - The implementation can be selected at compilation time simply by linking either
SparseDiagonalMatrixTiledCPU.oorSparseDiagonalMatrixTiledGPU.o
Some changes have been made to the SPDT interface to remove coupling to CUDA device memory, and allow for a CPU implementation:
-
The
SPDT.get_elementandSPDT.set_elementfunctions have been removed. The only use of these functions was by theCostFxnPointCloudWarpFieldclass, which has been adjusted so that the matrix data is created on the CPU, and then copied into the SPDT matrix via a call to the newSPDT.set_elementsfunction. This change has resulted in a substantial speed-up of the initialisation logic in theCostFxnPointCloudWarpFieldclass, and has allowed CPU-GPU synchronisation logic to be removed from theSPDTclass. -
The
SPDT.get_raw_pointerandSPDT.get_const_raw_pointerfunctions have been removed and replaced withSPDT.get_tileandSPDT.set_tilefunctions, which allow a client to read/write entire matrix tiles. This means that a copy must be made when reading/writing matrix data, but this doesn't seem to have a substantial effect on performance.
In order to allow the CPU SPDT implementation to be used with GPU code (e.g. cost functions for which we don't yet have CPU implementations), a few functions have been added in src/SparseDiagonalMatrixTiled_temporary_shim.cu - these functions simply copy between thrust::device_vector/thrust::host_vector and std::vector, and can be removed once we have a full CPU implementation.
This MR also adjusts the MMORF Makefile to allow for CPU and GPU executables to be installed alongside each other, and for binary files to live alongside each other in a local development environment:
- CPU object files are saved to
build/CPU/ - GPU object files are saved to
build/GPU/ - The CPU executable is saved to
build/CPU/mmorf_cpu - The GPU executable is saved to
build/CPU/mmorf_cuda11.0(when compiled against CUDA 11.0) - Both
mmorf_cpuandmmorf_cuda11.0will be installed into$FSLDIR/bin/, along with a new Python script called$FSLDIR/bin/mmorf, will call$FSLDIR/bin/mmorf_cuda11.0if a GPU is available, or will fall back to$FSLDIR/bin/mmorf_cpuotherwise.
The CPU/GPU versions of MMORF will be published as two separate conda packages which can be installed into the same environment:
-
mmorf-cuda-11.0provides$FSLDIR/bin/mmorf$FSLDIR/bin/mmorf_cuda11.0, and$FSLDIR/bin/tensor_average, although this may be removed/changed at some point. -
mmorf-cpuprovides$FSLDIR/bin/mmorf_cpu