ENH: CPU implementation of `VolumeBSpline` class
Note that this MR depends on the following:
New CPU implementation of the VolumeBSpline
class, using the spline interpolation implementation from miscmaths/splinterpolator.h
, which gives identical results to the existing CUDA-based implementation (the only differences being due to floating point precision).
The MMORF code has been restructured slightly to accommodate both CPU and GPU implementations:
- The
src/VolumeBSpline.cpp
class has been made CPU/GPU-agnostic. Data access/interpolation is now delegated to aVolumeBSplineSampler
instance, with its interface defined ininc/VolumeBSplineSampler.h
. - CUDA-based interpolation is handled by a
VolumeBSplineSampler
implementation insrc/VolumeBSplineSamplerGPU.cu
. - CPU-based interpolation is handled by an implementation in
src/VolumeBSplineSamplerCPU.cpp
. - The implementation is selected at compile time, by linking either
VolumeBSplineSamplerCPU.o
orVolumeBSplineSamplerGPU.o
The VolumeBSplineSamplerCPU
implementation parallelises sampling across voxels via std::thread
objects. This is performed in a general-purpose parfor
function, defined in inc/Threading.h
. The splinterpolator
module also uses std::thread
objects to parallelise initial calculation of the spline coefficients. The parfor
function essentially performs the same task as std::transform
, but allows for explicit control over the number of threads used.
During this work, an issue was found in the CUDA interpolation routines, implemented in the vendored CubicInterpolationCUDA
library. The built-in tex3D<float>
linear interpolation routines (which are used for spline interpolation) use 8-bit precision to discretise texture coordinates, which was resulting in artifacts in interpolated volume gradient fields. This has therefore been replaced with our own linear interpolation routine which uses 32 bit precision, with no apparent drop in speed. Note that this change does result in slight numeric differences in MMORF outputs.
Another change in this MR is re-arrangement of some utility functions for converting between 3D and 1D (flattened/unravelled) indices (e.g. index_vol_to_lin
, index_lin_to_vol
, etc). These functions need to be compiled for both host and device in the CUDA implementation, and also need to be compiled for the CPU implementation. In order to avoid the need to duplicate code, these routines have been moved into an "inline" file src/IndexUtils.inl
, which is included (via #include "IndexUtils.inl"
) into src/IndexUtilsCPU.cpp
and src/IndexUtilsGPU.cu
. This is slightly clunky, but avoids the need to have separate identical implementations.