README.rst 10.6 KB
Newer Older
Paul McCarthy's avatar
Paul McCarthy committed
1
2
**FUNPACK** - the FMRIB UKBioBank Normalisation, Parsing And Cleaning Kit
=========================================================================
Paul McCarthy's avatar
Paul McCarthy committed
3
4


Paul McCarthy's avatar
Paul McCarthy committed
5
.. image:: https://img.shields.io/pypi/v/fmrib-unpack.svg
Paul McCarthy's avatar
Paul McCarthy committed
6
   :target: https://pypi.python.org/pypi/fmrib-unpack/
Paul McCarthy's avatar
Paul McCarthy committed
7

Paul McCarthy's avatar
Paul McCarthy committed
8
9
.. image:: https://anaconda.org/conda-forge/fmrib-unpack/badges/version.svg
   :target: https://anaconda.org/conda-forge/fmrib-unpack
10

Paul McCarthy's avatar
Paul McCarthy committed
11

12
13
14
.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.1997626.svg
   :target: https://doi.org/10.5281/zenodo.1997626

Paul McCarthy's avatar
Paul McCarthy committed
15
16
.. image:: https://git.fmrib.ox.ac.uk/fsl/funpack/badges/master/coverage.svg
   :target: https://git.fmrib.ox.ac.uk/fsl/funpack/commits/master/
17
18


19
**FUNPACK** is a Python library for pre-processing of UK BioBank data.
Paul McCarthy's avatar
Paul McCarthy committed
20
21


22
23
24
    FUNPACK is developed at the Wellcome Centre for Integrative Neuroimaging
    (WIN@FMRIB), University of Oxford. FUNPACK is in no way endorsed,
    sanctioned, or validated by the `UK BioBank
Paul McCarthy's avatar
Paul McCarthy committed
25
26
    <https://www.ukbiobank.ac.uk/>`_.

27
    FUNPACK comes bundled with metadata about the variables present in UK
Paul McCarthy's avatar
Paul McCarthy committed
28
    BioBank data sets. This metadata can be obtained from the `UK BioBank
Paul McCarthy's avatar
Paul McCarthy committed
29
30
31
    online data showcase <https://biobank.ctsu.ox.ac.uk/showcase/index.cgi>`_


32
33
34
35
Installation
------------


36
Install FUNPACK via pip::
37

Paul McCarthy's avatar
Paul McCarthy committed
38
    pip install fmrib-unpack
39
40


Paul McCarthy's avatar
Paul McCarthy committed
41
42
Or from ``conda-forge``::

Paul McCarthy's avatar
Paul McCarthy committed
43
44
    conda install -c conda-forge fmrib-unpack

Paul McCarthy's avatar
Paul McCarthy committed
45

46
47
48
49
The FUNPACK source code can be found at
https://git.fmrib.ox.ac.uk/fsl/funpack/.


50
51
52
53
Introductory notebook
---------------------


Paul McCarthy's avatar
Paul McCarthy committed
54
The ``funpack_demo`` command will start a Jupyter Notebook which introduces
55
the main features provided by FUNPACK. A non-interactive version of this
56
57
notebook can be found at
https://open.win.ox.ac.uk/pages/fsl/funpack/demo.html.
58
59

If you are using ``pip``, you need to install a few additional dependencies::
60

Paul McCarthy's avatar
Paul McCarthy committed
61
    pip install fmrib-unpack[demo]
62
63


Paul McCarthy's avatar
Paul McCarthy committed
64
You can then start the demo by running ``fmrib_unpack_demo``.
65
66


67
68
69
70
71
72
.. note:: The introductory notebook uses ``bash``, so is unlikely to work on
          Windows.


Usage
-----
73
74
75
76


General usage is as follows::

Paul McCarthy's avatar
Paul McCarthy committed
77
    fmrib_unpack [options] output.tsv input1.tsv input2.tsv
78
79


Paul McCarthy's avatar
Paul McCarthy committed
80
81
82
83
84
85
You can get information on all of the options by typing ``fmrib_unpack --help``.

    The `fmrib_unpack` command was called `funpack` in older versions of
    FUNPACK, but was changed to `fmrib_unpack` in 3.0.0 to avoid a naming
    conflict with an `unrelated software package
    <https://heasarc.gsfc.nasa.gov/fitsio/>`_.
86
87
88
89
90


Options can be specified on the command line, and/or stored in a configuration
file. For example, the options in the following command line::

Paul McCarthy's avatar
Paul McCarthy committed
91
    fmrib_unpack \
92
      --overwrite \
Paul McCarthy's avatar
Paul McCarthy committed
93
      --write_log \
94
95
96
97
      --icd10_map_file icd_codes.tsv \
      --category 10 \
      --category 11 \
      output.tsv input1.tsv input2.tsv
98
99
100
101


Could be stored in a configuration file ``config.txt``::

102
    overwrite
Paul McCarthy's avatar
Paul McCarthy committed
103
    write_log
104
105
106
107
    icd10_map_file icd_codes.tsv
    category       10
    category       11

108
109
110

And then executed as follows::

Paul McCarthy's avatar
Paul McCarthy committed
111
    fmrib_unpack -cfg config.txt output.tsv input1.tsv input2.tsv
112
113


Paul McCarthy's avatar
Paul McCarthy committed
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
Features
--------


FUNPACK allows you to perform various data sanitisation and processing steps
on your data, such as:

 * **NA value replacement**: Specific values for some columns can be replaced
   with NA, for example, variables where a value of -1 indicates *Do not know*.

 * **Categorical recoding**: Certain categorical columns can re-coded. For
   example, variables where a value of 555 represents *half* can be recoded
   so that 555 is replaced with 0.5.

 * **Child value replacement**: NA values within some columns which are
   dependent upon other columns may have values inserted based on the values
   of their parent columns.

See the introductory notebook for a more comprehensive overview of the features
available in FUNPACK.


Paul McCarthy's avatar
Paul McCarthy committed
136
137
Built-in rules
--------------
138
139


140
141
FUNPACK contains a large number of built-in rules which have been specifically
written to pre-process UK BioBank data variables. These rules are stored in
Paul McCarthy's avatar
Paul McCarthy committed
142
143
the following files (the `funpack/` directory is located in your Python
environment directory):
144

Paul McCarthy's avatar
Paul McCarthy committed
145
146
147
148
149
150
151
 * ``funpack/configs/fmrib/datacodings_*.tsv``: Cleaning rules for data codings
 * ``funpack/configs/fmrib/variables_*.tsv``: Cleaning rules for individual
   variables
 * ``funpack/configs/fmrib/processing.tsv``: Processing steps
 * ``funpack/configs/fmrib/categories.tsv``: Variable categories


Paul McCarthy's avatar
Paul McCarthy committed
152
You can use these rules by using the FMRIB configuration profile::
Paul McCarthy's avatar
Paul McCarthy committed
153

Paul McCarthy's avatar
Paul McCarthy committed
154
    fmrib_unpack -cfg fmrib output.tsv input.tsv
155

Paul McCarthy's avatar
Paul McCarthy committed
156

157
You can customise or replace these files as you see fit. You can also pass
158
your own versions of these files to FUNPACK via the ``--variable_file``,
Paul McCarthy's avatar
Paul McCarthy committed
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
``--datacoding_file``, ``--type_file``, ``--processing_file``, and
``--category_file`` command-line options respectively. FUNPACK will load all
variable and datacoding files, and merge them into a single table which
contains the cleaning rules for each variable.


Creating your own rule files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^


To define rules at the *data-coding* level, create one or more ``.tsv`` files
with an ``ID`` column containing the data-coding ID, and any of the following
columns:


  - ``NAValues``: A comma-separated list of values to replace with NA
  - ``RawLevels`` A comma-separated list of values to be replaced with
    corresponding values in ``NewLevels``.
  - ``NewLevels`` A comma-separated list of replacement values for each
    of the values listed in ``RawLevels``.

To apply these rules, pass your ``.tsv`` file(s) to ``funpack`` with the
``--datacoding_file`` option. They will be applied to all variables which
use the data-coding(s) listed in the file(s).


To define rules at the *variable* level, create one or more ``.tsv`` files
with an ``ID`` column containing the variable ID, and any of the following
columns:


  - ``NAValues``: As above
  - ``RawLevels`` As above
  - ``NewLevels`` As above
  - ``ParentValues``: A comma-separated list of expressions on parent
    variables, defining conditions which should trigger child-value
    replacement.
  - ``ChildValues``: A comma-separated list of values to insert into the
    variable when the corresponding expression in ``ParentValues`` evaluates
    to true.
  - ``Clean``: A comma-separated list of cleaning functions to apply to the
    variable.
Paul McCarthy's avatar
Paul McCarthy committed
201

202

203
204
205
206
Output
------


Paul McCarthy's avatar
Paul McCarthy committed
207
208
209
The main output of FUNPACK is a plain-text file[*]_ which contains the input
data, after cleaning and processing, potentially with some columns removed,
and new columns added.
210
211


Paul McCarthy's avatar
Paul McCarthy committed
212
213
214
If you used the ``--suppress_non_numerics`` option, the main output file will
only contain the numeric columns. You can combine this with the
``--write_non_numerics`` option to save non-numeric columns to a separate
215
216
217
218
219
file.


You can use any tool of your choice to load this output file, such as Python,
MATLAB, or Excel. It is also possible to pass the output back into
220
FUNPACK.
221
222


Paul McCarthy's avatar
Paul McCarthy committed
223
224
225
.. [*] If your output file name ends with ``.csv``, the file will be
       comma-separated, and if your output file name ends with ``.tsv``, the
       file will be tab-separated.
226
227
228
229
230
231
232
233
234
235
236
237
238


Loading output into MATLAB
^^^^^^^^^^^^^^^^^^^^^^^^^^


.. |readtable| replace:: ``readtable``
.. _readtable: https://uk.mathworks.com/help/matlab/ref/readtable.html

.. |table| replace:: ``table``
.. _table: https://uk.mathworks.com/help/matlab/ref/table.html


239
If you are using MATLAB, you have several options for loading the FUNPACK
240
241
output. The best option is |readtable|_, which will load column names, and
will handle both non-numeric data and missing values.  Use ``readtable`` like
Paul McCarthy's avatar
Paul McCarthy committed
242
so (assuming that you generated a tab-separated file)::
243
244
245
246
247
248
249
250

    data = readtable('out.tsv', 'FileType', 'text');


The ``readtable`` function returns a |table|_ object, which stores each column
as a separate vector (or cell-array for non-numeric columns). If you are only
interested in numeric columns, you can retrieve them as an array like this::

251
252
    data    = data(:, vartype('numeric'));
    rawdata = data.Variables;
253
254


Paul McCarthy's avatar
Paul McCarthy committed
255
256
257
258
The ``readtable`` function will potentially rename the column names to ensure
that they are are valid MATLAB identifiers. You can retrieve the original
names from the ``table`` object like so::

259
    colnames = data.Properties.VariableDescriptions';
Paul McCarthy's avatar
Paul McCarthy committed
260
261


Paul McCarthy's avatar
Paul McCarthy committed
262
263
If you have used the ``--write_description`` or ``--description_file``
options, you can load in the descriptions for each column as follows::
Paul McCarthy's avatar
Paul McCarthy committed
264

Paul McCarthy's avatar
Paul McCarthy committed
265
    descs = readtable('out_descriptions.tsv', ...
266
                      'FileType', 'text', ...
Paul McCarthy's avatar
Paul McCarthy committed
267
                      'Delimiter', '\t',  ...
Paul McCarthy's avatar
Paul McCarthy committed
268
                      'ReadVariableNames',false);
269
270
271
272
273
    descs = [descs; {'eid', 'ID'}];
    idxs  = cellfun(@(x) find(strcmp(descs.Var1, x)), colnames, ...
                    'UniformOutput', false);
    idxs  = cell2mat(idxs);
    descs = descs.Var2(idxs);
Paul McCarthy's avatar
Paul McCarthy committed
274
275


276
277
Tests
-----
Paul McCarthy's avatar
Paul McCarthy committed
278

279

280
To run the test suite, you need to install some additional dependencies::
Paul McCarthy's avatar
Paul McCarthy committed
281

Paul McCarthy's avatar
Paul McCarthy committed
282
      pip install fmrib-unpack[test]
Paul McCarthy's avatar
Paul McCarthy committed
283
284


285
Then you can run the test suite using ``pytest``::
Paul McCarthy's avatar
Paul McCarthy committed
286

287
    pytest
288
289


Paul McCarthy's avatar
Paul McCarthy committed
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
macOS issues
------------


FUNPACK makes extensive use of the Python `multiprocessing
<https://docs.python.org/3/library/multiprocessing.html>`_ module to speed up
certain steps in its processing pipeline.  FUNPACK relies on the POSIX `fork()
<https://www.man7.org/linux/man-pages/man2/fork.2.html>`_ mechanism, so that
worker processes may inexpensively inherit the memory space of the main
process (often referred to as *copy-on-write*).  This is to avoid having to
serialise the data set being processed (stored internally as a
``pandas.DataFrame``).


In python 3.8 on macOS, the default method used by the ``multiprocessing``
module was changed from ``fork`` to ``spawn``, due to changes in macOS 10.13
restricting the use of ``fork()`` for safety reasons. Some background
information on this change can be found at https://bugs.python.org/issue33725,
and at `this blog post
<https://wefearchange.org/2018/11/forkmacos.rst.html>`_.


FUNPACK therefore explicitly sets the method used by the ``multiprocessing``
to ``fork``, to take advantage of copy-on-write semantics.  Using ``fork()``
on macOS *should* be safe for single-threaded parent processes, but as FUNPACK
calls ``fork()`` numerous times (by creating and discarding
``multiprocessing.Pool()`` objects on an as-needed basis), this assumption may
not be valid, and FUNPACK may crash with an error message resembling the
following::


    +[SomeClass initialize] may have been in progress in another thread
    when fork() was called. We cannot safely call it or ignore it in the
    fork() child process. Crashing instead.


You might be able to work around this error by setting an environment variable
before calling FUNPACK, like so::


    export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
Paul McCarthy's avatar
Paul McCarthy committed
331
    fmrib_unpack ...
Paul McCarthy's avatar
Paul McCarthy committed
332
333


334
335
336
337
Citing
------


338
If you would like to cite FUNPACK, please refer to its `Zenodo page
Paul McCarthy's avatar
Paul McCarthy committed
339
<https://doi.org/10.5281/zenodo.1997626>`_.