README.rst 6.36 KB
Newer Older
Paul McCarthy's avatar
Paul McCarthy committed
1
2
``ukbparse`` - the FMRIB UK BioBank data parser
===============================================
Paul McCarthy's avatar
Paul McCarthy committed
3
4


5
6
7
8
9
.. note:: ``ukbparse`` has been superseded by ``funpack`` and will no longer
          be developed. Head to https://git.fmrib.ox.ac.uk/fsl/ukbparse for
          more information.


10
11
12
.. image:: https://img.shields.io/pypi/v/ukbparse.svg
   :target: https://pypi.python.org/pypi/ukbparse/

Paul McCarthy's avatar
Paul McCarthy committed
13
14
15
.. image:: https://anaconda.org/conda-forge/ukbparse/badges/version.svg
   :target: https://anaconda.org/conda-forge/ukbparse

16
17
18
19
20
21
22
.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.1997626.svg
   :target: https://doi.org/10.5281/zenodo.1997626

.. image:: https://git.fmrib.ox.ac.uk/fsl/ukbparse/badges/master/coverage.svg
   :target: https://git.fmrib.ox.ac.uk/fsl/ukbparse/commits/master/


Paul McCarthy's avatar
Paul McCarthy committed
23
``ukbparse`` is a Python library for pre-processing of UK BioBank data.
Paul McCarthy's avatar
Paul McCarthy committed
24
25


Paul McCarthy's avatar
Paul McCarthy committed
26
27
    ``ukbparse`` is developed at the Wellcome Centre for Integrative
    Neuroimaging (WIN@FMRIB), University of Oxford. ``ukbparse`` is in no way
Paul McCarthy's avatar
Paul McCarthy committed
28
    endorsed, sanctioned, or validated by the `UK BioBank
Paul McCarthy's avatar
Paul McCarthy committed
29
30
31
    <https://www.ukbiobank.ac.uk/>`_.

    ``ukbparse`` comes bundled with metadata about the variables present in UK
Paul McCarthy's avatar
Paul McCarthy committed
32
    BioBank data sets. This metadata can be obtained from the `UK BioBank
Paul McCarthy's avatar
Paul McCarthy committed
33
34
35
    online data showcase <https://biobank.ctsu.ox.ac.uk/showcase/index.cgi>`_


36
37
38
39
40
41
Installation
------------


Install ``ukbparse`` via pip::

42

43
44
45
    pip install ukbparse


Paul McCarthy's avatar
Paul McCarthy committed
46
47
48
49
50
Or from ``conda-forge``::

    conda install -c conda-forge ukbparse


51
52
53
54
55
56
57
58
59
60
61
62
63
Introductory notebook
---------------------


The ``ukbparse_demo`` command will start a Jupyter Notebook which introduces
the main features provided by ``ukbparse``. To run it, you need to install a
few additional dependencies::


    pip install ukbparse[demo]


You can then start the demo by running ``ukbparse_demo``.
64
65


66
67
68
69
70
71
.. note:: The introductory notebook uses ``bash``, so is unlikely to work on
          Windows.


Usage
-----
72
73
74
75


General usage is as follows::

76
77

    ukbparse [options] output.tsv input1.tsv input2.tsv
78
79
80
81
82
83
84
85


You can get information on all of the options by typing ``ukbparse --help``.


Options can be specified on the command line, and/or stored in a configuration
file. For example, the options in the following command line::

86
87
88
89
90
91
92
93
94

    ukbparse \
      --overwrite \
      --import_all \
      --log_file log.txt \
      --icd10_map_file icd_codes.tsv \
      --category 10 \
      --category 11 \
      output.tsv input1.tsv input2.tsv
95
96
97
98


Could be stored in a configuration file ``config.txt``::

99
100
101
102
103
104
105
106

    overwrite
    import_all
    log_file       log.txt
    icd10_map_file icd_codes.tsv
    category       10
    category       11

107
108
109

And then executed as follows::

110
111
112
113
114
115
116
117
118
119
120
121
122

    ukbparse -cfg config.txt output.tsv input1.tsv input2.tsv


Customising
-----------


``ukbparse`` contains a large number of built-in rules which have been
specifically written to pre-process UK BioBank data variables. These rules are
stored in the following files:


Paul McCarthy's avatar
Paul McCarthy committed
123
124
 * ``ukbparse/data/variables_*.tsv``: Cleaning rules for individual variables
 * ``ukbparse/data/datacodings_*.tsv``: Cleaning rules for data codings
125
126
127
 * ``ukbparse/data/types.tsv``: Cleaning rules for specific types
 * ``ukbparse/data/processing.tsv``: Processing steps

Paul McCarthy's avatar
Paul McCarthy committed
128

129
130
131
You can customise or replace these files as you see fit. You can also pass
your own versions of these files to ``ukbparse`` via the ``--variable_file``,
``--datacoding_file``, ``--type_file`` and ``--processing_file`` command-line
Paul McCarthy's avatar
Paul McCarthy committed
132
options respectively. ``ukbparse`` will load all variable and datacoding files,
Paul McCarthy's avatar
Paul McCarthy committed
133
134
135
136
137
and merge them into a single table which contains the cleaning rules for each
variable.

Finally, you can use the ``--no_builtins`` option to bypass all of the
built-in cleaning and processing rules.
138
139


140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
Output
------


The main output of ``ukbparse`` is a plain-text tab-delimited[*]_ file which
contains the input data, after cleaning and processing, potentially with
some columns removed, and new columns added.


If you used the ``--non_numeric_file`` option, the main output file will only
contain the numeric columns; non-numeric columns will be saved to a separate
file.


You can use any tool of your choice to load this output file, such as Python,
MATLAB, or Excel. It is also possible to pass the output back into
``ukbparse``.


.. [*] You can change the delimiter via the ``--tsv_sep`` / ``-ts`` option.


Loading output into MATLAB
^^^^^^^^^^^^^^^^^^^^^^^^^^


.. |readtable| replace:: ``readtable``
.. _readtable: https://uk.mathworks.com/help/matlab/ref/readtable.html

.. |table| replace:: ``table``
.. _table: https://uk.mathworks.com/help/matlab/ref/table.html


If you are using MATLAB, you have several options for loading the ``ukbparse``
output. The best option is |readtable|_, which will load column names, and
will handle both non-numeric data and missing values.  Use ``readtable`` like
so::

    data = readtable('out.tsv', 'FileType', 'text');


The ``readtable`` function returns a |table|_ object, which stores each column
as a separate vector (or cell-array for non-numeric columns). If you are only
interested in numeric columns, you can retrieve them as an array like this::

185
    rawdata =  data(:, vartype('numeric')).Variables;
186
187


Paul McCarthy's avatar
Paul McCarthy committed
188
189
190
191
The ``readtable`` function will potentially rename the column names to ensure
that they are are valid MATLAB identifiers. You can retrieve the original
names from the ``table`` object like so::

192
193
194
195
196
    colnames        = data.Properties.VariableDescriptions;
    colnames        = regexp(colnames, '''(.+)''', 'tokens', 'once');
    empty           = cellfun(@isempty, colnames);
    colnames(empty) = data.Properties.VariableNames(empty);
    colnames        = vertcat(colnames{:});
Paul McCarthy's avatar
Paul McCarthy committed
197
198
199
200
201


If you have used the ``--description_file`` option, you can load in the
descriptions for each column as follows::

202
203
    descs = readtable('descriptions.tsv', ...
                      'FileType', 'text', ...
Paul McCarthy's avatar
Paul McCarthy committed
204
                      'Delimiter', '\t',  ...
Paul McCarthy's avatar
Paul McCarthy committed
205
                      'ReadVariableNames',false);
206
207
208
209
210
    descs = [descs; {'eid', 'ID'}];
    idxs  = cellfun(@(x) find(strcmp(descs.Var1, x)), colnames, ...
                    'UniformOutput', false);
    idxs  = cell2mat(idxs);
    descs = descs.Var2(idxs);
Paul McCarthy's avatar
Paul McCarthy committed
211
212


213
214
Tests
-----
Paul McCarthy's avatar
Paul McCarthy committed
215

216

217
To run the test suite, you need to install some additional dependencies::
Paul McCarthy's avatar
Paul McCarthy committed
218

219
220

      pip install ukbparse[test]
Paul McCarthy's avatar
Paul McCarthy committed
221
222


223
Then you can run the test suite using ``pytest``::
Paul McCarthy's avatar
Paul McCarthy committed
224

225
    pytest
226
227
228
229
230
231
232


Citing
------


If you would like to cite ``ukbparse``, please refer to its `Zenodo page
Paul McCarthy's avatar
Paul McCarthy committed
233
<https://doi.org/10.5281/zenodo.1997626>`_.