"description" : "<p>ukbparse is a Python library for pre-processing of UK BioBank tabular data.</p>\n\n<p>The ukbparse library is developed at the Wellcome Centre for Integrative Neuroimaging (FMRIB), at the University of Oxford. It is hosted at <a href=\"https://git.fmrib.ox.ac.uk/fsl/ukbparse/\">https://git.fmrib.ox.ac.uk/fsl/ukbparse/</a>.</p>",
"description" : "<p><tt>ukbparse</tt> - the FMRIB UK BioBank data processing library</p>\n\n<p><tt>ukbparse</tt> is a Python library for pre-processing of UK BioBank tabular data.</p>\n\n<p>The <tt>ukbparse</tt> library is developed at the Wellcome Centre for Integrative Neuroimaging (FMRIB), at the University of Oxford. It is hosted at <a href=\"https://git.fmrib.ox.ac.uk/fsl/ukbparse/\">https://git.fmrib.ox.ac.uk/fsl/ukbparse/</a>.</p>",
"*The examples in this notebook assume that you have installed `ukbparse` 0.17.0 or newer.*"
"*The examples in this notebook assume that you have installed `ukbparse` 0.20.0 or newer.*"
]
},
{
...
...
@@ -980,7 +980,7 @@
"metadata": {},
"outputs": [],
"source": [
"ukbparse -nb -q -ow -apr all \"removeIfSparse(minpres=0.65, absolute=False)\" out.tsv data_15.tsv\n",
"ukbparse -nb -q -ow -apr all \"removeIfSparse(minpres=0.65, abspres=False)\" out.tsv data_15.tsv\n",
"cat out.tsv"
]
},
...
...
%% Cell type:markdown id: tags:

# `ukbparse`
> Paul McCarthy <paul.mccarthy@ndcn.ox.ac.uk> ([WIN@FMRIB](https://www.win.ox.ac.uk/))
`ukbparse` is a command-line program which you can use to extract data from UK BioBank (and other tabular) data.
You can give `ukbparse` one or more input files (e.g. `.csv`, `.tsv`), and it will merge them together, perform some preprocessing, and produce a single output file.
A large number of rules are built into `ukbparse` which are specific to the UK BioBank data set. But you can control and customise everything that `ukbparse` does to your data, including which rows and columns to extract, and which cleaning/processing steps to perform on each column.
The `ukbparse` source code is available at https://git.fmrib.ox.ac.uk/fsl/ukbparse. You can install `ukbparse` into a Python environment using `pip`:
pip install ukbparse
Get command-line help by typing:
ukbparse -h
*The examples in this notebook assume that you have installed `ukbparse` 0.17.0 or newer.*
*The examples in this notebook assume that you have installed `ukbparse` 0.20.0 or newer.*
All data files are loaded in, unwanted columns and subjects are dropped, and the data files are merged into a single table (a.k.a. data frame). Multiple files can be merged according to an index column (e.g. subject ID). Or, if the input files contain the same columns/subjects, they can be naively concatenated along rows or columns.
## 2. Cleaning
The following cleaning steps are applied to each column:
1.**NA value replacement:** Specific values for some columns are replaced with NA, for example, variables where a value of `-1` indicates *Do not know*.
2.**Variable-specific cleaning functions:** Certain columns are re-formatted - for example, the [ICD10](https://en.wikipedia.org/wiki/ICD-10) disease codes are converted to integer representations.
3.**Categorical recoding:** Certain categorical columns are re-coded.
4.**Child value replacement:** NA values within some columns which are dependent upon other columns may have values inserted based on the values of their parent columns.
## 3. Processing
During the processing stage, columns may be removed, merged, or expanded into additional columns. For example, a categorical column may be expanded into a set of binary columns, one for each category.
A column may also be removed on the basis of being too sparse, or being redundant with respect to another column.
## 4. Export
The processed data can be saved as a `.csv`, `.tsv`, or `.hdf5` file.
%% Cell type:markdown id: tags:
# Examples
Throughout these examples, we are going to use a few command line options, which you will probably **not** normally want to use:
-`-nb` (short for `--no_builtins`): This tells `ukbparse` not to use the built-in processing rules, which are specifically tailored for UK BioBank data.
-`-ow` (short for `--overwrite`): This tells `ukbparse` not to complain if the output file already exists.
-`-q` (short for `--quiet`): This tells `ukbparse` to be quiet.
Without the `-q` option, `ukbparse` can be quite verbose, which can be annoying, but is very useful when things go wrong. A good strategy is to tell `ukbparse` to send all of its output to a log file with the `--log_file` (or `-lf`) option. For example:
ukbparse --log_file log.txt out.tsv in.tsv
Here's the first example input data set, with UK BioBank-style column names:
%% Cell type:code id: tags:
``` bash
cat data_01.tsv
```
%% Cell type:markdown id: tags:
The numbers in each column name represent:
1. The variable ID
2. The visit, for variables which were collected at multiple points in time.
3. The "instance", for multi-valued variables.
Note that one **variable** is typically associated with several **columns**, although we're keeping things simple for this first example - there is only one visit for each variable, and there are no mulit-valued variables.
%% Cell type:markdown id: tags:
# Import examples
## Selecting variables (columns)
You can specify which variables you want to load in the following ways, using the `--variable` (`-v` for short) and `--category` (`-c` for short) command line options:
* By variable ID
* By variable ranges
* By a text file which contains the IDs you want to keep.
* By pre-defined variable categories
* By column name
### Selecting individual variables
Simply provide the IDs of the variables you want to extract:
%% Cell type:code id: tags:
``` bash
ukbparse -nb-q-ow-v 1 -v 5 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variable ranges
The `--variable`/`-v` option accepts MATLAB-style ranges of the form `start:step:stop` (where the `stop` is inclusive):
%% Cell type:code id: tags:
``` bash
ukbparse -nb-q-ow-v 1:3:10 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variables with a file
If your variables of interest are listed in a plain-text file, you can simply pass that file:
%% Cell type:code id: tags:
``` bash
echo-e"1\n6\n9"> vars.txt
ukbparse -nb-q-ow-v vars.txt out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variables from pre-defined categories
Some UK BioBank-specific categories are baked into `ukbparse`, but you can also define your own categories - you just need to create a `.tsv` file, and pass it to `ukbparse` via the `--category_file` (`-cf` for short):