%% Cell type:markdown id: tags: | ||
# FUNPACK overview | ||
 | ||
> **Note:** If you have FUNPACK installed, you can start an interactive | ||
> version of this page by running `fmrib_unpack_demo`. | ||
FUNPACK is a command-line program which you can use to extract data from UK | ||
BioBank (and other tabular) data. You can run FUNPACK by calling the | ||
`fmrib_unpack` command. | ||
You can give FUNPACK one or more input files (e.g. `.csv`, `.tsv`), and it | ||
will merge them together, perform some preprocessing, and produce a single | ||
output file. | ||
A large number of rules are built into FUNPACK which are specific to the UK | ||
BioBank data set. But you can control and customise everything that FUNPACK | ||
does to your data, including which rows and columns to extract, and which | ||
cleaning/processing steps to perform on each column. | ||
**Important** The examples in this notebook assume that you have installed | ||
FUNPACK 3.3.0 or newer. | ||
> **Note:** The `fmrib_unpack` command was called `funpack` in older versions | ||
> of FUNPACK, but was changed to `fmrib_unpack` in 3.0.0 to avoid a naming | ||
> conflict with an [unrelated software | ||
> package](https://heasarc.gsfc.nasa.gov/fitsio/). | ||
%% Cell type:code id: tags: | ||
``` bash | ||
fmrib_unpack -V | ||
``` | ||
%%%% Output: stream | ||
funpack 3.3.0 | ||
%% Cell type:markdown id: tags: | ||
> **Note:** If the above command produces a `NameError`, you may need to | ||
> change the Jupyter Notebook kernel type to **Bash** - you can do so via the | ||
> **Kernel -> Change Kernel** menu option. | ||
## Contents | ||
1. [Overview](#Overview) | ||
2. [Examples](#Examples) | ||
3. [Import examples](#Import-examples) | ||
4. [Cleaning examples](#Cleaning-examples) | ||
5. [Processing examples](#Processing-examples) | ||
6. [Custom cleaning, processing and loading - funpack plugins](#Custom-cleaning,-processing-and-loading-funpack---plugins) | ||
7. [Miscellaneous topics](#Miscellaneous-topics) | ||
## Overview | ||
FUNPACK performs the following steps: | ||
### 1. Import | ||
All data files are loaded in, unwanted columns and subjects are dropped, and | ||
the data files are merged into a single table (a.k.a. data frame). Multiple | ||
files can be merged according to an index column (e.g. subject ID). Or, if the | ||
input files contain the same columns/subjects, they can be naively | ||
concatenated along rows or columns. | ||
> _Note:_ FUNPACK refers to UK Biobank **Data fields** as **variables**. The | ||
> two terms can be considered equivalent. | ||
### 2. Cleaning | ||
The following cleaning steps are applied to each column: | ||
1. **NA value replacement:** Specific values for some columns are replaced | ||
with NA, for example, variables where a value of `-1` indicates *Do not | ||
know*. | ||
2. **Variable-specific cleaning functions:** Certain columns are | ||
re-formatted; for example, the [ICD10](https://en.wikipedia.org/wiki/ICD-10) | ||
disease codes can be converted to integer representations. | ||
3. **Categorical recoding:** Certain categorical columns are re-coded. | ||
4. **Child value replacement:** NA values within some columns which are | ||
dependent upon other columns may have values inserted based on the values | ||
of their parent columns. | ||
### 3. Processing | ||
During the processing stage, columns may be removed, merged, or expanded into | ||
additional columns. For example, a categorical column may be expanded into a set | ||
of binary columns, one for each category. | ||
A column may also be removed on the basis of being too sparse, or being | ||
redundant with respect to another column. | ||
### 4. Export | ||
The processed data can be saved as a `.csv`, `.tsv`, or `.hdf5` file. | ||
## Examples | ||
Throughout these examples, we are going to use a few command line | ||
options, which you will probably **not** normally want to use: | ||
- We will alias `fmrib_unpack` to `funpack`, to save a little typing. | ||
- `-ow` (short for `--overwrite`): This tells `fmrib_unpack` not to complain | ||
if the output file already exists. | ||
- `-q` (short for `--quiet`): This tells `fmrib_unpack` to be quiet. Without | ||
the `-q` option, `fmrib_unpack` can be quite verbose, which can be | ||
annoying, but is very useful when things go wrong. A good strategy is to | ||
tell `fmrib_unpack` to produce verbose output using the `--noisy` (`-n` for | ||
short) option, and to send all of its output to a log file with the | ||
`--log_file` (or `-lf`) option. For example: | ||
> ``` | ||
> fmrib_unpack -n -n -n -lf log.txt out.tsv in.tsv | ||
> ``` | ||
%% Cell type:code id: tags: | ||
``` bash | ||
alias funpack="fmrib_unpack -ow -q" | ||
``` | ||
%% Cell type:markdown id: tags: | ||
Here's the first example input data set, with UK BioBank-style column names: | ||
%% Cell type:code id: tags: | ||
``` bash | ||
cat data_01.tsv | ||
``` | ||
%%%% Output: stream | ||
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0 7-0.0 8-0.0 9-0.0 10-0.0 | ||
1 31 65 10 11 84 22 56 65 90 12 | ||
2 56 52 52 42 89 35 3 65 50 67 | ||
3 45 84 20 84 93 36 96 62 48 59 | ||
4 7 46 37 48 80 20 18 72 37 27 | ||
5 8 86 51 68 80 84 11 28 69 10 | ||
6 6 29 85 59 7 46 14 60 73 80 | ||
7 24 49 41 46 92 23 39 68 7 63 | ||
8 80 92 97 30 92 83 98 36 6 23 | ||
9 84 59 89 79 16 12 95 73 2 62 | ||
10 23 96 67 41 8 20 97 57 59 23 | ||
%% Cell type:markdown id: tags: | ||
The numbers in each column name typically represent: | ||
1. The variable ID | ||
2. The visit, for variables which were collected at multiple points in time. | ||
3. The "instance", for multi-valued variables. | ||
Note that one **variable** is typically associated with several **columns**, | ||
although we're keeping things simple for this first example - there is only | ||
one visit for each variable, and there are no mulit-valued variables. | ||
> _Most but not all_ variables in the UK BioBank contain data collected at | ||
> different visits, the times that the participants visited a UK BioBank | ||
> assessment centre. However there are some variables (e.g. [ICD10 diagnosis | ||
> codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202)) for which | ||
> this is not the case. | ||
## Import examples | ||
### Selecting variables (columns) | ||
You can specify which variables you want to load in the following ways, using | ||
the `--variable` (`-v` for short), `--category` (`-c` for short) and | ||
`--column` (`-co` for short) command line options: | ||
* By variable ID | ||
* By variable ranges | ||
* By a text file which contains the IDs you want to keep. | ||
* By pre-defined variable categories | ||
* By column name | ||
#### Selecting individual variables | ||
Simply provide the IDs of the variables you want to extract: | ||
%% Cell type:code id: tags: | ||
``` bash | ||
funpack -v 1 -v 5 out.tsv data_01.tsv | ||
cat out.tsv | ||
``` | ||
%%%% Output: stream | ||
eid 1-0.0 5-0.0 | ||
1 31 84.0 | ||
2 56 89.0 | ||
3 45 93.0 | ||
4 7 80.0 | ||
5 8 80.0 | ||
6 6 7.0 | ||
7 24 92.0 | ||
8 80 92.0 | ||
9 84 16.0 | ||
10 23 8.0 | ||
%% Cell type:markdown id: tags: | ||
#### Selecting variable ranges | ||
The `--variable`/`-v` option accepts MATLAB-style ranges of the form | ||
`start:step:stop` (where the `stop` is inclusive): | ||
%% Cell type:code id: tags: | ||
``` bash | ||
funpack -v 1:3:10 out.tsv data_01.tsv | ||
cat out.tsv | ||
``` | ||
%%%% Output: stream | ||
eid 1-0.0 4-0.0 7-0.0 10-0.0 | ||
1 31 11.0 56 12 | ||
2 56 42.0 3 67 | ||
3 45 84.0 96 59 | ||
4 7 48.0 18 27 | ||
5 8 68.0 11 10 | ||
6 6 59.0 14 80 | ||
7 24 46.0 39 63 | ||
8 80 30.0 98 23 | ||
9 84 79.0 95 62 | ||
10 23 41.0 97 23 | ||
%% Cell type:markdown id: tags: | ||
#### Selecting variables with a file | ||
If your variables of interest are listed in a plain-text file, you can simply | ||
pass that file: | ||
%% Cell type:code id: tags: | ||
``` bash | ||
echo -e "1\n6\n9" > vars.txt | ||
funpack -v vars.txt out.tsv data_01.tsv | ||
cat out.tsv | ||
``` | ||
%%%% Output: stream | ||
eid 1-0.0 6-0.0 9-0.0 | ||
1 31 22.0 90 | ||
2 56 35.0 50 | ||
3 45 36.0 48 | ||
4 7 20.0 37 | ||
5 8 84.0 69 | ||
6 6 46.0 73 | ||
7 24 23.0 7 | ||
8 80 83.0 6 | ||
9 84 12.0 2 | ||
10 23 20.0 59 | ||
%% Cell type:markdown id: tags: | ||
#### Selecting variables from pre-defined categories | ||
Some UK BioBank-specific categories are [built into | ||
`funpack`](#Built-in-rules), but you can also define your own categories - you | ||
just need to create a `.tsv` file, and pass it to `funpack` via the | ||
`--category_file` (`-cf` for short): | ||
%% Cell type:code id: tags: | ||
``` bash | ||
echo -e "ID\tCategory\tVariables" > custom_categories.tsv | ||
echo -e "1\tCool variables\t1:5,7" >> custom_categories.tsv | ||
echo -e "2\tUncool variables\t6,8:10" >> custom_categories.tsv | ||
cat custom_categories.tsv | ||
``` | ||
%%%% Output: stream | ||
ID Category Variables | ||
1 Cool variables 1:5,7 | ||
2 Uncool variables 6,8:10 | ||
%% Cell type:markdown id: tags: | ||
Use the `--category` (`-c` for short) to select categories to output. You can | ||
refer to categories by their ID: | ||
%% Cell type:code id: tags: | ||
``` bash | ||
funpack -cf custom_categories.tsv -c 1 out.tsv data_01.tsv | ||
cat out.tsv | ||
``` | ||
%%%% Output: stream | ||
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 7-0.0 | ||
1 31 65 10.0 11.0 84.0 56 | ||
2 56 52 52.0 42.0 89.0 3 | ||
3 45 84 20.0 84.0 93.0 96 | ||
4 7 46 37.0 48.0 80.0 18 | ||
5 8 86 51.0 68.0 80.0 11 | ||
6 6 29 85.0 59.0 7.0 14 | ||
7 24 49 41.0 46.0 92.0 39 | ||
8 80 92 97.0 30.0 92.0 98 | ||
9 84 59 89.0 79.0 16.0 95 | ||
10 23 96 67.0 41.0 8.0 97 | ||
%% Cell type:markdown id: tags: | ||
Or by name: | ||
%% Cell type:code id: tags: | ||
``` bash | ||
funpack -cf custom_categories.tsv -c uncool out.tsv data_01.tsv | ||
cat out.tsv | ||
``` | ||
%%%% Output: stream | ||
eid 6-0.0 8-0.0 9-0.0 10-0.0 | ||
1 22.0 65 90 12 | ||
2 35.0 65 50 67 | ||
3 36.0 62 48 59 | ||
4 20.0 72 37 27 | ||
5 84.0 28 69 10 | ||
6 46.0 60 73 80 | ||
7 23.0 68 7 63 | ||
8 83.0 36 6 23 | ||
9 12.0 73 2 62 | ||
10 20.0 57 59 23 | ||
%% Cell type:markdown id: tags: | ||
#### Selecting column names | ||
If you are working with data that has non-UK BioBank style column names, you | ||
can use the `--column` (`-co` for short) to select individual columns by their | ||
name, rather than the variable with which they are associated. The `--column` | ||
option accepts full column names, and also shell-style wildcard patterns: | ||
%% Cell type:code id: tags: | ||
``` bash | ||
funpack -co 4-0.0 -co "??-0.0" out.tsv data_01.tsv | ||
cat out.tsv | ||
``` | ||
%%%% Output: stream | ||
eid 4-0.0 10-0.0 | ||
1 11.0 12 | ||
2 42.0 67 | ||
3 84.0 59 | ||
4 48.0 27 | ||
5 68.0 10 | ||
6 59.0 80 | ||
7 46.0 63 | ||
8 30.0 23 | ||
9 79.0 62 | ||
10 41.0 23 | ||
%% Cell type:markdown id: tags: | ||
### Selecting subjects (rows) | ||
`funpack` assumes that the first column in every input file is a subject | ||
ID. You can specify which subjects you want to load via the `--subject` (`-s` | ||
for short) option. You can specify subjects in the same way that you specified | ||
variables above, and also: | ||
* By specifying a conditional expression on variable values - only subjects | ||
for which the expression evaluates to true will be imported | ||
* By specifying subjects to exclude | ||
#### Selecting individual subjects | ||
%% Cell type:code id: tags: | ||
``` bash | ||
funpack -s 1 -s 3 -s 5 out.tsv data_01.tsv | ||
cat out.tsv | ||
``` | ||
%%%% Output: stream | ||
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0 7-0.0 8-0.0 9-0.0 10-0.0 | ||
1 31 65 10.0 11.0 84.0 22.0 56 65 90 12 | ||
3 45 84 20.0 84.0 93.0 36.0 96 62 48 59 | ||
5 8 86 51.0 68.0 80.0 84.0 11 28 69 10 | ||
%% Cell type:markdown id: tags: | ||
#### Selecting subject ranges | ||
%% Cell type:code id: tags: | ||
``` bash | ||
funpack -s 2:2:10 out.tsv data_01.tsv | ||
cat out.tsv | ||
``` | ||
%%%% Output: stream | ||
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0 7-0.0 8-0.0 9-0.0 10-0.0 | ||
2 56 52 52.0 42.0 89.0 35.0 3 65 50 67 | ||
4 7 46 37.0 48.0 80.0 20.0 18 72 37 27 | ||
6 6 29 85.0 59.0 7.0 46.0 14 60 73 80 | ||
8 80 92 97.0 30.0 92.0 83.0 98 36 6 23 | ||
10 23 96 67.0 41.0 8.0 20.0 97 57 59 23 | ||
%% Cell type:markdown id: tags: | ||
#### Selecting subjects from a file | ||
%% Cell type:code id: tags: | ||
``` bash | ||
echo -e "5\n6\n7\n8\n9\n10" > subjects.txt | ||
funpack -s subjects.txt out.tsv data_01.tsv | ||
cat out.tsv | ||
``` | ||
%%%% Output: stream | ||
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0 7-0.0 8-0.0 9-0.0 10-0.0 | ||
5 8 86 51.0 68.0 80.0 84.0 11 28 69 10 | ||
6 6 29 85.0 59.0 7.0 46.0 14 60 73 80 | ||
7 24 49 41.0 46.0 92.0 23.0 39 68 7 63 | ||
8 80 92 97.0 30.0 92.0 83.0 98 36 6 23 | ||
9 84 59 89.0 79.0 16.0 12.0 95 73 2 62 | ||
10 23 96 67.0 41.0 8.0 20.0 97 57 59 23 | ||
%% Cell type:markdown id: tags: | ||
#### Selecting subjects by variable value | ||
The `--subject` option accepts *variable expressions* - you can write an | ||
expression performing numerical comparisons on variables (denoted with a | ||
leading `v`) and combine these expressions using boolean algebra. Only | ||
subjects for which the expression evaluates to true will be imported. For | ||
example, to only import subjects where variable 1 is greater than 10, and | ||
variable 2 is less than 70, you can type: | ||
%% Cell type:code id: tags: | ||
``` bash | ||
funpack -s "v1 > 10 && v2 < 70" out.tsv data_01.tsv | ||
cat out.tsv | ||
``` | ||
%%%% Output: stream | ||
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0 7-0.0 8-0.0 9-0.0 10-0.0 | ||
1 31 65 10.0 11.0 84.0 22.0 56 65 90 12 | ||
2 56 52 52.0 42.0 89.0 35.0 3 65 50 67 | ||
7 24 49 41.0 46.0 92.0 23.0 39 68 7 63 | ||
9 84 59 89.0 79.0 16.0 12.0 95 73 2 62 | ||
%% Cell type:markdown id: tags: | ||
The following symbols can be used in variable expressions: | ||
| Symbol | Meaning | | ||
|---------------------------|---------------------------------| | ||
| `==` | equal to | | ||
| `!=` | not equal to | | ||
| `>` | greater than | | ||
| `>=` | greater than or equal to | | ||
| `<` | less than | | ||
| `<=` | less than or equal to | | ||
| `na` | N/A | | ||
| `&&` | logical and | | ||
| <code>||</code> | logical or | | ||
| `~` | logical not | | ||
| `contains` | Contains sub-string | | ||
| `all` | all columns must meet condition | | ||