Commit 0ff56f14 authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

DOC: create alias, so we don't have to have -q -ow in every command

parent ae71d267
......@@ -205,9 +205,22 @@
"\n",
" > ```\n",
" > funpack -n -n -n -lf log.txt out.tsv in.tsv\n",
" > ```\n",
"\n",
"\n",
" > ```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"alias funpack=\"funpack -ow -q\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's the first example input data set, with UK BioBank-style column names:"
]
},
......@@ -274,7 +287,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -v 1 -v 5 out.tsv data_01.tsv\n",
"funpack -v 1 -v 5 out.tsv data_01.tsv\n",
"cat out.tsv"
]
},
......@@ -295,7 +308,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -v 1:3:10 out.tsv data_01.tsv\n",
"funpack -v 1:3:10 out.tsv data_01.tsv\n",
"cat out.tsv"
]
},
......@@ -317,7 +330,7 @@
"outputs": [],
"source": [
"echo -e \"1\\n6\\n9\" > vars.txt\n",
"funpack -q -ow -v vars.txt out.tsv data_01.tsv\n",
"funpack -v vars.txt out.tsv data_01.tsv\n",
"cat out.tsv"
]
},
......@@ -360,7 +373,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -cf custom_categories.tsv -c 1 out.tsv data_01.tsv\n",
"funpack -cf custom_categories.tsv -c 1 out.tsv data_01.tsv\n",
"cat out.tsv"
]
},
......@@ -377,7 +390,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -cf custom_categories.tsv -c uncool out.tsv data_01.tsv\n",
"funpack -cf custom_categories.tsv -c uncool out.tsv data_01.tsv\n",
"cat out.tsv"
]
},
......@@ -400,7 +413,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -co 4-0.0 -co \"??-0.0\" out.tsv data_01.tsv\n",
"funpack -co 4-0.0 -co \"??-0.0\" out.tsv data_01.tsv\n",
"cat out.tsv"
]
},
......@@ -430,7 +443,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -s 1 -s 3 -s 5 out.tsv data_01.tsv\n",
"funpack -s 1 -s 3 -s 5 out.tsv data_01.tsv\n",
"cat out.tsv"
]
},
......@@ -447,7 +460,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -s 2:2:10 out.tsv data_01.tsv\n",
"funpack -s 2:2:10 out.tsv data_01.tsv\n",
"cat out.tsv"
]
},
......@@ -465,7 +478,7 @@
"outputs": [],
"source": [
"echo -e \"5\\n6\\n7\\n8\\n9\\n10\" > subjects.txt\n",
"funpack -q -ow -s subjects.txt out.tsv data_01.tsv\n",
"funpack -s subjects.txt out.tsv data_01.tsv\n",
"cat out.tsv"
]
},
......@@ -490,7 +503,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -s \"v1 > 10 && v2 < 70\" out.tsv data_01.tsv\n",
"funpack -s \"v1 > 10 && v2 < 70\" out.tsv data_01.tsv\n",
"cat out.tsv"
]
},
......@@ -554,7 +567,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -s \"v33 >= 1965-01-01 && v1 == 'B'\" out.tsv data_02.tsv\n",
"funpack -s \"v33 >= 1965-01-01 && v1 == 'B'\" out.tsv data_02.tsv\n",
"cat out.tsv"
]
},
......@@ -578,10 +591,10 @@
"faster, and will require less memory.\n",
"\n",
"\n",
"The `--ids-only` option can be used to generate an output file which only\n",
"The `--ids_only` option can be used to generate an output file which only\n",
"contains the IDs of the rows which would have been output, so can be used to\n",
"generate a subject ID file. Let's re-run one of the examples above, but this\n",
"time with the `--ids-only` option:"
"time with the `--ids_only` option:"
]
},
{
......@@ -590,7 +603,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow --ids-only -v 1,2 -s \"v1 > 10 && v2 < 70\" subjects.txt data_01.tsv\n",
"funpack --ids_only -v 1,2 -s \"v1 > 10 && v2 < 70\" subjects.txt data_01.tsv\n",
"cat subjects.txt"
]
},
......@@ -617,7 +630,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -s 1:8 -ex 5:10 out.tsv data_01.tsv\n",
"funpack -s 1:8 -ex 5:10 out.tsv data_01.tsv\n",
"cat out.tsv"
]
},
......@@ -657,7 +670,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -vi last out.tsv data_03.tsv\n",
"funpack -vi last out.tsv data_03.tsv\n",
"cat out.tsv"
]
},
......@@ -674,7 +687,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -vi 1 out.tsv data_03.tsv\n",
"funpack -vi 1 out.tsv data_03.tsv\n",
"cat out.tsv"
]
},
......@@ -704,7 +717,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -v 2 -s 'v2 < 50' out.tsv data_03.tsv\n",
"funpack -v 2 -s 'v2 < 50' out.tsv data_03.tsv\n",
"cat out.tsv"
]
},
......@@ -724,7 +737,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -v 2 -s 'all(v2 < 50)' out.tsv data_03.tsv\n",
"funpack -v 2 -s 'all(v2 < 50)' out.tsv data_03.tsv\n",
"cat out.tsv"
]
},
......@@ -741,7 +754,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -v 2,3 -s 'any(v2 < 50) && all(v3 >= 40)' out.tsv data_03.tsv\n",
"funpack -v 2,3 -s 'any(v2 < 50) && all(v3 >= 40)' out.tsv data_03.tsv\n",
"cat out.tsv"
]
},
......@@ -786,7 +799,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow out.tsv data_04.tsv data_05.tsv\n",
"funpack out.tsv data_04.tsv data_05.tsv\n",
"cat out.tsv"
]
},
......@@ -804,7 +817,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -ms outer out.tsv data_04.tsv data_05.tsv\n",
"funpack -ms outer out.tsv data_04.tsv data_05.tsv\n",
"cat out.tsv"
]
},
......@@ -843,7 +856,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -ma rows out.tsv data_06.tsv data_07.tsv\n",
"funpack -ma rows out.tsv data_06.tsv data_07.tsv\n",
"cat out.tsv"
]
},
......@@ -861,7 +874,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -ma rows -ms outer out.tsv data_06.tsv data_07.tsv\n",
"funpack -ma rows -ms outer out.tsv data_06.tsv data_07.tsv\n",
"cat out.tsv"
]
},
......@@ -901,7 +914,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -ms naive out.tsv data_08.tsv data_09.tsv\n",
"funpack -ms naive out.tsv data_08.tsv data_09.tsv\n",
"cat out.tsv"
]
},
......@@ -935,7 +948,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -ms naive -ma rows out.tsv data_10.tsv data_11.tsv\n",
"funpack -ms naive -ma rows out.tsv data_10.tsv data_11.tsv\n",
"cat out.tsv"
]
},
......@@ -993,7 +1006,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -nv 1 \" -1\" -nv 2 \" -1,0\" -nv 3 \"1,2\" out.tsv data_12.tsv\n",
"funpack -nv 1 \" -1\" -nv 2 \" -1,0\" -nv 3 \"1,2\" out.tsv data_12.tsv\n",
"cat out.tsv"
]
},
......@@ -1040,7 +1053,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -cl 1 \"codeToNumeric('icd10')\" out.tsv data_13.tsv\n",
"funpack -cl 1 \"codeToNumeric('icd10')\" out.tsv data_13.tsv\n",
"cat out.tsv"
]
},
......@@ -1082,7 +1095,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -cl 1 \"flattenHierarchical(name='icd10')\" out.tsv data_13.tsv\n",
"funpack -cl 1 \"flattenHierarchical(name='icd10')\" out.tsv data_13.tsv\n",
"cat out.tsv"
]
},
......@@ -1105,7 +1118,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -cl 1 \"codeToNumeric('icd10')\" -imf icd10_codes.tsv out.tsv data_13.tsv\n",
"funpack -cl 1 \"codeToNumeric('icd10')\" -imf icd10_codes.tsv out.tsv data_13.tsv\n",
"cat icd10_codes.tsv"
]
},
......@@ -1150,7 +1163,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -re 1 \"300,444,555\" \"3,0.25,0.5\" out.tsv data_14.tsv\n",
"funpack -re 1 \"300,444,555\" \"3,0.25,0.5\" out.tsv data_14.tsv\n",
"cat out.tsv"
]
},
......@@ -1213,7 +1226,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -cv 2 \"v1 == 0\" \"0\" out.tsv data_15.tsv\n",
"funpack -cv 2 \"v1 == 0\" \"0\" out.tsv data_15.tsv\n",
"cat out.tsv"
]
},
......@@ -1287,7 +1300,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -apr all \"removeIfSparse(minpres=8)\" out.tsv data_16.tsv\n",
"funpack -apr all \"removeIfSparse(minpres=8)\" out.tsv data_16.tsv\n",
"cat out.tsv"
]
},
......@@ -1305,7 +1318,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -apr all \"removeIfSparse(minpres=0.65, abspres=False)\" out.tsv data_16.tsv\n",
"funpack -apr all \"removeIfSparse(minpres=0.65, abspres=False)\" out.tsv data_16.tsv\n",
"cat out.tsv"
]
},
......@@ -1346,7 +1359,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -apr all \"removeIfRedundant(0.9)\" out.tsv data_17.tsv\n",
"funpack -apr all \"removeIfRedundant(0.9)\" out.tsv data_17.tsv\n",
"cat out.tsv"
]
},
......@@ -1387,7 +1400,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -apr all \"removeIfRedundant(0.9, nathres=0.6)\" out.tsv data_18.tsv\n",
"funpack -apr all \"removeIfRedundant(0.9, nathres=0.6)\" out.tsv data_18.tsv\n",
"cat out.tsv"
]
},
......@@ -1426,7 +1439,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -apr 1 \"binariseCategorical\" out.tsv data_19.tsv\n",
"funpack -apr 1 \"binariseCategorical\" out.tsv data_19.tsv\n",
"cat out.tsv"
]
},
......@@ -1445,7 +1458,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -apr 1 \"binariseCategorical(replace=False, nameFormat='{vid}:{value}')\" out.tsv data_19.tsv\n",
"funpack -apr 1 \"binariseCategorical(replace=False, nameFormat='{vid}:{value}')\" out.tsv data_19.tsv\n",
"cat out.tsv"
]
},
......@@ -1507,7 +1520,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -p plugin_1.py -cl 1 drop_odd_values -cl 2 drop_odd_values out.tsv data_20.tsv\n",
"funpack -p plugin_1.py -cl 1 drop_odd_values -cl 2 drop_odd_values out.tsv data_20.tsv\n",
"cat out.tsv"
]
},
......@@ -1548,7 +1561,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -p plugin_2.py -apr \"1,2\" \"sum_squares\" out.tsv data_20.tsv\n",
"funpack -p plugin_2.py -apr \"1,2\" \"sum_squares\" out.tsv data_20.tsv\n",
"cat out.tsv"
]
},
......@@ -1609,7 +1622,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -p plugin_3.py -l data_21.tsv my_datefile_loader out.tsv data_21.tsv\n",
"funpack -p plugin_3.py -l data_21.tsv my_datefile_loader out.tsv data_21.tsv\n",
"cat out.tsv"
]
},
......@@ -1659,7 +1672,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -nb -q -ow -s \"v1 contains 'A'\" out.tsv data_22.tsv\n",
"funpack -nb -s \"v1 contains 'A'\" out.tsv data_22.tsv\n",
"cat out.tsv"
]
},
......@@ -1683,7 +1696,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -nb -q -ow -esn out.tsv data_22.tsv\n",
"funpack -nb -esn out.tsv data_22.tsv\n",
"cat out.tsv"
]
},
......@@ -1701,7 +1714,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -nb -q -ow -esn -wnn out.tsv data_22.tsv\n",
"funpack -nb -esn -wnn out.tsv data_22.tsv\n",
"cat out.tsv\n",
"cat out_non_numerics.tsv"
]
......@@ -1727,7 +1740,7 @@
"outputs": [],
"source": [
"funpack \\\n",
" -nb -q -ow -d \\\n",
" -nb -d \\\n",
" -nv 1 \"7,8,9\" \\\n",
" -re 2 \"1,2,3\" \"100,200,300\" \\\n",
" -cv 3 \"v4 != 20\" \"25\" \\\n",
......@@ -1760,7 +1773,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -q -ow -cfg fmrib -d out.tsv ukbcols.csv"
"funpack -cfg fmrib -d out.tsv ukbcols.csv"
]
},
{
......@@ -1932,7 +1945,7 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -nb -q -ow -vf custom_variables.tsv -cf custom_categories.tsv -wu out.tsv data_01.tsv\n",
"funpack -nb -vf custom_variables.tsv -cf custom_categories.tsv -wu out.tsv data_01.tsv\n",
"cat out_unknown_vars.txt"
]
},
......@@ -1958,8 +1971,8 @@
"metadata": {},
"outputs": [],
"source": [
"funpack -nb -q -ow -vf custom_variables.tsv -cf custom_categories.tsv -c unknown unknowns.tsv data_01.tsv\n",
"funpack -nb -q -ow -vf custom_variables.tsv -cf custom_categories.tsv -c uncategorised uncategorised.tsv data_01.tsv\n",
"funpack -nb -vf custom_variables.tsv -cf custom_categories.tsv -c unknown unknowns.tsv data_01.tsv\n",
"funpack -nb -vf custom_variables.tsv -cf custom_categories.tsv -c uncategorised uncategorised.tsv data_01.tsv\n",
"echo \"Unknown variables:\"\n",
"cat unknowns.tsv\n",
"echo \"Uncategorised variables:\"\n",
......
%% Cell type:markdown id: tags:
![win logo](win.png)
# `funpack` (https://git.fmrib.ox.ac.uk/fsl/funpack)
> Paul McCarthy &lt;paul.mccarthy@ndcn.ox.ac.uk&gt;
> ([WIN@FMRIB](https://www.win.ox.ac.uk/))
`funpack` is a command-line program which you can use to extract data from UK
BioBank (and other tabular) data.
You can give `funpack` one or more input files (e.g. `.csv`, `.tsv`), and it
will merge them together, perform some preprocessing, and produce a single
output file.
A large number of rules are built into `funpack` which are specific to the UK
BioBank data set. But you can control and customise everything that `funpack`
does to your data, including which rows and columns to extract, and which
cleaning/processing steps to perform on each column.
`funpack` comes installed with recent versions of
[FSL](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/). You can also install `funpack`
via `conda`:
> ```
> conda install -c conda-forge fmrib-unpack
> ```
Or using `pip`:
> ```
> pip install fmrib-unpack
> ```
Get command-line help by typing:
> ```
> funpack -h
> ```
**Important** The examples in this notebook assume that you have installed `funpack`
2.5.0 or newer.
%% Cell type:code id: tags:
```
funpack -V
```
%% Cell type:markdown id: tags:
> _Note:_ If the above command produces a `NameError`, you may need to change
> the Jupyter Notebook kernel type to **Bash** - you can do so via the
> **Kernel -> Change Kernel** menu option.
### Contents
1. [Overview](#Overview)
1. [Import](#1.-Import)
2. [Cleaning](#2.-Cleaning)
3. [Processing](#3.-Processing)
4. [Export](#4.-Export)
2. [Examples](#Examples)
3. [Import examples](#Import-examples)
1. [Selecting variables (columns)](#Selecting-variables-(columns))
1. [Selecting individual variables](#Selecting-individual-variables)
2. [Selecting variable ranges](#Selecting-variable-ranges)
3. [Selecting variables with a file](#Selecting-variables-with-a-file)
4. [Selecting variables from pre-defined categories](#Selecting-variables-from-pre-defined-categories)
2. [Selecting subjects (rows)](#Selecting-subjects-(rows))
1. [Selecting individual subjects](#Selecting-individual-subjects)
2. [Selecting subject ranges](#Selecting-subject-ranges)
3. [Selecting subjects from a file](#Selecting-subjects-from-a-file)
4. [Selecting subjects by variable value](#Selecting-subjects-by-variable-value)
5. [Excluding subjects](#Excluding-subjects)
3. [Selecting visits](#Selecting-visits)
1. [Evaluating expressions across visits](#Evaluating-expressions-across-visits)
4. [Merging multiple input files](#Merging-multiple-input-files)
1. [Merging by subject](#Merging-by-subject)
2. [Merging by column](#Merging-by-column)
3. [Naive merging](#Merging-by-column)
4. [Cleaning examples](#Cleaning-examples)
1. [NA insertion](#NA-insertion)
2. [Variable-specific cleaning functions](#Variable-specific-cleaning-functions)
3. [Categorical recoding](#Categorical-recoding)
4. [Child value replacement](#Child-value-replacement)
5. [Processing examples](#Processing-examples)
1. [Sparsity check](#Sparsity-check)
2. [Redundancy check](#Redundancy-check)
3. [Categorical binarisation](#Categorical-binarisation)
6. [Custom cleaning, processing and loading - funpack plugins](#Custom-cleaning,-processing-and-loading---funpack-plugins)
1. [Custom cleaning functions](#Custom-cleaning-functions)
2. [Custom processing functions](#Custom-processing-functions)
3. [Custom file loaders](#Custom-file-loaders)
7. [Miscellaneous topics](#Miscellaneous-topics)
1. [Non-numeric data](#Non-numeric-data)
2. [Dry run](#Dry-run)
3. [Built-in rules](#Built-in-rules)
4. [Using a configuration file](#Using-a-configuration-file)
5. [Working with unknown/uncategorised variables](#Working-with-unknown/uncategorised-variables)
# Overview
`funpack` performs the following steps:
## 1. Import
All data files are loaded in, unwanted columns and subjects are dropped, and
the data files are merged into a single table (a.k.a. data frame). Multiple
files can be merged according to an index column (e.g. subject ID). Or, if the
input files contain the same columns/subjects, they can be naively
concatenated along rows or columns.
> _Note:_ FUNPACK refers to UK Biobank **Data fields** as **variables**. The
> two terms can be considered equivalent.
## 2. Cleaning
The following cleaning steps are applied to each column:
1. **NA value replacement:** Specific values for some columns are replaced
with NA, for example, variables where a value of `-1` indicates *Do not
know*.
2. **Variable-specific cleaning functions:** Certain columns are
re-formatted; for example, the [ICD10](https://en.wikipedia.org/wiki/ICD-10)
disease codes can be converted to integer representations.
3. **Categorical recoding:** Certain categorical columns are re-coded.
4. **Child value replacement:** NA values within some columns which are
dependent upon other columns may have values inserted based on the values
of their parent columns.
## 3. Processing
During the processing stage, columns may be removed, merged, or expanded into
additional columns. For example, a categorical column may be expanded into a set
of binary columns, one for each category.
A column may also be removed on the basis of being too sparse, or being
redundant with respect to another column.
## 4. Export
The processed data can be saved as a `.csv`, `.tsv`, or `.hdf5` file.
# Examples
Throughout these examples, we are going to use a few command line
options, which you will probably **not** normally want to use:
- `-ow` (short for `--overwrite`): This tells `funpack` not to complain if
the output file already exists.
- `-q` (short for `--quiet`): This tells `funpack` to be quiet. Without the
`-q` option, `funpack` can be quite verbose, which can be annoying, but is
very useful when things go wrong. A good strategy is to tell `funpack` to
produce verbose output using the `--noisy` (`-n` for short) option, and to
send all of its output to a log file with the `--log_file` (or `-lf`)
option. For example:
> ```
> funpack -n -n -n -lf log.txt out.tsv in.tsv
> ```
%% Cell type:code id: tags:
```
alias funpack="funpack -ow -q"
```
%% Cell type:markdown id: tags:
Here's the first example input data set, with UK BioBank-style column names:
%% Cell type:code id: tags:
```
cat data_01.tsv
```
%% Cell type:markdown id: tags:
The numbers in each column name typically represent:
1. The variable ID
2. The visit, for variables which were collected at multiple points in time.
3. The "instance", for multi-valued variables.
Note that one **variable** is typically associated with several **columns**,
although we're keeping things simple for this first example - there is only
one visit for each variable, and there are no mulit-valued variables.
> _Most but not all_ variables in the UK BioBank contain data collected at
> different visits, the times that the participants visited a UK BioBank
> assessment centre. However there are some variables (e.g. [ICD10 diagnosis
> codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202)) for which
> this is not the case.
# Import examples
## Selecting variables (columns)
You can specify which variables you want to load in the following ways, using
the `--variable` (`-v` for short), `--category` (`-c` for short) and
`--column` (`-co` for short) command line options:
* By variable ID
* By variable ranges
* By a text file which contains the IDs you want to keep.
* By pre-defined variable categories
* By column name
### Selecting individual variables
Simply provide the IDs of the variables you want to extract:
%% Cell type:code id: tags:
```
funpack -q -ow -v 1 -v 5 out.tsv data_01.tsv
funpack -v 1 -v 5 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variable ranges
The `--variable`/`-v` option accepts MATLAB-style ranges of the form
`start:step:stop` (where the `stop` is inclusive):
%% Cell type:code id: tags:
```
funpack -q -ow -v 1:3:10 out.tsv data_01.tsv
funpack -v 1:3:10 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variables with a file
If your variables of interest are listed in a plain-text file, you can simply
pass that file:
%% Cell type:code id: tags:
```
echo -e "1\n6\n9" > vars.txt
funpack -q -ow -v vars.txt out.tsv data_01.tsv
funpack -v vars.txt out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variables from pre-defined categories
Some UK BioBank-specific categories are [built into
`funpack`](#Built-in-rules), but you can also define your own categories - you
just need to create a `.tsv` file, and pass it to `funpack` via the
`--category_file` (`-cf` for short):
%% Cell type:code id: tags:
```
echo -e "ID\tCategory\tVariables" > custom_categories.tsv
echo -e "1\tCool variables\t1:5,7" >> custom_categories.tsv
echo -e "2\tUncool variables\t6,8:10" >> custom_categories.tsv
cat custom_categories.tsv
```
%% Cell type:markdown id: tags:
Use the `--category` (`-c` for short) to select categories to output. You can
refer to categories by their ID:
%% Cell type:code id: tags:
```
funpack -q -ow -cf custom_categories.tsv -c 1 out.tsv data_01.tsv
funpack -cf custom_categories.tsv -c 1 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
Or by name:
%% Cell type:code id: tags:
```
funpack -q -ow -cf custom_categories.tsv -c uncool out.tsv data_01.tsv
funpack -cf custom_categories.tsv -c uncool out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting column names
If you are working with data that has non-UK BioBank style column names, you
can use the `--column` (`-co` for short) to select individual columns by their
name, rather than the variable with which they are associated. The `--column`
option accepts full column names, and also shell-style wildcard patterns:
%% Cell type:code id: tags:
```
funpack -q -ow -co 4-0.0 -co "??-0.0" out.tsv data_01.tsv
funpack -co 4-0.0 -co "??-0.0" out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Selecting subjects (rows)
`funpack` assumes that the first column in every input file is a subject
ID. You can specify which subjects you want to load via the `--subject` (`-s`
for short) option. You can specify subjects in the same way that you specified
variables above, and also:
* By specifying a conditional expression on variable values - only subjects
for which the expression evaluates to true will be imported
* By specifying subjects to exclude
### Selecting individual subjects
%% Cell type:code id: tags:
```
funpack -q -ow -s 1 -s 3 -s 5 out.tsv data_01.tsv
funpack -s 1 -s 3 -s 5 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting subject ranges
%% Cell type:code id: tags:
```
funpack -q -ow -s 2:2:10 out.tsv data_01.tsv
funpack -s 2:2:10 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting subjects from a file
%% Cell type:code id: tags:
```
echo -e "5\n6\n7\n8\n9\n10" > subjects.txt
funpack -q -ow -s subjects.txt out.tsv data_01.tsv
funpack -s subjects.txt out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting subjects by variable value
The `--subject` option accepts *variable expressions* - you can write an
expression performing numerical comparisons on variables (denoted with a
leading `v`) and combine these expressions using boolean algebra. Only
subjects for which the expression evaluates to true will be imported. For
example, to only import subjects where variable 1 is greater than 10, and
variable 2 is less than 70, you can type:
%% Cell type:code id: tags:
```
funpack -q -ow -s "v1 > 10 && v2 < 70" out.tsv data_01.tsv
funpack -s "v1 > 10 && v2 < 70" out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
The following symbols can be used in variable expressions:
| Symbol | Meaning |
|---------------------------|---------------------------------|
| `==` | equal to |
| `!=` | not equal to |
| `>` | greater than |
| `>=` | greater than or equal to |
| `<` | less than |
| `<=` | less than or equal to |
| `na` | N/A |
| `&&` | logical and |
| <code>&#x7c;&#x7c;</code> | logical or |
| `~` | logical not |
| `contains` | Contains sub-string |
| `all` | all columns must meet condition |
| `any` | any column must meet condition |
| `()` | to denote precedence |
> The `all` and `any` symbols allow you to control how an expression is
> evaluated across multiple columns which are associated with one variable
> (e.g. separate columns for each visit). We will give an example of this in
> the section on [selecting visits](#Selecting-visits), below.
Non-numeric (i.e. string) variables can be used in these expressions in
conjunction with the `==`, `!=`, and `contains` operators, and date/time
variables can be compared using the `==`, `!=`, `>`, `>=`, `<`, and `<=`
operators. For example, imagine that we have the following data set:
%% Cell type:code id: tags:
```
cat data_02.tsv
```
%% Cell type:markdown id: tags:
And we want to identify subjects who were born during or after 1965 (variable
33), and who have a value of `B` for variable 1:
%% Cell type:code id: tags:
```
funpack -q -ow -s "v33 >= 1965-01-01 && v1 == 'B'" out.tsv data_02.tsv
funpack -s "v33 >= 1965-01-01 && v1 == 'B'" out.tsv data_02.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
> When comparing dates and times, you must use the format `YYYY-MM-DD` and
> `YYYY-MM-DD HH:MM:SS`. When comparing strings, you must surround values
> with single or double quotes.
Evaluating a variable expression requires the data for every subject to be
loaded into memory, so that the conditional expression can be evaluated.
A useful strategy, if you intend to work with the same subset of subjects more
than once, is to use FUNPACK once, to identify the subjects of interest, save
their IDs to a text file, and on subsequent calls to FUNPACK, use the text
file to select subjects. This means that subsequent FUNPACK runs will be
faster, and will require less memory.
The `--ids-only` option can be used to generate an output file which only
The `--ids_only` option can be used to generate an output file which only
contains the IDs of the rows which would have been output, so can be used to
generate a subject ID file. Let's re-run one of the examples above, but this
time with the `--ids-only` option:
time with the `--ids_only` option:
%% Cell type:code id: tags:
```
funpack -q -ow --ids-only -v 1,2 -s "v1 > 10 && v2 < 70" subjects.txt data_01.tsv
funpack --ids_only -v 1,2 -s "v1 > 10 && v2 < 70" subjects.txt data_01.tsv
cat subjects.txt
```
%% Cell type:markdown id: tags:
Now we can use `subjects.txt` with the `-s` option in subsequent FUNPACK
calls, to select the same subjects without having to re-evaluate the variable
expression.
### Excluding subjects
The `--exclude` (`-ex` for short) option allows you to exclude subjects - it
accepts individual IDs, an ID range, or a file containing IDs. The
`--exclude`/`-ex` option takes precedence over the `--subject`/`-s` option:
%% Cell type:code id: tags:
```
funpack -q -ow -s 1:8 -ex 5:10 out.tsv data_01.tsv
funpack -s 1:8 -ex 5:10 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Selecting visits
Many variables in the UK BioBank data contain observations at multiple points in
time, or visits. `funpack` allows you to specify which visits you are interested
in. Here is an example data set with variables that have data for multiple
visits (remember that the second number in the column names denotes the visit):
%% Cell type:code id: tags:
```
cat data_03.tsv
```
%% Cell type:markdown id: tags:
We can use the `--visit` (`-vi` for short) option to get just the last visit for
each variable:
%% Cell type:code id: tags:
```
funpack -q -ow -vi last out.tsv data_03.tsv
funpack -vi last out.tsv data_03.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
You can also specify which visit you want by its number:
%% Cell type:code id: tags:
```
funpack -q -ow -vi 1 out.tsv data_03.tsv
funpack -vi 1 out.tsv data_03.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
> Variables which are not associated with specific visits (e.g. [ICD10
> diagnosis codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202))
> will not be affected by the `-vi` option.
### Evaluating expressions across visits
The variable expressions described above in the section on [selecting
subjects](#Selecting-subjects-by-variable-value) will be applied to all of
the columns associated with a variable. By default, an expression will
evaluate to true where the values in _any_ column asssociated with the
variable evaluate to true. For example, we can extract the data for subjects
where the values of any column of variable 2 were less than 50:
%% Cell type:code id: tags:
```
funpack -q -ow -v 2 -s 'v2 < 50' out.tsv data_03.tsv
funpack -v 2 -s 'v2 < 50' out.tsv data_03.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
We can use the `any` and `all` operators to control how an expression is
evaluated across the columns of a variable. For example, we may only be
interested in subjects for whom all columns of variable 2 were greater than
50:
%% Cell type:code id: tags:
```
funpack -q -ow -v 2 -s 'all(v2 < 50)' out.tsv data_03.tsv
funpack -v 2 -s 'all(v2 < 50)' out.tsv data_03.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
We can use `any` and `all` in expressions involving multiple variables:
%% Cell type:code id: tags:
```
funpack -q -ow -v 2,3 -s 'any(v2 < 50) && all(v3 >= 40)' out.tsv data_03.tsv
funpack -v 2,3 -s 'any(v2 < 50) && all(v3 >= 40)' out.tsv data_03.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Merging multiple input files
If your data is split across multiple files, you can specify how `funpack`
should merge them together.
### Merging by subject
For example, let's say we have these two input files (shown side-by- side):
%% Cell type:code id: tags:
```
echo " " | paste data_04.tsv - data_05.tsv
```
%% Cell type:markdown id: tags:
Note that each file contains different variables, and different, but
overlapping, subjects. By default, when you pass these files to `funpack`, it
will output the intersection of the two files (more formally known as an
*inner join*), i.e. subjects which are present in both files: