Commit 333591f5 authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

Merge branch 'rf/date-norm-in-cleaning' into 'master'

Rf/date norm in cleaning

See merge request !61
parents 8caed2de 37d137d3
...@@ -2,8 +2,8 @@ FUNPACK changelog ...@@ -2,8 +2,8 @@ FUNPACK changelog
================= =================
2.3.0 (Under development) 2.3.0 (Tuesday 12th May 2020)
------------------------- -----------------------------
Changed Changed
...@@ -15,7 +15,11 @@ Changed ...@@ -15,7 +15,11 @@ Changed
variables. This should give superior performance. variables. This should give superior performance.
* Revisited the :meth:`.DataTable.merge` to optimise performance in all * Revisited the :meth:`.DataTable.merge` to optimise performance in all
scenarios. scenarios.
* Improved performance of the :mod:`.fmrib` date/time normalisation routines. * Improved performance of the :mod:`.fmrib` date/time normalisation routines,
and changed their usage so they are now applied as "cleaning" functions
after data import, rather than just before export. This means that date/
time columns can be subjected to the redundancy check (as they will have
a numeric type), and will improve data export performance.
2.2.1 (Monday 4th May 2020) 2.2.1 (Monday 4th May 2020)
......
...@@ -6,7 +6,7 @@ ...@@ -6,7 +6,7 @@
# #
__version__ = '2.3.0.dev0' __version__ = '2.3.0'
"""The ``funpack`` versioning scheme roughly follows Semantic Versioning """The ``funpack`` versioning scheme roughly follows Semantic Versioning
conventions. conventions.
""" """
......
...@@ -8,9 +8,20 @@ ...@@ -8,9 +8,20 @@
# Use local settings # Use local settings
config_file local config_file local
# Contains some FMRIB-specific plugin functions,
# including date/time normalisation.
plugin_file fmrib
# Drop non-numeric columns - the main output
# file only contains numeric data.
suppress_non_numerics
# Only import variables from FMRIB-curated categories,
# largely drawn from showcase categories
category_file fmrib/categories.tsv
# #
# FUNPACK processing stages # FUNPACK cleaning/processing stages
# #
# - NA insertion # - NA insertion
# - Categorical recoding # - Categorical recoding
...@@ -38,33 +49,22 @@ config_file local ...@@ -38,33 +49,22 @@ config_file local
# - NA insertion # - NA insertion
datacoding_file fmrib/datacodings_navalues.tsv datacoding_file fmrib/datacodings_navalues.tsv
# - Categorical recoding # - Categorical recoding
datacoding_file fmrib/datacodings_recoding.tsv datacoding_file fmrib/datacodings_recoding.tsv
# - Cleaning # - Cleaning
variable_file fmrib/variables_clean.tsv variable_file fmrib/variables_clean.tsv
# - Child value replacement
variable_file fmrib/variables_parentvalues.tsv
# - Processing
processing_file fmrib/processing.tsv
# FMRIB-curated categories, largely drawn from showcase categories # Date/timestamp normalisation (performed in the FUNPACK cleaning stage)
category_file fmrib/categories.tsv
#
# FMRIB processing of dates
#
# Converts a date or date+time into a single value x, where floor(x) is the # Converts a date or date+time into a single value x, where floor(x) is the
# calendar year and the fraction day/time within the year *except* 'a day' # calendar year and the fraction day/time within the year *except* 'a day'
# is redefined as the time between 7am and 8pm (scanning only takes place # is redefined as the time between 7am and 8pm (scanning only takes place
# within these hours. # within these hours.
# type_file fmrib/datetime_formatting.tsv
plugin_file fmrib
date_format FMRIBImagingDate
time_format FMRIBImagingTime
# - Child value replacement
variable_file fmrib/variables_parentvalues.tsv
# Drop non-numeric columns - the main output file only contains numeric data. # - Processing -
suppress_non_numerics processing_file fmrib/processing.tsv
Type Clean
date normalisedDate
time normalisedAcquisitionTime
\ No newline at end of file
...@@ -89,12 +89,13 @@ def formatColumn(col, ...@@ -89,12 +89,13 @@ def formatColumn(col,
# fall back to date/time formatting # fall back to date/time formatting
# if relevant for this column # if relevant for this column
if formatter is None: if formatter is None and pdtypes.is_datetime64_any_dtype(series):
if vtype == util.CTYPES.date: # use dateFormat if we know the column
formatter = dateFormat # is a date (and not datetime), otherwise
elif vtype == util.CTYPES.time or \ # use timeFormat if the column is a
pdtypes.is_datetime64_any_dtype(series): # datetime, or unknown type.
formatter = timeFormat if vtype == util.CTYPES.date: formatter = dateFormat
else: formatter = timeFormat
if formatter is not None: if formatter is not None:
log.debug('Formatting column %s%s with %s formatter', log.debug('Formatting column %s%s with %s formatter',
......
...@@ -88,30 +88,33 @@ def load_FMRIBImaging(infile): ...@@ -88,30 +88,33 @@ def load_FMRIBImaging(infile):
return df return df
@funpack.formatter('FMRIBImagingDate') @funpack.cleaner()
def normalisedDate(dtable, column, series): def normalisedDate(dtable, vid):
"""Converts date values into a numeric fractional year representation. """Converts date values into a numeric fractional year representation.
Converts a date into a single value x, where ``floor(x)`` is the calendar Converts a date into a single value x, where ``floor(x)`` is the calendar
year and the ``x mod 1`` is the fractional day within the year. The year and the ``x mod 1`` is the fractional day within the year. The
conversion takes leap years into account. conversion takes leap years into account.
""" """
datetimes = series.to_numpy()
years = datetimes.astype('datetime64[Y]')
days = datetimes.astype('datetime64[D]')
# convert to day of year for col in dtable.columns(vid):
# calculate fraction of day series = dtable[:, col.name]
days = (days - years).astype(np.float32) datetimes = series.to_numpy()
years = (years + 1970) .astype(np.float32) years = datetimes.astype('datetime64[Y]')
leaps = pd.DatetimeIndex(datetimes).is_leap_year + 365 days = datetimes.astype('datetime64[D]')
# calculate and return fraction of year # convert to day of year
return pd.Series(years + (days / leaps), name=series.name) # calculate fraction of day
days = (days - years).astype(np.float32)
years = (years + 1970) .astype(np.float32)
leaps = pd.DatetimeIndex(datetimes).is_leap_year + 365
# calculate fraction of year
dtable[:, col.name] = years + (days / leaps)
@funpack.formatter('FMRIBImagingTime')
def normalisedAcquisitionTime(dtable, column, series): @funpack.cleaner()
def normalisedAcquisitionTime(dtable, vid):
"""Converts timestamps into a numeric fractional year representation. """Converts timestamps into a numeric fractional year representation.
Converts a date or date+time into a single value x, where `floor(x)` is the Converts a date or date+time into a single value x, where `floor(x)` is the
...@@ -119,23 +122,25 @@ def normalisedAcquisitionTime(dtable, column, series): ...@@ -119,23 +122,25 @@ def normalisedAcquisitionTime(dtable, column, series):
redefined as the time between 7am and 8pm (UK BioBank scanning only takes redefined as the time between 7am and 8pm (UK BioBank scanning only takes
place within these hours). place within these hours).
""" """
datetimes = series.to_numpy() for col in dtable.columns(vid):
years = datetimes.astype('datetime64[Y]') series = dtable[:, col.name]
days = datetimes.astype('datetime64[D]') datetimes = series.to_numpy()
hours = datetimes.astype('datetime64[h]') years = datetimes.astype('datetime64[Y]')
mins = datetimes.astype('datetime64[m]') days = datetimes.astype('datetime64[D]')
secs = datetimes.astype('datetime64[s]') hours = datetimes.astype('datetime64[h]')
mins = datetimes.astype('datetime64[m]')
# convert to day of year, hour secs = datetimes.astype('datetime64[s]')
# of day, second of hour, then
# calculate fraction of day # convert to day of year, hour
secs = (secs - mins) .astype(np.float32) # of day, second of hour, then
mins = (mins - hours).astype(np.float32) # calculate fraction of day
hours = (hours - days) .astype(np.float32) secs = (secs - mins) .astype(np.float32)
days = (days - years).astype(np.float32) mins = (mins - hours).astype(np.float32)
years = (years + 1970) .astype(np.float32) hours = (hours - days) .astype(np.float32)
dayfracs = ((hours - 7) + (mins / 60) + (secs / 3600)) / 13 days = (days - years).astype(np.float32)
leaps = pd.DatetimeIndex(datetimes).is_leap_year + 365 years = (years + 1970) .astype(np.float32)
dayfracs = ((hours - 7) + (mins / 60) + (secs / 3600)) / 13
# calculate and return fraction of year leaps = pd.DatetimeIndex(datetimes).is_leap_year + 365
return pd.Series(years + (days + dayfracs) / leaps, name=series.name)
# calculate and return fraction of year
dtable[:, col.name] = years + (days + dayfracs) / leaps
...@@ -57,7 +57,7 @@ ...@@ -57,7 +57,7 @@
"\n", "\n",
"\n", "\n",
"**Important** The examples in this notebook assume that you have installed `funpack`\n", "**Important** The examples in this notebook assume that you have installed `funpack`\n",
"2.3.0.dev0 or newer." "2.3.0 or newer."
] ]
}, },
{ {
...@@ -1664,7 +1664,7 @@ ...@@ -1664,7 +1664,7 @@
"\n", "\n",
"`funpack` has a large number of hand-crafted rules built in, which are\n", "`funpack` has a large number of hand-crafted rules built in, which are\n",
"specific to variables found in the UK BioBank data set. These rules are part\n", "specific to variables found in the UK BioBank data set. These rules are part\n",
"of the ``fmrib`` configuration, which can be used by adding `-cfg fmrib` to\n", "of the `fmrib` configuration, which can be used by adding `-cfg fmrib` to\n",
"the command-line options.\n", "the command-line options.\n",
"\n", "\n",
"\n", "\n",
...@@ -1679,7 +1679,7 @@ ...@@ -1679,7 +1679,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"funpack -q -ow -cfg fmrib -d out.tsv ukb_dataset_1_only_column_names.tsv ukb_dataset_2_only_column_names.tsv" "funpack -q -ow -cfg fmrib -d out.tsv ukbcols.csv"
] ]
}, },
{ {
...@@ -1888,5 +1888,5 @@ ...@@ -1888,5 +1888,5 @@
], ],
"metadata":{"kernelspec":{"display_name":"Bash","language":"bash","name":"bash"},"language_info":{"codemirror_mode":"shell","file_extension":".sh","mimetype":"text/x-sh","name":"bash"}}, "metadata":{"kernelspec":{"display_name":"Bash","language":"bash","name":"bash"},"language_info":{"codemirror_mode":"shell","file_extension":".sh","mimetype":"text/x-sh","name":"bash"}},
"nbformat": 4, "nbformat": 4,
"nbformat_minor": 2 "nbformat_minor": 4
} }
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
![win logo](win.png) ![win logo](win.png)
# `funpack` (https://git.fmrib.ox.ac.uk/fsl/funpack) # `funpack` (https://git.fmrib.ox.ac.uk/fsl/funpack)
> Paul McCarthy <paul.mccarthy@ndcn.ox.ac.uk> > Paul McCarthy <paul.mccarthy@ndcn.ox.ac.uk>
> ([WIN@FMRIB](https://www.win.ox.ac.uk/)) > ([WIN@FMRIB](https://www.win.ox.ac.uk/))
`funpack` is a command-line program which you can use to extract data from UK `funpack` is a command-line program which you can use to extract data from UK
BioBank (and other tabular) data. BioBank (and other tabular) data.
You can give `funpack` one or more input files (e.g. `.csv`, `.tsv`), and it You can give `funpack` one or more input files (e.g. `.csv`, `.tsv`), and it
will merge them together, perform some preprocessing, and produce a single will merge them together, perform some preprocessing, and produce a single
output file. output file.
A large number of rules are built into `funpack` which are specific to the UK A large number of rules are built into `funpack` which are specific to the UK
BioBank data set. But you can control and customise everything that `funpack` BioBank data set. But you can control and customise everything that `funpack`
does to your data, including which rows and columns to extract, and which does to your data, including which rows and columns to extract, and which
cleaning/processing steps to perform on each column. cleaning/processing steps to perform on each column.
`funpack` comes installed with recent versions of `funpack` comes installed with recent versions of
[FSL](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/). You can also install `funpack` [FSL](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/). You can also install `funpack`
via `conda`: via `conda`:
> ``` > ```
> conda install -c conda-forge fmrib-unpack > conda install -c conda-forge fmrib-unpack
> ``` > ```
Or using `pip`: Or using `pip`:
> ``` > ```
> pip install fmrib-unpack > pip install fmrib-unpack
> ``` > ```
Get command-line help by typing: Get command-line help by typing:
> ``` > ```
> funpack -h > funpack -h
> ``` > ```
**Important** The examples in this notebook assume that you have installed `funpack` **Important** The examples in this notebook assume that you have installed `funpack`
2.3.0.dev0 or newer. 2.3.0 or newer.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -V funpack -V
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Contents ### Contents
1. [Overview](#Overview) 1. [Overview](#Overview)
1. [Import](#1.-Import) 1. [Import](#1.-Import)
2. [Cleaning](#2.-Cleaning) 2. [Cleaning](#2.-Cleaning)
3. [Processing](#3.-Processing) 3. [Processing](#3.-Processing)
4. [Export](#4.-Export) 4. [Export](#4.-Export)
2. [Examples](#Examples) 2. [Examples](#Examples)
3. [Import examples](#Import-examples) 3. [Import examples](#Import-examples)
1. [Selecting variables (columns)](#Selecting-variables-(columns)) 1. [Selecting variables (columns)](#Selecting-variables-(columns))
1. [Selecting individual variables](#Selecting-individual-variables) 1. [Selecting individual variables](#Selecting-individual-variables)
2. [Selecting variable ranges](#Selecting-variable-ranges) 2. [Selecting variable ranges](#Selecting-variable-ranges)
3. [Selecting variables with a file](#Selecting-variables-with-a-file) 3. [Selecting variables with a file](#Selecting-variables-with-a-file)
4. [Selecting variables from pre-defined categories](#Selecting-variables-from-pre-defined-categories) 4. [Selecting variables from pre-defined categories](#Selecting-variables-from-pre-defined-categories)
2. [Selecting subjects (rows)](#Selecting-subjects-(rows)) 2. [Selecting subjects (rows)](#Selecting-subjects-(rows))
1. [Selecting individual subjects](#Selecting-individual-subjects) 1. [Selecting individual subjects](#Selecting-individual-subjects)
2. [Selecting subject ranges](#Selecting-subject-ranges) 2. [Selecting subject ranges](#Selecting-subject-ranges)
3. [Selecting subjects from a file](#Selecting-subjects-from-a-file) 3. [Selecting subjects from a file](#Selecting-subjects-from-a-file)
4. [Selecting subjects by variable value](#Selecting-subjects-by-variable-value) 4. [Selecting subjects by variable value](#Selecting-subjects-by-variable-value)
5. [Excluding subjects](#Excluding-subjects) 5. [Excluding subjects](#Excluding-subjects)
3. [Selecting visits](#Selecting-visits) 3. [Selecting visits](#Selecting-visits)
1. [Evaluating expressions across visits](#Evaluating-expressions-across-visits) 1. [Evaluating expressions across visits](#Evaluating-expressions-across-visits)
4. [Merging multiple input files](#Merging-multiple-input-files) 4. [Merging multiple input files](#Merging-multiple-input-files)
1. [Merging by subject](#Merging-by-subject) 1. [Merging by subject](#Merging-by-subject)
2. [Merging by column](#Merging-by-column) 2. [Merging by column](#Merging-by-column)
3. [Naive merging](#Merging-by-column) 3. [Naive merging](#Merging-by-column)
4. [Cleaning examples](#Cleaning-examples) 4. [Cleaning examples](#Cleaning-examples)
1. [NA insertion](#NA-insertion) 1. [NA insertion](#NA-insertion)
2. [Variable-specific cleaning functions](#Variable-specific-cleaning-functions) 2. [Variable-specific cleaning functions](#Variable-specific-cleaning-functions)
3. [Categorical recoding](#Categorical-recoding) 3. [Categorical recoding](#Categorical-recoding)
4. [Child value replacement](#Child-value-replacement) 4. [Child value replacement](#Child-value-replacement)
5. [Processing examples](#Processing-examples) 5. [Processing examples](#Processing-examples)
1. [Sparsity check](#Sparsity-check) 1. [Sparsity check](#Sparsity-check)
2. [Redundancy check](#Redundancy-check) 2. [Redundancy check](#Redundancy-check)
3. [Categorical binarisation](#Categorical-binarisation) 3. [Categorical binarisation](#Categorical-binarisation)
6. [Custom cleaning, processing and loading - funpack plugins](#Custom-cleaning,-processing-and-loading---funpack-plugins) 6. [Custom cleaning, processing and loading - funpack plugins](#Custom-cleaning,-processing-and-loading---funpack-plugins)
1. [Custom cleaning functions](#Custom-cleaning-functions) 1. [Custom cleaning functions](#Custom-cleaning-functions)
2. [Custom processing functions](#Custom-processing-functions) 2. [Custom processing functions](#Custom-processing-functions)
3. [Custom file loaders](#Custom-file-loaders) 3. [Custom file loaders](#Custom-file-loaders)
7. [Miscellaneous topics](#Miscellaneous-topics) 7. [Miscellaneous topics](#Miscellaneous-topics)
1. [Non-numeric data](#Non-numeric-data) 1. [Non-numeric data](#Non-numeric-data)
2. [Dry run](#Dry-run) 2. [Dry run](#Dry-run)
3. [Built-in rules](#Built-in-rules) 3. [Built-in rules](#Built-in-rules)
4. [Using a configuration file](#Using-a-configuration-file) 4. [Using a configuration file](#Using-a-configuration-file)
5. [Working with unknown/uncategorised variables](#Working-with-unknown/uncategorised-variables) 5. [Working with unknown/uncategorised variables](#Working-with-unknown/uncategorised-variables)
# Overview # Overview
`funpack` performs the following steps: `funpack` performs the following steps:
## 1. Import ## 1. Import
All data files are loaded in, unwanted columns and subjects are dropped, and All data files are loaded in, unwanted columns and subjects are dropped, and
the data files are merged into a single table (a.k.a. data frame). Multiple the data files are merged into a single table (a.k.a. data frame). Multiple
files can be merged according to an index column (e.g. subject ID). Or, if the files can be merged according to an index column (e.g. subject ID). Or, if the
input files contain the same columns/subjects, they can be naively input files contain the same columns/subjects, they can be naively
concatenated along rows or columns. concatenated along rows or columns.
## 2. Cleaning ## 2. Cleaning
The following cleaning steps are applied to each column: The following cleaning steps are applied to each column:
1. **NA value replacement:** Specific values for some columns are replaced 1. **NA value replacement:** Specific values for some columns are replaced
with NA, for example, variables where a value of `-1` indicates *Do not with NA, for example, variables where a value of `-1` indicates *Do not
know*. know*.
2. **Variable-specific cleaning functions:** Certain columns are 2. **Variable-specific cleaning functions:** Certain columns are
re-formatted; for example, the [ICD10](https://en.wikipedia.org/wiki/ICD-10) re-formatted; for example, the [ICD10](https://en.wikipedia.org/wiki/ICD-10)
disease codes can be converted to integer representations. disease codes can be converted to integer representations.
3. **Categorical recoding:** Certain categorical columns are re-coded. 3. **Categorical recoding:** Certain categorical columns are re-coded.
4. **Child value replacement:** NA values within some columns which are 4. **Child value replacement:** NA values within some columns which are
dependent upon other columns may have values inserted based on the values dependent upon other columns may have values inserted based on the values
of their parent columns. of their parent columns.
## 3. Processing ## 3. Processing
During the processing stage, columns may be removed, merged, or expanded into During the processing stage, columns may be removed, merged, or expanded into
additional columns. For example, a categorical column may be expanded into a set additional columns. For example, a categorical column may be expanded into a set
of binary columns, one for each category. of binary columns, one for each category.
A column may also be removed on the basis of being too sparse, or being A column may also be removed on the basis of being too sparse, or being
redundant with respect to another column. redundant with respect to another column.
## 4. Export ## 4. Export
The processed data can be saved as a `.csv`, `.tsv`, or `.hdf5` file. The processed data can be saved as a `.csv`, `.tsv`, or `.hdf5` file.
# Examples # Examples
Throughout these examples, we are going to use a few command line Throughout these examples, we are going to use a few command line
options, which you will probably **not** normally want to use: options, which you will probably **not** normally want to use:
- `-ow` (short for `--overwrite`): This tells `funpack` not to complain if - `-ow` (short for `--overwrite`): This tells `funpack` not to complain if
the output file already exists. the output file already exists.
- `-q` (short for `--quiet`): This tells `funpack` to be quiet. Without the - `-q` (short for `--quiet`): This tells `funpack` to be quiet. Without the
`-q` option, `funpack` can be quite verbose, which can be annoying, but is `-q` option, `funpack` can be quite verbose, which can be annoying, but is
very useful when things go wrong. A good strategy is to tell `funpack` to very useful when things go wrong. A good strategy is to tell `funpack` to
produce verbose output using the `--noisy` (`-n` for short) option, and to produce verbose output using the `--noisy` (`-n` for short) option, and to
send all of its output to a log file with the `--log_file` (or `-lf`) send all of its output to a log file with the `--log_file` (or `-lf`)
option. For example: option. For example:
> ``` > ```
> funpack -n -n -n -lf log.txt out.tsv in.tsv > funpack -n -n -n -lf log.txt out.tsv in.tsv
> ``` > ```
Here's the first example input data set, with UK BioBank-style column names: Here's the first example input data set, with UK BioBank-style column names:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
cat data_01.tsv cat data_01.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The numbers in each column name typically represent: The numbers in each column name typically represent:
1. The variable ID 1. The variable ID
2. The visit, for variables which were collected at multiple points in time. 2. The visit, for variables which were collected at multiple points in time.
3. The "instance", for multi-valued variables. 3. The "instance", for multi-valued variables.
Note that one **variable** is typically associated with several **columns**, Note that one **variable** is typically associated with several **columns**,
although we're keeping things simple for this first example - there is only although we're keeping things simple for this first example - there is only
one visit for each variable, and there are no mulit-valued variables. one visit for each variable, and there are no mulit-valued variables.
> _Most but not all_ variables in the UK BioBank contain data collected at > _Most but not all_ variables in the UK BioBank contain data collected at
> different visits, the times that the participants visited a UK BioBank > different visits, the times that the participants visited a UK BioBank
> assessment centre. However there are some variables (e.g. [ICD10 diagnosis > assessment centre. However there are some variables (e.g. [ICD10 diagnosis
> codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202)) for which > codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202)) for which
> this is not the case. > this is not the case.
# Import examples # Import examples
## Selecting variables (columns) ## Selecting variables (columns)
You can specify which variables you want to load in the following ways, using You can specify which variables you want to load in the following ways, using
the `--variable` (`-v` for short), `--category` (`-c` for short) and the `--variable` (`-v` for short), `--category` (`-c` for short) and
`--column` (`-co` for short) command line options: `--column` (`-co` for short) command line options:
* By variable ID * By variable ID
* By variable ranges * By variable ranges
* By a text file which contains the IDs you want to keep. * By a text file which contains the IDs you want to keep.
* By pre-defined variable categories * By pre-defined variable categories
* By column name * By column name
### Selecting individual variables ### Selecting individual variables
Simply provide the IDs of the variables you want to extract: Simply provide the IDs of the variables you want to extract:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -v 1 -v 5 out.tsv data_01.tsv funpack -q -ow -v 1 -v 5 out.tsv data_01.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Selecting variable ranges ### Selecting variable ranges
The `--variable`/`-v` option accepts MATLAB-style ranges of the form The `--variable`/`-v` option accepts MATLAB-style ranges of the form
`start:step:stop` (where the `stop` is inclusive): `start:step:stop` (where the `stop` is inclusive):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -v 1:3:10 out.tsv data_01.tsv funpack -q -ow -v 1:3:10 out.tsv data_01.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Selecting variables with a file ### Selecting variables with a file
If your variables of interest are listed in a plain-text file, you can simply If your variables of interest are listed in a plain-text file, you can simply
pass that file: pass that file:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
echo -e "1\n6\n9" > vars.txt echo -e "1\n6\n9" > vars.txt
funpack -q -ow -v vars.txt out.tsv data_01.tsv funpack -q -ow -v vars.txt out.tsv data_01.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Selecting variables from pre-defined categories ### Selecting variables from pre-defined categories
Some UK BioBank-specific categories are [built into Some UK BioBank-specific categories are [built into
`funpack`](#Built-in-rules), but you can also define your own categories - you `funpack`](#Built-in-rules), but you can also define your own categories - you
just need to create a `.tsv` file, and pass it to `funpack` via the just need to create a `.tsv` file, and pass it to `funpack` via the
`--category_file` (`-cf` for short): `--category_file` (`-cf` for short):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
echo -e "ID\tCategory\tVariables" > custom_categories.tsv echo -e "ID\tCategory\tVariables" > custom_categories.tsv
echo -e "1\tCool variables\t1:5,7" >> custom_categories.tsv echo -e "1\tCool variables\t1:5,7" >> custom_categories.tsv
echo -e "2\tUncool variables\t6,8:10" >> custom_categories.tsv echo -e "2\tUncool variables\t6,8:10" >> custom_categories.tsv
cat custom_categories.tsv cat custom_categories.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Use the `--category` (`-c` for short) to select categories to output. You can Use the `--category` (`-c` for short) to select categories to output. You can
refer to categories by their ID: refer to categories by their ID:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -cf custom_categories.tsv -c 1 out.tsv data_01.tsv funpack -q -ow -cf custom_categories.tsv -c 1 out.tsv data_01.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Or by name: Or by name:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -cf custom_categories.tsv -c uncool out.tsv data_01.tsv funpack -q -ow -cf custom_categories.tsv -c uncool out.tsv data_01.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Selecting column names ### Selecting column names
If you are working with data that has non-UK BioBank style column names, you If you are working with data that has non-UK BioBank style column names, you
can use the `--column` (`-co` for short) to select individual columns by their can use the `--column` (`-co` for short) to select individual columns by their
name, rather than the variable with which they are associated. The `--column` name, rather than the variable with which they are associated. The `--column`
option accepts full column names, and also shell-style wildcard patterns: option accepts full column names, and also shell-style wildcard patterns:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -co 4-0.0 -co "??-0.0" out.tsv data_01.tsv funpack -q -ow -co 4-0.0 -co "??-0.0" out.tsv data_01.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Selecting subjects (rows) ## Selecting subjects (rows)
`funpack` assumes that the first column in every input file is a subject `funpack` assumes that the first column in every input file is a subject
ID. You can specify which subjects you want to load via the `--subject` (`-s` ID. You can specify which subjects you want to load via the `--subject` (`-s`
for short) option. You can specify subjects in the same way that you specified for short) option. You can specify subjects in the same way that you specified
variables above, and also: variables above, and also:
* By specifying a conditional expression on variable values - only subjects * By specifying a conditional expression on variable values - only subjects
for which the expression evaluates to true will be imported for which the expression evaluates to true will be imported
* By specifying subjects to exclude * By specifying subjects to exclude
### Selecting individual subjects ### Selecting individual subjects
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -s 1 -s 3 -s 5 out.tsv data_01.tsv funpack -q -ow -s 1 -s 3 -s 5 out.tsv data_01.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Selecting subject ranges ### Selecting subject ranges
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -s 2:2:10 out.tsv data_01.tsv funpack -q -ow -s 2:2:10 out.tsv data_01.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Selecting subjects from a file ### Selecting subjects from a file
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
echo -e "5\n6\n7\n8\n9\n10" > subjects.txt echo -e "5\n6\n7\n8\n9\n10" > subjects.txt
funpack -q -ow -s subjects.txt out.tsv data_01.tsv funpack -q -ow -s subjects.txt out.tsv data_01.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Selecting subjects by variable value ### Selecting subjects by variable value
The `--subject` option accepts *variable expressions* - you can write an The `--subject` option accepts *variable expressions* - you can write an
expression performing numerical comparisons on variables (denoted with a expression performing numerical comparisons on variables (denoted with a
leading `v`) and combine these expressions using boolean algebra. Only leading `v`) and combine these expressions using boolean algebra. Only
subjects for which the expression evaluates to true will be imported. For subjects for which the expression evaluates to true will be imported. For
example, to only import subjects where variable 1 is greater than 10, and example, to only import subjects where variable 1 is greater than 10, and
variable 2 is less than 70, you can type: variable 2 is less than 70, you can type:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -sp -s "v1 > 10 && v2 < 70" out.tsv data_01.tsv funpack -q -ow -sp -s "v1 > 10 && v2 < 70" out.tsv data_01.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The following symbols can be used in variable expressions: The following symbols can be used in variable expressions:
| Symbol | Meaning | | Symbol | Meaning |
|---------------------------|---------------------------------| |---------------------------|---------------------------------|
| `==` | equal to | | `==` | equal to |
| `!=` | not equal to | | `!=` | not equal to |
| `>` | greater than | | `>` | greater than |
| `>=` | greater than or equal to | | `>=` | greater than or equal to |
| `<` | less than | | `<` | less than |
| `<=` | less than or equal to | | `<=` | less than or equal to |
| `na` | N/A | | `na` | N/A |
| `&&` | logical and | | `&&` | logical and |
| <code>&#x7c;&#x7c;</code> | logical or | | <code>&#x7c;&#x7c;</code> | logical or |
| `~` | logical not | | `~` | logical not |
| `contains` | Contains sub-string | | `contains` | Contains sub-string |
| `all` | all columns must meet condition | | `all` | all columns must meet condition |
| `any` | any column must meet condition | | `any` | any column must meet condition |
| `()` | to denote precedence | | `()` | to denote precedence |
Non-numeric (i.e. string) variables can be used in these expressions in Non-numeric (i.e. string) variables can be used in these expressions in
conjunction with the `==`, `!=`, and `contains` operators. An example of such conjunction with the `==`, `!=`, and `contains` operators. An example of such
an expression is given in the section on [non-numeric an expression is given in the section on [non-numeric
data](#Non-numeric-data), below. data](#Non-numeric-data), below.
The `all` and `any` symbols allow you to control how an expression is The `all` and `any` symbols allow you to control how an expression is
evaluated across multiple columns which are associated with one variable evaluated across multiple columns which are associated with one variable
(e.g. separate columns for each visit). We will give an example of this in the (e.g. separate columns for each visit). We will give an example of this in the
section on [selecting visits](#Selecting-visits), below. section on [selecting visits](#Selecting-visits), below.
### Excluding subjects ### Excluding subjects
The `--exclude` (`-ex` for short) option allows you to exclude subjects - it The `--exclude` (`-ex` for short) option allows you to exclude subjects - it
accepts individual IDs, an ID range, or a file containing IDs. The accepts individual IDs, an ID range, or a file containing IDs. The
`--exclude`/`-ex` option takes precedence over the `--subject`/`-s` option: `--exclude`/`-ex` option takes precedence over the `--subject`/`-s` option:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -s 1:8 -ex 5:10 out.tsv data_01.tsv funpack -q -ow -s 1:8 -ex 5:10 out.tsv data_01.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Selecting visits ## Selecting visits
Many variables in the UK BioBank data contain observations at multiple points in Many variables in the UK BioBank data contain observations at multiple points in
time, or visits. `funpack` allows you to specify which visits you are interested time, or visits. `funpack` allows you to specify which visits you are interested
in. Here is an example data set with variables that have data for multiple in. Here is an example data set with variables that have data for multiple
visits (remember that the second number in the column names denotes the visit): visits (remember that the second number in the column names denotes the visit):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
cat data_02.tsv cat data_02.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can use the `--visit` (`-vi` for short) option to get just the last visit for We can use the `--visit` (`-vi` for short) option to get just the last visit for
each variable: each variable:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -vi last out.tsv data_02.tsv funpack -q -ow -vi last out.tsv data_02.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You can also specify which visit you want by its number: You can also specify which visit you want by its number:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -vi 1 out.tsv data_02.tsv funpack -q -ow -vi 1 out.tsv data_02.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
> Variables which are not associated with specific visits (e.g. [ICD10 > Variables which are not associated with specific visits (e.g. [ICD10
> diagnosis codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202)) > diagnosis codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202))
> will not be affected by the `-vi` option. > will not be affected by the `-vi` option.
### Evaluating expressions across visits ### Evaluating expressions across visits
The variable expressions described above in the section on [selecting The variable expressions described above in the section on [selecting
subjects](#Selecting-subjects-by-variable-value) will be applied to all of subjects](#Selecting-subjects-by-variable-value) will be applied to all of
the columns associated with a variable. By default, an expression will the columns associated with a variable. By default, an expression will
evaluate to true where the values in _any_ column asssociated with the evaluate to true where the values in _any_ column asssociated with the
variable evaluate to true. For example, we can extract the data for subjects variable evaluate to true. For example, we can extract the data for subjects
where the values of any column of variable 2 were less than 50: where the values of any column of variable 2 were less than 50:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -v 2 -s 'v2 < 50' out.tsv data_02.tsv funpack -q -ow -v 2 -s 'v2 < 50' out.tsv data_02.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can use the `any` and `all` operators to control how an expression is We can use the `any` and `all` operators to control how an expression is
evaluated across the columns of a variable. For example, we may only be evaluated across the columns of a variable. For example, we may only be
interested in subjects for whom all columns of variable 2 were greater than interested in subjects for whom all columns of variable 2 were greater than
50: 50:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -v 2 -s 'all(v2 < 50)' out.tsv data_02.tsv funpack -q -ow -v 2 -s 'all(v2 < 50)' out.tsv data_02.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can use `any` and `all` in expressions involving multiple variables: We can use `any` and `all` in expressions involving multiple variables:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -v 2,3 -s 'any(v2 < 50) && all(v3 >= 40)' out.tsv data_02.tsv funpack -q -ow -v 2,3 -s 'any(v2 < 50) && all(v3 >= 40)' out.tsv data_02.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Merging multiple input files ## Merging multiple input files
If your data is split across multiple files, you can specify how `funpack` If your data is split across multiple files, you can specify how `funpack`
should merge them together. should merge them together.
### Merging by subject ### Merging by subject
For example, let's say we have these two input files (shown side-by- side): For example, let's say we have these two input files (shown side-by- side):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
echo " " | paste data_03.tsv - data_04.tsv echo " " | paste data_03.tsv - data_04.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that each file contains different variables, and different, but Note that each file contains different variables, and different, but
overlapping, subjects. By default, when you pass these files to `funpack`, it overlapping, subjects. By default, when you pass these files to `funpack`, it
will output the intersection of the two files (more formally known as an will output the intersection of the two files (more formally known as an
*inner join*), i.e. subjects which are present in both files: *inner join*), i.e. subjects which are present in both files:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow out.tsv data_03.tsv data_04.tsv funpack -q -ow out.tsv data_03.tsv data_04.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If you want to keep all subjects, you can instruct `funpack` to output the union If you want to keep all subjects, you can instruct `funpack` to output the union
(a.k.a. *outer join*) via the `--merge_strategy` (`-ms` for short) option: (a.k.a. *outer join*) via the `--merge_strategy` (`-ms` for short) option:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -ms outer out.tsv data_03.tsv data_04.tsv funpack -q -ow -ms outer out.tsv data_03.tsv data_04.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Merging by column ### Merging by column
Your data may be organised in a different way. For example, these next two Your data may be organised in a different way. For example, these next two
files contain different groups of subjects, but overlapping columns: files contain different groups of subjects, but overlapping columns:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
echo " " | paste data_05.tsv - data_06.tsv echo " " | paste data_05.tsv - data_06.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In this case, we need to tell `funpack` to merge along the row axis, rather than In this case, we need to tell `funpack` to merge along the row axis, rather than
along the column axis. We can do this with the `--merge_axis` (`-ma` for short) along the column axis. We can do this with the `--merge_axis` (`-ma` for short)
option: option:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -ma rows out.tsv data_05.tsv data_06.tsv funpack -q -ow -ma rows out.tsv data_05.tsv data_06.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Again, if we want to retain all columns, we can tell `funpack` to perform an Again, if we want to retain all columns, we can tell `funpack` to perform an
outer join with the `-ms` option: outer join with the `-ms` option:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -ma rows -ms outer out.tsv data_05.tsv data_06.tsv funpack -q -ow -ma rows -ms outer out.tsv data_05.tsv data_06.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Naive merging ### Naive merging
Finally, your data may be organised such that you simply want to "paste", or Finally, your data may be organised such that you simply want to "paste", or
concatenate them together, along either rows or columns. For example, your concatenate them together, along either rows or columns. For example, your
data files might look like this: data files might look like this:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
echo " " | paste data_07.tsv - data_08.tsv echo " " | paste data_07.tsv - data_08.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Here, we have columns for different variables on the same set of subjects, and Here, we have columns for different variables on the same set of subjects, and
we just need to concatenate them together horizontally. We do this by using we just need to concatenate them together horizontally. We do this by using
`--merge_strategy naive` (`-ms naive` for short): `--merge_strategy naive` (`-ms naive` for short):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -ms naive out.tsv data_07.tsv data_08.tsv funpack -q -ow -ms naive out.tsv data_07.tsv data_08.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
For files which need to be concatenated vertically, such as these: For files which need to be concatenated vertically, such as these:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
echo " " | paste data_09.tsv - data_10.tsv echo " " | paste data_09.tsv - data_10.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We need to tell `funpack` which axis to concatenate along, again using the `-ma` We need to tell `funpack` which axis to concatenate along, again using the `-ma`
option: option:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -ms naive -ma rows out.tsv data_09.tsv data_10.tsv funpack -q -ow -ms naive -ma rows out.tsv data_09.tsv data_10.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Cleaning examples # Cleaning examples
Once the data has been imported, a sequence of cleaning steps are applied to Once the data has been imported, a sequence of cleaning steps are applied to
each column. each column.
## NA insertion ## NA insertion
For some variables it may make sense to discard or ignore certain values. For For some variables it may make sense to discard or ignore certain values. For
example, if an individual selects *Do not know* to a question such as *How example, if an individual selects *Do not know* to a question such as *How
much milk did you drink yesterday?*, that answer will be coded with a specific much milk did you drink yesterday?*, that answer will be coded with a specific
value (e.g. `-1`). It does not make any sense to include these values in most value (e.g. `-1`). It does not make any sense to include these values in most
analyses, so `funpack` can be used to mark such values as *Not Available analyses, so `funpack` can be used to mark such values as *Not Available
(NA)*. (NA)*.
A large number of NA insertion rules, specific to UK BioBank variables, are A large number of NA insertion rules, specific to UK BioBank variables, are
coded into `funpack`, and are applied when you use the `-cfg fmrib` option coded into `funpack`, and are applied when you use the `-cfg fmrib` option
(see the section below on [built-in rules](#Built-in-rules)). You can also (see the section below on [built-in rules](#Built-in-rules)). You can also
specify your own rules via the `--na_values` (`-nv` for short) option. specify your own rules via the `--na_values` (`-nv` for short) option.
Let's say we have this data set: Let's say we have this data set:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
cat data_11.tsv cat data_11.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
For variable 1, we want to ignore values of -1, for variable 2 we want to For variable 1, we want to ignore values of -1, for variable 2 we want to
ignore -1 and 0, and for variable 3 we want to ignore 1 and 2: ignore -1 and 0, and for variable 3 we want to ignore 1 and 2:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -nv 1 " -1" -nv 2 " -1,0" -nv 3 "1,2" out.tsv data_11.tsv funpack -q -ow -nv 1 " -1" -nv 2 " -1,0" -nv 3 "1,2" out.tsv data_11.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The `--na_values` option expects two arguments: The `--na_values` option expects two arguments:
* The variable ID * The variable ID
* A comma-separated list of values to replace with NA * A comma-separated list of values to replace with NA
## Variable-specific cleaning functions ## Variable-specific cleaning functions
A small number of cleaning/preprocessing functions are built into `funpack`, A small number of cleaning/preprocessing functions are built into `funpack`,
which can be applied to specific variables. For example, some variables in the which can be applied to specific variables. For example, some variables in the
UK BioBank contain ICD10 disease codes, which may be more useful if converted UK BioBank contain ICD10 disease codes, which may be more useful if converted
to a numeric format (e.g. to make them easy to load into MATLAB). Imagine to a numeric format (e.g. to make them easy to load into MATLAB). Imagine
that we have some data with ICD10 codes: that we have some data with ICD10 codes:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
cat data_12.tsv cat data_12.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can use the `--clean` (`-cl` for short) option with the built-in We can use the `--clean` (`-cl` for short) option with the built-in
`codeToNumeric` cleaning function to convert the codes to a numeric `codeToNumeric` cleaning function to convert the codes to a numeric
representation<sup>*</sup>: representation<sup>*</sup>:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -cl 1 "codeToNumeric('icd10')" out.tsv data_12.tsv funpack -q -ow -cl 1 "codeToNumeric('icd10')" out.tsv data_12.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
> <sup>*</sup>The `codeToNumeric` function will replace each ICD10 code with > <sup>*</sup>The `codeToNumeric` function will replace each ICD10 code with
> the corresponding *Node* number, as defined in the UK [BioBank ICD10 data > the corresponding *Node* number, as defined in the UK [BioBank ICD10 data
> coding](http://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=19). > coding](http://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=19).
The `--clean` option expects two arguments: The `--clean` option expects two arguments:
* The variable ID * The variable ID
* The cleaning function to apply. Some cleaning functions accept * The cleaning function to apply. Some cleaning functions accept
arguments - refer to the command-line help for a summary of available arguments - refer to the command-line help for a summary of available
functions. functions.
You can define your own cleaning functions by passing them in as a You can define your own cleaning functions by passing them in as a
`--plugin_file` (see the [section on custom plugins `--plugin_file` (see the [section on custom plugins
below](#Custom-cleaning,-processing-and-loading---funpack-plugins)). below](#Custom-cleaning,-processing-and-loading---funpack-plugins)).
### Example: flattening hierarchical data ### Example: flattening hierarchical data
Several variables in the UK Biobank (including the ICD10 disease Several variables in the UK Biobank (including the ICD10 disease
categorisations) are organised in a hierarchical manner - each value is a categorisations) are organised in a hierarchical manner - each value is a
child of a more general parent category. The `flattenHierarchical` cleaninng child of a more general parent category. The `flattenHierarchical` cleaninng
function can be used to replace each value in a data set with the value that function can be used to replace each value in a data set with the value that
corresponds to a parent category. Let's apply this to our example ICD10 data corresponds to a parent category. Let's apply this to our example ICD10 data
set. set.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -cl 1 "flattenHierarchical(name='icd10')" out.tsv data_12.tsv funpack -q -ow -cl 1 "flattenHierarchical(name='icd10')" out.tsv data_12.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Aside: ICD10 mapping file ### Aside: ICD10 mapping file
`funpack` has a feature specific to these ICD10 disease categorisations - you `funpack` has a feature specific to these ICD10 disease categorisations - you
can use the `--icd10_map_file` (`-imf` for short) option to tell `funpack` to can use the `--icd10_map_file` (`-imf` for short) option to tell `funpack` to
save a file which contains a list of all ICD10 codes that were present in the save a file which contains a list of all ICD10 codes that were present in the
input data, and the corresponding numerical codes that `funpack` generated: input data, and the corresponding numerical codes that `funpack` generated:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -cl 1 "codeToNumeric('icd10')" -imf icd10_codes.tsv out.tsv data_12.tsv funpack -q -ow -cl 1 "codeToNumeric('icd10')" -imf icd10_codes.tsv out.tsv data_12.tsv
cat icd10_codes.tsv cat icd10_codes.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Categorical recoding ## Categorical recoding
You may have some categorical data which is coded in an awkward manner, such as You may have some categorical data which is coded in an awkward manner, such as
in this example, which encodes the amount of some item that an individual has in this example, which encodes the amount of some item that an individual has
consumed: consumed:
![data coding example](coding.png) ![data coding example](coding.png)
You can use the `--recoding` (`-re` for short) option to recode data like this You can use the `--recoding` (`-re` for short) option to recode data like this
into something more useful. For example, given this data: into something more useful. For example, given this data:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
cat data_13.tsv cat data_13.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Let's recode it to be more monotonic: Let's recode it to be more monotonic:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -re 1 "300,444,555" "3,0.25,0.5" out.tsv data_13.tsv funpack -q -ow -re 1 "300,444,555" "3,0.25,0.5" out.tsv data_13.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The `--recoding` option expects three arguments: The `--recoding` option expects three arguments:
* The variable ID * The variable ID
* A comma-separated list of the values to be replaced * A comma-separated list of the values to be replaced
* A comma-separated list of the values to replace them with * A comma-separated list of the values to replace them with
## Child value replacement ## Child value replacement
Imagine that we have these two questions: Imagine that we have these two questions:
* **1**: *Do you currently smoke cigarettes?* * **1**: *Do you currently smoke cigarettes?*
* **2**: *How many cigarettes did you smoke yesterday?* * **2**: *How many cigarettes did you smoke yesterday?*
Now, question 2 was only asked if the answer to question 1 was *Yes*. So for Now, question 2 was only asked if the answer to question 1 was *Yes*. So for
all individuals who answered *No* to question 1, we will have a missing value all individuals who answered *No* to question 1, we will have a missing value
for question 2. But for some analyses, it would make more sense to have a for question 2. But for some analyses, it would make more sense to have a
value of 0, rather than NA, for these subjects. value of 0, rather than NA, for these subjects.
`funpack` can handle these sorts of dependencies by way of *child value `funpack` can handle these sorts of dependencies by way of *child value
replacement*. For question 2, we can define a conditional variable expression replacement*. For question 2, we can define a conditional variable expression
such that when both question 2 is NA and question 1 is *No*, we can insert a such that when both question 2 is NA and question 1 is *No*, we can insert a
value of 0 into question 2. value of 0 into question 2.
This scenario is demonstrated in this example data set (where, for This scenario is demonstrated in this example data set (where, for
question 1 values of `1` and `0` represent *Yes* and *No* respectively): question 1 values of `1` and `0` represent *Yes* and *No* respectively):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
cat data_14.tsv cat data_14.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can fill in the values for variable 2 by using the `--child_values` (`-cv` We can fill in the values for variable 2 by using the `--child_values` (`-cv`
for short) option: for short) option:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -cv 2 "v1 == 0" "0" out.tsv data_14.tsv funpack -q -ow -cv 2 "v1 == 0" "0" out.tsv data_14.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The `--child_values` option expects three arguments: The `--child_values` option expects three arguments:
* The variable ID * The variable ID
* An expression evaluating some condition on the parent variable(s) * An expression evaluating some condition on the parent variable(s)
* A value to replace NA with where the expression evaluates to true. * A value to replace NA with where the expression evaluates to true.
# Processing examples # Processing examples
After every column has been cleaned, the entire data set undergoes a series of After every column has been cleaned, the entire data set undergoes a series of
processing steps. The processing stage may result in columns being removed or processing steps. The processing stage may result in columns being removed or
manipulated, or new columns being added. manipulated, or new columns being added.
The processing stage can be controlled with these options: The processing stage can be controlled with these options:
* `--prepend_process` (`-ppr` for short): Apply a processing function before * `--prepend_process` (`-ppr` for short): Apply a processing function before
the built-in processing the built-in processing
* `--append_process` (`-apr` for short): Apply a processing function after the * `--append_process` (`-apr` for short): Apply a processing function after the
built-in processing built-in processing
A default set of processing steps are applied when you apply the `fmrib` A default set of processing steps are applied when you apply the `fmrib`
configuration profile by using `-cfg fmrib` - see the section on [built-in configuration profile by using `-cfg fmrib` - see the section on [built-in
rules](#Built-in-rules). rules](#Built-in-rules).
The `--prepend_process` and `--append_process` options require two arguments: The `--prepend_process` and `--append_process` options require two arguments:
* The variable ID(s) to apply the function to, or `all` to denote all * The variable ID(s) to apply the function to, or `all` to denote all
variables. variables.
* The processing function to apply. The available processing functions are * The processing function to apply. The available processing functions are
listed in the command line help, or you can write your own and pass it in listed in the command line help, or you can write your own and pass it in
as a plugin file as a plugin file
([see below](#Custom-cleaning,-processing-and-loading---funpack-plugins)). ([see below](#Custom-cleaning,-processing-and-loading---funpack-plugins)).
## Sparsity check ## Sparsity check
The `removeIfSparse` process will remove columns that are deemed to have too The `removeIfSparse` process will remove columns that are deemed to have too
many missing values. If we take this data set: many missing values. If we take this data set:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
cat data_15.tsv cat data_15.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Imagine that our analysis requires at least 8 values per variable to work. We Imagine that our analysis requires at least 8 values per variable to work. We
can use the `minpres` option to`funpack` to drop any columns which do not meet can use the `minpres` option to`funpack` to drop any columns which do not meet
this threshold: this threshold:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -apr all "removeIfSparse(minpres=8)" out.tsv data_15.tsv funpack -q -ow -apr all "removeIfSparse(minpres=8)" out.tsv data_15.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You can also specify `minpres` as a proportion, rather than an absolute number. You can also specify `minpres` as a proportion, rather than an absolute number.
e.g.: e.g.:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -apr all "removeIfSparse(minpres=0.65, abspres=False)" out.tsv data_15.tsv funpack -q -ow -apr all "removeIfSparse(minpres=0.65, abspres=False)" out.tsv data_15.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Redundancy check ## Redundancy check
You may wish to remove columns which contain redundant information. The You may wish to remove columns which contain redundant information. The
`removeIfRedundant` process calculates the pairwise correlation between all `removeIfRedundant` process calculates the pairwise correlation between all
columns, and removes columns with a correlation above a threshold that you columns, and removes columns with a correlation above a threshold that you
provide. Imagine that we have this data set: provide. Imagine that we have this data set:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
cat data_16.tsv cat data_16.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The data in column `2-0.0` is effectively equivalent to the data in column The data in column `2-0.0` is effectively equivalent to the data in column
`1-0.0`, so is not of any use to us. We can tell `funpack` to remove it like `1-0.0`, so is not of any use to us. We can tell `funpack` to remove it like
so: so:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -apr all "removeIfRedundant(0.9)" out.tsv data_16.tsv funpack -q -ow -apr all "removeIfRedundant(0.9)" out.tsv data_16.tsv
cat out.tsv cat out.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The `removeIfRedundant` process can also calculate the correlation of the The `removeIfRedundant` process can also calculate the correlation of the
patterns of missing values between variables - Consider this example: patterns of missing values between variables - Consider this example:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
cat data_17.tsv cat data_17.tsv
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
All three columns are highly correlated, but the pattern of missing values in All three columns are highly correlated, but the pattern of missing values in
column `3-0.0` is different to that of the other columns. column `3-0.0` is different to that of the other columns.
If we use the `nathres` option, `funpack` will only remove columns where the If we use the `nathres` option, `funpack` will only remove columns where the
correlation of both present and missing values meet the thresholds. Note that correlation of both present and missing values meet the thresholds. Note that
the column which contains more missing values will be the one that gets the column which contains more missing values will be the one that gets
removed: removed:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
funpack -q -ow -apr all "removeIfRedundant(0.9, nathres=0.6)" out.tsv data_17.tsv funpack -q -ow -apr all "removeIfRedundant(0.9, nathres=0.6)" out.tsv data_17.tsv
cat out.tsv cat out.tsv
```