Commit 946f7cd3 authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

Merge branch 'rf/descriptions' into 'master'

Rf/descriptions

See merge request fsl/ukbparse!124
parents 459df7a0 d04a1c7d
Pipeline #3703 canceled with stages
in 4 minutes and 22 seconds
......@@ -2,6 +2,31 @@
======================
0.21.0 (Thursday 8th May 2019)
------------------------------
Added
^^^^^
* :class:`.Column` objects now have a ``metadata`` attribute which may be used
in the column description (if the ``--description_file`` option is used).
Processing functions can set the metadata for newly added columns.
* New ``metaproc`` plugin type to manipulate column metadata.
* All processing functions accept a ``metaproc`` argument, allowing a
``metaproc`` function to be applied to any column metadata that is returned
by the processing function..
Changed
^^^^^^^
* The :func:`.binariseCategorical` function sets the categorical value as
column metadata on the new binarised columns.
0.20.1 (Wednesday 8th May 2019)
-------------------------------
......
......@@ -196,6 +196,7 @@ descriptions for each column as follows::
descs = readtable('descriptions.tsv', ...
'FileType', 'text', ...
'Delimiter', '\t', ...
'ReadVariableNames',false);
descs = [descs; {'eid', 'ID'}];
idxs = cellfun(@(x) find(strcmp(descs.Var1, x)), colnames, ...
......
......@@ -6,7 +6,7 @@
#
__version__ = '0.20.1'
__version__ = '0.21.0'
"""The ``ukbparse`` versioning scheme roughly follows Semantic Versioning
conventions.
"""
......
......@@ -12,16 +12,18 @@ and for cleaning and processing.
The following plugin types currently exist:
+-------------------+------------------------------------------------------+
+-------------------+-------------------------------------------------------+
| Plugin type | |
+-------------------+------------------------------------------------------|
+-------------------+-------------------------------------------------------|
| ``sniffer`` | Return information about the columns in a file |
| ``loader`` | Load data from a file |
| ``cleaner`` | Run a cleaning function on a single column |
| ``processor`` | Run a processing fnction on one or more data columns |
| ``processor`` | Run a processing function on one or more data columns |
| ``metaproc`` | Run a function on a :class:`.Column` ``metadata`` |
| | value |
| ``formatter`` | Format a column for output |
| ``exporter`` | Export the processed data set |
+-------------------+------------------------------------------------------+
+-------------------+-------------------------------------------------------+
To ensure that the ``ukbparse`` command line help is nicely formatted, all
......@@ -66,6 +68,7 @@ PLUGIN_TYPES = ['loader',
'formatter',
'cleaner',
'processor',
'metaproc',
'exporter']
......@@ -181,6 +184,7 @@ def registerBuiltIns():
import ukbparse.exporting_tsv as uet
import ukbparse.cleaning_functions as cf
import ukbparse.processing_functions as pf
import ukbparse.metaproc_functions as mf
if firstTime:
loglevel = log.getEffectiveLevel()
......@@ -191,6 +195,7 @@ def registerBuiltIns():
importlib.reload(uet)
importlib.reload(cf)
importlib.reload(pf)
importlib.reload(mf)
if firstTime:
log.setLevel(loglevel)
......
Variable Process
40001 binariseCategorical(acrossVisits=True, acrossInstances=True)
40002 binariseCategorical(acrossVisits=True, acrossInstances=True)
40006 binariseCategorical(acrossVisits=True, acrossInstances=True)
41201 binariseCategorical(acrossVisits=True, acrossInstances=True)
41202 binariseCategorical(acrossVisits=True, acrossInstances=True)
41204 binariseCategorical(acrossVisits=True, acrossInstances=True)
41270 binariseCategorical(acrossVisits=True, acrossInstances=True)
40001 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
40002 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
40006 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
41201 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
41202 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
41204 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
41270 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
all_independent_except,40001,40002,40006,41202,41204,41270 removeIfSparse(minpres=51, maxcat=0.99, abscat=False)
40001 removeIfSparse(mincat=10)
40002 removeIfSparse(mincat=10)
......
......@@ -42,7 +42,8 @@ class Column(object):
index,
vid=None,
visit=0,
instance=0):
instance=0,
metadata=None):
self.datafile = datafile
self.name = name
......@@ -50,6 +51,7 @@ class Column(object):
self.vid = vid
self.visit = visit
self.instance = instance
self.metadata = metadata
def __str__(self):
......@@ -356,7 +358,7 @@ class DataTable(object):
else: self.__varmap.pop(col.vid)
def addColumns(self, series, vids=None):
def addColumns(self, series, vids=None, meta=None):
"""Adds one or more new columns to the data set.
:arg series: Sequence of ``pandas.Series`` objects containing the
......@@ -365,10 +367,12 @@ class DataTable(object):
:arg vids: Sequence of variables each new column is associated
with. If ``None`` (the default), variable IDs are
automatically assigned.
:arg meta: Sequence of metadata associated with each new column.
"""
if vids is None:
vids = [None] * len(series)
if vids is None: vids = [None] * len(series)
if meta is None: meta = [None] * len(series)
for s in series:
if s.name in self.__data.columns:
......@@ -390,13 +394,13 @@ class DataTable(object):
# a vid for each column starting from here.
startvid = max(max(self.variables) + 1, AUTO_VARIABLE_ID)
for s, idx, vid in zip(series, idxs, vids):
for s, idx, vid, m in zip(series, idxs, vids, meta):
if vid is None:
vid = startvid
startvid = startvid + 1
col = Column(None, s.name, idx, vid, 0, 0)
col = Column(None, s.name, idx, vid, 0, 0, m)
self.__data[s.name] = s
# new column on existing variable.
......
......@@ -17,6 +17,8 @@ import warnings
import datetime
import calendar
import pandas as pd
import ukbparse
import ukbparse.util as util
import ukbparse.icd10 as icd10
......@@ -420,22 +422,42 @@ def doDescriptionExport(dtable, args):
with util.timed('Description export', log):
cols = dtable.allColumns[1:]
vartable = dtable.vartable
try:
with open(args.description_file, 'wt') as f:
for c in cols:
for col in cols:
desc = generateDescription(dtable, col)
f.write('{}\t{}\n'.format(col.name, desc))
except Exception as e:
log.warning('Failed to export descriptions: {}'.format(e),
exc_info=True)
desc = vartable.loc[c.vid, 'Description']
if desc == c.name:
def generateDescription(dtable, col):
"""Called by :func:`doDescriptionExport`. Generates and returns a
suitable description for the given column.
:arg dtable: :class:`.Datatable` instance
:arg col: :class:`.Column` instance
"""
vartable = dtable.vartable
desc = vartable.loc[col.vid, 'Description']
if pd.isna(desc) or (desc == col.name):
desc = 'n/a'
f.write('{}\t{}\n'.format(c.name, desc))
# If metadata has been added to the column,
# we add it to the description. See the
# binariseCategorical processing function
# for an example of this.
if col.metadata is not None:
suffix = ' ({})'.format(col.metadata)
else:
suffix = ' ({}.{})'.format(col.visit, col.instance)
return '{}{}'.format(desc, suffix)
except Exception as e:
log.warning('Failed to export descriptions: {}'.format(e),
exc_info=True)
def configLogging(args):
......
#!/usr/bin/env python
#
# metaproc_functions.py - Functions for manipulating column metadata.
#
# Author: Paul McCarthy <pauldmccarthy@gmail.com>
#
"""This module contains ``metaproc`` functions - functions for manipulating
column metadata.
Some :class:`.Column` instances have a ``metadata`` attribute, containing some
additional information about the column. The functions in this module can be
used to modify these metadata values. Currently, column metadata is only used
to generate a description of each column (via the ``--description_file``
command-line option).
"""
from . import icd10
from . import custom
from . import hierarchy
@custom.metaproc('icd10.numdesc')
def icd10DescriptionFromNumeric(val):
"""Generates a description for a numeric ICD10 code. """
val = icd10.numericToCode(val)
hier = hierarchy.getHierarchyFilePath(name='icd10')
hier = hierarchy.loadHierarchyFile(hier)
desc = hier.description(val)
return '{} - {}'.format(val, desc)
@custom.metaproc('icd10.codedesc')
def icd10DescriptionFromCode(val):
"""Generates a description for an ICD10 code. """
hier = hierarchy.getHierarchyFilePath(name='icd10')
hier = hierarchy.loadHierarchyFile(hier)
desc = hier.description(val)
return '{} - {}'.format(val, desc)
......@@ -36,8 +36,8 @@ import collections
import pyparsing as pp
import ukbparse.util as util
import ukbparse.custom as custom
from . import util
from . import custom
log = logging.getLogger(__name__)
......@@ -136,10 +136,11 @@ def runProcess(proc, dtable, vids):
remove = []
add = []
addvids = []
addmeta = []
def genvids(result, vi, si):
if result[vi] is None: return [None] * len(result[si])
else: return result[vi]
def expand(res, length):
if res is None: return [None] * length
else: return res
for result in results:
if result is None:
......@@ -152,15 +153,21 @@ def runProcess(proc, dtable, vids):
# series/vids to add
if len(result) == 2:
add .extend( result[0])
addvids.extend(genvids(result, 1, 0))
add .extend(result[0])
addvids.extend(expand(result[1], len(result[0])))
addmeta.extend(expand(None, len(result[0])))
# columns to remove, and
# series/vids to add
elif len(result) == 3:
remove .extend( result[0])
add .extend( result[1])
addvids.extend(genvids(result, 2, 1))
elif len(result) in (3, 4):
if len(result) == 3:
result = list(result) + [None]
remove .extend(result[0])
add .extend(result[1])
addvids.extend(expand(result[2], len(result[1])))
addmeta.extend(expand(result[3], len(result[1])))
else:
raise error
......@@ -170,8 +177,8 @@ def runProcess(proc, dtable, vids):
else:
raise error
if len(add) > 0: dtable.addColumns(add, addvids, addmeta)
if len(remove) > 0: dtable.removeColumns(remove)
if len(add) > 0: dtable.addColumns(add, addvids)
class NoSuchProcessError(Exception):
......@@ -204,6 +211,7 @@ class Process(object):
self.__name = name
self.__args = args
self.__kwargs = kwargs
self.__metaproc = kwargs.pop('metaproc', None)
def __repr__(self):
......@@ -240,12 +248,28 @@ class Process(object):
"""Run the process on the data, passing it the given arguments,
and any arguments that were passed to :meth:`__init__`.
"""
return custom.run(self.__ptype,
result = custom.run(self.__ptype,
self.__name,
*args,
*self.__args,
**self.__kwargs)
if self.__metaproc is not None and \
isinstance(result, tuple) and \
len(result) == 4:
meta = result[3]
mproc = self.__metaproc
try:
meta = [custom.runMetaproc(mproc, m) for m in meta]
except Exception as e:
log.warning('Metadata processing function failed: %s', e)
result = tuple(list(result[:3]) + [meta])
return result
def parseProcesses(procs, ptype):
"""Parses the given string containing one or more comma-separated process
......
......@@ -43,6 +43,12 @@ Furthermore, all processing functions must return one of the following:
- List of ``Series`` to be added
- List of variable IDs for each new ``Series``.
- A ``tuple`` of length 3, containing the above, and:
- List of metadata associated with each of the new ``Series``. This will
be added to the :class:`.Column` objects that represent each of the new
``Series``.
The following processing functions are defined:
.. autosummary::
......@@ -217,6 +223,7 @@ def binariseCategorical(dtable,
toremove = []
newseries = []
newvids = []
newmeta = []
for vid in vids:
......@@ -261,13 +268,14 @@ def binariseCategorical(dtable,
}
newvids .append(vid)
newmeta .append(val)
newseries.append(pd.Series(
col,
index=dtable.index,
name=nameFormat.format(**fmtargs)))
if replace: return toremove, newseries, newvids
else: return newseries, newvids
if replace: return toremove, newseries, newvids, newmeta
else: return [], newseries, newvids, newmeta
@custom.processor()
......
......@@ -30,7 +30,7 @@
"\n",
" ukbparse -h\n",
" \n",
"*The examples in this notebook assume that you have installed `ukbparse` 0.20.1 or newer.*"
"*The examples in this notebook assume that you have installed `ukbparse` 0.21.0 or newer.*"
]
},
{
......
%% Cell type:markdown id: tags:
![image.png](attachment:image.png)
# `ukbparse`
> Paul McCarthy &lt;paul.mccarthy@ndcn.ox.ac.uk&gt; ([WIN@FMRIB](https://www.win.ox.ac.uk/))
`ukbparse` is a command-line program which you can use to extract data from UK BioBank (and other tabular) data.
You can give `ukbparse` one or more input files (e.g. `.csv`, `.tsv`), and it will merge them together, perform some preprocessing, and produce a single output file.
A large number of rules are built into `ukbparse` which are specific to the UK BioBank data set. But you can control and customise everything that `ukbparse` does to your data, including which rows and columns to extract, and which cleaning/processing steps to perform on each column.
The `ukbparse` source code is available at https://git.fmrib.ox.ac.uk/fsl/ukbparse. You can install `ukbparse` into a Python environment using `pip`:
pip install ukbparse
Get command-line help by typing:
ukbparse -h
*The examples in this notebook assume that you have installed `ukbparse` 0.20.1 or newer.*
*The examples in this notebook assume that you have installed `ukbparse` 0.21.0 or newer.*
%% Cell type:code id: tags:
``` bash
ukbparse -V
```
%% Cell type:markdown id: tags:
### Contents
1. [Overview](#Overview)
1. [Import](#1.-Import)
2. [Cleaning](#2.-Cleaning)
3. [Processing](#3.-Processing)
4. [Export](#4.-Export)
2. [Examples](#Examples)
3. [Import examples](#Import-examples)
1. [Selecting variables (columns)](#Selecting-variables-(columns))
1. [Selecting individual variables](#Selecting-individual-variables)
2. [Selecting variable ranges](#Selecting-variable-ranges)
3. [Selecting variables with a file](#Selecting-variables-with-a-file)
4. [Selecting variables from pre-defined categories](#Selecting-variables-from-pre-defined-categories)
2. [Selecting subjects (rows)](#Selecting-subjects-(rows))
1. [Selecting individual subjects](#Selecting-individual-subjects)
2. [Selecting subject ranges](#Selecting-subject-ranges)
3. [Selecting subjects from a file](#Selecting-subjects-from-a-file)
4. [Selecting subjects by variable value](#Selecting-subjects-by-variable-value)
5. [Excluding subjects](#Excluding-subjects)
3. [Selecting visits](#Selecting-visits)
4. [Merging multiple input files](#Merging-multiple-input-files)
1. [Merging by subject](#Merging-by-subject)
2. [Merging by column](#Merging-by-column)
3. [Naive merging](#Merging-by-column)
4. [Cleaning examples](#Cleaning-examples)
1. [NA insertion](#NA-insertion)
2. [Variable-specific cleaning functions](#Variable-specific-cleaning-functions)
3. [Categorical recoding](#Categorical-recoding)
4. [Child value replacement](#Child-value-replacement)
5. [Processing examples](#Processing-examples)
1. [Sparsity check](#Sparsity-check)
2. [Redundancy check](#Redundancy-check)
3. [Categorical binarisation](#Categorical-binarisation)
6. [Custom cleaning, processing and loading - ukbparse plugins](#Custom-cleaning,-processing-and-loading---ukbparse-plugins)
1. [Custom cleaning functions](#Custom-cleaning-functions)
2. [Custom processing functions](#Custom-processing-functions)
3. [Custom file loaders](#Custom-file-loaders)
7. [Miscellaneous topics](#Miscellaneous-topics)
1. [Non-numeric data](#Non-numeric-data)
2. [Dry run](#Dry-run)
3. [Built-in rules](#Built-in-rules)
4. [Using a configuration file](#Using-a-configuration-file)
5. [Reporting unknown variables](#Reporting-unknown-variables)
6. [Low-memory mode](#Low-memory-mode)
%% Cell type:markdown id: tags:
# Overview
`ukbparse` performs the following steps:
## 1. Import
All data files are loaded in, unwanted columns and subjects are dropped, and the data files are merged into a single table (a.k.a. data frame). Multiple files can be merged according to an index column (e.g. subject ID). Or, if the input files contain the same columns/subjects, they can be naively concatenated along rows or columns.
## 2. Cleaning
The following cleaning steps are applied to each column:
1. **NA value replacement:** Specific values for some columns are replaced with NA, for example, variables where a value of `-1` indicates *Do not know*.
2. **Variable-specific cleaning functions:** Certain columns are re-formatted - for example, the [ICD10](https://en.wikipedia.org/wiki/ICD-10) disease codes are converted to integer representations.
3. **Categorical recoding:** Certain categorical columns are re-coded.
4. **Child value replacement:** NA values within some columns which are dependent upon other columns may have values inserted based on the values of their parent columns.
## 3. Processing
During the processing stage, columns may be removed, merged, or expanded into additional columns. For example, a categorical column may be expanded into a set of binary columns, one for each category.
A column may also be removed on the basis of being too sparse, or being redundant with respect to another column.
## 4. Export
The processed data can be saved as a `.csv`, `.tsv`, or `.hdf5` file.
%% Cell type:markdown id: tags:
# Examples
Throughout these examples, we are going to use a few command line options, which you will probably **not** normally want to use:
- `-nb` (short for `--no_builtins`): This tells `ukbparse` not to use the built-in processing rules, which are specifically tailored for UK BioBank data.
- `-ow` (short for `--overwrite`): This tells `ukbparse` not to complain if the output file already exists.
- `-q` (short for `--quiet`): This tells `ukbparse` to be quiet.
Without the `-q` option, `ukbparse` can be quite verbose, which can be annoying, but is very useful when things go wrong. A good strategy is to tell `ukbparse` to send all of its output to a log file with the `--log_file` (or `-lf`) option. For example:
ukbparse --log_file log.txt out.tsv in.tsv
Here's the first example input data set, with UK BioBank-style column names:
%% Cell type:code id: tags:
``` bash
cat data_01.tsv
```
%% Cell type:markdown id: tags:
The numbers in each column name represent:
1. The variable ID
2. The visit, for variables which were collected at multiple points in time.
3. The "instance", for multi-valued variables.
Note that one **variable** is typically associated with several **columns**, although we're keeping things simple for this first example - there is only one visit for each variable, and there are no mulit-valued variables.
%% Cell type:markdown id: tags:
# Import examples
## Selecting variables (columns)
You can specify which variables you want to load in the following ways, using the `--variable` (`-v` for short) and `--category` (`-c` for short) command line options:
* By variable ID
* By variable ranges
* By a text file which contains the IDs you want to keep.
* By pre-defined variable categories
* By column name
### Selecting individual variables
Simply provide the IDs of the variables you want to extract:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -v 1 -v 5 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variable ranges
The `--variable`/`-v` option accepts MATLAB-style ranges of the form `start:step:stop` (where the `stop` is inclusive):
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -v 1:3:10 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variables with a file
If your variables of interest are listed in a plain-text file, you can simply pass that file:
%% Cell type:code id: tags:
``` bash
echo -e "1\n6\n9" > vars.txt
ukbparse -nb -q -ow -v vars.txt out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variables from pre-defined categories
Some UK BioBank-specific categories are baked into `ukbparse`, but you can also define your own categories - you just need to create a `.tsv` file, and pass it to `ukbparse` via the `--category_file` (`-cf` for short):
%% Cell type:code id: tags:
``` bash
echo -e "ID\tCategory\tVariables" > custom_categories.tsv
echo -e "1\tCool variables\t1:5,7" >> custom_categories.tsv
echo -e "2\tUncool variables\t6,8:10" >> custom_categories.tsv
```
%% Cell type:markdown id: tags:
Use the `--category` (`-c` for short) to select categories to output. You can refer to categories by their ID:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -cf custom_categories.tsv -c 1 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
Or by name:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -cf custom_categories.tsv -c uncool out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting column names
If you are working with data that has non-UK BioBank style column names, you can use the `--column` (`-co` for short) to select individual columns by their name, rather than the variable with which they are associated. The `--column` option accepts full column names, and also shell-style wildcard patterns:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -co 4-0.0 -co "??-0.0" out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Selecting subjects (rows)
`ukbparse` assumes that the first column in every input file is a subject ID. You can specify which subjects you want to load via the `--subject` (`-s` for short) option. You can specify subjects in the same way that you specified variables above, and also:
* By specifying a conditional expression on variable values - only subjects for which the expression evaluates to true will be imported
* By specifying subjects to exclude
### Selecting individual subjects
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -s 1 -s 3 -s 5 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting subject ranges
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -s 2:2:10 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting subjects from a file
%% Cell type:code id: tags:
``` bash
echo -e "5\n6\n7\n8\n9\n10" > subjects.txt
ukbparse -nb -q -ow -s subjects.txt out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting subjects by variable value
The `--subject` option accepts *variable expressions* - you can write an expression performing numerical comparisons on variables (denoted with a leading `v`) and combine these expressions using boolean algebra. Only subjects for which the expression evaluates to true will be imported. For example, to only import subjects where variable 1 is greater than 10, and variable 2 is less than 70, you can type:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -sp -s "v1 > 10 && v2 < 70" out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
The following symbols can be used in variable expressions:
| Symbol | Meaning |
|---------------------------|--------------------------|
| `==` | equal to |
| `!=` | not equal to |
| `>` | greater than |
| `>=` | greater than or equal to |
| `<` | less than |
| `<=` | less than or equal to |
| `na` | N/A |
| `&&` | logical and |
| <code>&#x7c;&#x7c;</code> | logical or |
| `~` | logical not |
| `()` | To denote precedence |
### Excluding subjects
The `--exclude` (`-ex` for short) option allows you to exclude subjects - it accepts individual IDs, an ID range, or a file containing IDs. The `--exclude`/`-ex` option takes precedence over the `--subject`/`-s` option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -s 1:8 -ex 5:10 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Selecting visits
%% Cell type:markdown id: tags:
Many variables in the UK BioBank data contain observations at multiple points in time, or visits. `ukbparse` allows you to specify which visits you are interested in. Here is an example data set with variables that have data for multiple visits (remember that the second number in the column names denotes the visit):
%% Cell type:code id: tags:
``` bash
cat data_02.tsv
```
%% Cell type:markdown id: tags:
We can use the `--visit` (`-vi` for short) option to get just the last visit for each variable:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -vi last out.tsv data_02.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
You can also specify which visit you want by its number:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -vi 1 out.tsv data_02.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Merging multiple input files
If your data is split across multiple files, you can specify how `ukbparse` should merge them together.
### Merging by subject
For example, let's say we have these two input files (shown side-by-side):
%% Cell type:code id: tags:
``` bash
echo " " | paste data_03.tsv - data_04.tsv
```
%% Cell type:markdown id: tags:
Note that each file contains different variables, and different, but overlapping, subjects. By default, when you pass these files to `ukbparse`, it will output the intersection of the two files (more formally known as an *inner join*), i.e. subjects which are present in both files:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow out.tsv data_03.tsv data_04.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
If you want to keep all subjects, you can instruct `ukbparse` to output the union (a.k.a. *outer join*) via the `--merge_strategy` (`-ms` for short) option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -ms outer out.tsv data_03.tsv data_04.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Merging by column
Your data may be organised in a different way. For example, these next two files contain different groups of subjects, but overlapping columns:
%% Cell type:code id: tags:
``` bash
echo " " | paste data_05.tsv - data_06.tsv
```
%% Cell type:markdown id: tags:
In this case, we need to tell `ukbparse` to merge along the row axis, rather than along the column axis. We can do this with the `--merge_axis` (`-ma` for short) option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -ma rows out.tsv data_05.tsv data_06.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
Again, if we want to retain all columns, we can tell `ukbparse` to perform an outer join with the `-ms` option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -ma rows -ms outer out.tsv data_05.tsv data_06.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Naive merging
Finally, your data may be organised such that you simply want to "paste", or concatenate them together, along either rows or columns. For example, your data files might look like this:
%% Cell type:code id: tags:
``` bash
echo " " | paste data_07.tsv - data_08.tsv
```
%% Cell type:markdown id: tags:
Here, we have columns for different variables on the same set of subjects, and we just need to concatenate them together horizontally. We do this by using `--merge_strategy naive` (`-ms naive` for short):
%% Cell type:code id: tags:
``` bash
ukbparse -q -ow -ms naive out.tsv data_07.tsv data_08.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
For files which need to be concatenated vertically, such as these:
%% Cell type:code id: tags:
``` bash
echo " " | paste data_09.tsv - data_10.tsv
```
%% Cell type:markdown id: tags:
We need to tell `ukbparse` which axis to concatenate along, again using the `-ma` option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -ms naive -ma rows out.tsv data_09.tsv data_10.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
# Cleaning examples
Once the data has been imported, a sequence of cleaning steps are applied to each column.
## NA insertion
For some variables it may make sense to discard or ignore certain values. For example, if an individual selects *"Do not know"* to a question such as *"How much milk did you drink yesterday?"*, that answer will be coded with a specific value (e.g. `-1`). It does not make any sense to included these values in most analyses, so `ukbparse` can be used to mark such values as *Not Available (NA)*.
A large number of NA insertion rules, specific to UK BioBank variables, are coded into `ukbparse` (although they will not be used in these examples, as we are using the `--no_builtins`/`-nb` option). You can also specify your own rules via the `--na_values` (`-nv` for short) option.
Let's say we have this data set:
%% Cell type:code id: tags:
``` bash
cat data_11.tsv
```
%% Cell type:markdown id: tags:
For variable 1, we want to ignore values of -1, for variable 2 we want to ignore -1 and 0, and for variable 3 we want to ignore 1 and 2:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -nv 1 " -1" -nv 2 " -1,0" -nv 3 "1,2" out.tsv data_11.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
> The `--na_values` option expects two arguments:
> * The variable ID
> * A comma-separated list of values to replace with NA
%% Cell type:markdown id: tags:
## Variable-specific cleaning functions
A small number of cleaning/preprocessing functions are built into `ukbparse`, which can be applied to specific variables. For example, some variables in the UK BioBank contain ICD10 disease codes, which may be more useful if converted to a numeric format. Imagine that we have some data with ICD10 codes:
%% Cell type:code id: tags:
``` bash
cat data_12.tsv
```
%% Cell type:markdown id: tags:
We can use the `--clean` (`-cl` for short) option with the built-in `convertICD10Codes` cleaning function to convert the codes to a numeric representation:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -cl 1 convertICD10Codes out.tsv data_12.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
> The `--clean` option expects two arguments:
> * The variable ID
> * The cleaning function to apply. Some cleaning functions accept arguments - refer to the command-line help for a summary of available functions.
>
> You can define your own cleaning functions by passing them in as a `--plugin_file` (see the [section on custom plugins below](#Custom-cleaning,-processing-and-loading----ukbparse-plugins)).
### Example: flattening hierarchical data
Several variables in the UK Biobank (including the ICD10 disease categorisations) are organised in a hierarchical manner - each value is a child of a more general parent category. The `flattenHierarchical` cleaninng function can be used to replace each value in a data set with the value that corresponds to a parent category. Let's apply this to our example ICD10 data set.
> `ukbparse` needs to know the data coding of hierarchical variables, as it uses this to look up an internal table containing the hierarchy information. So in this example we are creating a dummy variable table file which tells `ukbparse` that the example data uses [data coding 19](https://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=19), which is the ICD10 data coding.
%% Cell type:code id: tags:
``` bash
echo -e "ID\tType\tDescription\tDataCoding\tNAValues\tRawLevels\tNewLevels\tParentValues\tChildValues\tClean
1\t\t\t19
" > variables.tsv
ukbparse -nb -q -ow -vf variables.tsv -cl 1 flattenHierarchical out.tsv data_12.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Aside: ICD10 mapping file
`ukbparse` has a feature specific to these ICD10 disease categorisations - you can use the `--icd10_map_file` (`-imf` for short) option to tell `ukbparse` to save a file which contains a list of all ICD10 codes that were present in the input data, and the corresponding numerical codes that `ukbparse` generated:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -cl 1 convertICD10Codes -imf icd10_codes.tsv out.tsv data_12.tsv
cat icd10_codes.tsv
```
%% Cell type:markdown id: tags:
## Categorical recoding
%% Cell type:markdown id: tags:
You may have some categorical data which is coded in an awkward manner, such as in this example, which encodes the amount of some item that an individual has consumed:
<img src="attachment:image.png" width="100"/>
You can use the `--recoding` (`-re` for short) option to recode data like this into something more useful. For example, given this data:
%% Cell type:code id: tags:
``` bash
cat data_13.tsv
```
%% Cell type:markdown id: tags:
Let's recode it to be more monotonic:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -re 1 "300,444,555" "3,0.25,0.5" out.tsv data_13.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
The `--recoding` option expects three arguments:
* The variable ID
* A comma-separated list of the values to be replaced
* A comma-separated list of the values to replace them with
%% Cell type:markdown id: tags:
## Child value replacement
Imagine that we have these two questions:
* **1**: *Do you currently smoke cigarettes?*
* **2**: *How many cigarettes did you smoke yesterday?*
Now, question 2 was only asked if the answer to question 1 was *"Yes"*. So for all individuals who answered *"No"* to question 1, we will have a missing value for question 2. But for some analyses, it would make more sense to have a value of 0, rather than NA, for these subjects.
`ukbparse` can handle these sorts of dependencies by way of *child value replacement*. For question 2, we can define a conditional variable expression such that when both question 2 is NA and question 1 is *"No"*, we can insert a value of 0 into question 2.
This scenario is demonstrated in this example data set (where, for question 1 values of `1` and `0` represent *"Yes"* and *"No"* respectively):
%% Cell type:code id: tags:
``` bash
cat data_14.tsv
```
%% Cell type:markdown id: tags:
We can fill in the values for variable 2 by using the `--child_values` (`-cv` for short) option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -cv 2 "v1 == 0" "0" out.tsv data_14.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
> The `--child_values` option expects three arguments:
> * The variable ID
> * An expression evaluating some condition on the parent variable(s)
> * A value to replace NA with where the expression evaluates to true.
%% Cell type:markdown id: tags:
# Processing examples
After every column has been cleaned, the entire data set undergoes a series of processing steps. The processing stage may result in columns being removed or manipulated, or new columns being added.
The processing stage can be controlled with these options:
* `--prepend_process` (`-ppr` for short): Apply a processing function before the built-in processing
* `--append_process` (`-apr` for short): Apply a processing function after the built-in processing
(But remember that in these examples we are using the `--no_builtins`/`-nb` option, so the built-in processing steps are not applied.)
The `--prepend_process` and `--append_process` options require two arguments:
* The variable ID(s) to apply the function to, or `all` to denote all variables.
* The processing function to apply. The available processing functions are listed in the command line help, or you can write your own and pass it in as a plugin file ([see below](#Custom-cleaning,-processing-and-loading----ukbparse-plugins)).
## Sparsity check
The `removeIfSparse` process will remove columns that are deemed to have too many missing values. If we take this data set:
%% Cell type:code id: tags:
``` bash
cat data_15.tsv
```
%% Cell type:markdown id: tags:
Imagine that our analysis requires at least 8 values per variable to work. We can use the `minpres` option to`ukbparse` to drop any columns which do not meet this threshold:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -apr all "removeIfSparse(minpres=8)" out.tsv data_15.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
You can also specify `minpres` as a proportion, rather than an absolute number. e.g.:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -apr all "removeIfSparse(minpres=0.65, abspres=False)" out.tsv data_15.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Redundancy check
You may wish to remove columns which do contain redundant information. The `removeIfRedundant` process calculates the pairwise correlation between all columns, and removes columns with a correlation above a threshold that you provide. Imagine that we have this data set:
%% Cell type:code id: tags:
``` bash
cat data_16.tsv
```
%% Cell type:markdown id: tags:
The data in column `2-0.0` is effectively equivalent to the data in column `1-0.0`, so is not of any use to us. We can tell `ukbparse` to remove it like so:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -apr all "removeIfRedundant(0.9)" out.tsv data_16.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
The `removeIfRedundant` process can also calculate the correlation of the patterns of missing values between variables - Consider this example:
%% Cell type:code id: tags:
``` bash
cat data_17.tsv
```
%% Cell type:markdown id: tags:
All three columns are highly correlated, but the pattern of missing values in column `3-0.0` is different to that of the other columns.
If we use the `nathres` option, `ukbparse` will only remove columns where the correlation of both present and missing values meet the thresholds. Note that the column which contains more missing values will be the one that gets removed:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -apr all "removeIfRedundant(0.9, nathres=0.6)" out.tsv data_17.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Categorical binarisation
The `binariseCategorical` process takes a column containing categorical labels, and replaces it with
a set of new binary columns, one for each category. Imagine that we have this data:
%% Cell type:code id: tags:
``` bash
cat data_18.tsv
```
%% Cell type:markdown id: tags:
We can use the `binariseCategorical` process to split column `1-0.0` into a separate column for each category:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -apr 1 "binariseCategorical" out.tsv data_18.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
There are a few options to `binariseCategorical`, including controlling whether the original column is removed, and also the naming of the newly created columns:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -apr 1 "binariseCategorical(replace=False, nameFormat='{vid}:{value}')" out.tsv data_18.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
# Custom cleaning, processing and loading - `ukbparse` plugins
If you want to apply some specific cleaning or processing function to a variable, you can code your functions up in python, and then tell `ukbparse` to apply them.
As an example, let's say we have some data like this:
%% Cell type:code id: tags:
``` bash
cat data_19.tsv
```
%% Cell type:markdown id: tags:
## Custom cleaning functions
But for our analysis, we are only interested in the even values for columns 1 and 2. Let's write a cleaning function which replaces all odd values with NA:
%% Cell type:code id: tags:
``` bash
cat plugin_1.py | pygmentize
```
%% Cell type:markdown id: tags:
To use our custom cleaner function, we simply pass our plugin file to `ukbparse` using the `--plugin_file` (`-p` for short) option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -p plugin_1.py -cl 1 drop_odd_values -cl 2 drop_odd_values out.tsv data_19.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Custom processing functions
Recall that **cleaning** functions are applied independently to each column, whereas **processing** functions may be applied to multiple columns simultaneously, and may add and/or remove columns. Let's say we want to derive a new column from columns `1-0.0` and `2-0.0` in our example data set. Our plugin file might look like this:
%% Cell type:code id: tags:
``` bash
cat plugin_2.py | pygmentize
```
%% Cell type:markdown id: tags:
Again, to use our plugin, we pass it to `ukbparse` via the `--plugin`/`-p` option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -p plugin_2.py -apr "1,2" "sum_squares" out.tsv data_19.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Custom file loaders
You might want to load some auxillary data which is in an awkward format that cannot be automatically parsed by `ukbparse`. For example, you may have a file which has acquisition date information separated into *year*, *month* and *day* columns, e.g.:
%% Cell type:code id: tags:
``` bash
cat data_20.tsv
```
%% Cell type:markdown id: tags:
These three columns would be better loaded as a single column. So we can write a plugin to load this file for us. We need to write two functions:
* A "sniffer" function, which returns information about the columns contained in the file
* A "loader" function which loads the file, returning it as a `pandas.DataFrame`.
%% Cell type:code id: tags:
``` bash
cat plugin_3.py | pygmentize