Commit 19a96d72 authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

Merge branch 'bf/recoding_bug' into 'master'

Bf/recoding bug

See merge request fsl/ukbparse!121
parents 98a9f0b7 b8974248
Pipeline #3670 passed with stages
in 18 minutes and 43 seconds
......@@ -2,6 +2,19 @@
======================
0.20.0 (Monday 6th May 2019)
----------------------------
Fixed
^^^^^
* Fixed a bug in the categorical recoding rules for Data Coding `100012
<https://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=100012>`_.
0.19.2 (Friday 26th April 2019)
-------------------------------
......
......@@ -43,9 +43,6 @@ Or from ``conda-forge``::
conda install -c conda-forge ukbparse
Comprehensive documentation does not yet exist.
Introductory notebook
---------------------
......@@ -127,7 +124,7 @@ stored in the following files:
You can customise or replace these files as you see fit. You can also pass
your own versions of these files to ``ukbparse`` via the ``--variable_file``,
``--datacoding_file``, ``--type_file`` and ``--processing_file`` command-line
options respectively.``ukbparse`` will load all variable and datacoding files,
options respectively. ``ukbparse`` will load all variable and datacoding files,
and merge them into a single table which contains the cleaning rules for each
variable.
......
Data importing
--------------
In order to load the `input data <inputs.rst>`_, ``ukbparse`` follows these
steps:
1. Load variable, data coding, and processing tables
2. Fill in variable table from data coding table as needed (the ``NAValues``,
``RawLevels`` and ``NewLevels`` columns). After this step, the data coding
table is no longer used - all information is in the variable table.
3. Parse dependency expressions (``ParentValues``) to figure out dependencies.
4. Load data. Only those variables which are listed in the variable table, and
which are not marked for removal in the processing table, are loaded.
.. todo:: Need to set data types - cast categoricals, minimal numpy data
type
5. Re-encode missing values (``NAValues``, e.g. (-1,-3) -> nan)
6. Apply pre-processing steps - refer to the page on `processing
<processing.rst>`_
7. Apply dependency expressions. In order for expressions to evaluate
correctly, we need to apply them in order from most "childish"
(i.e. variables with no children) to least childish (variables with no
parents).
8. Re-encode variable values (replace ``RawLevels`` with ``NewLevels``,
e.g. 555 -> 0.5)
``ukbparse`` inputs
===================
The primary input to ``ukbparse`` is a file which contains a two-dimensional
matrix of data for ``nsubjects`` (the rows) and ``nvariables`` (the columns).
It is assumed that the first row in the file contains identifiers for each
variable.
.. todo:: TODO describe ID format and input data more - visits and
multi-valued variables.
Additionally, a number of tables are expected, which contain information about
the variables in the data, and specify processing that is to be applied to
them:
- The *data coding* table
- The *variable* table
- The *processing* table
Each of these tables are expected to be specified as a ``tsv`` (tab-separated
value) file.
Data types
----------
The following data types may be present in the input data:
========================== ===============================================
Type Description
========================== ===============================================
**Sequence** A sequentially increasing integer
**Categorical (single)** A single selection from a set of categories
**Categorical (multiple)** One or more selections from a set of categories
**Integer** An integer number
**Continuous** An decimal number
**Text** Text
**Time** A time
**Date** A data
**Compound** A combination of other types
========================== ===============================================
.. note:: Multiple items may be present for any type - for categoricals, this
is explicit in the name ("categorical (single)" or "categorical
(multiple").
Data coding table
-----------------
Many (but not all) variables are assumed to adhere to a specific "data
coding", which defines a sub-set of values the variable may take.
The data coding table is a ``tsv`` file which contains a row for each known
data coding, defining how variables with that coding should be imported and
stored.
Each row in the data coding table has the following columns:
============= ================================================================
Name Meaning
============= ================================================================
**ID** An integer ID for this data coding.
**Type** Name of the type after it has been loaded - one of
- binary
- ordinal
- continuous
- categorical (single)
- categorical (multiple)
- text
**NumValues** Number of values defined in this data coding.
**NAValues** Comma-separated list of values which are to be recoded to *NA*
(not available).
**RawLevels** Comma-separated list of values denoting values which are to be
recoded to corresponding **NewLevels** on import. ``NA`` may be
used. Need not be exhaustive.
**NewLevels** Comma-separated list of values, one for each value in
**RawLevels**, denoting new recoded values. ``NA`` may be used.
============= ================================================================
.. note:: The **Type** column may not be necessary - taken at face-value, all
of the data codings are categorical, but they may be used in
variables of different types (e.g. data coding `100291 <http://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=100291>`_
is used in
continuous variables to encode missing values).
The **NumValues** column may also be unnecessary for similar
reasons. It is intended to be used as a hint in determining the
data type required to store a particular variable, but the number
of values in the data coding does not necessarily correspond to the
number of values that a variable using that data coding may take.
Variable table
--------------
The variable table is a ``tsv`` file which contains a row for every known
variable, and any special pre-processing which should be applied to that
variable.
Each row in the variable table has the following columns:
================ =============================================================
Name Meaning
================ =============================================================
**ID** An integer ID for this variable, called **UDI** in Biobank-ese.
**Type** Data type (see above).
**Description** Description.
**DataCoding** Data coding ID (may be *NA*).
**NAValues** May be used to override **NAValues** from data coding table.
**RawLevels** May be used to override **RawLevels** from data coding table.
**NewLevels** May be used to override **NewLevels** from data coding table.
**ParentValues** One or more rules to be used where this variable has a
missing value. The rules are expressions defining values that
parent variables of this variable may take to cause the value
for this variable to be replaced. See below.
**ChildValues** A list of comma-separated values (one for each rule in
**ParentValues**), specifying the replacement value.
**Preprocess** List of pre-processing steps to apply to this variable,
separated with commas.
================ =============================================================
**NAValues**, **RawLevels**, and **NewLevels**
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The **NAValues**, **RawLevels**, and **NewLevels** columns only need to be
used for variables which do not have a data coding, or which have *NA* or
recoding conventions that differ from their data coding.
**ParentValues** and **ChildValues**
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The **ParentValues** and **ChildValues** columns may be used when the current
variable values are missing, and parent variables of this variable have
values which may be used to infer the value of this variable.
**ParentValues** comprises one or more conditional expressions which define
the values of the parent variables that cause the child value to be replaced.
**ChildValues** comprises one or more values, corresponding to the expressions
in **ParentValues**, and defining the replacement value. Multiple expressions
in **ParentValues**, and values in **ChildValues**, must be separated with
commas. All spaces are ignored.
A single expression in the **ParentValues** column may contain one or more
comparisons of a parent variable ID to a numeric value. The following
comparison operators may be used:
======= ========================
Meaning Operator
======= ========================
``==`` Equal to
``!=`` Not equal to
``>`` Greater than
``>=`` Greater than or equal to
``<`` Less than
``<=`` Less than or equal to
======= ========================
Parent variables are denoted by the letter ``'v'`` followed by their numeric
ID, and must always appear on the left side of the statement.
For example, the following comparison will evaluate to true when the
variable with ID 1234 has a value less than or equal to 6::
v1234 <= 6
.. todo:: Do we need comparisons on non-numeric, compound, categorical, and
array variables? Only allow == and != for non-numerics, and to
denote presence or absence in a multiple/array variable?
The values used in an expression are to be specified as they were *prior to
any recoding of* **RawLevels** *to* **NewLevels** of that parent's values, but
*after recoding of* **NAValues**.
Multiple comparison statements may be combined with logical operators ``&&``
(logical *and*), ``||`` (logical *or*), and ``~`` (logical *not*). Round
brackets ``()`` may be used to enforce precedence.
Each expression in **ParentValues** must be accompanied by a value in
**ChildValues** - this is used to fill in the variable value in the event the
expression evaluates to true.
Multiple expressions are applied in succession, i.e. a seconnd
expression/value pair may overwrite the value of the first. Note that the
replacement value given to the variable will be subject to any re-encoding
specified in the **RawLevels** and **NewLevels** fields, as this stage is
applied after the parent variable replacement stage. See the page on
`importing <importing.rst>`_ for more details.
The **Preprocess** column
-------------------------
The **Preprocess** column contains one ore more preprocessing steps that
should be applied to this variable, separated with commas. More details on the
available preprocessing steps and the format are on the `processing
<processing.rst> page`_.
Processing table
----------------
The processing table is a ``tsv`` file which defines special processing rules
that should be applied to specific variables.
Each row in the processing table has the following columns:
todo: allow multiple comma-separated variable IDs on each line
============== ================================================================
Name Meaning
============== ================================================================
**ID** Variable ID
**Process** List of processing functions to apply to this variable,
separated with commas.
============== ================================================================
See the page on `processing <processing.rst>`_ for more details.
Data processing
===============
Variable-specific preprocessing and processing are defined in the processing
table. Pre-processing occurs during the `data import stage <importing.rst>`_,
and processing occurs immediately afterwards.
The preprocessing and processing steps for a variable are each specified as
comma-separated lists of processing function names. For example::
process1, process2('arg1'), process3(2, 3)
Preprocessing steps
-------------------
The following preprocessing options may be specified in the variable
table. These are applied during data import.
============== ============================================================
Name Meaning
============== ============================================================
``remove`` Remove the variable - it is not loaded at all
``keepVisits`` Discard all but the specified visits. Visits are specified
as one or more integer arguments, or one of the constants
``'first'`` and ``'last'``.
``fillVisits`` Fill NA values for a given visit from other visits. Takes a
single optional argument, one of ``'mode'`` (the default),
or ``'mean'``.
============== ============================================================
.. todo:: examples
Processing steps
----------------
================ =============================================
Name Meaning
================ =============================================
``fillMissing`` Replace missing (*NA*) values with a constant
``gaussianise`` Apply gaussian normalisation
================ =============================================
.. todo:: examples
......@@ -13,7 +13,7 @@ ID RawLevels NewLevels
100017 444,555,300 0.25,0.5,3
100013
100011 10,3060,1030,12,24,600 1,2,3,4,5,6
100012 0,13,35,57,79,912,1200 2,3,4,5,6,7
100012 1,13,35,57,79,912,1200 0.5,2,4,6,8,10.5,13
100290 -10 0.5
488 -1001 0
493 0,-131,-141 1,2,3
......
......@@ -113,6 +113,11 @@ class Expression(object):
return self.__origExpr
def __repr__(self):
"""Return the original string representation of the expression. """
return str(self)
@property
def variables(self):
"""Return a list of all variables used in the expression. """
......
......@@ -37,6 +37,7 @@ use variable IDs.
import itertools as it
import functools as ft
import os.path as op
import re
import logging
......@@ -559,6 +560,31 @@ def loadVariableTable(datafiles,
# Merge clean options into variable table
mergeCleanFunctions(vartable, tytable, clean, typeClean, globalClean)
# Check where we can that the
# vartable contains valid rules
def checkLengths(col1, col2, row):
val1 = row[col1]
val2 = row[col2]
isna1 = pd.isna(val1)
isna2 = pd.isna(val2)
# ugh. if the value is a sequence, isna
# will return a sequence of bools
if not isinstance(isna1, bool): isna1 = False
if not isinstance(isna2, bool): isna2 = False
if isna1 and isna2:
return
if isna1 or isna2 or (len(val1) != len(val2)):
raise ValueError('Columns don\'t match [len({}) != '
'len({})]: {}'.format(val1, val2, row.name))
checkRecoding = ft.partial(checkLengths, 'RawLevels', 'NewLevels')
checkParentValues = ft.partial(checkLengths, 'RawLevels', 'NewLevels')
vartable.apply(checkRecoding, axis=1)
vartable.apply(checkParentValues, axis=1)
return vartable, unknownVars, uncleanVars
......
......@@ -3076,7 +3076,7 @@
" 104590: [555. 400.] -> [0.5 4. ]\n",
" 104900: [ 10. 3060. 1030. 12. 24. 600.] -> [1. 2. 3. 4. 5. 6.]\n",
" 104910: [ 10. 3060. 1030. 12. 24. 600.] -> [1. 2. 3. 4. 5. 6.]\n",
" 104920: [ 0. 13. 35. 57. 79. 912. 1200.] -> [2. 3. 4. 5. 6. 7.]\n",
" 104920: [1.00e+00 1.30e+01 3.50e+01 5.70e+01 7.90e+01 9.12e+02 1.20e+03] -> [ 0.5 2. 4. 6. 8. 10.5 13. ]\n",
"\n",
"Processing: True\n",
" 1: [40001] -> [binariseCategorical[processor](acrossVisits=True,acrossInstances=True)]\n",
......
%% Cell type:markdown id: tags:
![image.png](attachment:image.png)
# `ukbparse`
> Paul McCarthy &lt;paul.mccarthy@ndcn.ox.ac.uk&gt; ([WIN@FMRIB](https://www.win.ox.ac.uk/))
`ukbparse` is a command-line program which you can use to extract data from UK BioBank (and other tabular) data.
You can give `ukbparse` one or more input files (e.g. `.csv`, `.tsv`), and it will merge them together, perform some preprocessing, and produce a single output file.
A large number of rules are built into `ukbparse` which are specific to the UK BioBank data set. But you can control and customise everything that `ukbparse` does to your data, including which rows and columns to extract, and which cleaning/processing steps to perform on each column.
The `ukbparse` source code is available at https://git.fmrib.ox.ac.uk/fsl/ukbparse. You can install `ukbparse` into a Python environment using `pip`:
pip install ukbparse
Get command-line help by typing:
ukbparse -h
*The examples in this notebook assume that you have installed `ukbparse` 0.19.2 or newer.*
%% Cell type:code id: tags:
``` bash
ukbparse -V
```
%%%% Output: stream
ukbparse 0.19.2
%% Cell type:markdown id: tags:
### Contents
1. [Overview](#Overview)
1. [Import](#1.-Import)
2. [Cleaning](#2.-Cleaning)
3. [Processing](#3.-Processing)
4. [Export](#4.-Export)
2. [Examples](#Examples)
3. [Import examples](#Import-examples)
1. [Selecting variables (columns)](#Selecting-variables-(columns))
1. [Selecting individual variables](#Selecting-individual-variables)
2. [Selecting variable ranges](#Selecting-variable-ranges)
3. [Selecting variables with a file](#Selecting-variables-with-a-file)
4. [Selecting variables from pre-defined categories](#Selecting-variables-from-pre-defined-categories)
2. [Selecting subjects (rows)](#Selecting-subjects-(rows))
1. [Selecting individual subjects](#Selecting-individual-subjects)
2. [Selecting subject ranges](#Selecting-subject-ranges)
3. [Selecting subjects from a file](#Selecting-subjects-from-a-file)
4. [Selecting subjects by variable value](#Selecting-subjects-by-variable-value)
5. [Excluding subjects](#Excluding-subjects)
3. [Selecting visits](#Selecting-visits)
4. [Merging multiple input files](#Merging-multiple-input-files)
1. [Merging by subject](#Merging-by-subject)
2. [Merging by column](#Merging-by-column)
3. [Naive merging](#Merging-by-column)
4. [Cleaning examples](#Cleaning-examples)
1. [NA insertion](#NA-insertion)
2. [Variable-specific cleaning functions](#Variable-specific-cleaning-functions)
3. [Categorical recoding](#Categorical-recoding)
4. [Child value replacement](#Child-value-replacement)
5. [Processing examples](#Processing-examples)
1. [Sparsity check](#Sparsity-check)
2. [Redundancy check](#Redundancy-check)
3. [Categorical binarisation](#Categorical-binarisation)
6. [Custom cleaning, processing and loading - ukbparse plugins](#Custom-cleaning,-processing-and-loading---ukbparse-plugins)
1. [Custom cleaning functions](#Custom-cleaning-functions)
2. [Custom processing functions](#Custom-processing-functions)
3. [Custom file loaders](#Custom-file-loaders)
7. [Miscellaneous topics](#Miscellaneous-topics)
1. [Non-numeric data](#Non-numeric-data)
2. [Dry run](#Dry-run)
3. [Built-in rules](#Built-in-rules)
4. [Using a configuration file](#Using-a-configuration-file)
5. [Reporting unknown variables](#Reporting-unknown-variables)
6. [Low-memory mode](#Low-memory-mode)
%% Cell type:markdown id: tags:
# Overview
`ukbparse` performs the following steps:
## 1. Import
All data files are loaded in, unwanted columns and subjects are dropped, and the data files are merged into a single table (a.k.a. data frame). Multiple files can be merged according to an index column (e.g. subject ID). Or, if the input files contain the same columns/subjects, they can be naively concatenated along rows or columns.
## 2. Cleaning
The following cleaning steps are applied to each column:
1. **NA value replacement:** Specific values for some columns are replaced with NA, for example, variables where a value of `-1` indicates *Do not know*.
2. **Variable-specific cleaning functions:** Certain columns are re-formatted - for example, the [ICD10](https://en.wikipedia.org/wiki/ICD-10) disease codes are converted to integer representations.
3. **Categorical recoding:** Certain categorical columns are re-coded.
4. **Child value replacement:** NA values within some columns which are dependent upon other columns may have values inserted based on the values of their parent columns.
## 3. Processing
During the processing stage, columns may be removed, merged, or expanded into additional columns. For example, a categorical column may be expanded into a set of binary columns, one for each category.
A column may also be removed on the basis of being too sparse, or being redundant with respect to another column.
## 4. Export
The processed data can be saved as a `.csv`, `.tsv`, or `.hdf5` file.
%% Cell type:markdown id: tags:
# Examples
Throughout these examples, we are going to use a few command line options, which you will probably **not** normally want to use:
- `-nb` (short for `--no_builtins`): This tells `ukbparse` not to use the built-in processing rules, which are specifically tailored for UK BioBank data.
- `-ow` (short for `--overwrite`): This tells `ukbparse` not to complain if the output file already exists.
- `-q` (short for `--quiet`): This tells `ukbparse` to be quiet.
Without the `-q` option, `ukbparse` can be quite verbose, which can be annoying, but is very useful when things go wrong. A good strategy is to tell `ukbparse` to send all of its output to a log file with the `--log_file` (or `-lf`) option. For example:
ukbparse --log_file log.txt out.tsv in.tsv
Here's the first example input data set, with UK BioBank-style column names:
%% Cell type:code id: tags:
``` bash
cat data_01.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0 7-0.0 8-0.0 9-0.0 10-0.0
1 31 65 10 11 84 22 56 65 90 12
2 56 52 52 42 89 35 3 65 50 67
3 45 84 20 84 93 36 96 62 48 59
4 7 46 37 48 80 20 18 72 37 27
5 8 86 51 68 80 84 11 28 69 10
6 6 29 85 59 7 46 14 60 73 80
7 24 49 41 46 92 23 39 68 7 63
8 80 92 97 30 92 83 98 36 6 23
9 84 59 89 79 16 12 95 73 2 62
10 23 96 67 41 8 20 97 57 59 23
%% Cell type:markdown id: tags:
The numbers in each column name represent:
1. The variable ID
2. The visit, for variables which were collected at multiple points in time.
3. The "instance", for multi-valued variables.
Note that one **variable** is typically associated with several **columns**, although we're keeping things simple for this first example - there is only one visit for each variable, and there are no mulit-valued variables.
%% Cell type:markdown id: tags:
# Import examples
## Selecting variables (columns)
You can specify which variables you want to load in the following ways, using the `--variable` (`-v` for short) and `--category` (`-c` for short) command line options:
* By variable ID
* By variable ranges
* By a text file which contains the IDs you want to keep.
* By pre-defined variable categories
* By column name
### Selecting individual variables
Simply provide the IDs of the variables you want to extract:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -v 1 -v 5 out.tsv data_01.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 5-0.0
1 31 84
2 56 89
3 45 93
4 7 80
5 8 80
6 6 7
7 24 92
8 80 92
9 84 16
10 23 8
%% Cell type:markdown id: tags:
### Selecting variable ranges
The `--variable`/`-v` option accepts MATLAB-style ranges of the form `start:step:stop` (where the `stop` is inclusive):
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -v 1:3:10 out.tsv data_01.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 4-0.0 7-0.0 10-0.0
1 31 11 56 12
2 56 42 3 67
3 45 84 96 59
4 7 48 18 27
5 8 68 11 10
6 6 59 14 80
7 24 46 39 63
8 80 30 98 23
9 84 79 95 62
10 23 41 97 23
%% Cell type:markdown id: tags:
### Selecting variables with a file
If your variables of interest are listed in a plain-text file, you can simply pass that file:
%% Cell type:code id: tags:
``` bash
echo -e "1\n6\n9" > vars.txt
ukbparse -nb -q -ow -v vars.txt out.tsv data_01.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 6-0.0 9-0.0
1 31 22 90
2 56 35 50
3 45 36 48
4 7 20 37
5 8 84 69
6 6 46 73
7 24 23 7
8 80 83 6
9 84 12 2
10 23 20 59
%% Cell type:markdown id: tags:
### Selecting variables from pre-defined categories
Some UK BioBank-specific categories are baked into `ukbparse`, but you can also define your own categories - you just need to create a `.tsv` file, and pass it to `ukbparse` via the `--category_file` (`-cf` for short):
%% Cell type:code id: tags:
``` bash
echo -e "ID\tCategory\tVariables" > custom_categories.tsv
echo -e "1\tCool variables\t1:5,7" >> custom_categories.tsv
echo -e "2\tUncool variables\t6,8:10" >> custom_categories.tsv
```
%% Cell type:markdown id: tags:
Use the `--category` (`-c` for short) to select categories to output. You can refer to categories by their ID:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -cf custom_categories.tsv -c 1 out.tsv data_01.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 7-0.0
1 31 65 10 11 84 56
2 56 52 52 42 89 3
3 45 84 20 84 93 96
4 7 46 37 48 80 18
5 8 86 51 68 80 11
6 6 29 85 59 7 14
7 24 49 41 46 92 39
8 80 92 97 30 92 98
9 84 59 89 79 16 95
10 23 96 67 41 8 97
%% Cell type:markdown id: tags:
Or by name:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -cf custom_categories.tsv -c uncool out.tsv data_01.tsv
cat out.tsv
```
%%%% Output: stream
eid 6-0.0 8-0.0 9-0.0 10-0.0
1 22 65 90 12
2 35 65 50 67
3 36 62 48 59
4 20 72 37 27
5 84 28 69 10
6 46 60 73 80
7 23 68 7 63
8 83 36 6 23
9 12 73 2 62
10 20 57 59 23
%% Cell type:markdown id: tags:
### Selecting column names
If you are working with data that has non-UK BioBank style column names, you can use the `--column` (`-co` for short) to select individual columns by their name, rather than the variable with which they are associated. The `--column` option accepts full column names, and also shell-style wildcard patterns:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -co 4-0.0 -co "??-0.0" out.tsv data_01.tsv
cat out.tsv
```
%%%% Output: stream
eid 4-0.0 10-0.0
1 11 12
2 42 67
3 84 59
4 48 27
5 68 10
6 59 80
7 46 63
8 30 23
9 79 62
10 41 23
%% Cell type:markdown id: tags:
## Selecting subjects (rows)
`ukbparse` assumes that the first column in every input file is a subject ID. You can specify which subjects you want to load via the `--subject` (`-s` for short) option. You can specify subjects in the same way that you specified variables above, and also:
* By specifying a conditional expression on variable values - only subjects for which the expression evaluates to true will be imported
* By specifying subjects to exclude
### Selecting individual subjects
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -s 1 -s 3 -s 5 out.tsv data_01.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0 7-0.0 8-0.0 9-0.0 10-0.0
1 31 65 10 11 84 22 56 65 90 12
3 45 84 20 84 93 36 96 62 48 59
5 8 86 51 68 80 84 11 28 69 10
%% Cell type:markdown id: tags:
### Selecting subject ranges
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -s 2:2:10 out.tsv data_01.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0 7-0.0 8-0.0 9-0.0 10-0.0
2 56 52 52 42 89 35 3 65 50 67
4 7 46 37 48 80 20 18 72 37 27
6 6 29 85 59 7 46 14 60 73 80
8 80 92 97 30 92 83 98 36 6 23
10 23 96 67 41 8 20 97 57 59 23
%% Cell type:markdown id: tags:
### Selecting subjects from a file
%% Cell type:code id: tags:
``` bash
echo -e "5\n6\n7\n8\n9\n10" > subjects.txt
ukbparse -nb -q -ow -s subjects.txt out.tsv data_01.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0 7-0.0 8-0.0 9-0.0 10-0.0
5 8 86 51 68 80 84 11 28 69 10
6 6 29 85 59 7 46 14 60 73 80
7 24 49 41 46 92 23 39 68 7 63
8 80 92 97 30 92 83 98 36 6 23
9 84 59 89 79 16 12 95 73 2 62
10 23 96 67 41 8 20 97 57 59 23
%% Cell type:markdown id: tags:
### Selecting subjects by variable value
The `--subject` option accepts *variable expressions* - you can write an expression performing numerical comparisons on variables (denoted with a leading `v`) and combine these expressions using boolean algebra. Only subjects for which the expression evaluates to true will be imported. For example, to only import subjects where variable 1 is greater than 10, and variable 2 is less than 70, you can type:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -sp -s "v1 > 10 && v2 < 70" out.tsv data_01.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0 7-0.0 8-0.0 9-0.0 10-0.0
1 31 65 10 11 84 22 56 65 90 12
2 56 52 52 42 89 35 3 65 50 67
7 24 49 41 46 92 23 39 68 7 63
9 84 59 89 79 16 12 95 73 2 62
%% Cell type:markdown id: tags:
The following symbols can be used in variable expressions:
| Symbol | Meaning |
|---------------------------|--------------------------|
| `==` | equal to |
| `!=` | not equal to |
| `>` | greater than |
| `>=` | greater than or equal to |
| `<` | less than |
| `<=` | less than or equal to |
| `na` | N/A |
| `&&` | logical and |
| <code>&#x7c;&#x7c;</code> | logical or |
| `~` | logical not |
| `()` | To denote precedence |
### Excluding subjects
The `--exclude` (`-ex` for short) option allows you to exclude subjects - it accepts individual IDs, an ID range, or a file containing IDs. The `--exclude`/`-ex` option takes precedence over the `--subject`/`-s` option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -s 1:8 -ex 5:10 out.tsv data_01.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0 7-0.0 8-0.0 9-0.0 10-0.0
1 31 65 10 11 84 22 56 65 90 12
2 56 52 52 42 89 35 3 65 50 67
3 45 84 20 84 93 36 96 62 48 59
4 7 46 37 48 80 20 18 72 37 27
%% Cell type:markdown id: tags:
## Selecting visits
%% Cell type:markdown id: tags:
Many variables in the UK BioBank data contain observations at multiple points in time, or visits. `ukbparse` allows you to specify which visits you are interested in. Here is an example data set with variables that have data for multiple visits (remember that the second number in the column names denotes the visit):
%% Cell type:code id: tags:
``` bash
cat data_02.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 2-1.0 2-2.0 3-0.0 3-1.0 4-0.0 5-0.0
1 86 76 82 75 34 99 50 5
2 20 25 40 44 30 57 54 44
3 85 2 48 42 23 77 84 27
4 23 30 18 97 44 55 97 20
5 83 45 76 51 18 64 8 33
%% Cell type:markdown id: tags:
We can use the `--visit` (`-vi` for short) option to get just the last visit for each variable:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -vi last out.tsv data_02.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-2.0 3-1.0 4-0.0 5-0.0
1 86 75 99 50 5
2 20 44 57 54 44
3 85 42 77 84 27
4 23 97 55 97 20
5 83 51 64 8 33
%% Cell type:markdown id: tags:
You can also specify which visit you want by its number:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -vi 1 out.tsv data_02.tsv
cat out.tsv
```
%%%% Output: stream
eid 2-1.0 3-1.0
1 82 99
2 40 57
3 48 77
4 18 55
5 76 64
%% Cell type:markdown id: tags:
## Merging multiple input files
If your data is split across multiple files, you can specify how `ukbparse` should merge them together.
### Merging by subject
For example, let's say we have these two input files (shown side-by-side):
%% Cell type:code id: tags:
``` bash
echo " " | paste data_03.tsv - data_04.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 eid 4-0.0 5-0.0 6-0.0
1 89 47 26 2 19 17 62
2 94 37 70 3 41 12 7
3 63 5 97 4 8 86 9
4 98 97 91 5 7 65 71
5 37 10 11 6 3 23 15
%% Cell type:markdown id: tags:
Note that each file contains different variables, and different, but overlapping, subjects. By default, when you pass these files to `ukbparse`, it will output the intersection of the two files (more formally known as an *inner join*), i.e. subjects which are present in both files:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow out.tsv data_03.tsv data_04.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0
2 94 37 70 19 17 62
3 63 5 97 41 12 7
4 98 97 91 8 86 9
5 37 10 11 7 65 71
%% Cell type:markdown id: tags:
If you want to keep all subjects, you can instruct `ukbparse` to output the union (a.k.a. *outer join*) via the `--merge_strategy` (`-ms` for short) option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -ms outer out.tsv data_03.tsv data_04.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0
1 89.0 47.0 26.0
2 94.0 37.0 70.0 19.0 17.0 62.0
3 63.0 5.0 97.0 41.0 12.0 7.0
4 98.0 97.0 91.0 8.0 86.0 9.0
5 37.0 10.0 11.0 7.0 65.0 71.0
6 3.0 23.0 15.0
%% Cell type:markdown id: tags:
### Merging by column
Your data may be organised in a different way. For example, these next two files contain different groups of subjects, but overlapping columns:
%% Cell type:code id: tags:
``` bash
echo " " | paste data_05.tsv - data_06.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 eid 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0
1 69 80 70 60 42 4 17 36 56 90 12
2 64 15 82 99 67 5 63 16 87 57 63
3 33 67 58 96 26 6 43 19 84 53 63
%% Cell type:markdown id: tags:
In this case, we need to tell `ukbparse` to merge along the row axis, rather than along the column axis. We can do this with the `--merge_axis` (`-ma` for short) option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -ma rows out.tsv data_05.tsv data_06.tsv
cat out.tsv
```
%%%% Output: stream
eid 2-0.0 3-0.0 4-0.0 5-0.0
1 80 70 60 42
2 15 82 99 67
3 67 58 96 26
4 17 36 56 90
5 63 16 87 57
6 43 19 84 53
%% Cell type:markdown id: tags:
Again, if we want to retain all columns, we can tell `ukbparse` to perform an outer join with the `-ms` option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -ma rows -ms outer out.tsv data_05.tsv data_06.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0
1 69.0 80 70 60 42
2 64.0 15 82 99 67
3 33.0 67 58 96 26
4 17 36 56 90 12.0
5 63 16 87 57 63.0
6 43 19 84 53 63.0
%% Cell type:markdown id: tags:
### Naive merging
Finally, your data may be organised such that you simply want to "paste", or concatenate them together, along either rows or columns. For example, your data files might look like this:
%% Cell type:code id: tags:
``` bash
echo " " | paste data_07.tsv - data_08.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 eid 4-0.0 5-0.0 6-0.0
1 30 99 57 1 16 54 60
2 3 6 75 2 43 59 9
3 13 91 36 3 71 73 38
%% Cell type:markdown id: tags:
Here, we have columns for different variables on the same set of subjects, and we just need to concatenate them together horizontally. We do this by using `--merge_strategy naive` (`-ms naive` for short):
%% Cell type:code id: tags:
``` bash
ukbparse -q -ow -ms naive out.tsv data_07.tsv data_08.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 4-0.0 5-0.0 6-0.0
1 30 99 57.0 16.0 54.0 60.0
2 3 6 75.0 43.0 59.0 9.0
3 13 91 36.0 71.0 73.0 38.0
%% Cell type:markdown id: tags:
For files which need to be concatenated vertically, such as these:
%% Cell type:code id: tags:
``` bash
echo " " | paste data_09.tsv - data_10.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0 eid 1-0.0 2-0.0 3-0.0
1 16 34 10 4 40 89 58
2 62 78 16 5 25 75 9
3 72 29 53 6 28 74 57
%% Cell type:markdown id: tags:
We need to tell `ukbparse` which axis to concatenate along, again using the `-ma` option:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -ms naive -ma rows out.tsv data_09.tsv data_10.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0
1 16 34 10
2 62 78 16
3 72 29 53
4 40 89 58
5 25 75 9
6 28 74 57
%% Cell type:markdown id: tags:
# Cleaning examples
Once the data has been imported, a sequence of cleaning steps are applied to each column.
## NA insertion
For some variables it may make sense to discard or ignore certain values. For example, if an individual selects *"Do not know"* to a question such as *"How much milk did you drink yesterday?"*, that answer will be coded with a specific value (e.g. `-1`). It does not make any sense to included these values in most analyses, so `ukbparse` can be used to mark such values as *Not Available (NA)*.
A large number of NA insertion rules, specific to UK BioBank variables, are coded into `ukbparse` (although they will not be used in these examples, as we are using the `--no_builtins`/`-nb` option). You can also specify your own rules via the `--na_values` (`-nv` for short) option.
Let's say we have this data set:
%% Cell type:code id: tags:
``` bash
cat data_11.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0
1 4 1 6
2 2 6 0
3 7 0 -1
4 -1 6 1
5 2 8 4
6 0 2 7
7 -1 0 0
8 7 7 2
9 4 -1 -1
10 8 -1 2
%% Cell type:markdown id: tags:
For variable 1, we want to ignore values of -1, for variable 2 we want to ignore -1 and 0, and for variable 3 we want to ignore 1 and 2:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -nv 1 " -1" -nv 2 " -1,0" -nv 3 "1,2" out.tsv data_11.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0 2-0.0 3-0.0
1 4.0 1.0 6.0
2 2.0 6.0 0.0
3 7.0 -1.0
4 6.0
5 2.0 8.0 4.0
6 0.0 2.0 7.0
7 0.0
8 7.0 7.0
9 4.0 -1.0
10 8.0
%% Cell type:markdown id: tags:
> The `--na_values` option expects two arguments:
> * The variable ID
> * A comma-separated list of values to replace with NA
%% Cell type:markdown id: tags:
## Variable-specific cleaning functions
A small number of cleaning/preprocessing functions are built into `ukbparse`, which can be applied to specific variables. For example, some variables in the UK BioBank contain ICD10 disease codes, which may be more useful if converted to a numeric format. Imagine that we have some data with ICD10 codes:
%% Cell type:code id: tags:
``` bash
cat data_12.tsv
```
%%%% Output: stream
eid 1-0.0
1 A481
2 A590
3 B391
4 D596
5 Z980
%% Cell type:markdown id: tags:
We can use the `--clean` (`-cl` for short) option with the built-in `convertICD10Codes` cleaning function to convert the codes to a numeric representation:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -cl 1 convertICD10Codes out.tsv data_12.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0
1 534
2 596
3 932
4 2159
5 19143
%% Cell type:markdown id: tags:
> The `--clean` option expects two arguments:
> * The variable ID
> * The cleaning function to apply. Some cleaning functions accept arguments - refer to the command-line help for a summary of available functions.
>
> You can define your own cleaning functions by passing them in as a `--plugin_file` (see the [section on custom plugins below](#Custom-cleaning,-processing-and-loading----ukbparse-plugins)).
### Example: flattening hierarchical data
Several variables in the UK Biobank (including the ICD10 disease categorisations) are organised in a hierarchical manner - each value is a child of a more general parent category. The `flattenHierarchical` cleaninng function can be used to replace each value in a data set with the value that corresponds to a parent category. Let's apply this to our example ICD10 data set.
> `ukbparse` needs to know the data coding of hierarchical variables, as it uses this to look up an internal table containing the hierarchy information. So in this example we are creating a dummy variable table file which tells `ukbparse` that the example data uses [data coding 19](https://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=19), which is the ICD10 data coding.
%% Cell type:code id: tags:
``` bash
echo -e "ID\tType\tDescription\tDataCoding\tNAValues\tRawLevels\tNewLevels\tParentValues\tChildValues\tClean
1\t\t\t19
" > variables.tsv
ukbparse -nb -q -ow -vf variables.tsv -cl 1 flattenHierarchical out.tsv data_12.tsv
cat out.tsv
```
%%%% Output: stream
eid 1-0.0