Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
FSL
funpack
Commits
333591f5
Commit
333591f5
authored
May 13, 2020
by
Paul McCarthy
🚵
Browse files
Merge branch 'rf/date-norm-in-cleaning' into 'master'
Rf/date norm in cleaning See merge request
!61
parents
8caed2de
37d137d3
Changes
14
Pipelines
2
Hide whitespace changes
Inline
Side-by-side
CHANGELOG.rst
View file @
333591f5
...
@@ -2,8 +2,8 @@ FUNPACK changelog
...
@@ -2,8 +2,8 @@ FUNPACK changelog
=================
=================
2.3.0 (
Under development
)
2.3.0 (
Tuesday 12th May 2020
)
-------------------------
-------------------------
----
Changed
Changed
...
@@ -15,7 +15,11 @@ Changed
...
@@ -15,7 +15,11 @@ Changed
variables. This should give superior performance.
variables. This should give superior performance.
* Revisited the :meth:`.DataTable.merge` to optimise performance in all
* Revisited the :meth:`.DataTable.merge` to optimise performance in all
scenarios.
scenarios.
* Improved performance of the :mod:`.fmrib` date/time normalisation routines.
* Improved performance of the :mod:`.fmrib` date/time normalisation routines,
and changed their usage so they are now applied as "cleaning" functions
after data import, rather than just before export. This means that date/
time columns can be subjected to the redundancy check (as they will have
a numeric type), and will improve data export performance.
2.2.1 (Monday 4th May 2020)
2.2.1 (Monday 4th May 2020)
...
...
funpack/__init__.py
View file @
333591f5
...
@@ -6,7 +6,7 @@
...
@@ -6,7 +6,7 @@
#
#
__version__
=
'2.3.0
.dev0
'
__version__
=
'2.3.0'
"""The ``funpack`` versioning scheme roughly follows Semantic Versioning
"""The ``funpack`` versioning scheme roughly follows Semantic Versioning
conventions.
conventions.
"""
"""
...
...
funpack/configs/fmrib.cfg
View file @
333591f5
...
@@ -8,9 +8,20 @@
...
@@ -8,9 +8,20 @@
# Use local settings
# Use local settings
config_file local
config_file local
# Contains some FMRIB-specific plugin functions,
# including date/time normalisation.
plugin_file fmrib
# Drop non-numeric columns - the main output
# file only contains numeric data.
suppress_non_numerics
# Only import variables from FMRIB-curated categories,
# largely drawn from showcase categories
category_file fmrib/categories.tsv
#
#
# FUNPACK processing stages
# FUNPACK
cleaning/
processing stages
#
#
# - NA insertion
# - NA insertion
# - Categorical recoding
# - Categorical recoding
...
@@ -38,33 +49,22 @@ config_file local
...
@@ -38,33 +49,22 @@ config_file local
# - NA insertion
# - NA insertion
datacoding_file fmrib/datacodings_navalues.tsv
datacoding_file fmrib/datacodings_navalues.tsv
# - Categorical recoding
# - Categorical recoding
datacoding_file fmrib/datacodings_recoding.tsv
datacoding_file fmrib/datacodings_recoding.tsv
# - Cleaning
# - Cleaning
variable_file fmrib/variables_clean.tsv
variable_file fmrib/variables_clean.tsv
# - Child value replacement
variable_file fmrib/variables_parentvalues.tsv
# - Processing
processing_file fmrib/processing.tsv
# FMRIB-curated categories, largely drawn from showcase categories
# Date/timestamp normalisation (performed in the FUNPACK cleaning stage)
category_file fmrib/categories.tsv
#
# FMRIB processing of dates
#
# Converts a date or date+time into a single value x, where floor(x) is the
# Converts a date or date+time into a single value x, where floor(x) is the
# calendar year and the fraction day/time within the year *except* 'a day'
# calendar year and the fraction day/time within the year *except* 'a day'
# is redefined as the time between 7am and 8pm (scanning only takes place
# is redefined as the time between 7am and 8pm (scanning only takes place
# within these hours.
# within these hours.
#
type_file fmrib/datetime_formatting.tsv
plugin_file fmrib
date_format FMRIBImagingDate
time_format FMRIBImagingTime
# - Child value replacement
variable_file fmrib/variables_parentvalues.tsv
#
Drop non-numeric columns - the main output file only contains numeric data.
#
- Processing -
suppress_non_numerics
processing_file fmrib/processing.tsv
funpack/configs/fmrib/datetime_formatting.tsv
0 → 100644
View file @
333591f5
Type Clean
date normalisedDate
time normalisedAcquisitionTime
\ No newline at end of file
funpack/exporting.py
View file @
333591f5
...
@@ -89,12 +89,13 @@ def formatColumn(col,
...
@@ -89,12 +89,13 @@ def formatColumn(col,
# fall back to date/time formatting
# fall back to date/time formatting
# if relevant for this column
# if relevant for this column
if
formatter
is
None
:
if
formatter
is
None
and
pdtypes
.
is_datetime64_any_dtype
(
series
):
if
vtype
==
util
.
CTYPES
.
date
:
# use dateFormat if we know the column
formatter
=
dateFormat
# is a date (and not datetime), otherwise
elif
vtype
==
util
.
CTYPES
.
time
or
\
# use timeFormat if the column is a
pdtypes
.
is_datetime64_any_dtype
(
series
):
# datetime, or unknown type.
formatter
=
timeFormat
if
vtype
==
util
.
CTYPES
.
date
:
formatter
=
dateFormat
else
:
formatter
=
timeFormat
if
formatter
is
not
None
:
if
formatter
is
not
None
:
log
.
debug
(
'Formatting column %s%s with %s formatter'
,
log
.
debug
(
'Formatting column %s%s with %s formatter'
,
...
...
funpack/plugins/fmrib.py
View file @
333591f5
...
@@ -88,30 +88,33 @@ def load_FMRIBImaging(infile):
...
@@ -88,30 +88,33 @@ def load_FMRIBImaging(infile):
return
df
return
df
@
funpack
.
formatter
(
'FMRIBImagingDate'
)
@
funpack
.
cleaner
(
)
def
normalisedDate
(
dtable
,
column
,
series
):
def
normalisedDate
(
dtable
,
vid
):
"""Converts date values into a numeric fractional year representation.
"""Converts date values into a numeric fractional year representation.
Converts a date into a single value x, where ``floor(x)`` is the calendar
Converts a date into a single value x, where ``floor(x)`` is the calendar
year and the ``x mod 1`` is the fractional day within the year. The
year and the ``x mod 1`` is the fractional day within the year. The
conversion takes leap years into account.
conversion takes leap years into account.
"""
"""
datetimes
=
series
.
to_numpy
()
years
=
datetimes
.
astype
(
'datetime64[Y]'
)
days
=
datetimes
.
astype
(
'datetime64[D]'
)
# convert to day of year
for
col
in
dtable
.
columns
(
vid
):
# calculate fraction of day
series
=
dtable
[:,
col
.
name
]
days
=
(
days
-
years
).
astype
(
np
.
float32
)
datetimes
=
series
.
to_numpy
(
)
years
=
(
years
+
1970
)
.
astype
(
np
.
float32
)
years
=
datetimes
.
astype
(
'datetime64[Y]'
)
leaps
=
pd
.
DatetimeIndex
(
datetimes
).
is_leap_year
+
365
days
=
datetimes
.
astype
(
'datetime64[D]'
)
# calculate and return fraction of year
# convert to day of year
return
pd
.
Series
(
years
+
(
days
/
leaps
),
name
=
series
.
name
)
# calculate fraction of day
days
=
(
days
-
years
).
astype
(
np
.
float32
)
years
=
(
years
+
1970
)
.
astype
(
np
.
float32
)
leaps
=
pd
.
DatetimeIndex
(
datetimes
).
is_leap_year
+
365
# calculate fraction of year
dtable
[:,
col
.
name
]
=
years
+
(
days
/
leaps
)
@
funpack
.
formatter
(
'FMRIBImagingTime'
)
def
normalisedAcquisitionTime
(
dtable
,
column
,
series
):
@
funpack
.
cleaner
()
def
normalisedAcquisitionTime
(
dtable
,
vid
):
"""Converts timestamps into a numeric fractional year representation.
"""Converts timestamps into a numeric fractional year representation.
Converts a date or date+time into a single value x, where `floor(x)` is the
Converts a date or date+time into a single value x, where `floor(x)` is the
...
@@ -119,23 +122,25 @@ def normalisedAcquisitionTime(dtable, column, series):
...
@@ -119,23 +122,25 @@ def normalisedAcquisitionTime(dtable, column, series):
redefined as the time between 7am and 8pm (UK BioBank scanning only takes
redefined as the time between 7am and 8pm (UK BioBank scanning only takes
place within these hours).
place within these hours).
"""
"""
datetimes
=
series
.
to_numpy
()
for
col
in
dtable
.
columns
(
vid
):
years
=
datetimes
.
astype
(
'datetime64[Y]'
)
series
=
dtable
[:,
col
.
name
]
days
=
datetimes
.
astype
(
'datetime64[D]'
)
datetimes
=
series
.
to_numpy
()
hours
=
datetimes
.
astype
(
'datetime64[h]'
)
years
=
datetimes
.
astype
(
'datetime64[Y]'
)
mins
=
datetimes
.
astype
(
'datetime64[m]'
)
days
=
datetimes
.
astype
(
'datetime64[D]'
)
secs
=
datetimes
.
astype
(
'datetime64[s]'
)
hours
=
datetimes
.
astype
(
'datetime64[h]'
)
mins
=
datetimes
.
astype
(
'datetime64[m]'
)
# convert to day of year, hour
secs
=
datetimes
.
astype
(
'datetime64[s]'
)
# of day, second of hour, then
# calculate fraction of day
# convert to day of year, hour
secs
=
(
secs
-
mins
)
.
astype
(
np
.
float32
)
# of day, second of hour, then
mins
=
(
mins
-
hours
).
astype
(
np
.
float32
)
# calculate fraction of day
hours
=
(
hours
-
days
)
.
astype
(
np
.
float32
)
secs
=
(
secs
-
mins
)
.
astype
(
np
.
float32
)
days
=
(
days
-
years
).
astype
(
np
.
float32
)
mins
=
(
mins
-
hours
).
astype
(
np
.
float32
)
years
=
(
years
+
1970
)
.
astype
(
np
.
float32
)
hours
=
(
hours
-
days
)
.
astype
(
np
.
float32
)
dayfracs
=
((
hours
-
7
)
+
(
mins
/
60
)
+
(
secs
/
3600
))
/
13
days
=
(
days
-
years
).
astype
(
np
.
float32
)
leaps
=
pd
.
DatetimeIndex
(
datetimes
).
is_leap_year
+
365
years
=
(
years
+
1970
)
.
astype
(
np
.
float32
)
dayfracs
=
((
hours
-
7
)
+
(
mins
/
60
)
+
(
secs
/
3600
))
/
13
# calculate and return fraction of year
leaps
=
pd
.
DatetimeIndex
(
datetimes
).
is_leap_year
+
365
return
pd
.
Series
(
years
+
(
days
+
dayfracs
)
/
leaps
,
name
=
series
.
name
)
# calculate and return fraction of year
dtable
[:,
col
.
name
]
=
years
+
(
days
+
dayfracs
)
/
leaps
funpack/scripts/demo/funpack_demonstration.ipynb
View file @
333591f5
...
@@ -57,7 +57,7 @@
...
@@ -57,7 +57,7 @@
"\n",
"\n",
"\n",
"\n",
"**Important** The examples in this notebook assume that you have installed `funpack`\n",
"**Important** The examples in this notebook assume that you have installed `funpack`\n",
"2.3.0
.dev0
or newer."
"2.3.0 or newer."
]
]
},
},
{
{
...
@@ -1664,7 +1664,7 @@
...
@@ -1664,7 +1664,7 @@
"\n",
"\n",
"`funpack` has a large number of hand-crafted rules built in, which are\n",
"`funpack` has a large number of hand-crafted rules built in, which are\n",
"specific to variables found in the UK BioBank data set. These rules are part\n",
"specific to variables found in the UK BioBank data set. These rules are part\n",
"of the
`
`fmrib`
`
configuration, which can be used by adding `-cfg fmrib` to\n",
"of the `fmrib` configuration, which can be used by adding `-cfg fmrib` to\n",
"the command-line options.\n",
"the command-line options.\n",
"\n",
"\n",
"\n",
"\n",
...
@@ -1679,7 +1679,7 @@
...
@@ -1679,7 +1679,7 @@
"metadata": {},
"metadata": {},
"outputs": [],
"outputs": [],
"source": [
"source": [
"funpack -q -ow -cfg fmrib -d out.tsv ukb
_dataset_1_only_column_names.tsv ukb_dataset_2_only_column_names.t
sv"
"funpack -q -ow -cfg fmrib -d out.tsv ukb
cols.c
sv"
]
]
},
},
{
{
...
@@ -1888,5 +1888,5 @@
...
@@ -1888,5 +1888,5 @@
],
],
"metadata":{"kernelspec":{"display_name":"Bash","language":"bash","name":"bash"},"language_info":{"codemirror_mode":"shell","file_extension":".sh","mimetype":"text/x-sh","name":"bash"}},
"metadata":{"kernelspec":{"display_name":"Bash","language":"bash","name":"bash"},"language_info":{"codemirror_mode":"shell","file_extension":".sh","mimetype":"text/x-sh","name":"bash"}},
"nbformat": 4,
"nbformat": 4,
"nbformat_minor":
2
"nbformat_minor":
4
}
}
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:


# `funpack` (https://git.fmrib.ox.ac.uk/fsl/funpack)
# `funpack` (https://git.fmrib.ox.ac.uk/fsl/funpack)
> Paul McCarthy <paul.mccarthy@ndcn.ox.ac.uk>
> Paul McCarthy <paul.mccarthy@ndcn.ox.ac.uk>
> ([WIN@FMRIB](https://www.win.ox.ac.uk/))
> ([WIN@FMRIB](https://www.win.ox.ac.uk/))
`funpack`
is a command-line program which you can use to extract data from UK
`funpack`
is a command-line program which you can use to extract data from UK
BioBank (and other tabular) data.
BioBank (and other tabular) data.
You can give
`funpack`
one or more input files (e.g.
`.csv`
,
`.tsv`
), and it
You can give
`funpack`
one or more input files (e.g.
`.csv`
,
`.tsv`
), and it
will merge them together, perform some preprocessing, and produce a single
will merge them together, perform some preprocessing, and produce a single
output file.
output file.
A large number of rules are built into
`funpack`
which are specific to the UK
A large number of rules are built into
`funpack`
which are specific to the UK
BioBank data set. But you can control and customise everything that
`funpack`
BioBank data set. But you can control and customise everything that
`funpack`
does to your data, including which rows and columns to extract, and which
does to your data, including which rows and columns to extract, and which
cleaning/processing steps to perform on each column.
cleaning/processing steps to perform on each column.
`funpack`
comes installed with recent versions of
`funpack`
comes installed with recent versions of
[
FSL
](
https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/
)
. You can also install
`funpack`
[
FSL
](
https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/
)
. You can also install
`funpack`
via
`conda`
:
via
`conda`
:
> ```
> ```
> conda install -c conda-forge fmrib-unpack
> conda install -c conda-forge fmrib-unpack
> ```
> ```
Or using
`pip`
:
Or using
`pip`
:
> ```
> ```
> pip install fmrib-unpack
> pip install fmrib-unpack
> ```
> ```
Get command-line help by typing:
Get command-line help by typing:
> ```
> ```
> funpack -h
> funpack -h
> ```
> ```
**Important**
The examples in this notebook assume that you have installed
`funpack`
**Important**
The examples in this notebook assume that you have installed
`funpack`
2.
3.0
.dev0
or newer.
2.
3.0 or newer.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-V
funpack
-V
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Contents
### Contents
1.
[
Overview
](
#Overview
)
1.
[
Overview
](
#Overview
)
1.
[
Import
](
#1.-Import
)
1.
[
Import
](
#1.-Import
)
2.
[
Cleaning
](
#2.-Cleaning
)
2.
[
Cleaning
](
#2.-Cleaning
)
3.
[
Processing
](
#3.-Processing
)
3.
[
Processing
](
#3.-Processing
)
4.
[
Export
](
#4.-Export
)
4.
[
Export
](
#4.-Export
)
2.
[
Examples
](
#Examples
)
2.
[
Examples
](
#Examples
)
3.
[
Import examples
](
#Import-examples
)
3.
[
Import examples
](
#Import-examples
)
1.
[
Selecting variables (columns)
](
#Selecting-variables-(columns
)
)
1.
[
Selecting variables (columns)
](
#Selecting-variables-(columns
)
)
1.
[
Selecting individual variables
](
#Selecting-individual-variables
)
1.
[
Selecting individual variables
](
#Selecting-individual-variables
)
2.
[
Selecting variable ranges
](
#Selecting-variable-ranges
)
2.
[
Selecting variable ranges
](
#Selecting-variable-ranges
)
3.
[
Selecting variables with a file
](
#Selecting-variables-with-a-file
)
3.
[
Selecting variables with a file
](
#Selecting-variables-with-a-file
)
4.
[
Selecting variables from pre-defined categories
](
#Selecting-variables-from-pre-defined-categories
)
4.
[
Selecting variables from pre-defined categories
](
#Selecting-variables-from-pre-defined-categories
)
2.
[
Selecting subjects (rows)
](
#Selecting-subjects-(rows
)
)
2.
[
Selecting subjects (rows)
](
#Selecting-subjects-(rows
)
)
1.
[
Selecting individual subjects
](
#Selecting-individual-subjects
)
1.
[
Selecting individual subjects
](
#Selecting-individual-subjects
)
2.
[
Selecting subject ranges
](
#Selecting-subject-ranges
)
2.
[
Selecting subject ranges
](
#Selecting-subject-ranges
)
3.
[
Selecting subjects from a file
](
#Selecting-subjects-from-a-file
)
3.
[
Selecting subjects from a file
](
#Selecting-subjects-from-a-file
)
4.
[
Selecting subjects by variable value
](
#Selecting-subjects-by-variable-value
)
4.
[
Selecting subjects by variable value
](
#Selecting-subjects-by-variable-value
)
5.
[
Excluding subjects
](
#Excluding-subjects
)
5.
[
Excluding subjects
](
#Excluding-subjects
)
3.
[
Selecting visits
](
#Selecting-visits
)
3.
[
Selecting visits
](
#Selecting-visits
)
1.
[
Evaluating expressions across visits
](
#Evaluating-expressions-across-visits
)
1.
[
Evaluating expressions across visits
](
#Evaluating-expressions-across-visits
)
4.
[
Merging multiple input files
](
#Merging-multiple-input-files
)
4.
[
Merging multiple input files
](
#Merging-multiple-input-files
)
1.
[
Merging by subject
](
#Merging-by-subject
)
1.
[
Merging by subject
](
#Merging-by-subject
)
2.
[
Merging by column
](
#Merging-by-column
)
2.
[
Merging by column
](
#Merging-by-column
)
3.
[
Naive merging
](
#Merging-by-column
)
3.
[
Naive merging
](
#Merging-by-column
)
4.
[
Cleaning examples
](
#Cleaning-examples
)
4.
[
Cleaning examples
](
#Cleaning-examples
)
1.
[
NA insertion
](
#NA-insertion
)
1.
[
NA insertion
](
#NA-insertion
)
2.
[
Variable-specific cleaning functions
](
#Variable-specific-cleaning-functions
)
2.
[
Variable-specific cleaning functions
](
#Variable-specific-cleaning-functions
)
3.
[
Categorical recoding
](
#Categorical-recoding
)
3.
[
Categorical recoding
](
#Categorical-recoding
)
4.
[
Child value replacement
](
#Child-value-replacement
)
4.
[
Child value replacement
](
#Child-value-replacement
)
5.
[
Processing examples
](
#Processing-examples
)
5.
[
Processing examples
](
#Processing-examples
)
1.
[
Sparsity check
](
#Sparsity-check
)
1.
[
Sparsity check
](
#Sparsity-check
)
2.
[
Redundancy check
](
#Redundancy-check
)
2.
[
Redundancy check
](
#Redundancy-check
)
3.
[
Categorical binarisation
](
#Categorical-binarisation
)
3.
[
Categorical binarisation
](
#Categorical-binarisation
)
6.
[
Custom cleaning, processing and loading - funpack plugins
](
#Custom-cleaning,-processing-and-loading---funpack-plugins
)
6.
[
Custom cleaning, processing and loading - funpack plugins
](
#Custom-cleaning,-processing-and-loading---funpack-plugins
)
1.
[
Custom cleaning functions
](
#Custom-cleaning-functions
)
1.
[
Custom cleaning functions
](
#Custom-cleaning-functions
)
2.
[
Custom processing functions
](
#Custom-processing-functions
)
2.
[
Custom processing functions
](
#Custom-processing-functions
)
3.
[
Custom file loaders
](
#Custom-file-loaders
)
3.
[
Custom file loaders
](
#Custom-file-loaders
)
7.
[
Miscellaneous topics
](
#Miscellaneous-topics
)
7.
[
Miscellaneous topics
](
#Miscellaneous-topics
)
1.
[
Non-numeric data
](
#Non-numeric-data
)
1.
[
Non-numeric data
](
#Non-numeric-data
)
2.
[
Dry run
](
#Dry-run
)
2.
[
Dry run
](
#Dry-run
)
3.
[
Built-in rules
](
#Built-in-rules
)
3.
[
Built-in rules
](
#Built-in-rules
)
4.
[
Using a configuration file
](
#Using-a-configuration-file
)
4.
[
Using a configuration file
](
#Using-a-configuration-file
)
5.
[
Working with unknown/uncategorised variables
](
#Working-with-unknown/uncategorised-variables
)
5.
[
Working with unknown/uncategorised variables
](
#Working-with-unknown/uncategorised-variables
)
# Overview
# Overview
`funpack`
performs the following steps:
`funpack`
performs the following steps:
## 1. Import
## 1. Import
All data files are loaded in, unwanted columns and subjects are dropped, and
All data files are loaded in, unwanted columns and subjects are dropped, and
the data files are merged into a single table (a.k.a. data frame). Multiple
the data files are merged into a single table (a.k.a. data frame). Multiple
files can be merged according to an index column (e.g. subject ID). Or, if the
files can be merged according to an index column (e.g. subject ID). Or, if the
input files contain the same columns/subjects, they can be naively
input files contain the same columns/subjects, they can be naively
concatenated along rows or columns.
concatenated along rows or columns.
## 2. Cleaning
## 2. Cleaning
The following cleaning steps are applied to each column:
The following cleaning steps are applied to each column:
1.
**NA value replacement:**
Specific values for some columns are replaced
1.
**NA value replacement:**
Specific values for some columns are replaced
with NA, for example, variables where a value of
`-1`
indicates
*
Do not
with NA, for example, variables where a value of
`-1`
indicates
*
Do not
know
*
.
know
*
.
2.
**Variable-specific cleaning functions:**
Certain columns are
2.
**Variable-specific cleaning functions:**
Certain columns are
re-formatted; for example, the
[
ICD10
](
https://en.wikipedia.org/wiki/ICD-10
)
re-formatted; for example, the
[
ICD10
](
https://en.wikipedia.org/wiki/ICD-10
)
disease codes can be converted to integer representations.
disease codes can be converted to integer representations.
3.
**Categorical recoding:**
Certain categorical columns are re-coded.
3.
**Categorical recoding:**
Certain categorical columns are re-coded.
4.
**Child value replacement:**
NA values within some columns which are
4.
**Child value replacement:**
NA values within some columns which are
dependent upon other columns may have values inserted based on the values
dependent upon other columns may have values inserted based on the values
of their parent columns.
of their parent columns.
## 3. Processing
## 3. Processing
During the processing stage, columns may be removed, merged, or expanded into
During the processing stage, columns may be removed, merged, or expanded into
additional columns. For example, a categorical column may be expanded into a set
additional columns. For example, a categorical column may be expanded into a set
of binary columns, one for each category.
of binary columns, one for each category.
A column may also be removed on the basis of being too sparse, or being
A column may also be removed on the basis of being too sparse, or being
redundant with respect to another column.
redundant with respect to another column.
## 4. Export
## 4. Export
The processed data can be saved as a
`.csv`
,
`.tsv`
, or
`.hdf5`
file.
The processed data can be saved as a
`.csv`
,
`.tsv`
, or
`.hdf5`
file.
# Examples
# Examples
Throughout these examples, we are going to use a few command line
Throughout these examples, we are going to use a few command line
options, which you will probably
**not**
normally want to use:
options, which you will probably
**not**
normally want to use:
-
`-ow`
(short for
`--overwrite`
): This tells
`funpack`
not to complain if
-
`-ow`
(short for
`--overwrite`
): This tells
`funpack`
not to complain if
the output file already exists.
the output file already exists.
-
`-q`
(short for
`--quiet`
): This tells
`funpack`
to be quiet. Without the
-
`-q`
(short for
`--quiet`
): This tells
`funpack`
to be quiet. Without the
`-q`
option,
`funpack`
can be quite verbose, which can be annoying, but is
`-q`
option,
`funpack`
can be quite verbose, which can be annoying, but is
very useful when things go wrong. A good strategy is to tell
`funpack`
to
very useful when things go wrong. A good strategy is to tell
`funpack`
to
produce verbose output using the
`--noisy`
(
`-n`
for short) option, and to
produce verbose output using the
`--noisy`
(
`-n`
for short) option, and to
send all of its output to a log file with the
`--log_file`
(or
`-lf`
)
send all of its output to a log file with the
`--log_file`
(or
`-lf`
)
option. For example:
option. For example:
> ```
> ```
> funpack -n -n -n -lf log.txt out.tsv in.tsv
> funpack -n -n -n -lf log.txt out.tsv in.tsv
> ```
> ```
Here's the first example input data set, with UK BioBank-style column names:
Here's the first example input data set, with UK BioBank-style column names:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
cat
data_01.tsv
cat
data_01.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The numbers in each column name typically represent:
The numbers in each column name typically represent:
1.
The variable ID
1.
The variable ID
2.
The visit, for variables which were collected at multiple points in time.
2.
The visit, for variables which were collected at multiple points in time.
3.
The "instance", for multi-valued variables.
3.
The "instance", for multi-valued variables.
Note that one
**variable**
is typically associated with several
**columns**
,
Note that one
**variable**
is typically associated with several
**columns**
,
although we're keeping things simple for this first example - there is only
although we're keeping things simple for this first example - there is only
one visit for each variable, and there are no mulit-valued variables.
one visit for each variable, and there are no mulit-valued variables.
> _Most but not all_ variables in the UK BioBank contain data collected at
> _Most but not all_ variables in the UK BioBank contain data collected at
> different visits, the times that the participants visited a UK BioBank
> different visits, the times that the participants visited a UK BioBank
> assessment centre. However there are some variables (e.g. [ICD10 diagnosis
> assessment centre. However there are some variables (e.g. [ICD10 diagnosis
> codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202)) for which
> codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202)) for which
> this is not the case.
> this is not the case.
# Import examples
# Import examples
## Selecting variables (columns)
## Selecting variables (columns)
You can specify which variables you want to load in the following ways, using
You can specify which variables you want to load in the following ways, using
the
`--variable`
(
`-v`
for short),
`--category`
(
`-c`
for short) and
the
`--variable`
(
`-v`
for short),
`--category`
(
`-c`
for short) and
`--column`
(
`-co`
for short) command line options:
`--column`
(
`-co`
for short) command line options:
*
By variable ID
*
By variable ID
*
By variable ranges
*
By variable ranges
*
By a text file which contains the IDs you want to keep.
*
By a text file which contains the IDs you want to keep.
*
By pre-defined variable categories
*
By pre-defined variable categories
*
By column name
*
By column name
### Selecting individual variables
### Selecting individual variables
Simply provide the IDs of the variables you want to extract:
Simply provide the IDs of the variables you want to extract:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-v
1
-v
5 out.tsv data_01.tsv
funpack
-q
-ow
-v
1
-v
5 out.tsv data_01.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Selecting variable ranges
### Selecting variable ranges
The
`--variable`
/
`-v`
option accepts MATLAB-style ranges of the form
The
`--variable`
/
`-v`
option accepts MATLAB-style ranges of the form
`start:step:stop`
(where the
`stop`
is inclusive):
`start:step:stop`
(where the
`stop`
is inclusive):
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-v
1:3:10 out.tsv data_01.tsv
funpack
-q
-ow
-v
1:3:10 out.tsv data_01.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Selecting variables with a file
### Selecting variables with a file
If your variables of interest are listed in a plain-text file, you can simply
If your variables of interest are listed in a plain-text file, you can simply
pass that file:
pass that file:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
echo
-e
"1
\n
6
\n
9"
>
vars.txt
echo
-e
"1
\n
6
\n
9"
>
vars.txt
funpack
-q
-ow
-v
vars.txt out.tsv data_01.tsv
funpack
-q
-ow
-v
vars.txt out.tsv data_01.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Selecting variables from pre-defined categories
### Selecting variables from pre-defined categories
Some UK BioBank-specific categories are
[
built into
Some UK BioBank-specific categories are
[
built into
`funpack`
](
#Built-in-rules
)
, but you can also define your own categories - you
`funpack`
](
#Built-in-rules
)
, but you can also define your own categories - you
just need to create a
`.tsv`
file, and pass it to
`funpack`
via the
just need to create a
`.tsv`
file, and pass it to
`funpack`
via the
`--category_file`
(
`-cf`
for short):
`--category_file`
(
`-cf`
for short):
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
echo
-e
"ID
\t
Category
\t
Variables"
>
custom_categories.tsv
echo
-e
"ID
\t
Category
\t
Variables"
>
custom_categories.tsv
echo
-e
"1
\t
Cool variables
\t
1:5,7"
>>
custom_categories.tsv
echo
-e
"1
\t
Cool variables
\t
1:5,7"
>>
custom_categories.tsv
echo
-e
"2
\t
Uncool variables
\t
6,8:10"
>>
custom_categories.tsv
echo
-e
"2
\t
Uncool variables
\t
6,8:10"
>>
custom_categories.tsv
cat
custom_categories.tsv
cat
custom_categories.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Use the
`--category`
(
`-c`
for short) to select categories to output. You can
Use the
`--category`
(
`-c`
for short) to select categories to output. You can
refer to categories by their ID:
refer to categories by their ID:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-cf
custom_categories.tsv
-c
1 out.tsv data_01.tsv
funpack
-q
-ow
-cf
custom_categories.tsv
-c
1 out.tsv data_01.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Or by name:
Or by name:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-cf
custom_categories.tsv
-c
uncool out.tsv data_01.tsv
funpack
-q
-ow
-cf
custom_categories.tsv
-c
uncool out.tsv data_01.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Selecting column names
### Selecting column names
If you are working with data that has non-UK BioBank style column names, you
If you are working with data that has non-UK BioBank style column names, you
can use the
`--column`
(
`-co`
for short) to select individual columns by their
can use the
`--column`
(
`-co`
for short) to select individual columns by their
name, rather than the variable with which they are associated. The
`--column`
name, rather than the variable with which they are associated. The
`--column`
option accepts full column names, and also shell-style wildcard patterns:
option accepts full column names, and also shell-style wildcard patterns:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-co
4-0.0
-co
"??-0.0"
out.tsv data_01.tsv
funpack
-q
-ow
-co
4-0.0
-co
"??-0.0"
out.tsv data_01.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Selecting subjects (rows)
## Selecting subjects (rows)
`funpack`
assumes that the first column in every input file is a subject
`funpack`
assumes that the first column in every input file is a subject
ID. You can specify which subjects you want to load via the
`--subject`
(
`-s`
ID. You can specify which subjects you want to load via the
`--subject`
(
`-s`
for short) option. You can specify subjects in the same way that you specified
for short) option. You can specify subjects in the same way that you specified
variables above, and also:
variables above, and also:
*
By specifying a conditional expression on variable values - only subjects
*
By specifying a conditional expression on variable values - only subjects
for which the expression evaluates to true will be imported
for which the expression evaluates to true will be imported
*
By specifying subjects to exclude
*
By specifying subjects to exclude
### Selecting individual subjects
### Selecting individual subjects
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-s
1
-s
3
-s
5 out.tsv data_01.tsv
funpack
-q
-ow
-s
1
-s
3
-s
5 out.tsv data_01.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Selecting subject ranges
### Selecting subject ranges
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-s
2:2:10 out.tsv data_01.tsv
funpack
-q
-ow
-s
2:2:10 out.tsv data_01.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Selecting subjects from a file
### Selecting subjects from a file
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
echo
-e
"5
\n
6
\n
7
\n
8
\n
9
\n
10"
>
subjects.txt
echo
-e
"5
\n
6
\n
7
\n
8
\n
9
\n
10"
>
subjects.txt
funpack
-q
-ow
-s
subjects.txt out.tsv data_01.tsv
funpack
-q
-ow
-s
subjects.txt out.tsv data_01.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Selecting subjects by variable value
### Selecting subjects by variable value
The
`--subject`
option accepts
*variable expressions*
- you can write an
The
`--subject`
option accepts
*variable expressions*
- you can write an
expression performing numerical comparisons on variables (denoted with a
expression performing numerical comparisons on variables (denoted with a
leading
`v`
) and combine these expressions using boolean algebra. Only
leading
`v`
) and combine these expressions using boolean algebra. Only
subjects for which the expression evaluates to true will be imported. For
subjects for which the expression evaluates to true will be imported. For
example, to only import subjects where variable 1 is greater than 10, and
example, to only import subjects where variable 1 is greater than 10, and
variable 2 is less than 70, you can type:
variable 2 is less than 70, you can type:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-sp
-s
"v1 > 10 && v2 < 70"
out.tsv data_01.tsv
funpack
-q
-ow
-sp
-s
"v1 > 10 && v2 < 70"
out.tsv data_01.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The following symbols can be used in variable expressions:
The following symbols can be used in variable expressions:
| Symbol | Meaning |
| Symbol | Meaning |
|---------------------------|---------------------------------|
|---------------------------|---------------------------------|
|
`==`
| equal to |
|
`==`
| equal to |
|
`!=`
| not equal to |
|
`!=`
| not equal to |
|
`>`
| greater than |
|
`>`
| greater than |
|
`>=`
| greater than or equal to |
|
`>=`
| greater than or equal to |
|
`<`
| less than |
|
`<`
| less than |
|
`<=`
| less than or equal to |
|
`<=`
| less than or equal to |
|
`na`
| N/A |
|
`na`
| N/A |
|
`&&`
| logical and |
|
`&&`
| logical and |
|
<code>
||
</code>
| logical or |
|
<code>
||
</code>
| logical or |
|
`~`
| logical not |
|
`~`
| logical not |
|
`contains`
| Contains sub-string |
|
`contains`
| Contains sub-string |
|
`all`
| all columns must meet condition |
|
`all`
| all columns must meet condition |
|
`any`
| any column must meet condition |
|
`any`
| any column must meet condition |
|
`()`
| to denote precedence |
|
`()`
| to denote precedence |
Non-numeric (i.e. string) variables can be used in these expressions in
Non-numeric (i.e. string) variables can be used in these expressions in
conjunction with the
`==`
,
`!=`
, and
`contains`
operators. An example of such
conjunction with the
`==`
,
`!=`
, and
`contains`
operators. An example of such
an expression is given in the section on
[
non-numeric
an expression is given in the section on
[
non-numeric
data
](
#Non-numeric-data
)
, below.
data
](
#Non-numeric-data
)
, below.
The
`all`
and
`any`
symbols allow you to control how an expression is
The
`all`
and
`any`
symbols allow you to control how an expression is
evaluated across multiple columns which are associated with one variable
evaluated across multiple columns which are associated with one variable
(e.g. separate columns for each visit). We will give an example of this in the
(e.g. separate columns for each visit). We will give an example of this in the
section on
[
selecting visits
](
#Selecting-visits
)
, below.
section on
[
selecting visits
](
#Selecting-visits
)
, below.
### Excluding subjects
### Excluding subjects
The
`--exclude`
(
`-ex`
for short) option allows you to exclude subjects - it
The
`--exclude`
(
`-ex`
for short) option allows you to exclude subjects - it
accepts individual IDs, an ID range, or a file containing IDs. The
accepts individual IDs, an ID range, or a file containing IDs. The
`--exclude`
/
`-ex`
option takes precedence over the
`--subject`
/
`-s`
option:
`--exclude`
/
`-ex`
option takes precedence over the
`--subject`
/
`-s`
option:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-s
1:8
-ex
5:10 out.tsv data_01.tsv
funpack
-q
-ow
-s
1:8
-ex
5:10 out.tsv data_01.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Selecting visits
## Selecting visits
Many variables in the UK BioBank data contain observations at multiple points in
Many variables in the UK BioBank data contain observations at multiple points in
time, or visits.
`funpack`
allows you to specify which visits you are interested
time, or visits.
`funpack`
allows you to specify which visits you are interested
in. Here is an example data set with variables that have data for multiple
in. Here is an example data set with variables that have data for multiple
visits (remember that the second number in the column names denotes the visit):
visits (remember that the second number in the column names denotes the visit):
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
cat
data_02.tsv
cat
data_02.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
We can use the
`--visit`
(
`-vi`
for short) option to get just the last visit for
We can use the
`--visit`
(
`-vi`
for short) option to get just the last visit for
each variable:
each variable:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-vi
last out.tsv data_02.tsv
funpack
-q
-ow
-vi
last out.tsv data_02.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
You can also specify which visit you want by its number:
You can also specify which visit you want by its number:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-vi
1 out.tsv data_02.tsv
funpack
-q
-ow
-vi
1 out.tsv data_02.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
> Variables which are not associated with specific visits (e.g. [ICD10
> Variables which are not associated with specific visits (e.g. [ICD10
> diagnosis codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202))
> diagnosis codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202))
> will not be affected by the `-vi` option.
> will not be affected by the `-vi` option.
### Evaluating expressions across visits
### Evaluating expressions across visits
The variable expressions described above in the section on
[
selecting
The variable expressions described above in the section on
[
selecting
subjects
](
#Selecting-subjects-by-variable-value
)
will be applied to all of
subjects
](
#Selecting-subjects-by-variable-value
)
will be applied to all of
the columns associated with a variable. By default, an expression will
the columns associated with a variable. By default, an expression will
evaluate to true where the values in _any_ column asssociated with the
evaluate to true where the values in _any_ column asssociated with the
variable evaluate to true. For example, we can extract the data for subjects
variable evaluate to true. For example, we can extract the data for subjects
where the values of any column of variable 2 were less than 50:
where the values of any column of variable 2 were less than 50:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-v
2
-s
'v2 < 50'
out.tsv data_02.tsv
funpack
-q
-ow
-v
2
-s
'v2 < 50'
out.tsv data_02.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
We can use the
`any`
and
`all`
operators to control how an expression is
We can use the
`any`
and
`all`
operators to control how an expression is
evaluated across the columns of a variable. For example, we may only be
evaluated across the columns of a variable. For example, we may only be
interested in subjects for whom all columns of variable 2 were greater than
interested in subjects for whom all columns of variable 2 were greater than
50:
50:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-v
2
-s
'all(v2 < 50)'
out.tsv data_02.tsv
funpack
-q
-ow
-v
2
-s
'all(v2 < 50)'
out.tsv data_02.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
We can use
`any`
and
`all`
in expressions involving multiple variables:
We can use
`any`
and
`all`
in expressions involving multiple variables:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-v
2,3
-s
'any(v2 < 50) && all(v3 >= 40)'
out.tsv data_02.tsv
funpack
-q
-ow
-v
2,3
-s
'any(v2 < 50) && all(v3 >= 40)'
out.tsv data_02.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Merging multiple input files
## Merging multiple input files
If your data is split across multiple files, you can specify how
`funpack`
If your data is split across multiple files, you can specify how
`funpack`
should merge them together.
should merge them together.
### Merging by subject
### Merging by subject
For example, let's say we have these two input files (shown side-by- side):
For example, let's say we have these two input files (shown side-by- side):
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
echo
" "
|
paste
data_03.tsv - data_04.tsv
echo
" "
|
paste
data_03.tsv - data_04.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Note that each file contains different variables, and different, but
Note that each file contains different variables, and different, but
overlapping, subjects. By default, when you pass these files to
`funpack`
, it
overlapping, subjects. By default, when you pass these files to
`funpack`
, it
will output the intersection of the two files (more formally known as an
will output the intersection of the two files (more formally known as an
*inner join*
), i.e. subjects which are present in both files:
*inner join*
), i.e. subjects which are present in both files:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
out.tsv data_03.tsv data_04.tsv
funpack
-q
-ow
out.tsv data_03.tsv data_04.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
If you want to keep all subjects, you can instruct
`funpack`
to output the union
If you want to keep all subjects, you can instruct
`funpack`
to output the union
(a.k.a.
*outer join*
) via the
`--merge_strategy`
(
`-ms`
for short) option:
(a.k.a.
*outer join*
) via the
`--merge_strategy`
(
`-ms`
for short) option:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-ms
outer out.tsv data_03.tsv data_04.tsv
funpack
-q
-ow
-ms
outer out.tsv data_03.tsv data_04.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Merging by column
### Merging by column
Your data may be organised in a different way. For example, these next two
Your data may be organised in a different way. For example, these next two
files contain different groups of subjects, but overlapping columns:
files contain different groups of subjects, but overlapping columns:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
echo
" "
|
paste
data_05.tsv - data_06.tsv
echo
" "
|
paste
data_05.tsv - data_06.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
In this case, we need to tell
`funpack`
to merge along the row axis, rather than
In this case, we need to tell
`funpack`
to merge along the row axis, rather than
along the column axis. We can do this with the
`--merge_axis`
(
`-ma`
for short)
along the column axis. We can do this with the
`--merge_axis`
(
`-ma`
for short)
option:
option:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-ma
rows out.tsv data_05.tsv data_06.tsv
funpack
-q
-ow
-ma
rows out.tsv data_05.tsv data_06.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Again, if we want to retain all columns, we can tell
`funpack`
to perform an
Again, if we want to retain all columns, we can tell
`funpack`
to perform an
outer join with the
`-ms`
option:
outer join with the
`-ms`
option:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-ma
rows
-ms
outer out.tsv data_05.tsv data_06.tsv
funpack
-q
-ow
-ma
rows
-ms
outer out.tsv data_05.tsv data_06.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Naive merging
### Naive merging
Finally, your data may be organised such that you simply want to "paste", or
Finally, your data may be organised such that you simply want to "paste", or
concatenate them together, along either rows or columns. For example, your
concatenate them together, along either rows or columns. For example, your
data files might look like this:
data files might look like this:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
echo
" "
|
paste
data_07.tsv - data_08.tsv
echo
" "
|
paste
data_07.tsv - data_08.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Here, we have columns for different variables on the same set of subjects, and
Here, we have columns for different variables on the same set of subjects, and
we just need to concatenate them together horizontally. We do this by using
we just need to concatenate them together horizontally. We do this by using
`--merge_strategy naive`
(
`-ms naive`
for short):
`--merge_strategy naive`
(
`-ms naive`
for short):
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-ms
naive out.tsv data_07.tsv data_08.tsv
funpack
-q
-ow
-ms
naive out.tsv data_07.tsv data_08.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
For files which need to be concatenated vertically, such as these:
For files which need to be concatenated vertically, such as these:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
echo
" "
|
paste
data_09.tsv - data_10.tsv
echo
" "
|
paste
data_09.tsv - data_10.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
We need to tell
`funpack`
which axis to concatenate along, again using the
`-ma`
We need to tell
`funpack`
which axis to concatenate along, again using the
`-ma`
option:
option:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-ms
naive
-ma
rows out.tsv data_09.tsv data_10.tsv
funpack
-q
-ow
-ms
naive
-ma
rows out.tsv data_09.tsv data_10.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Cleaning examples
# Cleaning examples
Once the data has been imported, a sequence of cleaning steps are applied to
Once the data has been imported, a sequence of cleaning steps are applied to
each column.
each column.
## NA insertion
## NA insertion
For some variables it may make sense to discard or ignore certain values. For
For some variables it may make sense to discard or ignore certain values. For
example, if an individual selects
*Do not know*
to a question such as
*
How
example, if an individual selects
*Do not know*
to a question such as
*
How
much milk did you drink yesterday?
*
, that answer will be coded with a specific
much milk did you drink yesterday?
*
, that answer will be coded with a specific
value (e.g.
`-1`
). It does not make any sense to include these values in most
value (e.g.
`-1`
). It does not make any sense to include these values in most
analyses, so
`funpack`
can be used to mark such values as
*
Not Available
analyses, so
`funpack`
can be used to mark such values as
*
Not Available
(NA)
*
.
(NA)
*
.
A large number of NA insertion rules, specific to UK BioBank variables, are
A large number of NA insertion rules, specific to UK BioBank variables, are
coded into
`funpack`
, and are applied when you use the
`-cfg fmrib`
option
coded into
`funpack`
, and are applied when you use the
`-cfg fmrib`
option
(see the section below on
[
built-in rules
](
#Built-in-rules
)
). You can also
(see the section below on
[
built-in rules
](
#Built-in-rules
)
). You can also
specify your own rules via the
`--na_values`
(
`-nv`
for short) option.
specify your own rules via the
`--na_values`
(
`-nv`
for short) option.
Let's say we have this data set:
Let's say we have this data set:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
cat
data_11.tsv
cat
data_11.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
For variable 1, we want to ignore values of -1, for variable 2 we want to
For variable 1, we want to ignore values of -1, for variable 2 we want to
ignore -1 and 0, and for variable 3 we want to ignore 1 and 2:
ignore -1 and 0, and for variable 3 we want to ignore 1 and 2:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-nv
1
" -1"
-nv
2
" -1,0"
-nv
3
"1,2"
out.tsv data_11.tsv
funpack
-q
-ow
-nv
1
" -1"
-nv
2
" -1,0"
-nv
3
"1,2"
out.tsv data_11.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The
`--na_values`
option expects two arguments:
The
`--na_values`
option expects two arguments:
*
The variable ID
*
The variable ID
*
A comma-separated list of values to replace with NA
*
A comma-separated list of values to replace with NA
## Variable-specific cleaning functions
## Variable-specific cleaning functions
A small number of cleaning/preprocessing functions are built into
`funpack`
,
A small number of cleaning/preprocessing functions are built into
`funpack`
,
which can be applied to specific variables. For example, some variables in the
which can be applied to specific variables. For example, some variables in the
UK BioBank contain ICD10 disease codes, which may be more useful if converted
UK BioBank contain ICD10 disease codes, which may be more useful if converted
to a numeric format (e.g. to make them easy to load into MATLAB). Imagine
to a numeric format (e.g. to make them easy to load into MATLAB). Imagine
that we have some data with ICD10 codes:
that we have some data with ICD10 codes:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
cat
data_12.tsv
cat
data_12.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
We can use the
`--clean`
(
`-cl`
for short) option with the built-in
We can use the
`--clean`
(
`-cl`
for short) option with the built-in
`codeToNumeric`
cleaning function to convert the codes to a numeric
`codeToNumeric`
cleaning function to convert the codes to a numeric
representation
<sup>
*
</sup>
:
representation
<sup>
*
</sup>
:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-cl
1
"codeToNumeric('icd10')"
out.tsv data_12.tsv
funpack
-q
-ow
-cl
1
"codeToNumeric('icd10')"
out.tsv data_12.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
> <sup>*</sup>The `codeToNumeric` function will replace each ICD10 code with
> <sup>*</sup>The `codeToNumeric` function will replace each ICD10 code with
> the corresponding *Node* number, as defined in the UK [BioBank ICD10 data
> the corresponding *Node* number, as defined in the UK [BioBank ICD10 data
> coding](http://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=19).
> coding](http://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=19).
The
`--clean`
option expects two arguments:
The
`--clean`
option expects two arguments:
*
The variable ID
*
The variable ID
*
The cleaning function to apply. Some cleaning functions accept
*
The cleaning function to apply. Some cleaning functions accept
arguments - refer to the command-line help for a summary of available
arguments - refer to the command-line help for a summary of available
functions.
functions.
You can define your own cleaning functions by passing them in as a
You can define your own cleaning functions by passing them in as a
`--plugin_file`
(see the
[
section on custom plugins
`--plugin_file`
(see the
[
section on custom plugins
below
](
#Custom-cleaning,-processing-and-loading---funpack-plugins
)
).
below
](
#Custom-cleaning,-processing-and-loading---funpack-plugins
)
).
### Example: flattening hierarchical data
### Example: flattening hierarchical data
Several variables in the UK Biobank (including the ICD10 disease
Several variables in the UK Biobank (including the ICD10 disease
categorisations) are organised in a hierarchical manner - each value is a
categorisations) are organised in a hierarchical manner - each value is a
child of a more general parent category. The
`flattenHierarchical`
cleaninng
child of a more general parent category. The
`flattenHierarchical`
cleaninng
function can be used to replace each value in a data set with the value that
function can be used to replace each value in a data set with the value that
corresponds to a parent category. Let's apply this to our example ICD10 data
corresponds to a parent category. Let's apply this to our example ICD10 data
set.
set.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-cl
1
"flattenHierarchical(name='icd10')"
out.tsv data_12.tsv
funpack
-q
-ow
-cl
1
"flattenHierarchical(name='icd10')"
out.tsv data_12.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Aside: ICD10 mapping file
### Aside: ICD10 mapping file
`funpack`
has a feature specific to these ICD10 disease categorisations - you
`funpack`
has a feature specific to these ICD10 disease categorisations - you
can use the
`--icd10_map_file`
(
`-imf`
for short) option to tell
`funpack`
to
can use the
`--icd10_map_file`
(
`-imf`
for short) option to tell
`funpack`
to
save a file which contains a list of all ICD10 codes that were present in the
save a file which contains a list of all ICD10 codes that were present in the
input data, and the corresponding numerical codes that
`funpack`
generated:
input data, and the corresponding numerical codes that
`funpack`
generated:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-cl
1
"codeToNumeric('icd10')"
-imf
icd10_codes.tsv out.tsv data_12.tsv
funpack
-q
-ow
-cl
1
"codeToNumeric('icd10')"
-imf
icd10_codes.tsv out.tsv data_12.tsv
cat
icd10_codes.tsv
cat
icd10_codes.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Categorical recoding
## Categorical recoding
You may have some categorical data which is coded in an awkward manner, such as
You may have some categorical data which is coded in an awkward manner, such as
in this example, which encodes the amount of some item that an individual has
in this example, which encodes the amount of some item that an individual has
consumed:
consumed:


You can use the
`--recoding`
(
`-re`
for short) option to recode data like this
You can use the
`--recoding`
(
`-re`
for short) option to recode data like this
into something more useful. For example, given this data:
into something more useful. For example, given this data:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
cat
data_13.tsv
cat
data_13.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Let's recode it to be more monotonic:
Let's recode it to be more monotonic:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-re
1
"300,444,555"
"3,0.25,0.5"
out.tsv data_13.tsv
funpack
-q
-ow
-re
1
"300,444,555"
"3,0.25,0.5"
out.tsv data_13.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The
`--recoding`
option expects three arguments:
The
`--recoding`
option expects three arguments:
*
The variable ID
*
The variable ID
*
A comma-separated list of the values to be replaced
*
A comma-separated list of the values to be replaced
*
A comma-separated list of the values to replace them with
*
A comma-separated list of the values to replace them with
## Child value replacement
## Child value replacement
Imagine that we have these two questions:
Imagine that we have these two questions:
*
**1**
:
*Do you currently smoke cigarettes?*
*
**1**
:
*Do you currently smoke cigarettes?*
*
**2**
:
*How many cigarettes did you smoke yesterday?*
*
**2**
:
*How many cigarettes did you smoke yesterday?*
Now, question 2 was only asked if the answer to question 1 was
*Yes*
. So for
Now, question 2 was only asked if the answer to question 1 was
*Yes*
. So for
all individuals who answered
*No*
to question 1, we will have a missing value
all individuals who answered
*No*
to question 1, we will have a missing value
for question 2. But for some analyses, it would make more sense to have a
for question 2. But for some analyses, it would make more sense to have a
value of 0, rather than NA, for these subjects.
value of 0, rather than NA, for these subjects.
`funpack`
can handle these sorts of dependencies by way of
*
child value
`funpack`
can handle these sorts of dependencies by way of
*
child value
replacement
*
. For question 2, we can define a conditional variable expression
replacement
*
. For question 2, we can define a conditional variable expression
such that when both question 2 is NA and question 1 is
*No*
, we can insert a
such that when both question 2 is NA and question 1 is
*No*
, we can insert a
value of 0 into question 2.
value of 0 into question 2.
This scenario is demonstrated in this example data set (where, for
This scenario is demonstrated in this example data set (where, for
question 1 values of
`1`
and
`0`
represent
*Yes*
and
*No*
respectively):
question 1 values of
`1`
and
`0`
represent
*Yes*
and
*No*
respectively):
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
cat
data_14.tsv
cat
data_14.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
We can fill in the values for variable 2 by using the
`--child_values`
(
`-cv`
We can fill in the values for variable 2 by using the
`--child_values`
(
`-cv`
for short) option:
for short) option:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-cv
2
"v1 == 0"
"0"
out.tsv data_14.tsv
funpack
-q
-ow
-cv
2
"v1 == 0"
"0"
out.tsv data_14.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The
`--child_values`
option expects three arguments:
The
`--child_values`
option expects three arguments:
*
The variable ID
*
The variable ID
*
An expression evaluating some condition on the parent variable(s)
*
An expression evaluating some condition on the parent variable(s)
*
A value to replace NA with where the expression evaluates to true.
*
A value to replace NA with where the expression evaluates to true.
# Processing examples
# Processing examples
After every column has been cleaned, the entire data set undergoes a series of
After every column has been cleaned, the entire data set undergoes a series of
processing steps. The processing stage may result in columns being removed or
processing steps. The processing stage may result in columns being removed or
manipulated, or new columns being added.
manipulated, or new columns being added.
The processing stage can be controlled with these options:
The processing stage can be controlled with these options:
*
`--prepend_process`
(
`-ppr`
for short): Apply a processing function before
*
`--prepend_process`
(
`-ppr`
for short): Apply a processing function before
the built-in processing
the built-in processing
*
`--append_process`
(
`-apr`
for short): Apply a processing function after the
*
`--append_process`
(
`-apr`
for short): Apply a processing function after the
built-in processing
built-in processing
A default set of processing steps are applied when you apply the
`fmrib`
A default set of processing steps are applied when you apply the
`fmrib`
configuration profile by using
`-cfg fmrib`
- see the section on
[
built-in
configuration profile by using
`-cfg fmrib`
- see the section on
[
built-in
rules
](
#Built-in-rules
)
.
rules
](
#Built-in-rules
)
.
The
`--prepend_process`
and
`--append_process`
options require two arguments:
The
`--prepend_process`
and
`--append_process`
options require two arguments:
*
The variable ID(s) to apply the function to, or
`all`
to denote all
*
The variable ID(s) to apply the function to, or
`all`
to denote all
variables.
variables.
*
The processing function to apply. The available processing functions are
*
The processing function to apply. The available processing functions are
listed in the command line help, or you can write your own and pass it in
listed in the command line help, or you can write your own and pass it in
as a plugin file
as a plugin file
(
[
see below
](
#Custom-cleaning,-processing-and-loading---funpack-plugins
)
).
(
[
see below
](
#Custom-cleaning,-processing-and-loading---funpack-plugins
)
).
## Sparsity check
## Sparsity check
The
`removeIfSparse`
process will remove columns that are deemed to have too
The
`removeIfSparse`
process will remove columns that are deemed to have too
many missing values. If we take this data set:
many missing values. If we take this data set:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
cat
data_15.tsv
cat
data_15.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Imagine that our analysis requires at least 8 values per variable to work. We
Imagine that our analysis requires at least 8 values per variable to work. We
can use the
`minpres`
option to
`funpack`
to drop any columns which do not meet
can use the
`minpres`
option to
`funpack`
to drop any columns which do not meet
this threshold:
this threshold:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-apr
all
"removeIfSparse(minpres=8)"
out.tsv data_15.tsv
funpack
-q
-ow
-apr
all
"removeIfSparse(minpres=8)"
out.tsv data_15.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
You can also specify
`minpres`
as a proportion, rather than an absolute number.
You can also specify
`minpres`
as a proportion, rather than an absolute number.
e.g.:
e.g.:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-apr
all
"removeIfSparse(minpres=0.65, abspres=False)"
out.tsv data_15.tsv
funpack
-q
-ow
-apr
all
"removeIfSparse(minpres=0.65, abspres=False)"
out.tsv data_15.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Redundancy check
## Redundancy check
You may wish to remove columns which contain redundant information. The
You may wish to remove columns which contain redundant information. The
`removeIfRedundant`
process calculates the pairwise correlation between all
`removeIfRedundant`
process calculates the pairwise correlation between all
columns, and removes columns with a correlation above a threshold that you
columns, and removes columns with a correlation above a threshold that you
provide. Imagine that we have this data set:
provide. Imagine that we have this data set:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
cat
data_16.tsv
cat
data_16.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The data in column
`2-0.0`
is effectively equivalent to the data in column
The data in column
`2-0.0`
is effectively equivalent to the data in column
`1-0.0`
, so is not of any use to us. We can tell
`funpack`
to remove it like
`1-0.0`
, so is not of any use to us. We can tell
`funpack`
to remove it like
so:
so:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-apr
all
"removeIfRedundant(0.9)"
out.tsv data_16.tsv
funpack
-q
-ow
-apr
all
"removeIfRedundant(0.9)"
out.tsv data_16.tsv
cat
out.tsv
cat
out.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
The
`removeIfRedundant`
process can also calculate the correlation of the
The
`removeIfRedundant`
process can also calculate the correlation of the
patterns of missing values between variables - Consider this example:
patterns of missing values between variables - Consider this example:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
cat
data_17.tsv
cat
data_17.tsv
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
All three columns are highly correlated, but the pattern of missing values in
All three columns are highly correlated, but the pattern of missing values in
column
`3-0.0`
is different to that of the other columns.
column
`3-0.0`
is different to that of the other columns.
If we use the
`nathres`
option,
`funpack`
will only remove columns where the
If we use the
`nathres`
option,
`funpack`
will only remove columns where the
correlation of both present and missing values meet the thresholds. Note that
correlation of both present and missing values meet the thresholds. Note that
the column which contains more missing values will be the one that gets
the column which contains more missing values will be the one that gets
removed:
removed:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
bash
```
bash
funpack
-q
-ow
-apr
all
"removeIfRedundant(0.9, nathres=0.6)"
out.tsv data_17.tsv
funpack
-q
-ow
-apr
all
"removeIfRedundant(0.9, nathres=0.6)"
out.tsv data_17.tsv
cat
out.tsv
cat
out.tsv
```