Commit 1bdefcb1 authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

DOC: tweaks

parent d75adec2
......@@ -620,10 +620,10 @@ Changed
* Changes to the ``fmrib`` configuration - variables
`41202 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202>`_,
`41203 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41203>`_,
`41270 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270>`_, and
`41271 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41271>`_ are
`41202 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202>`__,
`41203 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41203>`__,
`41270 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270>`__, and
`41271 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41271>`__ are
binarised, and the binarised values replaced with diagnosis dates from
the corresponding date variables.
* The processing function interface has been changed - processing functions
......@@ -805,7 +805,7 @@ Fixed
* Fixed a bug where non-numeric variables (e.g.
`41271 <https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41271>`_ ) were
`41271 <https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41271>`__) were
being interpreted by ``pandas`` as being numeric.
......
......@@ -33,14 +33,15 @@ Installation
------------
Install FUNPACK via pip::
Install FUNPACK from ``conda-forge``::
pip install fmrib-unpack
conda install -c conda-forge fmrib-unpack
Or from ``conda-forge``::
Or using ``pip``::
pip install fmrib-unpack
conda install -c conda-forge fmrib-unpack
The FUNPACK source code can be found at
......@@ -139,7 +140,7 @@ Built-in rules
FUNPACK contains a large number of built-in rules which have been specifically
written to pre-process UK BioBank data variables. These rules are stored in
the following files[*]_:
the following files [*]_:
* ``funpack/configs/fmrib/datacodings_*.tsv``: Cleaning rules for data
codings
......@@ -172,7 +173,7 @@ contains the cleaning rules for each variable.
https://git.fmrib.ox.ac.uk/fsl/funpack-fmrib-config/. However, it is
automatically installed alongside FUNPACK, so if you have FUNPACK,
you can use the ``fmrib`` profile. If you are using FUNPACK from a
source checkout, you may need to install the
source checkout, you may need to manually install the
``fmrib-unpack-fmrib-config`` package from `PyPi
<https://pypi.org/project/fmrib-unpack-fmrib-config/>`_ or
`conda-forge
......@@ -221,7 +222,7 @@ Output
------
The main output of FUNPACK is a plain-text file[*]_ which contains the input
The main output of FUNPACK is a plain-text file [*]_ which contains the input
data, after cleaning and processing, potentially with some columns removed,
and new columns added.
......
This diff is collapsed.
This diff is collapsed.
......@@ -163,7 +163,7 @@ The numbers in each column name typically represent:
Note that one **variable** is typically associated with several **columns**,
although we're keeping things simple for this first example - there is only
one visit for each variable, and there are no mulit-valued variables.
one visit for each variable, and there are no multi-valued variables.
> _Most but not all_ variables in the UK BioBank contain data collected at
......@@ -233,10 +233,10 @@ cat out.tsv
#### Selecting variables from pre-defined categories
Some UK BioBank-specific categories are [built into
`funpack`](#Built-in-rules), but you can also define your own categories - you
just need to create a `.tsv` file, and pass it to `funpack` via the
`--category_file` (`-cf` for short):
Some UK BioBank-specific categories are
[built into FUNPACK](#Built-in-rules), but you can also define your own
categories - you just need to create a `.tsv` file, and pass it to
FUNPACK via the `--category_file` (`-cf` for short):
```
......@@ -284,7 +284,7 @@ cat out.tsv
### Selecting subjects (rows)
`funpack` assumes that the first column in every input file is a subject
FUNPACK assumes that the first column in every input file is a subject
ID. You can specify which subjects you want to load via the `--subject` (`-s`
for short) option. You can specify subjects in the same way that you specified
variables above, and also:
......@@ -345,20 +345,20 @@ The following symbols can be used in variable expressions:
| Symbol | Meaning |
|---------------------------|---------------------------------|
| `==` | equal to |
| `!=` | not equal to |
| `>` | greater than |
| `>=` | greater than or equal to |
| `<` | less than |
| `<=` | less than or equal to |
| `na` | N/A |
| `&&` | logical and |
| <code>==</code> | equal to |
| <code>!=</code> | not equal to |
| <code>&gt;</code> | greater than |
| <code>&gt;=</code> | greater than or equal to |
| <code>&lt;</code> | less than |
| <code>&lt;=</code> | less than or equal to |
| <code>na</code> | N/A |
| <code>&amp;&amp;</code> | logical and |
| <code>&#x7c;&#x7c;</code> | logical or |
| `~` | logical not |
| `contains` | Contains sub-string |
| `all` | all columns must meet condition |
| `any` | any column must meet condition |
| `()` | to denote precedence |
| <code>~</code> | logical not |
| <code>contains</code> | Contains sub-string |
| <code>all</code> | all columns must meet condition |
| <code>any</code> | any column must meet condition |
| <code>()</code> | to denote precedence |
> The `all` and `any` symbols allow you to control how an expression is
......@@ -399,9 +399,10 @@ loaded into memory, so that the conditional expression can be evaluated.
A useful strategy, if you intend to work with the same subset of subjects more
than once, is to use FUNPACK once, to identify the subjects of interest, save
their IDs to a text file, and on subsequent calls to FUNPACK, use the text
file to select subjects. This means that subsequent FUNPACK runs will be
faster, and will require less memory.
their IDs to a text file (e.g. `subjects.txt`), and on subsequent calls to
FUNPACK, use the text file to select subjects (e.g. `fmrib_unpack -s
subjects.txt`). This means that subsequent FUNPACK runs will be faster, and
will require less memory.
The `--ids_only` option can be used to generate an output file which only
......@@ -438,7 +439,7 @@ cat out.tsv
Many variables in the UK BioBank data contain observations at multiple points in
time, or visits. `funpack` allows you to specify which visits you are interested
time, or visits. FUNPACK allows you to specify which visits you are interested
in. Here is an example data set with variables that have data for multiple
visits (remember that the second number in the column names denotes the visit):
......@@ -512,14 +513,14 @@ cat out.tsv
### Merging multiple input files
If your data is split across multiple files, you can specify how `funpack`
If your data is split across multiple files, you can specify how FUNPACK
should merge them together.
#### Merging by subject
For example, let's say we have these two input files (shown side-by- side):
For example, let's say we have these two input files (shown side-by-side):
```
......@@ -528,7 +529,7 @@ echo " " | paste data_04.tsv - data_05.tsv
Note that each file contains different variables, and different, but
overlapping, subjects. By default, when you pass these files to `funpack`, it
overlapping, subjects. By default, when you pass these files to FUNPACK, it
will output the intersection of the two files (more formally known as an
*inner join*), i.e. subjects which are present in both files:
......@@ -539,7 +540,7 @@ cat out.tsv
```
If you want to keep all subjects, you can instruct `funpack` to output the union
If you want to keep all subjects, you can instruct FUNPACK to output the union
(a.k.a. *outer join*) via the `--merge_strategy` (`-ms` for short) option:
......@@ -561,7 +562,7 @@ echo " " | paste data_06.tsv - data_07.tsv
```
In this case, we need to tell `funpack` to merge along the row axis, rather than
In this case, we need to tell FUNPACK to merge along the row axis, rather than
along the column axis. We can do this with the `--merge_axis` (`-ma` for short)
option:
......@@ -572,7 +573,7 @@ cat out.tsv
```
Again, if we want to retain all columns, we can tell `funpack` to perform an
Again, if we want to retain all columns, we can tell FUNPACK to perform an
outer join with the `-ms` option:
......@@ -614,7 +615,7 @@ echo " " | paste data_10.tsv - data_11.tsv
```
We need to tell `funpack` which axis to concatenate along, again using the `-ma`
We need to tell FUNPACK which axis to concatenate along, again using the `-ma`
option:
......@@ -638,12 +639,12 @@ For some variables it may make sense to discard or ignore certain values. For
example, if an individual selects *Do not know* to a question such as *How
much milk did you drink yesterday?*, that answer will be coded with a specific
value (e.g. `-1`). It does not make any sense to include these values in most
analyses, so `funpack` can be used to mark such values as *Not Available
analyses, so FUNPACK can be used to mark such values as *Not Available
(NA)*.
A large number of NA insertion rules, specific to UK BioBank variables, are
coded into `funpack`, and are applied when you use the `-cfg fmrib` option
coded into FUNPACK, and are applied when you use the `-cfg fmrib` option
(see the section below on [built-in rules](#Built-in-rules)). You can also
specify your own rules via the `--na_values` (`-nv` for short) option.
......@@ -675,7 +676,7 @@ The `--na_values` option expects two arguments:
### Variable-specific cleaning functions
A small number of cleaning/preprocessing functions are built into `funpack`,
A small number of cleaning/preprocessing functions are built into FUNPACK,
which can be applied to specific variables. For example, some variables in the
UK BioBank contain ICD10 disease codes, which may be more useful if converted
to a numeric format (e.g. to make them easy to load into MATLAB). Imagine
......@@ -689,7 +690,7 @@ cat data_13.tsv
We can use the `--clean` (`-cl` for short) option with the built-in
`codeToNumeric` cleaning function to convert the codes to a numeric
representation<sup>*</sup>:
representation<sup>&amp;</sup>:
```
......@@ -698,9 +699,9 @@ cat out.tsv
```
> <sup>*</sup>The `codeToNumeric` function will replace each ICD10 code with
> the corresponding *Node* number, as defined in the UK [BioBank ICD10 data
> coding](http://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=19).
> <sup>&amp;</sup>The `codeToNumeric` function will replace each ICD10 code
> with the corresponding *Node* number, as defined in the UK [BioBank ICD10
> data coding](http://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=19).
The `--clean` option expects two arguments:
......@@ -711,9 +712,10 @@ The `--clean` option expects two arguments:
functions.
You can define your own cleaning functions by passing them in as a
`--plugin_file` (see the [section on custom plugins
below](#Custom-cleaning,-processing-and-loading---funpack-plugins)).
You can find a list of the built-in cleaning functions
[here](function_reference.html). You can also define your own cleaning
functions by passing them in as a `--plugin_file` (see the [section on custom
plugins below](#Custom-cleaning,-processing-and-loading---funpack-plugins)).
#### Example: flattening hierarchical data
......@@ -736,10 +738,10 @@ cat out.tsv
#### Aside: ICD10 mapping file
`funpack` has a feature specific to these ICD10 disease categorisations - you
can use the `--icd10_map_file` (`-imf` for short) option to tell `funpack` to
FUNPACK has a feature specific to these ICD10 disease categorisations - you
can use the `--icd10_map_file` (`-imf` for short) option to tell FUNPACK to
save a file which contains a list of all ICD10 codes that were present in the
input data, and the corresponding numerical codes that `funpack` generated:
input data, and the corresponding numerical codes that FUNPACK generated:
```
......@@ -800,7 +802,7 @@ for question 2. But for some analyses, it would make more sense to have a
value of 0, rather than NA, for these subjects.
`funpack` can handle these sorts of dependencies by way of *child value
FUNPACK can handle these sorts of dependencies by way of *child value
replacement*. For question 2, we can define a conditional variable expression
such that when both question 2 is NA and question 1 is *No*, we can insert a
value of 0 into question 2.
......@@ -858,9 +860,10 @@ The `--prepend_process` and `--append_process` options require two arguments:
* The variable ID(s) to apply the function to, or `all` to denote all
variables.
* The processing function to apply. The available processing functions are
listed in the command line help, or you can write your own and pass it in
as a plugin file
([see below](#Custom-cleaning,-processing-and-loading---funpack-plugins)).
listed [here](function_reference.html) and in the [command line
help](command_line.html), or you can write your own and pass it in as a
plugin file ([see
below](#Custom-cleaning,-processing-and-loading---funpack-plugins)).
### Sparsity check
......@@ -876,8 +879,8 @@ cat data_16.tsv
Imagine that our analysis requires at least 8 values per variable to work. We
can use the `minpres` option to `funpack` to drop any columns which do not meet
this threshold:
can use the `minpres` option to `removeIfSparse` to drop any columns which do
not meet this threshold:
```
......@@ -911,7 +914,7 @@ cat data_17.tsv
The data in column `2-0.0` is effectively equivalent to the data in column
`1-0.0`, so is not of any use to us. We can tell `funpack` to remove it like
`1-0.0`, so is not of any use to us. We can tell FUNPACK to remove it like
so:
......@@ -934,7 +937,7 @@ All three columns are highly correlated, but the pattern of missing values in
column `3-0.0` is different to that of the other columns.
If we use the `nathres` option, `funpack` will only remove columns where the
If we use the `nathres` option, FUNPACK will only remove columns where the
correlation of both present and missing values meet the thresholds. Note that
the column which contains more missing values will be the one that gets
removed:
......@@ -980,11 +983,11 @@ cat out.tsv
```
## Custom cleaning, processing and loading - `funpack` plugins
## Custom cleaning, processing and loading - FUNPACK plugins
If you want to apply some specific cleaning or processing function to a
variable, you can code your functions up in python, and then tell `funpack` to
variable, you can code your functions up in python, and then tell FUNPACK to
apply them.
......@@ -1008,7 +1011,7 @@ cat plugin_1.py | pygmentize
```
To use our custom cleaner function, we simply pass our plugin file to `funpack`
To use our custom cleaner function, we simply pass our plugin file to FUNPACK
using the `--plugin_file` (`-p` for short) option:
......@@ -1033,7 +1036,7 @@ cat plugin_2.py | pygmentize
```
Again, to use our plugin, we pass it to `funpack` via the `--plugin`/`-p`
Again, to use our plugin, we pass it to FUNPACK via the `--plugin`/`-p`
option:
......@@ -1047,7 +1050,7 @@ cat out.tsv
You might want to load some auxillary data which is in an awkward format that
cannot be automatically parsed by `funpack`. For example, you may have a file
cannot be automatically parsed by FUNPACK. For example, you may have a file
which has acquisition date information separated into *year*, *month* and
*day* columns, e.g.:
......@@ -1103,9 +1106,9 @@ in the `A` category:
> Note the use of the `-nb` (`--no_builtins`) option here - this tells
> `funpack` to ignore its built-in variable table, which contains information
> FUNPACK to ignore its built-in variable table, which contains information
> about the type of each UK BioBank variable, and would otherwise interfere
> with this example.
> with this example. You would almost never need to use this option normally.
```
......@@ -1114,13 +1117,13 @@ cat out.tsv
```
By default, `funpack` will save columns containing non-numeric data to the
By default, FUNPACK will save columns containing non-numeric data to the
main output file, just like any other column. However, there are a couple of
options you can use to control what `funpack` does with non-numeric data.
options you can use to control what FUNPACK does with non-numeric data.
If you only care about numeric columns, you can use the
`--suppress_non_numerics` option (`-esn` for short) - this tells `funpack` to
`--suppress_non_numerics` option (`-esn` for short) - this tells FUNPACK to
discard all columns that are not numeric:
......@@ -1130,7 +1133,7 @@ cat out.tsv
```
You can also tell `funpack` to save all non-numeric columns to a separate file,
You can also tell FUNPACK to save all non-numeric columns to a separate file,
using the `--write_non_numerics` option (`-wnn` for short):
......@@ -1144,11 +1147,11 @@ cat out_non_numerics.tsv
### Dry run
The `--dry_run` (`-d` for short) option allows you to see what `funpack` is
The `--dry_run` (`-d` for short) option allows you to see what FUNPACK is
going to do - it is useful to perform a dry run before running a large
processing job, which could take a long time. For example, if we have a
complicated configuration such as the following, we can use the `--dry_run`
option to check that `funpack` is going to do what we expect:
option to check that FUNPACK is going to do what we expect:
```
......@@ -1166,7 +1169,7 @@ funpack \
### Built-in rules
`funpack` has a large number of hand-crafted rules built in, which are
FUNPACK has a large number of hand-crafted rules built in, which are
specific to variables found in the UK BioBank data set. These rules are part
of the `fmrib` configuration, which can be used by adding `-cfg fmrib` to
the command-line options.
......@@ -1183,7 +1186,7 @@ funpack -cfg fmrib -d out.tsv ukbcols.csv
All of these rules are coded in a set of `.cfg` and `.tsv` files which are
installed alongside the `funpack` source code:
installed alongside the FUNPACK source code:
```
......@@ -1194,19 +1197,19 @@ ls -l $cfgdir/fmrib/
The key files are:
* **`variables_*.tsv`**: Child value replacement, and cleaning rules for each
* `variables_*.tsv`: Child value replacement, and cleaning rules for each
variable.
* **`datacodings_*.tsv`**: NA insertion and recoding rules for data codings -
* `datacodings_*.tsv`: NA insertion and recoding rules for data codings -
these are used when rules are not explicitly specified in the
`variables_*.tsv` files, for a variable which uses a given data coding.
* **`processing.tsv`**: List of all processing functions that are applied, in
* `processing.tsv`: List of all processing functions that are applied, in
order.
* **`categories.tsv`**: Variable categories, for use with the `--category`/`-c`
* `categories.tsv`: Variable categories, for use with the `--category`/`-c`
option.
> Note that these rules are released as a separate package called
> [`fmrib-unpack-fmrib-config`](https://git.fmrib.ox.ac.uk/fsl/funpack-fmrib-config/),
> [fmrib-unpack-fmrib-config](https://git.fmrib.ox.ac.uk/fsl/funpack-fmrib-config/),
> but are automatically installed when you install FUNPACK.
......@@ -1231,7 +1234,7 @@ will override built-in rules for the same variables/datacodings.
The variable and datacoding rules can be stored across multiple files - for
example, you may want to write all of the NA insertion rules in one file, and
all of the child value replacement rules in another. ``funpack`` will merge
all of the child value replacement rules in another. FUNPACK will merge
all of the built-in files and any additionally provided files together, so you
do not need to maintain large and unwieldy variable/datacoding tables.
......@@ -1239,9 +1242,9 @@ do not need to maintain large and unwieldy variable/datacoding tables.
### Using a configuration file
`funpack` has an extensive command-line interface, but you don't need to pass
FUNPACK has an extensive command-line interface, but you don't need to pass
all of the settings via the command-line. Instead, you can put them into a
file, and give that file to `funpack` with the `--config` (`-cfg` for short)
file, and give that file to FUNPACK with the `--config` (`-cfg` for short)
option. You need to use the long-form of each command-line option, without the
leading `--`.
......@@ -1266,7 +1269,7 @@ cat config.txt
```
Now we can pass this file to `funpack` instead of having to pass all of the
Now we can pass this file to FUNPACK instead of having to pass all of the
command line options:
......@@ -1279,17 +1282,17 @@ funpack -d -cfg config.txt out.tsv data_01.tsv
Future UK BioBank data releases may contain new variables that are not present
in the built-in variable table used by `funpack`, and thus are not recognised
in the built-in variable table used by FUNPACK, and thus are not recognised
as UKB variables.
Furthermore, if you are working with the hand-crafted variable categories from
the [built-in `fmrib` configuration](#Built-in-rules), or your own categories,
the [built-in fmrib configuration](#Built-in-rules), or your own categories,
a new data release may contain variables which are not included in any
category.
To help you identify these new variables, `funpack` has the ability to inform
To help you identify these new variables, FUNPACK has the ability to inform
you about variables that it does not know about, or that are not categorised.
......@@ -1308,11 +1311,11 @@ cat custom_categories.tsv
Now we can use the `--write_unknown_vars` option (`-wu` for short) to generate
a summary of the variables which `funpack` did not recognise, or which were
a summary of the variables which FUNPACK did not recognise, or which were
not in any category:
> Again we are using the `-nb` (`--no_builtins`) option, which tells `funpack`
> Again we are using the `-nb` (`--no_builtins`) option, which tells FUNPACK
> not to load its built-in table of UK BioBank variables.
......
This diff is collapsed.
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment