Commit 1e8773fe authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

Merge branch 'enh/mincat' into 'master'

Enh/mincat

See merge request fsl/ukbparse!122
parents 19a96d72 8e0df3fc
Pipeline #3679 passed with stages
in 11 minutes and 4 seconds
......@@ -4,7 +4,7 @@
"upload_type" : "software",
"version" : "{{VERSION}}",
"publication_date" : "{{DATE}}",
"description" : "<p>ukbparse is a Python library for pre-processing of UK BioBank tabular data.</p>\n\n<p>The ukbparse library is developed at the Wellcome Centre for Integrative Neuroimaging (FMRIB), at the University of Oxford. It is hosted at <a href=\"https://git.fmrib.ox.ac.uk/fsl/ukbparse/\">https://git.fmrib.ox.ac.uk/fsl/ukbparse/</a>.</p>",
"description" : "<p><tt>ukbparse</tt> - the FMRIB UK BioBank data processing library</p>\n\n<p><tt>ukbparse</tt> is a Python library for pre-processing of UK BioBank tabular data.</p>\n\n<p>The <tt>ukbparse</tt> library is developed at the Wellcome Centre for Integrative Neuroimaging (FMRIB), at the University of Oxford. It is hosted at <a href=\"https://git.fmrib.ox.ac.uk/fsl/ukbparse/\">https://git.fmrib.ox.ac.uk/fsl/ukbparse/</a>.</p>",
"keywords" : ["python", "biobank"],
"access_right" : "open",
"license" : "Apache-2.0",
......
......@@ -2,8 +2,32 @@
======================
0.20.0 (Monday 6th May 2019)
----------------------------
0.20.0 (Tuesday 7th May 2019)
-----------------------------
Added
^^^^^
* The :func:`.isSparse` and :func:`.removeIfSparse` functions accept
a new option, ``mincat``, which allows a categorical to be deemed sparse
if the size of its smallest category is below a given threshold.
* New ``--description_file`` option which, for UK BioBank data, saves the
description for each column to a text file.
Changed
^^^^^^^
* The ``absolute`` parameter to the :func:`.isSparse` and
:func:`.removeIfSparse` functions is deprecated. Instead, they now accept
``abspres`` and ``abscat`` arguments, allowing the
absoluteness/proportionality of the ``minpres`` and ``mincat``/``maxcat``
options to be specified separately.
* Changed default processing rules so that ICD10 variables undergo a slightly
different sparsity test.
Fixed
......
``ukbparse`` - the UK BioBank data parser
=========================================
``ukbparse`` - the FMRIB UK BioBank data parser
===============================================
.. image:: https://img.shields.io/pypi/v/ukbparse.svg
......@@ -180,6 +180,24 @@ interested in numeric columns, you can retrieve them as an array like this::
data = data(:, vartype('numeric')).Variables;
The ``readtable`` function will potentially rename the column names to ensure
that they are are valid MATLAB identifiers. You can retrieve the original
names from the ``table`` object like so::
colnames = regexp(data.Properties.VariableDescriptions,
'''(.+)''', 'tokens', 'once');
colnames = vertcat(colnames{:});
If you have used the ``--description_file`` option, you can load in the
descriptions for each column as follows::
descs = readtable('descriptions.tsv',
'FileType', 'text',
'ReadVariableNames',false);
descs = descs.Var2;
Tests
-----
......
......@@ -45,7 +45,7 @@ with open(op.join(basedir, 'README.rst'), 'rt') as f:
setup(
name='ukbparse',
version=version,
description='UK Biobank data processing library',
description='The FMRIB UK Biobank data processing library',
long_description=readme,
url='https://git.fmrib.ox.ac.uk/fsl/ukbparse',
author='Paul McCarthy',
......@@ -55,6 +55,7 @@ setup(
classifiers=[
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
],
install_requires=install_requires,
......
......@@ -6,7 +6,7 @@
#
__version__ = '0.19.2'
__version__ = '0.20.0'
"""The ``ukbparse`` versioning scheme roughly follows Semantic Versioning
conventions.
"""
......
......@@ -133,7 +133,8 @@ CLI_ARGUMENTS = collections.OrderedDict((
(('etf', 'time_format'), {'default' : 'default'}),
(('nr', 'num_rows'), {'type' : int}),
(('uf', 'unknown_vars_file'), {}),
(('imf', 'icd10_map_file'), {})]),
(('imf', 'icd10_map_file'), {}),
(('def', 'description_file'), {})]),
('TSV export options', [
(('ts', 'tsv_sep'), {'default' : DEFAULT_TSV_SEP}),
......@@ -363,6 +364,9 @@ CLI_ARGUMENT_HELP = {
'icd10_map_file' :
'Save converted ICD10 code mappings to file',
'description_file' :
'Save descriptions of each column to file',
# TSV export options
'tsv_sep' :
'Column separator string to use in output file (default: "{}")'.format(
......
......@@ -7,6 +7,7 @@ log_file log.txt
unknown_vars_file unknowns.tsv
non_numeric_file non_numerics.tsv
icd10_map_file icd10_codes.tsv
description_file descriptions.tsv
plugin_file fmrib
loader FMRIB_internal_info.txt FMRIBImaging
date_format FMRIBImagingDate
......
......@@ -6,5 +6,11 @@ Variable Process
41202 binariseCategorical(acrossVisits=True, acrossInstances=True)
41204 binariseCategorical(acrossVisits=True, acrossInstances=True)
41270 binariseCategorical(acrossVisits=True, acrossInstances=True)
all_independent removeIfSparse(minpres=51, maxcat=0.99)
all_independent_except,40001,40002,40006,41202,41204,41270 removeIfSparse(minpres=51, maxcat=0.99, abscat=False)
40001 removeIfSparse(mincat=10)
40002 removeIfSparse(mincat=10)
40006 removeIfSparse(mincat=10)
41202 removeIfSparse(mincat=10)
41204 removeIfSparse(mincat=10)
41270 removeIfSparse(mincat=10)
all removeIfRedundant(0.99, 0.2)
......@@ -119,19 +119,44 @@ def convert_ParentValues(val):
def convert_Process_Variable(val):
"""Convert a string containing a process variable specification - one of:
- One or more comma-separated MATLAB-style ``start:stop:step`` ranges.
- ``'all'``, indicating that the process is to be applied to all
variables simultaneously
- ``'all_independent'``,, indicating that the process is to be applied
- ``'all_independent'``, indicating that the process is to be applied
to all variables independently
- One or more comma-separated MATLAB-style ``start:stop:step`` ranges.
- ``'all_except,'``, followed by one or more comma-separated MATLAB-style
ranges, indicating that the process is to be applied to all variables
simultaneously, except for the specified variables.
- ``'all_independent_except,'``, followed by one or more comma-separated
MATLAB-style ranges, indicating that the process is to be applied to
all variables independently, except for the specified variables.
:returns: A tuple containing:
- The process variable type - one of ``'all'``,
``'all_independent'``, ``'all_except'``,
``'all_independent_except'``, or ``'vids'``
- A list of variable IDs (empty if the process variable type
is ``'all'`` or ``'all_independent'``).
"""
if val in ('all', 'all_independent'):
return val
tokens = convert_comma_sep_text(val)
if len(tokens) == 1 and \
tokens[0] in ('all', 'all_independent',
'all_except', 'all_independent_except'):
return tokens[0], []
if tokens[0] in ('all_except', 'all_independent_except'):
ptype = tokens[0]
tokens = tokens[1:]
else:
tokens = convert_comma_sep_text(val)
return list(it.chain(*[util.parseMatlabRange(t) for t in tokens]))
ptype = 'vids'
return ptype, list(it.chain(*[util.parseMatlabRange(t) for t in tokens]))
def convert_Process(ptype, val):
......@@ -977,6 +1002,8 @@ def loadProcessingTable(procfile=None,
log.debug('Loading processing table from %s', procfile)
proctable = pd.read_csv(procfile, '\t',
index_col=False,
skip_blank_lines=True,
comment='#',
converters=PROCTABLE_CONVERTERS)
else:
......
......@@ -7,15 +7,15 @@
"""This module contains the ``ukbparse`` entry point. """
import multiprocessing as mp
import sys
import shutil
import logging
import fnmatch
import tempfile
import warnings
import datetime
import calendar
import multiprocessing as mp
import sys
import shutil
import logging
import fnmatch
import tempfile
import warnings
import datetime
import calendar
import ukbparse
import ukbparse.util as util
......@@ -115,6 +115,7 @@ def main(argv=None):
finaliseColumns( dtable, args, unknowns, unprocessed)
doExport( dtable, args)
doICD10Export( args)
doDescriptionExport(dtable, args)
finally:
# shutdown the pool gracefully
......@@ -410,6 +411,33 @@ def doICD10Export(args):
exc_info=True)
def doDescriptionExport(dtable, args):
"""If a ``--description_file`` has been specified, a description for every
file is saved out to the file.
"""
if args.description_file is None:
return
with util.timed('Description export', log):
cols = dtable.allColumns[1:]
vartable = dtable.vartable
try:
with open(args.description_file, 'wt') as f:
for c in cols:
desc = vartable.loc[c.vid, 'Description']
if desc == c.name:
desc = 'n/a'
f.write('{}\t{}\n'.format(c.name, desc))
except Exception as e:
log.warning('Failed to export descriptions: {}'.format(e),
exc_info=True)
def configLogging(args):
"""Configures ``ukbparse`` logging.
......
......@@ -58,22 +58,39 @@ def processData(dtable):
# a previously executed process may
# add/remove variables to/from the
# data.
all_vids = dtable.variables
all_vids = [v for v in all_vids if v != 0]
vids = ptable.loc[i, 'Variable']
procs = ptable.loc[i, 'Process']
all_vids = dtable.variables
all_vids = [v for v in all_vids if v != 0]
# For each process, the processing table
# contains a "process variable type",
# a list of vids, and the process itself.
# The pvtype is one of:
# - vids: apply the process to the specified vids
# - all: apply the process to all vids
# - all_independent: apply the process independently to all
# vids
# - all_except: apply the process to all vids except the
# specified ones
# - all_independent_except: apply the process independently to all
# vids except the specified ones
pvtype, vids = ptable.loc[i, 'Variable']
procs = ptable.loc[i, 'Process']
# Build a list of lists of vids, with
# each vid list a group of variables
# that is to be processed together.
# apply independently to every variable
if vids == 'all_independent':
vids = [[v] for v in all_vids]
if pvtype in ('all_independent', 'all_independent_except'):
if pvtype.endswith('except'): exclude = vids
else: exclude = []
vids = [[v] for v in all_vids if v not in exclude]
# apply simultaneously to every variable
elif vids == 'all':
vids = [all_vids]
elif pvtype in ('all', 'all_except'):
if pvtype.endswith('except'): exclude = vids
else: exclude = []
vids = [[v for v in all_vids if v not in exclude]]
# apply to specified variables
else:
......
......@@ -57,6 +57,7 @@ The following processing functions are defined:
import itertools as it
import logging
import warnings
import collections
import pandas as pd
......@@ -74,15 +75,24 @@ def removeIfSparse(dtable,
vids,
minpres=None,
minstd=None,
mincat=None,
maxcat=None,
absolute=True):
"""removeIfSparse([minpres], [minstd], [maxcat], [absolute])
abspres=True,
abscat=True,
absolute=None):
"""removeIfSparse([minpres], [minstd], [mincat], [maxcat], [abspres], [abscat])
Removes columns deemed to be sparse.
Removes columns for the variables in ``vids`` if they are sparse.
See the :func:`.isSparse` function.
"""
""" # noqa
if absolute is not None:
warnings.warn('The absolute argument to removeIfSparse is deprecated '
'and will be removed in ukbparse 1.0.0. Use abspres/'
'abscat instead.', DeprecationWarning, stacklevel=1)
abspres = absolute
remove = []
......@@ -98,8 +108,10 @@ def removeIfSparse(dtable,
vtype,
minpres=minpres,
minstd=minstd,
mincat=mincat,
maxcat=maxcat,
absolute=absolute)
abspres=abspres,
abscat=abscat)
if isSparse:
log.debug('Dropping sparse column %s (%s: %f)',
......@@ -255,7 +267,7 @@ def binariseCategorical(dtable,
name=nameFormat.format(**fmtargs)))
if replace: return toremove, newseries, newvids
else: return [], newseries, newvids
else: return newseries, newvids
@custom.processor()
......@@ -304,4 +316,4 @@ def expandCompound(dtable, vids, nameFormat=None, replace=True):
name=name))
if replace: return columns, newseries, newvids
else: return [], newseries, newvids
else: return newseries, newvids
......@@ -18,9 +18,9 @@
import logging
import itertools as it
import warnings
import collections
import itertools as it
import numpy as np
import pandas as pd
......@@ -35,22 +35,30 @@ def isSparse(data,
ctype,
minpres=None,
minstd=None,
mincat=None,
maxcat=None,
absolute=True):
abspres=True,
abscat=True,
absolute=None):
"""Returns ``True`` if the given data looks sparse, ``False`` otherwise.
Used by :func:`removeIfSparse`.
The check is based on up to three criteria:
The check is based on the following criteria:
- The number/proportion of non-NA values must be greater than
or equal to ``minpres``.
- The standard deviation of the data must be greater than ``minstd``.
- For integer and categorical types, the proportion of the largest
- For integer and categorical types, the number/proportion of the largest
category must be less than ``maxcat``.
- For integer and categorical types, the number/proportion of the largest
category must be greater than ``mincat``.
If any of these criteria are not met, the data is considered to be sparse.
Each criteria can be disabled by passing in ``None`` for the relevant
parameter.
......@@ -63,28 +71,58 @@ def isSparse(data,
:arg minstd: Minimum standard deviation, for numeric/categorical types.
:arg maxcat: Maximum size (as a proportion) of largest category,
:arg mincat: Minimum size/proportion of largest category,
for integer/categorical types.
:arg absolute: If ``True`` (the default), ``minpres`` is interpreted as
:arg maxcat: Maximum size/proportion of largest category,
for integer/categorical types.
:arg abspres: If ``True`` (the default), ``minpres`` is interpreted as
an absolute count. Otherwise ``minpres`` is interpreted
as a proportion.
:arg abscat: If ``True`` (the default), ``mincat`` and ``maxcat`` are
interpreted as absolute counts. Otherwise ``mincat`` and
``maxcat`` are interpreted as proportions
:returns: A tuple containing:
- ``True`` if the data is sparse, ``False`` otherwise.
- If the data is sparse, one of ``'minpres'``,
``'minstd'``, or ``'maxcat'``, indicating the cause of
its sparsity. ``None`` if the data is not sparse.
``'minstd'``, ``mincat``, or ``'maxcat'``, indicating
the cause of its sparsity. ``None`` if the data is not
sparse.
- If the data is sparse, the value of the criteria which
caused the data to fail the test. ``None`` if the data
is not sparse.
"""
if absolute is not None:
warnings.warn('The absolute argument to isSparse is deprecated '
'and will be removed in ukbparse 1.0.0. Use abspres/'
'abscat instead.', DeprecationWarning, stacklevel=1)
abspres = absolute
presmask = data.notnull()
present = data[presmask]
ntotal = len(data)
npresent = len(present)
def fixabs(val, isabs, repl=npresent):
# Turn proportion into
# an absolute count
if not isabs:
val = val * ntotal
if (val % 1) >= 0.5: val = np.ceil(val)
else: val = np.floor(val)
# ignore the threshold if it is
# bigger than the total data length
if len(data) < val: return repl
else: return val
iscategorical = ctype in (util.CTYPES.integer,
util.CTYPES.categorical_single,
......@@ -93,19 +131,8 @@ def isSparse(data,
# not enough values
if minpres is not None:
if absolute:
pres = len(present)
# ignore absolute minpres threshold if
# total data length is less than it
if len(data) < minpres:
minpres = pres
else:
pres = float(len(present)) / len(data)
if pres < minpres:
return True, 'minpres', pres
if npresent < fixabs(minpres, abspres):
return True, 'minpres', npresent
# stddev is not large enough (for
# numerical/categorical types)
......@@ -115,14 +142,25 @@ def isSparse(data,
return True, 'minstd', std
# for categorical types
if iscategorical and maxcat is not None:
if iscategorical and ((maxcat is not None) or (mincat is not None)):
if maxcat is not None: maxcat = fixabs(maxcat, abscat, npresent + 1)
if mincat is not None: mincat = fixabs(mincat, abscat)
# one category is too dominant
# mincat - smallest category is too small
# maxcat - one category is too dominant
uniqvals = np.unique(present)
uniqcounts = [sum(present == u) for u in uniqvals]
catcount = float(max(uniqcounts)) / len(present)
if catcount >= maxcat:
return True, 'maxcat', catcount
nmincat = min(uniqcounts)
nmaxcat = max(uniqcounts)
if mincat is not None:
if nmincat < mincat:
return True, 'mincat', nmincat
if maxcat is not None:
if nmaxcat >= maxcat:
return True, 'maxcat', nmaxcat
return False, None, None
......@@ -165,7 +203,7 @@ def redundantColumns(data, columns, corrthres, nathres=None):
namask = np.asarray(pd.isna(data[columns]), dtype=np.float16)
# normalise ecah column of the namask
# normalise each column of the namask
# array to unit standard deviation, but
# only for those columns which have a
# stddev greater than 0 (those that
......
......@@ -30,7 +30,7 @@
"\n",
" ukbparse -h\n",
" \n",
"*The examples in this notebook assume that you have installed `ukbparse` 0.17.0 or newer.*"
"*The examples in this notebook assume that you have installed `ukbparse` 0.20.0 or newer.*"
]
},
{
......@@ -980,7 +980,7 @@
"metadata": {},
"outputs": [],
"source": [
"ukbparse -nb -q -ow -apr all \"removeIfSparse(minpres=0.65, absolute=False)\" out.tsv data_15.tsv\n",
"ukbparse -nb -q -ow -apr all \"removeIfSparse(minpres=0.65, abspres=False)\" out.tsv data_15.tsv\n",
"cat out.tsv"
]
},
......
%% Cell type:markdown id: tags:
![image.png](attachment:image.png)
# `ukbparse`
> Paul McCarthy &lt;paul.mccarthy@ndcn.ox.ac.uk&gt; ([WIN@FMRIB](https://www.win.ox.ac.uk/))
`ukbparse` is a command-line program which you can use to extract data from UK BioBank (and other tabular) data.
You can give `ukbparse` one or more input files (e.g. `.csv`, `.tsv`), and it will merge them together, perform some preprocessing, and produce a single output file.
A large number of rules are built into `ukbparse` which are specific to the UK BioBank data set. But you can control and customise everything that `ukbparse` does to your data, including which rows and columns to extract, and which cleaning/processing steps to perform on each column.
The `ukbparse` source code is available at https://git.fmrib.ox.ac.uk/fsl/ukbparse. You can install `ukbparse` into a Python environment using `pip`:
pip install ukbparse
Get command-line help by typing:
ukbparse -h
*The examples in this notebook assume that you have installed `ukbparse` 0.17.0 or newer.*
*The examples in this notebook assume that you have installed `ukbparse` 0.20.0 or newer.*
%% Cell type:code id: tags:
``` bash
ukbparse -V
```
%% Cell type:markdown id: tags:
### Contents
1. [Overview](#Overview)
1. [Import](#1.-Import)
2. [Cleaning](#2.-Cleaning)
3. [Processing](#3.-Processing)
4. [Export](#4.-Export)
2. [Examples](#Examples)
3. [Import examples](#Import-examples)
1. [Selecting variables (columns)](#Selecting-variables-(columns))
1. [Selecting individual variables](#Selecting-individual-variables)
2. [Selecting variable ranges](#Selecting-variable-ranges)
3. [Selecting variables with a file](#Selecting-variables-with-a-file)
4. [Selecting variables from pre-defined categories](#Selecting-variables-from-pre-defined-categories)
2. [Selecting subjects (rows)](#Selecting-subjects-(rows))
1. [Selecting individual subjects](#Selecting-individual-subjects)
2. [Selecting subject ranges](#Selecting-subject-ranges)
3. [Selecting subjects from a file](#Selecting-subjects-from-a-file)
4. [Selecting subjects by variable value](#Selecting-subjects-by-variable-value)
5. [Excluding subjects](#Excluding-subjects)
3. [Selecting visits](#Selecting-visits)
4. [Merging multiple input files](#Merging-multiple-input-files)
1. [Merging by subject](#Merging-by-subject)
2. [Merging by column](#Merging-by-column)
3. [Naive merging](#Merging-by-column)
4. [Cleaning examples](#Cleaning-examples)
1. [NA insertion](#NA-insertion)
2. [Variable-specific cleaning functions](#Variable-specific-cleaning-functions)
3. [Categorical recoding](#Categorical-recoding)
4. [Child value replacement](#Child-value-replacement)
5. [Processing examples](#Processing-examples)
1. [Sparsity check](#Sparsity-check)
2. [Redundancy check](#Redundancy-check)
3. [Categorical binarisation](#Categorical-binarisation)
6. [Custom cleaning, processing and loading - ukbparse plugins](#Custom-cleaning,-processing-and-loading---ukbparse-plugins)
1. [Custom cleaning functions](#Custom-cleaning-functions)
2. [Custom processing functions](#Custom-processing-functions)
3. [Custom file loaders](#Custom-file-loaders)
7. [Miscellaneous topics](#Miscellaneous-topics)
1. [Non-numeric data](#Non-numeric-data)
2. [Dry run](#Dry-run)
3. [Built-in rules](#Built-in-rules)
4. [Using a configuration file](#Using-a-configuration-file)
5. [Reporting unknown variables](#Reporting-unknown-variables)
6. [Low-memory mode](#Low-memory-mode)
%% Cell type:markdown id: tags:
# Overview
`ukbparse` performs the following steps:
## 1. Import
All data files are loaded in, unwanted columns and subjects are dropped, and the data files are merged into a single table (a.k.a. data frame). Multiple files can be merged according to an index column (e.g. subject ID). Or, if the input files contain the same columns/subjects, they can be naively concatenated along rows or columns.
## 2. Cleaning
The following cleaning steps are applied to each column:
1. **NA value replacement:** Specific values for some columns are replaced with NA, for example, variables where a value of `-1` indicates *Do not know*.
2. **Variable-specific cleaning functions:** Certain columns are re-formatted - for example, the [ICD10](https://en.wikipedia.org/wiki/ICD-10) disease codes are converted to integer representations.
3. **Categorical recoding:** Certain categorical columns are re-coded.
4. **Child value replacement:** NA values within some columns which are dependent upon other columns may have values inserted based on the values of their parent columns.
## 3. Processing
During the processing stage, columns may be removed, merged, or expanded into additional columns. For example, a categorical column may be expanded into a set of binary columns, one for each category.
A column may also be removed on the basis of being too sparse, or being redundant with respect to another column.
## 4. Export
The processed data can be saved as a `.csv`, `.tsv`, or `.hdf5` file.
%% Cell type:markdown id: tags:
# Examples
Throughout these examples, we are going to use a few command line options, which you will probably **not** normally want to use:
- `-nb` (short for `--no_builtins`): This tells `ukbparse` not to use the built-in processing rules, which are specifically tailored for UK BioBank data.
- `-ow` (short for `--overwrite`): This tells `ukbparse` not to complain if the output file already exists.
- `-q` (short for `--quiet`): This tells `ukbparse` to be quiet.
Without the `-q` option, `ukbparse` can be quite verbose, which can be annoying, but is very useful when things go wrong. A good strategy is to tell `ukbparse` to send all of its output to a log file with the `--log_file` (or `-lf`) option. For example:
ukbparse --log_file log.txt out.tsv in.tsv
Here's the first example input data set, with UK BioBank-style column names:
%% Cell type:code id: tags:
``` bash
cat data_01.tsv
```
%% Cell type:markdown id: tags:
The numbers in each column name represent:
1. The variable ID
2. The visit, for variables which were collected at multiple points in time.
3. The "instance", for multi-valued variables.
Note that one **variable** is typically associated with several **columns**, although we're keeping things simple for this first example - there is only one visit for each variable, and there are no mulit-valued variables.
%% Cell type:markdown id: tags:
# Import examples
## Selecting variables (columns)
You can specify which variables you want to load in the following ways, using the `--variable` (`-v` for short) and `--category` (`-c` for short) command line options:
* By variable ID
* By variable ranges
* By a text file which contains the IDs you want to keep.
* By pre-defined variable categories
* By column name
### Selecting individual variables
Simply provide the IDs of the variables you want to extract:
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -v 1 -v 5 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variable ranges
The `--variable`/`-v` option accepts MATLAB-style ranges of the form `start:step:stop` (where the `stop` is inclusive):
%% Cell type:code id: tags:
``` bash
ukbparse -nb -q -ow -v 1:3:10 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variables with a file
If your variables of interest are listed in a plain-text file, you can simply pass that file:
%% Cell type:code id: tags:
``` bash
echo -e "1\n6\n9" > vars.txt
ukbparse -nb -q -ow -v vars.txt out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variables from pre-defined categories
Some UK BioBank-specific categories are baked into `ukbparse`, but you can also define your own categories - you just need to create a `.tsv` file, and pass it to `ukbparse` via the `--category_file` (`-cf` for short):
%% Cell type:code id: tags:
``` bash
echo -e "ID\tCategory\tVariables" > custom_categories.tsv
echo -e "1\tCool variables\t1:5,7" >> custom_categories.tsv