Commit 946f7cd3 authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

Merge branch 'rf/descriptions' into 'master'

Rf/descriptions

See merge request fsl/ukbparse!124
parents 459df7a0 d04a1c7d
Pipeline #3703 canceled with stages
in 4 minutes and 22 seconds
...@@ -2,6 +2,31 @@ ...@@ -2,6 +2,31 @@
====================== ======================
0.21.0 (Thursday 8th May 2019)
------------------------------
Added
^^^^^
* :class:`.Column` objects now have a ``metadata`` attribute which may be used
in the column description (if the ``--description_file`` option is used).
Processing functions can set the metadata for newly added columns.
* New ``metaproc`` plugin type to manipulate column metadata.
* All processing functions accept a ``metaproc`` argument, allowing a
``metaproc`` function to be applied to any column metadata that is returned
by the processing function..
Changed
^^^^^^^
* The :func:`.binariseCategorical` function sets the categorical value as
column metadata on the new binarised columns.
0.20.1 (Wednesday 8th May 2019) 0.20.1 (Wednesday 8th May 2019)
------------------------------- -------------------------------
......
...@@ -196,6 +196,7 @@ descriptions for each column as follows:: ...@@ -196,6 +196,7 @@ descriptions for each column as follows::
descs = readtable('descriptions.tsv', ... descs = readtable('descriptions.tsv', ...
'FileType', 'text', ... 'FileType', 'text', ...
'Delimiter', '\t', ...
'ReadVariableNames',false); 'ReadVariableNames',false);
descs = [descs; {'eid', 'ID'}]; descs = [descs; {'eid', 'ID'}];
idxs = cellfun(@(x) find(strcmp(descs.Var1, x)), colnames, ... idxs = cellfun(@(x) find(strcmp(descs.Var1, x)), colnames, ...
......
...@@ -6,7 +6,7 @@ ...@@ -6,7 +6,7 @@
# #
__version__ = '0.20.1' __version__ = '0.21.0'
"""The ``ukbparse`` versioning scheme roughly follows Semantic Versioning """The ``ukbparse`` versioning scheme roughly follows Semantic Versioning
conventions. conventions.
""" """
......
...@@ -12,16 +12,18 @@ and for cleaning and processing. ...@@ -12,16 +12,18 @@ and for cleaning and processing.
The following plugin types currently exist: The following plugin types currently exist:
+-------------------+------------------------------------------------------+ +-------------------+-------------------------------------------------------+
| Plugin type | | | Plugin type | |
+-------------------+------------------------------------------------------| +-------------------+-------------------------------------------------------|
| ``sniffer`` | Return information about the columns in a file | | ``sniffer`` | Return information about the columns in a file |
| ``loader`` | Load data from a file | | ``loader`` | Load data from a file |
| ``cleaner`` | Run a cleaning function on a single column | | ``cleaner`` | Run a cleaning function on a single column |
| ``processor`` | Run a processing fnction on one or more data columns | | ``processor`` | Run a processing function on one or more data columns |
| ``formatter`` | Format a column for output | | ``metaproc`` | Run a function on a :class:`.Column` ``metadata`` |
| ``exporter`` | Export the processed data set | | | value |
+-------------------+------------------------------------------------------+ | ``formatter`` | Format a column for output |
| ``exporter`` | Export the processed data set |
+-------------------+-------------------------------------------------------+
To ensure that the ``ukbparse`` command line help is nicely formatted, all To ensure that the ``ukbparse`` command line help is nicely formatted, all
...@@ -66,6 +68,7 @@ PLUGIN_TYPES = ['loader', ...@@ -66,6 +68,7 @@ PLUGIN_TYPES = ['loader',
'formatter', 'formatter',
'cleaner', 'cleaner',
'processor', 'processor',
'metaproc',
'exporter'] 'exporter']
...@@ -181,6 +184,7 @@ def registerBuiltIns(): ...@@ -181,6 +184,7 @@ def registerBuiltIns():
import ukbparse.exporting_tsv as uet import ukbparse.exporting_tsv as uet
import ukbparse.cleaning_functions as cf import ukbparse.cleaning_functions as cf
import ukbparse.processing_functions as pf import ukbparse.processing_functions as pf
import ukbparse.metaproc_functions as mf
if firstTime: if firstTime:
loglevel = log.getEffectiveLevel() loglevel = log.getEffectiveLevel()
...@@ -191,6 +195,7 @@ def registerBuiltIns(): ...@@ -191,6 +195,7 @@ def registerBuiltIns():
importlib.reload(uet) importlib.reload(uet)
importlib.reload(cf) importlib.reload(cf)
importlib.reload(pf) importlib.reload(pf)
importlib.reload(mf)
if firstTime: if firstTime:
log.setLevel(loglevel) log.setLevel(loglevel)
......
Variable Process Variable Process
40001 binariseCategorical(acrossVisits=True, acrossInstances=True) 40001 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
40002 binariseCategorical(acrossVisits=True, acrossInstances=True) 40002 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
40006 binariseCategorical(acrossVisits=True, acrossInstances=True) 40006 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
41201 binariseCategorical(acrossVisits=True, acrossInstances=True) 41201 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
41202 binariseCategorical(acrossVisits=True, acrossInstances=True) 41202 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
41204 binariseCategorical(acrossVisits=True, acrossInstances=True) 41204 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
41270 binariseCategorical(acrossVisits=True, acrossInstances=True) 41270 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='icd10.numdesc')
all_independent_except,40001,40002,40006,41202,41204,41270 removeIfSparse(minpres=51, maxcat=0.99, abscat=False) all_independent_except,40001,40002,40006,41202,41204,41270 removeIfSparse(minpres=51, maxcat=0.99, abscat=False)
40001 removeIfSparse(mincat=10) 40001 removeIfSparse(mincat=10)
40002 removeIfSparse(mincat=10) 40002 removeIfSparse(mincat=10)
......
...@@ -42,7 +42,8 @@ class Column(object): ...@@ -42,7 +42,8 @@ class Column(object):
index, index,
vid=None, vid=None,
visit=0, visit=0,
instance=0): instance=0,
metadata=None):
self.datafile = datafile self.datafile = datafile
self.name = name self.name = name
...@@ -50,6 +51,7 @@ class Column(object): ...@@ -50,6 +51,7 @@ class Column(object):
self.vid = vid self.vid = vid
self.visit = visit self.visit = visit
self.instance = instance self.instance = instance
self.metadata = metadata
def __str__(self): def __str__(self):
...@@ -356,7 +358,7 @@ class DataTable(object): ...@@ -356,7 +358,7 @@ class DataTable(object):
else: self.__varmap.pop(col.vid) else: self.__varmap.pop(col.vid)
def addColumns(self, series, vids=None): def addColumns(self, series, vids=None, meta=None):
"""Adds one or more new columns to the data set. """Adds one or more new columns to the data set.
:arg series: Sequence of ``pandas.Series`` objects containing the :arg series: Sequence of ``pandas.Series`` objects containing the
...@@ -365,10 +367,12 @@ class DataTable(object): ...@@ -365,10 +367,12 @@ class DataTable(object):
:arg vids: Sequence of variables each new column is associated :arg vids: Sequence of variables each new column is associated
with. If ``None`` (the default), variable IDs are with. If ``None`` (the default), variable IDs are
automatically assigned. automatically assigned.
:arg meta: Sequence of metadata associated with each new column.
""" """
if vids is None: if vids is None: vids = [None] * len(series)
vids = [None] * len(series) if meta is None: meta = [None] * len(series)
for s in series: for s in series:
if s.name in self.__data.columns: if s.name in self.__data.columns:
...@@ -390,13 +394,13 @@ class DataTable(object): ...@@ -390,13 +394,13 @@ class DataTable(object):
# a vid for each column starting from here. # a vid for each column starting from here.
startvid = max(max(self.variables) + 1, AUTO_VARIABLE_ID) startvid = max(max(self.variables) + 1, AUTO_VARIABLE_ID)
for s, idx, vid in zip(series, idxs, vids): for s, idx, vid, m in zip(series, idxs, vids, meta):
if vid is None: if vid is None:
vid = startvid vid = startvid
startvid = startvid + 1 startvid = startvid + 1
col = Column(None, s.name, idx, vid, 0, 0) col = Column(None, s.name, idx, vid, 0, 0, m)
self.__data[s.name] = s self.__data[s.name] = s
# new column on existing variable. # new column on existing variable.
......
...@@ -17,6 +17,8 @@ import warnings ...@@ -17,6 +17,8 @@ import warnings
import datetime import datetime
import calendar import calendar
import pandas as pd
import ukbparse import ukbparse
import ukbparse.util as util import ukbparse.util as util
import ukbparse.icd10 as icd10 import ukbparse.icd10 as icd10
...@@ -419,25 +421,45 @@ def doDescriptionExport(dtable, args): ...@@ -419,25 +421,45 @@ def doDescriptionExport(dtable, args):
return return
with util.timed('Description export', log): with util.timed('Description export', log):
cols = dtable.allColumns[1:] cols = dtable.allColumns[1:]
vartable = dtable.vartable
try: try:
with open(args.description_file, 'wt') as f: with open(args.description_file, 'wt') as f:
for c in cols: for col in cols:
desc = generateDescription(dtable, col)
desc = vartable.loc[c.vid, 'Description'] f.write('{}\t{}\n'.format(col.name, desc))
if desc == c.name:
desc = 'n/a'
f.write('{}\t{}\n'.format(c.name, desc))
except Exception as e: except Exception as e:
log.warning('Failed to export descriptions: {}'.format(e), log.warning('Failed to export descriptions: {}'.format(e),
exc_info=True) exc_info=True)
def generateDescription(dtable, col):
"""Called by :func:`doDescriptionExport`. Generates and returns a
suitable description for the given column.
:arg dtable: :class:`.Datatable` instance
:arg col: :class:`.Column` instance
"""
vartable = dtable.vartable
desc = vartable.loc[col.vid, 'Description']
if pd.isna(desc) or (desc == col.name):
desc = 'n/a'
# If metadata has been added to the column,
# we add it to the description. See the
# binariseCategorical processing function
# for an example of this.
if col.metadata is not None:
suffix = ' ({})'.format(col.metadata)
else:
suffix = ' ({}.{})'.format(col.visit, col.instance)
return '{}{}'.format(desc, suffix)
def configLogging(args): def configLogging(args):
"""Configures ``ukbparse`` logging. """Configures ``ukbparse`` logging.
......
#!/usr/bin/env python
#
# metaproc_functions.py - Functions for manipulating column metadata.
#
# Author: Paul McCarthy <pauldmccarthy@gmail.com>
#
"""This module contains ``metaproc`` functions - functions for manipulating
column metadata.
Some :class:`.Column` instances have a ``metadata`` attribute, containing some
additional information about the column. The functions in this module can be
used to modify these metadata values. Currently, column metadata is only used
to generate a description of each column (via the ``--description_file``
command-line option).
"""
from . import icd10
from . import custom
from . import hierarchy
@custom.metaproc('icd10.numdesc')
def icd10DescriptionFromNumeric(val):
"""Generates a description for a numeric ICD10 code. """
val = icd10.numericToCode(val)
hier = hierarchy.getHierarchyFilePath(name='icd10')
hier = hierarchy.loadHierarchyFile(hier)
desc = hier.description(val)
return '{} - {}'.format(val, desc)
@custom.metaproc('icd10.codedesc')
def icd10DescriptionFromCode(val):
"""Generates a description for an ICD10 code. """
hier = hierarchy.getHierarchyFilePath(name='icd10')
hier = hierarchy.loadHierarchyFile(hier)
desc = hier.description(val)
return '{} - {}'.format(val, desc)
...@@ -36,8 +36,8 @@ import collections ...@@ -36,8 +36,8 @@ import collections
import pyparsing as pp import pyparsing as pp
import ukbparse.util as util from . import util
import ukbparse.custom as custom from . import custom
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
...@@ -136,10 +136,11 @@ def runProcess(proc, dtable, vids): ...@@ -136,10 +136,11 @@ def runProcess(proc, dtable, vids):
remove = [] remove = []
add = [] add = []
addvids = [] addvids = []
addmeta = []
def genvids(result, vi, si): def expand(res, length):
if result[vi] is None: return [None] * len(result[si]) if res is None: return [None] * length
else: return result[vi] else: return res
for result in results: for result in results:
if result is None: if result is None:
...@@ -152,15 +153,21 @@ def runProcess(proc, dtable, vids): ...@@ -152,15 +153,21 @@ def runProcess(proc, dtable, vids):
# series/vids to add # series/vids to add
if len(result) == 2: if len(result) == 2:
add .extend( result[0]) add .extend(result[0])
addvids.extend(genvids(result, 1, 0)) addvids.extend(expand(result[1], len(result[0])))
addmeta.extend(expand(None, len(result[0])))
# columns to remove, and # columns to remove, and
# series/vids to add # series/vids to add
elif len(result) == 3: elif len(result) in (3, 4):
remove .extend( result[0])
add .extend( result[1]) if len(result) == 3:
addvids.extend(genvids(result, 2, 1)) result = list(result) + [None]
remove .extend(result[0])
add .extend(result[1])
addvids.extend(expand(result[2], len(result[1])))
addmeta.extend(expand(result[3], len(result[1])))
else: else:
raise error raise error
...@@ -170,8 +177,8 @@ def runProcess(proc, dtable, vids): ...@@ -170,8 +177,8 @@ def runProcess(proc, dtable, vids):
else: else:
raise error raise error
if len(add) > 0: dtable.addColumns(add, addvids, addmeta)
if len(remove) > 0: dtable.removeColumns(remove) if len(remove) > 0: dtable.removeColumns(remove)
if len(add) > 0: dtable.addColumns(add, addvids)
class NoSuchProcessError(Exception): class NoSuchProcessError(Exception):
...@@ -200,10 +207,11 @@ class Process(object): ...@@ -200,10 +207,11 @@ class Process(object):
# cleaner functions are not # cleaner functions are not
# defined in processing_functions, # defined in processing_functions,
# so in this case func will be None. # so in this case func will be None.
self.__ptype = ptype self.__ptype = ptype
self.__name = name self.__name = name
self.__args = args self.__args = args
self.__kwargs = kwargs self.__kwargs = kwargs
self.__metaproc = kwargs.pop('metaproc', None)
def __repr__(self): def __repr__(self):
...@@ -240,11 +248,27 @@ class Process(object): ...@@ -240,11 +248,27 @@ class Process(object):
"""Run the process on the data, passing it the given arguments, """Run the process on the data, passing it the given arguments,
and any arguments that were passed to :meth:`__init__`. and any arguments that were passed to :meth:`__init__`.
""" """
return custom.run(self.__ptype, result = custom.run(self.__ptype,
self.__name, self.__name,
*args, *args,
*self.__args, *self.__args,
**self.__kwargs) **self.__kwargs)
if self.__metaproc is not None and \
isinstance(result, tuple) and \
len(result) == 4:
meta = result[3]
mproc = self.__metaproc
try:
meta = [custom.runMetaproc(mproc, m) for m in meta]
except Exception as e:
log.warning('Metadata processing function failed: %s', e)
result = tuple(list(result[:3]) + [meta])
return result
def parseProcesses(procs, ptype): def parseProcesses(procs, ptype):
......
...@@ -43,6 +43,12 @@ Furthermore, all processing functions must return one of the following: ...@@ -43,6 +43,12 @@ Furthermore, all processing functions must return one of the following:
- List of ``Series`` to be added - List of ``Series`` to be added
- List of variable IDs for each new ``Series``. - List of variable IDs for each new ``Series``.
- A ``tuple`` of length 3, containing the above, and:
- List of metadata associated with each of the new ``Series``. This will
be added to the :class:`.Column` objects that represent each of the new
``Series``.
The following processing functions are defined: The following processing functions are defined:
.. autosummary:: .. autosummary::
...@@ -217,6 +223,7 @@ def binariseCategorical(dtable, ...@@ -217,6 +223,7 @@ def binariseCategorical(dtable,
toremove = [] toremove = []
newseries = [] newseries = []
newvids = [] newvids = []
newmeta = []
for vid in vids: for vid in vids:
...@@ -261,13 +268,14 @@ def binariseCategorical(dtable, ...@@ -261,13 +268,14 @@ def binariseCategorical(dtable,
} }
newvids .append(vid) newvids .append(vid)
newmeta .append(val)
newseries.append(pd.Series( newseries.append(pd.Series(
col, col,
index=dtable.index, index=dtable.index,
name=nameFormat.format(**fmtargs))) name=nameFormat.format(**fmtargs)))
if replace: return toremove, newseries, newvids if replace: return toremove, newseries, newvids, newmeta
else: return newseries, newvids else: return [], newseries, newvids, newmeta
@custom.processor() @custom.processor()
......
...@@ -30,7 +30,7 @@ ...@@ -30,7 +30,7 @@
"\n", "\n",
" ukbparse -h\n", " ukbparse -h\n",
" \n", " \n",
"*The examples in this notebook assume that you have installed `ukbparse` 0.20.1 or newer.*" "*The examples in this notebook assume that you have installed `ukbparse` 0.21.0 or newer.*"
] ]
}, },
{ {
......
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
![image.png](attachment:image.png) ![image.png](attachment:image.png)
# `ukbparse` # `ukbparse`
> Paul McCarthy &lt;paul.mccarthy@ndcn.ox.ac.uk&gt; ([WIN@FMRIB](https://www.win.ox.ac.uk/)) > Paul McCarthy &lt;paul.mccarthy@ndcn.ox.ac.uk&gt; ([WIN@FMRIB](https://www.win.ox.ac.uk/))
`ukbparse` is a command-line program which you can use to extract data from UK BioBank (and other tabular) data. `ukbparse` is a command-line program which you can use to extract data from UK BioBank (and other tabular) data.
You can give `ukbparse` one or more input files (e.g. `.csv`, `.tsv`), and it will merge them together, perform some preprocessing, and produce a single output file. You can give `ukbparse` one or more input files (e.g. `.csv`, `.tsv`), and it will merge them together, perform some preprocessing, and produce a single output file.
A large number of rules are built into `ukbparse` which are specific to the UK BioBank data set. But you can control and customise everything that `ukbparse` does to your data, including which rows and columns to extract, and which cleaning/processing steps to perform on each column. A large number of rules are built into `ukbparse` which are specific to the UK BioBank data set. But you can control and customise everything that `ukbparse` does to your data, including which rows and columns to extract, and which cleaning/processing steps to perform on each column.
The `ukbparse` source code is available at https://git.fmrib.ox.ac.uk/fsl/ukbparse. You can install `ukbparse` into a Python environment using `pip`: The `ukbparse` source code is available at https://git.fmrib.ox.ac.uk/fsl/ukbparse. You can install `ukbparse` into a Python environment using `pip`:
pip install ukbparse pip install ukbparse
Get command-line help by typing: Get command-line help by typing:
ukbparse -h ukbparse -h
*The examples in this notebook assume that you have installed `ukbparse` 0.20.1 or newer.* *The examples in this notebook assume that you have installed `ukbparse` 0.21.0 or newer.*
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` bash ``` bash
ukbparse -V ukbparse -V
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Contents ### Contents
1. [Overview](#Overview) 1. [Overview](#Overview)
1. [Import](#1.-Import) 1. [Import](#1.-Import)
2. [Cleaning](#2.-Cleaning) 2. [Cleaning](#2.-Cleaning)
3. [Processing](#3.-Processing) 3. [Processing](#3.-Processing)
4. [Export](#4.-Export) 4. [Export](#4.-Export)
2. [Examples](#Examples) 2. [Examples](#Examples)
3. [Import examples](#Import-examples) 3. [Import examples](#Import-examples)
1. [Selecting variables (columns)](#Selecting-variables-(columns)) 1. [Selecting variables (columns)](#Selecting-variables-(columns))
1. [Selecting individual variables](#Selecting-individual-variables) 1. [Selecting individual variables](#Selecting-individual-variables)
2. [Selecting variable ranges](#Selecting-variable-ranges) 2. [Selecting variable ranges](#Selecting-variable-ranges)
3. [Selecting variables with a file](#Selecting-variables-with-a-file) 3. [Selecting variables with a file](#Selecting-variables-with-a-file)
4. [Selecting variables from pre-defined categories](#Selecting-variables-from-pre-defined-categories) 4. [Selecting variables from pre-defined categories](#Selecting-variables-from-pre-defined-categories)
2. [Selecting subjects (rows)](#Selecting-subjects-(rows)) 2. [Selecting subjects (rows)](#Selecting-subjects-(rows))
1. [Selecting individual subjects](#Selecting-individual-subjects) 1. [Selecting individual subjects](#Selecting-individual-subjects)
2. [Selecting subject ranges](#Selecting-subject-ranges) 2. [Selecting subject ranges](#Selecting-subject-ranges)
3. [Selecting subjects from a file](#Selecting-subjects-from-a-file) 3. [Selecting subjects from a file](#Selecting-subjects-from-a-file)
4. [Selecting subjects by variable value](#Selecting-subjects-by-variable-value) 4. [Selecting subjects by variable value](#Selecting-subjects-by-variable-value)
5. [Excluding subjects](#Excluding-subjects) 5. [Excluding subjects](#Excluding-subjects)
3. [Selecting visits](#Selecting-visits) 3. [Selecting visits](#Selecting-visits)
4. [Merging multiple input files](#Merging-multiple-input-files) 4. [Merging multiple input files](#Merging-multiple-input-files)
1. [Merging by subject](#Merging-by-subject) 1. [Merging by subject](#Merging-by-subject)
2. [Merging by column](#Merging-by-column) 2. [Merging by column](#Merging-by-column)
3. [Naive merging](#Merging-by-column) 3. [Naive merging](#Merging-by-column)
4. [Cleaning examples](#Cleaning-examples) 4. [Cleaning examples](#Cleaning-examples)
1. [NA insertion](#NA-insertion) 1. [NA insertion](#NA-insertion)
2. [Variable-specific cleaning functions](#Variable-specific-cleaning-functions) 2. [Variable-specific cleaning functions](#Variable-specific-cleaning-functions)
3. [Categorical recoding](#Categorical-recoding) 3. [Categorical recoding](#Categorical-recoding)
4. [Child value replacement](#Child-value-replacement) 4. [Child value replacement](#Child-value-replacement)
5. [Processing examples](#Processing-examples) 5. [Processing examples](#Processing-examples)
1. [Sparsity check](#Sparsity-check) 1. [Sparsity check](#Sparsity-check)
2. [Redundancy check](#Redundancy-check) 2. [Redundancy check](#Redundancy-check)
3. [Categorical binarisation](#Categorical-binarisation) 3. [Categorical binarisation](#Categorical-binarisation)
6. [Custom cleaning, processing and loading - ukbparse plugins](#Custom-cleaning,-processing-and-loading---ukbparse-plugins) 6. [Custom cleaning, processing and loading - ukbparse plugins](#Custom-cleaning,-processing-and-loading---ukbparse-plugins)
1. [Custom cleaning functions](#Custom-cleaning-functions) 1. [Custom cleaning functions](#Custom-cleaning-functions)
2. [Custom processing functions](#Custom-processing-functions) 2. [Custom processing functions](#Custom-processing-functions)
3. [Custom file loaders](#Custom-file-loaders) 3. [Custom file loaders](#Custom-file-loaders)
7. [Miscellaneous topics](#Miscellaneous-topics) 7. [Miscellaneous topics](#Miscellaneous-topics)
1. [Non-numeric data](#Non-numeric-data) 1. [Non-numeric data](#Non-numeric-data)
2. [Dry run](#Dry-run) 2. [Dry run](#Dry-run)
3. [Built-in rules](#Built-in-rules) 3. [Built-in rules](#Built-in-rules)
4. [Using a configuration file](#Using-a-configuration-file) 4. [Using a configuration file](#Using-a-configuration-file)
5. [Reporting unknown variables](#Reporting-unknown-variables) 5. [Reporting unknown variables](#Reporting-unknown-variables)
6. [Low-memory mode](#Low-memory-mode) 6. [Low-memory mode](#Low-memory-mode)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Overview # Overview
`ukbparse` performs the following steps: `ukbparse` performs the following steps:
## 1. Import ## 1. Import
All data files are loaded in, unwanted columns and subjects are dropped, and the data files are merged into a single table (a.k.a. data frame). Multiple files can be merged according to an index column (e.g. subject ID). Or, if the input files contain the same columns/subjects, they can be naively concatenated along rows or columns. All data files are loaded in, unwanted columns and subjects are dropped, and the data files are merged into a single table (a.k.a. data frame). Multiple files can be merged according to an index column (e.g. subject ID). Or, if the input files contain the same columns/subjects, they can be naively concatenated along rows or columns.
## 2. Cleaning ## 2. Cleaning
The following cleaning steps are applied to each column: The following cleaning steps are applied to each column:
1. **NA value replacement:** Specific values for some columns are replaced with NA, for example, variables where a value of `-1` indicates *Do not know*. 1. **NA value replacement:** Specific values for some columns are replaced with NA, for example, variables where a value of `-1` indicates *Do not know*.
2. **Variable-specific cleaning functions:** Certain columns are re-formatted - for example, the [ICD10](https://en.wikipedia.org/wiki/ICD-10) disease codes are converted to integer representations. 2. **Variable-specific cleaning functions:** Certain columns are re-formatted - for example, the [ICD10](https://en.wikipedia.org/wiki/ICD-10) disease codes are converted to integer representations.
3. **Categorical recoding:** Certain categorical columns are re-coded. 3. **Categorical recoding:** Certain categorical columns are re-coded.
4. **Child value replacement:** NA values within some columns which are dependent upon other columns may have values inserted based on the values of their parent columns. 4. **Child value replacement:** NA values within some columns which are dependent upon other columns may have values inserted based on the values of their parent columns.
## 3. Processing ## 3. Processing
During the processing stage, columns may be removed, merged, or expanded into additional columns. For example, a categorical column may be expanded into a set of binary columns, one for each category. During the processing stage, columns may be removed, merged, or expanded into additional columns. For example, a categorical column may be expanded into a set of binary columns, one for each category.
A column may also be removed on the basis of being too sparse, or being redundant with respect to another column. A column may also be removed on the basis of being too sparse, or being redundant with respect to another column.
## 4. Export ## 4. Export