Commit da176971 authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

Merge branch 'bf/binarise-categorical' into 'master'

New --add_aux_var option

See merge request !97
parents 48c0d5ff 616cb27e
......@@ -2,6 +2,42 @@ FUNPACK release history
=======================
3.3.0 (Monday 27th June 2022)
-----------------------------
Added
^^^^^
* New ``--add_aux_vars`` option, which causes all auxillary data-fields
specified in the processing rules to be selected for import, if present
in the data, and not already selected. This option is enabled in the
``fmrib`` configuration profile.
* The ``$FUNPACK_CONFIG_DIR`` environment variable can be used to direct
FUNPACK to a directory containing configuration, table, and plugin files.
Changed
^^^^^^^
* The ``fmrib`` configuration profile has been removed from FUNPACK, and is
now being released independently as a dependency of FUNPACK (see
https://git.fmrib.ox.ac.uk/fsl/funpack-fmrib-config). However, it will be
installed automatically alongside FUNPACK, so from an end-user's
perspective, this change will have no effect on usage.
Deprecated
^^^^^^^^^^
* The use of ``broadcast_`` arguments with processing functions have been
deprecated, and will be removed in FUNPACK 4.0.0. The alternative to
broadcast arguments is for processing functions to perform their own
parallelisation internally.
3.2.3 (Wednesday 1st June 2022)
-------------------------------
......@@ -584,10 +620,10 @@ Changed
* Changes to the ``fmrib`` configuration - variables
`41202 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202>`_,
`41203 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41203>`_,
`41270 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270>`_, and
`41271 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41271>`_ are
`41202 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202>`__,
`41203 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41203>`__,
`41270 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41270>`__, and
`41271 <http://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41271>`__ are
binarised, and the binarised values replaced with diagnosis dates from
the corresponding date variables.
* The processing function interface has been changed - processing functions
......@@ -769,7 +805,7 @@ Fixed
* Fixed a bug where non-numeric variables (e.g.
`41271 <https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41271>`_ ) were
`41271 <https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41271>`__) were
being interpreted by ``pandas`` as being numeric.
......
......@@ -33,14 +33,15 @@ Installation
------------
Install FUNPACK via pip::
Install FUNPACK from ``conda-forge``::
pip install fmrib-unpack
conda install -c conda-forge fmrib-unpack
Or from ``conda-forge``::
Or using ``pip``::
pip install fmrib-unpack
conda install -c conda-forge fmrib-unpack
The FUNPACK source code can be found at
......@@ -139,15 +140,20 @@ Built-in rules
FUNPACK contains a large number of built-in rules which have been specifically
written to pre-process UK BioBank data variables. These rules are stored in
the following files (the `funpack/` directory is located in your Python
environment directory):
the following files [*]_:
* ``funpack/configs/fmrib/datacodings_*.tsv``: Cleaning rules for data codings
* ``funpack/configs/fmrib/datacodings_*.tsv``: Cleaning rules for data
codings
* ``funpack/configs/fmrib/variables_*.tsv``: Cleaning rules for individual
variables
* ``funpack/configs/fmrib/processing.tsv``: Processing steps
* ``funpack/configs/fmrib/categories.tsv``: Variable categories
.. [*] The ``funpack/`` directory is located in
``<python-env>/lib/python<X.Y>/site-packages/funpack/``, where
``<python-env>`` is the location of your Python environment directory,
and ``<X.Y>`` is the Python version you are using.
You can use these rules by using the FMRIB configuration profile::
......@@ -162,6 +168,18 @@ variable and datacoding files, and merge them into a single table which
contains the cleaning rules for each variable.
.. note:: The ``fmrib`` configuration profile is managed and released
separately from FUNPACK at
https://git.fmrib.ox.ac.uk/fsl/funpack-fmrib-config/. However, it is
automatically installed alongside FUNPACK, so if you have FUNPACK,
you can use the ``fmrib`` profile. If you are using FUNPACK from a
source checkout, you may need to manually install the
``fmrib-unpack-fmrib-config`` package from `PyPi
<https://pypi.org/project/fmrib-unpack-fmrib-config/>`_ or
`conda-forge
<https://anaconda.org/conda-forge/fmrib-unpack-fmrib-config>`_.
Creating your own rule files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
......@@ -204,7 +222,7 @@ Output
------
The main output of FUNPACK is a plain-text file[*]_ which contains the input
The main output of FUNPACK is a plain-text file [*]_ which contains the input
data, after cleaning and processing, potentially with some columns removed,
and new columns added.
......@@ -279,7 +297,7 @@ Tests
To run the test suite, you need to install some additional dependencies::
pip install fmrib-unpack[test]
pip install fmrib-unpack[test]
Then you can run the test suite using ``pytest``::
......
details summary p {
display: inline;
}
FUNPACK command-line interface
==============================
FUNPACK is controlled entirely through its command-line interface. You can
control which data fields and subjects to import, which cleaning/processing
rules to apply, and how the output should be formatted.
Configuration files
-------------------
FUNPACK can also be controlled through configuration files - a FUNPACK
configuration file simply contains a list of command-line options, without the
leading ``--``. For example, the options in the following command-line::
fmrib_unpack \
--overwrite \
--write_log \
--icd10_map_file icd_codes.tsv \
--category 10 \
--category 11 \
output.tsv input1.tsv input2.tsv
Could be stored in a configuration file ``config.txt``::
overwrite
write_log
icd10_map_file icd_codes.tsv
category 10
category 11
And then executed as follows::
fmrib_unpack -cfg config.txt output.tsv input1.tsv input2.tsv
A FUNPACK configuration file can *include* other configuration files, making
it possible to build a *configuration proflie* which is organised across
several configuration files. For example, you may wish to store variable
selectors in one file ``variables.cfg``::
variable 31 # sex
variable 33 # dob
variable 41202 # ICD10 main diagnoses
Then your main configuration file can include ``variables.cfg`` via the
``config_file`` argument, along with any other options::
overwrite
write_log
config_file variables.cfg
The FMRIB configuration profile (described :ref:`here
<fmrib_configuration_profile>`) uses this feature to organise the options and
rules that it applies.
Command-line reference
----------------------
All of the command-line options that FUNPACK accespts are documented below:
..
Defined in conf.py
.. codesub:: none
|funpack_cli_help|
......@@ -17,19 +17,32 @@
# -- Project information -----------------------------------------------------
import os.path as op
import datetime
import funpack
import funpack.config as config
import funpack.custom as custom
date = datetime.date.today()
date = datetime.date.today()
author = 'Paul McCarthy'
project = 'FUNPACK'
release = '{}'.format(funpack.__version__)
copyright = f'{date.year}, Paul McCarthy, University of Oxford, Oxford, UK'
project = 'FUNPACK'
copyright = u'{}, Paul McCarthy, University of Oxford, Oxford, UK'.format(
date.year)
author = 'Paul McCarthy'
# substitutions used in various places throughout the docs
custom.registerBuiltIns()
docdir = op.dirname(op.abspath(__file__))
cfgdir = op.relpath(funpack.findConfigDir(), docdir)
release = '{}'.format(funpack.__version__)
# See funpack_sphinx_exts.SubstitutionCode -
# hack to allow multi-line substitutions
clihelp = config.makeParser().format_help().replace('\n', '//n//')
rst_prolog = f"""
.. |cfgdir| replace:: {cfgdir}
.. |funpack_cli_help| replace:: {clihelp}
"""
# -- General configuration ---------------------------------------------------
......@@ -41,7 +54,8 @@ extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.viewcode',
'sphinx.ext.autosummary',
'nbsphinx'
'nbsphinx',
'funpack_sphinx_exts',
]
# Add any paths that contain templates here, relative to this directory.
......@@ -64,3 +78,5 @@ html_theme = 'sphinx_rtd_theme'
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_css_files = ['custom.css']
.. _fmrib_configuration_profile:
FMRIB configuration profile
===========================
FUNPACK comes with a built-in configuration profile (the `"FMRIB"
<https://www.win.ox.ac.uk/about/locations/fmrib/>`_ profile), containing a
range of processing rules for a large number of UK BioBank data fields. These
rules can be applied by running::
fmrib_unpack -cfg fmrib <output.tsv> <input.csv>
The ``fmrib`` configuration profile is installed alongside the FUNPACK source
code - it can be viewed online `here
<https://git.fmrib.ox.ac.uk/fsl/funpack-fmrib-config/-/tree/master/funpack/configs>`_,
or found in your local FUNPACK installation within
``<python-env>/lib/python<X.Y>/site-packages/funpack/configs/`` (replacing
``<python-env>`` with the location of your Python environment, and ``<X.Y>``
with the Python version).
.. note:: The ``fmrib`` configuration profile is managed independently from
the FUNPACK source code at
https://git.fmrib.ox.ac.uk/fsl/funpack-fmrib-config/, but is always
installed alongside FUNPACK.
The ``fmrib`` configuration profile is split across several files, each of
which are described below. Click on the arrow to the left of each section to
view the contents of that file.
..
|cfgdir| is defined in conf.py
.. details::
.. summary:: ``funpack/configs/fmrib.cfg``: Top-level configuration file,
containing general settings, and references to the other
configuration files.
.. includesub:: |cfgdir|/fmrib.cfg
:literal:
.. details::
.. summary:: ``funpack/configs/local.cfg``: Included by ``fmrib.cfg``.
Contains some miscellaneous settings related to performance
and error-checking.
.. includesub:: |cfgdir|/local.cfg
:literal:
.. details::
.. summary:: ``funpack/configs/fmrib/categories.tsv``: Definition of FMRIB
datafield categories - groups of related data-fields.
Categories can be selected with the ``-c/--category`` option.
.. includesub:: |cfgdir|/fmrib/categories.tsv
:literal:
.. details::
.. summary:: ``funpack/configs/fmrib/datacodings_navalues.tsv``: NA value
replacement rules. The ``ID`` column specifies the UKB
data-coding, and the ``NAValues`` column is a comma-separated
list of values which will be removed. Each rule is applied to
every data-field that uses the corresponding data-coding. Only
the first 20 rules are shown here.
.. includesub:: |cfgdir|/fmrib/datacodings_navalues.tsv
:literal:
:end-line: 21
.. details::
.. summary:: ``funpack/configs/fmrib/datacodings_recoding.tsv``:
Categorical recoding rules. The ``ID`` column specifies the
UKB data-coding, the ``RawLevels`` column is a comma-separated
list of values to be replaced, and the ``NewLevels`` column is
a comma-separated list of values to replace the ``RawLevels``
with. Each rule is applied to every data-field that uses the
corresponding data-coding. Only the first 20 rules are shown
here.
.. includesub:: |cfgdir|/fmrib/datacodings_recoding.tsv
:literal:
:end-line: 21
.. details::
.. summary:: ``funpack/configs/fmrib/variables_clean.tsv``: Custom cleaning
functions. The ``ID`` column specifies the UKB data-field, and
the ``Clean`` column specifies the cleaning function to apply
(see :ref:`here <cleaning_functions>` for an overview of all
built-in cleaning functions). Only the first 20 rules are
shown here.
.. includesub:: |cfgdir|/fmrib/variables_clean.tsv
:literal:
:end-line: 21
.. details::
.. summary:: ``funpack/configs/fmrib/datetime_formatting.tsv``: Output
format for all date/time data-fields. These functions are
defined in the built-in :mod:`funpack.plugins.fmrib` plugin
module.
.. includesub:: |cfgdir|/fmrib/datetime_formatting.tsv
:literal:
.. details::
.. summary:: ``funpack/configs/fmrib/variables_parentvalues.tsv``: Child
value replacement rules applied to data fields. The ``ID``
column specifies the UKB data-fiuld, the ``ParentValues``
column is an exprssion which is evaluated on the parent
data-fields, and the ``ChildValues`` column is a value to set
the data-field to when the expression evaluates to true. Only
the first 20 rules are shown here.
.. includesub:: |cfgdir|/fmrib/variables_parentvalues.tsv
:literal:
:end-line: 21
.. details::
.. summary:: ``funpack/configs/fmrib/processing.tsv``: Processing rules
applied to the data set after all cleaning stages have been
performed. See :ref:`here <processing_functions>` for an
overview of all of the built-in processing functions.
.. includesub:: |cfgdir|/fmrib/processing.tsv
:literal:
|
Some additional configuration profiles enhance the ``fmrib`` profile with
some extra options.
The ``fmrib_standard`` profile incorporates all options from the ``fmrib``
profile, however it only load datafields from the categories listed in
``fmrib_cats.cfg``, and it also configures ``funpack`` to output verbose
logging information and additional summary files.
.. details::
.. summary:: ``fmrib_standard.cfg``
.. includesub:: |cfgdir|/fmrib_standard.cfg
:literal:
.. details::
.. summary:: ``fmrib_logs.cfg``
.. includesub:: |cfgdir|/fmrib_logs.cfg
:literal:
.. details::
.. summary:: ``fmrib_cats.cfg``
.. includesub:: |cfgdir|/fmrib_cats.cfg
:literal:
|
The ``fmrib_new_release`` profile is the same as the ``fmrib_standard``
profile, but also load and process any unknown or uncategorised
datafields. This is useful when processing a new data set which may contain
newly added data fields that you have not yet encountered.
.. details::
.. summary:: ``fmrib_new_release.cfg``
.. includesub:: |cfgdir|/fmrib_new_release.cfg
:literal:
......@@ -58,14 +58,16 @@ For example, this command would run cleaning functinos ``func1``, ``func2``,
output.tsv input.csv
.. _cleaning_functions:
Cleaning functions
------------------
These functions may be used with the ``-cl`` / ``--clean`` command-line
option. For example, to apply ``fillMissing`` and ``flattenHierarchical`` to
the columns of data field `41202
<https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202>`_::
Built-in cleaning functions are defined in the
:mod:`funpack.cleaning_functions` module. These functions may be used with
the ``-cl`` / ``--clean`` command-line option. For example, to apply
``fillMissing`` and ``flattenHierarchical`` to the columns of data field
`41202 <https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202>`_::
fmrib_unpack -cl 20002 "fillVisits,flattenHierarchical" output.tsv input.csv
......@@ -89,15 +91,19 @@ the columns of data field `41202
:noindex:
.. _processing_functions:
Processing functions
--------------------
These functions may be used with the ``-ppr`` / ``--prepend_process`` and
``-apr`` / ``--append_process`` command-line flags. You have the option to
prepend or append processing functions in case you are using a pre-defined
processing table or configuration profile, but wish to perform some additional
steps for specific data fields.
Built-in processing functions are defined in the
:mod:`funpack.processing_functions` module. These functions may be used with
the ``-ppr`` / ``--prepend_process`` and ``-apr`` / ``--append_process``
command-line flags. You have the option to prepend or append processing
functions in case you are using a pre-defined processing table or
configuration profile, but wish to perform some additional steps for specific
data fields.
For example, to apply ``removeIfSparse`` and ``binariseCategorical`` to the
......
#!/usr/bin/env fslpython
#
# sphinx extensions which:
# - embed content wihtin a HTML <details> element
# - allows subsitutions in paths passed to include::
# - allows subsitutions in codeblock::
from docutils import nodes
from docutils.parsers.rst import Directive
from sphinx.directives.other import Include
from sphinx.directives.code import CodeBlock
from sphinx.util.nodes import nested_parse_with_titles
class DetailsNode(nodes.Element):
pass
class SummaryNode(nodes.Element):
pass
class DetailsDirective(Directive):
has_content = True
def run(self):
node = DetailsNode()
nested_parse_with_titles(self.state, self.content, node)
return [node]
class SummaryDirective(Directive):
has_content = True
def run(self):
node = SummaryNode()
self.state.nested_parse(self.content, self.content_offset, node)
return [node]
def visitSummaryNode(self, node):
self.body.append('<details><summary>')
def departSummaryNode(self, node):
self.body.append('</summary>')
def visitDetailsNode(self, node):
self.body.append('<p>')
def departDetailsNode(self, node):
self.body.append("</p></details>")
class SubstitutionInclude(Include):
def run(self):
filename = self.arguments[0]
for orig, repl in self.state.document.substitution_defs.items():
filename = filename.replace(f'|{orig}|', repl.astext())
self.arguments[0] = filename
return super().run()
class SubstitutionCode(CodeBlock):
def run(self):
newcontent = []
for line in self.content:
for orig, repl in self.state.document.substitution_defs.items():
line = line.replace(f'|{orig}|', repl.astext())
# hack to allow multi-line substitutions
line = line.replace('//n//', '\n')
newcontent.append(line)
self.content = newcontent
return super().run()
def setup(app):
app.add_node(DetailsNode, html=(visitDetailsNode, departDetailsNode))
app.add_node(SummaryNode, html=(visitSummaryNode, departSummaryNode))
app.add_directive('details', DetailsDirective)
app.add_directive('summary', SummaryDirective)
app.add_directive('includesub', SubstitutionInclude)
app.add_directive('codesub', SubstitutionCode)
......@@ -8,6 +8,8 @@ FUNPACK
self
demo.ipynb
function_reference
command_line
fmrib_profile
apiref
changelog
......
......@@ -6,18 +6,19 @@
#
__version__ = '3.2.3'
__version__ = '3.3.0'
"""The ``funpack`` versioning scheme roughly follows Semantic Versioning
conventions.
"""
from .custom import (loader, # noqa
from .util import findConfigDir # noqa
from .custom import (loader, # noqa
sniffer,
formatter,
exporter,
processor,
metaproc,
cleaner)
from .datatable import (DataTable, # noqa
from .datatable import (DataTable, # noqa
Column)
......@@ -5,9 +5,10 @@
# Author: Paul McCarthy <pauldmccarthy@gmail.com>
#
"""This module contains functions for parsing ``funpack`` command line
arguments and configuration files. A ``funpack`` configuration file
is simply a plain-text file which contains command-line options,
sans-hyphens.
arguments and configuration files.
A ``funpack`` configuration file is simply a plain-text file which contains
command-line options, sans-hyphens.
"""
......@@ -15,12 +16,13 @@ import os.path as op
import functools as ft
import itertools as it
import multiprocessing as mp
import os
import sys
import site
import shlex
import logging
import argparse
import collections
import numpy as np
import funpack
import funpack.util as util
......@@ -56,14 +58,17 @@ CLI_ARGUMENTS = collections.OrderedDict((
'default' : DEFAULT_MERGE_AXIS}),
(('ms', 'merge_strategy'), {'choices' : AVAILABLE_MERGE_STRATEGIES,
'default' : DEFAULT_MERGE_STRATEGY}),
(('rm', 'remove_duplicates'), {'action' : 'store_true'}),
(('rd', 'rename_duplicates'), {'action' : 'store_true'}),
(('cfg', 'config_file'), {'action' : 'append'}),
(('vf', 'variable_file'), {'action' : 'append'}),
(('df', 'datacoding_file'), {'action' : 'append'}),
(('tf', 'type_file'), {}),
(('pf', 'processing_file'), {}),
(('cf', 'category_file'), {})]),
(('rm', 'remove_duplicates'), {'action' : 'store_true'}),
(('rd', 'rename_duplicates'), {'action' : 'store_true'}),
(('cfg', 'config_file'), {'action' : 'append',
'type' : util.findConfigFile}),
(('vf', 'variable_file'), {'action' : 'append',
'type' : util.findTableFile}),
(('df', 'datacoding_file'), {'action' : 'append',
'type' : util.findTableFile}),
(('tf', 'type_file'), {'type' : util.findTableFile}),
(('pf', 'processing_file'), {'type' : util.findTableFile}),
(('cf', 'category_file'), {'type' : util.findTableFile})]),