Commit 88cc84b5 authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

DOC: Removed obsolete docs

parent 98a9f0b7
Data importing
--------------
In order to load the `input data <inputs.rst>`_, ``ukbparse`` follows these
steps:
1. Load variable, data coding, and processing tables
2. Fill in variable table from data coding table as needed (the ``NAValues``,
``RawLevels`` and ``NewLevels`` columns). After this step, the data coding
table is no longer used - all information is in the variable table.
3. Parse dependency expressions (``ParentValues``) to figure out dependencies.
4. Load data. Only those variables which are listed in the variable table, and
which are not marked for removal in the processing table, are loaded.
.. todo:: Need to set data types - cast categoricals, minimal numpy data
type
5. Re-encode missing values (``NAValues``, e.g. (-1,-3) -> nan)
6. Apply pre-processing steps - refer to the page on `processing
<processing.rst>`_
7. Apply dependency expressions. In order for expressions to evaluate
correctly, we need to apply them in order from most "childish"
(i.e. variables with no children) to least childish (variables with no
parents).
8. Re-encode variable values (replace ``RawLevels`` with ``NewLevels``,
e.g. 555 -> 0.5)
``ukbparse`` inputs
===================
The primary input to ``ukbparse`` is a file which contains a two-dimensional
matrix of data for ``nsubjects`` (the rows) and ``nvariables`` (the columns).
It is assumed that the first row in the file contains identifiers for each
variable.
.. todo:: TODO describe ID format and input data more - visits and
multi-valued variables.
Additionally, a number of tables are expected, which contain information about
the variables in the data, and specify processing that is to be applied to
them:
- The *data coding* table
- The *variable* table
- The *processing* table
Each of these tables are expected to be specified as a ``tsv`` (tab-separated
value) file.
Data types
----------
The following data types may be present in the input data:
========================== ===============================================
Type Description
========================== ===============================================
**Sequence** A sequentially increasing integer
**Categorical (single)** A single selection from a set of categories
**Categorical (multiple)** One or more selections from a set of categories
**Integer** An integer number
**Continuous** An decimal number
**Text** Text
**Time** A time
**Date** A data
**Compound** A combination of other types
========================== ===============================================
.. note:: Multiple items may be present for any type - for categoricals, this
is explicit in the name ("categorical (single)" or "categorical
(multiple").
Data coding table
-----------------
Many (but not all) variables are assumed to adhere to a specific "data
coding", which defines a sub-set of values the variable may take.
The data coding table is a ``tsv`` file which contains a row for each known
data coding, defining how variables with that coding should be imported and
stored.
Each row in the data coding table has the following columns:
============= ================================================================
Name Meaning
============= ================================================================
**ID** An integer ID for this data coding.
**Type** Name of the type after it has been loaded - one of
- binary
- ordinal
- continuous
- categorical (single)
- categorical (multiple)
- text
**NumValues** Number of values defined in this data coding.
**NAValues** Comma-separated list of values which are to be recoded to *NA*
(not available).
**RawLevels** Comma-separated list of values denoting values which are to be
recoded to corresponding **NewLevels** on import. ``NA`` may be
used. Need not be exhaustive.
**NewLevels** Comma-separated list of values, one for each value in
**RawLevels**, denoting new recoded values. ``NA`` may be used.
============= ================================================================
.. note:: The **Type** column may not be necessary - taken at face-value, all
of the data codings are categorical, but they may be used in
variables of different types (e.g. data coding `100291 <http://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=100291>`_
is used in
continuous variables to encode missing values).
The **NumValues** column may also be unnecessary for similar
reasons. It is intended to be used as a hint in determining the
data type required to store a particular variable, but the number
of values in the data coding does not necessarily correspond to the
number of values that a variable using that data coding may take.
Variable table
--------------
The variable table is a ``tsv`` file which contains a row for every known
variable, and any special pre-processing which should be applied to that
variable.
Each row in the variable table has the following columns:
================ =============================================================
Name Meaning
================ =============================================================
**ID** An integer ID for this variable, called **UDI** in Biobank-ese.
**Type** Data type (see above).
**Description** Description.
**DataCoding** Data coding ID (may be *NA*).
**NAValues** May be used to override **NAValues** from data coding table.
**RawLevels** May be used to override **RawLevels** from data coding table.
**NewLevels** May be used to override **NewLevels** from data coding table.
**ParentValues** One or more rules to be used where this variable has a
missing value. The rules are expressions defining values that
parent variables of this variable may take to cause the value
for this variable to be replaced. See below.
**ChildValues** A list of comma-separated values (one for each rule in
**ParentValues**), specifying the replacement value.
**Preprocess** List of pre-processing steps to apply to this variable,
separated with commas.
================ =============================================================
**NAValues**, **RawLevels**, and **NewLevels**
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The **NAValues**, **RawLevels**, and **NewLevels** columns only need to be
used for variables which do not have a data coding, or which have *NA* or
recoding conventions that differ from their data coding.
**ParentValues** and **ChildValues**
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The **ParentValues** and **ChildValues** columns may be used when the current
variable values are missing, and parent variables of this variable have
values which may be used to infer the value of this variable.
**ParentValues** comprises one or more conditional expressions which define
the values of the parent variables that cause the child value to be replaced.
**ChildValues** comprises one or more values, corresponding to the expressions
in **ParentValues**, and defining the replacement value. Multiple expressions
in **ParentValues**, and values in **ChildValues**, must be separated with
commas. All spaces are ignored.
A single expression in the **ParentValues** column may contain one or more
comparisons of a parent variable ID to a numeric value. The following
comparison operators may be used:
======= ========================
Meaning Operator
======= ========================
``==`` Equal to
``!=`` Not equal to
``>`` Greater than
``>=`` Greater than or equal to
``<`` Less than
``<=`` Less than or equal to
======= ========================
Parent variables are denoted by the letter ``'v'`` followed by their numeric
ID, and must always appear on the left side of the statement.
For example, the following comparison will evaluate to true when the
variable with ID 1234 has a value less than or equal to 6::
v1234 <= 6
.. todo:: Do we need comparisons on non-numeric, compound, categorical, and
array variables? Only allow == and != for non-numerics, and to
denote presence or absence in a multiple/array variable?
The values used in an expression are to be specified as they were *prior to
any recoding of* **RawLevels** *to* **NewLevels** of that parent's values, but
*after recoding of* **NAValues**.
Multiple comparison statements may be combined with logical operators ``&&``
(logical *and*), ``||`` (logical *or*), and ``~`` (logical *not*). Round
brackets ``()`` may be used to enforce precedence.
Each expression in **ParentValues** must be accompanied by a value in
**ChildValues** - this is used to fill in the variable value in the event the
expression evaluates to true.
Multiple expressions are applied in succession, i.e. a seconnd
expression/value pair may overwrite the value of the first. Note that the
replacement value given to the variable will be subject to any re-encoding
specified in the **RawLevels** and **NewLevels** fields, as this stage is
applied after the parent variable replacement stage. See the page on
`importing <importing.rst>`_ for more details.
The **Preprocess** column
-------------------------
The **Preprocess** column contains one ore more preprocessing steps that
should be applied to this variable, separated with commas. More details on the
available preprocessing steps and the format are on the `processing
<processing.rst> page`_.
Processing table
----------------
The processing table is a ``tsv`` file which defines special processing rules
that should be applied to specific variables.
Each row in the processing table has the following columns:
todo: allow multiple comma-separated variable IDs on each line
============== ================================================================
Name Meaning
============== ================================================================
**ID** Variable ID
**Process** List of processing functions to apply to this variable,
separated with commas.
============== ================================================================
See the page on `processing <processing.rst>`_ for more details.
Data processing
===============
Variable-specific preprocessing and processing are defined in the processing
table. Pre-processing occurs during the `data import stage <importing.rst>`_,
and processing occurs immediately afterwards.
The preprocessing and processing steps for a variable are each specified as
comma-separated lists of processing function names. For example::
process1, process2('arg1'), process3(2, 3)
Preprocessing steps
-------------------
The following preprocessing options may be specified in the variable
table. These are applied during data import.
============== ============================================================
Name Meaning
============== ============================================================
``remove`` Remove the variable - it is not loaded at all
``keepVisits`` Discard all but the specified visits. Visits are specified
as one or more integer arguments, or one of the constants
``'first'`` and ``'last'``.
``fillVisits`` Fill NA values for a given visit from other visits. Takes a
single optional argument, one of ``'mode'`` (the default),
or ``'mean'``.
============== ============================================================
.. todo:: examples
Processing steps
----------------
================ =============================================
Name Meaning
================ =============================================
``fillMissing`` Replace missing (*NA*) values with a constant
``gaussianise`` Apply gaussian normalisation
================ =============================================
.. todo:: examples
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment