Commit 2358fc8a authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

RF: parseColumnName accepts non-numeric instance, and will also parse

names with no visit
parent e61172c4
......@@ -75,15 +75,18 @@ from the ``funpack/data/type.txt`` file (see :func:`.loadTableBases`).
def parseColumnName(name):
"""Parses a UK Biobank column name, returns the components.
Two column naming formats are supported. The name is expected to be either
a string of the form::
Two column naming formats are supported. The name is expected to be
a string of one of the following forms::
variable-visit.instance
Or a string of the form::
variable.instance
f.variable.visit.instance
where ``variable`` and ``visit`` are integers. ``instance`` is typically
also an integer, but non-numeric values for ``instance`` are
accepted. This (and the second form above) is to allow parsing of derived
columns (see e.g. the :func:`.binariseCategorical` processing function).
Some variables have the form::
f.variable..visit.instance
......@@ -97,11 +100,11 @@ def parseColumnName(name):
a column name (``visit`` above) corresponds to the assessment
visit. However, there are a small number of variables which are
not associated with a specific visit, and thus for which this
number does not corresopnd to a visit (e.g. variable 40006), but
number does not correspond to a visit (e.g. variable 40006), but
to some other coding.
Confusingly, the UK Biobank showcase refers to the coding that a
variable adhers to as an "instancing", whilst also using the
variable adheres to as an "instancing", whilst also using the
term "instance" to refer to the columns of multi-valued
variables - the ``instance`` element of the column name.
......@@ -112,27 +115,48 @@ def parseColumnName(name):
:returns: A tuple containing:
- variable ID
- visit number
- instance number
- instance (may be an integer or a string)
"""
if name.startswith('f'):
pat = re.compile(r'f\.([0-9]+)\.(\.)?([0-9]+)\.([0-9]+)')
else:
pat = re.compile(r'([0-9]+)-(-)?([0-9]+)\.([0-9]+)')
def parse_norm(grps):
vid = int(grps[0])
visit = int(grps[2])
instance = grps[3]
if grps[1] is not None:
visit = -visit
return vid, visit, instance
parts = pat.fullmatch(name)
def parse_deriv(grps):
vid = int(grps[0])
instance = grps[1]
return vid, 0, instance
if parts is None:
raise ValueError('Invalid column name: {}'.format(name))
patterns = [
(r'([0-9]+)-(-)?([0-9]+)\.(.+)', parse_norm),
(r'([0-9]+)\.(.+)', parse_deriv),
(r'f\.([0-9]+)\.(\.)?([0-9]+)\.([0-9]+)', parse_norm)
]
parts = parts.groups()
for pat, parse in patterns:
vid = int(parts[0])
visit = int(parts[2])
instance = int(parts[3])
pat = re.compile(pat)
match = pat.fullmatch(name)
if parts[1] is not None:
visit = -visit
if match is None:
continue
vid, visit, instance = parse(match.groups())
# accept numeric/non-numeric instance
try:
instance = int(instance)
except ValueError:
pass
break
if match is None:
raise ValueError('Invalid column name: {}'.format(name))
return (vid, visit, instance)
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment