Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
FSL
ukbparse
Commits
ae664cb5
Commit
ae664cb5
authored
Mar 22, 2019
by
Paul McCarthy
🚵
Browse files
Merge branch 'rf/use_schema' into 'master'
Rf/use schema Closes #4 See merge request fsl/ukbparse!114
parents
1c9f55c0
580fca09
Pipeline
#3491
passed with stages
in 11 minutes and 3 seconds
Changes
25
Pipelines
2
Expand all
Hide whitespace changes
Inline
Side-by-side
CHANGELOG.rst
View file @
ae664cb5
...
...
@@ -2,6 +2,39 @@
======================
0.16.0 (Friday 22nd March 2019)
-------------------------------
Changed
^^^^^^^
* Full variable and datacoding table files no longer need to be provided -
``ukbparse`` uses ``ukbparse/data/field.txt`` and
``ukbparse/data/encoding.txt`` files, obtained from the UK Biobank showcase
website, as the basis for recognising variables and data codings. The
``--variable_file``/``-vf`` and ``--datacoding_file``/``-df`` options now
accept partial table definitions - these will be merged with the built-in
rules (still stored in ``ukbparse/data/variables_*.tsv`` and
``ukbparse/data/datacodings_*.tsv``) when ``ukbparse`` is invoked.
Deprecated
^^^^^^^^^^
* The ``ukbparse_htmlparse``, ``ukbparse_join`` , and
``ukbparse_compare_tables`` commands.
Removed
^^^^^^^
* The ``--icd10_file`` command-line option has been removed.
0.15.1 (Thursday 21st March 2019)
---------------------------------
...
...
@@ -26,6 +59,8 @@ Added
leaf values with parent values.
* New :mod:`.hierarchy` module which contains helper functions and data
structures for working with hierarchical variables.
* Definitions for all hierarchical UK Biobank variables are located in the
``ukbparse/data/hierarchy/`` directory.
Deprecated
...
...
README.rst
View file @
ae664cb5
...
...
@@ -118,57 +118,21 @@ specifically written to pre-process UK BioBank data variables. These rules are
stored in the following files:
* ``ukbparse/data/variables.tsv``: Cleaning rules for individual variables
* ``ukbparse/data/datacodings.tsv``: Cleaning rules for data codings
* ``ukbparse/data/variables
_*
.tsv``: Cleaning rules for individual variables
* ``ukbparse/data/datacodings
_*
.tsv``: Cleaning rules for data codings
* ``ukbparse/data/types.tsv``: Cleaning rules for specific types
* ``ukbparse/data/processing.tsv``: Processing steps
You can customise or replace these files as you see fit. You can also pass
your own versions of these files to ``ukbparse`` via the ``--variable_file``,
``--datacoding_file``, ``--type_file`` and ``--processing_file`` command-line
options respectively.
The ``variables.tsv`` file defines all of the variables that ``ukbparse`` is
aware of. If your UK BioBank data set contains variables which are not listed
in this file, you may wish to generate your own version - you can do so
by following these steps:
1. Use the ``ukbconv`` utility (available through the `BioBank Data showcase
<http://biobank.ctsu.ox.ac.uk/showcase/>`_) to generate a HTML file
describing all of the variables in your data set, and data codings used by
them.
2. Use the ``ukbparse_htmlparse`` command to convert this ``html`` file into
variable and data coding "base" files, which just contain the meta-data
for each variable/data coding.
3. Code up your custom cleaning rules for each variable and data coding, in
the same format as can be seen in the ``ukbparse/data/`` directory. For
data codings, create these flies:
* ``datacodings_navalues.tsv``: contains NA value replacement rules
* ``datacodings_recoding.tsv``: contains categorical recoding rules
And for variables, create these files:
* ``variables_navalues.tsv``: Contains NA value replacement rules
* ``variables_recoding.tsv``: Contains categorical recoding rules
* ``variables_clean.tsv``: Contains variable-specific cleaning functions
* ``variables_parentvalues.tsv``: Contains child value replacement rules.
4. Use the ``ukbparse_join`` command to generate the final variable and data
coding tables from your base files, e.g.::
options respectively.``ukbparse`` will load all variable and datacoding files,
and merge them into a single table which contains the cleaning rules for each
variable.
ukbparse_join final_variables_table.tsv \
variables_base.tsv \
variables_navalues.tsv \
variables_recoding.tsv \
variables_parentvalues.tsv \
variables_clean.tsv
ukbparse_join final_datacodings.tsv \
datacodings_base.tsv \
datacodings_navalues.tsv \
datacodings_recoding.tsv
Finally, you can use the ``--no_builtins`` option to bypass all of the
built-in cleaning and processing rules.
Tests
...
...
@@ -191,4 +155,4 @@ Citing
If you would like to cite ``ukbparse``, please refer to its `Zenodo page
<https://
zeno
do.org/
record/2203808#.XBDJ-xP7RE4
>`_.
<https://do
i
.org/
10.5281/zenodo.1997626
>`_.
ukbparse/__init__.py
View file @
ae664cb5
...
...
@@ -6,7 +6,7 @@
#
__version__
=
'0.1
5.1
'
__version__
=
'0.1
6.0
'
"""The ``ukbparse`` versioning scheme roughly follows Semantic Versioning
conventions.
"""
...
...
ukbparse/config.py
View file @
ae664cb5
...
...
@@ -13,6 +13,7 @@ import os.path as op
import
functools
as
ft
import
multiprocessing
as
mp
import
sys
import
glob
import
shlex
import
logging
import
argparse
...
...
@@ -33,12 +34,13 @@ log = logging.getLogger(__name__)
VERSION
=
ukbparse
.
__version__
UKBPARSEDIR
=
op
.
dirname
(
__file__
)
DEFAULT_VFILE
=
op
.
join
(
UKBPARSEDIR
,
'data'
,
'variables.tsv'
)
DEFAULT_DFILE
=
op
.
join
(
UKBPARSEDIR
,
'data'
,
'datacodings.tsv'
)
DEFAULT_TFILE
=
op
.
join
(
UKBPARSEDIR
,
'data'
,
'types.tsv'
)
DEFAULT_PFILE
=
op
.
join
(
UKBPARSEDIR
,
'data'
,
'processing.tsv'
)
DEFAULT_CFILE
=
op
.
join
(
UKBPARSEDIR
,
'data'
,
'categories.tsv'
)
DEFAULT_ICD10FILE
=
op
.
join
(
UKBPARSEDIR
,
'data'
,
'icd10.tsv'
)
DEFAULT_VFILES
=
op
.
join
(
UKBPARSEDIR
,
'data'
,
'variables_*.tsv'
)
DEFAULT_DFILES
=
op
.
join
(
UKBPARSEDIR
,
'data'
,
'datacodings_*.tsv'
)
DEFAULT_VFILES
=
list
(
glob
.
glob
(
DEFAULT_VFILES
))
DEFAULT_DFILES
=
list
(
glob
.
glob
(
DEFAULT_DFILES
))
DEFAULT_MERGE_AXIS
=
importing
.
MERGE_AXIS
DEFAULT_MERGE_STRATEGY
=
importing
.
MERGE_STRATEGY
DEFAULT_EXPORT_FORMAT
=
exporting
.
EXPORT_FORMAT
...
...
@@ -66,12 +68,13 @@ CLI_ARGUMENTS = collections.OrderedDict((
((
'ms'
,
'merge_strategy'
),
{
'choices'
:
AVAILABLE_MERGE_STRATEGIES
,
'default'
:
DEFAULT_MERGE_STRATEGY
}),
((
'cfg'
,
'config_file'
),
{}),
((
'vf'
,
'variable_file'
),
{
'default'
:
DEFAULT_VFILE
}),
((
'df'
,
'datacoding_file'
),
{
'default'
:
DEFAULT_DFILE
}),
((
'vf'
,
'variable_file'
),
{
'action'
:
'append'
,
'default'
:
DEFAULT_VFILES
}),
((
'df'
,
'datacoding_file'
),
{
'action'
:
'append'
,
'default'
:
DEFAULT_DFILES
}),
((
'tf'
,
'type_file'
),
{
'default'
:
DEFAULT_TFILE
}),
((
'pf'
,
'processing_file'
),
{
'default'
:
DEFAULT_PFILE
}),
((
'cf'
,
'category_file'
),
{
'default'
:
DEFAULT_CFILE
}),
((
'if'
,
'icd10_file'
),
{
'default'
:
DEFAULT_ICD10FILE
})]),
((
'cf'
,
'category_file'
),
{
'default'
:
DEFAULT_CFILE
})]),
(
'Import options'
,
[
((
'ia'
,
'import_all'
),
{
'action'
:
'store_true'
}),
...
...
@@ -165,6 +168,11 @@ CLI_ARGUMENTS = collections.OrderedDict((
CLI_DESCRIPTIONS
=
{
'Inputs'
:
'The --variable_file and --datacoding_file options can be used multiple
\n
'
'times - all provided files will be merged into a single table using the
\n
'
'variable/data coding IDs.'
,
'Import options'
:
'Using the --import_all option will increase RAM and CPU requirements,
\n
'
'as it forces every column to be loaded and processed. The purpose of
\n
'
...
...
@@ -214,12 +222,12 @@ CLI_ARGUMENT_HELP = {
'File containing default command line arguments.'
,
'variable_file'
:
'File containing rules for handling variables '
'(default: {}).'
.
format
(
DEFAULT_VFILE
),
'File
(s)
containing rules for handling variables '
'(default: {}).'
.
format
(
DEFAULT_VFILE
S
),
'datacoding_file'
:
'File containing rules for handling data codings '
'(default: {}).'
.
format
(
DEFAULT_DFILE
),
'File
(s)
containing rules for handling data codings '
'(default: {}).'
.
format
(
DEFAULT_DFILE
S
),
'type_file'
:
'File containing rules for handling types '
...
...
@@ -233,10 +241,6 @@ CLI_ARGUMENT_HELP = {
'File containing variable categories '
'(default: {}).'
.
format
(
DEFAULT_CFILE
),
'icd10_file'
:
'File containing ICD10 hierarchy '
'(default: {}).'
.
format
(
DEFAULT_ICD10FILE
),
# Import options
'import_all'
:
'Import and process all columns, and apply --variable/--category/'
...
...
@@ -578,11 +582,11 @@ def parseArgs(argv=None, namespace=None):
args
.
skip_processing
=
True
if
args
.
no_builtins
:
if
args
.
variable_file
==
DEFAULT_VFILE
:
args
.
variable_file
=
None
if
args
.
datacoding_file
==
DEFAULT_DFILE
:
args
.
datacoding_file
=
None
if
args
.
type_file
==
DEFAULT_TFILE
:
args
.
type_file
=
None
if
args
.
processing_file
==
DEFAULT_PFILE
:
args
.
processing_file
=
None
if
args
.
category_file
==
DEFAULT_CFILE
:
args
.
category_file
=
None
if
args
.
variable_file
==
DEFAULT_VFILE
S
:
args
.
variable_file
=
None
if
args
.
datacoding_file
==
DEFAULT_DFILE
S
:
args
.
datacoding_file
=
None
if
args
.
type_file
==
DEFAULT_TFILE
:
args
.
type_file
=
None
if
args
.
processing_file
==
DEFAULT_PFILE
:
args
.
processing_file
=
None
if
args
.
category_file
==
DEFAULT_CFILE
:
args
.
category_file
=
None
# the importing.loadData function accepts
# either a single encoding, or one encoding
...
...
ukbparse/data/datacodings.tsv
deleted
100644 → 0
View file @
1c9f55c0
ID NAValues RawLevels NewLevels
2
3
4
5
6
7
8
9
10
12
13 -1,-3
14 -1,-3
15
16
18
19
21
22
23
24
27
29
30
31
32
33
35
37 -1,-3
38
39
47
49
50
51
54
59
60
74
75
76
77
78
79
80
81
82
83
84
85
86
87
89
90 -3
91
92
93
100
101 -1
170
201
212
216
217
219
220
223
224
225
227
228
229
230
231
238
239
261
262
263
264
265
266
267
268
269
270
271
272
300
400
401
402 0
403
470
479
480 -818
485 -1
486 -121,-818
487
488 -1001 0
489
493 -121 0,-131,-141 1,2,3
494 -1520,-2030,-3040,-4000 1,2,3,4
496 -818 111,112,113,114 1,2,3,4
497
498
500
634
777
778
946 -818 -1001 0.5
1001 -1,-3
1002
1007
1010 -11,-13,-21,-23
1101
1204
1205
1206
1207
1208
1209
1210
2226
5001
5002
5003
5004
5005
5006
5007
5008
5009
5010
5011
5012
5013
5014
22000
100001 555,200 0.5,2
100002 111,555,300 0,0.5,3
100003 555,300 0.5,3
100004 555,400 0.5,4
100005 444,555,500 0.25,0.5,5
100006 555,600 0.5,6
100007 600 6
100008
100009 1,111 2,1
100010
100011 10,3060,1030,12,24,600 1,2,3,4,5,6
100012 0,13,35,57,79,912,1200 2,3,4,5,6,7
100013
100014
100015
100016 555,500 0.5,5
100017 444,555,300 0.25,0.5,3
100256 6,7
100257
100258
100260 3,6,7
100261 -1,4
100263 6,7
100267
100270 6,7,9
100271
100272
100273
100274 6,7
100280
100282 9
100286 -3
100287 -3
100288 -1,-3
100289 -1,-3
100290 -1,-3
100291 -1,-3
100292 -3
100293 -1,-3
100294 -1,-3
100295 -3
100298 -1,-3
100299 -3
100300 -1,-3
100301 -1,-3
100305 -3
100306 -1,-2,-3
100307 -1,-2,-3
100313 -3
100314 -1,-3
100316 -3
100317 -1,-3
100318 -1,-3
100327 -1,-3
100328 -3
100329 -1,-3
100334 -1,-3
100335 -1,-3
100336 -1,-3
100337 -1,-3
100338 -1,-3
100339 -1,-3
100341 -1,-3
100342 -1,-3
100343 -3
100345 -1,-3
100346 -1,-3
100347 -3
100348 -3
100349 -1,-3
100351 -3
100352 -3
100353 -1
100355 -1,-3
100356 -1,-3
100357 -3
100358 -3
100359 -3
100360 -3
100361 -1,-3
100369 -1,-3
100370 -3
100373 -1,-3
100377 -1,-3
100385 -3
100387 -1,-3
100388 -1,-3
100389 -1,-3
100391 -1,-3
100393 -1,-3
100394 -3
100397 -1,-3
100398 -3
100400 -3
100401 -1,-3
100402 -3
100416 -1,-3
100417 -1,-3
100418 -1,-3
100420 -1,-3
100428 -1,-3
100429 -1,-3
100430 -3
100431 -1,-3
100432 -1,-3
100434 -1,-3
100435 -1,-3
100478 -1,-3
100479 -1,-3
100484 -1,-3
100498
100499 -1,-3
100500 -1,-3
100501 -1,-3
100502 -3
100503
100504 -1,-3
100508 -1,-3
100510 -1,-3
100511 -1,-3
100514 -1,-3
100515
100523 -1,-3
100536 -1,-3
100537 -1,-3
100538 -3
100539 -3
100540 -1,-3
100549 -1,-3
100550 -1,-3
100552 -1,-3
100553 -3
100563 -3
100564 -3
100567 -1,-3
100569 -1,-3
100570 -1,-3
100572 -1,-3
100579 -3
100582 -1,-3
100584 -3
100585 -1,-3
100586 -3,-4
100595 -1,-3
100598 -1,-3
100599 -3,-5
100603 -1,-3
100605 -3
100610 -3
100617 -1,-3
100622 -1,-3
100625 -1,-3
100626 -1,-3
100628 -1,-3
100629 -3
100630 -1,-3
100631 -1,-3
100635 -1,-3
100636 -1,-3
100637 -1,-3
100639 -3
100640
100641
100642
100643 -1,-3
100644 -1,-3
100645 -1,-3
100646 -1,-3
100647 -1,-3
100648 -1,-3
100649 -1,-3
100650 -1,-3
100651 -1,-3
100652 -1,-3
100653 -1,-3
100654 -1,-3
100655 -1,-3
100656 -1,-3
100657 -1,-3
100658 -3
100659 -1,-3
100662 -3
100663 -1,-3
100664 -1,-3
100668 -1
100669 -1,-3
100672 -3
100673 -1,-3
100674 -1,-3
100683 -1,-3
100685 -3
100686 -3
100688 -1,-3
100689 -1,-3
100690 -3
100691 -3
100692 -3
100693
100696
100698 -313
100699 1,99
ukbparse/data/datacodings_base.tsv
deleted
100644 → 0
View file @
1c9f55c0
ID
2
3
4
5
6
7
8
9
10
12
13
14
15
16
18
19
21
22
23
24
27
29
30
31
32
33
35
37
38
39
47
49
50
51
54
59
60
74
75
76
77
78
79
80
81
82
83
84
85
86
87
89
90
91
92
93
100
101
170
201
212
216
217
219
220
223
224
225
227
228
229
230
231
238
239
261
262
263
264
265
266
267