Commit ae664cb5 authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

Merge branch 'rf/use_schema' into 'master'

Rf/use schema

Closes #4

See merge request fsl/ukbparse!114
parents 1c9f55c0 580fca09
Pipeline #3491 passed with stages
in 11 minutes and 3 seconds
......@@ -2,6 +2,39 @@
======================
0.16.0 (Friday 22nd March 2019)
-------------------------------
Changed
^^^^^^^
* Full variable and datacoding table files no longer need to be provided -
``ukbparse`` uses ``ukbparse/data/field.txt`` and
``ukbparse/data/encoding.txt`` files, obtained from the UK Biobank showcase
website, as the basis for recognising variables and data codings. The
``--variable_file``/``-vf`` and ``--datacoding_file``/``-df`` options now
accept partial table definitions - these will be merged with the built-in
rules (still stored in ``ukbparse/data/variables_*.tsv`` and
``ukbparse/data/datacodings_*.tsv``) when ``ukbparse`` is invoked.
Deprecated
^^^^^^^^^^
* The ``ukbparse_htmlparse``, ``ukbparse_join`` , and
``ukbparse_compare_tables`` commands.
Removed
^^^^^^^
* The ``--icd10_file`` command-line option has been removed.
0.15.1 (Thursday 21st March 2019)
---------------------------------
......@@ -26,6 +59,8 @@ Added
leaf values with parent values.
* New :mod:`.hierarchy` module which contains helper functions and data
structures for working with hierarchical variables.
* Definitions for all hierarchical UK Biobank variables are located in the
``ukbparse/data/hierarchy/`` directory.
Deprecated
......
......@@ -118,57 +118,21 @@ specifically written to pre-process UK BioBank data variables. These rules are
stored in the following files:
* ``ukbparse/data/variables.tsv``: Cleaning rules for individual variables
* ``ukbparse/data/datacodings.tsv``: Cleaning rules for data codings
* ``ukbparse/data/variables_*.tsv``: Cleaning rules for individual variables
* ``ukbparse/data/datacodings_*.tsv``: Cleaning rules for data codings
* ``ukbparse/data/types.tsv``: Cleaning rules for specific types
* ``ukbparse/data/processing.tsv``: Processing steps
You can customise or replace these files as you see fit. You can also pass
your own versions of these files to ``ukbparse`` via the ``--variable_file``,
``--datacoding_file``, ``--type_file`` and ``--processing_file`` command-line
options respectively.
The ``variables.tsv`` file defines all of the variables that ``ukbparse`` is
aware of. If your UK BioBank data set contains variables which are not listed
in this file, you may wish to generate your own version - you can do so
by following these steps:
1. Use the ``ukbconv`` utility (available through the `BioBank Data showcase
<http://biobank.ctsu.ox.ac.uk/showcase/>`_) to generate a HTML file
describing all of the variables in your data set, and data codings used by
them.
2. Use the ``ukbparse_htmlparse`` command to convert this ``html`` file into
variable and data coding "base" files, which just contain the meta-data
for each variable/data coding.
3. Code up your custom cleaning rules for each variable and data coding, in
the same format as can be seen in the ``ukbparse/data/`` directory. For
data codings, create these flies:
* ``datacodings_navalues.tsv``: contains NA value replacement rules
* ``datacodings_recoding.tsv``: contains categorical recoding rules
And for variables, create these files:
* ``variables_navalues.tsv``: Contains NA value replacement rules
* ``variables_recoding.tsv``: Contains categorical recoding rules
* ``variables_clean.tsv``: Contains variable-specific cleaning functions
* ``variables_parentvalues.tsv``: Contains child value replacement rules.
4. Use the ``ukbparse_join`` command to generate the final variable and data
coding tables from your base files, e.g.::
options respectively.``ukbparse`` will load all variable and datacoding files,
and merge them into a single table which contains the cleaning rules for each
variable.
ukbparse_join final_variables_table.tsv \
variables_base.tsv \
variables_navalues.tsv \
variables_recoding.tsv \
variables_parentvalues.tsv \
variables_clean.tsv
ukbparse_join final_datacodings.tsv \
datacodings_base.tsv \
datacodings_navalues.tsv \
datacodings_recoding.tsv
Finally, you can use the ``--no_builtins`` option to bypass all of the
built-in cleaning and processing rules.
Tests
......@@ -191,4 +155,4 @@ Citing
If you would like to cite ``ukbparse``, please refer to its `Zenodo page
<https://zenodo.org/record/2203808#.XBDJ-xP7RE4>`_.
<https://doi.org/10.5281/zenodo.1997626>`_.
......@@ -6,7 +6,7 @@
#
__version__ = '0.15.1'
__version__ = '0.16.0'
"""The ``ukbparse`` versioning scheme roughly follows Semantic Versioning
conventions.
"""
......
......@@ -13,6 +13,7 @@ import os.path as op
import functools as ft
import multiprocessing as mp
import sys
import glob
import shlex
import logging
import argparse
......@@ -33,12 +34,13 @@ log = logging.getLogger(__name__)
VERSION = ukbparse.__version__
UKBPARSEDIR = op.dirname(__file__)
DEFAULT_VFILE = op.join(UKBPARSEDIR, 'data', 'variables.tsv')
DEFAULT_DFILE = op.join(UKBPARSEDIR, 'data', 'datacodings.tsv')
DEFAULT_TFILE = op.join(UKBPARSEDIR, 'data', 'types.tsv')
DEFAULT_PFILE = op.join(UKBPARSEDIR, 'data', 'processing.tsv')
DEFAULT_CFILE = op.join(UKBPARSEDIR, 'data', 'categories.tsv')
DEFAULT_ICD10FILE = op.join(UKBPARSEDIR, 'data', 'icd10.tsv')
DEFAULT_VFILES = op.join(UKBPARSEDIR, 'data', 'variables_*.tsv')
DEFAULT_DFILES = op.join(UKBPARSEDIR, 'data', 'datacodings_*.tsv')
DEFAULT_VFILES = list(glob.glob(DEFAULT_VFILES))
DEFAULT_DFILES = list(glob.glob(DEFAULT_DFILES))
DEFAULT_MERGE_AXIS = importing.MERGE_AXIS
DEFAULT_MERGE_STRATEGY = importing.MERGE_STRATEGY
DEFAULT_EXPORT_FORMAT = exporting.EXPORT_FORMAT
......@@ -66,12 +68,13 @@ CLI_ARGUMENTS = collections.OrderedDict((
(('ms', 'merge_strategy'), {'choices' : AVAILABLE_MERGE_STRATEGIES,
'default' : DEFAULT_MERGE_STRATEGY}),
(('cfg', 'config_file'), {}),
(('vf', 'variable_file'), {'default' : DEFAULT_VFILE}),
(('df', 'datacoding_file'), {'default' : DEFAULT_DFILE}),
(('vf', 'variable_file'), {'action' : 'append',
'default' : DEFAULT_VFILES}),
(('df', 'datacoding_file'), {'action' : 'append',
'default' : DEFAULT_DFILES}),
(('tf', 'type_file'), {'default' : DEFAULT_TFILE}),
(('pf', 'processing_file'), {'default' : DEFAULT_PFILE}),
(('cf', 'category_file'), {'default' : DEFAULT_CFILE}),
(('if', 'icd10_file'), {'default' : DEFAULT_ICD10FILE})]),
(('cf', 'category_file'), {'default' : DEFAULT_CFILE})]),
('Import options', [
(('ia', 'import_all'), {'action' : 'store_true'}),
......@@ -165,6 +168,11 @@ CLI_ARGUMENTS = collections.OrderedDict((
CLI_DESCRIPTIONS = {
'Inputs' :
'The --variable_file and --datacoding_file options can be used multiple\n'
'times - all provided files will be merged into a single table using the\n'
'variable/data coding IDs.',
'Import options' :
'Using the --import_all option will increase RAM and CPU requirements,\n'
'as it forces every column to be loaded and processed. The purpose of\n'
......@@ -214,12 +222,12 @@ CLI_ARGUMENT_HELP = {
'File containing default command line arguments.',
'variable_file' :
'File containing rules for handling variables '
'(default: {}).'.format(DEFAULT_VFILE),
'File(s) containing rules for handling variables '
'(default: {}).'.format(DEFAULT_VFILES),
'datacoding_file' :
'File containing rules for handling data codings '
'(default: {}).'.format(DEFAULT_DFILE),
'File(s) containing rules for handling data codings '
'(default: {}).'.format(DEFAULT_DFILES),
'type_file' :
'File containing rules for handling types '
......@@ -233,10 +241,6 @@ CLI_ARGUMENT_HELP = {
'File containing variable categories '
'(default: {}).'.format(DEFAULT_CFILE),
'icd10_file' :
'File containing ICD10 hierarchy '
'(default: {}).'.format(DEFAULT_ICD10FILE),
# Import options
'import_all' :
'Import and process all columns, and apply --variable/--category/'
......@@ -578,11 +582,11 @@ def parseArgs(argv=None, namespace=None):
args.skip_processing = True
if args.no_builtins:
if args.variable_file == DEFAULT_VFILE: args.variable_file = None
if args.datacoding_file == DEFAULT_DFILE: args.datacoding_file = None
if args.type_file == DEFAULT_TFILE: args.type_file = None
if args.processing_file == DEFAULT_PFILE: args.processing_file = None
if args.category_file == DEFAULT_CFILE: args.category_file = None
if args.variable_file == DEFAULT_VFILES: args.variable_file = None
if args.datacoding_file == DEFAULT_DFILES: args.datacoding_file = None
if args.type_file == DEFAULT_TFILE: args.type_file = None
if args.processing_file == DEFAULT_PFILE: args.processing_file = None
if args.category_file == DEFAULT_CFILE: args.category_file = None
# the importing.loadData function accepts
# either a single encoding, or one encoding
......
ID NAValues RawLevels NewLevels
2
3
4
5
6
7
8
9
10
12
13 -1,-3
14 -1,-3
15
16
18
19
21
22
23
24
27
29
30
31
32
33
35
37 -1,-3
38
39
47
49
50
51
54
59
60
74
75
76
77
78
79
80
81
82
83
84
85
86
87
89
90 -3
91
92
93
100
101 -1
170
201
212
216
217
219
220
223
224
225
227
228
229
230
231
238
239
261
262
263
264
265
266
267
268
269
270
271
272
300
400
401
402 0
403
470
479
480 -818
485 -1
486 -121,-818
487
488 -1001 0
489
493 -121 0,-131,-141 1,2,3
494 -1520,-2030,-3040,-4000 1,2,3,4
496 -818 111,112,113,114 1,2,3,4
497
498
500
634
777
778
946 -818 -1001 0.5
1001 -1,-3
1002
1007
1010 -11,-13,-21,-23
1101
1204
1205
1206
1207
1208
1209
1210
2226
5001
5002
5003
5004
5005
5006
5007
5008
5009
5010
5011
5012
5013
5014
22000
100001 555,200 0.5,2
100002 111,555,300 0,0.5,3
100003 555,300 0.5,3
100004 555,400 0.5,4
100005 444,555,500 0.25,0.5,5
100006 555,600 0.5,6
100007 600 6
100008
100009 1,111 2,1
100010
100011 10,3060,1030,12,24,600 1,2,3,4,5,6
100012 0,13,35,57,79,912,1200 2,3,4,5,6,7
100013
100014
100015
100016 555,500 0.5,5
100017 444,555,300 0.25,0.5,3
100256 6,7
100257
100258
100260 3,6,7
100261 -1,4
100263 6,7
100267
100270 6,7,9
100271
100272
100273
100274 6,7
100280
100282 9
100286 -3
100287 -3
100288 -1,-3
100289 -1,-3
100290 -1,-3
100291 -1,-3
100292 -3
100293 -1,-3
100294 -1,-3
100295 -3
100298 -1,-3
100299 -3
100300 -1,-3
100301 -1,-3
100305 -3
100306 -1,-2,-3
100307 -1,-2,-3
100313 -3
100314 -1,-3
100316 -3
100317 -1,-3
100318 -1,-3
100327 -1,-3
100328 -3
100329 -1,-3
100334 -1,-3
100335 -1,-3
100336 -1,-3
100337 -1,-3
100338 -1,-3
100339 -1,-3
100341 -1,-3
100342 -1,-3
100343 -3
100345 -1,-3
100346 -1,-3
100347 -3
100348 -3
100349 -1,-3
100351 -3
100352 -3
100353 -1
100355 -1,-3
100356 -1,-3
100357 -3
100358 -3
100359 -3
100360 -3
100361 -1,-3
100369 -1,-3
100370 -3
100373 -1,-3
100377 -1,-3
100385 -3
100387 -1,-3
100388 -1,-3
100389 -1,-3
100391 -1,-3
100393 -1,-3
100394 -3
100397 -1,-3
100398 -3
100400 -3
100401 -1,-3
100402 -3
100416 -1,-3
100417 -1,-3
100418 -1,-3
100420 -1,-3
100428 -1,-3
100429 -1,-3
100430 -3
100431 -1,-3
100432 -1,-3
100434 -1,-3
100435 -1,-3
100478 -1,-3
100479 -1,-3
100484 -1,-3
100498
100499 -1,-3
100500 -1,-3
100501 -1,-3
100502 -3
100503
100504 -1,-3
100508 -1,-3
100510 -1,-3
100511 -1,-3
100514 -1,-3
100515
100523 -1,-3
100536 -1,-3
100537 -1,-3
100538 -3
100539 -3
100540 -1,-3
100549 -1,-3
100550 -1,-3
100552 -1,-3
100553 -3
100563 -3
100564 -3
100567 -1,-3
100569 -1,-3
100570 -1,-3
100572 -1,-3
100579 -3
100582 -1,-3
100584 -3
100585 -1,-3
100586 -3,-4
100595 -1,-3
100598 -1,-3
100599 -3,-5
100603 -1,-3
100605 -3
100610 -3
100617 -1,-3
100622 -1,-3
100625 -1,-3
100626 -1,-3
100628 -1,-3
100629 -3
100630 -1,-3
100631 -1,-3
100635 -1,-3
100636 -1,-3
100637 -1,-3
100639 -3
100640
100641
100642
100643 -1,-3
100644 -1,-3
100645 -1,-3
100646 -1,-3
100647 -1,-3
100648 -1,-3
100649 -1,-3
100650 -1,-3
100651 -1,-3
100652 -1,-3
100653 -1,-3
100654 -1,-3
100655 -1,-3
100656 -1,-3
100657 -1,-3
100658 -3
100659 -1,-3
100662 -3
100663 -1,-3
100664 -1,-3
100668 -1
100669 -1,-3
100672 -3
100673 -1,-3
100674 -1,-3
100683 -1,-3
100685 -3
100686 -3
100688 -1,-3
100689 -1,-3
100690 -3
100691 -3
100692 -3
100693
100696
100698 -313
100699 1,99
ID
2
3
4
5
6
7
8
9
10
12
13
14
15
16
18
19
21
22
23
24
27
29
30
31
32
33
35
37
38
39
47
49
50
51
54
59
60
74
75
76
77
78
79
80
81
82
83
84
85
86
87
89
90
91
92
93
100
101
170
201
212
216
217
219
220
223
224
225
227
228
229
230
231
238
239
261
262
263
264
265
266
267