Commit 386e2f38 authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

Merge branch 'enh/parallel-import' into 'master'

Enh/parallel import

Closes #4

See merge request !25
parents 981a41d5 4e87b352
[run]
concurrency = multiprocessing
concurrency = thread multiprocessing
source = funpack
[report]
......
......@@ -2,8 +2,63 @@ FUNPACK changelog
=================
1.4.5 (Thursday 5th December)
-----------------------------
1.5.0 (Under development)
-------------------------
Added
^^^^^
* New :func:`.util.wc` function to count the rows (lines) of a file;
this is simply a wrapper around the UNIX ``wc`` tool.
* New :func:`.util.cat` function to concatenate multiple files together;
this is simply a wrapper around the UNIX ``cat`` tool.
* New :meth:`.DataTable.subtable` and :meth:`.DataTable.merge` methods, to aid
in passing data to/from worker processes.
* Processing functions can now be specified to run independently on a subset
of variables by using ``'independent'`` in the variable list.
Changed
^^^^^^^
* FUNPACK will now parallelise tasks by default; previously it would only
parallelise tasks if ``--low_memory`` mode were selected.
* The data import stage is parallelised by using multiple processes to read
different chunks of the input file(s), and then concatenating the resulting
``pandas.DataFrame`` objects afterwards.
* Cleaning functions are executed on each variable in parallel.
* Each processing step is executed in parallel where possible
(e.g. ``independent`` processes), but processing steps are still executed
sequentially. New columns created by processing functions are saved to
disk, and re-loaded by the main process, rather than being passed back to
the main process via inter-process communication.
* The ``removeIfRedundant`` process now compares pairs of columns in parallel.
* The data export stage is parallelised by writing chunks of rows to different
files, and then concatenating them into a single output file afterwards.
* The ``--variable``, ``--subject`` and ``--exclude`` options now accept
comma-separated mixtures of IDs and MATLAB-style ranges.
* Updates to FMRIB categories.
* Updates to FMRIB processing rules, to take advantage of parallelism.
* The ,:mod:`icd10` module must now be initialised via the
:func:`.icd10.initialise` function, when it is to be used in a multiprocessing
context. This is not necessary when ``funpack`` is configured to not
parallelise tasks (e.g. with ``--num_jobs 1``).
Deprecated
^^^^^^^^^^
* The ``--low_memory`` and ``--work_dir`` options have been deprecated, and no
longer have any effect. The :mod:`.storage` module is no longer used, but is
still present for possible future usage.
1.4.5 (Thursday 5th December 2019)
----------------------------------
Changed
......
......@@ -3,7 +3,7 @@
.. image:: https://img.shields.io/pypi/v/fmrib-unpack.svg
:target: https://pypi.python.org/pypi/funpack/
:target: https://pypi.python.org/pypi/fmrib-unpack/
.. image:: https://anaconda.org/conda-forge/fmrib-unpack/badges/version.svg
:target: https://anaconda.org/conda-forge/fmrib-unpack
......@@ -35,7 +35,6 @@ Installation
Install FUNPACK via pip::
pip install fmrib-unpack
......@@ -44,27 +43,17 @@ Or from ``conda-forge``::
conda install -c conda-forge fmrib-unpack
Introductory notebook
---------------------
The ``funpack_demo`` command will start a Jupyter Notebook which introduces
the main features provided by FUNPACK. To run it, you need to install a few
additional dependencies::
the main features provided by FUNPACK. If you are using ``pip``, you need to
install a few additional dependencies::
pip install fmrib-unpack[demo]
.. note:: If you have installed FUNPACK into an FSL environment, you will
need to install the notebook dependencies like this (you may
need administrative privileges)::
source $FSLDIR/fslpython/bin/activate fslpython
pip install fmrib-unpack[demo]
You can then start the demo by running ``funpack_demo``.
......@@ -78,7 +67,6 @@ Usage
General usage is as follows::
funpack [options] output.tsv input1.tsv input2.tsv
......@@ -88,7 +76,6 @@ You can get information on all of the options by typing ``funpack --help``.
Options can be specified on the command line, and/or stored in a configuration
file. For example, the options in the following command line::
funpack \
--overwrite \
--import_all \
......@@ -101,7 +88,6 @@ file. For example, the options in the following command line::
Could be stored in a configuration file ``config.txt``::
overwrite
import_all
log_file log.txt
......@@ -112,7 +98,6 @@ Could be stored in a configuration file ``config.txt``::
And then executed as follows::
funpack -cfg config.txt output.tsv input1.tsv input2.tsv
......@@ -146,7 +131,6 @@ FUNPACK contains a large number of built-in rules which have been specifically
written to pre-process UK BioBank data variables. These rules are stored in
the following files:
* ``funpack/configs/fmrib/datacodings_*.tsv``: Cleaning rules for data codings
* ``funpack/configs/fmrib/variables_*.tsv``: Cleaning rules for individual
variables
......@@ -285,7 +269,6 @@ Tests
To run the test suite, you need to install some additional dependencies::
pip install fmrib-unpack[test]
......
......@@ -6,7 +6,7 @@
#
__version__ = '1.4.5'
__version__ = '1.5.0.dev0'
"""The ``funpack`` versioning scheme roughly follows Semantic Versioning
conventions.
"""
......
......@@ -12,7 +12,7 @@ a set of cleaning steps on the data.
import itertools as it
import functools as ft
import logging
import functools
import numpy as np
......@@ -68,11 +68,6 @@ def cleanData(dtable,
return dtable
def _runNAInsertion(dtable, col, navals):
"""Performs NA insertion on one column. """
dtable[:, col] = dtable[:, col].replace(navals)
def applyNAInsertion(dtable):
"""Re-codes data which should be interpreted as missing/not available.
......@@ -92,9 +87,6 @@ def applyNAInsertion(dtable):
log.debug('Recoding missing values as NA for %u variables ...',
len(vids))
allcols = []
allnavals = []
for vid in vids:
if not dtable.present(vid):
......@@ -103,13 +95,7 @@ def applyNAInsertion(dtable):
navals = {v : np.nan for v in vtable['NAValues'][vid]}
for col in dtable.columns(vid):
allcols .append(col.name)
allnavals.append(navals)
job = ft.partial(_runNAInsertion, dtable)
with dtable.pool() as pool:
pool.starmap(job, zip(allcols, allnavals))
dtable[:, col.name] = dtable[:, col.name].replace(navals)
def _runCleaningFunctions(dtable, procs, vid):
......@@ -120,6 +106,7 @@ def _runCleaningFunctions(dtable, procs, vid):
for proc in procs.values():
proc.run(dtable, vid)
return dtable
def applyCleaningFunctions(dtable):
......@@ -130,8 +117,9 @@ def applyCleaningFunctions(dtable):
vtable = dtable.vartable
vids = vtable.index[vtable['Clean'].notna()]
allprocs = []
allvids = []
allprocs = []
allvids = []
subtables = []
for vid in vids:
......@@ -139,13 +127,18 @@ def applyCleaningFunctions(dtable):
continue
procs = vtable.loc[vid, 'Clean']
cols = dtable.columns(vid)
allprocs.append(procs)
allvids .append(vid)
allprocs .append(procs)
allvids .append(vid)
subtables.append(dtable.subtable(cols))
with dtable.pool() as pool:
pool.starmap(ft.partial(_runCleaningFunctions, dtable),
zip(allprocs, allvids))
subtables = pool.starmap(_runCleaningFunctions,
zip(subtables, allprocs, allvids))
for subtable in subtables:
dtable.merge(subtable)
def _runChildValues(dtable, exprs, cvals, vid):
......@@ -222,7 +215,7 @@ def _runChildValues(dtable, exprs, cvals, vid):
# Now we can combine all expressions
# to get the final result, and restrict
# the mask to only affect missing values.
mask = functools.reduce(lambda a, b: a | b, masks)
mask = ft.reduce(lambda a, b: a | b, masks)
mask = mask & dtable[:, colname].isna()
# Finally we apply it to the data.
......@@ -257,22 +250,8 @@ def applyChildValues(dtable):
# parent variables
evalOrder = [eo[1] for eo in evalOrder]
for vidlevel in evalOrder:
job = ft.partial(_runChildValues, dtable, exprs, cvals)
with dtable.pool() as pool:
pool.map(job, vidlevel)
def _runNewLevels(dtable, col, valmap):
"""Recodes categoricals for the given column. """
name = col.name
old = dtable[:, name]
new = old.replace(valmap)
corr = old.corr(new)
dtable[:, name] = new
if corr < 0:
dtable.addFlag(col, 'inverted')
for vid in vidlevel:
_runChildValues(dtable, exprs, cvals, vid)
def applyNewLevels(dtable):
......@@ -296,19 +275,18 @@ def applyNewLevels(dtable):
log.debug('Recoding categoricals for %u variables ...',
len(vids))
cols = []
valmaps = []
for vid in vids:
if not dtable.present(vid):
continue
valmap = {r : n for r, n in zip(rawlevels[vid], newlevels[vid])}
valmap = dict(zip(rawlevels[vid], newlevels[vid]))
for col in dtable.columns(vid):
cols .append(col)
valmaps.append(valmap)
old = dtable[:, col.name]
new = old.replace(valmap)
corr = old.corr(new)
dtable[:, col.name] = new
with dtable.pool() as pool:
pool.starmap(ft.partial(_runNewLevels, dtable), zip(cols, valmaps))
if corr < 0:
dtable.addFlag(col, 'inverted')
......@@ -121,7 +121,6 @@ CLI_ARGUMENTS = collections.OrderedDict((
(('oi', 'output_id_column'), {}),
(('edf', 'date_format'), {'default' : 'default'}),
(('etf', 'time_format'), {'default' : 'default'}),
(('nr', 'num_rows'), {'type' : int}),
(('uf', 'unknown_vars_file'), {}),
(('imf', 'icd10_map_file'), {}),
(('def', 'description_file'), {}),
......@@ -131,6 +130,7 @@ CLI_ARGUMENTS = collections.OrderedDict((
(('ts', 'tsv_sep'), {'default' : DEFAULT_TSV_SEP}),
(('tm', 'tsv_missing_values'), {'default' : ''}),
(('nn', 'non_numeric_file'), {}),
(('nr', 'num_rows'), {'type' : int}),
(('tvf', 'tsv_var_format'), {'nargs' : 2,
'metavar' : ('VID', 'FORMATTER'),
'action' : 'append'})]),
......@@ -346,9 +346,6 @@ CLI_ARGUMENT_HELP = {
'time_format' :
'Formatter to use for time variables (default: "default").',
'num_rows' :
'Number of rows to write at a time.',
'unknown_vars_file' :
'Save list of unknown variables/columns to file. Only applicable if '
'--import_all is enabled.',
......@@ -369,6 +366,8 @@ CLI_ARGUMENT_HELP = {
'tsv_missing_values' :
'String to use for missing values in output file (default: empty '
'string).' ,
'num_rows' :
'Number of rows to write at a time. Ignored if --num_jobs is set to 1.',
'non_numeric_file' :
'Export all non-numeric columns (after formatting) to this file instead '
'of the primary output file.',
......@@ -386,15 +385,13 @@ CLI_ARGUMENT_HELP = {
'version' : 'Print version and exit.',
'dry_run' : 'Print a summary of what would happen and exit.',
'no_builtins' : 'Do not use the built in variable or data coding tables.',
'low_memory' : 'Store intermediate results on disk, rather than in RAM. '
'Use this flag on systems which cannot store the full '
'data set in RAM. ',
'work_dir' : 'Directory to store intermediate files (default: '
'temporary directory). Only relevant when using '
'--low_memory.',
'low_memory' : 'Deprecated, has no effect.',
'work_dir' : 'Deprecated, has no effect.',
'log_file' : 'Save log messages to file.',
'num_jobs' : 'Maximum number of jobs to run in parallel. '
'(default: {}).'.format(mp.cpu_count()),
'num_jobs' : 'Maximum number of jobs to run in parallel. Set to 1 '
'to disable parallelisation. '
'(default: number of available CPUs [{} on '
'this platform]).'.format(mp.cpu_count()),
'pass_through' : 'Do not perform any cleaning or processing on the data - '
'implies --skip_insertna, --skip_childvalues, '
'--skip_clean_funcs, --skip_recoding, and '
......@@ -597,11 +594,6 @@ def parseArgs(argv=None, namespace=None):
'either one encoding, or one encoding for each '
'input file.')
# parallelisation only allowed
# in low_memory mode
if not args.low_memory:
args.num_jobs = 0
# turn loaders into dict of { absfile : name } mappings
if args.loader is not None:
args.loader = {op.realpath(f) : n for f, n in args.loader}
......@@ -647,26 +639,24 @@ def parseArgs(argv=None, namespace=None):
parsed = [int(t.strip()) for t in parsed]
else:
# Or they may be an ID or matlab-style
# start[:step[:stop]] range, both handled
# by the parseMatlabRange function.
try: parsed = util.parseMatlabRange(thing)
except ValueError: parsed = None
# Or they may be a comma-separated
# list of IDs
if parsed is None:
try:
parsed = [int(v) for v in thing.split(',')]
# Or they may be one or more comma-separated
# IDs or matlab start:step[:stop] ranges,
# both handled by the parseMatlabRange function.
try:
parsed = []
for tkn in thing.split(','):
parsed.extend(util.parseMatlabRange(tkn))
except ValueError:
parsed = None
# --subject may also be an expression,
# so if error is False, and the range/
# list parses fail, we pass the argument
# through. Otherwise we propagate the
# error.
except ValueError:
if error:
raise
if error:
raise
if parsed is None:
failed.append(thing)
......
ID Category Variables
1 age, sex, brain MRI protocol, Phase 31,34,21022,22200,25780
2 genetics 21000,22000:22125,22201:22325
3 early life factors 52,129,130,1677,1687,1697,1737,1767,1777,1787,20022
2 genetics 21000,22000:22125,22201:22325,22182,22800:22823
3 early life factors 52,129,130,1677,1687,1697,1737,1767,1777,1787,21066,20022
10 lifestyle and environment - general 3:6,132,189,670,680,699,709,728,738,767,777,1031,1797,1807,1835,1845,1873,1883,2139,2149,2159,2237,2375,2385,2395,2405,2267,2277,2714:10:2834,2946,3526,3536,3546,3581,3591,3659,3669,3700,3710,3720,3829,3839,3849,3872,3882,3912,3942,3972,3982,4501,4674,4825,4836,5057,6138,6142,6139:6141,6145:6146,6160,10016,10105,10114,10721,10722,10740,10749,10860,10877,10886,20074:20075,20110:20113,20118:20119,20121,22501,22599,22606,22700,22702,22704,24003:24019,24024,24500:24508,26410:26434
11 lifestyle and environment - exercise and work 1001,1011,796,806,816,826,845,864,874,884,894,904,914,924,943,971,981,991,1021,1050:10:1220,2624,2634,3426,3637,3647,6143,6162,6164,10953,10962,10971,22604,22605,22607:22615,22620,22630,22631,22640:22655,104900,104910,104920
12 lifestyle and environment - food and drink 1289:10:1389,1408:10:1548,2654,3089,3680,6144,10007,10723,10767,10776,10855,10912,20084:20094,20098:20109,100001:100009,100011:100019,100021:100025,100010:10:100560,100760:10:104670
......@@ -11,15 +11,15 @@ ID Category Variables
21 physical measures - bone density and sizes 77,78,3083:3086,3143:3144,3146:3148,4092,4095,4100:4101,4103:4106,4119:4120,4122:4125,4138:4147,23200:23243,23290:23320
22 physical measures - cardiac & blood vessels 93:95,102,4079,4080,4136,4194:4196,4198:4200,4204:4205,4207,5983,5984,5986,5992,5993,6014:6017,6019,6020,6022,6024,6032:6034,6038,6039,12673:12687,12336,12338,12340,12697,12698,12702,21021,22330:22338,22420:22426,22670:22685
23 hearing test 4229:4230,4232:4237,4239:4247,4249,4268:4270,4272,4275:4277,4279,4849,10793,20019,20021,20060
24 eye test 5076:5077,5082:5091,5096:5119,5132:5136,5138:5149,5152,5155:5164,5181:5183,5186,5188,5190,5193,5198:5199,5201,5202,5206,5208,5209,5215,5221,5237,5251,5254:5259,5262:5267,5274,5276,5292,5306,5324:5328,6070:6075,20052,20055,20261:20262
24 eye test 5076:5079,5082:5091,5096:5119,5132:5136,5138:5149,5152,5155:5164,5181:5183,5186,5188,5190,5193,5198:5199,5201,5202,5204,5206,5208,5209,5211,5215,5221,5237,5251,5254:5259,5262:5267,5274,5276,5292,5306,5324:5328,6070:6075,20052,20055,20261:20262
25 physical activity measures 5985,90002:90003,90010:90013,90015:90177,90179:90195
26 abdominal measures 22415:22416
30 blood assays 74,23000:23044,23049:23059,23062,23065:23068,23070,23073,23074,30000:10:30300,30104,30112,30114,30172,30174,30242,30252,30254,30314:10:30344,30364:10:30424,30500:10:30530,30600:10:30890
31 brain IDPs 25000:25746,25756:25758,25761:25768,25781:25920,26500:26502,26504:26506,26508:26513,26517:26518,26520:26524,26526:26528,26530:26552,26554:26720,26722:26723,26725:26727,26732,26734:26740,26743,26746:26750,26752,26754:26757,26759:26761,26763,26766,26768:26769,26771,26773:26774,26777,26780,26781:26782,26784,26786,26788:26790,26792:26796,26799,26801:26807,26810,26813:26819,26821:26824,26826:26827,26833,26835:26837,26839:26841,26844,26847:26851,26853:26857,26860:26862,26864,26867,26869:26870,26873:26875,26878,26881:26883,26885,26887,26889:26891,26893:26895,26897,26900,26902:26908,26911,26914:27772
30 blood assays 74,23000:23044,23049:23060,23062,23063,23065:23071,23073:23075,30000:10:30300,30104,30112,30114,30172,30174,30242,30252,30254,30314:10:30344,30364:10:30424,30500:10:30530,30600:10:30890
31 brain IDPs 25000:25746,25754:25759,25761:25768,25781:25920,26500:26508:26513,26517:26518,26520:26552,26554:26720,26722:26723,26725:26727,26732,26734:26740,26743,26746:26750,26752,26754:26757,26759:26761,26763,26766,26768:26769,26771,26773:26774,26777,26780,26781:26782,26784,26786,26788:26790,26792:26796,26799,26801:26807,26810,26813:26819,26821:26824,26826:26827,26833,26835:26837,26839:26841,26844,26847:26851,26853:26857,26860:26862,26864,26867,26869:26870,26873:26875,26878,26881:26883,26885,26887,26889:26891,26893:26895,26897,26900,26902:26908,26911,26914:27772
32 cognitive phenotypes 62,111,396:404,630,4250:4256,4258:4260,4281:4283,4285,4287,4290:4292,4294,4924,4935,4957,4968,4979,4990,5001,5012,5556,5699,5779,5790,5866,6312,6332,6333,6348:6351,6362,6373,6374,6382,6383,6671,6770:6773,10133:10134,10136:10144,10146:10147,10241,10609:10610,10612,20016,20018,20023,20082,20128:20157,20159,20165,20167,20169:2:20195,20196:2:20200,20229,20230,20240,20242,20244:20248,23321:23324
50 health and medical history, health outcomes 84,87,92,134:137,2178,2188,2207,2217,2227,2247,2257,2296,2316,2335:10:2365,2415,2443:10:2473,2492,2674,2684,2694,2704,2844,2956:10:2986,3005,3079,3140,3393,3404,3414,3571,3606,3616,3627,3741,3751,3761,3773,3786,3799,3809,3894,3992,4012,4022,4041,4056,4067,4689,4700,4717,4728,4792,4803,4814,5408,5419,5430,5441,5452,5463,5474,5485,5496,5507,5518,5529,5540,5610,5832,5843,5855,5877,5890,5901,5912,5923,5934,5945,6119,6147,6148,6149,6150,6151,6152,6153,6154,6155,6159,6177,6179,6205,10004:10006,10854,20001:20011,20199,21024:21027,21035:21044,21047,21064:21065,21067,21071,21073:21076,22126:22181,22502:22505,22616,22618,22619,40001:41253,41256,41258,41266,41267,41269,41271,41273,41275,41276,41277,41278,41284,41285,41286,42000:42013
51 mental health self-report 1920:10:2110,4526,4537,4548,4559,4570,4581,4598,4609,4620,4631,4642,4653,5375,5386,5663,5674,6156,20122,20126:20127,20401,20417:20423,20425:20429,20431:20442,20445:20450,20453:20460,20465:20467,20470,20473,20476,20477,20479,20481:20483,20485:20500,20502,20505:20512,20514:20526,20528:20544,20546:20551,20553:20554,21062:21063
60 health dates 41257,41260,41262,41263,41268,41280:41283,42014,42016,130004,130008,130014:2:130020,130062,130064,130070,130082,130106,130134,130174:2:130178,130184:2:130190,130194,130202,130216,130218,130224:2:130230,130264,130310,130320,130336,130338,130342,130344,130622,130624,130648,130656:2:130660,130664,130670,130686,130696:2:130708,130714,130718,130722,130726,130734,130736,130770,130774,130792,130814,130818,130820,130826,130828,130832,130854,130868,130892:2:130898,130902:2:130910,130914,130918,130922,130924,130998:2:131022,131030,131032,131042,131046,131048,131052:2:131056,131060:2:131064,131070:2:131076,131086,131102,131114,131124,131128:2:131132,131136,131138,131142,131144,131148,131150,131154,131158,131164,131166,131178:2:131186,131190,131192,131198,131204,131208:2:131212,131216,131222,131224,131228,131230,131234,131236,131242,131252,131256:2:131264,131270,131282,131286,131296,131298,131304:2:131308,131314,131316,131322,131324,131338,131342,131344,131348:2:131356,131366,131368,131370,131374,131382,131386,131390,131392,131396,131402,131404,131408,131410,131414,131416,131424:2:131432,131436,131442,131456,131458,131462:2:131476,131480:2:131484,131490:2:131494,131498,131528,131534,131538,131540,131546,131548,131554,131556,131560:2:131586,131590:2:131594,131598:2:131604,131608,131612:2:131620,131624:2:131654,131666:2:131670,131674:2:131684,131688,131692,131698:2:131708,131720,131722,131726,131730,131734:2:131748,131754,131760,131768,131774,131778,131782,131790:2:131798,131802:2:131806,131810,131812,131822:131826,131830,131836,131850,131852,131858,131864,131868:2:131888,131892,131900,131906,131910:2:131914,131916,131918,131922:2:131928,131934,131938:2:131942,131946:2:131950,131954:2:131964,131970:2:131974,131980,131988:2:131992,132002,132008,132016,132020,132022,132030:2:132038,132042,132050,132054:2:132058,132062:2:132066,132070:2:132078,132082:2:132088,132092,132096:2:132106,132110,132112,132116,132118,132122,132124,132128:2:132152,132156,132160,132162,132166:2:132170,132186,132192,132194,132202,132206,132216,132220,132224,132230,132238:2:132244,132250,132252,132260:2:132264,132268,132274:2:132280,132298,132522,132532,132542,132562,132574,132312
70 health sources 42015,42017,130005,130009,130015:2:130019,130063,130071,130083,130107,130135,130175,130177,130185:2:130191,130195,130203,130217,130219,130225,130231,130265,130311,130337,130343,130345,130623,130625,130649,130657:2:130661,130665,130671,130687,130697:2:130709,130715,130719,130723,130727,130735,130737,130771,130775,130793,130815,130819,130821,130827,130829,130833,130855,130869,130893:2:130899,130905:2:130911,130915,130919,130923,130925,130999,131001,131023,131031,131033,131043,131047,131049,131053,131055,131057,131061,131063,131065,131071,131075,131077,131087,131103,131115,131129,131131,131133,131137,131139,131145,131151,131155,131159,131167,131179:2:131187,131191,131193,131199,131205,131209,131211,131213,131217,131229,131231,131237,131243,131253,131257,131259,131261,131265,131271,131283,131287,131297,131299,131305,131307,131309,131315,131317,131323,131325,131339,131343,131345,131349:131357,131361,131367:131371,131375,131383,131387,131391,131393,131397,131403,131409,131411,131415,131417,131425,131431,131457,131463:2:131477,131481,131483,131485,131491,131493,131495,131499,131529,131535,131539,131541,131547,131549,131555,131561,131563,131565:2:131587,131591,131593,131595,131599:2:131605,131613:2:131621,131625:2:131653,131667:2:131671,131675:2:131685,131689,131693,131701:2:131709,131727,131731,131735:2:131743,131749,131755,131761,131769,131775,131783,131793,131795,131797,131803,131805,131807,131811,131813,131823,131825,131827,131831,131837,131851,131859,131865,131869,131871,131873,131877:2:131889,131893,131901,131907,131911,131913:2:131919,131923:2:131929,131935,131939,131941,131943,131947,131949,131951,131955:2:131965,131971,131973,131975,131981,131989,131991,131993,132003,132009,132017,132021,132023,132033,132035,132037,132039,132043,132051,132055,132057,132059,132063,132067,132071:2:132079,132083:2:132089,132093,132097:2:132107,132111,132113,132117,132119,132123,132125,132129:2:132153,132157,132161,132163,132167,132169,132171,132187,132193,132195,132203,132213,132217,132245,132269,132275:2:132281,132299,132523,132533,132543,132563,132575,132312:132313
50 health and medical history, health outcomes 84,87,92,134:137,2178,2188,2207,2217,2227,2247,2257,2296,2316,2335:10:2365,2415,2443:10:2473,2492,2674,2684,2694,2704,2844,2956:10:2986,3005,3079,3140,3393,3404,3414,3571,3606,3616,3627,3741,3751,3761,3773,3786,3799,3809,3894,3992,4012,4022,4041,4056,4067,4689,4700,4717,4728,4792,4803,4814,5408,5419,5430,5441,5452,5463,5474,5485,5496,5507,5518,5529,5540,5610,5832,5843,5855,5877,5890,5901,5912,5923,5934,5945,6119,6147,6148,6149,6150,6151,6152,6153,6154,6155,6159,6177,6179,6205,10004:10006,10854,20001:20011,20199,21024:21045,21047:21061,21064:21065,21067,21068,21070:21076,22126:22181,22502:22505,22616,22618,22619,40001:41253,41256,41258,41266,41267,41269,41271,41273,41275,41276,41277,41278,41284,41285,41286,42000:42013
51 mental health self-report 1920:10:2110,4526,4537,4548,4559,4570,4581,4598,4609,4620,4631,4642,4653,5375,5386,5663,5674,6156,20122,20126:20127,20401,20411,20417:20423,20425:20429,20431:20442,20445:20450,20453:20460,20463,20465:20467,20470:20471,20473,20476,20477,20479:20484,20485:20502,20505:20544,20546:20551,20553:20554,21062:21063
60 health dates 41257,41260,41262,41263,41268,41280:41283,42014,42016,130004,130008,130014:2:130020,130062,130064,130070,130082,130106,130134,130174:2:130178,130184:2:130190,130194,130202,130216,130218,130224:2:130230,130264,130310,130320,130336,130338,130342,130344,130622,130624,130648,130656:2:130660,130664,130670,130686,130696:2:130708,130714,130718,130722,130726,130734,130736,130770,130774,130792,130814,130818,130820,130826,130828,130832,130854,130868,130892:2:130898,130902:2:130910,130914,130918,130922,130924,130998,131000,131022,131030,131032,131042,131046,131048,131052:2:131056,131060:2:131064,131070:2:131076,131086,131102,131114,131124,131128:2:131132,131136,131138,131142,131144,131148,131150,131154,131158,131164,131166,131178:2:131186,131190,131192,131198,131204,131208:2:131212,131216,131222,131224,131228,131230,131234,131236,131242,131252,131256:2:131264,131270,131282,131286,131296,131298,131304:2:131308,131314,131316,131322,131324,131338,131342,131344,131348:2:131356,131360,131366:2:131370,131374,131382,131386,131390,131392,131396,131402,131404,131408,131410,131414,131416,131424:2:131432,131436,131442,131456,131458,131462:2:131476,131480:2:131484,131490:2:131494,131498,131528,131534,131538,131540,131546,131548,131554,131556,131560:2:131586,131590:2:131594,131598:2:131604,131608,131612:2:131620,131624:2:131654,131666:2:131670,131674:2:131684,131688,131692,131698:2:131708,131720,131722,131726,131730,131734:2:131742,131746,131748,131754,131760,131768,131774,131778,131782,131790:2:131798,131802:2:131806,131810,131812,131822:131826,131830,131836,131850,131852,131858,131864,131868:2:131888,131892,131900,131906,131910:2:131914,131916,131918,131922:2:131928,131934,131938:2:131942,131946:2:131950,131954:2:131964,131970:2:131974,131980,131988:2:131992,132002,132008,132016,132020,132022,132030:2:132038,132042,132050,132054:2:132058,132062:2:132066,132070:2:132078,132082:2:132088,132092,132096:2:132106,132110,132112,132116,132118,132122,132124,132128:2:132152,132156,132160,132162,132166:2:132170,132186,132192,132194,132202,132206,132212,132216,132220,132224,132230,132238:2:132244,132250,132252,132260:2:132264,132268,132274:2:132280,132298,132522,132532,132542,132562,132574,132312
70 health sources 42015,42017,130005,130009,130015:2:130019,130063,130065,130071,130083,130107,130135,130175:130179,130185:2:130191,130195,130203,130217,130219,130225,130231,130265,130311,130337,130343,130345,130623,130625,130649,130657:2:130661,130665,130671,130687,130697:2:130709,130715,130719,130723,130727,130735,130737,130771,130775,130793,130815,130819,130821,130827,130829,130833,130855,130869,130893:2:130899,130903:2:130911,130915,130919,130923,130925,130999,131001,131023,131031,131033,131043,131047,131049,131053,131055,131057,131061,131063,131065,131071:2:131077,131087,131103,131115,131125,131129,131131,131133,131137,131139,131145,131149,131151,131155,131159,131165,131167,131179:2:131187,131191,131193,131199,131205,131209,131211,131213,131217,131223,131225,131229,131231,131237,131243,131253,131257:2:131265,131271,131283,131287,131297,131299,131305,131307,131309,131315,131317,131323,131325,131339,131343,131345,131349:131357,131361,131367:131371,131375,131383,131387,131391,131393,131397,131403,131409,131411,131415,131417,131425:2:131433,131437,131443,131457,131459,131463:2:131477,131481,131483,131485,131491:2:131495,131499,131529,131535,131539,131541,131547,131549,131555,131557,131561,131563,131565:2:131587,131591,131593,131595,131599:2:131605,131609,131613:2:131621,131625:2:131655,131667:2:131671,131675:2:131685,131689,131693,131701:2:131709,131727,131731,131735:2:131743,131747,131749,131755,131761,131769,131775,131779,131783,131793,131795,131797,131803,131805,131807,131811,131813,131823,131825,131827,131831,131837,131851,131859,131865,131869:2:131889,131893,131901,131907,131911,131913:2:131919,131923:2:131929,131935,131939,131941,131943,131947,131949,131951,131955:2:131965,131971,131973,131975,131981,131989,131991,131993,132003,132009,132017,132021,132023,132031:2:132039,132043,132051,132055,132057,132059,132063:2:132067,132071:2:132079,132083:2:132089,132093,132097:2:132107,132111,132113,132117,132119,132123,132125,132129:2:132153,132157,132161,132163,132167,132169,132171,132187,132193,132195,132203,132207,132213,132217,132221,132225,132245,132265,132269,132275:2:132281,132299,132523,132533,132543,132563,132575,132313
98 pending 41259,41261,41264,42038:42040
99 miscellaneous 19,21,35:45,53:55,68,96,120,757,1647,2129,3061,3066,3077,3081:3082,3090,3137,3166,4081,4093,4096,4206,4257,4286,4288:4289,4293,4295,5214,5253,5270,5987:5988,5991,10145,10697,12139:12141,12148,12187,12188,12223,12224,12253,12254,12323,12623,12624,12651,12652,12654,12658,12663,12664,12671,12695,12699,12700,12704,12706,12848,12851,12854,20012:20014,20024,20031:20032,20041:20054,20058:20059,20061:20062,20072,20077:20081,20083,20114:20115,20158,20201:20227,20249:20253,20400,21003,21023,21611,21621,21622,21625,21631,21634,21642,21651,21661:21666,21671,21711,21721:21723,21725,21731:21734,21736,21738,21741,21742,21751,21761:21766,21771,21811,21833,21834,22499,22500,22600:22603,22617,22660:22664,23048,25747:25753,30001:10:30301,30002:10:30302,30003:10:30303,30004:10:30304,30522,30532,30615,30621,30622,30635,30645,30665,30666,30691,30692,30725,30751,30755,30775,30791,30795,30796,30805,30806,30825,30826,30835,30845,30855,30856,30885,30895,30897,40000,105010,105030,110005,110006,110008
99 miscellaneous 19,21,35:45,53:55,68,96,120,200,393,757,1647,2129,3060,3061,3066,3077,3081:3082,3090,3137,3166,4081,4093,4096,4206,4238,4248,4257,4286,4288:4289,4293,4295,5074,5075,5080,5081,5214,5253,5270,5987:5988,5991,6023,6025,6334,10145,10697,12139:12141,12148,12187,12188,12223,12224,12253,12254,12291,12323,12623,12624,12651:12654,12658,12663,12664,12671,12688,12695,12699,12700,12704,12706,12848,12851,12854,20012:20014,20024:20025,20031:20032,20035,20041:20054,20058:20059,20061:20062,20072,20077:20081,20083,20114:20115,20158,20201:20227,20249:20253,20400,21003,21023,21069,21611,21621,21622,21625,21631,21634,21642,21651,21661:21666,21671,21711,21721:21723,21725,21731:21734,21736,21738,21741,21742,21751,21761:21766,21771,21811,21821:21823,21825,21831:21834,21836,21838,21841:21842,21851,21861:21866,21871,22499,22500,22600:22603,22617,22660:22664,23048,23160,25747:25753,30001:10:30301,30002:10:30302,30003:10:30303,30004:10:30304,30354,30502:10:30522,30532,30601:10:30891,30615,30622,30635,30645,30665,30666,30692,30725,30755,30775,30795,30796,30805,30806,30825,30826,30835,30845,30855,30856,30875,30885,30895,30897,40000,105010,105030,110005,110006,110008
Variable Process
6155 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='codingdesc')
20001 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchycodedesc')
20002 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchycodedesc')
20003 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='codingdesc')
20004 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchycodedesc')
20199 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='codingdesc')
40001 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
40002 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
40006 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
40011 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchycodedesc')
40012 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchycodedesc')
40013 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41200 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41201 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41202 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41203 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41204 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41205 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41210 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41256 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41258 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41270 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41271 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41272 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
41273 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
independent,6155,20003,20199 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='codingdesc')
independent,20001,20002,20004,40011,40012 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchycodedesc')
independent,40001,40002,40006,40013,41200,41201,41202,41203,41204,41205,41210,41256,41258,41270,41271,41272,41273 binariseCategorical(acrossVisits=True, acrossInstances=True, metaproc='hierarchynumdesc')
all_independent_except,6155,20001,20002,20003,20004,20199,40001,40002,40006,40011,40012,40013,41200,41201,41202,41203,41204,41205,41210,41256,41258,41270,41271,41272,41273 removeIfSparse(minpres=51, maxcat=0.99, minstd=1e-6, abscat=False)
6155 removeIfSparse(mincat=10)
20001 removeIfSparse(mincat=10)
20002 removeIfSparse(mincat=10)
20003 removeIfSparse(mincat=10)
20004 removeIfSparse(mincat=10)
20199 removeIfSparse(mincat=10)
40001 removeIfSparse(mincat=10)
40002 removeIfSparse(mincat=10)
40006 removeIfSparse(mincat=10)
40011 removeIfSparse(mincat=10)
40012 removeIfSparse(mincat=10)
40013 removeIfSparse(mincat=10)
41200 removeIfSparse(mincat=10)
41201 removeIfSparse(mincat=10)
41202 removeIfSparse(mincat=10)
41203 removeIfSparse(mincat=10)
41204 removeIfSparse(mincat=10)
41205 removeIfSparse(mincat=10)
41210 removeIfSparse(mincat=10)
41256 removeIfSparse(mincat=10)
41258 removeIfSparse(mincat=10)
41270 removeIfSparse(mincat=10)
41271 removeIfSparse(mincat=10)
41272 removeIfSparse(mincat=10)
41273 removeIfSparse(mincat=10)
independent,6155,20001,20002,20003,20004,20199,40001,40002,40006,40011,40012,40013,41200,41201,41202,41203,41204,41205,41210,41256,41258,41270,41271,41272,41273 removeIfSparse(mincat=10)
all removeIfRedundant(0.99, 0.2)
ID Type
20003 float64
20199 float64
......@@ -16,7 +16,6 @@ import contextlib
import collections
from . import storage
from . import loadtables
......@@ -29,7 +28,7 @@ variable IDs really shoulld not conflict with actual UKB variable IDs.
"""
class Column(object):
class Column:
"""The ``Column`` is a simple container class containing metadata
about a single column in a data file.
......@@ -79,7 +78,7 @@ class Column(object):
self.instance == other.instance)
class DataTable(object):
class DataTable:
"""The ``DataTable`` is a simple container class.
It keeps references to the variable and processing tables, and the data
......@@ -126,6 +125,32 @@ class DataTable(object):
Columns can be "flagged" with metadata labels via the :meth:`addFlag`
method. All of the flags on a column can be retrieved via the
:meth:`getFlags` method.
The :meth:`subtable` method can be used to generate a replica
``DataTable`` with a specific subset of columns. It is intended for
parallelisation, so that only the required data is copied over to tasks
running in other processes. The :meth:`subtable` and :meth:`merge` methods
are intended to be used like so:
1. Create subtables which only contain data for specific columns::
cols = ['1-0.0', '2-0.0', '3-0.0']
subtables = [dtable.subtable([c]) for c in cols]
2. Use multiprocessing to perform parallel processing on each column::
def mytask(dtable, col):
dtable[:, col] += 5
return dtable
with dtable.pool() as pool:
subtables = pool.starmap(mytask, zip(subtables, cols))
3. Merge the results back into the main table::
for subtable in subtables:
dtable.merge(subtable)
"""
......@@ -138,9 +163,7 @@ class DataTable(object):
pool=None):
"""Create a ``DataTable``.
:arg data: ``pandas.DataFrame``, or
:class:`.storage.HDF5StoreCollection`, containing the
data.
:arg data: ``pandas.DataFrame`` containing the data.
:arg columns: List of :class:`.Column` objects, representing the
columns that are in the data.
:arg vartable: ``pandas.DataFrame`` containing the variable table.
......@@ -149,9 +172,6 @@ class DataTable(object):
:arg pool: ``multiprocessing.Pool`` for parallelising tasks.
"""
isstore = isinstance(data, storage.HDFStoreCollection)
self.__isstore = isstore
self.__data = data
self.__vartable = vartable
self.__proctable = proctable
......@@ -161,9 +181,13 @@ class DataTable(object):
# The varmap is a dictionary of
# { vid : [Column] } mappings,
# and the colmap is a dictionary
# of { name : Column } mappings
self.__varmap = collections.OrderedDict()
self.__colmap = collections.OrderedDict()
for col in columns:
self.__colmap[col.name] = col
if col.vid in self.__varmap: self.__varmap[col.vid].append(col)
else: self.__varmap[col.vid] = [col]
......@@ -171,11 +195,11 @@ class DataTable(object):
def __getstate__(self):
"""Returns the state of this :class:`.DataTable` for pickling. """
return (self.__data,
self.__isstore,
self.__vartable,
self.__proctable,
self.__cattable,
self.__varmap,
self.__colmap,
self.__flags)
......@@ -183,11 +207,11 @@ class DataTable(object):
"""Set the state of this :class:`.DataTable` for unpickling. """