Commit 57754a97 authored by Paul McCarthy's avatar Paul McCarthy 🚵
Browse files

Merge branch 'rf/categories' into 'master'

Rf/categories

See merge request !49
parents 7b32eb1a 063fe822
......@@ -2,6 +2,17 @@ FUNPACK changelog
=================
1.9.1 (Sunday 29th March 2020)
------------------------------
Changed
^^^^^^^
* Updates to FMRIB categories.
1.9.0 (Friday 28th February 2020)
---------------------------------
......
......@@ -6,7 +6,7 @@
#
__version__ = '1.9.0'
__version__ = '1.9.1'
"""The ``funpack`` versioning scheme roughly follows Semantic Versioning
conventions.
"""
......
......@@ -2,7 +2,7 @@ ID Category Variables
1 age, sex, brain MRI protocol, Phase 31,34,21022,22200,25780
2 genetics 21000,22000:22125,22201:22325,22182,22800:22823
3 early life factors 52,129,130,1677,1687,1697,1737,1767,1777,1787,21066,20022
10 lifestyle and environment - general 3:6,132,189,670,680,699,709,728,738,767,777,1031,1797,1807,1835,1845,1873,1883,2139,2149,2159,2237,2375,2385,2395,2405,2267,2277,2714:10:2834,2946,3526,3536,3546,3581,3591,3659,3669,3700,3710,3720,3829,3839,3849,3872,3882,3912,3942,3972,3982,4501,4674,4825,4836,5057,6138,6142,6139:6141,6145:6146,6160,10016,10105,10114,10721,10722,10740,10749,10860,10877,10886,20074:20075,20110:20113,20118:20119,20121,22501,22599,22606,22700,22702,22704,24003:24019,24024,24500:24508,26410:26434
10 lifestyle and environment - general 3:6,132,189,670,680,699,709,728,738,767,777,1031,1797,1807,1835,1845,1873,1883,2139,2149,2159,2237,2375,2385,2395,2405,2267,2277,2714:10:2834,2946,3526,3536,3546,3581,3591,3659,3669,3700,3710,3720,3829,3839,3849,3872,3882,3912,3942,3972,3982,4501,4674,4825,4836,5057,6138,6142,6139:6141,6145:6146,6160,10016,10105,10114,10721,10722,10740,10749,10860,10877,10886,20074:20075,20110:20113,20118:20119,20121,22501,22599,22606,22700,22702,22704,24003:24020,24024,24500:24508,26410:26434
11 lifestyle and environment - exercise and work 1001,1011,796,806,816,826,845,864,874,884,894,904,914,924,943,971,981,991,1021,1050:10:1220,2624,2634,3426,3637,3647,6143,6162,6164,10953,10962,10971,22604,22605,22607:22615,22620,22630,22631,22640:22655,104900,104910,104920
12 lifestyle and environment - food and drink 1289:10:1389,1408:10:1548,2654,3089,3680,6144,10007,10723,10767,10776,10855,10912,20084:20094,20098:20109,100001:100009,100011:100019,100021:100025,100010:10:100560,100760:10:104670
13 lifestyle and environment - alcohol 1558:10:1628,2664,3731,3859,4407,4418,4429,4440,4451,4462,5364,10818,20095:20097,20117,20403:20410,20414:20416,100580:10:100740
......@@ -11,15 +11,15 @@ ID Category Variables
21 physical measures - bone density and sizes 77,78,3083:3086,3143:3144,3146:3148,4092,4095,4100:4101,4103:4106,4119:4120,4122:4125,4138:4147,23200:23243,23290:23320
22 physical measures - cardiac & blood vessels 93:95,102,4079,4080,4136,4194:4196,4198:4200,4204:4205,4207,5983,5984,5986,5992,5993,6014:6017,6019,6020,6022,6024,6032:6034,6038,6039,12673:12687,12336,12338,12340,12697,12698,12702,21021,22330:22338,22420:22426,22670:22685
23 hearing test 4229:4230,4232:4237,4239:4247,4249,4268:4270,4272,4275:4277,4279,4849,10793,20019,20021,20060
24 eye test 5076:5079,5082:5091,5096:5119,5132:5136,5138:5149,5152,5155:5164,5181:5183,5186,5188,5190,5193,5198:5199,5201,5202,5204,5206,5208,5209,5211,5215,5221,5237,5251,5254:5259,5262:5267,5274,5276,5292,5306,5324:5328,6070:6075,20052,20055,20261:20262
24 eye test 5076:5079,5082:5091,5096:5119,5132:5136,5138:5149,5152,5155:5164,5181:5183,5186,5188,5190,5193,5198:5199,5201,5202,5204,5206,5208,5209,5211,5215,5221,5237,5251,5254:5259,5261:5267,5273,5274,5276,5292,5306,5324:5328,6070:6075,20052,20055,20261:20262
25 physical activity measures 5985,90002:90003,90010:90013,90015:90177,90179:90195
26 abdominal measures 22415:22416
30 blood assays 74,23000:23044,23049:23060,23062,23063,23065:23071,23073:23075,30000:10:30300,30104,30112,30114,30172,30174,30242,30252,30254,30314:10:30344,30364:10:30424,30500:10:30530,30600:10:30890
31 brain IDPs 25000:25746,25754:25759,25761:25768,25781:25920,26500:26508:26513,26517:26518,26520:26552,26554:26720,26722:26723,26725:26727,26732,26734:26740,26743,26746:26750,26752,26754:26757,26759:26761,26763,26766,26768:26769,26771,26773:26774,26777,26780,26781:26782,26784,26786,26788:26790,26792:26796,26799,26801:26807,26810,26813:26819,26821:26824,26826:26827,26833,26835:26837,26839:26841,26844,26847:26851,26853:26857,26860:26862,26864,26867,26869:26870,26873:26875,26878,26881:26883,26885,26887,26889:26891,26893:26895,26897,26900,26902:26908,26911,26914:27772
32 cognitive phenotypes 62,111,396:404,630,4250:4256,4258:4260,4281:4283,4285,4287,4290:4292,4294,4924,4935,4957,4968,4979,4990,5001,5012,5556,5699,5779,5790,5866,6312,6332,6333,6348:6351,6362,6373,6374,6382,6383,6671,6770:6773,10133:10134,10136:10144,10146:10147,10241,10609:10610,10612,20016,20018,20023,20082,20128:20157,20159,20165,20167,20169:2:20195,20196:2:20200,20229,20230,20240,20242,20244:20248,23321:23324
50 health and medical history, health outcomes 84,87,92,134:137,2178,2188,2207,2217,2227,2247,2257,2296,2316,2335:10:2365,2415,2443:10:2473,2492,2674,2684,2694,2704,2844,2956:10:2986,3005,3079,3140,3393,3404,3414,3571,3606,3616,3627,3741,3751,3761,3773,3786,3799,3809,3894,3992,4012,4022,4041,4056,4067,4689,4700,4717,4728,4792,4803,4814,5408,5419,5430,5441,5452,5463,5474,5485,5496,5507,5518,5529,5540,5610,5832,5843,5855,5877,5890,5901,5912,5923,5934,5945,6119,6147,6148,6149,6150,6151,6152,6153,6154,6155,6159,6177,6179,6205,10004:10006,10854,20001:20011,20199,21024:21045,21047:21061,21064:21065,21067,21068,21070:21076,22126:22181,22502:22505,22616,22618,22619,40001:41253,41256,41258,41266,41267,41269,41271,41273,41275,41276,41277,41278,41284,41285,41286,42000:42013
31 brain IDPs 25000:25746,25754:25759,25761:25768,25781:25920,26500:26514,26517:26518,26520:27772
32 cognitive phenotypes 62,111,396:404,630,4250:4256,4258:4260,4281:4283,4285,4287,4290:4292,4294,4924,4935,4957,4968,4979,4990,5001,5012,5556,5699,5779,5790,5866,6312,6332,6333,6348:6351,6362,6373,6374,6382,6383,6671,6770:6773,10133:10134,10136:10144,10146:10147,10241,10609:10610,10612,20016,20018,20023,20082,20128:20157,20159,20165,20167,20169:2:20195,20196:2:20200,20229,20230,20240,20242,20244:20248,21004,23321:23324
50 health and medical history, health outcomes 84,87,92,134:137,2178,2188,2207,2217,2227,2247,2257,2296,2316,2335:10:2365,2415,2443:10:2473,2492,2674,2684,2694,2704,2844,2956:10:2986,3005,3079,3140,3393,3404,3414,3571,3606,3616,3627,3741,3751,3761,3773,3786,3799,3809,3894,3992,4012,4022,4041,4056,4067,4689,4700,4717,4728,4792,4803,4814,5408,5419,5430,5441,5452,5463,5474,5485,5496,5507,5518,5529,5540,5610,5832,5843,5855,5877,5890,5901,5912,5923,5934,5945,6119,6147,6148,6149,6150,6151,6152,6153,6154,6155,6159,6177,6179,6205,10004:10006,10854,20001:20011,20199,21024:21045,21047:21061,21064:21065,21067,21068,21070:21076,22126:22181,22502:22505,22616,22618,22619,40001:41253,41256,41258,41266,41267,41269:41273,41275:41278,41284:41286,42000:42013
51 mental health self-report 1920:10:2110,4526,4537,4548,4559,4570,4581,4598,4609,4620,4631,4642,4653,5375,5386,5663,5674,6156,20122,20126:20127,20401,20411,20417:20423,20425:20429,20431:20442,20445:20450,20453:20460,20463,20465:20467,20470:20471,20473,20476,20477,20479:20484,20485:20502,20505:20544,20546:20551,20553:20554,21062:21063
60 health dates 41257,41260,41262,41263,41268,41280:41283,42014,42016,130004,130008,130014:2:130020,130062,130064,130070,130082,130106,130134,130174:2:130178,130184:2:130190,130194,130202,130216,130218,130224:2:130230,130264,130310,130320,130336,130338,130342,130344,130622,130624,130648,130656:2:130660,130664,130670,130686,130696:2:130708,130714,130718,130722,130726,130734,130736,130770,130774,130792,130814,130818,130820,130826,130828,130832,130854,130868,130892:2:130898,130902:2:130910,130914,130918,130922,130924,130998,131000,131022,131030,131032,131042,131046,131048,131052:2:131056,131060:2:131064,131070:2:131076,131086,131102,131114,131124,131128:2:131132,131136,131138,131142,131144,131148,131150,131154,131158,131164,131166,131178:2:131186,131190,131192,131198,131204,131208:2:131212,131216,131222,131224,131228,131230,131234,131236,131242,131252,131256:2:131264,131270,131282,131286,131296,131298,131304:2:131308,131314,131316,131322,131324,131338,131342,131344,131348:2:131356,131360,131366:2:131370,131374,131382,131386,131390,131392,131396,131402,131404,131408,131410,131414,131416,131424:2:131432,131436,131442,131456,131458,131462:2:131476,131480:2:131484,131490:2:131494,131498,131528,131534,131538,131540,131546,131548,131554,131556,131560:2:131586,131590:2:131594,131598:2:131604,131608,131612:2:131620,131624:2:131654,131666:2:131670,131674:2:131684,131688,131692,131698:2:131708,131720,131722,131726,131730,131734:2:131742,131746,131748,131754,131760,131768,131774,131778,131782,131790:2:131798,131802:2:131806,131810,131812,131822:131826,131830,131836,131850,131852,131858,131864,131868:2:131888,131892,131900,131906,131910:2:131914,131916,131918,131922:2:131928,131934,131938:2:131942,131946:2:131950,131954:2:131964,131970:2:131974,131980,131988:2:131992,132002,132008,132016,132020,132022,132030:2:132038,132042,132050,132054:2:132058,132062:2:132066,132070:2:132078,132082:2:132088,132092,132096:2:132106,132110,132112,132116,132118,132122,132124,132128:2:132152,132156,132160,132162,132166:2:132170,132186,132192,132194,132202,132206,132212,132216,132220,132224,132230,132238:2:132244,132250,132252,132260:2:132264,132268,132274:2:132280,132298,132522,132532,132542,132562,132574,132312
70 health sources 42015,42017,130005,130009,130015:2:130019,130063,130065,130071,130083,130107,130135,130175:130179,130185:2:130191,130195,130203,130217,130219,130225,130231,130265,130311,130337,130343,130345,130623,130625,130649,130657:2:130661,130665,130671,130687,130697:2:130709,130715,130719,130723,130727,130735,130737,130771,130775,130793,130815,130819,130821,130827,130829,130833,130855,130869,130893:2:130899,130903:2:130911,130915,130919,130923,130925,130999,131001,131023,131031,131033,131043,131047,131049,131053,131055,131057,131061,131063,131065,131071:2:131077,131087,131103,131115,131125,131129,131131,131133,131137,131139,131145,131149,131151,131155,131159,131165,131167,131179:2:131187,131191,131193,131199,131205,131209,131211,131213,131217,131223,131225,131229,131231,131237,131243,131253,131257:2:131265,131271,131283,131287,131297,131299,131305,131307,131309,131315,131317,131323,131325,131339,131343,131345,131349:131357,131361,131367:131371,131375,131383,131387,131391,131393,131397,131403,131409,131411,131415,131417,131425:2:131433,131437,131443,131457,131459,131463:2:131477,131481,131483,131485,131491:2:131495,131499,131529,131535,131539,131541,131547,131549,131555,131557,131561,131563,131565:2:131587,131591,131593,131595,131599:2:131605,131609,131613:2:131621,131625:2:131655,131667:2:131671,131675:2:131685,131689,131693,131701:2:131709,131727,131731,131735:2:131743,131747,131749,131755,131761,131769,131775,131779,131783,131793,131795,131797,131803,131805,131807,131811,131813,131823,131825,131827,131831,131837,131851,131859,131865,131869:2:131889,131893,131901,131907,131911,131913:2:131919,131923:2:131929,131935,131939,131941,131943,131947,131949,131951,131955:2:131965,131971,131973,131975,131981,131989,131991,131993,132003,132009,132017,132021,132023,132031:2:132039,132043,132051,132055,132057,132059,132063:2:132067,132071:2:132079,132083:2:132089,132093,132097:2:132107,132111,132113,132117,132119,132123,132125,132129:2:132153,132157,132161,132163,132167,132169,132171,132187,132193,132195,132203,132207,132213,132217,132221,132225,132245,132265,132269,132275:2:132281,132299,132523,132533,132543,132563,132575,132313
60 health dates 41257,41260,41262,41263,41268,41280:41283,42014,42016,130004,130008,130014:2:130020,130062,130064,130070,130082,130106,130134,130174:2:130178,130184:2:130190,130194,130202,130216,130218,130224:2:130230,130264,130310,130320,130336,130338,130342,130344,130622,130624,130648,130656:2:130660,130664,130670,130686,130696:2:130708,130714,130718,130722,130726,130734,130736,130770,130774,130792,130814,130818,130820,130826,130828,130832,130854,130868,130892:2:130898,130902:2:130910,130914,130918,130922,130924,130998,131000,131022,131030,131032,131042,131046,131048,131052:2:131056,131060:2:131064,131070:2:131076,131086,131102,131114,131118,131124,131128:2:131132,131136,131138,131142,131144,131148,131150,131154,131158,131164,131166,131178:2:131186,131190,131192,131196,131198,131204,131208:2:131212,131216,131222,131224,131228,131230,131234,131236,131242,131252,131256:2:131264,131270,131282,131286,131296,131298,131304:2:131308,131314,131316,131322,131324,131338,131342,131344,131348:2:131356,131360,131366:2:131370,131374,131382,131386,131390,131392,131396,131402,131404,131408,131410,131414,131416,131424:2:131432,131436,131442,131456,131458,131462:2:131476,131480:2:131484,131490:2:131494,131498,131528,131534,131538,131540,131546,131548,131554,131556,131560:2:131586,131590:2:131594,131598:2:131604,131608,131612:2:131620,131624:2:131654,131666:2:131670,131674:2:131684,131688,131692,131698:2:131708,131720,131722,131726,131730,131734:2:131742,131746,131748,131754,131760,131768,131774,131778,131782,131790:2:131798,131802:2:131806,131810,131812,131822:131826,131830,131836,131850,131852,131858,131864,131868:2:131888,131892,131894,131900,131906,131910:2:131914,131916,131918,131922:2:131928,131934,131938:2:131942,131946:2:131950,131954:2:131964,131970:2:131974,131980,131988:2:131992,132002,132008,132016,132020,132022,132030:2:132038,132042,132050,132054:2:132058,132062:2:132066,132070:2:132078,132082:2:132088,132092,132096:2:132106,132110,132112,132116,132118,132122,132124,132128:2:132152,132156,132160:2:132170,132186,132192,132194,132202,132206,132212,132216,132220,132224,132230,132238:2:132244,132250,132252,132260:2:132264,132268,132274:2:132280,132298,132522,132532,132542,132562,132574,132312
70 health sources 42015,42017,130005,130009,130015:2:130019,130063,130065,130071,130083,130107,130135,130175:130179,130185:2:130191,130195,130203,130217,130219,130225,130231,130265,130311,130337,130343,130345,130623,130625,130649,130657:2:130661,130665,130671,130687,130697:2:130709,130715,130719,130723,130727,130735,130737,130771,130775,130793,130815,130819,130821,130827,130829,130833,130855,130869,130893:2:130899,130903:2:130911,130915,130919,130923,130925,130999,131001,131023,131031,131033,131043,131047,131049,131053,131055,131057,131061,131063,131065,131071:2:131077,131087,131103,131115,131119,131125,131129,131131,131133,131137,131139,131145,131149,131151,131155,131159,131165,131167,131179:2:131187,131191,131193,131197,131199,131205,131209,131211,131213,131217,131223,131225,131229,131231,131237,131243,131253,131257:2:131265,131271,131283,131287,131297,131299,131305,131307,131309,131315,131317,131323,131325,131339,131343,131345,131349:131357,131361,131367:131371,131375,131383,131387,131391,131393,131397,131403,131409,131411,131415,131417,131425:2:131433,131437,131443,131457,131459,131463:2:131477,131481,131483,131485,131491:2:131495,131499,131529,131535,131539,131541,131547,131549,131555,131557,131561,131563,131565:2:131587,131591,131593,131595,131599:2:131605,131609,131613:2:131621,131625:2:131655,131667:2:131671,131675:2:131685,131689,131693,131701:2:131709,131727,131731,131735:2:131743,131747,131749,131755,131761,131769,131775,131779,131783,131793,131795,131797,131803,131805,131807,131811,131813,131823,131825,131827,131831,131837,131851,131859,131865,131869:2:131889,131893,131895,131901,131907,131911,131913:2:131919,131923:2:131929,131935,131939,131941,131943,131947,131949,131951,131955:2:131965,131971,131973,131975,131981,131989,131991,131993,132003,132009,132017,132021,132023,132031:2:132039,132043,132051,132055,132057,132059,132063:2:132067,132071:2:132079,132083:2:132089,132093,132097:2:132107,132111,132113,132117,132119,132123,132125,132129:2:132153,132157,132161:2:132171,132187,132193,132195,132203,132207,132213,132217,132221,132225,132245,132265,132269,132275:2:132281,132299,132523,132533,132543,132563,132575,132313
98 pending 41259,41261,41264,42038:42040
99 miscellaneous 19,21,35:45,53:55,68,96,120,200,393,757,1647,2129,3060,3061,3066,3077,3081:3082,3090,3137,3166,4081,4093,4096,4206,4238,4248,4257,4286,4288:4289,4293,4295,5074,5075,5080,5081,5214,5253,5270,5987:5988,5991,6023,6025,6334,10145,10697,12139:12141,12148,12187,12188,12223,12224,12253,12254,12291,12323,12623,12624,12651:12654,12658,12663,12664,12671,12688,12695,12699,12700,12704,12706,12848,12851,12854,20012:20014,20024:20025,20031:20032,20035,20041:20054,20058:20059,20061:20062,20072,20077:20081,20083,20114:20115,20158,20201:20227,20249:20253,20400,21003,21023,21069,21611,21621,21622,21625,21631,21634,21642,21651,21661:21666,21671,21711,21721:21723,21725,21731:21734,21736,21738,21741,21742,21751,21761:21766,21771,21811,21821:21823,21825,21831:21834,21836,21838,21841:21842,21851,21861:21866,21871,22499,22500,22600:22603,22617,22660:22664,23048,23160,25747:25753,30001:10:30301,30002:10:30302,30003:10:30303,30004:10:30304,30354,30502:10:30522,30532,30601:10:30891,30615,30622,30635,30645,30665,30666,30692,30725,30755,30775,30795,30796,30805,30806,30825,30826,30835,30845,30855,30856,30875,30885,30895,30897,40000,105010,105030,110005,110006,110008
99 miscellaneous 19,21,35:45,53:55,68,96,120,200,393,757,1647,2129,3060,3061,3066,3077,3081:3082,3090,3132,3137,3166,4081,4093,4096,4206,4238,4248,4257,4286,4288:4289,4293,4295,5074,5075,5080,5081,5214,5253,5270,5987:5988,5991,6023,6025,6334,10145,10697,12139:12141,12148,12187,12188,12223,12224,12253,12254,12291,12323,12623,12624,12651:12654,12658,12663,12664,12671,12688,12695,12699,12700,12704,12706,12848,12851,12854,20012:20014,20024:20025,20031:20032,20035,20041:20054,20058:20059,20061:20062,20072,20077:20081,20083,20114:20115,20158,20201:20227,20249:20254,20259,20260,20263,20400,21003,21011:21018,21023,21069,21611,21621,21622,21625,21631,21634,21642,21651,21661:21666,21671,21711,21721:21723,21725,21731:21734,21736,21738,21741,21742,21751,21761:21766,21771,21811,21821:21823,21825,21831:21834,21836,21838,21841:21842,21851,21861:21866,21871,22499,22500,22600:22603,22617,22660:22664,23048,23160:23164,25747:25753,30001:10:30301,30002:10:30302,30003:10:30303,30004:10:30304,30354,30502:10:30522,30532,30601:10:30891,30615,30622,30635,30645,30665,30666,30692,30725,30755,30775,30795,30796,30805,30806,30825,30826,30835,30845,30855,30856,30875,30885,30895,30897,40000,90001,90004,105010,105030,110005,110006,110008
%% Cell type:markdown id: tags:
![image.png](attachment:image.png)
# `funpack`
> Paul McCarthy <paul.mccarthy@ndcn.ox.ac.uk> ([WIN@FMRIB](https://www.win.ox.ac.uk/))
`funpack` is a command-line program which you can use to extract data from UK BioBank (and other tabular) data.
You can give `funpack` one or more input files (e.g. `.csv`, `.tsv`), and it will merge them together, perform some preprocessing, and produce a single output file.
A large number of rules are built into `funpack` which are specific to the UK BioBank data set. But you can control and customise everything that `funpack` does to your data, including which rows and columns to extract, and which cleaning/processing steps to perform on each column.
The `funpack` source code is available at https://git.fmrib.ox.ac.uk/fsl/funpack. You can install `funpack` into a Python environment using `pip`:
pip install fmrib-unpack
Get command-line help by typing:
funpack -h
*The examples in this notebook assume that you have installed `funpack` 1.9.0 or newer.*
*The examples in this notebook assume that you have installed `funpack` 1.9.1 or newer.*
%% Cell type:code id: tags:
``` bash
funpack -V
```
%% Cell type:markdown id: tags:
### Contents
1. [Overview](#Overview)
1. [Import](#1.-Import)
2. [Cleaning](#2.-Cleaning)
3. [Processing](#3.-Processing)
4. [Export](#4.-Export)
2. [Examples](#Examples)
3. [Import examples](#Import-examples)
1. [Selecting variables (columns)](#Selecting-variables-(columns))
1. [Selecting individual variables](#Selecting-individual-variables)
2. [Selecting variable ranges](#Selecting-variable-ranges)
3. [Selecting variables with a file](#Selecting-variables-with-a-file)
4. [Selecting variables from pre-defined categories](#Selecting-variables-from-pre-defined-categories)
2. [Selecting subjects (rows)](#Selecting-subjects-(rows))
1. [Selecting individual subjects](#Selecting-individual-subjects)
2. [Selecting subject ranges](#Selecting-subject-ranges)
3. [Selecting subjects from a file](#Selecting-subjects-from-a-file)
4. [Selecting subjects by variable value](#Selecting-subjects-by-variable-value)
5. [Excluding subjects](#Excluding-subjects)
3. [Selecting visits](#Selecting-visits)
1. [Evaluating expressions across visits](#Evaluating-expressions-across-visits)
4. [Merging multiple input files](#Merging-multiple-input-files)
1. [Merging by subject](#Merging-by-subject)
2. [Merging by column](#Merging-by-column)
3. [Naive merging](#Merging-by-column)
4. [Cleaning examples](#Cleaning-examples)
1. [NA insertion](#NA-insertion)
2. [Variable-specific cleaning functions](#Variable-specific-cleaning-functions)
3. [Categorical recoding](#Categorical-recoding)
4. [Child value replacement](#Child-value-replacement)
5. [Processing examples](#Processing-examples)
1. [Sparsity check](#Sparsity-check)
2. [Redundancy check](#Redundancy-check)
3. [Categorical binarisation](#Categorical-binarisation)
6. [Custom cleaning, processing and loading - funpack plugins](#Custom-cleaning,-processing-and-loading---funpack-plugins)
1. [Custom cleaning functions](#Custom-cleaning-functions)
2. [Custom processing functions](#Custom-processing-functions)
3. [Custom file loaders](#Custom-file-loaders)
7. [Miscellaneous topics](#Miscellaneous-topics)
1. [Non-numeric data](#Non-numeric-data)
2. [Dry run](#Dry-run)
3. [Built-in rules](#Built-in-rules)
4. [Using a configuration file](#Using-a-configuration-file)
5. [Reporting unknown variables](#Reporting-unknown-variables)
6. [Low-memory mode](#Low-memory-mode)
%% Cell type:markdown id: tags:
# Overview
`funpack` performs the following steps:
## 1. Import
All data files are loaded in, unwanted columns and subjects are dropped, and the data files are merged into a single table (a.k.a. data frame). Multiple files can be merged according to an index column (e.g. subject ID). Or, if the input files contain the same columns/subjects, they can be naively concatenated along rows or columns.
## 2. Cleaning
The following cleaning steps are applied to each column:
1. **NA value replacement:** Specific values for some columns are replaced with NA, for example, variables where a value of `-1` indicates *Do not know*.
2. **Variable-specific cleaning functions:** Certain columns are re-formatted - for example, the [ICD10](https://en.wikipedia.org/wiki/ICD-10) disease codes can be converted to integer representations.
3. **Categorical recoding:** Certain categorical columns are re-coded.
4. **Child value replacement:** NA values within some columns which are dependent upon other columns may have values inserted based on the values of their parent columns.
## 3. Processing
During the processing stage, columns may be removed, merged, or expanded into additional columns. For example, a categorical column may be expanded into a set of binary columns, one for each category.
A column may also be removed on the basis of being too sparse, or being redundant with respect to another column.
## 4. Export
The processed data can be saved as a `.csv`, `.tsv`, or `.hdf5` file.
%% Cell type:markdown id: tags:
# Examples
Throughout these examples, we are going to use a few command line options, which you will probably **not** normally want to use:
- `-ow` (short for `--overwrite`): This tells `funpack` not to complain if the output file already exists.
- `-q` (short for `--quiet`): This tells `funpack` to be quiet.
Without the `-q` option, `funpack` can be quite verbose, which can be annoying, but is very useful when things go wrong. A good strategy is to tell `funpack` to produce verbose output using the `--noisy` (`-n` for short) option, and to send all of its output to a log file with the `--log_file` (or `-lf`) option. For example:
funpack -n -n -n -lf log.txt out.tsv in.tsv
Here's the first example input data set, with UK BioBank-style column names:
%% Cell type:code id: tags:
``` bash
cat data_01.tsv
```
%% Cell type:markdown id: tags:
The numbers in each column name typically represent:
1. The variable ID
2. The visit, for variables which were collected at multiple points in time.
3. The "instance", for multi-valued variables.
Note that one **variable** is typically associated with several **columns**, although we're keeping things simple for this first example - there is only one visit for each variable, and there are no mulit-valued variables.
> _Most but not all_ variables in the UK BioBank contain data collected at different visits, the times that the participants visited a UK BioBank assessment centre. However there are some variables (e.g. [ICD10 diagnosis codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202)) for which this is not the case.
%% Cell type:markdown id: tags:
# Import examples
## Selecting variables (columns)
You can specify which variables you want to load in the following ways, using the `--variable` (`-v` for short) and `--category` (`-c` for short) command line options:
* By variable ID
* By variable ranges
* By a text file which contains the IDs you want to keep.
* By pre-defined variable categories
* By column name
### Selecting individual variables
Simply provide the IDs of the variables you want to extract:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -v 1 -v 5 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variable ranges
The `--variable`/`-v` option accepts MATLAB-style ranges of the form `start:step:stop` (where the `stop` is inclusive):
%% Cell type:code id: tags:
``` bash
funpack -q -ow -v 1:3:10 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variables with a file
If your variables of interest are listed in a plain-text file, you can simply pass that file:
%% Cell type:code id: tags:
``` bash
echo -e "1\n6\n9" > vars.txt
funpack -q -ow -v vars.txt out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting variables from pre-defined categories
Some UK BioBank-specific categories are baked into `funpack`, but you can also define your own categories - you just need to create a `.tsv` file, and pass it to `funpack` via the `--category_file` (`-cf` for short):
%% Cell type:code id: tags:
``` bash
echo -e "ID\tCategory\tVariables" > custom_categories.tsv
echo -e "1\tCool variables\t1:5,7" >> custom_categories.tsv
echo -e "2\tUncool variables\t6,8:10" >> custom_categories.tsv
cat custom_categories.tsv
```
%% Cell type:markdown id: tags:
Use the `--category` (`-c` for short) to select categories to output. You can refer to categories by their ID:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -cf custom_categories.tsv -c 1 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
Or by name:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -cf custom_categories.tsv -c uncool out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting column names
If you are working with data that has non-UK BioBank style column names, you can use the `--column` (`-co` for short) to select individual columns by their name, rather than the variable with which they are associated. The `--column` option accepts full column names, and also shell-style wildcard patterns:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -co 4-0.0 -co "??-0.0" out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Selecting subjects (rows)
`funpack` assumes that the first column in every input file is a subject ID. You can specify which subjects you want to load via the `--subject` (`-s` for short) option. You can specify subjects in the same way that you specified variables above, and also:
* By specifying a conditional expression on variable values - only subjects for which the expression evaluates to true will be imported
* By specifying subjects to exclude
### Selecting individual subjects
%% Cell type:code id: tags:
``` bash
funpack -q -ow -s 1 -s 3 -s 5 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting subject ranges
%% Cell type:code id: tags:
``` bash
funpack -q -ow -s 2:2:10 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting subjects from a file
%% Cell type:code id: tags:
``` bash
echo -e "5\n6\n7\n8\n9\n10" > subjects.txt
funpack -q -ow -s subjects.txt out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Selecting subjects by variable value
The `--subject` option accepts *variable expressions* - you can write an expression performing numerical comparisons on variables (denoted with a leading `v`) and combine these expressions using boolean algebra. Only subjects for which the expression evaluates to true will be imported. For example, to only import subjects where variable 1 is greater than 10, and variable 2 is less than 70, you can type:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -sp -s "v1 > 10 && v2 < 70" out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
The following symbols can be used in variable expressions:
| Symbol | Meaning |
|---------------------------|---------------------------------|
| `==` | equal to |
| `!=` | not equal to |
| `>` | greater than |
| `>=` | greater than or equal to |
| `<` | less than |
| `<=` | less than or equal to |
| `na` | N/A |
| `&&` | logical and |
| <code>&#x7c;&#x7c;</code> | logical or |
| `~` | logical not |
| `contains` | Contains sub-string |
| `all` | all columns must meet condition |
| `any` | any column must meet condition |
| `()` | to denote precedence |
Non-numeric (i.e. string) variables can be used in these expressions in conjunction with the `==`, `!=`, and `contains` operators. An example of such an expression is given in the section on [non-numeric data](#Non-numeric-data), below.
The `all` and `any` symbols allow you to control how an expression is evaluated across multiple columns which are associated with one variable (e.g. separate columns for each visit). We will give an example of this in the section on [selecting visits](#Selecting-visits), below.
### Excluding subjects
The `--exclude` (`-ex` for short) option allows you to exclude subjects - it accepts individual IDs, an ID range, or a file containing IDs. The `--exclude`/`-ex` option takes precedence over the `--subject`/`-s` option:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -s 1:8 -ex 5:10 out.tsv data_01.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Selecting visits
%% Cell type:markdown id: tags:
Many variables in the UK BioBank data contain observations at multiple points in time, or visits. `funpack` allows you to specify which visits you are interested in. Here is an example data set with variables that have data for multiple visits (remember that the second number in the column names denotes the visit):
%% Cell type:code id: tags:
``` bash
cat data_02.tsv
```
%% Cell type:markdown id: tags:
We can use the `--visit` (`-vi` for short) option to get just the last visit for each variable:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -vi last out.tsv data_02.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
You can also specify which visit you want by its number:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -vi 1 out.tsv data_02.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
> Variables which are not associated with specific visits (e.g. [ICD10 diagnosis codes](https://biobank.ctsu.ox.ac.uk/crystal/field.cgi?id=41202)) will not be affected by the `-vi` option.
%% Cell type:markdown id: tags:
### Evaluating expressions across visits
The variable expressions described above in the section on [selecting subjects](#Selecting-subjects-by-variable-value) will be applied to all of the columns associated with a variable. By default, an expression will evaluate to true where the values in _any_ column asssociated with the variable evaluate to true. For example, we can extract the data for subjects where the values of any column of variable 2 were less than 50:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -v 2 -s 'v2 < 50' out.tsv data_02.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
We can use the `any` and `all` operators to control how an expression is evaluated across the columns of a variable. For example, we may only be interested in subjects for whom all columns of variable 2 were greater than 50:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -v 2 -s 'all(v2 < 50)' out.tsv data_02.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
We can use `any` and `all` in expressions involving multiple variables:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -v 2,3 -s 'any(v2 < 50) && all(v3 >= 40)' out.tsv data_02.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
## Merging multiple input files
If your data is split across multiple files, you can specify how `funpack` should merge them together.
### Merging by subject
For example, let's say we have these two input files (shown side-by-side):
%% Cell type:code id: tags:
``` bash
echo " " | paste data_03.tsv - data_04.tsv
```
%% Cell type:markdown id: tags:
Note that each file contains different variables, and different, but overlapping, subjects. By default, when you pass these files to `funpack`, it will output the intersection of the two files (more formally known as an *inner join*), i.e. subjects which are present in both files:
%% Cell type:code id: tags:
``` bash
funpack -q -ow out.tsv data_03.tsv data_04.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
If you want to keep all subjects, you can instruct `funpack` to output the union (a.k.a. *outer join*) via the `--merge_strategy` (`-ms` for short) option:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -ms outer out.tsv data_03.tsv data_04.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Merging by column
Your data may be organised in a different way. For example, these next two files contain different groups of subjects, but overlapping columns:
%% Cell type:code id: tags:
``` bash
echo " " | paste data_05.tsv - data_06.tsv
```
%% Cell type:markdown id: tags:
In this case, we need to tell `funpack` to merge along the row axis, rather than along the column axis. We can do this with the `--merge_axis` (`-ma` for short) option:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -ma rows out.tsv data_05.tsv data_06.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
Again, if we want to retain all columns, we can tell `funpack` to perform an outer join with the `-ms` option:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -ma rows -ms outer out.tsv data_05.tsv data_06.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
### Naive merging
Finally, your data may be organised such that you simply want to "paste", or concatenate them together, along either rows or columns. For example, your data files might look like this:
%% Cell type:code id: tags:
``` bash
echo " " | paste data_07.tsv - data_08.tsv
```
%% Cell type:markdown id: tags:
Here, we have columns for different variables on the same set of subjects, and we just need to concatenate them together horizontally. We do this by using `--merge_strategy naive` (`-ms naive` for short):
%% Cell type:code id: tags:
``` bash
funpack -q -ow -ms naive out.tsv data_07.tsv data_08.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
For files which need to be concatenated vertically, such as these:
%% Cell type:code id: tags:
``` bash
echo " " | paste data_09.tsv - data_10.tsv
```
%% Cell type:markdown id: tags:
We need to tell `funpack` which axis to concatenate along, again using the `-ma` option:
%% Cell type:code id: tags:
``` bash
funpack -q -ow -ms naive -ma rows out.tsv data_09.tsv data_10.tsv
cat out.tsv
```
%% Cell type:markdown id: tags:
# Cleaning examples
Once the data has been imported, a sequence of cleaning steps are applied to each column.
## NA insertion
For some variables it may make sense to discard or ignore certain values. For example, if an individual selects *"Do not know"* to a question such as *"How much milk did you drink yesterday?"*, that answer will be coded with a specific value (e.g. `-1`). It does not make any sense to include these values in most analyses, so `funpack` can be used to mark such values as *Not Available (NA)*.
A large number of NA insertion rules, specific to UK BioBank variables, are coded into `funpack`, and are applied when you use the `-cfg fmrib` option (see the section below on [built-in rules](#Built-in-rules)). You can also specify your own rules via the `--na_values` (`-nv` for short) option.
Let's say we have this data set:
%% Cell type:code id: tags:
``` bash
cat data_11.tsv
```
%% Cell type:markdown id: tags:
For variable 1, we want to ignore values of -1, for variable 2 we want to ignore -1 and 0, and for variable 3 we want to ignore 1 and 2:
%% Cell type:code id: tags: