diff --git a/advanced_topics/08_fslpy.md b/advanced_topics/08_fslpy.md new file mode 100644 index 0000000000000000000000000000000000000000..024c8cdd61dcfda2d2161417463ea1e0a8c26ea2 --- /dev/null +++ b/advanced_topics/08_fslpy.md @@ -0,0 +1,247 @@ +# `fslpy` + +`fslpy` is a Python library which is built into FSL, and contains a range of +functionality for working with neuroimaging data in an FSL context. + +This practical highlights some of the most useful features provided by +`fslpy`. You may find `fslpy` useful if you are writing Python code to +perform analyses and image processing in conjunction with FSL. + + +> **Note**: `fslpy` is distinct from `fslpython` - `fslpython` is the Python +> environment that is baked into FSL. `fslpy` is a Python library which is +> installed into the `fslpython` environment. + + +* [The `Image` class, and other data types](#the-image-class-and-other-data-types) +* [FSL atlases](#fsl-atlases) +* [The `filetree`](#the-filetree) +* [NIfTI coordinate systems](#nifti-coordinate-systems) +* [Image processing](#image-processing) +* [FSL wrapper functions](#fsl-wrapper-functions) + + +<a class="anchor" id="the-image-class-and-other-data-types"></a> +## The `Image` class, and other data types + + +The `fsl.data.image` module provides the `Image` class, which sits on top of +`nibabel` and contains some handy functionality if you need to work with +coordinate transformations, or do some FSL-specific processing. The `Image` +class provides features such as: + +- Support for NIFTI1, NIFTI2, and ANALYZE image files +- Access to affine transformations between the voxel, FSL and world coordinate + systems +- Ability to load metadata from BIDS sidecar files + +Some simple image processing routines are also provided - these are covered +[below](#image-processing). + + +### Creating images + +It's easy to create an `Image` - you can create one from a file name: + +``` +from fsl.data.image import Image +stddir = op.expandvars('${FSLDIR}/data/standard/') + +# load a FSL image - the file +# suffix is optional, just like +# in real FSL-land! +img = Image(op.join(stddir, 'MNI152_T1_1mm')) +``` + +You can crearte an `Image` from an existing `nibabel` image: + +``` +# load a nibabel image, and +# convert it into an FSL image +nibimg = nib.load(op.join(stddir, 'MNI152_T1_1mm.nii.gz')) +img = Image(nibimg) +`` + +Or you can create an `Image` from a `numpy` array: + +``` +data = np.zeros((100, 100, 100)) +img = Image(data, xform=np.eye(4)) +``` + + + + + +<a class="anchor" id="fsl-atlases"></a> +## FSL atlases + +<a class="anchor" id="the-filetree"></a> +## The `filetree` + +<a class="anchor" id="nifti-coordinate-systems"></a> +## NIfTI coordinate systems + +<a class="anchor" id="image-processing"></a> +## Image processing + +<a class="anchor" id="fsl-wrapper-functions"></a> +## FSL wrapper functions + + + +<a class="anchor" id="nifti-coordinate-systems"></a> +## NIfTI coordinate systems + + + +The `getAffine` method gives you access to affine transformations which can be +used to convert coordinates between the different coordinate systems +associated with an image. Have some MNI coordinates you'd like to convert to +voxels? Easy! + +``` +mnicoords = np.array([[0, 0, 0], + [0, -18, 18]]) + +world2vox = img.getAffine('world', 'voxel') +vox2world = img.getAffine('voxel', 'world') + +# Apply the world->voxel +# affine to the coordinates +voxcoords = (np.dot(world2vox[:3, :3], mnicoords.T)).T + world2vox[:3, 3] + +# The code above is a bit fiddly, so +# instead of figuring it out, you can +# just use the transform() function: +from fsl.transform.affine import transform +voxcoords = transform(mnicoords, world2vox) + +# just to double check, let's transform +# those voxel coordinates back into world +# coordinates +backtomni = transform(voxcoords, vox2world) + +for m, v, b in zip(mnicoords, voxcoords, backtomni): + print(m, '->', v, '->', b) +``` + +> The `Image.getAffine` method can give you transformation matrices +> between any of these coordinate systems: +> +> - `'voxel'`: Image data voxel coordinates +> - `'world'`: mm coordinates, defined by the sform/qform of an image +> - `'fsl'`: The FSL coordinate system, used internally by many FSL tools +> (e.g. FLIRT) + + + +Oh, that example was too easy I hear you say? Try this one on for size. Let's +say we have run FEAT on some task fMRI data, and want to get the MNI +coordinates of the voxel with peak activation. + +> This is what people used to use `Featquery` for, back in the un-enlightened +> days. + + +Let's start by identifying the voxel with the biggest t-statistic: + +``` +featdir = op.join(op.join('05_nifti', 'fmri.feat')) + +# The Image.data attribute returns a +# numpy array containing, well, the +# image data. +tstat1 = Image(op.join(featdir, 'stats', 'tstat1')).data + +# Recall from the numpy practical that +# argmax gives us a 1D index into a +# flattened view of the array. We can +# use the unravel_index function to +# convert it into a 3D index. +peakvox = np.abs(tstat1).argmax() +peakvox = np.unravel_index(peakvox, tstat1.shape) +print('Peak voxel coordinates for tstat1:', peakvox, tstat1[peakvox]) +``` + + +Now that we've got the voxel coordinates in functional space, we need to +transform them into MNI space. FEAT provides a transformation which goes +directly from functional to standard space, in the `reg` directory: + +``` +func2std = np.loadtxt(op.join(featdir, 'reg', 'example_func2standard.mat')) +``` + +But ... wait a minute ... this is a FLIRT matrix! We can't just plug voxel +coordinates into a FLIRT matrix and expect to get sensible results, because +FLIRT works in an internal FSL coordinate system, which is not quite +`'voxel'`, and not quite `'world'`. So we need to do a little more work. +Let's start by loading our functional image, and the MNI152 template (the +source and reference images of our FLIRT matrix): + +``` +func = Image(op.join(featdir, 'reg', 'example_func')) +std = Image(op.expandvars(op.join('$FSLDIR', 'data', 'standard', 'MNI152_T1_2mm'))) +``` + + +Now we can use them to get affines which convert between all of the different +coordinate systems - we're going to combine them into a single uber-affine, +which transforms our functional-space voxels into MNI world coordinates via: + + 1. functional voxels -> FLIRT source space + 2. FLIRT source space -> FLIRT reference space + 3. FLIRT referece space -> MNI world coordinates + + +``` +vox2fsl = func.getAffine('voxel', 'fsl') +fsl2mni = std .getAffine('fsl', 'world') +``` + +Combining two affines into one is just a simple dot-product. There is a +`concat()` function which does this for us, for any number of affines: + +``` +from fsl.transform.affine import concat + +# To combine affines together, we +# have to list them in reverse - +# linear algebra is *weird*. +funcvox2mni = concat(fsl2mni, func2std, vox2fsl) +``` + +So we've now got some voxel coordinates from our functional data, and an affine +to transform into MNI world coordinates. The rest is easy: + +``` +mnicoords = transform(peakvox, funcvox2mni) +print('Peak activation (MNI coordinates):', mnicoords) +``` + + +> Note that in the above example we are only applying a linear transformation +> into MNI space - in reality you would also want to apply your non-linear +> structural-to-standard transformation too. But this is left as an exercise +> for the reader ;). + + +<a class="anchor" id="image-processing"></a> +## Image processing + +Now, it's all well and good to look at t-statistric values and voxel +coordinates and so on and so forth, but let's spice things up a bit and look +at some images. Let's display our peak activation location in MNI space. To +do this, we're going to resample our functional image into + +``` +from IPython.display import Image as Screenshot +!fsleyes render -of screenshot.png -std +``` + + +### (Advanced) Transform coordinates with nonlinear warpfields + + +have to use your own dataset diff --git a/getting_started/01_basics.ipynb b/getting_started/01_basics.ipynb index 11cc06dce3e1220432f84ca70c8907ecb150d461..677b36d559ec4b943240a696cc8fc7b4f8c603a0 100644 --- a/getting_started/01_basics.ipynb +++ b/getting_started/01_basics.ipynb @@ -16,7 +16,17 @@ "(including the text blocks, so you can just move down the document\n", "with shift + enter).\n", "\n", - "It is also possible to _change_ the contents of each code block (these pages are completely interactive) so do experiment with the code you see and try some variations!\n", + "It is also possible to _change_ the contents of each code block (these pages\n", + "are completely interactive) so do experiment with the code you see and try\n", + "some variations!\n", + "\n", + "> **Important**: We are exclusively using Python 3 in FSL - as of FSL 6.0.4 we\n", + "> are using Python 3.7. There are some subtle differences between Python 2 and\n", + "> Python 3, but instead of learning about these differences, it is easier to\n", + "> simply forget that Python 2 exists. When you are googling for Python help,\n", + "> make sure that the pages you find are relevant to Python 3 and *not* Python\n", + "> 2! The official Python docs can be found at https://docs.python.org/3/ (note\n", + "> the _/3/_ at the end!).\n", "\n", "## Contents\n", "\n", @@ -64,17 +74,9 @@ }, { "cell_type": "code", - "execution_count": 130, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "4\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "a = 4\n", "b = 3.6\n", @@ -93,19 +95,9 @@ }, { "cell_type": "code", - "execution_count": 131, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[10, 20, 30]\n", - "{'b': 20, 'a': 10}\n", - "4 3.6 abc\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "print(d)\n", "print(e)\n", @@ -135,17 +127,9 @@ }, { "cell_type": "code", - "execution_count": 132, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "test string :: another test string\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "s1 = \"test string\"\n", "s2 = 'another test string'\n", @@ -161,20 +145,9 @@ }, { "cell_type": "code", - "execution_count": 133, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "This is\n", - "a string over\n", - "multiple lines\n", - "\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "s3 = '''This is\n", "a string over\n", @@ -190,23 +163,16 @@ "<a class=\"anchor\" id=\"Format\"></a>\n", "### Format\n", "\n", - "More interesting strings can be created using the `format` statement, which is very useful in print statements:" + "More interesting strings can be created using the\n", + "[`format`](https://docs.python.org/3/library/string.html#formatstrings)\n", + "statement, which is very useful in print statements:" ] }, { "cell_type": "code", - "execution_count": 134, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "The numerical value is 1 and a name is PyTreat\n", - "A name is PyTreat and a number is 1\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "x = 1\n", "y = 'PyTreat'\n", @@ -219,8 +185,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "There are also other options along these lines, but this is the more modern version, although you will see plenty of the other alternatives in \"old\" code (to python coders this means anything written before last week).\n", - "\n", + "Python also supports C-style [`%`\n", + "formatting](https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x = 1\n", + "y = 'PyTreat'\n", + "s = 'The numerical value is %i and a name is %s' % (x, y)\n", + "print(s)\n", + "print('A name is %s and a number is %i' % (y, x))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "<a class=\"anchor\" id=\"String-manipulation\"></a>\n", "### String manipulation\n", "\n", @@ -229,18 +214,9 @@ }, { "cell_type": "code", - "execution_count": 135, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "THIS IS A TEST STRING\n", - "this is a test string\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "s = 'This is a Test String'\n", "print(s.upper())\n", @@ -256,17 +232,9 @@ }, { "cell_type": "code", - "execution_count": 136, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "This is a Better String\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "s = 'This is a Test String'\n", "s2 = s.replace('Test', 'Better')\n", @@ -282,17 +250,9 @@ }, { "cell_type": "code", - "execution_count": 137, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "This is a Test String :: This is a Better String\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "s3 = s + ' :: ' + s2\n", "print(s3)" @@ -307,17 +267,9 @@ }, { "cell_type": "code", - "execution_count": 138, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "This is an example of an example String\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "import re\n", "s = 'This is a test of a Test String'\n", @@ -333,23 +285,15 @@ "\n", "For more information on matching and substitutions, look up the regular expression module on the web.\n", "\n", - "Two common and convenient string methods are `strip()` and `split()`. The first will remove any whitespace at the beginning and end of a string:" + "Two common and convenient string methods are `strip()` and `split()`. The\n", + "first will remove any whitespace at the beginning and end of a string:" ] }, { "cell_type": "code", - "execution_count": 139, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "* A very spacy string *\n", - "*A very spacy string*\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "s2 = ' A very spacy string '\n", "print('*' + s2 + '*')\n", @@ -365,18 +309,9 @@ }, { "cell_type": "code", - "execution_count": 140, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "['This', 'is', 'a', 'test', 'of', 'a', 'Test', 'String']\n", - "['A', 'very', 'spacy', 'string']\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "print(s.split())\n", "print(s2.split())" @@ -391,17 +326,9 @@ }, { "cell_type": "code", - "execution_count": 141, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[' This is', ' as you can see ', ' a very weirdly spaced and punctuated string ... ']\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "s4 = ' This is, as you can see , a very weirdly spaced and punctuated string ... '\n", "print(s4.split(','))" @@ -411,28 +338,44 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "There are more powerful ways of dealing with this like csv files/strings, which are covered in later practicals, but even this can get you a long way.\n", + "A neat trick, if you want to change the delimiter in some structured data (e.g.\n", + "replace `,` with `\\t`), is to use `split()` in combination with another string\n", + "method, `join()`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "csvdata = 'some,comma,separated,data'\n", + "tsvdata = '\\t'.join(csvdata.split(','))\n", + "tsvdata = tsvdata.replace('comma', 'tab'))\n", + "print('csvdata:', csvdata)\n", + "print('tsvdata:', tsvdata)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are more powerful ways of dealing with this like csv files/strings,\n", + "which are covered in later practicals, but even this can get you a long way.\n", "\n", - "> Note that strings in python 3 are _unicode_ so can represent Chinese characters, etc, and is therefore very flexible. However, in general you can just be blissfully ignorant of this fact.\n", + "> Note that strings in python 3 are _unicode_ so can represent Chinese\n", + "> characters, etc, and is therefore very flexible. However, in general you\n", + "> can just be blissfully ignorant of this fact.\n", "\n", - "Strings can be converted to integer or floating-point values by using the `int()` and `float()` calls:" + "Strings can be converted to integer or floating-point values by using the\n", + "`int()` and `float()` calls:" ] }, { "cell_type": "code", - "execution_count": 142, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "232.03\n", - "25.03\n", - "25.03\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "sint='23'\n", "sfp='2.03'\n", @@ -452,8 +395,8 @@ "<a class=\"anchor\" id=\"Tuples-and-lists\"></a>\n", "## Tuples and lists\n", "\n", - "Both tuples and lists are builtin python types and are like vectors, \n", - "but for numerical vectors and arrays it is much better to use _numpy_\n", + "Both tuples and lists are builtin python types and are like vectors,\n", + "but for numerical vectors and arrays it is much better to use `numpy`\n", "arrays (or matrices), which are covered in a later tutorial.\n", "\n", "A tuple is like a list or a vector, but with less flexibility than a full list (tuples are immutable), however anything can be stored in either a list or tuple, without any consistency being required. Tuples are defined using round brackets and lists are defined using square brackets. For example:" @@ -461,18 +404,9 @@ }, { "cell_type": "code", - "execution_count": 143, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(3, 7.6, 'str')\n", - "[1, 'mj', -5.4]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "xtuple = (3, 7.6, 'str')\n", "xlist = [1, 'mj', -5.4]\n", @@ -489,18 +423,9 @@ }, { "cell_type": "code", - "execution_count": 144, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "x2 is: ((3, 7.6, 'str'), [1, 'mj', -5.4])\n", - "x3 is: [(3, 7.6, 'str'), [1, 'mj', -5.4]]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "x2 = (xtuple, xlist)\n", "x3 = [xtuple, xlist]\n", @@ -520,17 +445,9 @@ }, { "cell_type": "code", - "execution_count": 145, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[10, 20, 30, 70, 80]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "a = [10, 20, 30]\n", "a = a + [70]\n", @@ -542,7 +459,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> Similar things can be done for tuples, except for the last one: that is, a += (80) as a tuple is immutable so cannot be changed like this. \n", + "> Similar things can be done for tuples, except for the last one: that is,\n", + "> `a += (80)` as a tuple is immutable so cannot be changed like this.\n", "\n", "<a class=\"anchor\" id=\"Indexing\"></a>\n", "### Indexing\n", @@ -552,17 +470,9 @@ }, { "cell_type": "code", - "execution_count": 146, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "20\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "d = [10, 20, 30]\n", "print(d[1])" @@ -578,18 +488,9 @@ }, { "cell_type": "code", - "execution_count": 147, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "30\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "a = [10, 20, 30, 40, 50, 60]\n", "print(a[0])\n", @@ -600,23 +501,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Indices naturally run from 0 to N-1, _but_ negative numbers can be used to reference from the end (circular wrap-around)." + "Indices naturally run from 0 to N-1, _but_ negative numbers can be used to\n", + "reference from the end (circular wrap-around)." ] }, { "cell_type": "code", - "execution_count": 148, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "60\n", - "10\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "print(a[-1])\n", "print(a[-6])" @@ -631,42 +524,18 @@ }, { "cell_type": "code", - "execution_count": 149, - "metadata": {}, - "outputs": [ - { - "ename": "IndexError", - "evalue": "list index out of range", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m<ipython-input-149-f4cf4536701c>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m7\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;31mIndexError\u001b[0m: list index out of range" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "print(a[-7])" ] }, { "cell_type": "code", - "execution_count": 150, - "metadata": {}, - "outputs": [ - { - "ename": "IndexError", - "evalue": "list index out of range", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m<ipython-input-150-52d95fbe5286>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m6\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;31mIndexError\u001b[0m: list index out of range" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "print(a[6])" ] @@ -680,17 +549,9 @@ }, { "cell_type": "code", - "execution_count": 151, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "6\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "print(len(a))" ] @@ -704,18 +565,9 @@ }, { "cell_type": "code", - "execution_count": 152, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "20\n", - "40\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "b = [[10, 20, 30], [40, 50, 60]]\n", "print(b[0][1])\n", @@ -739,17 +591,9 @@ }, { "cell_type": "code", - "execution_count": 153, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[10, 20, 30]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "print(a[0:3])" ] @@ -766,18 +610,9 @@ }, { "cell_type": "code", - "execution_count": 154, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[10, 20, 30]\n", - "[20, 30]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "a = [10, 20, 30, 40, 50, 60]\n", "print(a[0:3]) # same as a(1:3) in MATLAB\n", @@ -791,26 +626,14 @@ "> _*Pitfall:*_\n", ">\n", "> Unlike in MATLAB, you cannot use a list as indices instead of an\n", - "> integer or a slice (although these can be done in _numpy_)." + "> integer or a slice (although these can be done in `numpy`)." ] }, { "cell_type": "code", - "execution_count": 155, - "metadata": {}, - "outputs": [ - { - "ename": "TypeError", - "evalue": "list indices must be integers or slices, not list", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m<ipython-input-155-aad7915ae3d8>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0;36m3\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m4\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mb\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;31mTypeError\u001b[0m: list indices must be integers or slices, not list" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "b = [3, 4]\n", "print(a[b])" @@ -825,19 +648,9 @@ }, { "cell_type": "code", - "execution_count": 156, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[10, 20, 30]\n", - "[20, 30, 40, 50, 60]\n", - "[10, 20, 30, 40, 50]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "print(a[:3])\n", "print(a[1:])\n", @@ -855,19 +668,9 @@ }, { "cell_type": "code", - "execution_count": 157, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[10, 30]\n", - "[10, 30, 50]\n", - "[60, 50, 40, 30, 20, 10]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "print(a[0:4:2])\n", "print(a[::2])\n", @@ -889,17 +692,9 @@ }, { "cell_type": "code", - "execution_count": 158, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[10, 20, 30, 10, 20, 30, 10, 20, 30, 10, 20, 30]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "d = [10, 20, 30]\n", "print(d * 4)" @@ -914,21 +709,9 @@ }, { "cell_type": "code", - "execution_count": 159, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[10, 20, 30, 40]\n", - "[10, 20, 30, 40, 50, 60]\n", - "[10, 20, 30, 40, 50, 60, 70, 80]\n", - "[10, 30, 40, 50, 60, 70, 80]\n", - "[30, 40, 50, 60, 70, 80]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "d.append(40)\n", "print(d)\n", @@ -954,19 +737,9 @@ }, { "cell_type": "code", - "execution_count": 160, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "10\n", - "20\n", - "30\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "d = [10, 20, 30]\n", "for x in d:\n", @@ -987,134 +760,9 @@ }, { "cell_type": "code", - "execution_count": 161, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Help on list object:\n", - "\n", - "class list(object)\n", - " | list() -> new empty list\n", - " | list(iterable) -> new list initialized from iterable's items\n", - " | \n", - " | Methods defined here:\n", - " | \n", - " | __add__(self, value, /)\n", - " | Return self+value.\n", - " | \n", - " | __contains__(self, key, /)\n", - " | Return key in self.\n", - " | \n", - " | __delitem__(self, key, /)\n", - " | Delete self[key].\n", - " | \n", - " | __eq__(self, value, /)\n", - " | Return self==value.\n", - " | \n", - " | __ge__(self, value, /)\n", - " | Return self>=value.\n", - " | \n", - " | __getattribute__(self, name, /)\n", - " | Return getattr(self, name).\n", - " | \n", - " | __getitem__(...)\n", - " | x.__getitem__(y) <==> x[y]\n", - " | \n", - " | __gt__(self, value, /)\n", - " | Return self>value.\n", - " | \n", - " | __iadd__(self, value, /)\n", - " | Implement self+=value.\n", - " | \n", - " | __imul__(self, value, /)\n", - " | Implement self*=value.\n", - " | \n", - " | __init__(self, /, *args, **kwargs)\n", - " | Initialize self. See help(type(self)) for accurate signature.\n", - " | \n", - " | __iter__(self, /)\n", - " | Implement iter(self).\n", - " | \n", - " | __le__(self, value, /)\n", - " | Return self<=value.\n", - " | \n", - " | __len__(self, /)\n", - " | Return len(self).\n", - " | \n", - " | __lt__(self, value, /)\n", - " | Return self<value.\n", - " | \n", - " | __mul__(self, value, /)\n", - " | Return self*value.n\n", - " | \n", - " | __ne__(self, value, /)\n", - " | Return self!=value.\n", - " | \n", - " | __new__(*args, **kwargs) from builtins.type\n", - " | Create and return a new object. See help(type) for accurate signature.\n", - " | \n", - " | __repr__(self, /)\n", - " | Return repr(self).\n", - " | \n", - " | __reversed__(...)\n", - " | L.__reversed__() -- return a reverse iterator over the list\n", - " | \n", - " | __rmul__(self, value, /)\n", - " | Return self*value.\n", - " | \n", - " | __setitem__(self, key, value, /)\n", - " | Set self[key] to value.\n", - " | \n", - " | __sizeof__(...)\n", - " | L.__sizeof__() -- size of L in memory, in bytes\n", - " | \n", - " | append(...)\n", - " | L.append(object) -> None -- append object to end\n", - " | \n", - " | clear(...)\n", - " | L.clear() -> None -- remove all items from L\n", - " | \n", - " | copy(...)\n", - " | L.copy() -> list -- a shallow copy of L\n", - " | \n", - " | count(...)\n", - " | L.count(value) -> integer -- return number of occurrences of value\n", - " | \n", - " | extend(...)\n", - " | L.extend(iterable) -> None -- extend list by appending elements from the iterable\n", - " | \n", - " | index(...)\n", - " | L.index(value, [start, [stop]]) -> integer -- return first index of value.\n", - " | Raises ValueError if the value is not present.\n", - " | \n", - " | insert(...)\n", - " | L.insert(index, object) -- insert object before index\n", - " | \n", - " | pop(...)\n", - " | L.pop([index]) -> item -- remove and return item at index (default last).\n", - " | Raises IndexError if list is empty or index is out of range.\n", - " | \n", - " | remove(...)\n", - " | L.remove(value) -> None -- remove first occurrence of value.\n", - " | Raises ValueError if the value is not present.\n", - " | \n", - " | reverse(...)\n", - " | L.reverse() -- reverse *IN PLACE*\n", - " | \n", - " | sort(...)\n", - " | L.sort(key=None, reverse=False) -> None -- stable sort *IN PLACE*\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Data and other attributes defined here:\n", - " | \n", - " | __hash__ = None\n", - "\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "help(d)" ] @@ -1128,64 +776,9 @@ }, { "cell_type": "code", - "execution_count": 162, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['__add__',\n", - " '__class__',\n", - " '__contains__',\n", - " '__delattr__',\n", - " '__delitem__',\n", - " '__dir__',\n", - " '__doc__',\n", - " '__eq__',\n", - " '__format__',\n", - " '__ge__',\n", - " '__getattribute__',\n", - " '__getitem__',\n", - " '__gt__',\n", - " '__hash__',\n", - " '__iadd__',\n", - " '__imul__',\n", - " '__init__',\n", - " '__iter__',\n", - " '__le__',\n", - " '__len__',\n", - " '__lt__',\n", - " '__mul__',\n", - " '__ne__',\n", - " '__new__',\n", - " '__reduce__',\n", - " '__reduce_ex__',\n", - " '__repr__',\n", - " '__reversed__',\n", - " '__rmul__',\n", - " '__setattr__',\n", - " '__setitem__',\n", - " '__sizeof__',\n", - " '__str__',\n", - " '__subclasshook__',\n", - " 'append',\n", - " 'clear',\n", - " 'copy',\n", - " 'count',\n", - " 'extend',\n", - " 'index',\n", - " 'insert',\n", - " 'pop',\n", - " 'remove',\n", - " 'reverse',\n", - " 'sort']" - ] - }, - "execution_count": 162, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "dir(d)" ] @@ -1194,7 +787,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> Note that google is often more helpful! At least, as long as you find pages relating to the right version of python - we use python 3 for FSL, so check that what you find is appropriate for that.\n", + "> Note that google is often more helpful! At least, as long as you find pages\n", + "> relating to Python 3 - Python 2 is no longer supported, but there is still\n", + "> lots of information about it on the internet, so be careful!\n", "\n", "---\n", "\n", @@ -1206,20 +801,9 @@ }, { "cell_type": "code", - "execution_count": 163, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2\n", - "dict_keys(['b', 'a'])\n", - "dict_values([20, 10])\n", - "10\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "e = {'a' : 10, 'b': 20}\n", "print(len(e))\n", @@ -1244,17 +828,9 @@ }, { "cell_type": "code", - "execution_count": 164, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'b': 20, 'a': 10, 'c': 555}\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "e['c'] = 555 # just like in Biobank! ;)\n", "print(e)" @@ -1272,18 +848,9 @@ }, { "cell_type": "code", - "execution_count": 165, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "{'a': 10, 'c': 555}\n", - "{'a': 10}\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "e.pop('b')\n", "print(e)\n", @@ -1303,19 +870,9 @@ }, { "cell_type": "code", - "execution_count": 166, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "('b', 20)\n", - "('a', 10)\n", - "('c', 555)\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "e = {'a' : 10, 'b': 20, 'c':555}\n", "for k, v in e.items():\n", @@ -1333,19 +890,9 @@ }, { "cell_type": "code", - "execution_count": 167, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "('b', 20)\n", - "('a', 10)\n", - "('c', 555)\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "for k in e:\n", " print((k, e[k]))" @@ -1355,31 +902,29 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> Note that in both cases the order is arbitrary. The `sorted` function can be used if you want keys in a sorted order; e.g. `for k in sorted(e):` ...\n", + "> In older versions of Python 3, there was no guarantee of ordering when using dictionaries.\n", + "> However, a of Python 3.7, dictionaries will remember the order in which items are inserted,\n", + "> and the `keys()`, `values()`, and `items()` methods will return elements in that order.\n", ">\n", - "> There are also [other options](https://docs.python.org/3.5/library/collections.html#collections.OrderedDict) if you want a dictionary with ordering.\n", + "\n", + "> If you want a dictionary with ordering, *and* you want your code to work with\n", + "> Python versions older than 3.7, you can use the\n", + "> [`OrderedDict`](https://docs.python.org/3/library/collections.html#collections.OrderedDict)\n", + "> class.\n", "\n", "---\n", "\n", "<a class=\"anchor\" id=\"Copying-and-references\"></a>\n", - "## Copying and references \n", + "## Copying and references\n", "\n", "In python there are immutable types (e.g. numbers) and mutable types (e.g. lists). The main thing to know is that assignment can sometimes create separate copies and sometimes create references (as in C++). In general, the more complicated types are assigned via references. For example:" ] }, { "cell_type": "code", - "execution_count": 168, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "7\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "a = 7\n", "b = a\n", @@ -1396,17 +941,9 @@ }, { "cell_type": "code", - "execution_count": 169, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[8888]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "a = [7]\n", "b = a\n", @@ -1423,17 +960,9 @@ }, { "cell_type": "code", - "execution_count": 170, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[7, 7]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "a = [7]\n", "b = a * 2\n", @@ -1450,17 +979,9 @@ }, { "cell_type": "code", - "execution_count": 171, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[7]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "a = [7]\n", "b = list(a)\n", @@ -1477,18 +998,9 @@ }, { "cell_type": "code", - "execution_count": 172, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(2, 5, 7)\n", - "[2, 5, 7]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "xt = (2, 5, 7)\n", "xl = list(xt)\n", @@ -1507,24 +1019,9 @@ }, { "cell_type": "code", - "execution_count": 173, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "a: [5]\n", - "x: [5, 10]\n", - "a: [5, 10]\n", - "x: [5, 10, 10]\n", - "a: [5, 10]\n", - "return value: [5, 10, 10]\n", - "a: [5, 10]\n", - "b: [5, 10, 10]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "def foo1(x):\n", " x.append(10)\n", @@ -1563,7 +1060,11 @@ "<a class=\"anchor\" id=\"Boolean-operators\"></a>\n", "### Boolean operators\n", "\n", - "There is a boolean type in python that can be `True` or `False` (note the capitals). Other values can also be used for True or False (e.g., 1 for True; 0 or None or [] or {} or \"\") although they are not considered 'equal' in the sense that the operator `==` would consider them the same.\n", + "There is a boolean type in python that can be `True` or `False` (note the\n", + "capitals). Other values can also be used for True or False (e.g., `1` for\n", + "`True`; `0` or `None` or `[]` or `{}` or `\"\"` for `False`) although they are\n", + "not considered 'equal' in the sense that the operator `==` would consider them\n", + "the same.\n", "\n", "Relevant boolean and comparison operators include: `not`, `and`, `or`, `==` and `!=`\n", "\n", @@ -1572,21 +1073,9 @@ }, { "cell_type": "code", - "execution_count": 174, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Not a is: False\n", - "Not 1 is: False\n", - "Not 0 is: True\n", - "Not {} is: True\n", - "{}==0 is: False\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "a = True\n", "print('Not a is:', not a)\n", @@ -1605,19 +1094,9 @@ }, { "cell_type": "code", - "execution_count": 175, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "False\n", - "True\n", - "True\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "print('the' in 'a number of words')\n", "print('of' in 'a number of words')\n", @@ -1639,18 +1118,9 @@ }, { "cell_type": "code", - "execution_count": 176, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "0.5890515724950383\n", - "Positive\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "import random\n", "a = random.uniform(-1, 1)\n", @@ -1672,17 +1142,9 @@ }, { "cell_type": "code", - "execution_count": 177, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Variable is true, or at least not empty\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "a = [] # just one of many examples\n", "if not a:\n", @@ -1693,7 +1155,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This can be useful for functions where a variety of possible input types are being dealt with. \n", + "This can be useful for functions where a variety of possible input types are being dealt with.\n", "\n", "---\n", "\n", @@ -1705,21 +1167,9 @@ }, { "cell_type": "code", - "execution_count": 178, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2\n", - "is\n", - "more\n", - "than\n", - "1\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "for x in [2, 'is', 'more', 'than', 1]:\n", " print(x)" @@ -1736,23 +1186,9 @@ }, { "cell_type": "code", - "execution_count": 179, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "2\n", - "3\n", - "4\n", - "5\n", - "6\n", - "7\n", - "8\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "for x in range(2, 9):\n", " print(x)" @@ -1769,18 +1205,9 @@ }, { "cell_type": "code", - "execution_count": 180, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "4\n", - "7\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "x, y = [4, 7]\n", "print(x)\n", @@ -1796,21 +1223,9 @@ }, { "cell_type": "code", - "execution_count": 181, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[('Some', 0), ('set', 1), ('of', 2), ('items', 3)]\n", - "0 Some\n", - "1 set\n", - "2 of\n", - "3 items\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "alist = ['Some', 'set', 'of', 'items']\n", "blist = list(range(len(alist)))\n", @@ -1833,17 +1248,9 @@ }, { "cell_type": "code", - "execution_count": 182, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "34.995996566662235\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "import random\n", "n = 0\n", @@ -1881,17 +1288,9 @@ }, { "cell_type": "code", - "execution_count": 183, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "0.33573141209899227 0.11271558106998338\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "import random\n", "x = random.uniform(0, 1)\n", @@ -1911,18 +1310,9 @@ }, { "cell_type": "code", - "execution_count": 184, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]\n", - "[0, 1, 4, 9, 16, 25, 36, 64, 81]\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "v1 = [ x**2 for x in range(10) ]\n", "print(v1)\n", @@ -1959,19 +1349,9 @@ }, { "cell_type": "code", - "execution_count": 185, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(22.360679774997898, 500)\n", - "37.416573867739416\n", - "37.416573867739416\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "def myfunc(x, y, z=0):\n", " r2 = x*x + y*y + z*z\n", @@ -2002,17 +1382,9 @@ }, { "cell_type": "code", - "execution_count": 186, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "22.360679774997898 30 60\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "def myfunc(x, y, z=0, flag=''):\n", " if flag=='L1':\n", @@ -2041,7 +1413,7 @@ "Let's say you are given a single string with comma separated elements\n", "that represent filenames and ID codes: e.g., `/vols/Data/pytreat/AAC, 165873, /vols/Data/pytreat/AAG, 170285, ...`\n", "\n", - "Write some code to do the following: \n", + "Write some code to do the following:\n", " * separate out the filenames and ID codes into separate lists (ID's\n", " should be numerical values, not strings) - you may need several steps for this\n", " * loop over the two and generate a _string_ that could be used to\n", @@ -2064,25 +1436,7 @@ ] } ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.5.2" - } - }, + "metadata": {}, "nbformat": 4, "nbformat_minor": 2 } diff --git a/getting_started/01_basics.md b/getting_started/01_basics.md index 9c5928de0ac64d057b56057096881f180d495089..ac9372c7e12bd688b410200a76fe65eefdc69517 100644 --- a/getting_started/01_basics.md +++ b/getting_started/01_basics.md @@ -10,7 +10,17 @@ explanations. You can run each block by using _shift + enter_ (including the text blocks, so you can just move down the document with shift + enter). -It is also possible to _change_ the contents of each code block (these pages are completely interactive) so do experiment with the code you see and try some variations! +It is also possible to _change_ the contents of each code block (these pages +are completely interactive) so do experiment with the code you see and try +some variations! + +> **Important**: We are exclusively using Python 3 in FSL - as of FSL 6.0.4 we +> are using Python 3.7. There are some subtle differences between Python 2 and +> Python 3, but instead of learning about these differences, it is easier to +> simply forget that Python 2 exists. When you are googling for Python help, +> make sure that the pages you find are relevant to Python 3 and *not* Python +> 2! The official Python docs can be found at https://docs.python.org/3/ (note +> the _/3/_ at the end!). ## Contents @@ -104,7 +114,10 @@ print(s3) <a class="anchor" id="Format"></a> ### Format -More interesting strings can be created using the `format` statement, which is very useful in print statements: +More interesting strings can be created using the +[`format`](https://docs.python.org/3/library/string.html#formatstrings) +statement, which is very useful in print statements: + ``` x = 1 y = 'PyTreat' @@ -113,7 +126,16 @@ print(s) print('A name is {} and a number is {}'.format(y, x)) ``` -There are also other options along these lines, but this is the more modern version, although you will see plenty of the other alternatives in "old" code (to python coders this means anything written before last week). +Python also supports C-style [`%` +formatting](https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting): + +``` +x = 1 +y = 'PyTreat' +s = 'The numerical value is %i and a name is %s' % (x, y) +print(s) +print('A name is %s and a number is %i' % (y, x)) +``` <a class="anchor" id="String-manipulation"></a> ### String manipulation @@ -150,7 +172,8 @@ where the `r` before the quote is used to force the regular expression specifica For more information on matching and substitutions, look up the regular expression module on the web. -Two common and convenient string methods are `strip()` and `split()`. The first will remove any whitespace at the beginning and end of a string: +Two common and convenient string methods are `strip()` and `split()`. The +first will remove any whitespace at the beginning and end of a string: ``` s2 = ' A very spacy string ' @@ -159,6 +182,7 @@ print('*' + s2.strip() + '*') ``` With `split()` we can tokenize a string (to turn it into a list of strings) like this: + ``` print(s.split()) print(s2.split()) @@ -170,11 +194,27 @@ s4 = ' This is, as you can see , a very weirdly spaced and punctuated print(s4.split(',')) ``` -There are more powerful ways of dealing with this like csv files/strings, which are covered in later practicals, but even this can get you a long way. +A neat trick, if you want to change the delimiter in some structured data (e.g. +replace `,` with `\t`), is to use `split()` in combination with another string +method, `join()`: +``` +csvdata = 'some,comma,separated,data' +tsvdata = '\t'.join(csvdata.split(',')) +tsvdata = tsvdata.replace('comma', 'tab')) +print('csvdata:', csvdata) +print('tsvdata:', tsvdata) +``` + + +There are more powerful ways of dealing with this like csv files/strings, +which are covered in later practicals, but even this can get you a long way. -> Note that strings in python 3 are _unicode_ so can represent Chinese characters, etc, and is therefore very flexible. However, in general you can just be blissfully ignorant of this fact. +> Note that strings in python 3 are _unicode_ so can represent Chinese +> characters, etc, and is therefore very flexible. However, in general you +> can just be blissfully ignorant of this fact. -Strings can be converted to integer or floating-point values by using the `int()` and `float()` calls: +Strings can be converted to integer or floating-point values by using the +`int()` and `float()` calls: ``` sint='23' @@ -191,8 +231,8 @@ print(float(sint) + float(sfp)) <a class="anchor" id="Tuples-and-lists"></a> ## Tuples and lists -Both tuples and lists are builtin python types and are like vectors, -but for numerical vectors and arrays it is much better to use _numpy_ +Both tuples and lists are builtin python types and are like vectors, +but for numerical vectors and arrays it is much better to use `numpy` arrays (or matrices), which are covered in a later tutorial. A tuple is like a list or a vector, but with less flexibility than a full list (tuples are immutable), however anything can be stored in either a list or tuple, without any consistency being required. Tuples are defined using round brackets and lists are defined using square brackets. For example: @@ -222,7 +262,8 @@ a += [80] print(a) ``` -> Similar things can be done for tuples, except for the last one: that is, a += (80) as a tuple is immutable so cannot be changed like this. +> Similar things can be done for tuples, except for the last one: that is, +> `a += (80)` as a tuple is immutable so cannot be changed like this. <a class="anchor" id="Indexing"></a> ### Indexing @@ -242,7 +283,9 @@ print(a[0]) print(a[2]) ``` -Indices naturally run from 0 to N-1, _but_ negative numbers can be used to reference from the end (circular wrap-around). +Indices naturally run from 0 to N-1, _but_ negative numbers can be used to +reference from the end (circular wrap-around). + ``` print(a[-1]) print(a[-6]) @@ -294,7 +337,7 @@ print(a[1:3]) # same as a(2:3) in MATLAB > _*Pitfall:*_ > > Unlike in MATLAB, you cannot use a list as indices instead of an -> integer or a slice (although these can be done in _numpy_). +> integer or a slice (although these can be done in `numpy`). ``` b = [3, 4] @@ -372,7 +415,9 @@ dir(d) ``` -> Note that google is often more helpful! At least, as long as you find pages relating to the right version of python - we use python 3 for FSL, so check that what you find is appropriate for that. +> Note that google is often more helpful! At least, as long as you find pages +> relating to Python 3 - Python 2 is no longer supported, but there is still +> lots of information about it on the internet, so be careful! --- @@ -430,14 +475,20 @@ for k in e: print((k, e[k])) ``` -> Note that in both cases the order is arbitrary. The `sorted` function can be used if you want keys in a sorted order; e.g. `for k in sorted(e):` ... +> In older versions of Python 3, there was no guarantee of ordering when using dictionaries. +> However, a of Python 3.7, dictionaries will remember the order in which items are inserted, +> and the `keys()`, `values()`, and `items()` methods will return elements in that order. > -> There are also [other options](https://docs.python.org/3.5/library/collections.html#collections.OrderedDict) if you want a dictionary with ordering. + +> If you want a dictionary with ordering, *and* you want your code to work with +> Python versions older than 3.7, you can use the +> [`OrderedDict`](https://docs.python.org/3/library/collections.html#collections.OrderedDict) +> class. --- <a class="anchor" id="Copying-and-references"></a> -## Copying and references +## Copying and references In python there are immutable types (e.g. numbers) and mutable types (e.g. lists). The main thing to know is that assignment can sometimes create separate copies and sometimes create references (as in C++). In general, the more complicated types are assigned via references. For example: ``` @@ -517,7 +568,11 @@ print('b: ', b) <a class="anchor" id="Boolean-operators"></a> ### Boolean operators -There is a boolean type in python that can be `True` or `False` (note the capitals). Other values can also be used for True or False (e.g., 1 for True; 0 or None or [] or {} or "") although they are not considered 'equal' in the sense that the operator `==` would consider them the same. +There is a boolean type in python that can be `True` or `False` (note the +capitals). Other values can also be used for True or False (e.g., `1` for +`True`; `0` or `None` or `[]` or `{}` or `""` for `False`) although they are +not considered 'equal' in the sense that the operator `==` would consider them +the same. Relevant boolean and comparison operators include: `not`, `and`, `or`, `==` and `!=` @@ -564,7 +619,7 @@ a = [] # just one of many examples if not a: print('Variable is true, or at least not empty') ``` -This can be useful for functions where a variety of possible input types are being dealt with. +This can be useful for functions where a variety of possible input types are being dealt with. --- @@ -722,7 +777,7 @@ You will often see python functions called with these named arguments. In fact, Let's say you are given a single string with comma separated elements that represent filenames and ID codes: e.g., `/vols/Data/pytreat/AAC, 165873, /vols/Data/pytreat/AAG, 170285, ...` -Write some code to do the following: +Write some code to do the following: * separate out the filenames and ID codes into separate lists (ID's should be numerical values, not strings) - you may need several steps for this * loop over the two and generate a _string_ that could be used to @@ -738,5 +793,3 @@ Write some code to do the following: mstr = '/vols/Data/pytreat/AAC, 165873, /vols/Data/pytreat/AAG, 170285, /vols/Data/pytreat/AAH, 196792, /vols/Data/pytreat/AAK, 212577, /vols/Data/pytreat/AAQ, 385376, /vols/Data/pytreat/AB, 444600, /vols/Data/pytreat/AC6, 454578, /vols/Data/pytreat/V8, 501502, /vols/Data/pytreat/2YK, 667688, /vols/Data/pytreat/C3PO, 821971' ``` - - diff --git a/getting_started/02_text_io.ipynb b/getting_started/02_text_io.ipynb index 43ef0db4be79d6820bdd4e19e953e73f6bb0c53b..310b3adaf7c93d023898036863f087a6eaa9f199 100644 --- a/getting_started/02_text_io.ipynb +++ b/getting_started/02_text_io.ipynb @@ -6,11 +6,16 @@ "source": [ "# Text input/output\n", "\n", - "In this section we will explore how to write and/or retrieve our data from text files.\n", + "In this section we will explore how to write and/or retrieve our data from\n", + "text files.\n", "\n", - "Most of the functionality for reading/writing files and manipulating strings is available without any imports. However, you can find some additional functionality in the [`string`](https://docs.python.org/3.6/library/string.html) module.\n", + "Most of the functionality for reading/writing files and manipulating strings\n", + "is available without any imports. However, you can find some additional\n", + "functionality in the\n", + "[`string`](https://docs.python.org/3/library/string.html) module.\n", "\n", - "Most of the string functions are available as methods on string objects. This means that you can use the ipython autocomplete to check for them." + "Most of the string functions are available as methods on string objects. This\n", + "means that you can use the ipython autocomplete to check for them." ] }, { @@ -28,7 +33,10 @@ "metadata": {}, "outputs": [], "source": [ - "empty_string. # after running the code block above, put your cursor behind the dot and press tab to get a list of methods" + "# after running the code block above,\n", + "# put your cursor after the dot and\n", + "# press tab to get a list of methods\n", + "empty_string." ] }, { @@ -49,11 +57,20 @@ "* [Exercises](#exercises)\n", "\n", "<a class=\"anchor\" id=\"reading-writing-files\"></a>\n", + "\n", "## Reading/writing files\n", - "The syntax to open a file in python is `with open(<filename>, <mode>) as <file_object>: <block of code>`, where\n", + "\n", + "\n", + "The syntax to open a file in python is `with open(<filename>, <mode>) as\n", + "<file_object>: <block of code>`, where\n", "* `filename` is a string with the name of the file\n", - "* `mode` is one of 'r' (for read-only access), 'w' (for writing a file, this wipes out any existing content), 'a' (for appending to an existing file).\n", - "* `file_object` is a variable name which will be used within the `block of code` to access the opened file.\n", + "\n", + "* `mode` is one of `'r'` (for read-only access), `'w'` (for writing a file,\n", + " this wipes out any existing content), `'a'` (for appending to an existing\n", + " file).\n", + "\n", + "* `file_object` is a variable name which will be used within the `block of\n", + " code` to access the opened file.\n", "\n", "For example the following will read all the text in `README.md` and print it." ] @@ -72,10 +89,37 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> The `with` statement is an advanced python feature, however you will probably only encounter it when opening files. In that context it merely ensures that the file will be properly closed as soon as the program leaves the `with` statement (even if an error is raised within the `with` statement).\n", - "\n", - "You could also use the `readlines()` method to get a list of all the lines.\n", + "> The `with` statement is an advanced python feature, however you will\n", + "> probably only encounter it when opening files. In that context it merely\n", + "> ensures that the file will be properly closed as soon as the program leaves\n", + "> the `with` statement (even if an error is raised within the `with`\n", + "> statement).\n", "\n", + "You could also use the `readlines()` method to get a list of all the lines, or\n", + "simply \"loop over\" the file object to get the lines one by one:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "with open('README.md', 'r') as readme_file:\n", + " print('First five lines...')\n", + " for i, line in enumerate(readme_file):\n", + " # each line is returned with its\n", + " # newline character still intact,\n", + " # so we use rstrip() to remove it.\n", + " print('{}: {}'.format(i, line.rstrip()))\n", + " if i == 4:\n", + " break" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ "A very similar syntax is used to write files:" ] }, @@ -94,7 +138,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that no new line characters get added automatically. We can investigate the resulting file using" + "Note that no new line characters get added automatically. We can investigate\n", + "the resulting file using" ] }, { @@ -110,7 +155,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> Any lines starting with `!` will be interpreted as shell commands by ipython. It is great when playing around in the ipython notebook or in the ipython terminal, however it is an ipython-only feature and hence is not available when writing python scripts. How to call shell commands from python will be discussed in the `scripts` practical.\n", + "> In Jupyter notebook, (and in `ipython`/`fslipython`), any lines starting\n", + "> with `!` will be interpreted as shell commands. It is great when playing\n", + "> around in a Jupyter notebook or in the `ipython` terminal, however it is an\n", + "> ipython-only feature and hence is not available when writing python\n", + "> scripts. How to call shell commands from python will be discussed in the\n", + "> `scripts` practical.\n", "\n", "If we want to add to the existing file we can open it in the append mode:" ] @@ -130,14 +180,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Below we will discuss how we can convert python objects to strings to store in these files and how to extract those python objects from strings again.\n", + "Below we will discuss how we can convert python objects to strings to store in\n", + "these files and how to extract those python objects from strings again.\n", "\n", "<a class=\"anchor\" id=\"creating-new-strings\"></a>\n", "## Creating new strings\n", "\n", "<a class=\"anchor\" id=\"string-syntax\"></a>\n", "### String syntax\n", - "Single-line strings can be created in python using either single or double quotes" + "\n", + "\n", + "Single-line strings can be created in python using either single or double\n", + "quotes:" ] }, { @@ -155,7 +209,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The main rationale for choosing between single or double quotes, is whether the string itself will contain any quotes. You can include a single quote in a string surrounded by single quotes by escaping it with the `\\` character, however in such a case it would be more convenient to use double quotes:" + "The main rationale for choosing between single or double quotes, is whether\n", + "the string itself will contain any quotes. You can include a single quote in a\n", + "string surrounded by single quotes by escaping it with the `\\` character,\n", + "however in such a case it would be more convenient to use double quotes:" ] }, { @@ -245,9 +302,19 @@ "source": [ "<a class=\"anchor\" id=\"unicode-versus-bytes\"></a>\n", "#### unicode versus bytes\n", - "To encourage the spread of python around the world, python 3 switched to using unicode as the default for strings and code (which is one of the main reasons for the incompatibility between python 2 and 3).\n", - "This means that each element in a string is a unicode character (using [UTF-8 encoding](https://docs.python.org/3/howto/unicode.html)), which can consist of one or more bytes.\n", - "The advantage is that any unicode characters can now be used in strings or in the code itself:" + "\n", + "> **Note**: You can safely skip this section if you do not have any plans to\n", + "> work with binary files or non-English text in Python, and you do not want\n", + "> to know how to insert poop emojis into your code.\n", + "\n", + "\n", + "To encourage the spread of python around the world, python 3 switched to using\n", + "unicode as the default for strings and code (which is one of the main reasons\n", + "for the incompatibility between python 2 and 3). This means that each element\n", + "in a string is a unicode character (using [UTF-8\n", + "encoding](https://docs.python.org/3/howto/unicode.html)), which can consist of\n", + "one or more bytes. The advantage is that any unicode characters can now be\n", + "used in strings or in the code itself:" ] }, { @@ -264,7 +331,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In python 2 each element in a string was a single byte rather than a potentially multi-byte character. You can convert back to interpreting your sequence as a unicode string or a byte array using:\n", + "In python 2 each element in a string was a single byte rather than a\n", + "potentially multi-byte character. You can convert back to interpreting your\n", + "sequence as a unicode string or a byte array using:\n", + "\n", "* `encode()` called on a string converts it into a bytes array (`bytes` object)\n", "* `decode()` called on a `bytes` array converts it into a unicode string." ] @@ -283,7 +353,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "These byte arrays can be created directly be prepending the quotes enclosing the string with a `b`, which tells python 3 to interpret the following as a byte array:" + "These byte arrays can be created directly by prepending the quotes enclosing\n", + "the string with a `b`, which tells python 3 to interpret the following as a\n", + "byte array:" ] }, { @@ -300,9 +372,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Especially in code dealing with strings (e.g., reading/writing of files) many of the errors arising of running python 2 code in python 3 arise from the mixing of unicode strings with byte arrays. Decoding and/or encoding some of these objects can often fix these issues.\n", + "Especially in code dealing with strings (e.g., reading/writing of files) many\n", + "of the errors arising of running python 2 code in python 3 arise from the\n", + "mixing of unicode strings with byte arrays. Decoding and/or encoding some of\n", + "these objects can often fix these issues.\n", "\n", - "By default any file opened in python will be interpreted as unicode. If you want to treat a file as raw bytes, you have to include a 'b' in the `mode` when calling the `open()` function:" + "By default any file opened in python will be interpreted as unicode. If you\n", + "want to treat a file as raw bytes, you have to include a 'b' in the `mode`\n", + "when calling the `open()` function:" ] }, { @@ -320,14 +397,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> We use the `expandvars()` function here to insert the FSLDIR environmental variable into our string. This function will be presented in the file management practical.\n", + "> We use the `expandvars()` function here to insert the FSLDIR environmental\n", + "> variable into our string. This function will be presented in the file\n", + "> management practical.\n", + "\n", "\n", "<a class=\"anchor\" id=\"converting-objects-into-strings\"></a>\n", "### converting objects into strings\n", - "There are two functions to convert python objects into strings, `repr()` and `str()`.\n", - "All other functions that rely on string-representations of python objects will use one of these two (for example the `print()` function will call `str()` on the object).\n", "\n", - "The goal of the `str()` function is to be readable, while the goal of `repr()` is to be unambiguous. Compare" + "There are two functions to convert python objects into strings, `repr()` and\n", + "`str()`. All other functions that rely on string-representations of python\n", + "objects will use one of these two (for example the `print()` function will\n", + "call `str()` on the object).\n", + "\n", + "The goal of the `str()` function is to be readable, while the goal of `repr()`\n", + "is to be unambiguous. Compare" ] }, { @@ -445,10 +529,10 @@ "<a class=\"anchor\" id=\"string-formatting\"></a>\n", "### String formatting\n", "Using the techniques in [Combining strings](#combining-strings) we can build simple strings. For longer strings it is often useful to first write a template strings with some placeholders, where variables are later inserted. Built into python are currently 4 different ways of doing this (with many packages providing similar capabilities):\n", - "* the recommended [new-style formatting](https://docs.python.org/3.6/library/string.html#format-string-syntax).\n", + "* the recommended [new-style formatting](https://docs.python.org/3/library/string.html#format-string-syntax).\n", "* printf-like [old-style formatting](https://docs.python.org/3/library/stdtypes.html#old-string-formatting)\n", - "* [formatted string literals](https://docs.python.org/3.6/reference/lexical_analysis.html#f-strings) (these are only available in python 3.6+)\n", - "* bash-like [template-strings](https://docs.python.org/3.6/library/string.html#template-strings)\n", + "* [formatted string literals](https://docs.python.org/3/reference/lexical_analysis.html#f-strings) (these are only available in python 3.6+)\n", + "* bash-like [template-strings](https://docs.python.org/3/library/string.html#template-strings)\n", "\n", "Here we provide a single example using the first three methods, so you can recognize them in the future.\n", "\n", @@ -520,6 +604,14 @@ "\n", "<a class=\"anchor\" id=\"extracting-information-from-strings\"></a>\n", "## Extracting information from strings\n", + "\n", + "The techniques shown in this section are useful if you are loading data from a\n", + "small text file or user input, or parsing a small amount of output from\n", + "e.g. `fslstats`. However, if you are working with large structured text data\n", + "(e.g. a big `csv` file), you should use the I/O capabilities of `numpy` or\n", + "`pandas` instead of doing things manually - this is covered in separate\n", + "practcals.\n", + "\n", "<a class=\"anchor\" id=\"splitting-strings\"></a>\n", "### Splitting strings\n", "The simplest way to extract a sub-string is to use slicing" @@ -598,9 +690,14 @@ "source": [ "> We use the syntax `[<expr> for <element> in <sequence>]` here which applies the `expr` to each `element` in the `sequence` and returns the resulting list. This is a list comprehension - a convenient form in python to create a new list from the old one.\n", "\n", + "\n", "<a class=\"anchor\" id=\"converting-strings-to-numbers\"></a>\n", "### Converting strings to numbers\n", - "Once you have extracted a number from a string, you can convert it into an actual integer or float by calling respectively `int()` or `float()` on it. `float()` understands a wide variety of different ways to write numbers:" + "\n", + "\n", + "Once you have extracted a number from a string, you can convert it into an\n", + "actual integer or float by calling respectively `int()` or `float()` on\n", + "it. `float()` understands a wide variety of different ways to write numbers:" ] }, { diff --git a/getting_started/02_text_io.md b/getting_started/02_text_io.md index afab42f8e05d1f2029d26ff4031ca1578b97394c..42e93b09b5883d25ded5c0540d221bc3dfb540a1 100644 --- a/getting_started/02_text_io.md +++ b/getting_started/02_text_io.md @@ -1,16 +1,25 @@ # Text input/output -In this section we will explore how to write and/or retrieve our data from text files. +In this section we will explore how to write and/or retrieve our data from +text files. -Most of the functionality for reading/writing files and manipulating strings is available without any imports. However, you can find some additional functionality in the [`string`](https://docs.python.org/3.6/library/string.html) module. +Most of the functionality for reading/writing files and manipulating strings +is available without any imports. However, you can find some additional +functionality in the +[`string`](https://docs.python.org/3/library/string.html) module. + +Most of the string functions are available as methods on string objects. This +means that you can use the ipython autocomplete to check for them. -Most of the string functions are available as methods on string objects. This means that you can use the ipython autocomplete to check for them. ``` empty_string = '' ``` ``` -empty_string. # after running the code block above, put your cursor behind the dot and press tab to get a list of methods +# after running the code block above, +# put your cursor after the dot and +# press tab to get a list of methods +empty_string. ``` * [Reading/writing files](#reading-writing-files) @@ -27,20 +36,47 @@ empty_string. # after running the code block above, put your cursor behind th * [Exercises](#exercises) <a class="anchor" id="reading-writing-files"></a> + ## Reading/writing files -The syntax to open a file in python is `with open(<filename>, <mode>) as <file_object>: <block of code>`, where + + +The syntax to open a file in python is `with open(<filename>, <mode>) as +<file_object>: <block of code>`, where * `filename` is a string with the name of the file -* `mode` is one of 'r' (for read-only access), 'w' (for writing a file, this wipes out any existing content), 'a' (for appending to an existing file). -* `file_object` is a variable name which will be used within the `block of code` to access the opened file. + +* `mode` is one of `'r'` (for read-only access), `'w'` (for writing a file, + this wipes out any existing content), `'a'` (for appending to an existing + file). + +* `file_object` is a variable name which will be used within the `block of + code` to access the opened file. For example the following will read all the text in `README.md` and print it. ``` with open('README.md', 'r') as readme_file: print(readme_file.read()) ``` -> The `with` statement is an advanced python feature, however you will probably only encounter it when opening files. In that context it merely ensures that the file will be properly closed as soon as the program leaves the `with` statement (even if an error is raised within the `with` statement). -You could also use the `readlines()` method to get a list of all the lines. +> The `with` statement is an advanced python feature, however you will +> probably only encounter it when opening files. In that context it merely +> ensures that the file will be properly closed as soon as the program leaves +> the `with` statement (even if an error is raised within the `with` +> statement). + +You could also use the `readlines()` method to get a list of all the lines, or +simply "loop over" the file object to get the lines one by one: + +``` +with open('README.md', 'r') as readme_file: + print('First five lines...') + for i, line in enumerate(readme_file): + # each line is returned with its + # newline character still intact, + # so we use rstrip() to remove it. + print('{}: {}'.format(i, line.rstrip())) + if i == 4: + break +``` A very similar syntax is used to write files: ``` @@ -48,11 +84,21 @@ with open('02_text_io/my_file', 'w') as my_file: my_file.write('This is my first line\n') my_file.writelines(['Second line\n', 'and the third\n']) ``` -Note that no new line characters get added automatically. We can investigate the resulting file using + +Note that no new line characters get added automatically. We can investigate +the resulting file using + ``` !cat 02_text_io/my_file ``` -> Any lines starting with `!` will be interpreted as shell commands by ipython. It is great when playing around in the ipython notebook or in the ipython terminal, however it is an ipython-only feature and hence is not available when writing python scripts. How to call shell commands from python will be discussed in the `scripts` practical. + + +> In Jupyter notebook, (and in `ipython`/`fslipython`), any lines starting +> with `!` will be interpreted as shell commands. It is great when playing +> around in a Jupyter notebook or in the `ipython` terminal, however it is an +> ipython-only feature and hence is not available when writing python +> scripts. How to call shell commands from python will be discussed in the +> `scripts` practical. If we want to add to the existing file we can open it in the append mode: ``` @@ -61,21 +107,30 @@ with open('02_text_io/my_file', 'a') as my_file: !cat 02_text_io/my_file ``` -Below we will discuss how we can convert python objects to strings to store in these files and how to extract those python objects from strings again. +Below we will discuss how we can convert python objects to strings to store in +these files and how to extract those python objects from strings again. <a class="anchor" id="creating-new-strings"></a> ## Creating new strings <a class="anchor" id="string-syntax"></a> ### String syntax -Single-line strings can be created in python using either single or double quotes + + +Single-line strings can be created in python using either single or double +quotes: + ``` a_string = 'To be or not to be' same_string = "To be or not to be" print(a_string == same_string) ``` -The main rationale for choosing between single or double quotes, is whether the string itself will contain any quotes. You can include a single quote in a string surrounded by single quotes by escaping it with the `\` character, however in such a case it would be more convenient to use double quotes: +The main rationale for choosing between single or double quotes, is whether +the string itself will contain any quotes. You can include a single quote in a +string surrounded by single quotes by escaping it with the `\` character, +however in such a case it would be more convenient to use double quotes: + ``` a_string = "That's the question" same_string = 'That\'s the question' @@ -110,45 +165,78 @@ print("The 'c' and 'd' got concatenated, because we forgot the comma:", my_list_ <a class="anchor" id="unicode-versus-bytes"></a> #### unicode versus bytes -To encourage the spread of python around the world, python 3 switched to using unicode as the default for strings and code (which is one of the main reasons for the incompatibility between python 2 and 3). -This means that each element in a string is a unicode character (using [UTF-8 encoding](https://docs.python.org/3/howto/unicode.html)), which can consist of one or more bytes. -The advantage is that any unicode characters can now be used in strings or in the code itself: + +> **Note**: You can safely skip this section if you do not have any plans to +> work with binary files or non-English text in Python, and you do not want +> to know how to insert poop emojis into your code. + + +To encourage the spread of python around the world, python 3 switched to using +unicode as the default for strings and code (which is one of the main reasons +for the incompatibility between python 2 and 3). This means that each element +in a string is a unicode character (using [UTF-8 +encoding](https://docs.python.org/3/howto/unicode.html)), which can consist of +one or more bytes. The advantage is that any unicode characters can now be +used in strings or in the code itself: + ``` Δ = "café" print(Δ) ``` -In python 2 each element in a string was a single byte rather than a potentially multi-byte character. You can convert back to interpreting your sequence as a unicode string or a byte array using: +In python 2 each element in a string was a single byte rather than a +potentially multi-byte character. You can convert back to interpreting your +sequence as a unicode string or a byte array using: + * `encode()` called on a string converts it into a bytes array (`bytes` object) * `decode()` called on a `bytes` array converts it into a unicode string. + ``` delta = "Δ" print('The character', delta, 'consists of the following 2 bytes', delta.encode()) ``` -These byte arrays can be created directly be prepending the quotes enclosing the string with a `b`, which tells python 3 to interpret the following as a byte array: +These byte arrays can be created directly by prepending the quotes enclosing +the string with a `b`, which tells python 3 to interpret the following as a +byte array: + ``` a_byte_array = b'\xce\xa9' print('The two bytes ', a_byte_array, ' become single unicode character (', a_byte_array.decode(), ') with UTF-8 encoding') ``` -Especially in code dealing with strings (e.g., reading/writing of files) many of the errors arising of running python 2 code in python 3 arise from the mixing of unicode strings with byte arrays. Decoding and/or encoding some of these objects can often fix these issues. +Especially in code dealing with strings (e.g., reading/writing of files) many +of the errors arising of running python 2 code in python 3 arise from the +mixing of unicode strings with byte arrays. Decoding and/or encoding some of +these objects can often fix these issues. + +By default any file opened in python will be interpreted as unicode. If you +want to treat a file as raw bytes, you have to include a 'b' in the `mode` +when calling the `open()` function: -By default any file opened in python will be interpreted as unicode. If you want to treat a file as raw bytes, you have to include a 'b' in the `mode` when calling the `open()` function: ``` import os.path as op with open(op.expandvars('${FSLDIR}/data/standard/MNI152_T1_1mm.nii.gz'), 'rb') as gzipped_nifti: print('First few bytes of gzipped NIFTI file:', gzipped_nifti.read(10)) ``` -> We use the `expandvars()` function here to insert the FSLDIR environmental variable into our string. This function will be presented in the file management practical. + +> We use the `expandvars()` function here to insert the FSLDIR environmental +> variable into our string. This function will be presented in the file +> management practical. + <a class="anchor" id="converting-objects-into-strings"></a> ### converting objects into strings -There are two functions to convert python objects into strings, `repr()` and `str()`. -All other functions that rely on string-representations of python objects will use one of these two (for example the `print()` function will call `str()` on the object). -The goal of the `str()` function is to be readable, while the goal of `repr()` is to be unambiguous. Compare +There are two functions to convert python objects into strings, `repr()` and +`str()`. All other functions that rely on string-representations of python +objects will use one of these two (for example the `print()` function will +call `str()` on the object). + +The goal of the `str()` function is to be readable, while the goal of `repr()` +is to be unambiguous. Compare + ``` print(str("3")) print(str(3)) @@ -198,10 +286,10 @@ print(full_string) <a class="anchor" id="string-formatting"></a> ### String formatting Using the techniques in [Combining strings](#combining-strings) we can build simple strings. For longer strings it is often useful to first write a template strings with some placeholders, where variables are later inserted. Built into python are currently 4 different ways of doing this (with many packages providing similar capabilities): -* the recommended [new-style formatting](https://docs.python.org/3.6/library/string.html#format-string-syntax). +* the recommended [new-style formatting](https://docs.python.org/3/library/string.html#format-string-syntax). * printf-like [old-style formatting](https://docs.python.org/3/library/stdtypes.html#old-string-formatting) -* [formatted string literals](https://docs.python.org/3.6/reference/lexical_analysis.html#f-strings) (these are only available in python 3.6+) -* bash-like [template-strings](https://docs.python.org/3.6/library/string.html#template-strings) +* [formatted string literals](https://docs.python.org/3/reference/lexical_analysis.html#f-strings) (these are only available in python 3.6+) +* bash-like [template-strings](https://docs.python.org/3/library/string.html#template-strings) Here we provide a single example using the first three methods, so you can recognize them in the future. @@ -238,6 +326,14 @@ This code block will fail in fslpython, since it uses python 3.5. <a class="anchor" id="extracting-information-from-strings"></a> ## Extracting information from strings + +The techniques shown in this section are useful if you are loading data from a +small text file or user input, or parsing a small amount of output from +e.g. `fslstats`. However, if you are working with large structured text data +(e.g. a big `csv` file), you should use the I/O capabilities of `numpy` or +`pandas` instead of doing things manually - this is covered in separate +practcals. + <a class="anchor" id="splitting-strings"></a> ### Splitting strings The simplest way to extract a sub-string is to use slicing @@ -271,9 +367,15 @@ print(list_without_whitespace) ``` > We use the syntax `[<expr> for <element> in <sequence>]` here which applies the `expr` to each `element` in the `sequence` and returns the resulting list. This is a list comprehension - a convenient form in python to create a new list from the old one. + <a class="anchor" id="converting-strings-to-numbers"></a> ### Converting strings to numbers -Once you have extracted a number from a string, you can convert it into an actual integer or float by calling respectively `int()` or `float()` on it. `float()` understands a wide variety of different ways to write numbers: + + +Once you have extracted a number from a string, you can convert it into an +actual integer or float by calling respectively `int()` or `float()` on +it. `float()` understands a wide variety of different ways to write numbers: + ``` print(int("3")) print(float("3")) @@ -282,6 +384,7 @@ print(float("3.213e5")) print(float("3.213E-25")) ``` + <a class="anchor" id="regular-expressions"></a> ### Regular expressions Regular expressions are used for looking for specific patterns in a longer string. This can be used to extract specific information from a well-formatted string or to modify a string. In python regular expressions are available in the [re](https://docs.python.org/3/library/re.html#re-syntax) module. diff --git a/getting_started/03_file_management.ipynb b/getting_started/03_file_management.ipynb index 2dd68d75c84b58c1775c4740eadcfa19e97dd9ad..a677f0d0867b13e45ce59782acd44b90dc6d7447 100644 --- a/getting_started/03_file_management.ipynb +++ b/getting_started/03_file_management.ipynb @@ -15,11 +15,11 @@ "across the following modules:\n", "\n", "\n", - " - [`os`](https://docs.python.org/3.5/library/os.html)\n", - " - [`shutil`](https://docs.python.org/3.5/library/shutil.html)\n", - " - [`os.path`](https://docs.python.org/3.5/library/os.path.html)\n", - " - [`glob`](https://docs.python.org/3.5/library/glob.html)\n", - " - [`fnmatch`](https://docs.python.org/3.5/library/fnmatch.html)\n", + " - [`os`](https://docs.python.org/3/library/os.html)\n", + " - [`shutil`](https://docs.python.org/3/library/shutil.html)\n", + " - [`os.path`](https://docs.python.org/3/library/os.path.html)\n", + " - [`glob`](https://docs.python.org/3/library/glob.html)\n", + " - [`fnmatch`](https://docs.python.org/3/library/fnmatch.html)\n", "\n", "\n", "The `os` and `shutil` modules have functions allowing you to manage _files and\n", @@ -28,7 +28,7 @@ "\n", "\n", "> Another standard library -\n", - "> [`pathlib`](https://docs.python.org/3.5/library/pathlib.html) - was added in\n", + "> [`pathlib`](https://docs.python.org/3/library/pathlib.html) - was added in\n", "> Python 3.4, and provides an object-oriented interface to path management. We\n", "> aren't going to cover `pathlib` here, but feel free to take a look at it if\n", "> you are into that sort of thing.\n", @@ -296,7 +296,7 @@ "> Note that `os.walk` does not guarantee a specific ordering in the lists of\n", "> files and sub-directories that it returns. However, you can force an\n", "> ordering quite easily - see its\n", - "> [documentation](https://docs.python.org/3.5/library/os.html#os.walk) for\n", + "> [documentation](https://docs.python.org/3/library/os.html#os.walk) for\n", "> more details.\n", "\n", "\n", @@ -501,7 +501,7 @@ "> ```\n", ">\n", "> Take a look at the [official Python\n", - "> tutorial](https://docs.python.org/3.5/tutorial/controlflow.html#defining-functions)\n", + "> tutorial](https://docs.python.org/3/tutorial/controlflow.html#defining-functions)\n", "> for more details on defining your own functions.\n", "\n", "\n", @@ -608,7 +608,11 @@ "> Correct handling of them is an open problem in Computer Science, and is\n", "> considered by many to be unsolvable. For `imglob`, `imcp`, and `immv`-like\n", "> functionality, check out the `fsl.utils.path` and `fsl.utils.imcp` modules,\n", - "> part of the [`fslpy` project](https://pypi.python.org/pypi/fslpy).\n", + "> part of the [`fslpy`\n", + "> project](https://users.fmrib.ox.ac.uk/~paulmc/fsleyes/fslpy/latest/). If you\n", + "> are using `fslpython`, then you already have access to all of the functions\n", + "> in `fslpy`.\n", + "\n", "\n", "\n", "<a class=\"anchor\" id=\"absolute-and-relative-paths\"></a>\n", @@ -751,7 +755,7 @@ "source": [ "Now that we have this function, we can sort the directories in one line of\n", "code, via the built-in\n", - "[`sorted`](https://docs.python.org/3.5/library/functions.html#sorted)\n", + "[`sorted`](https://docs.python.org/3/library/functions.html#sorted)\n", "function. The directories will be sorted according to the `key` function that\n", "we specify, which provides a mapping from each directory to a sortable\n", ""key":" @@ -812,10 +816,10 @@ "\n", "Note that the syntax used by `glob` and `fnmatch` is similar, but __not__\n", "identical to the syntax that you are used to from `bash`. Refer to the\n", - "[`fnmatch` module](https://docs.python.org/3.5/library/fnmatch.html)\n", + "[`fnmatch` module](https://docs.python.org/3/library/fnmatch.html)\n", "documentation for details. If you need more complicated pattern matching, you\n", "can use regular expressions, available via the [`re`\n", - "module](https://docs.python.org/3.5/library/re.html).\n", + "module](https://docs.python.org/3/library/re.html).\n", "\n", "\n", "For example, let's retrieve all images that are in our data set:" @@ -1080,7 +1084,7 @@ "\n", "> There are many different types of exceptions in Python - a list of all the\n", "> built-in ones can be found\n", - "> [here](https://docs.python.org/3.5/library/exceptions.html). It is also easy\n", + "> [here](https://docs.python.org/3/library/exceptions.html). It is also easy\n", "> to define your own exceptions by creating a sub-class of `Exception` (beyond\n", "> the scope of this practical).\n", "\n", @@ -1346,7 +1350,7 @@ "\n", "\n", "You can read more about handling exceptions in Python\n", - "[here](https://docs.python.org/3.5/tutorial/errors.html).\n", + "[here](https://docs.python.org/3/tutorial/errors.html).\n", "\n", "\n", "### Raising exceptions\n", diff --git a/getting_started/03_file_management.md b/getting_started/03_file_management.md index 63b47989412b35a5a805f997ddbc5e424254f44f..0c4979a8a3321811f0381fdfb5277b63c75d999d 100644 --- a/getting_started/03_file_management.md +++ b/getting_started/03_file_management.md @@ -9,11 +9,11 @@ Most of Python's built-in functionality for managing files and paths is spread across the following modules: - - [`os`](https://docs.python.org/3.5/library/os.html) - - [`shutil`](https://docs.python.org/3.5/library/shutil.html) - - [`os.path`](https://docs.python.org/3.5/library/os.path.html) - - [`glob`](https://docs.python.org/3.5/library/glob.html) - - [`fnmatch`](https://docs.python.org/3.5/library/fnmatch.html) + - [`os`](https://docs.python.org/3/library/os.html) + - [`shutil`](https://docs.python.org/3/library/shutil.html) + - [`os.path`](https://docs.python.org/3/library/os.path.html) + - [`glob`](https://docs.python.org/3/library/glob.html) + - [`fnmatch`](https://docs.python.org/3/library/fnmatch.html) The `os` and `shutil` modules have functions allowing you to manage _files and @@ -22,7 +22,7 @@ managing file and directory _paths_. > Another standard library - -> [`pathlib`](https://docs.python.org/3.5/library/pathlib.html) - was added in +> [`pathlib`](https://docs.python.org/3/library/pathlib.html) - was added in > Python 3.4, and provides an object-oriented interface to path management. We > aren't going to cover `pathlib` here, but feel free to take a look at it if > you are into that sort of thing. @@ -226,7 +226,7 @@ for root, dirs, files in os.walk('raw_mri_data'): > Note that `os.walk` does not guarantee a specific ordering in the lists of > files and sub-directories that it returns. However, you can force an > ordering quite easily - see its -> [documentation](https://docs.python.org/3.5/library/os.html#os.walk) for +> [documentation](https://docs.python.org/3/library/os.html#os.walk) for > more details. @@ -391,7 +391,7 @@ def whatisit(path, existonly=False): > ``` > > Take a look at the [official Python -> tutorial](https://docs.python.org/3.5/tutorial/controlflow.html#defining-functions) +> tutorial](https://docs.python.org/3/tutorial/controlflow.html#defining-functions) > for more details on defining your own functions. @@ -474,7 +474,11 @@ print('Suffix: {}'.format(suffix)) > Correct handling of them is an open problem in Computer Science, and is > considered by many to be unsolvable. For `imglob`, `imcp`, and `immv`-like > functionality, check out the `fsl.utils.path` and `fsl.utils.imcp` modules, -> part of the [`fslpy` project](https://pypi.python.org/pypi/fslpy). +> part of the [`fslpy` +> project](https://users.fmrib.ox.ac.uk/~paulmc/fsleyes/fslpy/latest/). If you +> are using `fslpython`, then you already have access to all of the functions +> in `fslpy`. + <a class="anchor" id="absolute-and-relative-paths"></a> @@ -577,7 +581,7 @@ print(get_subject_id('raw_mri_data/subj_9')) Now that we have this function, we can sort the directories in one line of code, via the built-in -[`sorted`](https://docs.python.org/3.5/library/functions.html#sorted) +[`sorted`](https://docs.python.org/3/library/functions.html#sorted) function. The directories will be sorted according to the `key` function that we specify, which provides a mapping from each directory to a sortable "key": @@ -622,10 +626,10 @@ pattern matching logic. Note that the syntax used by `glob` and `fnmatch` is similar, but __not__ identical to the syntax that you are used to from `bash`. Refer to the -[`fnmatch` module](https://docs.python.org/3.5/library/fnmatch.html) +[`fnmatch` module](https://docs.python.org/3/library/fnmatch.html) documentation for details. If you need more complicated pattern matching, you can use regular expressions, available via the [`re` -module](https://docs.python.org/3.5/library/re.html). +module](https://docs.python.org/3/library/re.html). For example, let's retrieve all images that are in our data set: @@ -834,7 +838,7 @@ errors. For example, when you type CTRL+C into a running Python program, a > There are many different types of exceptions in Python - a list of all the > built-in ones can be found -> [here](https://docs.python.org/3.5/library/exceptions.html). It is also easy +> [here](https://docs.python.org/3/library/exceptions.html). It is also easy > to define your own exceptions by creating a sub-class of `Exception` (beyond > the scope of this practical). @@ -1037,7 +1041,7 @@ finally: You can read more about handling exceptions in Python -[here](https://docs.python.org/3.5/tutorial/errors.html). +[here](https://docs.python.org/3/tutorial/errors.html). ### Raising exceptions diff --git a/getting_started/04_numpy.ipynb b/getting_started/04_numpy.ipynb index bccf81a46347a23fed03a371d09d3f0096f1204b..ce634dd4a41927e607d585cd04a6862296e56e46 100644 --- a/getting_started/04_numpy.ipynb +++ b/getting_started/04_numpy.ipynb @@ -19,11 +19,6 @@ "alternative to Matlab as a scientific computing platform.\n", "\n", "\n", - "The `fslpython` environment currently includes [Numpy\n", - "1.11.1](https://docs.scipy.org/doc/numpy-1.11.0/index.html), which is a little\n", - "out of date, but we will update it for the next release of FSL.\n", - "\n", - "\n", "## Contents\n", "\n", "\n", @@ -102,7 +97,7 @@ "source": [ "For simple tasks, you could stick with processing your data using python\n", "lists, and the built-in\n", - "[`math`](https://docs.python.org/3.5/library/math.html) library. And this\n", + "[`math`](https://docs.python.org/3/library/math.html) library. And this\n", "might be tempting, because it does look quite a lot like what you might type\n", "into Matlab.\n", "\n", @@ -444,8 +439,8 @@ "\n", "\n", "> <sup>2</sup> Python, being an object-oriented language, distinguishes\n", - "> between _functions_ and _methods_. Hopefully we all know what a function is\n", - "> - a _method_ is simply the term used to refer to a function that is\n", + "> between _functions_ and _methods_. Hopefully we all know what a function\n", + "> is - a _method_ is simply the term used to refer to a function that is\n", "> associated with a specific object. Similarly, the term _attribute_ is used\n", "> to refer to some piece of information that is attached to an object, such as\n", "> `z.shape`, or `z.dtype`.\n", @@ -819,7 +814,7 @@ "### Broadcasting\n", "\n", "\n", - "One of the coolest features of Numpy is _broadcasting_<sup>3</sup>.\n", + "One of the coolest features of Numpy is *broadcasting*<sup>3</sup>.\n", "Broadcasting allows you to perform element-wise operations on arrays which\n", "have a different shape. For each axis in the two arrays, Numpy will implicitly\n", "expand the shape of the smaller axis to match the shape of the larger one. You\n", @@ -1272,9 +1267,9 @@ "\n", "> <sup>4</sup> Even though these are FLIRT transforms, this is just a toy\n", "> example. Look\n", - "> [here](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FLIRT/FAQ#What_is_the_format_of_the_matrix_used_by_FLIRT.2C_and_how_does_it_relate_to_the_transformation_parameters.3F)\n", + "> [here](https://users.fmrib.ox.ac.uk/~paulmc/fsleyes/fslpy/latest/fsl.transform.flirt.html)\n", "> and\n", - "> [here](https://git.fmrib.ox.ac.uk/fsl/fslpy/blob/1.6.2/fsl/utils/transform.py#L537)\n", + "> [here](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FLIRT/FAQ#What_is_the_format_of_the_matrix_used_by_FLIRT.2C_and_how_does_it_relate_to_the_transformation_parameters.3F)\n", "> if you actually need to work with FLIRT transforms.\n", "\n", "\n", diff --git a/getting_started/04_numpy.md b/getting_started/04_numpy.md index 3a28cc830442b08628633cd04372f9091a6683ee..db6fcf904f7a5a9b7d1c0d2739f183f2db2e9cf7 100644 --- a/getting_started/04_numpy.md +++ b/getting_started/04_numpy.md @@ -13,11 +13,6 @@ important Python libraries, and it (along with its partners alternative to Matlab as a scientific computing platform. -The `fslpython` environment currently includes [Numpy -1.11.1](https://docs.scipy.org/doc/numpy-1.11.0/index.html), which is a little -out of date, but we will update it for the next release of FSL. - - ## Contents @@ -80,7 +75,7 @@ xyz_coords = [[-11.4, 1.0, 22.6], For simple tasks, you could stick with processing your data using python lists, and the built-in -[`math`](https://docs.python.org/3.5/library/math.html) library. And this +[`math`](https://docs.python.org/3/library/math.html) library. And this might be tempting, because it does look quite a lot like what you might type into Matlab. @@ -334,8 +329,8 @@ view of the array. > <sup>2</sup> Python, being an object-oriented language, distinguishes -> between _functions_ and _methods_. Hopefully we all know what a function is -> - a _method_ is simply the term used to refer to a function that is +> between _functions_ and _methods_. Hopefully we all know what a function +> is - a _method_ is simply the term used to refer to a function that is > associated with a specific object. Similarly, the term _attribute_ is used > to refer to some piece of information that is attached to an object, such as > `z.shape`, or `z.dtype`. @@ -611,7 +606,7 @@ like what you might expect from Matlab. You can find a brief overview of the ### Broadcasting -One of the coolest features of Numpy is _broadcasting_<sup>3</sup>. +One of the coolest features of Numpy is *broadcasting*<sup>3</sup>. Broadcasting allows you to perform element-wise operations on arrays which have a different shape. For each axis in the two arrays, Numpy will implicitly expand the shape of the smaller axis to match the shape of the larger one. You @@ -963,9 +958,9 @@ correct: > <sup>4</sup> Even though these are FLIRT transforms, this is just a toy > example. Look -> [here](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FLIRT/FAQ#What_is_the_format_of_the_matrix_used_by_FLIRT.2C_and_how_does_it_relate_to_the_transformation_parameters.3F) +> [here](https://users.fmrib.ox.ac.uk/~paulmc/fsleyes/fslpy/latest/fsl.transform.flirt.html) > and -> [here](https://git.fmrib.ox.ac.uk/fsl/fslpy/blob/1.6.2/fsl/utils/transform.py#L537) +> [here](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FLIRT/FAQ#What_is_the_format_of_the_matrix_used_by_FLIRT.2C_and_how_does_it_relate_to_the_transformation_parameters.3F) > if you actually need to work with FLIRT transforms. diff --git a/getting_started/05_nifti.ipynb b/getting_started/05_nifti.ipynb index 0379ccf60db6ab2e28fc6cd2f25c026cc5a829e6..9d8551e0970ce313c6801143e06d5b4859db3b2d 100644 --- a/getting_started/05_nifti.ipynb +++ b/getting_started/05_nifti.ipynb @@ -6,7 +6,17 @@ "source": [ "# NIfTI images and python\n", "\n", - "The [nibabel](http://nipy.org/nibabel/) module is used to read and write NIfTI images and also some other medical imaging formats (e.g., ANALYZE, GIFTI, MINC, MGH). This module is included within the FSL python environment.\n", + "The [`nibabel`](http://nipy.org/nibabel/) module is used to read and write NIfTI\n", + "images and also some other medical imaging formats (e.g., ANALYZE, GIFTI,\n", + "MINC, MGH). `nibabel` is included within the FSL python environment.\n", + "\n", + "\n", + "Building upon `nibabel`, the\n", + "[`fslpy`](https://users.fmrib.ox.ac.uk/~paulmc/fsleyes/fslpy/latest/) library\n", + "contains a number of FSL-specific classes and functions which you may find\n", + "useful. But let's start with `nibabel` - `fslpy` is introduced in a different\n", + "practical (`advanced_topics/08_fslpy.ipynb`).\n", + "\n", "\n", "## Contents\n", "\n", @@ -36,10 +46,12 @@ "import os.path as op\n", "filename = op.expandvars('${FSLDIR}/data/standard/MNI152_T1_1mm.nii.gz')\n", "imobj = nib.load(filename, mmap=False)\n", + "\n", "# display header object\n", "imhdr = imobj.header\n", - "# extract data (as an numpy array)\n", - "imdat = imobj.get_data().astype(float)\n", + "\n", + "# extract data (as a numpy array)\n", + "imdat = imobj.get_fdata()\n", "print(imdat.shape)" ] }, @@ -47,29 +59,37 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> Make sure you use the full filename, including the .nii.gz extension.\n", + "> Make sure you use the full filename, including the `.nii.gz` extension.\n", + "> `fslpy` provides FSL-like automatic file suffix detection though.\n", "\n", + "> We use the `expandvars()` function above to insert the FSLDIR\n", + "> environmental variable into our string. This function is\n", + "> discussed more fully in the file management practical.\n", "\n", - "> We use the expandvars() function above to insert the FSLDIR\n", - ">environmental variable into our string. This function is\n", - ">discussed more fully in the file management practical.\n", - " \n", - "Reading the data off the disk is not done until `get_data()` is called.\n", + "Reading the data off the disk is not done until `get_fdata()` is called.\n", "\n", "> Pitfall:\n", ">\n", - "> The option `mmap=False`is necessary as turns off memory mapping, which otherwise would be invoked for uncompressed NIfTI files but not for compressed files. Since some functionality behaves differently on memory mapped objects, it is advisable to turn this off.\n", + "> The option `mmap=False` disables memory mapping, which would otherwise be\n", + "> invoked for uncompressed NIfTI files but not for compressed files. Since\n", + "> some functionality behaves differently on memory mapped objects, it is\n", + "> advisable to turn this off unless you specifically want it.\n", "\n", "Once the data is read into a numpy array then it is easily manipulated.\n", "\n", - "> We recommend converting it to float at the start to avoid problems with integer arithmetic and overflow, though this is not compulsory.\n", + "> The `get_fdata` method will return floating point data, regardless of the\n", + "> underlying image data type. If you want the image data in the type that it\n", + "> is stored (e.g. integer ROI labels), then use\n", + "> `imdat = np.asanyarray(imobj.dataobj)` instead.\n", "\n", "---\n", "\n", "<a class=\"anchor\" id=\"header-info\"></a>\n", "## Header info\n", "\n", - "There are many methods available on the header object - for example, look at `dir(imhdr)` or `help(imhdr)` or the [nibabel webpage about NIfTI images](http://nipy.org/nibabel/nifti_images.html)\n", + "There are many methods available on the header object - for example, look at\n", + "`dir(imhdr)` or `help(imhdr)` or the [nibabel webpage about NIfTI\n", + "images](http://nipy.org/nibabel/nifti_images.html)\n", "\n", "<a class=\"anchor\" id=\"voxel-sizes\"></a>\n", "### Voxel sizes\n", @@ -137,7 +157,11 @@ "<a class=\"anchor\" id=\"writing-images\"></a>\n", "## Writing images\n", "\n", - "If you have created a modified image by making or modifying a numpy array then you need to put this into a NIfTI image object in order to save it to a file. The easiest way to do this is to copy all the header info from an existing image like this:" + "\n", + "If you have created a modified image by making or modifying a numpy array then\n", + "you need to put this into a NIfTI image object in order to save it to a file.\n", + "The easiest way to do this is to copy all the header info from an existing\n", + "image like this:" ] }, { @@ -156,7 +180,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "where `newdata` is the numpy array (the above is a random example only) and `imhdr` is the existing image header (as above).\n", + "where `newdata` is the numpy array (the above is a random example only) and\n", + "`imhdr` is the existing image header (as above).\n", "\n", "> It is possible to also just pass in an affine matrix rather than a\n", "> copied header, but we *strongly* recommend against this if you are\n", @@ -167,15 +192,24 @@ "> whenever possible, and just use the affine matrix option if you are\n", "> creating an entirely separate image, like a simulation.\n", "\n", - "If the voxel size of the image is different, then extra modifications will be required. For this, or for building an image from scratch, see the [nibabel documentation](http://nipy.org/nibabel/nifti_images.html) on NIfTI images.\n", + "If the voxel size of the image is different, then extra modifications will be\n", + "required. Take a look at the `fslpy` practical for some extra image\n", + "manipulation options, including cropping and resampling\n", + "(`advanced_topics/08_fslpy.ipynb`).\n", "\n", "---\n", "\n", - "<a class=\"anchor\" id=\"exercise\"></a>\n", + "\n", + "<a class=\"anchor\" id=\"exercises\"></a>\n", "## Exercise\n", "\n", - "Write some code to read in a 4D fMRI image (you can find one [here] if\n", - "you don't have one handy), calculate the tSNR and then save the 3D result." + "\n", + "Write some code to read in a 4D fMRI image (you can find one\n", + "[here](http://www.fmrib.ox.ac.uk/~mark/files/av.nii.gz) if you don't have one\n", + "handy), calculate the tSNR and then save the 3D result.\n", + "\n", + "> The tSNR of a time series signal is simply its mean divided by its standard\n", + "> deviation." ] }, { diff --git a/getting_started/05_nifti.md b/getting_started/05_nifti.md index 5e554469539e111084f5ae87aa29ad8c9f534316..67139c538b3f79da2dd24057f8e82acaae4cebe7 100644 --- a/getting_started/05_nifti.md +++ b/getting_started/05_nifti.md @@ -1,6 +1,16 @@ # NIfTI images and python -The [nibabel](http://nipy.org/nibabel/) module is used to read and write NIfTI images and also some other medical imaging formats (e.g., ANALYZE, GIFTI, MINC, MGH). This module is included within the FSL python environment. +The [`nibabel`](http://nipy.org/nibabel/) module is used to read and write NIfTI +images and also some other medical imaging formats (e.g., ANALYZE, GIFTI, +MINC, MGH). `nibabel` is included within the FSL python environment. + + +Building upon `nibabel`, the +[`fslpy`](https://users.fmrib.ox.ac.uk/~paulmc/fsleyes/fslpy/latest/) library +contains a number of FSL-specific classes and functions which you may find +useful. But let's start with `nibabel` - `fslpy` is introduced in a different +practical (`advanced_topics/08_fslpy.ipynb`). + ## Contents @@ -24,36 +34,46 @@ import nibabel as nib import os.path as op filename = op.expandvars('${FSLDIR}/data/standard/MNI152_T1_1mm.nii.gz') imobj = nib.load(filename, mmap=False) + # display header object imhdr = imobj.header -# extract data (as an numpy array) -imdat = imobj.get_data().astype(float) + +# extract data (as a numpy array) +imdat = imobj.get_fdata() print(imdat.shape) ``` -> Make sure you use the full filename, including the .nii.gz extension. +> Make sure you use the full filename, including the `.nii.gz` extension. +> `fslpy` provides FSL-like automatic file suffix detection though. +> We use the `expandvars()` function above to insert the FSLDIR +> environmental variable into our string. This function is +> discussed more fully in the file management practical. -> We use the expandvars() function above to insert the FSLDIR ->environmental variable into our string. This function is ->discussed more fully in the file management practical. - -Reading the data off the disk is not done until `get_data()` is called. +Reading the data off the disk is not done until `get_fdata()` is called. > Pitfall: > -> The option `mmap=False`is necessary as turns off memory mapping, which otherwise would be invoked for uncompressed NIfTI files but not for compressed files. Since some functionality behaves differently on memory mapped objects, it is advisable to turn this off. +> The option `mmap=False` disables memory mapping, which would otherwise be +> invoked for uncompressed NIfTI files but not for compressed files. Since +> some functionality behaves differently on memory mapped objects, it is +> advisable to turn this off unless you specifically want it. Once the data is read into a numpy array then it is easily manipulated. -> We recommend converting it to float at the start to avoid problems with integer arithmetic and overflow, though this is not compulsory. +> The `get_fdata` method will return floating point data, regardless of the +> underlying image data type. If you want the image data in the type that it +> is stored (e.g. integer ROI labels), then use +> `imdat = np.asanyarray(imobj.dataobj)` instead. --- <a class="anchor" id="header-info"></a> ## Header info -There are many methods available on the header object - for example, look at `dir(imhdr)` or `help(imhdr)` or the [nibabel webpage about NIfTI images](http://nipy.org/nibabel/nifti_images.html) +There are many methods available on the header object - for example, look at +`dir(imhdr)` or `help(imhdr)` or the [nibabel webpage about NIfTI +images](http://nipy.org/nibabel/nifti_images.html) <a class="anchor" id="voxel-sizes"></a> ### Voxel sizes @@ -91,7 +111,11 @@ print(affine, code) <a class="anchor" id="writing-images"></a> ## Writing images -If you have created a modified image by making or modifying a numpy array then you need to put this into a NIfTI image object in order to save it to a file. The easiest way to do this is to copy all the header info from an existing image like this: + +If you have created a modified image by making or modifying a numpy array then +you need to put this into a NIfTI image object in order to save it to a file. +The easiest way to do this is to copy all the header info from an existing +image like this: ``` newdata = imdat * imdat @@ -99,7 +123,9 @@ newhdr = imhdr.copy() newobj = nib.nifti1.Nifti1Image(newdata, None, header=newhdr) nib.save(newobj, "mynewname.nii.gz") ``` -where `newdata` is the numpy array (the above is a random example only) and `imhdr` is the existing image header (as above). + +where `newdata` is the numpy array (the above is a random example only) and +`imhdr` is the existing image header (as above). > It is possible to also just pass in an affine matrix rather than a > copied header, but we *strongly* recommend against this if you are @@ -110,17 +136,25 @@ where `newdata` is the numpy array (the above is a random example only) and `imh > whenever possible, and just use the affine matrix option if you are > creating an entirely separate image, like a simulation. -If the voxel size of the image is different, then extra modifications will be required. For this, or for building an image from scratch, see the [nibabel documentation](http://nipy.org/nibabel/nifti_images.html) on NIfTI images. +If the voxel size of the image is different, then extra modifications will be +required. Take a look at the `fslpy` practical for some extra image +manipulation options, including cropping and resampling +(`advanced_topics/08_fslpy.ipynb`). --- -<a class="anchor" id="exercise"></a> + +<a class="anchor" id="exercises"></a> ## Exercise -Write some code to read in a 4D fMRI image (you can find one [here](http://www.fmrib.ox.ac.uk/~mark/files/av.nii.gz) if -you don't have one handy), calculate the tSNR and then save the 3D result. + +Write some code to read in a 4D fMRI image (you can find one +[here](http://www.fmrib.ox.ac.uk/~mark/files/av.nii.gz) if you don't have one +handy), calculate the tSNR and then save the 3D result. + +> The tSNR of a time series signal is simply its mean divided by its standard +> deviation. ``` # Calculate tSNR ``` - diff --git a/getting_started/06_plotting.ipynb b/getting_started/06_plotting.ipynb index cb00bc0a683e41a774668a0c5b05660d2604d340..447ff90cc5f5cb354107d49a3027bf228cec92ff 100644 --- a/getting_started/06_plotting.ipynb +++ b/getting_started/06_plotting.ipynb @@ -34,7 +34,7 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -59,30 +59,9 @@ }, { "cell_type": "code", - "execution_count": 38, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<matplotlib.text.Text at 0x119885748>" - ] - }, - "execution_count": 38, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<matplotlib.figure.Figure at 0x11b4c7908>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", @@ -114,27 +93,9 @@ }, { "cell_type": "code", - "execution_count": 39, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "#348ABD\n" - ] - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<matplotlib.figure.Figure at 0x1197cacf8>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "hdl = plt.plot(x, cosx)\n", "print(hdl[0].get_color())\n", @@ -158,20 +119,9 @@ }, { "cell_type": "code", - "execution_count": 40, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "<matplotlib.figure.Figure at 0x11a23c128>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "r = np.random.rand(1000)\n", "n,bins,_ = plt.hist((r-0.5)**2, bins=30)" @@ -194,30 +144,9 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<matplotlib.legend.Legend at 0x11a50dfd0>" - ] - }, - "execution_count": 41, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<matplotlib.figure.Figure at 0x119def588>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "samp1 = r[0:10]\n", "samp2 = r[10:20]\n", @@ -241,34 +170,13 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(0.1039996137295216, 0.96287533552434978)" - ] - }, - "execution_count": 42, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<matplotlib.figure.Figure at 0x11b481630>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "fig, ax = plt.subplots()\n", "# setup some sizes for each point (arbitrarily example here)\n", - "ssize = 100*abs(samp1-samp2) + 10 \n", + "ssize = 100*abs(samp1-samp2) + 10\n", "ax.scatter(samp1, samp2, s=ssize, alpha=0.5)\n", "# now add the y=x line\n", "allsamps = np.hstack((samp1,samp2))\n", @@ -298,30 +206,9 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<matplotlib.text.Text at 0x119d32a90>" - ] - }, - "execution_count": 43, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<matplotlib.figure.Figure at 0x11a5331d0>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "plt.subplot(2, 1, 1)\n", "plt.plot(x,cosx, '.-')\n", @@ -345,30 +232,43 @@ }, { "cell_type": "code", - "execution_count": 44, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "\n", - "text/plain": [ - "<matplotlib.figure.Figure at 0x1197ca390>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "import nibabel as nib\n", "import os.path as op\n", "nim = nib.load(op.expandvars('${FSLDIR}/data/standard/MNI152_T1_1mm.nii.gz'), mmap=False)\n", "imdat = nim.get_data().astype(float)\n", - "plt.imshow(imdat[:,:,70], cmap=plt.cm.gray)\n", + "imslc = imdat[:,:,70]\n", + "plt.imshow(imslc, cmap=plt.cm.gray)\n", "plt.colorbar()\n", "plt.grid('off')" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that matplotlib will use the **voxel data orientation**, and that\n", + "configuring the plot orientation is **your responsibility**. To rotate a\n", + "slice, simply transpose the data (`.T`). To invert the data along along an\n", + "axis, you don't need to modify the data - simply swap the axis limits around:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "plt.imshow(imslc.T, cmap=plt.cm.gray)\n", + "plt.xlim(reversed(plt.xlim()))\n", + "plt.ylim(reversed(plt.ylim()))\n", + "plt.colorbar()\n", + "plt.grid('off')\n" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -379,30 +279,9 @@ }, { "cell_type": "code", - "execution_count": 45, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "<mpl_toolkits.mplot3d.art3d.Line3DCollection at 0x11b74afd0>" - ] - }, - "execution_count": 45, - "metadata": {}, - "output_type": "execute_result" - }, - { - "data": { - "image/png": "\n", - "text/plain": [ - "<matplotlib.figure.Figure at 0x1197d75f8>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "outputs": [], "source": [ "# Taken from https://matplotlib.org/gallery/mplot3d/wire3d.html#sphx-glr-gallery-mplot3d-wire3d-py\n", "\n", @@ -453,47 +332,15 @@ }, { "cell_type": "code", - "execution_count": 46, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make up some data and do the funky plot" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.5.2" - } - }, + "metadata": {}, "nbformat": 4, "nbformat_minor": 2 } diff --git a/getting_started/06_plotting.md b/getting_started/06_plotting.md index 00e8e695c15c7f934da66b76f72a0d481a6f9753..ddcd7ecb70dd1a6da5369ace7a0e270ad4de55a7 100644 --- a/getting_started/06_plotting.md +++ b/getting_started/06_plotting.md @@ -109,7 +109,7 @@ there is also an alternative: `scatter()` ``` fig, ax = plt.subplots() # setup some sizes for each point (arbitrarily example here) -ssize = 100*abs(samp1-samp2) + 10 +ssize = 100*abs(samp1-samp2) + 10 ax.scatter(samp1, samp2, s=ssize, alpha=0.5) # now add the y=x line allsamps = np.hstack((samp1,samp2)) @@ -153,11 +153,28 @@ import nibabel as nib import os.path as op nim = nib.load(op.expandvars('${FSLDIR}/data/standard/MNI152_T1_1mm.nii.gz'), mmap=False) imdat = nim.get_data().astype(float) -plt.imshow(imdat[:,:,70], cmap=plt.cm.gray) +imslc = imdat[:,:,70] +plt.imshow(imslc, cmap=plt.cm.gray) plt.colorbar() plt.grid('off') ``` +Note that matplotlib will use the **voxel data orientation**, and that +configuring the plot orientation is **your responsibility**. To rotate a +slice, simply transpose the data (`.T`). To invert the data along along an +axis, you don't need to modify the data - simply swap the axis limits around: + + +``` +plt.imshow(imslc, cmap=plt.cm.gray) +plt.xlim(reversed(plt.xlim())) +plt.ylim(reversed(plt.ylim())) +plt.colorbar() +plt.grid('off') + +``` + + <a class="anchor" id="3D-plots"></a> ### 3D plots @@ -208,5 +225,3 @@ example code from the docs). ``` # Make up some data and do the funky plot ``` - - diff --git a/getting_started/07_jupyter.ipynb b/getting_started/07_jupyter.ipynb index d80ba601db7dd834af4c3d36c9b56055f3565e63..4cabf6b81bfef791ee40f0cc93d2517e51fe5d64 100644 --- a/getting_started/07_jupyter.ipynb +++ b/getting_started/07_jupyter.ipynb @@ -5,14 +5,27 @@ "metadata": {}, "source": [ "# Jupyter notebook and IPython\n", - "Our main interaction with python so far has been through the [Jupyter notebook](http://jupyter.org/).\n", - "These notebooks are extremely popular these days within the python scientific community, however they support many more languages, such as R and octave (and even matlab with the right [plugin](https://github.com/Calysto/matlab_kernel)).\n", - "They allow for interactive analysis of your data interspersed by explanatory notes (including LaTeX) with inline plotting.\n", - "However, they can not be called as scripts on the command line or be imported from other python code, which makes them rather stand-alone.\n", - "This makes them more useful for analysis that needs to be reproducible, but does not need to be replicated on different datasets (e.g., making a plot for a paper).\n", "\n", - "For more ad-hoc analysis it can be useful to just use the command line (i.e., a REPL).\n", - "We strongly recommend to use the IPython (available as `ipython` or `fslipython`) rather than default python REPL (available through `python` or `fslpython`)\n", + "Our main interaction with python so far has been through the [Jupyter\n", + "notebook](http://jupyter.org/). These notebooks are extremely popular these\n", + "days within the python scientific community, however they support many more\n", + "languages, such as R and octave (and even matlab with the right\n", + "[plugin](https://github.com/Calysto/matlab_kernel)). They allow for\n", + "interactive analysis of your data interspersed by explanatory notes (including\n", + "LaTeX) with inline plotting. However, they can not be called as scripts on\n", + "the command line or be imported from other python code, which makes them\n", + "rather stand-alone. This makes them more useful for analysis that needs to be\n", + "reproducible, but does not need to be replicated on different datasets (e.g.,\n", + "making a plot for a paper).\n", + "\n", + "For more ad-hoc analysis it can be useful to just use the command line (i.e.,\n", + "a REPL<sup>*</sup>). We strongly recommend to use the IPython (available as\n", + "`fslipython` or `ipython`) rather than default python REPL (available through\n", + "`fslpython` or `python`), as IPython is much more user-friendly.\n", + "\n", + "> <sup>*</sup>REPL = **R**ead-**E**val-**P**rint-**L**oop - the geeky term for\n", + "> an interactive prompt. You may hear younger generations using the term\n", + "> [ESRR](https://www.youtube.com/watch?v=wBoRkg5-Ieg) instead.\n", "\n", "Both Ipython and the jupyter notebook offer a whole range of magic commands, which all start with a `%` sign.\n", "* A magic command starting with a single `%` sign will only affect the single line.\n", diff --git a/getting_started/07_jupyter.md b/getting_started/07_jupyter.md index d5e9de6bada6166e8dd93e0c0d8a1b40251983dc..221d1f951797cbecf8df8a5db9eddd768156b101 100644 --- a/getting_started/07_jupyter.md +++ b/getting_started/07_jupyter.md @@ -1,12 +1,25 @@ # Jupyter notebook and IPython -Our main interaction with python so far has been through the [Jupyter notebook](http://jupyter.org/). -These notebooks are extremely popular these days within the python scientific community, however they support many more languages, such as R and octave (and even matlab with the right [plugin](https://github.com/Calysto/matlab_kernel)). -They allow for interactive analysis of your data interspersed by explanatory notes (including LaTeX) with inline plotting. -However, they can not be called as scripts on the command line or be imported from other python code, which makes them rather stand-alone. -This makes them more useful for analysis that needs to be reproducible, but does not need to be replicated on different datasets (e.g., making a plot for a paper). -For more ad-hoc analysis it can be useful to just use the command line (i.e., a REPL). -We strongly recommend to use the IPython (available as `ipython` or `fslipython`) rather than default python REPL (available through `python` or `fslpython`) +Our main interaction with python so far has been through the [Jupyter +notebook](http://jupyter.org/). These notebooks are extremely popular these +days within the python scientific community, however they support many more +languages, such as R and octave (and even matlab with the right +[plugin](https://github.com/Calysto/matlab_kernel)). They allow for +interactive analysis of your data interspersed by explanatory notes (including +LaTeX) with inline plotting. However, they can not be called as scripts on +the command line or be imported from other python code, which makes them +rather stand-alone. This makes them more useful for analysis that needs to be +reproducible, but does not need to be replicated on different datasets (e.g., +making a plot for a paper). + +For more ad-hoc analysis it can be useful to just use the command line (i.e., +a REPL<sup>*</sup>). We strongly recommend to use the IPython (available as +`fslipython` or `ipython`) rather than default python REPL (available through +`fslpython` or `python`), as IPython is much more user-friendly. + +> <sup>*</sup>REPL = **R**ead-**E**val-**P**rint-**L**oop - the geeky term for +> an interactive prompt. You may hear younger generations using the term +> [ESRR](https://www.youtube.com/watch?v=wBoRkg5-Ieg) instead. Both Ipython and the jupyter notebook offer a whole range of magic commands, which all start with a `%` sign. * A magic command starting with a single `%` sign will only affect the single line. @@ -206,4 +219,3 @@ We can now run this script You can access the full history of your session using `%history`. To save the history to a file use `%history -f <filename>`. You will probably have to clean a lot of erroneous commands you typed from that file before you are able to run it as a script. - diff --git a/getting_started/08_scripts.ipynb b/getting_started/08_scripts.ipynb index 16faa7ec08486151715108408828cb714ff3908a..2fa5d9d389cec346c1b77011b25f604861b43bf2 100644 --- a/getting_started/08_scripts.ipynb +++ b/getting_started/08_scripts.ipynb @@ -6,10 +6,19 @@ "source": [ "# Callable scripts in python\n", "\n", - "In this tutorial we will cover how to write simple stand-alone scripts in python that can be used as alternatives to bash scripts.\n", + "In this tutorial we will cover how to write simple stand-alone scripts in\n", + "python that can be used as alternatives to bash scripts.\n", "\n", - "There are some code blocks within this webpage, but for this practical we _**strongly\n", - "recommend that you write the code in an IDE or editor**_ instead and then run the scripts from a terminal.\n", + "**Important**: Throughout this series of practicals we have been working\n", + "entirely within the Jupyter notebook environment. But it's now time to\n", + "graduate to writing *real* Python scripts, and running them within a\n", + "*real* enviromnent.\n", + "\n", + "So within this practical there are some code blocks, but instead of running\n", + "them inside the notebook we **strongly recommend that you write the code in\n", + "an IDE or editor**,and then run the scripts from a terminal. [Don't\n", + "panic](https://www.youtube.com/watch?v=KojYatpLPSE), we're right here,\n", + "ready to help.\n", "\n", "## Contents\n", "\n", @@ -73,6 +82,7 @@ "outputs": [], "source": [ "import subprocess as sp\n", + "import shlex\n", "sp.run(['ls', '-la'])" ] }, @@ -80,6 +90,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "> Passing the arguments as a list is good practice and improves the safety of\n", + "> the call.\n", + "\n", "To suppress the output do this:" ] }, @@ -105,7 +118,7 @@ "metadata": {}, "outputs": [], "source": [ - "spobj = sp.run('ls -la'.split(), stdout = sp.PIPE)\n", + "spobj = sp.run(shlex.split('ls -la'), stdout = sp.PIPE)\n", "sout = spobj.stdout.decode('utf-8')\n", "print(sout)" ] @@ -114,7 +127,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "> Note that the `decode` call in the middle line converts the string from a byte string to a normal string. In Python 3 there is a distinction between strings (sequences of characters, possibly using multiple bytes to store each character) and bytes (sequences of bytes). The world has moved on from ASCII, so in this day and age, this distinction is absolutely necessary, and Python does a fairly good job of it.\n", + "> shlex.split and shlex.quote are functions designed to break up and quote\n", + "> (respectively) shell command lines and arguments. Quoting of user provided\n", + "> arguments helps to prevent unintended consequences from inappropriate inputs.\n", + ">\n", + "> Note that the `decode` call in the middle line converts the string from a byte\n", + "> string to a normal string. In Python 3 there is a distinction between strings\n", + "> (sequences of characters, possibly using multiple bytes to store each\n", + "> character) and bytes (sequences of bytes). The world has moved on from ASCII,\n", + "> so in this day and age, this distinction is absolutely necessary, and Python\n", + "> does a fairly good job of it.\n", "\n", "If the output is numerical then this can be extracted like this:" ] @@ -129,8 +151,9 @@ "fsldir = os.getenv('FSLDIR')\n", "spobj = sp.run([fsldir+'/bin/fslstats', fsldir+'/data/standard/MNI152_T1_1mm_brain', '-V'], stdout = sp.PIPE)\n", "sout = spobj.stdout.decode('utf-8')\n", - "vol_vox = float(sout.split()[0])\n", - "vol_mm = float(sout.split()[1])\n", + "results = sout.split()\n", + "vol_vox = float(results[0])\n", + "vol_mm = float(results[1])\n", "print('Volumes are: ', vol_vox, ' in voxels and ', vol_mm, ' in mm')" ] }, @@ -147,6 +170,7 @@ "metadata": {}, "outputs": [], "source": [ + "import shlex\n", "commands = \"\"\"\n", "{fsldir}/bin/fslmaths {t1} -bin {t1_mask}\n", "{fsldir}/bin/fslmaths {t2} -mas {t1_mask} {t2_masked}\n", @@ -156,10 +180,10 @@ "commands = commands.format(t1 = 't1.nii.gz', t1_mask = 't1_mask', t2 = 't2', t2_masked = 't2_masked', fsldir = fsldirpath)\n", "\n", "sout=[]\n", - "for cmd in commands.split('\\n'):\n", + "for cmd in commands.splitlines():\n", " if cmd: # avoids empty strings getting passed to sp.run()\n", " print('Running command: ', cmd)\n", - " spobj = sp.run(cmd.split(), stdout = sp.PIPE)\n", + " spobj = sp.run(shlex.split(cmd), stdout = sp.PIPE)\n", " sout.append(spobj.stdout.decode('utf-8'))" ] }, @@ -167,6 +191,24 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "> Don't be tempted to use the shell=True argument to subprocess.run, especially\n", + "> if you are dealing with user input - if the user gave\n", + "> *myfile; rm -f ~*\n", + "> as a file name and you called the command with shell=True **and** you\n", + "> passed the command in as a string then bad things happen!\n", + ">\n", + "> The safe way to use these kinds of inputs is to pass them through shlex.quote()\n", + "> before sending.\n", + ">\n", + "> ```a = shlex.quote('myfile; rm -f ~')\n", + "> cmd = \"ls {}\".format(a)\n", + "> sp.run(shlex.split(cmd))```\n", + "\n", + "\n", + "> If you're calling lots of FSL tools, the `fslpy` library has a number of\n", + "> *wrapper* functions, which can be used to call an FSL command directly\n", + "> from Python - check out `advanced_topics/08_fslpy.ipynb`.\n", + "\n", "<a class=\"anchor\" id=\"command-line-arguments\"></a>\n", "## Command line arguments\n", "\n", @@ -190,6 +232,8 @@ "source": [ "For more sophisticated argument parsing you can use `argparse` - good documentation and examples of this can be found on the web.\n", "\n", + "> argparse can automatically produce help text for the user, validate input etc., so it is strongly recommended.\n", + "\n", "---\n", "\n", "<a class=\"anchor\" id=\"example-script\"></a>\n", @@ -213,7 +257,7 @@ "outfile=$2\n", "# mask input image with MNI\n", "$FSLDIR/bin/fslmaths $infile -mas $FSLDIR/data/standard/MNI152_T1_1mm_brain $outfile\n", - "# calculate volumes of masked image \n", + "# calculate volumes of masked image\n", "vv=`$FSLDIR/bin/fslstats $outfile -V`\n", "vol_vox=`echo $vv | awk '{ print $1 }'`\n", "vol_mm=`echo $vv | awk '{ print $2 }'`\n", @@ -244,11 +288,12 @@ "outfile = sys.argv[2]\n", "# mask input image with MNI\n", "spobj = sp.run([fsldir+'/bin/fslmaths', infile, '-mas', fsldir+'/data/standard/MNI152_T1_1mm_brain', outfile], stdout = sp.PIPE)\n", - "# calculate volumes of masked image \n", + "# calculate volumes of masked image\n", "spobj = sp.run([fsldir+'/bin/fslstats', outfile, '-V'], stdout = sp.PIPE)\n", "sout = spobj.stdout.decode('utf-8')\n", - "vol_vox = float(sout.split()[0])\n", - "vol_mm = float(sout.split()[1])\n", + "results = sout.split()\n", + "vol_vox = float(results[0])\n", + "vol_mm = float(results[1])\n", "print('Volumes are: ', vol_vox, ' in voxels and ', vol_mm, ' in mm')" ] }, @@ -275,25 +320,7 @@ ] } ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.5.2" - } - }, + "metadata": {}, "nbformat": 4, "nbformat_minor": 2 } diff --git a/getting_started/08_scripts.md b/getting_started/08_scripts.md index ec759f3f5b655e1c3d32ab881f1bc74b8e5dce32..ab06a61c8777523ff59d7596091eb63070142b52 100644 --- a/getting_started/08_scripts.md +++ b/getting_started/08_scripts.md @@ -1,9 +1,18 @@ # Callable scripts in python -In this tutorial we will cover how to write simple stand-alone scripts in python that can be used as alternatives to bash scripts. +In this tutorial we will cover how to write simple stand-alone scripts in +python that can be used as alternatives to bash scripts. -There are some code blocks within this webpage, but for this practical we _**strongly -recommend that you write the code in an IDE or editor**_ instead and then run the scripts from a terminal. +**Important**: Throughout this series of practicals we have been working +entirely within the Jupyter notebook environment. But it's now time to +graduate to writing *real* Python scripts, and running them within a +*real* enviromnent. + +So within this practical there are some code blocks, but instead of running +them inside the notebook we **strongly recommend that you write the code in +an IDE or editor**,and then run the scripts from a terminal. [Don't +panic](https://www.youtube.com/watch?v=KojYatpLPSE), we're right here, +ready to help. ## Contents @@ -63,10 +72,10 @@ print(sout) > arguments helps to prevent unintended consequences from inappropriate inputs. > > Note that the `decode` call in the middle line converts the string from a byte -> string to a normal string. In Python 3 there is a distinction between strings -> (sequences of characters, possibly using multiple bytes to store each -> character) and bytes (sequences of bytes). The world has moved on from ASCII, -> so in this day and age, this distinction is absolutely necessary, and Python +> string to a normal string. In Python 3 there is a distinction between strings +> (sequences of characters, possibly using multiple bytes to store each +> character) and bytes (sequences of bytes). The world has moved on from ASCII, +> so in this day and age, this distinction is absolutely necessary, and Python > does a fairly good job of it. If the output is numerical then this can be extracted like this: @@ -100,7 +109,7 @@ for cmd in commands.splitlines(): sout.append(spobj.stdout.decode('utf-8')) ``` -> Don't be tempted to use the shell=True argument to subprocess.run, especially +> Don't be tempted to use the shell=True argument to subprocess.run, especially > if you are dealing with user input - if the user gave > *myfile; rm -f ~* > as a file name and you called the command with shell=True **and** you @@ -113,6 +122,11 @@ for cmd in commands.splitlines(): > cmd = "ls {}".format(a) > sp.run(shlex.split(cmd))``` + +> If you're calling lots of FSL tools, the `fslpy` library has a number of +> *wrapper* functions, which can be used to call an FSL command directly +> from Python - check out `advanced_topics/08_fslpy.ipynb`. + <a class="anchor" id="command-line-arguments"></a> ## Command line arguments @@ -144,7 +158,7 @@ infile=$1 outfile=$2 # mask input image with MNI $FSLDIR/bin/fslmaths $infile -mas $FSLDIR/data/standard/MNI152_T1_1mm_brain $outfile -# calculate volumes of masked image +# calculate volumes of masked image vv=`$FSLDIR/bin/fslstats $outfile -V` vol_vox=`echo $vv | awk '{ print $1 }'` vol_mm=`echo $vv | awk '{ print $2 }'` @@ -166,7 +180,7 @@ infile = sys.argv[1] outfile = sys.argv[2] # mask input image with MNI spobj = sp.run([fsldir+'/bin/fslmaths', infile, '-mas', fsldir+'/data/standard/MNI152_T1_1mm_brain', outfile], stdout = sp.PIPE) -# calculate volumes of masked image +# calculate volumes of masked image spobj = sp.run([fsldir+'/bin/fslstats', outfile, '-V'], stdout = sp.PIPE) sout = spobj.stdout.decode('utf-8') results = sout.split() @@ -186,4 +200,3 @@ mean or a _sum_ (and hence can do something that fslstats cannot!) ``` # Don't write anything here - do it in a standalone script! ``` - diff --git a/getting_started/09_pandas.ipynb b/getting_started/09_pandas.ipynb index 7c50300ad7b5eb18280f746acaff9e0e8bd4ec39..4c330f68381c403eadc97b50e1fe56c5cef716ff 100644 --- a/getting_started/09_pandas.ipynb +++ b/getting_started/09_pandas.ipynb @@ -4,21 +4,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Pandas" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Pandas is a data analysis library focussed on the cleaning and exploration of tabular data.\n", + "# Pandas\n", + "\n", + "Pandas is a data analysis library focused on the cleaning and exploration of\n", + "tabular data.\n", "\n", "Some useful links are:\n", "- [main website](https://pandas.pydata.org)\n", "- [documentation](http://pandas.pydata.org/pandas-docs/stable/)<sup>1</sup>\n", - "- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)<sup>1</sup> by Jake van der Plas\n", + "- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)<sup>1</sup> by\n", + " Jake van der Plas\n", "\n", - "<sup>1</sup> This tutorial borrows heavily from the pandas documentation and the Python Data Science Handbook" + "<sup>1</sup> This tutorial borrows heavily from the pandas documentation and\n", + "the Python Data Science Handbook" ] }, { @@ -29,6 +27,7 @@ "source": [ "%pylab inline\n", "import pandas as pd # pd is the usual abbreviation for pandas\n", + "import matplotlib.pyplot as plt # matplotlib for plotting\n", "import seaborn as sns # seaborn is the main plotting library for Pandas\n", "import statsmodels.api as sm # statsmodels fits linear models to pandas data\n", "import statsmodels.formula.api as smf\n", @@ -40,22 +39,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Loading in data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Pandas supports a wide range of I/O tools to load from text files, binary files, and SQL databases. You can find a table with all formats [here](http://pandas.pydata.org/pandas-docs/stable/io.html)." + "> We will mostly be using `seaborn` instead of `matplotlib` for\n", + "> visualisation. But `seaborn` is actually an extension to `matplotlib`, so we\n", + "> are still using the latter under the hood.\n", + "\n", + "## Loading in data\n", + "\n", + "Pandas supports a wide range of I/O tools to load from text files, binary files,\n", + "and SQL databases. You can find a table with all formats\n", + "[here](http://pandas.pydata.org/pandas-docs/stable/io.html)." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')\n", @@ -66,9 +64,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This loads the data into a [DataFrame](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.html) object, which is the main object we will be interacting with in pandas. It represents a table of data.\n", - "\n", - "The other file formats all start with `pd.read_{format}`. Note that we can provide the URL to the dataset, rather than download it beforehand.\n", + "This loads the data into a\n", + "[`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)\n", + "object, which is the main object we will be interacting with in pandas. It\n", + "represents a table of data. The other file formats all start with\n", + "`pd.read_{format}`. Note that we can provide the URL to the dataset, rather\n", + "than download it beforehand.\n", "\n", "We can write out the dataset using `dataframe.to_{format}(<filename)`:" ] @@ -76,9 +77,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.to_csv('titanic_copy.csv', index=False) # we set index to False to prevent pandas from storing the row names" @@ -88,7 +87,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If you can not connect to the internet, you can run the command below to load this locally stored titanic dataset" + "If you can not connect to the internet, you can run the command below to load\n", + "this locally stored titanic dataset" ] }, { @@ -105,15 +105,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that the titanic dataset was also available to us as one of the standard datasets included with seaborn. We could load it from there using" + "Note that the titanic dataset was also available to us as one of the standard\n", + "datasets included with seaborn. We could load it from there using" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "sns.load_dataset('titanic')" @@ -123,15 +122,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Dataframes can also be created from other python objects, using pd.DataFrame.from_{other type}. The most useful of these is from_dict, which converts a mapping of the columns to a pandas DataFrame (i.e., table).\n" + "`Dataframes` can also be created from other python objects, using\n", + "`pd.DataFrame.from_{other type}`. The most useful of these is `from_dict`,\n", + "which converts a mapping of the columns to a pandas `DataFrame` (i.e., table)." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "pd.DataFrame.from_dict({\n", @@ -147,15 +146,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For many applications (e.g., ICA, machine learning input) you might want to extract your data as a numpy array. The underlying numpy array can be accessed using the `values` attribute" + "For many applications (e.g., ICA, machine learning input) you might want to\n", + "extract your data as a numpy array. The underlying numpy array can be accessed\n", + "using the `values` attribute" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.values" @@ -165,15 +164,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that the type of the returned array is the most common type (in this case object). If you just want the numeric parts of the table you can use `select_dtype`, which selects specific columns based on their dtype:" + "Note that the type of the returned array is the most common type (in this case\n", + "object). If you just want the numeric parts of the table you can use\n", + "`select_dtypes`, which selects specific columns based on their dtype:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.select_dtypes(include=np.number).values" @@ -184,16 +183,14 @@ "metadata": {}, "source": [ "Note that the numpy array has no information on the column names or row indices.\n", - "\n", - "Alternatively, when you want to include the categorical variables in your later analysis (e.g., for machine learning), you can extract dummy variables using: " + "Alternatively, when you want to include the categorical variables in your later\n", + "analysis (e.g., for machine learning), you can extract dummy variables using:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "pd.get_dummies(titanic)" @@ -203,36 +200,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Accessing parts of the data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Documentation on indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Selecting columns by name" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "## Accessing parts of the data\n", + "\n", + "[Documentation on indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html)\n", + "\n", + "### Selecting columns by name\n", + "\n", "Single columns can be selected using the normal python indexing:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic['embark_town']" @@ -242,15 +222,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If the column names are simple strings (not required) we can also access it directly as an attribute" + "If the column names are simple strings (not required) we can also access it\n", + "directly as an attribute" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.embark_town" @@ -260,17 +239,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that this returns a pandas [Series](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.html) rather than a DataFrame object. A Series is simply a 1-dimensional array representing a single column.\n", - "\n", - "Multiple columns can be returned by providing a list of columns names. This will return a DataFrame:" + "Note that this returns a pandas\n", + "[`Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)\n", + "rather than a `DataFrame` object. A `Series` is simply a 1-dimensional array\n", + "representing a single column. Multiple columns can be returned by providing a\n", + "list of columns names. This will return a `DataFrame`:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic[['class', 'alive']]" @@ -280,15 +259,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that you have to provide a list here (square brackets). If you provide a tuple (round brackets) pandas will think you are trying to access a single column that has that tuple as a name:" + "Note that you have to provide a list here (square brackets). If you provide a\n", + "tuple (round brackets) pandas will think you are trying to access a single\n", + "column that has that tuple as a name:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic[('class', 'alive')]" @@ -298,29 +277,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In this case there is no column called ('class', 'alive') leading to an error. Later on we will see some uses to having columns named like this." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Indexing rows by name or integer" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Individual rows can be accessed based on their name (i.e., the index) or integer (i.e., which row it is in). In our current table this will give the same results. To ensure that these are different, let's sort our titanic dataset based on the passenger fare:" + "In this case there is no column called `('class', 'alive')` leading to an\n", + "error. Later on we will see some uses to having columns named like this.\n", + "\n", + "### Indexing rows by name or integer\n", + "\n", + "Individual rows can be accessed based on their name (i.e., the index) or integer\n", + "(i.e., which row it is in). In our current table this will give the same\n", + "results. To ensure that these are different, let's sort our titanic dataset\n", + "based on the passenger fare:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic_sorted = titanic.sort_values('fare')\n", @@ -331,17 +302,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that the re-sorting did not change the values in the index (i.e., left-most column).\n", + "Note that the re-sorting did not change the values in the index (i.e., left-most\n", + "column).\n", "\n", - "We can select the first row of this newly sorted table using iloc" + "We can select the first row of this newly sorted table using `iloc`" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic_sorted.iloc[0]" @@ -357,9 +327,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic_sorted.loc[0]" @@ -369,15 +337,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that this gives the same passenger as the first row of the initial table before sorting" + "Note that this gives the same passenger as the first row of the initial table\n", + "before sorting" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.iloc[0]" @@ -387,15 +354,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Another common way to access the first or last N rows of a table is using the head/tail methods" + "Another common way to access the first or last N rows of a table is using the\n", + "head/tail methods" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic_sorted.head(3)" @@ -404,9 +370,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic_sorted.tail(3)" @@ -416,15 +380,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that nearly all methods in pandas return a new Dataframe, which means that we can easily call another method on them" + "Note that nearly all methods in pandas return a new `Dataframe`, which means\n", + "that we can easily call another method on them" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic_sorted.tail(10).head(5) # select the first 5 of the last 10 passengers in the database" @@ -433,9 +396,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic_sorted.iloc[-10:-5] # alternative way to get the same passengers" @@ -445,15 +406,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Exercise: use sorting and tail/head or indexing to find the 10 youngest passengers on the titanic. Try to do this on a single line by chaining calls to the titanic dataframe object" + "**Exercise**: use sorting and tail/head or indexing to find the 10 youngest\n", + "passengers on the titanic. Try to do this on a single line by chaining calls\n", + "to the titanic `DataFrame` object" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.sort_values..." @@ -463,22 +424,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Indexing rows by value" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "### Indexing rows by value\n", + "\n", "One final way to select specific columns is by their value" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic[titanic.sex == 'female'] # selects all females" @@ -487,9 +441,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "# select all passengers older than 60 who departed from Southampton\n", @@ -500,17 +452,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that this required typing \"titanic\" quite often. A quicker way to get the same result is using the `query` method, which is described in detail [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method) (note that using the `query` method is also faster and uses a lot less memory).\n", + "Note that this required typing `titanic` quite often. A quicker way to get the\n", + "same result is using the `query` method, which is described in detail\n", + "[here](http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method)\n", + "(note that using the `query` method is also faster and uses a lot less\n", + "memory).\n", "\n", - "> You may have trouble using the query method with columns which have a name that cannot be used as a Python identifier." + "> You may have trouble using the `query` method with columns which have\n", + "a name that cannot be used as a Python identifier." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.query('(age > 60) & (embark_town == \"Southampton\")')" @@ -520,15 +475,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Particularly useful when selecting data like this is the `isna` method which finds all missing data" + "Particularly useful when selecting data like this is the `isna` method which\n", + "finds all missing data" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic[~titanic.age.isna()] # select first few passengers whose age is not N/A" @@ -544,9 +498,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.dropna() # drops all passengers that have some datapoint missing" @@ -555,9 +507,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.dropna(subset=['age', 'fare']) # Only drop passengers with missing ages or fares" @@ -567,15 +517,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Exercise: use sorting, indexing by value, dropna and tail/head or indexing to find the 10 oldest female passengers on the titanic. Try to do this on a single line by chaining calls to the titanic dataframe object" + "**Exercise**: use sorting, indexing by value, `dropna` and `tail`/`head` or\n", + "indexing to find the 10 oldest female passengers on the titanic. Try to do\n", + "this on a single line by chaining calls to the titanic `DataFrame` object" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic..." @@ -585,24 +535,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Plotting the data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Before we start analyzing the data, let's play around with visualizing it. \n", + "## Plotting the data\n", "\n", + "Before we start analyzing the data, let's play around with visualizing it.\n", "Pandas does have some basic built-in plotting options:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.fare.hist(bins=20, log=True)" @@ -611,9 +553,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.age.plot()" @@ -623,15 +563,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Individual columns are essentially 1D arrays, so we can use them as such in matplotlib" + "Individual columns are essentially 1D arrays, so we can use them as such in\n", + "`matplotlib`" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "plt.scatter(titanic.age, titanic.fare)" @@ -641,17 +580,21 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "However, for most purposes much nicer plots can be obtained using [Seaborn](https://seaborn.pydata.org). Seaborn has support to produce plots showing the [univariate](https://seaborn.pydata.org/tutorial/distributions.html#plotting-univariate-distributions) or [bivariate](https://seaborn.pydata.org/tutorial/distributions.html#plotting-bivariate-distributions) distribution of data in a single or a grid of plots.\n", - "\n", - "Most of the seaborn plotting functions expect to get a pandas dataframe (although they will work with Numpy arrays as well). So we can plot age vs. fare like:" + "However, for most purposes much nicer plots can be obtained using\n", + "[Seaborn](https://seaborn.pydata.org). Seaborn has support to produce plots\n", + "showing the\n", + "[univariate](https://seaborn.pydata.org/tutorial/distributions.html#plotting-univariate-distributions)\n", + "or\n", + "[bivariate](https://seaborn.pydata.org/tutorial/distributions.html#plotting-bivariate-distributions)\n", + "distribution of data in a single or a grid of plots. Most of the seaborn\n", + "plotting functions expect to get a pandas `DataFrame` (although they will work\n", + "with Numpy arrays as well). So we can plot age vs. fare like:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "sns.jointplot('age', 'fare', data=titanic)" @@ -661,15 +604,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Exercise: check the documentation from `sns.jointplot` (hover the mouse over the text \"jointplot\" and press shift-tab) to find out how to turn the scatter plot into a density (kde) map" + "**Exercise**: check the documentation from `sns.jointplot` (hover the mouse\n", + "over the text `jointplot` and press shift-tab) to find out how to turn the\n", + "scatter plot into a density (kde) map" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "sns.jointplot('age', 'fare', data=titanic, ...)" @@ -679,15 +622,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Here is just a brief example of how we can use multiple columns to illustrate the data in more detail" + "Here is just a brief example of how we can use multiple columns to illustrate\n", + "the data in more detail" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "sns.relplot(x='age', y='fare', col='class', hue='sex', data=titanic,\n", @@ -698,15 +640,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Exercise: Split the plot above into two rows with the first row including the passengers who survived and the second row those who did not (you might have to check the documentation again by using shift-tab while overing the mouse over `relplot`) " + "**Exercise**: Split the plot above into two rows with the first row including\n", + "the passengers who survived and the second row those who did not (you might\n", + "have to check the documentation again by using shift-tab while overing the\n", + "mouse over `relplot`)" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "sns.relplot(x='age', y='fare', col='class', hue='sex', data=titanic,\n", @@ -717,19 +660,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "One of the nice thing of Seaborn is how easy it is to update how these plots look. You can read more about that [here](https://seaborn.pydata.org/tutorial/aesthetics.html). For example, to increase the font size to get a plot more approriate for a talk, you can use:" + "One of the nice thing of Seaborn is how easy it is to update how these plots\n", + "look. You can read more about that\n", + "[here](https://seaborn.pydata.org/tutorial/aesthetics.html). For example, to\n", + "increase the font size to get a plot more approriate for a talk, you can use:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "sns.set_context('talk')\n", - "sns.violinplot(x='class', y='age', hue='sex', data=titanic, split=True, \n", + "sns.violinplot(x='class', y='age', hue='sex', data=titanic, split=True,\n", " order=('First', 'Second', 'Third'))" ] }, @@ -737,22 +681,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Summarizing the data (mean, std, etc.)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "There are a large number of built-in methods to summarize the observations in a Pandas dataframe. Most of these will return a Series with the columns names as index:" + "## Summarizing the data (mean, std, etc.)\n", + "\n", + "There are a large number of built-in methods to summarize the observations in\n", + "a Pandas `DataFrame`. Most of these will return a `Series` with the columns\n", + "names as index:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.mean()" @@ -761,9 +700,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.quantile(0.75)" @@ -773,15 +710,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "One very useful one is `describe`, which gives an overview of many common summary measures" + "One very useful one is `describe`, which gives an overview of many common\n", + "summary measures" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.describe()" @@ -793,21 +729,20 @@ "source": [ "Note that non-numeric columns are ignored when summarizing data in this way.\n", "\n", - "We can also define our own functions to apply to the columns (in this case we have to explicitly set the data types)." + "We can also define our own functions to apply to the columns (in this case we\n", + "have to explicitly set the data types)." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "def mad(series):\n", " \"\"\"\n", " Computes the median absolute deviatation (MAD)\n", - " \n", + "\n", " This is a outlier-resistant measure of the standard deviation\n", " \"\"\"\n", " no_nan = series.dropna()\n", @@ -820,15 +755,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can also provide multiple functions to the `apply` method (note that functions can be provided as strings)" + "We can also provide multiple functions to the `apply` method (note that\n", + "functions can be provided as strings)" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.select_dtypes(np.number).apply(['mean', np.median, np.std, mad])" @@ -838,22 +772,20 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Grouping by" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "One of the more powerful features of is `groupby`, which splits the dataset on a categorical variable. The book contains a clear tutorial on that feature [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html). You can check the pandas documentation [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for a more formal introduction. One simple use is just to put it into a loop" + "### Grouping by\n", + "\n", + "One of the more powerful features of is `groupby`, which splits the dataset on\n", + "a categorical variable. The book contains a clear tutorial on that feature\n", + "[here](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html). You\n", + "can check the pandas documentation\n", + "[here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for a more\n", + "formal introduction. One simple use is just to put it into a loop" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "for cls, part_table in titanic.groupby('class'):\n", @@ -864,7 +796,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "However, it is more often combined with one of the aggregation functions discussed above as illustrated in this figure from the [Python data science handbook](https://jakevdp.github.io/PythonDataScienceHandbook/06.00-figure-code.html#Split-Apply-Combine)\n", + "However, it is more often combined with one of the aggregation functions\n", + "discussed above as illustrated in this figure from the [Python data science\n", + "handbook](https://jakevdp.github.io/PythonDataScienceHandbook/06.00-figure-code.html#Split-Apply-Combine)\n", "\n", "" ] @@ -872,9 +806,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.groupby('class').mean()" @@ -890,9 +822,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.groupby(['class', 'survived']).mean() # as always in pandas supply multiple column names as lists, not tuples" @@ -902,15 +832,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "When grouping it can help to use the `cut` method to split a continuous variable into a categorical one" + "When grouping it can help to use the `cut` method to split a continuous variable\n", + "into a categorical one" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.groupby(['class', pd.cut(titanic.age, bins=(0, 18, 50, np.inf))]).mean()" @@ -926,9 +855,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.groupby(['class', 'survived']).aggregate((np.median, mad))" @@ -938,17 +865,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that both the index (on the left) and the column names (on the top) now have multiple levels. Such a multi-level index is referred to as `MultiIndex`. This does complicate selecting specific columns/rows. You can read more of using `MultiIndex` [here](http://pandas.pydata.org/pandas-docs/stable/advanced.html).\n", - "\n", - "The short version is that columns can be selected using direct indexing (as discussed above)" + "Note that both the index (on the left) and the column names (on the top) now\n", + "have multiple levels. Such a multi-level index is referred to as `MultiIndex`.\n", + "This does complicate selecting specific columns/rows. You can read more of using\n", + "`MultiIndex` [here](http://pandas.pydata.org/pandas-docs/stable/advanced.html).\n", + "The short version is that columns can be selected using direct indexing (as\n", + "discussed above)" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "df_full = titanic.groupby(['class', 'survived']).aggregate((np.median, mad))" @@ -957,9 +885,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "df_full[('age', 'median')] # selects median age column; note that the round brackets are optional" @@ -968,9 +894,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "df_full['age'] # selects both age columns" @@ -980,15 +904,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Remember that indexing based on the index was done through `loc`. The rest is the same as for the columns above" + "Remember that indexing based on the index was done through `loc`. The rest is\n", + "the same as for the columns above" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "df_full.loc[('First', 0)]" @@ -997,9 +920,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "df_full.loc['First']\n" @@ -1015,9 +936,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "df_full.xs(0, level='survived') # selects all the zero's from the survived index" @@ -1026,34 +945,26 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, - "outputs": [], - "source": [ - "df_full.xs('mad', axis=1, level=1) # selects mad from the second level in the columns (i.e., axis=1) " - ] - }, - { - "cell_type": "markdown", "metadata": {}, + "outputs": [], "source": [ - "## Reshaping tables" + "df_full.xs('mad', axis=1, level=1) # selects mad from the second level in the columns (i.e., axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "If we were interested in how the survival rate depends on the class and sex of the passengers we could simply use a groupby:" + "## Reshaping tables\n", + "\n", + "If we were interested in how the survival rate depends on the class and sex of\n", + "the passengers we could simply use a groupby:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.groupby(['class', 'sex']).survived.mean()" @@ -1063,7 +974,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "However, this single-column table is difficult to read. The reason for this is that the indexing is multi-leveled (called `MultiIndex` in pandas), while there is only a single column. We would like to move one of the levels in the index to the columns. This can be done using `stack`/`unstack`:\n", + "However, this single-column table is difficult to read. The reason for this is\n", + "that the indexing is multi-leveled (called `MultiIndex` in pandas), while there\n", + "is only a single column. We would like to move one of the levels in the index to\n", + "the columns. This can be done using `stack`/`unstack`:\n", + "\n", "- `unstack`: Moves one levels in the index to the columns\n", "- `stack`: Moves one of levels in the columns to the index" ] @@ -1071,9 +986,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.groupby(['class', 'sex']).survived.mean().unstack('sex')" @@ -1083,7 +996,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The former table, where the different groups are defined in different rows, is often referred to as long-form. After unstacking the table is often referred to as wide-form as the different group (sex in this case) is now represented as different columns. In pandas some operations are easier on long-form tables (e.g., `groupby`) while others require wide_form tables (e.g., making scatter plots of two variables). You can go back and forth using `unstack` or `stack` as illustrated above, but as this is a crucial part of pandas there are many alternatives, such as `pivot_table`, `melt`, and `wide_to_long`, which we will discuss below.\n", + "The former table, where the different groups are defined in different rows, is\n", + "often referred to as long-form. After unstacking the table is often referred to\n", + "as wide-form as the different group (sex in this case) is now represented as\n", + "different columns. In pandas some operations are easier on long-form tables\n", + "(e.g., `groupby`) while others require wide_form tables (e.g., making scatter\n", + "plots of two variables). You can go back and forth using `unstack` or `stack` as\n", + "illustrated above, but as this is a crucial part of pandas there are many\n", + "alternatives, such as `pivot_table`, `melt`, and `wide_to_long`, which we will\n", + "discuss below.\n", "\n", "We can prettify the table further using seaborn" ] @@ -1091,12 +1012,10 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ - "ax = sns.heatmap(titanic.groupby(['class', 'sex']).survived.mean().unstack('sex'), \n", + "ax = sns.heatmap(titanic.groupby(['class', 'sex']).survived.mean().unstack('sex'),\n", " annot=True)\n", "ax.set_title('survival rate')" ] @@ -1105,22 +1024,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that there are also many ways to produce prettier tables in pandas (e.g., color all the negative values). This is documented [here](http://pandas.pydata.org/pandas-docs/stable/style.html)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Because this stacking/unstacking is fairly common after a groupby operation, there is a shortcut for it: `pivot_table`" + "Note that there are also many ways to produce prettier tables in pandas (e.g.,\n", + "color all the negative values). This is documented\n", + "[here](http://pandas.pydata.org/pandas-docs/stable/style.html).\n", + "\n", + "Because this stacking/unstacking is fairly common after a groupby operation,\n", + "there is a shortcut for it: `pivot_table`" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "titanic.pivot_table('survived', 'class', 'sex')" @@ -1136,9 +1051,7 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))]), annot=True)" @@ -1154,12 +1067,10 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ - "sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))], \n", + "sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))],\n", " aggfunc='count'), annot=True)" ] }, @@ -1167,7 +1078,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As in `groupby` the aggregation function can be a string of a common aggregation function, or any function that should be applied.\n", + "As in `groupby` the aggregation function can be a string of a common aggregation\n", + "function, or any function that should be applied.\n", "\n", "We can even apply different aggregate functions to different columns" ] @@ -1175,12 +1087,10 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ - "titanic.pivot_table(index='class', columns='sex', \n", + "titanic.pivot_table(index='class', columns='sex',\n", " aggfunc={'survived': 'count', 'fare': np.mean}) # compute number of survivors and mean fare\n" ] }, @@ -1188,15 +1098,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The opposite of `pivot_table` is `melt`. This can be used to change a wide-form table into a long-form table. This is not particularly useful on the titanic dataset, so let's create a new table where this might be useful. Let's say we have a dataset listing the FA and MD values in various WM tracts:" + "The opposite of `pivot_table` is `melt`. This can be used to change a wide-form\n", + "table into a long-form table. This is not particularly useful on the titanic\n", + "dataset, so let's create a new table where this might be useful. Let's say we\n", + "have a dataset listing the FA and MD values in various WM tracts:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "tracts = ('Corpus callosum', 'Internal capsule', 'SLF', 'Arcuate fasciculus')\n", @@ -1211,15 +1122,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "This wide-form table (i.e., all the information is in different columns) makes it hard to select just all the FA values or only the values associated with the SLF. For this it would be easier to lismt all the values in a single column. Most of the tools discussed above (e.g., `group_by` or `seaborn` plotting) work better with long-form data, which we can obtain from `melt`: " + "This wide-form table (i.e., all the information is in different columns) makes\n", + "it hard to select just all the FA values or only the values associated with the\n", + "SLF. For this it would be easier to list all the values in a single column.\n", + "Most of the tools discussed above (e.g., `group_by` or `seaborn` plotting) work\n", + "better with long-form data, which we can obtain from `melt`:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "df_long = df_wide.melt('subject', var_name='measurement', value_name='dti_value')\n", @@ -1230,15 +1143,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can see that `melt` took all the columns (we could also have specified a specific sub-set) and returned each measurement as a seperate row. We probably want to seperate the measurement column into the measurement type (FA or MD) and the tract name. Many string manipulation function are available in the `DataFrame` object under `DataFrame.str` ([tutorial](http://pandas.pydata.org/pandas-docs/stable/text.html))" + "We can see that `melt` took all the columns (we could also have specified a\n", + "specific sub-set) and returned each measurement as a seperate row. We probably\n", + "want to seperate the measurement column into the measurement type (FA or MD) and\n", + "the tract name. Many string manipulation function are available in the\n", + "`DataFrame` object under `DataFrame.str`\n", + "([tutorial](http://pandas.pydata.org/pandas-docs/stable/text.html))" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "df_long['variable'] = df_long.measurement.str.slice(0, 2) # first two letters correspond to FA or MD\n", @@ -1250,17 +1166,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Finally we probably do want the FA and MD variables as different columns. \n", + "Finally we probably do want the FA and MD variables as different columns.\n", "\n", - "*Exercise*: Use `pivot_table` or `stack`/`unstack` to create a column for MD and FA." + "**Exercise**: Use `pivot_table` or `stack`/`unstack` to create a column for MD\n", + "and FA." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "df_unstacked = df_long." @@ -1270,15 +1185,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can now use the tools discussed above to visualize the table (`seaborn`) or to group the table based on tract (`groupby` or `pivot_table`)." + "We can now use the tools discussed above to visualize the table (`seaborn`) or\n", + "to group the table based on tract (`groupby` or `pivot_table`)." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "# feel free to analyze this random data in more detail" @@ -1288,15 +1202,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "In general pandas is better at handling long-form than wide-form data, although for better visualization of the data an intermediate format is often best. One exception is calculating a covariance (`DataFrame.cov`) or correlation (`DataFrame.corr`) matrices which computes the correlation between each column:" + "In general pandas is better at handling long-form than wide-form data, although\n", + "for better visualization of the data an intermediate format is often best. One\n", + "exception is calculating a covariance (`DataFrame.cov`) or correlation\n", + "(`DataFrame.corr`) matrices which computes the correlation between each column:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "sns.heatmap(df_wide.corr(), cmap=sns.diverging_palette(240, 10, s=99, n=300), )" @@ -1306,24 +1221,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Linear fitting (statsmodels)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Linear fitting between the different columns is available through the [statsmodels](https://www.statsmodels.org/stable/index.html) library. A nice way to play around with a wide variety of possible models is to use R-style functions. The usage of the functions in stastmodels is described [here](https://www.statsmodels.org/dev/example_formulas.html). You can find a more detailed description of the R-style functions [here](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-language). \n", + "## Linear fitting (`statsmodels`)\n", + "\n", + "Linear fitting between the different columns is available through the\n", + "[`statsmodels`](https://www.statsmodels.org/stable/index.html) library. A nice\n", + "way to play around with a wide variety of possible models is to use R-style\n", + "functions. The usage of the functions in `statsmodels` is described\n", + "[here](https://www.statsmodels.org/dev/example_formulas.html). You can find a\n", + "more detailed description of the R-style functions\n", + "[here](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-\n", + "language).\n", "\n", - "In short these functions describe the linear model as a string. For example, \"y ~ x + a + x * a\" fits the variable `y` as a function of `x`, `a`, and the interaction between `x` and `a`. The intercept is included by default (you can add \"+ 0\" to remove it)." + "In short these functions describe the linear model as a string. For example,\n", + "`\"y ~ x + a + x * a\"` fits the variable `y` as a function of `x`, `a`, and the\n", + "interaction between `x` and `a`. The intercept is included by default (you can\n", + "add `\"+ 0\"` to remove it)." ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "result = smf.logit('survived ~ age + sex + age * sex', data=titanic).fit()\n", @@ -1334,17 +1252,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Note that statsmodels understands categorical variables and automatically replaces them with dummy variables.\n", + "Note that `statsmodels` understands categorical variables and automatically\n", + "replaces them with dummy variables.\n", "\n", - "Above we used logistic regression, which is appropriate for the binary survival rate. A wide variety of linear models are available. Let's try a GLM, but assume that the fare is drawn from a Gamma distribution:" + "Above we used logistic regression, which is appropriate for the binary\n", + "survival rate. A wide variety of linear models are available. Let's try a GLM,\n", + "but assume that the fare is drawn from a Gamma distribution:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ "age_dmean = titanic.age - titanic.age.mean()\n", @@ -1359,66 +1278,32 @@ "Cherbourg passengers clearly paid a lot more...\n", "\n", "\n", - "Note that we did not actually add the age_dmean to the dataframe. Statsmodels (or more precisely the underlying [patsy](https://patsy.readthedocs.io/en/latest/) library) automatically extracted this from our environment. This can lead to confusing behaviour..." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# More reading" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "Note that we did not actually add the `age_dmean` to the\n", + "`DataFrame`. `statsmodels` (or more precisely the underlying\n", + "[patsy](https://patsy.readthedocs.io/en/latest/) library) automatically\n", + "extracted this from our environment. This can lead to confusing behaviour...\n", + "\n", + "# More reading\n", + "\n", "Other useful features\n", - "- [Concatenating](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html) and [merging](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html) of tables\n", - "- [Lots of](http://pandas.pydata.org/pandas-docs/stable/basics.html#dt-accessor) [time](http://pandas.pydata.org/pandas-docs/stable/timeseries.html) [series](http://pandas.pydata.org/pandas-docs/stable/timedeltas.html) support\n", - "- [Rolling Window functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-functions) for after you have meaningfully sorted your data\n", + "\n", + "- [Concatenating](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html)\n", + " and\n", + " [merging](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html)\n", + " of tables\n", + "- [Lots\n", + " of](http://pandas.pydata.org/pandas-docs/stable/basics.html#dt-accessor)\n", + " [time](http://pandas.pydata.org/pandas-docs/stable/timeseries.html)\n", + " [series](http://pandas.pydata.org/pandas-docs/stable/timedeltas.html)\n", + " support\n", + "- [Rolling Window\n", + " functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-\n", + " functions) for after you have meaningfully sorted your data\n", "- and much, much more" ] } ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.2" - }, - "toc": { - "colors": { - "hover_highlight": "#DAA520", - "running_highlight": "#FF0000", - "selected_highlight": "#FFD700" - }, - "moveMenuLeft": true, - "nav_menu": { - "height": "225px", - "width": "252px" - }, - "navigate_menu": true, - "number_sections": true, - "sideBar": true, - "threshold": 4, - "toc_cell": false, - "toc_section_display": "block", - "toc_window_display": false - } - }, + "metadata": {}, "nbformat": 4, "nbformat_minor": 2 } diff --git a/getting_started/09_pandas.md b/getting_started/09_pandas.md new file mode 100644 index 0000000000000000000000000000000000000000..bb99481ef0d52ff8b996fec024b56e496bfe7af5 --- /dev/null +++ b/getting_started/09_pandas.md @@ -0,0 +1,658 @@ +# Pandas + +Pandas is a data analysis library focused on the cleaning and exploration of +tabular data. + +Some useful links are: +- [main website](https://pandas.pydata.org) +- [documentation](http://pandas.pydata.org/pandas-docs/stable/)<sup>1</sup> +- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)<sup>1</sup> by + Jake van der Plas + +<sup>1</sup> This tutorial borrows heavily from the pandas documentation and +the Python Data Science Handbook + +``` +%pylab inline +import pandas as pd # pd is the usual abbreviation for pandas +import matplotlib.pyplot as plt # matplotlib for plotting +import seaborn as sns # seaborn is the main plotting library for Pandas +import statsmodels.api as sm # statsmodels fits linear models to pandas data +import statsmodels.formula.api as smf +from IPython.display import Image +sns.set() # use the prettier seaborn plotting settings rather than the default matplotlib one +``` + +> We will mostly be using `seaborn` instead of `matplotlib` for +> visualisation. But `seaborn` is actually an extension to `matplotlib`, so we +> are still using the latter under the hood. + +## Loading in data + +Pandas supports a wide range of I/O tools to load from text files, binary files, +and SQL databases. You can find a table with all formats +[here](http://pandas.pydata.org/pandas-docs/stable/io.html). + +``` +titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv') +titanic +``` + +This loads the data into a +[`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) +object, which is the main object we will be interacting with in pandas. It +represents a table of data. The other file formats all start with +`pd.read_{format}`. Note that we can provide the URL to the dataset, rather +than download it beforehand. + +We can write out the dataset using `dataframe.to_{format}(<filename)`: + +``` +titanic.to_csv('titanic_copy.csv', index=False) # we set index to False to prevent pandas from storing the row names +``` + +If you can not connect to the internet, you can run the command below to load +this locally stored titanic dataset + +``` +titanic = pd.read_csv('09_pandas/titanic.csv') +titanic +``` + +Note that the titanic dataset was also available to us as one of the standard +datasets included with seaborn. We could load it from there using + +``` +sns.load_dataset('titanic') +``` + +`Dataframes` can also be created from other python objects, using +`pd.DataFrame.from_{other type}`. The most useful of these is `from_dict`, +which converts a mapping of the columns to a pandas `DataFrame` (i.e., table). + +``` +pd.DataFrame.from_dict({ + 'random numbers': np.random.rand(5), + 'sequence (int)': np.arange(5), + 'sequence (float)': np.linspace(0, 5, 5), + 'letters': list('abcde'), + 'constant_value': 'same_value' +}) +``` + +For many applications (e.g., ICA, machine learning input) you might want to +extract your data as a numpy array. The underlying numpy array can be accessed +using the `values` attribute + +``` +titanic.values +``` + +Note that the type of the returned array is the most common type (in this case +object). If you just want the numeric parts of the table you can use +`select_dtypes`, which selects specific columns based on their dtype: + +``` +titanic.select_dtypes(include=np.number).values +``` + +Note that the numpy array has no information on the column names or row indices. +Alternatively, when you want to include the categorical variables in your later +analysis (e.g., for machine learning), you can extract dummy variables using: + +``` +pd.get_dummies(titanic) +``` + +## Accessing parts of the data + +[Documentation on indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html) + +### Selecting columns by name + +Single columns can be selected using the normal python indexing: + +``` +titanic['embark_town'] +``` + +If the column names are simple strings (not required) we can also access it +directly as an attribute + +``` +titanic.embark_town +``` + +Note that this returns a pandas +[`Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) +rather than a `DataFrame` object. A `Series` is simply a 1-dimensional array +representing a single column. Multiple columns can be returned by providing a +list of columns names. This will return a `DataFrame`: + +``` +titanic[['class', 'alive']] +``` + +Note that you have to provide a list here (square brackets). If you provide a +tuple (round brackets) pandas will think you are trying to access a single +column that has that tuple as a name: + +``` +titanic[('class', 'alive')] +``` + +In this case there is no column called `('class', 'alive')` leading to an +error. Later on we will see some uses to having columns named like this. + +### Indexing rows by name or integer + +Individual rows can be accessed based on their name (i.e., the index) or integer +(i.e., which row it is in). In our current table this will give the same +results. To ensure that these are different, let's sort our titanic dataset +based on the passenger fare: + +``` +titanic_sorted = titanic.sort_values('fare') +titanic_sorted +``` + +Note that the re-sorting did not change the values in the index (i.e., left-most +column). + +We can select the first row of this newly sorted table using `iloc` + +``` +titanic_sorted.iloc[0] +``` + +We can select the row with the index 0 using + +``` +titanic_sorted.loc[0] +``` + +Note that this gives the same passenger as the first row of the initial table +before sorting + +``` +titanic.iloc[0] +``` + +Another common way to access the first or last N rows of a table is using the +head/tail methods + +``` +titanic_sorted.head(3) +``` + +``` +titanic_sorted.tail(3) +``` + +Note that nearly all methods in pandas return a new `Dataframe`, which means +that we can easily call another method on them + +``` +titanic_sorted.tail(10).head(5) # select the first 5 of the last 10 passengers in the database +``` + +``` +titanic_sorted.iloc[-10:-5] # alternative way to get the same passengers +``` + +**Exercise**: use sorting and tail/head or indexing to find the 10 youngest +passengers on the titanic. Try to do this on a single line by chaining calls +to the titanic `DataFrame` object + +```{.python .input} +titanic.sort_values... +``` + +### Indexing rows by value + +One final way to select specific columns is by their value + +``` +titanic[titanic.sex == 'female'] # selects all females +``` + +``` +# select all passengers older than 60 who departed from Southampton +titanic[(titanic.age > 60) & (titanic['embark_town'] == 'Southampton')] +``` + +Note that this required typing `titanic` quite often. A quicker way to get the +same result is using the `query` method, which is described in detail +[here](http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method) +(note that using the `query` method is also faster and uses a lot less +memory). + +> You may have trouble using the `query` method with columns which have +a name that cannot be used as a Python identifier. + +``` +titanic.query('(age > 60) & (embark_town == "Southampton")') +``` + +Particularly useful when selecting data like this is the `isna` method which +finds all missing data + +``` +titanic[~titanic.age.isna()] # select first few passengers whose age is not N/A +``` + +This removing of missing numbers is so common that it has is own method + +``` +titanic.dropna() # drops all passengers that have some datapoint missing +``` + +``` +titanic.dropna(subset=['age', 'fare']) # Only drop passengers with missing ages or fares +``` + +**Exercise**: use sorting, indexing by value, `dropna` and `tail`/`head` or +indexing to find the 10 oldest female passengers on the titanic. Try to do +this on a single line by chaining calls to the titanic `DataFrame` object + +``` +titanic... +``` + +## Plotting the data + +Before we start analyzing the data, let's play around with visualizing it. +Pandas does have some basic built-in plotting options: + +``` +titanic.fare.hist(bins=20, log=True) +``` + +``` +titanic.age.plot() +``` + +Individual columns are essentially 1D arrays, so we can use them as such in +`matplotlib` + +``` +plt.scatter(titanic.age, titanic.fare) +``` + +However, for most purposes much nicer plots can be obtained using +[Seaborn](https://seaborn.pydata.org). Seaborn has support to produce plots +showing the +[univariate](https://seaborn.pydata.org/tutorial/distributions.html#plotting-univariate-distributions) +or +[bivariate](https://seaborn.pydata.org/tutorial/distributions.html#plotting-bivariate-distributions) +distribution of data in a single or a grid of plots. Most of the seaborn +plotting functions expect to get a pandas `DataFrame` (although they will work +with Numpy arrays as well). So we can plot age vs. fare like: + +``` +sns.jointplot('age', 'fare', data=titanic) +``` + +**Exercise**: check the documentation from `sns.jointplot` (hover the mouse +over the text `jointplot` and press shift-tab) to find out how to turn the +scatter plot into a density (kde) map + +``` +sns.jointplot('age', 'fare', data=titanic, ...) +``` + +Here is just a brief example of how we can use multiple columns to illustrate +the data in more detail + +``` +sns.relplot(x='age', y='fare', col='class', hue='sex', data=titanic, + col_order=('First', 'Second', 'Third')) +``` + +**Exercise**: Split the plot above into two rows with the first row including +the passengers who survived and the second row those who did not (you might +have to check the documentation again by using shift-tab while overing the +mouse over `relplot`) + +``` +sns.relplot(x='age', y='fare', col='class', hue='sex', data=titanic, + col_order=('First', 'Second', 'Third')...) +``` + +One of the nice thing of Seaborn is how easy it is to update how these plots +look. You can read more about that +[here](https://seaborn.pydata.org/tutorial/aesthetics.html). For example, to +increase the font size to get a plot more approriate for a talk, you can use: + +``` +sns.set_context('talk') +sns.violinplot(x='class', y='age', hue='sex', data=titanic, split=True, + order=('First', 'Second', 'Third')) +``` + +## Summarizing the data (mean, std, etc.) + +There are a large number of built-in methods to summarize the observations in +a Pandas `DataFrame`. Most of these will return a `Series` with the columns +names as index: + +``` +titanic.mean() +``` + +``` +titanic.quantile(0.75) +``` + +One very useful one is `describe`, which gives an overview of many common +summary measures + +``` +titanic.describe() +``` + +Note that non-numeric columns are ignored when summarizing data in this way. + +We can also define our own functions to apply to the columns (in this case we +have to explicitly set the data types). + +``` +def mad(series): + """ + Computes the median absolute deviatation (MAD) + + This is a outlier-resistant measure of the standard deviation + """ + no_nan = series.dropna() + return np.median(abs(no_nan - np.nanmedian(no_nan))) + +titanic.select_dtypes(np.number).apply(mad) +``` + +We can also provide multiple functions to the `apply` method (note that +functions can be provided as strings) + +``` +titanic.select_dtypes(np.number).apply(['mean', np.median, np.std, mad]) +``` + +### Grouping by + +One of the more powerful features of is `groupby`, which splits the dataset on +a categorical variable. The book contains a clear tutorial on that feature +[here](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html). You +can check the pandas documentation +[here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for a more +formal introduction. One simple use is just to put it into a loop + +``` +for cls, part_table in titanic.groupby('class'): + print(f'Mean fare in {cls.lower()} class: {part_table.fare.mean()}') +``` + +However, it is more often combined with one of the aggregation functions +discussed above as illustrated in this figure from the [Python data science +handbook](https://jakevdp.github.io/PythonDataScienceHandbook/06.00-figure-code.html#Split-Apply-Combine) + + + +``` +titanic.groupby('class').mean() +``` + +We can also group by multiple variables at once + +``` +titanic.groupby(['class', 'survived']).mean() # as always in pandas supply multiple column names as lists, not tuples +``` + +When grouping it can help to use the `cut` method to split a continuous variable +into a categorical one + +``` +titanic.groupby(['class', pd.cut(titanic.age, bins=(0, 18, 50, np.inf))]).mean() +``` + +We can use the `aggregate` method to apply a different function to each series + +``` +titanic.groupby(['class', 'survived']).aggregate((np.median, mad)) +``` + +Note that both the index (on the left) and the column names (on the top) now +have multiple levels. Such a multi-level index is referred to as `MultiIndex`. +This does complicate selecting specific columns/rows. You can read more of using +`MultiIndex` [here](http://pandas.pydata.org/pandas-docs/stable/advanced.html). +The short version is that columns can be selected using direct indexing (as +discussed above) + +``` +df_full = titanic.groupby(['class', 'survived']).aggregate((np.median, mad)) +``` + +``` +df_full[('age', 'median')] # selects median age column; note that the round brackets are optional +``` + +``` +df_full['age'] # selects both age columns +``` + +Remember that indexing based on the index was done through `loc`. The rest is +the same as for the columns above + +``` +df_full.loc[('First', 0)] +``` + +``` +df_full.loc['First'] + +``` + +More advanced use of the `MultiIndex` is possible through `xs`: + +``` +df_full.xs(0, level='survived') # selects all the zero's from the survived index +``` + +``` +df_full.xs('mad', axis=1, level=1) # selects mad from the second level in the columns (i.e., axis=1) +``` + +## Reshaping tables + +If we were interested in how the survival rate depends on the class and sex of +the passengers we could simply use a groupby: + +``` +titanic.groupby(['class', 'sex']).survived.mean() +``` + +However, this single-column table is difficult to read. The reason for this is +that the indexing is multi-leveled (called `MultiIndex` in pandas), while there +is only a single column. We would like to move one of the levels in the index to +the columns. This can be done using `stack`/`unstack`: + +- `unstack`: Moves one levels in the index to the columns +- `stack`: Moves one of levels in the columns to the index + +``` +titanic.groupby(['class', 'sex']).survived.mean().unstack('sex') +``` + +The former table, where the different groups are defined in different rows, is +often referred to as long-form. After unstacking the table is often referred to +as wide-form as the different group (sex in this case) is now represented as +different columns. In pandas some operations are easier on long-form tables +(e.g., `groupby`) while others require wide_form tables (e.g., making scatter +plots of two variables). You can go back and forth using `unstack` or `stack` as +illustrated above, but as this is a crucial part of pandas there are many +alternatives, such as `pivot_table`, `melt`, and `wide_to_long`, which we will +discuss below. + +We can prettify the table further using seaborn + +``` +ax = sns.heatmap(titanic.groupby(['class', 'sex']).survived.mean().unstack('sex'), + annot=True) +ax.set_title('survival rate') +``` + +Note that there are also many ways to produce prettier tables in pandas (e.g., +color all the negative values). This is documented +[here](http://pandas.pydata.org/pandas-docs/stable/style.html). + +Because this stacking/unstacking is fairly common after a groupby operation, +there is a shortcut for it: `pivot_table` + +``` +titanic.pivot_table('survived', 'class', 'sex') +``` + +As usual in pandas, where we can also provide multiple column names + +``` +sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))]), annot=True) +``` + +We can also change the function to be used to aggregate the data + +``` +sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))], + aggfunc='count'), annot=True) +``` + +As in `groupby` the aggregation function can be a string of a common aggregation +function, or any function that should be applied. + +We can even apply different aggregate functions to different columns + +``` +titanic.pivot_table(index='class', columns='sex', + aggfunc={'survived': 'count', 'fare': np.mean}) # compute number of survivors and mean fare + +``` + +The opposite of `pivot_table` is `melt`. This can be used to change a wide-form +table into a long-form table. This is not particularly useful on the titanic +dataset, so let's create a new table where this might be useful. Let's say we +have a dataset listing the FA and MD values in various WM tracts: + +``` +tracts = ('Corpus callosum', 'Internal capsule', 'SLF', 'Arcuate fasciculus') +df_wide = pd.DataFrame.from_dict(dict({'subject': list('ABCDEFGHIJ')}, **{ + f'FA({tract})': np.random.rand(10) for tract in tracts }, **{ + f'MD({tract})': np.random.rand(10) * 1e-3 for tract in tracts +})) +df_wide +``` + +This wide-form table (i.e., all the information is in different columns) makes +it hard to select just all the FA values or only the values associated with the +SLF. For this it would be easier to list all the values in a single column. +Most of the tools discussed above (e.g., `group_by` or `seaborn` plotting) work +better with long-form data, which we can obtain from `melt`: + +``` +df_long = df_wide.melt('subject', var_name='measurement', value_name='dti_value') +df_long.head(12) +``` + +We can see that `melt` took all the columns (we could also have specified a +specific sub-set) and returned each measurement as a seperate row. We probably +want to seperate the measurement column into the measurement type (FA or MD) and +the tract name. Many string manipulation function are available in the +`DataFrame` object under `DataFrame.str` +([tutorial](http://pandas.pydata.org/pandas-docs/stable/text.html)) + +``` +df_long['variable'] = df_long.measurement.str.slice(0, 2) # first two letters correspond to FA or MD +df_long['tract'] = df_long.measurement.str.slice(3, -1) # fourth till the second-to-last letter correspond to the tract +df_long.head(12) +``` + +Finally we probably do want the FA and MD variables as different columns. + +**Exercise**: Use `pivot_table` or `stack`/`unstack` to create a column for MD +and FA. + +``` +df_unstacked = df_long. +``` + +We can now use the tools discussed above to visualize the table (`seaborn`) or +to group the table based on tract (`groupby` or `pivot_table`). + +``` +# feel free to analyze this random data in more detail +``` + +In general pandas is better at handling long-form than wide-form data, although +for better visualization of the data an intermediate format is often best. One +exception is calculating a covariance (`DataFrame.cov`) or correlation +(`DataFrame.corr`) matrices which computes the correlation between each column: + +``` +sns.heatmap(df_wide.corr(), cmap=sns.diverging_palette(240, 10, s=99, n=300), ) +``` + +## Linear fitting (`statsmodels`) + +Linear fitting between the different columns is available through the +[`statsmodels`](https://www.statsmodels.org/stable/index.html) library. A nice +way to play around with a wide variety of possible models is to use R-style +functions. The usage of the functions in `statsmodels` is described +[here](https://www.statsmodels.org/dev/example_formulas.html). You can find a +more detailed description of the R-style functions +[here](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula- +language). + +In short these functions describe the linear model as a string. For example, +`"y ~ x + a + x * a"` fits the variable `y` as a function of `x`, `a`, and the +interaction between `x` and `a`. The intercept is included by default (you can +add `"+ 0"` to remove it). + +``` +result = smf.logit('survived ~ age + sex + age * sex', data=titanic).fit() +print(result.summary()) +``` + +Note that `statsmodels` understands categorical variables and automatically +replaces them with dummy variables. + +Above we used logistic regression, which is appropriate for the binary +survival rate. A wide variety of linear models are available. Let's try a GLM, +but assume that the fare is drawn from a Gamma distribution: + +``` +age_dmean = titanic.age - titanic.age.mean() +result = smf.glm('fare ~ age_dmean + embark_town', data=titanic).fit() +print(result.summary()) +``` + +Cherbourg passengers clearly paid a lot more... + + +Note that we did not actually add the `age_dmean` to the +`DataFrame`. `statsmodels` (or more precisely the underlying +[patsy](https://patsy.readthedocs.io/en/latest/) library) automatically +extracted this from our environment. This can lead to confusing behaviour... + +# More reading + +Other useful features + +- [Concatenating](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html) + and + [merging](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html) + of tables +- [Lots + of](http://pandas.pydata.org/pandas-docs/stable/basics.html#dt-accessor) + [time](http://pandas.pydata.org/pandas-docs/stable/timeseries.html) + [series](http://pandas.pydata.org/pandas-docs/stable/timedeltas.html) + support +- [Rolling Window + functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window- + functions) for after you have meaningfully sorted your data +- and much, much more