diff --git a/advanced_programming/threading.ipynb b/advanced_programming/threading.ipynb
index 69ec51b3309fd23189f21b829d7ce4ed5c5879f0..caf598513c185d4c70fdf08aeb62590247904aec 100644
--- a/advanced_programming/threading.ipynb
+++ b/advanced_programming/threading.ipynb
@@ -2,6 +2,7 @@
  "cells": [
   {
    "cell_type": "markdown",
+   "id": "12ef343d",
    "metadata": {},
    "source": [
     "# Threading and parallel processing\n",
@@ -12,7 +13,12 @@
     "true parallelism in the\n",
     "[`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html)\n",
     "module.  If you want to be impressed, skip straight to the section on\n",
-    "[`multiprocessing`](todo).\n",
+    "[`multiprocessing`](multiprocessing).\n",
+    "\n",
+    "\n",
+    "> *Note*: This notebook covers features that are built-in to the Python\n",
+    "> programming language. However, there are many other parallelisation options\n",
+    "> available to you through third-party libraries - some of them are covered in `applications/parallel/parallel.ipynb`.\n",
     "\n",
     "\n",
     "> *Note*: If you are familiar with a \"real\" programming language such as C++\n",
@@ -40,6 +46,7 @@
     "    * [`Pool.map`](#pool-map)\n",
     "    * [`Pool.apply_async`](#pool-apply-async)\n",
     "* [Sharing data between processes](#sharing-data-between-processes)\n",
+    "  * [Memory-mapping](#memory-mapping)\n",
     "  * [Read-only sharing](#read-only-sharing)\n",
     "  * [Read/write sharing](#read-write-sharing)\n",
     "\n",
@@ -61,6 +68,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "956c477f",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -85,6 +93,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "c7f0f9ad",
    "metadata": {},
    "source": [
     "You can also `join` a thread, which will block execution in the current thread\n",
@@ -94,6 +103,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "f6e3d5e6",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -107,6 +117,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "41def024",
    "metadata": {},
    "source": [
     "<a class=\"anchor\" id=\"subclassing-thread\"></a>\n",
@@ -120,6 +131,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "dbf8e4ff",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -142,6 +154,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "a3218617",
    "metadata": {},
    "source": [
     "<a class=\"anchor\" id=\"daemon-threads\"></a>\n",
@@ -162,6 +175,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "5a0b442b",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -171,6 +185,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "f04828ff",
    "metadata": {},
    "source": [
     "See the [`Thread`\n",
@@ -205,6 +220,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "33d52d8b",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -221,6 +237,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "281eb07a",
    "metadata": {},
    "source": [
     "But if we protect the critical section with a `Lock` object, the output will\n",
@@ -230,6 +247,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "40802e7f",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -250,6 +268,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "88acba0e",
    "metadata": {},
    "source": [
     "> Instead of using a `Lock` object in a `with` statement, it is also possible\n",
@@ -273,6 +292,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "4c30c31a",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -312,6 +332,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "c69dbe16",
    "metadata": {},
    "source": [
     "Try removing the `mutex` lock from the two methods in the above code, and see\n",
@@ -335,6 +356,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "b0b933b6",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -359,6 +381,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "2a6a36e2",
    "metadata": {},
    "source": [
     "<a class=\"anchor\" id=\"the-global-interpreter-lock-gil\"></a>\n",
@@ -479,6 +502,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "daadc9c9",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -486,36 +510,21 @@
     "import multiprocessing as mp\n",
     "import numpy           as np\n",
     "\n",
-    "# We must avoid concurrent accesses to the\n",
-    "# print function when running parallel code\n",
-    "# within the Jupyter notebook environment.\n",
-    "# This would not be necessary when executing\n",
-    "# code normally (i.e. outside of Jupyter\n",
-    "# notebook)\n",
-    "lock = mp.Lock()\n",
-    "\n",
     "def crunchImage(imgfile):\n",
     "\n",
-    "    # Load a nifti image, do stuff\n",
-    "    # to it. Use your imagination\n",
-    "    # to fill in this function.\n",
+    "    # Load a nifti image and calculate some\n",
+    "    # metric from the image. Use your\n",
+    "    # imagination to fill in this function.\n",
     "    time.sleep(2)\n",
-    "\n",
-    "    # numpy's random number generator\n",
-    "    # will be initialised in the same\n",
-    "    # way in each process, so let's\n",
-    "    # re-seed it.\n",
     "    np.random.seed()\n",
-    "    result = np.random.randint(1, 100, 1)\n",
-    "\n",
-    "    with lock:\n",
-    "        print(imgfile, ':', result)\n",
+    "    result = np.random.randint(1, 100, 1)[0]\n",
     "\n",
     "    return result\n",
     "\n",
+    "\n",
     "imgfiles = [f'{i:02d}.nii.gz' for i in range(20)]\n",
     "\n",
-    "print('Crunching images...')\n",
+    "print(f'Crunching {len(imgfiles)} images...')\n",
     "\n",
     "start = time.time()\n",
     "\n",
@@ -524,11 +533,15 @@
     "\n",
     "end = time.time()\n",
     "\n",
+    "for imgfile, result in zip(imgfiles, results):\n",
+    "   print(f'Result for {imgfile}: {result}')\n",
+    "\n",
     "print('Total execution time: {:0.2f} seconds'.format(end - start))"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "51d5ae8a",
    "metadata": {},
    "source": [
     "The `Pool.map` method only works with functions that accept one argument, such\n",
@@ -541,10 +554,10 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "60ce3e5b",
    "metadata": {},
    "outputs": [],
    "source": [
-    "lock = mp.Lock()\n",
     "def crunchImage(imgfile, modality):\n",
     "    time.sleep(2)\n",
     "\n",
@@ -555,10 +568,8 @@
     "    elif modality == 't2':\n",
     "        result = np.random.randint(100, 200, 1)\n",
     "\n",
-    "    with lock:\n",
-    "        print(imgfile, ': ', result)\n",
+    "    return result[0]\n",
     "\n",
-    "    return result\n",
     "\n",
     "imgfiles   = [f't1_{i:02d}.nii.gz' for i in range(10)] + \\\n",
     "             [f't2_{i:02d}.nii.gz' for i in range(10)]\n",
@@ -575,11 +586,15 @@
     "\n",
     "end = time.time()\n",
     "\n",
+    "for imgfile, modality, result in zip(imgfiles, modalities, results):\n",
+    "    print(f'{imgfile} [{modality}]: {result}')\n",
+    "\n",
     "print('Total execution time: {:0.2f} seconds'.format(end - start))"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "99d35451",
    "metadata": {},
    "source": [
     "The `map` and `starmap` methods also have asynchronous equivalents `map_async`\n",
@@ -610,6 +625,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "b791805e",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -620,9 +636,9 @@
     "\n",
     "def linear_registration(src, ref):\n",
     "    time.sleep(1)\n",
-    "\n",
     "    return np.eye(4)\n",
     "\n",
+    "\n",
     "def nonlinear_registration(src, ref, affine):\n",
     "\n",
     "    time.sleep(3)\n",
@@ -630,13 +646,15 @@
     "    # this number represents a non-linear warp\n",
     "    # field - use your imagination people!\n",
     "    np.random.seed()\n",
-    "    return np.random.randint(1, 100, 1)\n",
+    "    return np.random.randint(1, 100, 1)[0]\n",
+    "\n",
     "\n",
     "t1s = [f'{i:02d}_t1.nii.gz' for i in range(20)]\n",
     "std = 'MNI152_T1_2mm.nii.gz'\n",
     "\n",
     "print('Running structural-to-standard registration '\n",
-    "      'on {} subjects...'.format(len(t1s)))\n",
+    "      f'on {len(t1s)} subjects...')\n",
+    "\n",
     "\n",
     "# Run linear registration on all the T1s.\n",
     "start = time.time()\n",
@@ -653,10 +671,12 @@
     "    for i, r in enumerate(linresults):\n",
     "        linresults[i] = r.get()\n",
     "\n",
+    "\n",
     "end = time.time()\n",
     "\n",
     "print('Linear registrations completed in '\n",
-    "      '{:0.2f} seconds'.format(end - start))\n",
+    "      f'{end - start:0.2f} seconds')\n",
+    "\n",
     "\n",
     "# Run non-linear registration on all the T1s,\n",
     "# using the linear registrations to initialise.\n",
@@ -670,6 +690,7 @@
     "    for i, r in enumerate(nlinresults):\n",
     "        nlinresults[i] = r.get()\n",
     "\n",
+    "\n",
     "end = time.time()\n",
     "\n",
     "print('Non-linear registrations completed in '\n",
@@ -677,11 +698,12 @@
     "\n",
     "print('Non linear registrations:')\n",
     "for t1, result in zip(t1s, nlinresults):\n",
-    "    print(t1, ':', result)"
+    "    print(f'{t1} : {result}')"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "495c9c03",
    "metadata": {},
    "source": [
     "<a class=\"anchor\" id=\"sharing-data-between-processes\"></a>\n",
@@ -707,7 +729,9 @@
     "\n",
     "> <sup>1</sup>*Pickleable* is the term used in the Python world to refer to\n",
     "> something that is *serialisable* - basically, the process of converting an\n",
-    "> in-memory object into a binary form that can be stored and/or transmitted.\n",
+    "> in-memory object into a binary form that can be stored and/or transmitted,\n",
+    "> and then loaded back into memory at some point in the future (in the same\n",
+    "> process, or in another process).\n",
     "\n",
     "\n",
     "There is obviously some overhead in copying data back and forth between the\n",
@@ -717,10 +741,134 @@
     "computation time, rather than I/O between the parent and child processes.\n",
     "\n",
     "\n",
-    "However, if you are working with a large dataset, you have determined that\n",
-    "copying data between processes is having a substantial impact on your\n",
-    "performance, and instead wish to *share* a single copy of the data between\n",
-    "the processes, you will need to:\n",
+    "However, if you are working with a large data set, where copying it between\n",
+    "processes is not viable, you have a couple of options available to you.\n",
+    "\n",
+    "\n",
+    "<a class=\"anchor\" id=\"memory-mapping\"></a>\n",
+    "### Memory-mapping\n",
+    "\n",
+    "\n",
+    "One method for sharing a large `numpy` array between multiple processes is to\n",
+    "use a _memory-mapped_ array. This is a feature built into `numpy` which\n",
+    "stores your data in a regular file, instead of in memory.  This allows your\n",
+    "data to be simultaneously read and written by multiple processes, and is fairly\n",
+    "straightforward to accomplish.\n",
+    "\n",
+    "For example, let's say you have some 4D fMRI data, and wish to fit a\n",
+    "complicated model to the time series at each voxel.  First we will load our 4D\n",
+    "data, and pre-allocate another array to store the fitted model parameters:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3384d747",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import                    time\n",
+    "import functools       as ft\n",
+    "import multiprocessing as mp\n",
+    "import numpy           as np\n",
+    "\n",
+    "# Store the parameters that are required\n",
+    "# to create our memory-mapped arrays, as\n",
+    "# we need to re-use them a couple of times.\n",
+    "#\n",
+    "# Note that in practice you would usually\n",
+    "# want to store these files in a temporary\n",
+    "# directory, and/or ensure that they are\n",
+    "# deleted once you are finished.\n",
+    "data_params  = dict(filename='data.mmap',  shape=(91, 109, 91, 50), dtype=np.float32)\n",
+    "model_params = dict(filename='model.mmap', shape=(91, 109, 91),     dtype=np.float32)\n",
+    "\n",
+    "# Load our data as a memory-mapped array (we\n",
+    "# are using random data for this example)\n",
+    "data    = np.memmap(**data_params, mode='w+')\n",
+    "data[:] = np.random.random((91, 109, 91, 50)).astype(np.float32)\n",
+    "data.flush()\n",
+    "\n",
+    "# Pre-allocate space to store the fitted\n",
+    "# model parameters\n",
+    "model = np.memmap(**model_params, mode='w+')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "15ae0eb1",
+   "metadata": {},
+   "source": [
+    "> If your image files are uncompressed (i.e. `.nii` rather than `.nii.gz`),\n",
+    "> you can instruct `nibabel` and `fslpy` to load them as a memory-map by\n",
+    "> passing `mmap=True` to the `nibabel.load` function, and the\n",
+    "> `fsl.data.image.Image` constructor.\n",
+    "\n",
+    "\n",
+    "Now we will write our model fitting function so that it works on one slice at\n",
+    "a time - this will allow us to process multiple slices in parallel. Note\n",
+    "that, within this function, we have to _re-load_ the memory-mapped arrays. In\n",
+    "this example we have written this function so as to expect the arguments\n",
+    "required to create the two memory-maps to be passed in (the `data_params` and\n",
+    "`model_params` dictionaries that we created above):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2daf1f1b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def fit_model(indata, outdata, sliceidx):\n",
+    "\n",
+    "    indata  = np.memmap(**indata,  mode='r')\n",
+    "    outdata = np.memmap(**outdata, mode='r+')\n",
+    "\n",
+    "    # sleep to simulate expensive model fitting\n",
+    "    print(f'Fitting model at slice {sliceidx}')\n",
+    "    time.sleep(1)\n",
+    "    outdata[:, :, sliceidx] = indata[:, :, sliceidx, :].mean() + sliceidx"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "90b8f2e3",
+   "metadata": {},
+   "source": [
+    "Now we can use `multiprocessing` to fit the model in parallel across all of the\n",
+    "image slices:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ffb4c693",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fit_function = ft.partial(fit_model, data_params, model_params)\n",
+    "slice_idxs   = list(range(91))\n",
+    "\n",
+    "with mp.Pool(processes=16) as pool:\n",
+    "    pool.map(fit_function, slice_idxs)\n",
+    "\n",
+    "print(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd0dd890",
+   "metadata": {},
+   "source": [
+    "<a class=\"anchor\" id=\"read-only-sharing\"></a>\n",
+    "### Read-only sharing\n",
+    "\n",
+    "\n",
+    "If you are working with a large dataset, you have determined that copying data\n",
+    "between processes is having a substantial impact on your performance, and have\n",
+    "also decided that memory-mapping is not an option for you, and instead wish to\n",
+    "*share* a single copy of the data between the processes, you will need to:\n",
     "\n",
     " 1. Structure your code so that the data you want to share is accessible at\n",
     "    the *module level*.\n",
@@ -728,7 +876,7 @@
     "\n",
     "\n",
     "This is because, when you create a `Pool`, what actually happens is that the\n",
-    "process your Pythonn script is running in will [**fork**][wiki-fork] itself -\n",
+    "process your Python script is running in will [**fork**][wiki-fork] itself -\n",
     "the child processes that are created are used as the worker processes by the\n",
     "`Pool`. And if you create/load your data in your main process *before* this\n",
     "fork occurs, all of the child processes will inherit the memory space of the\n",
@@ -739,10 +887,6 @@
     "[wiki-fork]: https://en.wikipedia.org/wiki/Fork_(system_call)\n",
     "\n",
     "\n",
-    "<a class=\"anchor\" id=\"read-only-sharing\"></a>\n",
-    "### Read-only sharing\n",
-    "\n",
-    "\n",
     "Let's see this in action with a simple example. We'll start by defining a\n",
     "horrible little helper function which allows us to track the total memory\n",
     "usage:"
@@ -751,6 +895,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "13fe8356",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -776,6 +921,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "398f7b19",
    "metadata": {},
    "source": [
     "Now our task is simply to calculate the sum of a large array of numbers. We're\n",
@@ -786,6 +932,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "66b9917b",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -835,6 +982,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "9b06285f",
    "metadata": {},
    "source": [
     "You should be able to see that only one copy of `data` is created, and is\n",
@@ -922,6 +1070,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "d7a8f363",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -949,6 +1098,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "4f0cbe28",
    "metadata": {},
    "source": [
     "Rather than passing the input and output data arrays in as arguments to the\n",
@@ -974,6 +1124,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "c3ff121f",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1044,6 +1195,7 @@
   },
   {
    "cell_type": "markdown",
+   "id": "09269b65",
    "metadata": {},
    "source": [
     "Now we can call our `process_data` function just like any other function:"
@@ -1052,6 +1204,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "27a07401",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -1068,5 +1221,5 @@
  ],
  "metadata": {},
  "nbformat": 4,
- "nbformat_minor": 4
+ "nbformat_minor": 5
 }
diff --git a/advanced_programming/threading.md b/advanced_programming/threading.md
index a6e045315658ab6b2b3ff893bcf36a628a6f207c..78cce4738ebded16331ee2345ebb0d872b7b2101 100644
--- a/advanced_programming/threading.md
+++ b/advanced_programming/threading.md
@@ -6,7 +6,12 @@ The Python language has built-in support for multi-threading in the
 true parallelism in the
 [`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html)
 module.  If you want to be impressed, skip straight to the section on
-[`multiprocessing`](todo).
+[`multiprocessing`](multiprocessing).
+
+
+> *Note*: This notebook covers features that are built-in to the Python
+> programming language. However, there are many other parallelisation options
+> available to you through third-party libraries - some of them are covered in `applications/parallel/parallel.ipynb`.
 
 
 > *Note*: If you are familiar with a "real" programming language such as C++
@@ -34,6 +39,7 @@ module.  If you want to be impressed, skip straight to the section on
     * [`Pool.map`](#pool-map)
     * [`Pool.apply_async`](#pool-apply-async)
 * [Sharing data between processes](#sharing-data-between-processes)
+  * [Memory-mapping](#memory-mapping)
   * [Read-only sharing](#read-only-sharing)
   * [Read/write sharing](#read-write-sharing)
 
@@ -411,36 +417,21 @@ import                    time
 import multiprocessing as mp
 import numpy           as np
 
-# We must avoid concurrent accesses to the
-# print function when running parallel code
-# within the Jupyter notebook environment.
-# This would not be necessary when executing
-# code normally (i.e. outside of Jupyter
-# notebook)
-lock = mp.Lock()
-
 def crunchImage(imgfile):
 
-    # Load a nifti image, do stuff
-    # to it. Use your imagination
-    # to fill in this function.
+    # Load a nifti image and calculate some
+    # metric from the image. Use your
+    # imagination to fill in this function.
     time.sleep(2)
-
-    # numpy's random number generator
-    # will be initialised in the same
-    # way in each process, so let's
-    # re-seed it.
     np.random.seed()
-    result = np.random.randint(1, 100, 1)
-
-    with lock:
-        print(imgfile, ':', result)
+    result = np.random.randint(1, 100, 1)[0]
 
     return result
 
+
 imgfiles = [f'{i:02d}.nii.gz' for i in range(20)]
 
-print('Crunching images...')
+print(f'Crunching {len(imgfiles)} images...')
 
 start = time.time()
 
@@ -449,6 +440,9 @@ with mp.Pool(processes=16) as p:
 
 end = time.time()
 
+for imgfile, result in zip(imgfiles, results):
+   print(f'Result for {imgfile}: {result}')
+
 print('Total execution time: {:0.2f} seconds'.format(end - start))
 ```
 
@@ -461,7 +455,6 @@ method instead:
 
 
 ```
-lock = mp.Lock()
 def crunchImage(imgfile, modality):
     time.sleep(2)
 
@@ -472,10 +465,8 @@ def crunchImage(imgfile, modality):
     elif modality == 't2':
         result = np.random.randint(100, 200, 1)
 
-    with lock:
-        print(imgfile, ': ', result)
+    return result[0]
 
-    return result
 
 imgfiles   = [f't1_{i:02d}.nii.gz' for i in range(10)] + \
              [f't2_{i:02d}.nii.gz' for i in range(10)]
@@ -492,6 +483,9 @@ with mp.Pool(processes=16) as pool:
 
 end = time.time()
 
+for imgfile, modality, result in zip(imgfiles, modalities, results):
+    print(f'{imgfile} [{modality}]: {result}')
+
 print('Total execution time: {:0.2f} seconds'.format(end - start))
 ```
 
@@ -529,9 +523,9 @@ import numpy           as np
 
 def linear_registration(src, ref):
     time.sleep(1)
-
     return np.eye(4)
 
+
 def nonlinear_registration(src, ref, affine):
 
     time.sleep(3)
@@ -539,13 +533,15 @@ def nonlinear_registration(src, ref, affine):
     # this number represents a non-linear warp
     # field - use your imagination people!
     np.random.seed()
-    return np.random.randint(1, 100, 1)
+    return np.random.randint(1, 100, 1)[0]
+
 
 t1s = [f'{i:02d}_t1.nii.gz' for i in range(20)]
 std = 'MNI152_T1_2mm.nii.gz'
 
 print('Running structural-to-standard registration '
-      'on {} subjects...'.format(len(t1s)))
+      f'on {len(t1s)} subjects...')
+
 
 # Run linear registration on all the T1s.
 start = time.time()
@@ -562,10 +558,12 @@ with mp.Pool(processes=16) as pool:
     for i, r in enumerate(linresults):
         linresults[i] = r.get()
 
+
 end = time.time()
 
 print('Linear registrations completed in '
-      '{:0.2f} seconds'.format(end - start))
+      f'{end - start:0.2f} seconds')
+
 
 # Run non-linear registration on all the T1s,
 # using the linear registrations to initialise.
@@ -579,6 +577,7 @@ with mp.Pool(processes=16) as pool:
     for i, r in enumerate(nlinresults):
         nlinresults[i] = r.get()
 
+
 end = time.time()
 
 print('Non-linear registrations completed in '
@@ -586,7 +585,7 @@ print('Non-linear registrations completed in '
 
 print('Non linear registrations:')
 for t1, result in zip(t1s, nlinresults):
-    print(t1, ':', result)
+    print(f'{t1} : {result}')
 ```
 
 
@@ -613,7 +612,9 @@ third-party library).
 
 > <sup>1</sup>*Pickleable* is the term used in the Python world to refer to
 > something that is *serialisable* - basically, the process of converting an
-> in-memory object into a binary form that can be stored and/or transmitted.
+> in-memory object into a binary form that can be stored and/or transmitted,
+> and then loaded back into memory at some point in the future (in the same
+> process, or in another process).
 
 
 There is obviously some overhead in copying data back and forth between the
@@ -623,10 +624,101 @@ important - the performance bottleneck is typically going to be the
 computation time, rather than I/O between the parent and child processes.
 
 
-However, if you are working with a large dataset, you have determined that
-copying data between processes is having a substantial impact on your
-performance, and instead wish to *share* a single copy of the data between
-the processes, you will need to:
+However, if you are working with a large data set, where copying it between
+processes is not viable, you have a couple of options available to you.
+
+
+<a class="anchor" id="memory-mapping"></a>
+### Memory-mapping
+
+
+One method for sharing a large `numpy` array between multiple processes is to
+use a _memory-mapped_ array. This is a feature built into `numpy` which
+stores your data in a regular file, instead of in memory.  This allows your
+data to be simultaneously read and written by multiple processes, and is fairly
+straightforward to accomplish.
+
+For example, let's say you have some 4D fMRI data, and wish to fit a
+complicated model to the time series at each voxel.  First we will load our 4D
+data, and pre-allocate another array to store the fitted model parameters:
+
+
+```
+import                    time
+import functools       as ft
+import multiprocessing as mp
+import numpy           as np
+
+# Store the parameters that are required
+# to create our memory-mapped arrays, as
+# we need to re-use them a couple of times.
+#
+# Note that in practice you would usually
+# want to store these files in a temporary
+# directory, and/or ensure that they are
+# deleted once you are finished.
+data_params  = dict(filename='data.mmap',  shape=(91, 109, 91, 50), dtype=np.float32)
+model_params = dict(filename='model.mmap', shape=(91, 109, 91),     dtype=np.float32)
+
+# Load our data as a memory-mapped array (we
+# are using random data for this example)
+data    = np.memmap(**data_params, mode='w+')
+data[:] = np.random.random((91, 109, 91, 50)).astype(np.float32)
+data.flush()
+
+# Pre-allocate space to store the fitted
+# model parameters
+model = np.memmap(**model_params, mode='w+')
+```
+
+
+> If your image files are uncompressed (i.e. `.nii` rather than `.nii.gz`),
+> you can instruct `nibabel` and `fslpy` to load them as a memory-map by
+> passing `mmap=True` to the `nibabel.load` function, and the
+> `fsl.data.image.Image` constructor.
+
+
+Now we will write our model fitting function so that it works on one slice at
+a time - this will allow us to process multiple slices in parallel. Note
+that, within this function, we have to _re-load_ the memory-mapped arrays. In
+this example we have written this function so as to expect the arguments
+required to create the two memory-maps to be passed in (the `data_params` and
+`model_params` dictionaries that we created above):
+
+```
+def fit_model(indata, outdata, sliceidx):
+
+    indata  = np.memmap(**indata,  mode='r')
+    outdata = np.memmap(**outdata, mode='r+')
+
+    # sleep to simulate expensive model fitting
+    print(f'Fitting model at slice {sliceidx}')
+    time.sleep(1)
+    outdata[:, :, sliceidx] = indata[:, :, sliceidx, :].mean() + sliceidx
+```
+
+Now we can use `multiprocessing` to fit the model in parallel across all of the
+image slices:
+
+```
+fit_function = ft.partial(fit_model, data_params, model_params)
+slice_idxs   = list(range(91))
+
+with mp.Pool(processes=16) as pool:
+    pool.map(fit_function, slice_idxs)
+
+print(model)
+```
+
+
+<a class="anchor" id="read-only-sharing"></a>
+### Read-only sharing
+
+
+If you are working with a large dataset, you have determined that copying data
+between processes is having a substantial impact on your performance, and have
+also decided that memory-mapping is not an option for you, and instead wish to
+*share* a single copy of the data between the processes, you will need to:
 
  1. Structure your code so that the data you want to share is accessible at
     the *module level*.
@@ -634,7 +726,7 @@ the processes, you will need to:
 
 
 This is because, when you create a `Pool`, what actually happens is that the
-process your Pythonn script is running in will [**fork**][wiki-fork] itself -
+process your Python script is running in will [**fork**][wiki-fork] itself -
 the child processes that are created are used as the worker processes by the
 `Pool`. And if you create/load your data in your main process *before* this
 fork occurs, all of the child processes will inherit the memory space of the
@@ -645,10 +737,6 @@ any copying required.
 [wiki-fork]: https://en.wikipedia.org/wiki/Fork_(system_call)
 
 
-<a class="anchor" id="read-only-sharing"></a>
-### Read-only sharing
-
-
 Let's see this in action with a simple example. We'll start by defining a
 horrible little helper function which allows us to track the total memory
 usage:
diff --git a/applications/README.md b/applications/README.md
index 89ddea7a2a090508bda2950b4ab29efd3b634344..e1cfe31aa0248ee8f1219d977f53e19d71bc70a9 100644
--- a/applications/README.md
+++ b/applications/README.md
@@ -12,3 +12,4 @@ Practicals are split into the following sub-categories (and sub-folders):
 3. `matlab_vs_python` : a number of data analysis examples with head-to-head comparisons between matlab and python code.
 4. `modelling` : multiple examples of analysis methods in python.
 5. `pandas` : processing and analsing tabular data with the powerful Pandas package.
+6. `parallel` : Third-party libraries and strategies for parallelising Python code.
diff --git a/applications/parallel/parallel.ipynb b/applications/parallel/parallel.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..83b765aa7a9a66ee26f9243a7423e7ccad69e3d0
--- /dev/null
+++ b/applications/parallel/parallel.ipynb
@@ -0,0 +1,1010 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2b27adf9",
+   "metadata": {},
+   "source": [
+    "# Parallel processing in Python\n",
+    "\n",
+    "\n",
+    "While Python has built-in support for threading and parallelising in the\n",
+    "[`threading`](https://docs.python.org/3/library/threading.html) and\n",
+    "[`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html)\n",
+    "modules (covered in `advanced_programming/threading.ipynb`), there are a range of third-party libraries which you can use to improve the performance of your code.\n",
+    "\n",
+    "Contents:\n",
+    "\n",
+    "* [Do you really need to parallelise your code?](#do-you-need-to-parallelise)\n",
+    "  * [Profilng your code](#profiling-your-code)\n",
+    "* [JobLib](#joblib)\n",
+    "* [Dask](#dask)\n",
+    "  * [Dask Numpy API](#dask-numpy-api)\n",
+    "  * [Low-level Dask API](#low-level-dask-api)\n",
+    "  * [Distributing computation with Dask](#distributing-computation-with-dask)\n",
+    "* [fsl-pipe](#fsl-pipe)\n",
+    "* [fsl-sub](#fsl-sub)\n",
+    "\n",
+    "\n",
+    "<a class=\"anchor\" id=\"do-you-need-to-parallelise\"></a>\n",
+    "# Do you really need to parallelise your code?\n",
+    "\n",
+    "\n",
+    "Before diving in and tearing your code apart, you should think very carefully about your problem, and the hardware you are using to run your code. For example, if you are processing MRI data for a number of subjects on the FMRIB cluster (where each node on the cluster has fairly modest hardware specs, meaning that within-process parallelisation will have limited benefits), the most efficient option may be to write your code in a single-threaded manner, and to parallelise across data sets, processing one data set on each cluster node.\n",
+    "Your best option may simply be to repeatedly call `fsl_sub`- see the [section below](#fsl-sub) for more details on this approach.\n",
+    "\n",
+    "\n",
+    "<a class=\"anchor\" id=\"#profiling-your-code\"></a>\n",
+    "## Profilng your code\n",
+    "\n",
+    "\n",
+    "Once you have decided that you need to parallelise some steps in your program, **STOP!** As the great Donald Knuth once said:\n",
+    "\n",
+    "> Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time:\n",
+    ">\n",
+    "> **premature optimization is the root of all evil**.\n",
+    ">\n",
+    "> Yet we should not pass up our opportunities in that critical 3%.\n",
+    "\n",
+    "\n",
+    "What Knuth is essentially saying is that there is no point in optimising or rewriting a piece of code unless it is actually going to have a positive impact on performance. In other words, before you refactor your code, you need to ensure that you _understand_ where the bottlenecks are, so that you know which parts of the code _should_ be re-written.\n",
+    "\n",
+    "One way in which we can find those bottlenecks is through _profiling_ - running our code, and timing each part of it, to identify which parts are causing our program to run slowly.\n",
+    "\n",
+    "As a simple example, consider this code. It is running slowly, and we want to know why:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "92119f45",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "def process_data(datafile):\n",
+    "    data = np.loadtxt(datafile)\n",
+    "    result = data.mean()\n",
+    "    return result\n",
+    "\n",
+    "print(process_data('mydata.txt'))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "28a3e8e1",
+   "metadata": {},
+   "source": [
+    "> The `mydata.txt` file can be generated by running this code:\n",
+    ">\n",
+    "> ```\n",
+    "> import numpy as np\n",
+    "> data = np.random.random((10000, 1000))\n",
+    "> np.savetxt('mydata.txt', data)\n",
+    "> ```\n",
+    "\n",
+    "\n",
+    "For small functions, you can profile code manually by using the `time` module, e.g.:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "49fa2ba9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "\n",
+    "start  = time.time()\n",
+    "result = process_data('mydata.txt')\n",
+    "end    = time.time()\n",
+    "\n",
+    "print(f'Total #seconds: {end - start}')\n",
+    "print(result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6c6de92c",
+   "metadata": {},
+   "source": [
+    "We could also do the same within the `process_data` function, and calculate and report the timings for each individual section within.  But there is a library called `line_profiler` which can do this for you - it will time every single line of code, and print a report for you:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2413b91c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%load_ext line_profiler\n",
+    "%lprun -f process_data process_data('mydata.txt')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d3cf2572",
+   "metadata": {},
+   "source": [
+    "> `line_profiler` is not installed as part of FSL by default, but you can install it with this command:\n",
+    "> ```$FSLDIR/bin/conda install -p $FSLDIR line_profiler```\n",
+    ">\n",
+    "> And you can also use it from the command-line, if you are not working in a Jupyter notebook. You can learn more about `line_profiler` at https://github.com/pyutils/line_profiler\n",
+    "\n",
+    "\n",
+    "We can see that most of the processing time was actually in _loading_ the data from file, and not in the data _processing_ (simply calculating the mean in this toy example). Without profiling your code, you may have naively thought that the data processing step was the culprit, and needed to be optimised. But what profiling has told us here is that our problem lies elsewhere, and perhaps we need to think about changing how we are storing our data.\n",
+    "\n",
+    "\n",
+    "But let's assume from this point on that we _do_ need to optimise our code...\n",
+    "\n",
+    "\n",
+    "<a class=\"anchor\" id=\"joblib\"></a>\n",
+    "# JobLib\n",
+    "\n",
+    "https://joblib.readthedocs.io/en/stable/\n",
+    "\n",
+    "\n",
+    "JobLib has been around for a while, and is a simple and lightweight library with two main features:\n",
+    "\n",
+    " - An API for \"embarrassingly parallel\" tasks.\n",
+    " - An API for caching/re-using the results of time-consuming computations (which we won't discuss in this notebook, but you can read about [here](https://joblib.readthedocs.io/en/stable/memory.html)).\n",
+    "\n",
+    "On the surface, JobLib does not provide any functionality that cannot already be accomplished with built-in libraries such as `multiprocessing` and `functools`. However, the JobLib developers have put a lot of effort into ensuring that JobLib:\n",
+    "\n",
+    " - is easy to use,\n",
+    " - works consistently across all of the major platforms, and\n",
+    " - works as efficiently as possible with Numpy arrays\n",
+    "\n",
+    "We can use the JobLib API by importing the `joblib` package (we'll also use `numpy` and `time` in this example):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0198ff8d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import joblib\n",
+    "import time"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f64fac25",
+   "metadata": {},
+   "source": [
+    "Now, let's say that we have a collection of `numpy` arrays containing image data for a group of subjects:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a79aacb0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "datasets = [np.random.random((91, 109, 91)) for i in range(10)]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e30049b",
+   "metadata": {},
+   "source": [
+    "And we want to calculate some metric on each image. We simply need to define a function which contains the calculation we want to perform (the `time.sleep` call is there just to simulate a complex calculation):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ba63be7a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def do_calculation(input_data):\n",
+    "    time.sleep(2)\n",
+    "    return input_data.mean()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d3d2b0b8",
+   "metadata": {},
+   "source": [
+    "We can now use `joblib.Parallel` and `joblib.delayed` to parallelise those calculations. `joblib.Parallel` sets up a pool of worker processes, and `joblib.delayed` schedules a single instantiation of the function, and associated data, which is to be executed. We can execute a sequence of `delayed` tasks by passing them to the `Parallel` pool:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "814aef5c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with joblib.Parallel(n_jobs=-1) as pool:\n",
+    "    tasks   = [joblib.delayed(do_calculation)(d) for d in datasets]\n",
+    "    results = pool(tasks)\n",
+    "print(results)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e3bd3114",
+   "metadata": {},
+   "source": [
+    "Just like with `multiprocessing`, JobLib is susceptible to the same problems with regard to sharing data between processes (these problems are discussed in the `advanced_programming/threading.ipynb` notebook). In the example above, each dataset is serialised and copied from the main process to the worker processes, and then the result copied back.  This behaviour could be a performance bottleneck for your own task, or you may be working on a system with limited memory which is incapable of storing several copies of the data.\n",
+    "\n",
+    "To deal with this, we can use memory-mapped Numpy arrays. This is a feature built into Numpy, and supported by JobLib, which invlolves storing your data in your file system instead of in memory. This allows your data to be simultaneously read and written by multiple processes.\n",
+    "\n",
+    "> Some other options for sharing data between processes are discussed in the `advanced_programming/threading.ipynb` notebook.\n",
+    "\n",
+    "You can create `numpy.memmap` arrays directly, but JobLib will also  automatically convert any Numpy arrays which are larger than a specified threshold into memory-mapped arrays. This threshold defaults to 1MB, and you can change it via the `n_bytes` option when you create a `joblib.Parallel` pool.\n",
+    "\n",
+    "To see this in action, imagine that you have some 4D fMRI data, and wish to fit a complicated model to the time series at each voxel. We will write our model fitting function so that it works on one slice at a time:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9b7106d6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def fit_model(indata, outdata, sliceidx):\n",
+    "    print(f'Fitting model at slice {sliceidx}')\n",
+    "    time.sleep(1)\n",
+    "    outdata[:, :, sliceidx] = indata[:, :, sliceidx, :].mean() + sliceidx"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "44eff256",
+   "metadata": {},
+   "source": [
+    "Now we will load our 4D data, and pre-allocate another array to store the fitted model parameters."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2e548f25",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Imagine that we have loaded this data from a file\n",
+    "data  = np.random.random((91, 109, 91, 50)).astype(np.float32)\n",
+    "\n",
+    "# Pre-allocate space to store the fitted model parameters\n",
+    "model = np.memmap('model.mmap',\n",
+    "                  shape=(91, 109, 91),\n",
+    "                  dtype=np.float32,\n",
+    "                  mode='w+')\n",
+    "\n",
+    "# Fit our model, processing slices in parallel\n",
+    "with joblib.Parallel(n_jobs=-1) as pool:\n",
+    "    pool(joblib.delayed(fit_model)(data, model, slc) for slc in range(91))\n",
+    "\n",
+    "print(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c5a776e2",
+   "metadata": {},
+   "source": [
+    "<a class=\"anchor\" id=\"dask\"></a>\n",
+    "# Dask\n",
+    "\n",
+    "https://www.dask.org/\n",
+    "\n",
+    "\n",
+    "Dask is a very powerful library which you can use to parallelise your code, and to distribute computation across multiple machines. You can use Dask locally on your own computer, on HPC clusters (e.g. SLURM and SGE) and with cloud compute platforms (e.g. AWS, Azure).\n",
+    "\n",
+    "Dask has two main components:\n",
+    "\n",
+    " - APIs for defining tasks/jobs - there is a low-level API comparable to that provided by JobLib, but Dask also has sophisticated high-level APIs for working with Pandas and Numpy-style data.\n",
+    "\n",
+    " - A task scheduler which builds a graph of all the tasks that need to be executed, and which manges their execution, either locally or remotely.\n",
+    "\n",
+    "We will introduce the Numpy API and the low-level API, and then demonstrate how to use Dask to perform calculations on a SGE cluster.\n",
+    "\n",
+    "\n",
+    "<a class=\"anchor\" id=\"dask-numpy-api\"></a>\n",
+    "## Dask Numpy API\n",
+    "\n",
+    "https://docs.dask.org/en/stable/array.html\n",
+    "\n",
+    "\n",
+    "To use the Dask Numpy API, simply import the `dask.array` package, and use it instead of the `numpy` package:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "10dabd47",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy      as np\n",
+    "import dask.array as da\n",
+    "\n",
+    "data = da.random.random((1000, 1000, 1000, 20)).astype(np.float32)\n",
+    "data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6c7c7ecd",
+   "metadata": {},
+   "source": [
+    "If you do the numbers, you will realise that the above call has created an array which requires **74 gigabytes** of memory - this is far more memory than what is available in most consumer level computers.\n",
+    "\n",
+    "The call would almost certainly fail if you made it using `np.random.random` instead of `da.random.random`, as Numpy would attempt to create the entire data set (and in fact would temporarily require up to **150 gigabytes** of memory, as Numpy uses double-precision floating point (`float64`) values by default).\n",
+    "\n",
+    "However, this worked because:\n",
+    "\n",
+    " - Dask automatically splits arrays into smaller chunks - in the above code, `data` has been split into 1331 smaller arrays, each of which has shape `(94, 94, 94, 10)`, and requires only 64 megabytes.\n",
+    "\n",
+    " - Most operations in the `dask.array` package are _lazily evaluated_. The above call has not actually created any data - Dask has just created a set of jobs which describe how to create a large array of random values.\n",
+    "\n",
+    "When using `dask.array`, nothing is actually executed until you call the `compute()` function. For example, we can try calculating the mean of our large array:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "724334d6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "m = data.mean()\n",
+    "m"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98438c67",
+   "metadata": {},
+   "source": [
+    "But again, this will not actually calculate the mean - it just defines a task which, when executed, will calculate the mean over the `data` array.\n",
+    "\n",
+    "We can _execute_ that task by callintg `compute()` on the result:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fe23c705",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(m.compute())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e558f8cb",
+   "metadata": {},
+   "source": [
+    "Dask arrays support _most_ of the functionality of Numpy arrays, but there are a few exceptions, most notably the `numpy.linalg` package.\n",
+    "\n",
+    "\n",
+    "> Remember that Dask also provides a Pandas-style API, accessible in the `dask.dataframe` package - you can read about it at https://docs.dask.org/en/stable/dataframe.html.\n",
+    "\n",
+    "\n",
+    "<a class=\"anchor\" id=\"low-level-dask-api\"></a>\n",
+    "## Low-level Dask API\n",
+    "\n",
+    "\n",
+    "In addition to the Numpy and Pandas APIs, Dask also has a lower-level interface which allows you to create and lazily execute a pipeline made up of Python functions. This API is based around `dask.delayed`, which is a Python decorator that can be applied to any function, and which tells Dask that a function should be lazily executed.\n",
+    "\n",
+    "As a very simple example (taken from the [Dask documentation](https://docs.dask.org/en/stable/delayed.html#example)), consider this simple numerical task, where we have a set of numbers and, for each number `x`, we want to calculate `(x * x)`, and then add all of the results together (i.e. the sum of squares). We'll start by defining a function for each of the operations that we need to perform:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9c3acf29",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def square(x):\n",
+    "    return x * x\n",
+    "\n",
+    "def sum(values):\n",
+    "    total = 0\n",
+    "    for v in values:\n",
+    "        total = total + v\n",
+    "    return total"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "09c6f537",
+   "metadata": {},
+   "source": [
+    "We could solve this problem without parallelism by using conventional Python code, which might look something like the following:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "77e896a4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data   = [1, 2, 3, 4, 5]\n",
+    "output = []\n",
+    "\n",
+    "for x in data:\n",
+    "    s = square(x)\n",
+    "    output.append(s)\n",
+    "\n",
+    "total = sum(output)\n",
+    "print(total)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b731893c",
+   "metadata": {},
+   "source": [
+    "However, this problem is inherently parallelisable - we are independently performing the same series of steps to each of our inputs, before the final tallying. We can use the `dask.delayed` function to take advantage of this:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a79b728e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import dask\n",
+    "\n",
+    "output = []\n",
+    "for x in data:\n",
+    "    a = dask.delayed(square)(x)\n",
+    "    output.append(a)\n",
+    "\n",
+    "total = dask.delayed(sum)(output)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1cccc09a",
+   "metadata": {},
+   "source": [
+    "We have not actually performed any computations yet. What we have done is built a _graph_ of operations that we need to perform. Dask refers to this as a **task graph**. Dask keeps track of the dependencies of each function that we add, and so is able to determine the order in they need to be executed, and also which steps do not depend on each other and so can be executed in parallel. Dask can even visualise this task graph for us:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d9b95b0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "total.visualize()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9c387b05",
+   "metadata": {},
+   "source": [
+    "And then when we are ready, we can call `compute()` to actually do the calculation:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f5d1d09c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "total.compute()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c1c95d5f",
+   "metadata": {},
+   "source": [
+    "For a more realistic example, let's imagine that we have T1 MRI images for five subjects, and we want to perform basic structural preprocessing on each of them (reorientation, FOV reduction, and brain extraction). Here we're creating this example data set (all of the T1 images are just a copy of the `bighead.nii.gz` image, from the FSL course data)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f6f614f2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import shutil\n",
+    "\n",
+    "os.makedirs('braindata', exist_ok=True)\n",
+    "for i in range(1, 6):\n",
+    "    shutil.copy('../../applications/fslpy/bighead.nii.gz', f'braindata/{i:02d}.nii.gz')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ababc17d",
+   "metadata": {},
+   "source": [
+    "And now we can build our pipeline. The [fslpy](https://open.win.ox.ac.uk/pages/fsl/fslpy/) library has a collection of functions which we can use to call the FSL commands that we need.\n",
+    "\n",
+    "\n",
+    "We need to do a little work, as by default `dask.delayed` assumes that the dependencies of a function (i.e. other `delayed` functions) are passed to the function as arguments. There are [other methods](https://docs.dask.org/en/stable/delayed.html#indirect-dependencies) of dealing with this, but often the easiest option is simply to write a few small functions that define each of our tasks, in a manner that satisfies this assumption:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fe917598",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import fsl.wrappers as fw\n",
+    "\n",
+    "def reorient(input, output):\n",
+    "    fw.fslreorient2std(input, output)\n",
+    "    return output\n",
+    "\n",
+    "def fov(input, output):\n",
+    "    fw.robustfov(input, output)\n",
+    "    return output\n",
+    "\n",
+    "def bet(input, output):\n",
+    "    fw.bet(input, output)\n",
+    "    return output"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a1cd4b06",
+   "metadata": {},
+   "source": [
+    "Again we use `dask.delayed` to build up a graph of tasks that need executing, starting from the input files, and ending with our final brain-extracted outputs. Again, we are not actually executing anything here - we're just building up a task graph that Dask will then execute for us later:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d03c2ae2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import glob\n",
+    "import dask\n",
+    "import fsl.data.image as fslimage\n",
+    "\n",
+    "inputs = list(glob.glob('braindata/??.nii.gz'))\n",
+    "tasks  = []\n",
+    "\n",
+    "for input in inputs:\n",
+    "    basename = fslimage.removeExt(input)\n",
+    "    r = dask.delayed(reorient)(input, f'{basename}_reorient.nii.gz')\n",
+    "    f = dask.delayed(fov)(r,          f'{basename}_fov.nii.gz')\n",
+    "    b = dask.delayed(bet)(f,          f'{basename}_brain.nii.gz')\n",
+    "    tasks.append(b)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0215eaf",
+   "metadata": {},
+   "source": [
+    "In the previous example we had a single output (the result of summing the squared input values) upon which we called `visualize()`. Here we have a list of independent tasks (in the `tasks` list). However, we can still visualise them by passing the entire list to the `dask.visualize()` function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a08d0d91",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dask.visualize(*tasks)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "244a2ae9",
+   "metadata": {},
+   "source": [
+    "And, similarly, we can call `dask.compute` to run them all at once:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b98bb2a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "outputs = dask.compute(*tasks)\n",
+    "print(outputs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6d77a9f8",
+   "metadata": {},
+   "source": [
+    "<a class=\"anchor\" id=\"distributing-computation-with-dask\"></a>\n",
+    "## Distributing computation with Dask\n",
+    "\n",
+    "\n",
+    "In the examples above, Dask was running your tasks in parallel on your local machine. But it is easy to instruct Dask to distribute your computations across multiple machines. For example, Dask has support for executing tasks on a SGE or SLURM cluster, which we have available in Oxford at FMRIB and the BDI.\n",
+    "\n",
+    "To use this functionality, we need an additional library called `dask-jobqueue` which, at the moment, is not installed as part of FSL.\n",
+    "\n",
+    "> Note that the code cells below will not work if you are running this notebook on your own computer (unless you happen to have a SGE cluster system installed on your laptop).\n",
+    "\n",
+    "\n",
+    "The first step is to create a `SGECluster` (or `SLURMCluster`) object. You need to populate this object with information about a _single job_ in your workflow. At a minumum, you must specify the number of cores and memory that are available to a single job:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e206a399",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dask_jobqueue import SGECluster\n",
+    "cluster = SGECluster(cores=2, memory='16GB')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a69e8eaa",
+   "metadata": {},
+   "source": [
+    "You can also set a range of other options, including specifying a queue name (e.g. `queue='short.q'`), or specifying the total amount of time that your job will run for (e.g. `walltime='01:00:00'` for one hour).\n",
+    "\n",
+    "\n",
+    "The next step is to create some jobs by calling the `scale()` method:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20c0da1a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cluster.scale(jobs=5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "96c05bcd",
+   "metadata": {},
+   "source": [
+    "Behind the scenes, the `scale()` method uses `qsub` to create five \"worker processes\", each running on a different cluster node. In the first instance, these worker processes will be sitting idle, doing nothing, and waiting for you to give them a task to run.\n",
+    "\n",
+    "The final step is to create  \"client\" object. This is an important step which configures Dask to use the cluster we have just set up:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e3742ee0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "client = cluster.get_client()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d4053813",
+   "metadata": {},
+   "source": [
+    "After creating a client, any Dask computations that we request will be performed on the cluster by our worker processes:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2b8fc2c0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import dask.array as da\n",
+    "data = da.random.random((1000, 1000, 1000, 10))\n",
+    "print(data.mean().compute())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e0e99795",
+   "metadata": {},
+   "source": [
+    "<a class=\"anchor\" id=\"fsl-pipe\"></a>\n",
+    "# fsl-pipe\n",
+    "\n",
+    "\n",
+    "https://open.win.ox.ac.uk/pages/fsl/fsl-pipe/\n",
+    "\n",
+    "\n",
+    "fsl-pipe is a Python library (written by our very own Michiel Cottaar) which builds upon another library called [file-tree](https://open.win.ox.ac.uk/pages/fsl/file-tree/), and allows you to write an analysis pipeline in a _declarative_ manner. A pipeline is defined by:\n",
+    "\n",
+    " - A _file tree_ which defines the directory structure of the input and output files of the pipeline.\n",
+    " - A set of _recipes_ (Python functions) describing how each of the pipeline outputs should be produced.\n",
+    "\n",
+    "In a similar vein to [Dask task graphs](#low-level-dask-api), fsl-pipe will automatically determine which recipes should be executed, and in what order, to produce whichever output file(s) you request. You can instruct fsl-pipe to run your recipes locally, be parallelised or distributed using Dask, or be executed on a cluster using `fsl_sub`.\n",
+    "\n",
+    "\n",
+    "For example, let's again imagine that we have some T1 images upon which we wish to perform basic structural preprocessing, and which are arranged in the following manner:\n",
+    "\n",
+    "> ```\n",
+    "> subjectA/\n",
+    ">     T1w.nii.gz\n",
+    "> subjectB/\n",
+    ">     T1w.nii.gz\n",
+    "> subjectC/\n",
+    ">     T1w.nii.gz\n",
+    "> ```\n",
+    "\n",
+    "The code cell below will automatically create a dummy data set with the above structure:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4d42fa32",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import shutil\n",
+    "\n",
+    "for subj in 'ABC':\n",
+    "    subjdir = f'mydata/subject{subj}'\n",
+    "    os.makedirs(subjdir, exist_ok=True)\n",
+    "    shutil.copy('../../applications/fslpy/bighead.nii.gz', f'{subjdir}/T1w.nii.gz')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "01575655",
+   "metadata": {},
+   "source": [
+    "We must first describe the structure of our data, and save that description as a file-tree file (e.g. `mydata.tree`). This file contains a placeholder for the subject ID, and gives both the input and any desired output files unique identifiers:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4de54ba7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%writefile mydata.tree\n",
+    "subject{subject}\n",
+    "    T1w.nii.gz          (t1)\n",
+    "    T1w_brain.nii.gz    (t1_brain)\n",
+    "    T1w_fov.nii.gz      (t1_fov)\n",
+    "    T1w_reorient.nii.gz (t1_reorient)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "99006f8c",
+   "metadata": {},
+   "source": [
+    "Now we need to define our _recipes_, the individual processing steps that we want to perform, and that will generate our output files. This is similar to what we did above when we were using [Dask](#low-level-dask-api) - we define functions which perform each step."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "819900fa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from fsl.wrappers import bet, fslreorient2std, robustfov\n",
+    "from fsl_pipe import Pipeline, In, Out\n",
+    "\n",
+    "def reorient(t1 : In, t1_reorient : Out):\n",
+    "    fslreorient2std(t1, t1_reorient)\n",
+    "\n",
+    "def fov(t1_reorient : In, t1_fov : Out):\n",
+    "    robustfov(t1_reorient, t1_fov)\n",
+    "\n",
+    "def brain_extract(t1_fov : In, t1_brain : Out):\n",
+    "    bet(t1_fov, t1_brain)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "91375ad3",
+   "metadata": {},
+   "source": [
+    "Note that we have also annotated the inputs and outputs of each function, and have used the same identifiers that we used in our `mydata.tree` file above. By doing this, `fsl-pipe` is automatically able to determine the steps that would be required in order to generate the output files that we request.\n",
+    "\n",
+    "Once we have defined all of our recipes, we need to create a `Pipeline` object, and add all of our functions to it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "25e48fd7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pipe = Pipeline()\n",
+    "pipe(reorient)\n",
+    "pipe(fov)\n",
+    "pipe(brain_extract)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d1d5024f",
+   "metadata": {},
+   "source": [
+    "> Note that it is also possible (and equivalent) to create the `Pipeline` before defining our recipe functions, and to use the pipeline as a decorator, e.g.:\n",
+    ">\n",
+    "> ```\n",
+    "> pipe = Pipeline()\n",
+    ">\n",
+    "> @pipe\n",
+    "> def reorient(t1 : In, t1_reorient : Out):\n",
+    ">     ...\n",
+    "> ```\n",
+    ">\n",
+    "\n",
+    "\n",
+    "We now need to create a `FileTree` object which is used to generate file paths for the input and output files. The `update_glob('t1')` method instructs the `FileTree` to scan the file system, and to generate file paths for all T1 images that are present."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9d49246c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from file_tree import FileTree\n",
+    "tree = FileTree.read('mydata.tree', './mydata/').update_glob('t1')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84d9df1f",
+   "metadata": {},
+   "source": [
+    "Then it is very easy to run our pipeline:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d0c92b06",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "jobs = pipe.generate_jobs(tree)\n",
+    "jobs.run()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e486a07f",
+   "metadata": {},
+   "source": [
+    "The default behaviour when using fsl-pipe locally is to for one task to be executed at a time. However, if you run fsl-pipe on the cluster, it will automatically submit the jobs using `fsl_sub`. You can also tell fsl-pipe to execute the pipeline using Dask, which will cause any independent jobs to be executed in parallel, e.g.:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a61d7603",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "jobs.run(method='dask')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "458be3fe",
+   "metadata": {},
+   "source": [
+    "The above example is just one way of using fsl-pipe - the library has several powerful features, including its own command-line interface, and the ability to skip jobs for output files that already exist.\n",
+    "\n",
+    "\n",
+    "<a class=\"anchor\" id=\"fsl-sub\"></a>\n",
+    "## fsl-sub\n",
+    "\n",
+    "\n",
+    "You can use the venerable `fsl_sub` to run several tasks in parallel both on your local machine, and on a HPC cluster, such as the ones available to us at FMRIB and the BDI. `fsl_sub` is typically called from the command-line, e.g.:\n",
+    "\n",
+    "> ```\n",
+    "> for dataset in mydata/sub-*; do\n",
+    ">   fsl_sub ./my_processing_script.py ${dataset} --jobram 16\n",
+    "> done\n",
+    "> ```\n",
+    "\n",
+    "\n",
+    "If you are working on a local machine, `fsl_sub` will block until your command has completed. You can run multiple commands in parallel by using \"array tasks\" - save each of the commands you want to run to a text file, e.g.:\n",
+    "\n",
+    "> ```\n",
+    "> echo \"./my_processing_script.py mydata/sub-01\"  > tasks.txt\n",
+    "> echo \"./my_processing_script.py mydata/sub-02\" >> tasks.txt\n",
+    "> echo \"./my_processing_script.py mydata/sub-03\" >> tasks.txt\n",
+    "> echo \"./my_processing_script.py mydata/sub-04\" >> tasks.txt\n",
+    "> echo \"./my_processing_script.py mydata/sub-05\" >> tasks.txt\n",
+    "> ```\n",
+    "\n",
+    "And then run `fsl_sub -t tasks.txt` - each of your commands will be executed in parallel. If you are working on a cluster, `fsl_sub` will schedule all of the commands to be executed  simultaneously.\n",
+    "\n",
+    "You can also call `fsl_sub` from Python by using functions from the `fslpy` library:\n",
+    "\n",
+    "> ```\n",
+    "> from glob import glob\n",
+    "> from fsl.wrappers import fsl_sub\n",
+    ">\n",
+    "> for dataset in glob('mydata/sub-*'):\n",
+    ">     fsl_sub(f'./my_processing_script.py ${dataset}', jobram=16)\n",
+    "> ```\n",
+    "\n",
+    "And the `fslpy` wrapper functions allow you to run FSL commands with `fsl_sub`, and to specify job dependencies. For example, to submit a collection of jobs, you can pass `submit=True`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "37134ca6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from fsl.data.image import removeExt\n",
+    "from fsl.wrappers import bet\n",
+    "from glob import glob\n",
+    "\n",
+    "for t1 in glob('braindata/??.nii.gz'):\n",
+    "    t1 = removeExt(t1)\n",
+    "    bet(t1, f'{t1}_brain', submit={'jobram':16})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "681b1a8c",
+   "metadata": {},
+   "source": [
+    "You can also specify dependencies by using the ID of a previously submitted job, e.g.:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e7c2329c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from fsl.data.image import removeExt\n",
+    "from fsl.wrappers import robustfov, bet\n",
+    "from glob import glob\n",
+    "\n",
+    "for t1 in glob('braindata/??.nii.gz'):\n",
+    "    t1 = removeExt(t1)\n",
+    "    jid = robustfov(t1, f'{t1}_fov', submit=True)\n",
+    "    bet(f'{t1}_fov', f'{t1}_brain', submit={'jobram':16, 'jobhold' : jid})"
+   ]
+  }
+ ],
+ "metadata": {},
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/applications/parallel/parallel.md b/applications/parallel/parallel.md
new file mode 100644
index 0000000000000000000000000000000000000000..d51a17c7fe968ed9fc8abaa4fa316bfbb150bb4b
--- /dev/null
+++ b/applications/parallel/parallel.md
@@ -0,0 +1,611 @@
+# Parallel processing in Python
+
+
+While Python has built-in support for threading and parallelising in the
+[`threading`](https://docs.python.org/3/library/threading.html) and
+[`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html)
+modules (covered in `advanced_programming/threading.ipynb`), there are a range of third-party libraries which you can use to improve the performance of your code.
+
+Contents:
+
+* [Do you really need to parallelise your code?](#do-you-need-to-parallelise)
+  * [Profilng your code](#profiling-your-code)
+* [JobLib](#joblib)
+* [Dask](#dask)
+  * [Dask Numpy API](#dask-numpy-api)
+  * [Low-level Dask API](#low-level-dask-api)
+  * [Distributing computation with Dask](#distributing-computation-with-dask)
+* [fsl-pipe](#fsl-pipe)
+* [fsl-sub](#fsl-sub)
+
+
+<a class="anchor" id="do-you-need-to-parallelise"></a>
+# Do you really need to parallelise your code?
+
+
+Before diving in and tearing your code apart, you should think very carefully about your problem, and the hardware you are using to run your code. For example, if you are processing MRI data for a number of subjects on the FMRIB cluster (where each node on the cluster has fairly modest hardware specs, meaning that within-process parallelisation will have limited benefits), the most efficient option may be to write your code in a single-threaded manner, and to parallelise across data sets, processing one data set on each cluster node.
+Your best option may simply be to repeatedly call `fsl_sub`- see the [section below](#fsl-sub) for more details on this approach.
+
+
+<a class="anchor" id="#profiling-your-code"></a>
+## Profilng your code
+
+
+Once you have decided that you need to parallelise some steps in your program, **STOP!** As the great Donald Knuth once said:
+
+> Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time:
+>
+> **premature optimization is the root of all evil**.
+>
+> Yet we should not pass up our opportunities in that critical 3%.
+
+
+What Knuth is essentially saying is that there is no point in optimising or rewriting a piece of code unless it is actually going to have a positive impact on performance. In other words, before you refactor your code, you need to ensure that you _understand_ where the bottlenecks are, so that you know which parts of the code _should_ be re-written.
+
+One way in which we can find those bottlenecks is through _profiling_ - running our code, and timing each part of it, to identify which parts are causing our program to run slowly.
+
+As a simple example, consider this code. It is running slowly, and we want to know why:
+
+
+```
+import numpy as np
+
+def process_data(datafile):
+    data = np.loadtxt(datafile)
+    result = data.mean()
+    return result
+
+print(process_data('mydata.txt'))
+```
+
+
+> The `mydata.txt` file can be generated by running this code:
+>
+> ```
+> import numpy as np
+> data = np.random.random((10000, 1000))
+> np.savetxt('mydata.txt', data)
+> ```
+
+
+For small functions, you can profile code manually by using the `time` module, e.g.:
+
+
+```
+import time
+
+start  = time.time()
+result = process_data('mydata.txt')
+end    = time.time()
+
+print(f'Total #seconds: {end - start}')
+print(result)
+```
+
+
+We could also do the same within the `process_data` function, and calculate and report the timings for each individual section within.  But there is a library called `line_profiler` which can do this for you - it will time every single line of code, and print a report for you:
+
+
+```
+%load_ext line_profiler
+%lprun -f process_data process_data('mydata.txt')
+```
+
+
+> `line_profiler` is not installed as part of FSL by default, but you can install it with this command:
+> ```$FSLDIR/bin/conda install -p $FSLDIR line_profiler```
+>
+> And you can also use it from the command-line, if you are not working in a Jupyter notebook. You can learn more about `line_profiler` at https://github.com/pyutils/line_profiler
+
+
+We can see that most of the processing time was actually in _loading_ the data from file, and not in the data _processing_ (simply calculating the mean in this toy example). Without profiling your code, you may have naively thought that the data processing step was the culprit, and needed to be optimised. But what profiling has told us here is that our problem lies elsewhere, and perhaps we need to think about changing how we are storing our data.
+
+
+But let's assume from this point on that we _do_ need to optimise our code...
+
+
+<a class="anchor" id="joblib"></a>
+# JobLib
+
+https://joblib.readthedocs.io/en/stable/
+
+
+JobLib has been around for a while, and is a simple and lightweight library with two main features:
+
+ - An API for "embarrassingly parallel" tasks.
+ - An API for caching/re-using the results of time-consuming computations (which we won't discuss in this notebook, but you can read about [here](https://joblib.readthedocs.io/en/stable/memory.html)).
+
+On the surface, JobLib does not provide any functionality that cannot already be accomplished with built-in libraries such as `multiprocessing` and `functools`. However, the JobLib developers have put a lot of effort into ensuring that JobLib:
+
+ - is easy to use,
+ - works consistently across all of the major platforms, and
+ - works as efficiently as possible with Numpy arrays
+
+We can use the JobLib API by importing the `joblib` package (we'll also use `numpy` and `time` in this example):
+
+```
+import numpy as np
+import joblib
+import time
+```
+
+Now, let's say that we have a collection of `numpy` arrays containing image data for a group of subjects:
+
+
+```
+datasets = [np.random.random((91, 109, 91)) for i in range(10)]
+```
+
+And we want to calculate some metric on each image. We simply need to define a function which contains the calculation we want to perform (the `time.sleep` call is there just to simulate a complex calculation):
+
+```
+def do_calculation(input_data):
+    time.sleep(2)
+    return input_data.mean()
+```
+
+We can now use `joblib.Parallel` and `joblib.delayed` to parallelise those calculations. `joblib.Parallel` sets up a pool of worker processes, and `joblib.delayed` schedules a single instantiation of the function, and associated data, which is to be executed. We can execute a sequence of `delayed` tasks by passing them to the `Parallel` pool:
+
+
+```
+with joblib.Parallel(n_jobs=-1) as pool:
+    tasks   = [joblib.delayed(do_calculation)(d) for d in datasets]
+    results = pool(tasks)
+print(results)
+```
+
+Just like with `multiprocessing`, JobLib is susceptible to the same problems with regard to sharing data between processes (these problems are discussed in the `advanced_programming/threading.ipynb` notebook). In the example above, each dataset is serialised and copied from the main process to the worker processes, and then the result copied back.  This behaviour could be a performance bottleneck for your own task, or you may be working on a system with limited memory which is incapable of storing several copies of the data.
+
+To deal with this, we can use memory-mapped Numpy arrays. This is a feature built into Numpy, and supported by JobLib, which invlolves storing your data in your file system instead of in memory. This allows your data to be simultaneously read and written by multiple processes.
+
+> Some other options for sharing data between processes are discussed in the `advanced_programming/threading.ipynb` notebook.
+
+You can create `numpy.memmap` arrays directly, but JobLib will also  automatically convert any Numpy arrays which are larger than a specified threshold into memory-mapped arrays. This threshold defaults to 1MB, and you can change it via the `n_bytes` option when you create a `joblib.Parallel` pool.
+
+To see this in action, imagine that you have some 4D fMRI data, and wish to fit a complicated model to the time series at each voxel. We will write our model fitting function so that it works on one slice at a time:
+
+
+```
+def fit_model(indata, outdata, sliceidx):
+    print(f'Fitting model at slice {sliceidx}')
+    time.sleep(1)
+    outdata[:, :, sliceidx] = indata[:, :, sliceidx, :].mean() + sliceidx
+```
+
+Now we will load our 4D data, and pre-allocate another array to store the fitted model parameters.
+
+
+```
+# Imagine that we have loaded this data from a file
+data  = np.random.random((91, 109, 91, 50)).astype(np.float32)
+
+# Pre-allocate space to store the fitted model parameters
+model = np.memmap('model.mmap',
+                  shape=(91, 109, 91),
+                  dtype=np.float32,
+                  mode='w+')
+
+# Fit our model, processing slices in parallel
+with joblib.Parallel(n_jobs=-1) as pool:
+    pool(joblib.delayed(fit_model)(data, model, slc) for slc in range(91))
+
+print(model)
+```
+
+<a class="anchor" id="dask"></a>
+# Dask
+
+https://www.dask.org/
+
+
+Dask is a very powerful library which you can use to parallelise your code, and to distribute computation across multiple machines. You can use Dask locally on your own computer, on HPC clusters (e.g. SLURM and SGE) and with cloud compute platforms (e.g. AWS, Azure).
+
+Dask has two main components:
+
+ - APIs for defining tasks/jobs - there is a low-level API comparable to that provided by JobLib, but Dask also has sophisticated high-level APIs for working with Pandas and Numpy-style data.
+
+ - A task scheduler which builds a graph of all the tasks that need to be executed, and which manges their execution, either locally or remotely.
+
+We will introduce the Numpy API and the low-level API, and then demonstrate how to use Dask to perform calculations on a SGE cluster.
+
+
+<a class="anchor" id="dask-numpy-api"></a>
+## Dask Numpy API
+
+https://docs.dask.org/en/stable/array.html
+
+
+To use the Dask Numpy API, simply import the `dask.array` package, and use it instead of the `numpy` package:
+
+```
+import numpy      as np
+import dask.array as da
+
+data = da.random.random((1000, 1000, 1000, 20)).astype(np.float32)
+data
+```
+
+If you do the numbers, you will realise that the above call has created an array which requires **74 gigabytes** of memory - this is far more memory than what is available in most consumer level computers.
+
+The call would almost certainly fail if you made it using `np.random.random` instead of `da.random.random`, as Numpy would attempt to create the entire data set (and in fact would temporarily require up to **150 gigabytes** of memory, as Numpy uses double-precision floating point (`float64`) values by default).
+
+However, this worked because:
+
+ - Dask automatically splits arrays into smaller chunks - in the above code, `data` has been split into 1331 smaller arrays, each of which has shape `(94, 94, 94, 10)`, and requires only 64 megabytes.
+
+ - Most operations in the `dask.array` package are _lazily evaluated_. The above call has not actually created any data - Dask has just created a set of jobs which describe how to create a large array of random values.
+
+When using `dask.array`, nothing is actually executed until you call the `compute()` function. For example, we can try calculating the mean of our large array:
+
+
+```
+m = data.mean()
+m
+```
+
+But again, this will not actually calculate the mean - it just defines a task which, when executed, will calculate the mean over the `data` array.
+
+We can _execute_ that task by callintg `compute()` on the result:
+
+```
+print(m.compute())
+```
+
+
+Dask arrays support _most_ of the functionality of Numpy arrays, but there are a few exceptions, most notably the `numpy.linalg` package.
+
+
+> Remember that Dask also provides a Pandas-style API, accessible in the `dask.dataframe` package - you can read about it at https://docs.dask.org/en/stable/dataframe.html.
+
+
+<a class="anchor" id="low-level-dask-api"></a>
+## Low-level Dask API
+
+
+In addition to the Numpy and Pandas APIs, Dask also has a lower-level interface which allows you to create and lazily execute a pipeline made up of Python functions. This API is based around `dask.delayed`, which is a Python decorator that can be applied to any function, and which tells Dask that a function should be lazily executed.
+
+As a very simple example (taken from the [Dask documentation](https://docs.dask.org/en/stable/delayed.html#example)), consider this simple numerical task, where we have a set of numbers and, for each number `x`, we want to calculate `(x * x)`, and then add all of the results together (i.e. the sum of squares). We'll start by defining a function for each of the operations that we need to perform:
+
+```
+def square(x):
+    return x * x
+
+def sum(values):
+    total = 0
+    for v in values:
+        total = total + v
+    return total
+```
+
+We could solve this problem without parallelism by using conventional Python code, which might look something like the following:
+
+```
+data   = [1, 2, 3, 4, 5]
+output = []
+
+for x in data:
+    s = square(x)
+    output.append(s)
+
+total = sum(output)
+print(total)
+```
+
+However, this problem is inherently parallelisable - we are independently performing the same series of steps to each of our inputs, before the final tallying. We can use the `dask.delayed` function to take advantage of this:
+
+
+```
+import dask
+
+output = []
+for x in data:
+    a = dask.delayed(square)(x)
+    output.append(a)
+
+total = dask.delayed(sum)(output)
+```
+
+We have not actually performed any computations yet. What we have done is built a _graph_ of operations that we need to perform. Dask refers to this as a **task graph**. Dask keeps track of the dependencies of each function that we add, and so is able to determine the order in they need to be executed, and also which steps do not depend on each other and so can be executed in parallel. Dask can even visualise this task graph for us:
+
+```
+total.visualize()
+```
+
+And then when we are ready, we can call `compute()` to actually do the calculation:
+
+```
+total.compute()
+```
+
+
+For a more realistic example, let's imagine that we have T1 MRI images for five subjects, and we want to perform basic structural preprocessing on each of them (reorientation, FOV reduction, and brain extraction). Here we're creating this example data set (all of the T1 images are just a copy of the `bighead.nii.gz` image, from the FSL course data).
+
+
+```
+import os
+import shutil
+
+os.makedirs('braindata', exist_ok=True)
+for i in range(1, 6):
+    shutil.copy('../../applications/fslpy/bighead.nii.gz', f'braindata/{i:02d}.nii.gz')
+```
+
+And now we can build our pipeline. The [fslpy](https://open.win.ox.ac.uk/pages/fsl/fslpy/) library has a collection of functions which we can use to call the FSL commands that we need.
+
+
+We need to do a little work, as by default `dask.delayed` assumes that the dependencies of a function (i.e. other `delayed` functions) are passed to the function as arguments. There are [other methods](https://docs.dask.org/en/stable/delayed.html#indirect-dependencies) of dealing with this, but often the easiest option is simply to write a few small functions that define each of our tasks, in a manner that satisfies this assumption:
+
+
+```
+import fsl.wrappers as fw
+
+def reorient(input, output):
+    fw.fslreorient2std(input, output)
+    return output
+
+def fov(input, output):
+    fw.robustfov(input, output)
+    return output
+
+def bet(input, output):
+    fw.bet(input, output)
+    return output
+```
+
+Again we use `dask.delayed` to build up a graph of tasks that need executing, starting from the input files, and ending with our final brain-extracted outputs. Again, we are not actually executing anything here - we're just building up a task graph that Dask will then execute for us later:
+
+```
+import glob
+import dask
+import fsl.data.image as fslimage
+
+inputs = list(glob.glob('braindata/??.nii.gz'))
+tasks  = []
+
+for input in inputs:
+    basename = fslimage.removeExt(input)
+    r = dask.delayed(reorient)(input, f'{basename}_reorient.nii.gz')
+    f = dask.delayed(fov)(r,          f'{basename}_fov.nii.gz')
+    b = dask.delayed(bet)(f,          f'{basename}_brain.nii.gz')
+    tasks.append(b)
+```
+
+In the previous example we had a single output (the result of summing the squared input values) upon which we called `visualize()`. Here we have a list of independent tasks (in the `tasks` list). However, we can still visualise them by passing the entire list to the `dask.visualize()` function:
+
+```
+dask.visualize(*tasks)
+```
+
+And, similarly, we can call `dask.compute` to run them all at once:
+
+```
+outputs = dask.compute(*tasks)
+print(outputs)
+```
+
+
+<a class="anchor" id="distributing-computation-with-dask"></a>
+## Distributing computation with Dask
+
+
+In the examples above, Dask was running your tasks in parallel on your local machine. But it is easy to instruct Dask to distribute your computations across multiple machines. For example, Dask has support for executing tasks on a SGE or SLURM cluster, which we have available in Oxford at FMRIB and the BDI.
+
+To use this functionality, we need an additional library called `dask-jobqueue` which, at the moment, is not installed as part of FSL.
+
+> Note that the code cells below will not work if you are running this notebook on your own computer (unless you happen to have a SGE cluster system installed on your laptop).
+
+
+The first step is to create a `SGECluster` (or `SLURMCluster`) object. You need to populate this object with information about a _single job_ in your workflow. At a minumum, you must specify the number of cores and memory that are available to a single job:
+
+
+```
+from dask_jobqueue import SGECluster
+cluster = SGECluster(cores=2, memory='16GB')
+```
+
+You can also set a range of other options, including specifying a queue name (e.g. `queue='short.q'`), or specifying the total amount of time that your job will run for (e.g. `walltime='01:00:00'` for one hour).
+
+
+The next step is to create some jobs by calling the `scale()` method:
+
+```
+cluster.scale(jobs=5)
+```
+
+Behind the scenes, the `scale()` method uses `qsub` to create five "worker processes", each running on a different cluster node. In the first instance, these worker processes will be sitting idle, doing nothing, and waiting for you to give them a task to run.
+
+The final step is to create  "client" object. This is an important step which configures Dask to use the cluster we have just set up:
+
+```
+client = cluster.get_client()
+```
+
+After creating a client, any Dask computations that we request will be performed on the cluster by our worker processes:
+
+```
+import dask.array as da
+data = da.random.random((1000, 1000, 1000, 10))
+print(data.mean().compute())
+```
+
+
+<a class="anchor" id="fsl-pipe"></a>
+# fsl-pipe
+
+
+https://open.win.ox.ac.uk/pages/fsl/fsl-pipe/
+
+
+fsl-pipe is a Python library (written by our very own Michiel Cottaar) which builds upon another library called [file-tree](https://open.win.ox.ac.uk/pages/fsl/file-tree/), and allows you to write an analysis pipeline in a _declarative_ manner. A pipeline is defined by:
+
+ - A _file tree_ which defines the directory structure of the input and output files of the pipeline.
+ - A set of _recipes_ (Python functions) describing how each of the pipeline outputs should be produced.
+
+In a similar vein to [Dask task graphs](#low-level-dask-api), fsl-pipe will automatically determine which recipes should be executed, and in what order, to produce whichever output file(s) you request. You can instruct fsl-pipe to run your recipes locally, be parallelised or distributed using Dask, or be executed on a cluster using `fsl_sub`.
+
+
+For example, let's again imagine that we have some T1 images upon which we wish to perform basic structural preprocessing, and which are arranged in the following manner:
+
+> ```
+> subjectA/
+>     T1w.nii.gz
+> subjectB/
+>     T1w.nii.gz
+> subjectC/
+>     T1w.nii.gz
+> ```
+
+The code cell below will automatically create a dummy data set with the above structure:
+
+
+```
+import os
+import shutil
+
+for subj in 'ABC':
+    subjdir = f'mydata/subject{subj}'
+    os.makedirs(subjdir, exist_ok=True)
+    shutil.copy('../../applications/fslpy/bighead.nii.gz', f'{subjdir}/T1w.nii.gz')
+```
+
+
+We must first describe the structure of our data, and save that description as a file-tree file (e.g. `mydata.tree`). This file contains a placeholder for the subject ID, and gives both the input and any desired output files unique identifiers:
+
+
+```
+%%writefile mydata.tree
+subject{subject}
+    T1w.nii.gz          (t1)
+    T1w_brain.nii.gz    (t1_brain)
+    T1w_fov.nii.gz      (t1_fov)
+    T1w_reorient.nii.gz (t1_reorient)
+```
+
+
+Now we need to define our _recipes_, the individual processing steps that we want to perform, and that will generate our output files. This is similar to what we did above when we were using [Dask](#low-level-dask-api) - we define functions which perform each step.
+
+```
+from fsl.wrappers import bet, fslreorient2std, robustfov
+from fsl_pipe import Pipeline, In, Out
+
+def reorient(t1 : In, t1_reorient : Out):
+    fslreorient2std(t1, t1_reorient)
+
+def fov(t1_reorient : In, t1_fov : Out):
+    robustfov(t1_reorient, t1_fov)
+
+def brain_extract(t1_fov : In, t1_brain : Out):
+    bet(t1_fov, t1_brain)
+```
+
+Note that we have also annotated the inputs and outputs of each function, and have used the same identifiers that we used in our `mydata.tree` file above. By doing this, `fsl-pipe` is automatically able to determine the steps that would be required in order to generate the output files that we request.
+
+Once we have defined all of our recipes, we need to create a `Pipeline` object, and add all of our functions to it:
+
+
+```
+pipe = Pipeline()
+pipe(reorient)
+pipe(fov)
+pipe(brain_extract)
+```
+
+
+> Note that it is also possible (and equivalent) to create the `Pipeline` before defining our recipe functions, and to use the pipeline as a decorator, e.g.:
+>
+> ```
+> pipe = Pipeline()
+>
+> @pipe
+> def reorient(t1 : In, t1_reorient : Out):
+>     ...
+> ```
+>
+
+
+We now need to create a `FileTree` object which is used to generate file paths for the input and output files. The `update_glob('t1')` method instructs the `FileTree` to scan the file system, and to generate file paths for all T1 images that are present.
+
+
+```
+from file_tree import FileTree
+tree = FileTree.read('mydata.tree', './mydata/').update_glob('t1')
+```
+
+Then it is very easy to run our pipeline:
+
+```
+jobs = pipe.generate_jobs(tree)
+jobs.run()
+```
+
+The default behaviour when using fsl-pipe locally is to for one task to be executed at a time. However, if you run fsl-pipe on the cluster, it will automatically submit the jobs using `fsl_sub`. You can also tell fsl-pipe to execute the pipeline using Dask, which will cause any independent jobs to be executed in parallel, e.g.:
+
+
+```
+jobs.run(method='dask')
+```
+
+The above example is just one way of using fsl-pipe - the library has several powerful features, including its own command-line interface, and the ability to skip jobs for output files that already exist.
+
+
+<a class="anchor" id="fsl-sub"></a>
+## fsl-sub
+
+
+You can use the venerable `fsl_sub` to run several tasks in parallel both on your local machine, and on a HPC cluster, such as the ones available to us at FMRIB and the BDI. `fsl_sub` is typically called from the command-line, e.g.:
+
+> ```
+> for dataset in mydata/sub-*; do
+>   fsl_sub ./my_processing_script.py ${dataset} --jobram 16
+> done
+> ```
+
+
+If you are working on a local machine, `fsl_sub` will block until your command has completed. You can run multiple commands in parallel by using "array tasks" - save each of the commands you want to run to a text file, e.g.:
+
+> ```
+> echo "./my_processing_script.py mydata/sub-01"  > tasks.txt
+> echo "./my_processing_script.py mydata/sub-02" >> tasks.txt
+> echo "./my_processing_script.py mydata/sub-03" >> tasks.txt
+> echo "./my_processing_script.py mydata/sub-04" >> tasks.txt
+> echo "./my_processing_script.py mydata/sub-05" >> tasks.txt
+> ```
+
+And then run `fsl_sub -t tasks.txt` - each of your commands will be executed in parallel. If you are working on a cluster, `fsl_sub` will schedule all of the commands to be executed  simultaneously.
+
+You can also call `fsl_sub` from Python by using functions from the `fslpy` library:
+
+> ```
+> from glob import glob
+> from fsl.wrappers import fsl_sub
+>
+> for dataset in glob('mydata/sub-*'):
+>     fsl_sub(f'./my_processing_script.py ${dataset}', jobram=16)
+> ```
+
+And the `fslpy` wrapper functions allow you to run FSL commands with `fsl_sub`, and to specify job dependencies. For example, to submit a collection of jobs, you can pass `submit=True`:
+
+```
+from fsl.data.image import removeExt
+from fsl.wrappers import bet
+from glob import glob
+
+for t1 in glob('braindata/??.nii.gz'):
+    t1 = removeExt(t1)
+    bet(t1, f'{t1}_brain', submit={'jobram':16})
+```
+
+
+You can also specify dependencies by using the ID of a previously submitted job, e.g.:
+
+
+```
+from fsl.data.image import removeExt
+from fsl.wrappers import robustfov, bet
+from glob import glob
+
+for t1 in glob('braindata/??.nii.gz'):
+    t1 = removeExt(t1)
+    jid = robustfov(t1, f'{t1}_fov', submit=True)
+    bet(f'{t1}_fov', f'{t1}_brain', submit={'jobram':16, 'jobhold' : jid})
+```