added section on shared memory in threading/multiprocessing practical

ca6e659e · Paul McCarthy · ffaad040 · ca6e659e · ca6e659e
Commit ca6e659e authored 7 years ago by Paul McCarthy
--- a/advanced_topics/07_threading.ipynb
+++ b/advanced_topics/07_threading.ipynb
@@ -606,6 +606,212 @@
    "for t1, result in zip(t1s, nlinresults):\n",
    "    print(t1, ':', result)"
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Sharing data between processes\n",
+    "\n",
+    "\n",
+    "When you use the `Pool.map` method (or any of the other methods we have shown)\n",
+    "to run a function on a sequence of items, those items must be copied into the\n",
+    "memory of each of the child processes. When the child processes are finished,\n",
+    "the data that they return then has to be copied back to the parent process.\n",
+    "\n",
+    "\n",
+    "Any items which you wish to pass to a function that is executed by a `Pool`\n",
+    "must be - the built-in\n",
+    "[`pickle`](https://docs.python.org/3.5/library/pickle.html) module is used by\n",
+    "`multiprocessing` to serialise and de-serialise the data passed into and\n",
+    "returned from a child process. The majority of standard Python types (`list`,\n",
+    "`dict`, `str` etc), and Numpy arrays can be pickled and unpickled, so you only\n",
+    "need to worry about this detail if you are passing objects of a custom type\n",
+    "(e.g. instances of classes that you have written, or that are defined in some\n",
+    "third-party library).\n",
+    "\n",
+    "\n",
+    "There is obviously some overhead in copying data back and forth between the\n",
+    "main process and the worker processes.  For most computationally intensive\n",
+    "tasks, this communication overhead is not important - the performance\n",
+    "bottleneck is typically going to be the computation time, rather than I/O\n",
+    "between the parent and child processes. You may need to spend some time\n",
+    "adjusting the way in which you split up your data, and the number of\n",
+    "processes, in order to get the best performance.\n",
+    "\n",
+    "\n",
+    "However, if you have determined that copying data between processes is having\n",
+    "a substantial impact on your performance, the `multiprocessing` module\n",
+    "provides the [`Value`, `Array`, and `RawArray`\n",
+    "classes](https://docs.python.org/3.5/library/multiprocessing.html#shared-ctypes-objects),\n",
+    "which allow you to share individual values, or arrays of values, respectively.\n",
+    "\n",
+    "\n",
+    "The `Array` and `RawArray` classes essentially wrap a typed pointer (from the\n",
+    "built-in [`ctypes`](https://docs.python.org/3.5/library/ctypes.html) module)\n",
+    "to a block of memory. We can use the `Array` or `RawArray` class to share a\n",
+    "Numpy array between our worker processes. The difference between an `Array`\n",
+    "and a `RawArray` is that the former offers synchronised (i.e. process-safe)\n",
+    "access to the shared memory. This is necessary if your child processes will be\n",
+    "modifying the same parts of your data.\n",
+    "\n",
+    "\n",
+    "Due to the way that shared memory works, in order to share a Numpy array\n",
+    "between different processes you need to structure your code so that the\n",
+    "array(s) you want to share are accessible at the _module level_. Furthermore,\n",
+    "we need to make sure that our input and output arrays are located in shared\n",
+    "memory - we can do this via the `Array` or `RawArray`.\n",
+    "\n",
+    "\n",
+    "As an example, let's say we want to parallelise processing of an image by\n",
+    "having each worker process perform calculations on a chunk of the image.\n",
+    "First, let's define a function which does the calculation on a specified set\n",
+    "of image coordinates:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import multiprocessing as mp\n",
+    "import ctypes\n",
+    "import numpy as np\n",
+    "np.set_printoptions(suppress=True)\n",
+    "\n",
+    "\n",
+    "def process_chunk(shape, idxs):\n",
+    "\n",
+    "    # Get references to our\n",
+    "    # input/output data, and\n",
+    "    # create Numpy array views\n",
+    "    # into them.\n",
+    "    sindata  = process_chunk.input_data\n",
+    "    soutdata = process_chunk.output_data\n",
+    "    indata   = np.ctypeslib.as_array(sindata) .reshape(shape)\n",
+    "    outdata  = np.ctypeslib.as_array(soutdata).reshape(shape)\n",
+    "\n",
+    "    # Do the calculation on\n",
+    "    # the specified voxels\n",
+    "    outdata[idxs] = indata[idxs] ** 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Rather than passing the input and output data arrays in as arguments to the\n",
+    "`process_chunk` function, we set them as attributes of the `process_chunk`\n",
+    "function. This makes the input/output data accessible at the module level,\n",
+    "which is required in order to share the data between the main process and the\n",
+    "child processes.\n",
+    "\n",
+    "\n",
+    "Now let's define a second function which process an entire image. It does the\n",
+    "following:\n",
+    "\n",
+    "\n",
+    "1. Initialises shared memory areas to store the input and output data.\n",
+    "2. Copies the input data into shared memory.\n",
+    "3. Sets the input and output data as attributes of the `process_chunk` function.\n",
+    "4. Creates sets of indices into the input data which, for each worker process,\n",
+    "   specifies the portion of the data that it is responsible for.\n",
+    "5. Creates a worker pool, and runs the `process_chunk` function for each set\n",
+    "   of indices."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def process_dataset(data):\n",
+    "\n",
+    "    nprocs   = 8\n",
+    "    origData = data\n",
+    "\n",
+    "    # Create arrays to store the\n",
+    "    # input and output data\n",
+    "    sindata  = mp.RawArray(ctypes.c_double, data.size)\n",
+    "    soutdata = mp.RawArray(ctypes.c_double, data.size)\n",
+    "    data     = np.ctypeslib.as_array(sindata).reshape(data.shape)\n",
+    "    outdata  = np.ctypeslib.as_array(soutdata).reshape(data.shape)\n",
+    "\n",
+    "    # Copy the input data\n",
+    "    # into shared memory\n",
+    "    data[:]  = origData\n",
+    "\n",
+    "    # Make the input/output data\n",
+    "    # accessible to the process_chunk\n",
+    "    # function. This must be done\n",
+    "    # *before* the worker pool is created.\n",
+    "    process_chunk.input_data  = sindata\n",
+    "    process_chunk.output_data = soutdata\n",
+    "\n",
+    "    # number of boxels to be computed\n",
+    "    # by each worker process.\n",
+    "    nvox = int(data.size / nprocs)\n",
+    "\n",
+    "    # Generate coordinates for\n",
+    "    # every voxel in the image\n",
+    "    xlen, ylen, zlen = data.shape\n",
+    "    xs, ys, zs = np.meshgrid(np.arange(xlen),\n",
+    "                             np.arange(ylen),\n",
+    "                             np.arange(zlen))\n",
+    "\n",
+    "    xs = xs.flatten()\n",
+    "    ys = ys.flatten()\n",
+    "    zs = zs.flatten()\n",
+    "\n",
+    "    # We're going to pass each worker\n",
+    "    # process a list of indices, which\n",
+    "    # specify the data items which that\n",
+    "    # worker process needs to compute.\n",
+    "    xs = [xs[nvox * i:nvox * i + nvox] for i in range(nprocs)] + \\\n",
+    "         [xs[nvox * nprocs:]]\n",
+    "    ys = [ys[nvox * i:nvox * i + nvox] for i in range(nprocs)] + \\\n",
+    "         [ys[nvox * nprocs:]]\n",
+    "    zs = [zs[nvox * i:nvox * i + nvox] for i in range(nprocs)] + \\\n",
+    "         [zs[nvox * nprocs:]]\n",
+    "\n",
+    "    # Build the argument lists for\n",
+    "    # each worker process.\n",
+    "    args = [(data.shape, (x, y, z)) for x, y, z in zip(xs, ys, zs)]\n",
+    "\n",
+    "    # Create a pool of worker\n",
+    "    # processes and run the jobs.\n",
+    "    pool   = mp.Pool(processes=nprocs)\n",
+    "\n",
+    "    pool.starmap(process_chunk, args)\n",
+    "\n",
+    "    return outdata"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we can call our `process_data` function just like any other function:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = np.array(np.arange(64).reshape((4, 4, 4)), dtype=np.float64)\n",
+    "\n",
+    "outdata = process_dataset(data)\n",
+    "\n",
+    "print('Input')\n",
+    "print(data)\n",
+    "\n",
+    "print('Output')\n",
+    "print(outdata)"
+   ]
  }
 ],
 "metadata": {},

 %% Cell type:markdown id: tags:
 # Threading and parallel processing
 The Python language has built-in support for multi-threading in the
 [`threading`](https://docs.python.org/3.5/library/threading.html) module, and
 true parallelism in the
 [`multiprocessing`](https://docs.python.org/3.5/library/multiprocessing.html)
 module.  If you want to be impressed, skip straight to the section on
 [`multiprocessing`](todo).
 ## Threading
 The [`threading`](https://docs.python.org/3.5/library/threading.html) module
 provides a traditional multi-threading API that should be familiar to you if
 you have worked with threads in other languages.
 Running a task in a separate thread in Python is easy - simply create a
 `Thread` object, and pass it the function or method that you want it to
 run. Then call its `start` method:
 %% Cell type:code id: tags:
 ``` 
 import time
 import threading
 def longRunningTask(niters):
    for i in range(niters):
        if i % 2 == 0: print('Tick')
        else:          print('Tock')
        time.sleep(0.5)
 t = threading.Thread(target=longRunningTask, args=(8,))
 t.start()
 while t.is_alive():
    time.sleep(0.4)
    print('Waiting for thread to finish...')
 print('Finished!')
 ```
 %% Cell type:markdown id: tags:
 You can also `join` a thread, which will block execution in the current thread
 until the thread that has been `join`ed has finished:
 %% Cell type:code id: tags:
 ``` 
 t = threading.Thread(target=longRunningTask, args=(6, ))
 t.start()
 print('Joining thread ...')
 t.join()
 print('Finished!')
 ```
 %% Cell type:markdown id: tags:
 ### Subclassing `Thread`
 It is also possible to sub-class the `Thread` class, and override its `run`
 method:
 %% Cell type:code id: tags:
 ``` 
 class LongRunningThread(threading.Thread):
    def __init__(self, niters, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.niters = niters
    def run(self):
        for i in range(self.niters):
            if i % 2 == 0: print('Tick')
            else:          print('Tock')
            time.sleep(0.5)
 t = LongRunningThread(6)
 t.start()
 t.join()
 print('Done')
 ```
 %% Cell type:markdown id: tags:
 ### Daemon threads
 By default, a Python application will not exit until _all_ active threads have
 finished execution.  If you want to run a task in the background for the
 duration of your application, you can mark it as a `daemon` thread - when all
 non-daemon threads in a Python application have finished, all daemon threads
 will be halted, and the application will exit.
 You can mark a thread as being a daemon by setting an attribute on it after
 creation:
 %% Cell type:code id: tags:
 ``` 
 t = threading.Thread(target=longRunningTask)
 t.daemon = True
 ```
 %% Cell type:markdown id: tags:
 See the [`Thread`
 documentation](https://docs.python.org/3.5/library/threading.html#thread-objects)
 for more details.
 ### Thread synchronisation
 The `threading` module provides some useful thread-synchronisation primitives
 - the `Lock`, `RLock` (re-entrant `Lock`), and `Event` classes.  The
 `threading` module also provides `Condition` and `Semaphore` classes - refer
 to the [documentation](https://docs.python.org/3.5/library/threading.html) for
 more details.
 #### `Lock`
 The [`Lock`](https://docs.python.org/3.5/library/threading.html#lock-objects)
 class (and its re-entrant version, the
 [`RLock`](https://docs.python.org/3.5/library/threading.html#rlock-objects))
 prevents a block of code from being accessed by more than one thread at a
 time. For example, if we have multiple threads running this `task` function,
 their [outputs](https://www.youtube.com/watch?v=F5fUFnfPpYU) will inevitably
 become intertwined:
 %% Cell type:code id: tags:
 ``` 
 def task():
    for i in range(5):
        print('{} Woozle '.format(i), end='')
        time.sleep(0.1)
        print('Wuzzle')
 threads = [threading.Thread(target=task) for i in range(5)]
 for t in threads:
    t.start()
 ```
 %% Cell type:markdown id: tags:
 But if we protect the critical section with a `Lock` object, the output will
 look more sensible:
 %% Cell type:code id: tags:
 ``` 
 lock = threading.Lock()
 def task():
    for i in range(5):
        with lock:
            print('{} Woozle '.format(i), end='')
            time.sleep(0.1)
            print('Wuzzle')
 threads = [threading.Thread(target=task) for i in range(5)]
 for t in threads:
    t.start()
 ```
 %% Cell type:markdown id: tags:
 > Instead of using a `Lock` object in a `with` statement, it is also possible
 > to manually call its `acquire` and `release` methods:
 >
 >     def task():
 >         for i in range(5):
 >             lock.acquire()
 >             print('{} Woozle '.format(i), end='')
 >             time.sleep(0.1)
 >             print('Wuzzle')
 >             lock.release()
 Python does not have any built-in constructs to implement `Lock`-based mutual
 exclusion across several functions or methods - each function/method must
 explicitly acquire/release a shared `Lock` instance. However, it is relatively
 straightforward to implement a decorator which does this for you:
 %% Cell type:code id: tags:
 ``` 
 def mutex(func, lock):
    def wrapper(*args):
        with lock:
            func(*args)
    return wrapper
 class MyClass(object):
    def __init__(self):
        lock = threading.Lock()
        self.safeFunc1 = mutex(self.safeFunc1, lock)
        self.safeFunc2 = mutex(self.safeFunc2, lock)
    def safeFunc1(self):
        time.sleep(0.1)
        print('safeFunc1 start')
        time.sleep(0.2)
        print('safeFunc1 end')
    def safeFunc2(self):
        time.sleep(0.1)
        print('safeFunc2 start')
        time.sleep(0.2)
        print('safeFunc2 end')
 mc = MyClass()
 f1threads = [threading.Thread(target=mc.safeFunc1) for i in range(4)]
 f2threads = [threading.Thread(target=mc.safeFunc2) for i in range(4)]
 for t in f1threads + f2threads:
    t.start()
 ```
 %% Cell type:markdown id: tags:
 Try removing the `mutex` lock from the two methods in the above code, and see
 what it does to the output.
 #### `Event`
 The
 [`Event`](https://docs.python.org/3.5/library/threading.html#event-objects)
 class is essentially a boolean [semaphore][semaphore-wiki]. It can be used to
 signal events between threads. Threads can `wait` on the event, and be awoken
 when the event is `set` by another thread:
 [semaphore-wiki]: https://en.wikipedia.org/wiki/Semaphore_(programming)
 %% Cell type:code id: tags:
 ``` 
 import numpy as np
 processingFinished = threading.Event()
 def processData(data):
    print('Processing data ...')
    time.sleep(2)
    print('Result: {}'.format(data.mean()))
    processingFinished.set()
 data = np.random.randint(1, 100, 100)
 t = threading.Thread(target=processData, args=(data,))
 t.start()
 processingFinished.wait()
 print('Processing finished!')
 ```
 %% Cell type:markdown id: tags:
 ### The Global Interpreter Lock (GIL)
 The [_Global Interpreter
 Lock_](https://docs.python.org/3/c-api/init.html#thread-state-and-the-global-interpreter-lock)
 is an implementation detail of [CPython](https://github.com/python/cpython)
 (the official Python interpreter).  The GIL means that a multi-threaded
 program written in pure Python is not able to take advantage of multiple
 cores - this essentially means that only one thread may be executing at any
 point in time.
 The `threading` module does still have its uses though, as this GIL problem
 does not affect tasks which involve calls to system or natively compiled
 libraries (e.g. file and network I/O, Numpy operations, etc.). So you can,
 for example, perform some expensive processing on a Numpy array in a thread
 running on one core, whilst having another thread (e.g. user interaction)
 running on another core.
 ## Multiprocessing
 For true parallelism, you should check out the
 [`multiprocessing`](https://docs.python.org/3.5/library/multiprocessing.html)
 module.
 The `multiprocessing` module spawns sub-processes, rather than threads, and so
 is not subject to the GIL constraints that the `threading` module suffers
 from. It provides two APIs - a "traditional" equivalent to that provided by
 the `threading` module, and a powerful higher-level API.
 ### `threading`-equivalent API
 The
 [`Process`](https://docs.python.org/3.5/library/multiprocessing.html#the-process-class)
 class is the `multiprocessing` equivalent of the
 [`threading.Thread`](https://docs.python.org/3.5/library/threading.html#thread-objects)
 class.  `multprocessing` also has equivalents of the [`Lock` and `Event`
 classes](https://docs.python.org/3.5/library/multiprocessing.html#synchronization-between-processes),
 and the other synchronisation primitives provided by `threading`.
 So you can simply replace `threading.Thread` with `multiprocessing.Process`,
 and you will have true parallelism.
 Because your "threads" are now independent processes, you need to be a little
 careful about how to share information across them. Fortunately, the
 `multiprocessing` module provides [`Queue` and `Pipe`
 classes](https://docs.python.org/3.5/library/multiprocessing.html#exchanging-objects-between-processes)
 which make it easy to share data across processes.
 ### Higher-level API - the `multiprocessing.Pool`
 The real advantages of `multiprocessing` lie in its higher level API, centered
 around the [`Pool`
 class](https://docs.python.org/3.5/library/multiprocessing.html#using-a-pool-of-workers).
 Essentially, you create a `Pool` of worker processes - you specify the number
 of processes when you create the pool.
 > The best number of processes to use for a `Pool` will depend on the system
 > you are running on (number of cores), and the tasks you are running (e.g.
 > I/O bound or CPU bound).
 Once you have created a `Pool`, you can use its methods to automatically
 parallelise tasks. The most useful are the `map`, `starmap` and
 `apply_async` methods.
 #### `Pool.map`
 The
 [`Pool.map`](https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.pool.Pool.map)
 method is the multiprocessing equivalent of the built-in
 [`map`](https://docs.python.org/3.5/library/functions.html#map) function - it
 is given a function, and a sequence, and it applies the function to each
 element in the sequence.
 %% Cell type:code id: tags:
 ``` 
 import                    time
 import multiprocessing as mp
 import numpy           as np
 def crunchImage(imgfile):
    # Load a nifti image, do stuff
    # to it. Use your imagination
    # to fill in this function.
    time.sleep(2)
    # numpy's random number generator
    # will be initialised in the same
    # way in each process, so let's
    # re-seed it.
    np.random.seed()
    result = np.random.randint(1, 100, 1)
    print(imgfile, ':', result)
    return result
 imgfiles = ['{:02d}.nii.gz'.format(i) for i in range(20)]
 p = mp.Pool(processes=16)
 print('Crunching images...')
 start   = time.time()
 results = p.map(crunchImage, imgfiles)
 end     = time.time()
 print('Total execution time: {:0.2f} seconds'.format(end - start))
 ```
 %% Cell type:markdown id: tags:
 The `Pool.map` method only works with functions that accept one argument, such
 as our `crunchImage` function above. If you have a function which accepts
 multiple arguments, use the
 [`Pool.starmap`](https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.pool.Pool.starmap)
 method instead:
 %% Cell type:code id: tags:
 ``` 
 def crunchImage(imgfile, modality):
    time.sleep(2)
    np.random.seed()
    if modality == 't1':
        result = np.random.randint(1, 100, 1)
    elif modality == 't2':
        result = np.random.randint(100, 200, 1)
    print(imgfile, ': ', result)
    return result
 imgfiles   = ['t1_{:02d}.nii.gz'.format(i) for i in range(10)] + \
             ['t2_{:02d}.nii.gz'.format(i) for i in range(10)]
 modalities = ['t1'] * 10 + ['t2'] * 10
 pool = mp.Pool(processes=16)
 args = [(f, m) for f, m in zip(imgfiles, modalities)]
 print('Crunching images...')
 start   = time.time()
 results = pool.starmap(crunchImage, args)
 end     = time.time()
 print('Total execution time: {:0.2f} seconds'.format(end - start))
 ```
 %% Cell type:markdown id: tags:
 The `map` and `starmap` methods also have asynchronous equivalents `map_async`
 and `starmap_async`, which return immediately. Refer to the
 [`Pool`](https://docs.python.org/3.5/library/multiprocessing.html#module-multiprocessing.pool)
 documentation for more details.
 #### `Pool.apply_async`
 The
 [`Pool.apply`](https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.pool.Pool.apply)
 method will execute a function on one of the processes, and block until it has
 finished.  The
 [`Pool.apply_async`](https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.pool.Pool.apply_async)
 method returns immediately, and is thus more suited to asynchronously
 scheduling multiple jobs to run in parallel.
 `apply_async` returns an object of type
 [`AsyncResult`](https://docs.python.org/3.5/library/multiprocessing.html#multiprocessing.pool.AsyncResult).
 An `AsyncResult` object has `wait` and `get` methods which will block until
 the job has completed.
 %% Cell type:code id: tags:
 ``` 
 import                    time
 import multiprocessing as mp
 import numpy           as np
 def linear_registration(src, ref):
    time.sleep(1)
    return np.eye(4)
 def nonlinear_registration(src, ref, affine):
    time.sleep(3)
    # this number represents a non-linear warp
    # field - use your imagination people!
    np.random.seed()
    return np.random.randint(1, 100, 1)
 t1s = ['{:02d}_t1.nii.gz'.format(i) for i in range(20)]
 std = 'MNI152_T1_2mm.nii.gz'
 pool = mp.Pool(processes=16)
 print('Running structural-to-standard registration '
      'on {} subjects...'.format(len(t1s)))
 # Run linear registration on all the T1s.
 #
 # We build a list of AsyncResult objects
 linresults = [pool.apply_async(linear_registration, (t1, std))
              for t1 in t1s]
 # Then we wait for each job to finish,
 # and replace its AsyncResult object
 # with the actual result - an affine
 # transformation matrix.
 start = time.time()
 for i, r in enumerate(linresults):
    linresults[i] = r.get()
 end = time.time()
 print('Linear registrations completed in '
      '{:0.2f} seconds'.format(end - start))
 # Run non-linear registration on all the T1s,
 # using the linear registrations to initialise.
 nlinresults = [pool.apply_async(nonlinear_registration, (t1, std, aff))
               for (t1, aff) in zip(t1s, linresults)]
 # Wait for each non-linear reg to finish,
 # and store the resulting warp field.
 start = time.time()
 for i, r in enumerate(nlinresults):
    nlinresults[i] = r.get()
 end = time.time()
 print('Non-linear registrations completed in '
      '{:0.2f} seconds'.format(end - start))
 print('Non linear registrations:')
 for t1, result in zip(t1s, nlinresults):
    print(t1, ':', result)
 ```
+%% Cell type:markdown id: tags:
+### Sharing data between processes
+When you use the `Pool.map` method (or any of the other methods we have shown)
+to run a function on a sequence of items, those items must be copied into the
+memory of each of the child processes. When the child processes are finished,
+the data that they return then has to be copied back to the parent process.
+Any items which you wish to pass to a function that is executed by a `Pool`
+must be - the built-in
+[`pickle`](https://docs.python.org/3.5/library/pickle.html) module is used by
+`multiprocessing` to serialise and de-serialise the data passed into and
+returned from a child process. The majority of standard Python types (`list`,
+`dict`, `str` etc), and Numpy arrays can be pickled and unpickled, so you only
+need to worry about this detail if you are passing objects of a custom type
+(e.g. instances of classes that you have written, or that are defined in some
+third-party library).
+There is obviously some overhead in copying data back and forth between the
+main process and the worker processes.  For most computationally intensive
+tasks, this communication overhead is not important - the performance
+bottleneck is typically going to be the computation time, rather than I/O
+between the parent and child processes. You may need to spend some time
+adjusting the way in which you split up your data, and the number of
+processes, in order to get the best performance.
+However, if you have determined that copying data between processes is having
+a substantial impact on your performance, the `multiprocessing` module
+provides the [`Value`, `Array`, and `RawArray`
+classes](https://docs.python.org/3.5/library/multiprocessing.html#shared-ctypes-objects),
+which allow you to share individual values, or arrays of values, respectively.
+The `Array` and `RawArray` classes essentially wrap a typed pointer (from the
+built-in [`ctypes`](https://docs.python.org/3.5/library/ctypes.html) module)
+to a block of memory. We can use the `Array` or `RawArray` class to share a
+Numpy array between our worker processes. The difference between an `Array`
+and a `RawArray` is that the former offers synchronised (i.e. process-safe)
+access to the shared memory. This is necessary if your child processes will be
+modifying the same parts of your data.
+Due to the way that shared memory works, in order to share a Numpy array
+between different processes you need to structure your code so that the
+array(s) you want to share are accessible at the _module level_. Furthermore,
+we need to make sure that our input and output arrays are located in shared
+memory - we can do this via the `Array` or `RawArray`.
+As an example, let's say we want to parallelise processing of an image by
+having each worker process perform calculations on a chunk of the image.
+First, let's define a function which does the calculation on a specified set
+of image coordinates:
+%% Cell type:code id: tags:
+``` 
+import multiprocessing as mp
+import ctypes
+import numpy as np
+np.set_printoptions(suppress=True)
+def process_chunk(shape, idxs):
+    # Get references to our
+    # input/output data, and
+    # create Numpy array views
+    # into them.
+    sindata  = process_chunk.input_data
+    soutdata = process_chunk.output_data
+    indata   = np.ctypeslib.as_array(sindata) .reshape(shape)
+    outdata  = np.ctypeslib.as_array(soutdata).reshape(shape)
+    # Do the calculation on
+    # the specified voxels
+    outdata[idxs] = indata[idxs] ** 2
+```
+%% Cell type:markdown id: tags:
+Rather than passing the input and output data arrays in as arguments to the
+`process_chunk` function, we set them as attributes of the `process_chunk`
+function. This makes the input/output data accessible at the module level,
+which is required in order to share the data between the main process and the
+child processes.
+Now let's define a second function which process an entire image. It does the
+following:
+1. Initialises shared memory areas to store the input and output data.
+2. Copies the input data into shared memory.
+3. Sets the input and output data as attributes of the `process_chunk` function.
+4. Creates sets of indices into the input data which, for each worker process,
+   specifies the portion of the data that it is responsible for.
+5. Creates a worker pool, and runs the `process_chunk` function for each set
+   of indices.
+%% Cell type:code id: tags:
+``` 
+def process_dataset(data):
+    nprocs   = 8
+    origData = data
+    # Create arrays to store the
+    # input and output data
+    sindata  = mp.RawArray(ctypes.c_double, data.size)
+    soutdata = mp.RawArray(ctypes.c_double, data.size)
+    data     = np.ctypeslib.as_array(sindata).reshape(data.shape)
+    outdata  = np.ctypeslib.as_array(soutdata).reshape(data.shape)
+    # Copy the input data
+    # into shared memory
+    data[:]  = origData
+    # Make the input/output data
+    # accessible to the process_chunk
+    # function. This must be done
+    # *before* the worker pool is created.
+    process_chunk.input_data  = sindata
+    process_chunk.output_data = soutdata
+    # number of boxels to be computed
+    # by each worker process.
+    nvox = int(data.size / nprocs)
+    # Generate coordinates for
+    # every voxel in the image
+    xlen, ylen, zlen = data.shape
+    xs, ys, zs = np.meshgrid(np.arange(xlen),
+                             np.arange(ylen),
+                             np.arange(zlen))
+    xs = xs.flatten()
+    ys = ys.flatten()
+    zs = zs.flatten()
+    # We're going to pass each worker
+    # process a list of indices, which
+    # specify the data items which that
+    # worker process needs to compute.
+    xs = [xs[nvox * i:nvox * i + nvox] for i in range(nprocs)] + \
+         [xs[nvox * nprocs:]]
+    ys = [ys[nvox * i:nvox * i + nvox] for i in range(nprocs)] + \
+         [ys[nvox * nprocs:]]
+    zs = [zs[nvox * i:nvox * i + nvox] for i in range(nprocs)] + \
+         [zs[nvox * nprocs:]]
+    # Build the argument lists for
+    # each worker process.
+    args = [(data.shape, (x, y, z)) for x, y, z in zip(xs, ys, zs)]
+    # Create a pool of worker
+    # processes and run the jobs.
+    pool   = mp.Pool(processes=nprocs)
+    pool.starmap(process_chunk, args)
+    return outdata
+```
+%% Cell type:markdown id: tags:
+Now we can call our `process_data` function just like any other function:
+%% Cell type:code id: tags:
+``` 
+data = np.array(np.arange(64).reshape((4, 4, 4)), dtype=np.float64)
+outdata = process_dataset(data)
+print('Input')
+print(data)
+print('Output')
+print(outdata)
+```

--- a/advanced_topics/07_threading.md
+++ b/advanced_topics/07_threading.md
@@ -514,3 +514,185 @@ print('Non linear registrations:')
 for t1, result in zip(t1s, nlinresults):
    print(t1, ':', result)
 ```
+### Sharing data between processes
+When you use the `Pool.map` method (or any of the other methods we have shown)
+to run a function on a sequence of items, those items must be copied into the
+memory of each of the child processes. When the child processes are finished,
+the data that they return then has to be copied back to the parent process.
+Any items which you wish to pass to a function that is executed by a `Pool`
+must be - the built-in
+[`pickle`](https://docs.python.org/3.5/library/pickle.html) module is used by
+`multiprocessing` to serialise and de-serialise the data passed into and
+returned from a child process. The majority of standard Python types (`list`,
+`dict`, `str` etc), and Numpy arrays can be pickled and unpickled, so you only
+need to worry about this detail if you are passing objects of a custom type
+(e.g. instances of classes that you have written, or that are defined in some
+third-party library).
+There is obviously some overhead in copying data back and forth between the
+main process and the worker processes.  For most computationally intensive
+tasks, this communication overhead is not important - the performance
+bottleneck is typically going to be the computation time, rather than I/O
+between the parent and child processes. You may need to spend some time
+adjusting the way in which you split up your data, and the number of
+processes, in order to get the best performance.
+However, if you have determined that copying data between processes is having
+a substantial impact on your performance, the `multiprocessing` module
+provides the [`Value`, `Array`, and `RawArray`
+classes](https://docs.python.org/3.5/library/multiprocessing.html#shared-ctypes-objects),
+which allow you to share individual values, or arrays of values, respectively.
+The `Array` and `RawArray` classes essentially wrap a typed pointer (from the
+built-in [`ctypes`](https://docs.python.org/3.5/library/ctypes.html) module)
+to a block of memory. We can use the `Array` or `RawArray` class to share a
+Numpy array between our worker processes. The difference between an `Array`
+and a `RawArray` is that the former offers synchronised (i.e. process-safe)
+access to the shared memory. This is necessary if your child processes will be
+modifying the same parts of your data.
+Due to the way that shared memory works, in order to share a Numpy array
+between different processes you need to structure your code so that the
+array(s) you want to share are accessible at the _module level_. Furthermore,
+we need to make sure that our input and output arrays are located in shared
+memory - we can do this via the `Array` or `RawArray`.
+As an example, let's say we want to parallelise processing of an image by
+having each worker process perform calculations on a chunk of the image.
+First, let's define a function which does the calculation on a specified set
+of image coordinates:
+```
+import multiprocessing as mp
+import ctypes
+import numpy as np
+np.set_printoptions(suppress=True)
+def process_chunk(shape, idxs):
+    # Get references to our
+    # input/output data, and
+    # create Numpy array views
+    # into them.
+    sindata  = process_chunk.input_data
+    soutdata = process_chunk.output_data
+    indata   = np.ctypeslib.as_array(sindata) .reshape(shape)
+    outdata  = np.ctypeslib.as_array(soutdata).reshape(shape)
+    # Do the calculation on
+    # the specified voxels
+    outdata[idxs] = indata[idxs] ** 2
+```
+Rather than passing the input and output data arrays in as arguments to the
+`process_chunk` function, we set them as attributes of the `process_chunk`
+function. This makes the input/output data accessible at the module level,
+which is required in order to share the data between the main process and the
+child processes.
+Now let's define a second function which process an entire image. It does the
+following:
+1. Initialises shared memory areas to store the input and output data.
+2. Copies the input data into shared memory.
+3. Sets the input and output data as attributes of the `process_chunk` function.
+4. Creates sets of indices into the input data which, for each worker process,
+   specifies the portion of the data that it is responsible for.
+5. Creates a worker pool, and runs the `process_chunk` function for each set
+   of indices.
+```
+def process_dataset(data):
+    nprocs   = 8
+    origData = data
+    # Create arrays to store the
+    # input and output data
+    sindata  = mp.RawArray(ctypes.c_double, data.size)
+    soutdata = mp.RawArray(ctypes.c_double, data.size)
+    data     = np.ctypeslib.as_array(sindata).reshape(data.shape)
+    outdata  = np.ctypeslib.as_array(soutdata).reshape(data.shape)
+    # Copy the input data
+    # into shared memory
+    data[:]  = origData
+    # Make the input/output data
+    # accessible to the process_chunk
+    # function. This must be done
+    # *before* the worker pool is created.
+    process_chunk.input_data  = sindata
+    process_chunk.output_data = soutdata
+    # number of boxels to be computed
+    # by each worker process.
+    nvox = int(data.size / nprocs)
+    # Generate coordinates for
+    # every voxel in the image
+    xlen, ylen, zlen = data.shape
+    xs, ys, zs = np.meshgrid(np.arange(xlen),
+                             np.arange(ylen),
+                             np.arange(zlen))
+    xs = xs.flatten()
+    ys = ys.flatten()
+    zs = zs.flatten()
+    # We're going to pass each worker
+    # process a list of indices, which
+    # specify the data items which that
+    # worker process needs to compute.
+    xs = [xs[nvox * i:nvox * i + nvox] for i in range(nprocs)] + \
+         [xs[nvox * nprocs:]]
+    ys = [ys[nvox * i:nvox * i + nvox] for i in range(nprocs)] + \
+         [ys[nvox * nprocs:]]
+    zs = [zs[nvox * i:nvox * i + nvox] for i in range(nprocs)] + \
+         [zs[nvox * nprocs:]]
+    # Build the argument lists for
+    # each worker process.
+    args = [(data.shape, (x, y, z)) for x, y, z in zip(xs, ys, zs)]
+    # Create a pool of worker
+    # processes and run the jobs.
+    pool   = mp.Pool(processes=nprocs)
+    pool.starmap(process_chunk, args)
+    return outdata
+```
+Now we can call our `process_data` function just like any other function:
+```
+data = np.array(np.arange(64).reshape((4, 4, 4)), dtype=np.float64)
+outdata = process_dataset(data)
+print('Input')
+print(data)
+print('Output')
+print(outdata)
+```