Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • fsl/pytreat-practicals-2020
  • mchiew/pytreat-practicals-2020
  • ndcn0236/pytreat-practicals-2020
  • nichols/pytreat-practicals-2020
4 results
Show changes
Showing
with 5067 additions and 399 deletions
File added
%% Cell type:markdown id: tags:
Imports
%% Cell type:code id: tags:
``` python
import h5py
import matplotlib.pyplot as plt
import numpy as np
```
%% Cell type:markdown id: tags:
# Load data
Load complex image data from MATLAB mat-file (v7.3 or later), which is actually an HDF5 format
Complex data is loaded as a (real, imag) tuple, so it neds to be explicitly converted to complex double
In this section:
- using h5py module
- np.transpose
- 1j as imaginary constant
%% Cell type:code id: tags:
``` python
# get hdf5 object for the mat-file
h = h5py.File('data.mat','r')
# get img variable from the mat-file
dat = h.get('img')
# turn array of (real, imag) tuples into an array of complex doubles
# transpose to keep data in same orientation as MATLAB
img = np.transpose(dat['real'] + 1j*dat['imag'])
```
%% Cell type:markdown id: tags:
# 6/8 Partial Fourier sampling
Fourier transform the image to get k-space data, and add complex Gaussian noise
To simulate 6/8 Partial Fourier sampling, zero out the bottom 1/4 of k-space
In this section:
- np.random.randn
- np.fft
- 0-based indexing
- image plotting
%% Cell type:code id: tags:
``` python
# generate normally-distributed complex noise
n = np.random.randn(96,96) + 1j*np.random.randn(96,96)
# Fourier transform the image and add noise
y = np.fft.fftshift(np.fft.fft2(img), axes=0) + n
# set bottom 24/96 lines to 0
y[72:,:] = 0
# show sampling
_, ax = plt.subplots()
ax.imshow(np.log(np.abs(np.fft.fftshift(y, axes=1))))
```
%% Cell type:markdown id: tags:
# Estimate phase
Filter the k-space data and extract a low-resolution phase estimate
Filtering can help reduce ringing in the phase image
In this section:
- np.pad
- np.hanning
- reshaping 1D array to 2D array using np.newaxis (or None)
- subplots with titles
%% Cell type:code id: tags:
``` python
# create zero-padded hanning filter for ky-filtering
filt = np.pad(np.hanning(48),24,'constant')
# reshape 1D array into 2D array
filt = filt[:,np.newaxis]
# or
# filt = filt[:,None]
# generate low-res image with inverse Fourier transform
low = np.fft.ifft2(np.fft.ifftshift(y*filt, axes=0))
# get phase image
phs = np.exp(1j*np.angle(low))
# show phase estimate alongside true phase
_, ax = plt.subplots(1,2)
ax[0].imshow(np.angle(img))
ax[0].set_title('True image phase')
ax[1].imshow(np.angle(phs))
ax[1].set_title('Estimated phase')
```
%% Cell type:markdown id: tags:
# POCS reconstruction
Perform the projection-onto-convex-sets (POCS) partial Fourier reconstruction method.
POCS is an iterative scheme estimates the reconstructed image as any element in the intersection of the following two (convex) sets:
1. Set of images consistent with the measured data
2. Set of images that are non-negative real
This requires prior knowledge of the image phase (hence the estimate above), and it works because although we have less than a full k-space of measurements, we're now only estimating half the number of free parameters (real values only, instead of real + imag), and we're no longer under-determined. Equivalently, consider the fact that real-valued images have conjugate symmetric k-spaces, so we only require half of k-space to reconstruct our image.
In this section:
- np.zeros
- range() builtin
- point-wise multiplication (*)
- np.fft operations default to last axis, not first
- np.maximum vs np.max
%% Cell type:code id: tags:
``` python
# initialise image estimate to be zeros
est = np.zeros((96,96))
# set the number of iterations
iters = 10
# each iteration cycles between projections
for i in range(iters):
# projection onto data-consistent set:
# use a-priori phase to get complex image
est = est*phs
# Fourier transform to get k-space
est = np.fft.fftshift(np.fft.fft2(est), axes=0)
# replace data with measured lines
est[:72,:] = y[:72,:]
# inverse Fourier transform to get back to image space
est = np.fft.ifft2(np.fft.ifftshift(est, axes=0))
# projection onto non-negative reals:
# remove a-priori phase
est = est*np.conj(phs)
# get real part
est = np.real(est)
# ensure output is non-negative
est = np.maximum(est, 0)
```
%% Cell type:markdown id: tags:
# Display error and plot reconstruction
The POCS reconstruction is compared to a zero-filled reconstruction (i.e., where the missing data is zeroed prior to inverse Fourier Transform)
In this section:
- print formatted strings to standard output
- 2D subplots with min/max scales, figure size
- np.sum sums over all elements by default
%% Cell type:code id: tags:
``` python
# compute zero-filled recon
zf = np.fft.ifft2(np.fft.ifftshift(y, axes=0))
# compute rmse for zero-filled and POCS recon
err_zf = np.sqrt(np.sum(np.abs(zf - img)**2))
err_pocs = np.sqrt(np.sum(np.abs(est*phs - img)**2))
# print errors
print(f'RMSE for zero-filled recon: {err_zf}')
print(f'RMSE for POCS recon: {err_pocs}')
# plot both recons side-by-side
_, ax = plt.subplots(2,2,figsize=(16,16))
# plot zero-filled
ax[0,0].imshow(np.abs(zf), vmin=0, vmax=1)
ax[0,0].set_title('Zero-filled')
ax[1,0].plot(np.abs(zf[:,47]))
# plot POCS
ax[0,1].imshow(est, vmin=0, vmax=1)
ax[0,1].set_title('POCS recon')
ax[1,1].plot(np.abs(est[:,47]))
```
%% Load Data
% get matfile object
h = matfile('data.mat');
% get img variable from the mat-file
img = h.img;
%% 6/8 Partial Fourier sampling
% generate normally-distributed complex noise
n = randn(96) + 1j*randn(96);
% Fourier transform the image and add noise
y = fftshift(fft2(img),1) + n;
% set bottom 24/96 lines to 0
y(73:end,:) = 0;
% show sampling
figure();
imshow(log(abs(fftshift(y,2))), [], 'colormap', jet)
%% Estimate phase
% create zero-padded hanning filter for ky-filtering
filt = padarray(hann(48),24);
% generate low-res image with inverse Fourier transform
low = ifft2(ifftshift(y.*filt,1));
% get phase image
phs = exp(1j*angle(low));
% show phase estimate alongside true phase
figure();
subplot(1,2,1);
imshow(angle(img), [-pi,pi], 'colormap', hsv);
title('True image phase');
subplot(1,2,2);
imshow(angle(phs), [-pi,pi], 'colormap', hsv)
title('Estimated phase');
%% POCS reconstruction
% initialise image estimate to be zeros
est = zeros(96);
% set number of iterations
iters = 10;
% each iteration cycles between projections
for i = 1:iters
% projection onto data-consistent set:
% use a-priori phase to get complex image
est = est.*phs;
% Fourier transform to get k-space
est = fftshift(fft2(est), 1);
% replace data with measured lines
est(1:72,:) = y(1:72,:);
% inverse Fourier transform to get back to image space
est = ifft2(ifftshift(est, 1));
% projection onto non-negative reals
% remove a-priori phase
est = est.*conj(phs);
% get real part
est = real(est);
% ensure output is non-negative
est = max(est, 0);
end
%% Display error and plot reconstruction
% compute zero-filled recon
zf = ifft2(ifftshift(y, 1));
% compute rmse for zero-filled and POCS recon
err_zf = norm(zf(:) - img(:));
err_pocs = norm(est(:).*phs(:) - img(:));
% print errors
fprintf(1, 'RMSE for zero-filled recon: %f\n', err_zf);
fprintf(1, 'RMSE for POCS recon: %f\n', err_pocs);
% plot both recons side-by-side
figure();
% plot zero-filled
subplot(2,2,1);
imshow(abs(zf), [0 1]);
title('Zero-Filled');
subplot(2,2,3);
plot(abs(zf(:,48)), 'linewidth', 2);
% plot POCS
subplot(2,2,2);
imshow(est, [0 1]);
title('POCS recon');
subplot(2,2,4);
plot(abs(est(:,48)), 'linewidth', 2);
%% Cell type:markdown id: tags:
## Matlab v Python : introduction
This is a very small piece of code that fits RBFs to some data
In this section
- numpy
- matlab-like functions like np.linspace
%% Cell type:code id: tags:
``` python
import numpy as np
# Generate noisy sine wave
x = np.linspace(0,10,100)
y = np.sin(3*x) + np.random.randn(x.size)*.5
```
%% Cell type:markdown id: tags:
We define our Radial basis functions as $\textrm{RBF}(x) = \exp\left[-\frac{(x-c)^2}{\sigma^2}\right]$ where $c$ is the centre and $\sigma$ the extent. By having multiple such functions with different centres we can fit an arbitrary function.
In this section
- defining a function (two options)
- list comprehension
- power with double *
- numpy array from list
- transpose
%% Cell type:code id: tags:
``` python
# Define RBF atom
# two options:
# inline function definition (not available in matlab)
# lambda : like @ in matlab
sig = 2 # extent of an atom
# option 1
# def rbf(x,c):
# return np.exp(-(x-c)**2/sig**2)
# option 2
rbf = lambda x,c : np.exp(-(x-c)**2/sig**2)
# create a design matrix
# (using list comprehension to show off)
ci = np.linspace(0,10,20) # centres of the atoms
desmat = [rbf(x,c) for c in ci] # desmat contains a list of atoms
# from list to numpy array
desmat = np.asarray(desmat).T
```
%% Cell type:markdown id: tags:
Now we can fit to the sine wave with pseudoinverse.
In this section:
- pinv
- basic plotting and prettifying
- saving figure to file
%% Cell type:code id: tags:
``` python
# invert model
beta = np.linalg.pinv(desmat)@y.T
import matplotlib.pyplot as plt
# plot data, RBFs, and fitted model
plt.figure()
plt.plot(x,y,'.')
plt.plot(x,desmat,'k',alpha=.2)
plt.plot(x,desmat@beta)
# make it pretty
plt.grid()
plt.xlabel('x')
plt.ylabel('y')
plt.title('RBF fitting')
plt.savefig('/Users/saad/Desktop/RBF.pdf')
```
%% Output
%% Cell type:markdown id: tags:
## The end.
%% Cell type:code id: tags:
``` python
```
%% Fitting Radial Basis Functions to some data
% Generate a noisy sine wave and plot it
x = linspace(0,10,100);
y = sin(3*x) + randn(size(x))*.5;
figure,
plot(x,y)
%% Fit RBF
% This defines a RBF atom function
sig = 2;
rbf = @(x,c)(exp(-(x-c).^2/sig^2));
% design matrix for fitting
xi = linspace(0,10,20);
desmat = zeros(length(x),length(xi));
for i=1:length(xi)
% each column is an RBF centered around xi
desmat(:,i) = rbf(x,xi(i));
end
% fit model
beta = desmat\y';
% plot fit
figure(1),hold on
plot(x,y,'.','markersize',10)
h = plot(x,desmat,'k'); for i =1:20;h(i).Color=[0,0,0,0.2];end
plot(x,desmat*beta,'-','linewidth',2)
% make it a little prettier
grid on
xlabel('x')
ylabel('y')
set(gca,'fontsize',16)
title('RBF fitting')
%% save figure
print -depsc ~/Desktop/RBF.eps
close(1)
!open ~/Desktop/RBF.eps
%% Cell type:markdown id: tags:
![Running in parallel](parallel.png)
# Multiprocessing and multithreading in Python
## Why use multiprocessing?
%% Cell type:code id: tags:
``` python
import multiprocessing
multiprocessing.cpu_count()
```
%% Output
4
%% Cell type:markdown id: tags:
*Almost all CPUs these days are multi-core.*
CPU-intensive programs will not be efficient unless they take advantage of this!
%% Cell type:markdown id: tags:
# General plan
Walk through a basic application of multiprocessing, hopefully relevant to the kind of work you might want to do.
Not a comprehensive guide to Python multithreading/multiprocessing.
%% Cell type:markdown id: tags:
![voxel](voxel.png)
## Sample application
Assume we are doing some voxelwise image processing - i.e. running a computationally intensive calculation *independently* on each voxel in a (possibly large) image.
*(Such problems are sometimes called 'embarrassingly parallel')*
This is in a Python module called my_analysis. Here we simulate this by just calculating a large number of exponentials for each voxel.
%% Cell type:code id: tags:
``` python
# my_analysis.py
import math
import numpy
def calculate_voxel(val):
# 'Slow' voxelwise calculation
for i in range(30000):
b = math.exp(val)
return b
```
%% Cell type:markdown id: tags:
We're going to run this on a Numpy array. `numpy.vectorize` is a convenient function to apply a function to every element of the array, but it is *not* doing anything clever - it is no different than looping over the x, y, and z co-ordinates.
We're also giving the data an ID - this will be used later when we have multiple threads.
%% Cell type:code id: tags:
``` python
def calculate_data(data, id=0):
# Run 'calculate_voxel' on each voxel in data
print("Id: %i: Processing %i voxels" % (id, data.size))
vectorized = numpy.vectorize(calculate_voxel)
vectorized(data)
print("Id: %i: Done" % id)
return data
```
%% Cell type:markdown id: tags:
Here's some Python code to run our analysis on a random Numpy array, and time how long it takes
%% Cell type:code id: tags:
``` python
import numpy
import timeit
import my_analysis
def run():
data = numpy.random.rand(16, 16, 16)
my_analysis.calculate_data(data)
t = timeit.timeit(run, number=1)
print("Data processing took %.2f seconds" % t)
```
%% Output
Id: 0: Processing 4096 voxels
Id: 0: Done
Data processing took 26.44 seconds
%% Cell type:markdown id: tags:
So, it took a little while.
%% Cell type:markdown id: tags:
If we watch what's going on while this runs, we can see the program is not using all of our CPU. It's only working on one core.
![Running in serial](onecore.png)
%% Cell type:markdown id: tags:
## What we want
It would be nice to split the data up into chunks and give one to each core. Then we could get through the processing 8 times as fast.
![Running in parallel](multicore.png)
%% Cell type:markdown id: tags:
# Multithreading attempt
*Threads* are a way for a program to run more than one task at a time. Let's try using this on our application, using the Python `threading` module.
%% Cell type:markdown id: tags:
## Splitting the data up
We're going to need to split the data up into chunks. Numpy has a handy function `numpy.split` which slices the array up into equal portions along a specified axis:
chunks = numpy.split(full_data, num_chunks, axis)
*The data must split up equally along this axis! We will solve this problem later*
%% Cell type:markdown id: tags:
## Creating a new thread for each chunk
def function_to_call(args, arg2, arg3):
...do something
...
import threading
thread = threading.Thread(target=function_to_call,
args=[arg1, arg2, arg3])
%% Cell type:markdown id: tags:
## Waiting for the threads to complete
thread.join()
- This waits until `thread` has completed
- So, if we have more than one thread we need to keep a list and wait for them all to finish:
for thread in threads:
thread.join()
%% Cell type:markdown id: tags:
## Example code
The example code is below - let's see how it does!
%% Cell type:code id: tags:
``` python
import threading
def multithread_process(data):
n_workers = 4
# Split the data into chunks along axis 0
# We are assuming this axis is divisible by the number of workers!
chunks = numpy.split(data, n_workers, axis=0)
# Start a worker for each chunk
workers = []
for idx, chunk in enumerate(chunks):
print("Starting worker for part %i" % idx)
w = threading.Thread(target=my_analysis.calculate_data, args=[chunk, idx])
workers.append(w)
w.start()
# Wait for each worker to finish
for w in workers:
w.join()
def run_with_threads():
data = numpy.random.rand(16, 16, 16)
multithread_process(data)
t = timeit.timeit(run_with_threads, number=1)
print("Data processing took %.2f seconds" % t)
```
%% Output
Starting worker for part 0
Starting worker for part 1
Id: 0: Processing 1024 voxels
Id: 1: Processing 1024 voxels
Starting worker for part 2Id: 2: Processing 1024 voxels
Starting worker for part 3Id: 3: Processing 1024 voxels
Id: 1: DoneId: 0: Done
Id: 2: Done
Id: 3: Done
Data processing took 132.90 seconds
%% Cell type:markdown id: tags:
# The Big Problem with Python threads
%% Cell type:markdown id: tags:
**Only one thread can execute Python code at a time**
![Python multithreading](thread_gil.png)
This is what's really going on.
%% Cell type:markdown id: tags:
The reason is something called the **Global Interpreter Lock (GIL)**. Only one thread can have it, and you can only execute Python code when you have the GIL.
%% Cell type:markdown id: tags:
## So, does that mean Python threads are useless?
No, not completely. They're useful for:
- Making a user interface continue to respond while a calculation takes place in the background
- A web server handling multiple requests.
- *The GIL is not required while waiting for network connections*
- Doing calculations in parallel which are running in native (C/C++) code
- *The GIL is not required while running native code*
%% Cell type:markdown id: tags:
### But for doing CPU-intensive Python calculations in parallel, yes Python threads are essentially useless
%% Cell type:markdown id: tags:
## Can multiprocessing help?
%% Cell type:markdown id: tags:
### Differences between threads and processes
- Threads are quicker to start up and generally require fewer resources
- Threads share memory with the main process
- Don't need to copy your data to pass it to a thread
- Don't need to copy the output data back to the main program
- Processes have their own memory space
- Data needs to be copied from the main program to the process
- Any output needs to be copied back
- However, importantly for Python, *Each process has its own GIL so they can run at the same time as others*
%% Cell type:markdown id: tags:
## Multiprocessing attempt
Multiprocessing is normally more work than multithreading.
However Python tries *very hard* to make multiprocessing as easy as multithreading.
- `import multiprocessing` instead of `import threading`
- `multiprocessing.Process()` instead of `threading.Thread()`
%% Cell type:code id: tags:
``` python
import multiprocessing
def multiprocess_process(data):
n_workers = 4
# Split the data into chunks along axis 0
# We are assuming this axis is divisible by the number of workers!
chunks = numpy.split(data, n_workers, axis=0)
workers = []
for idx, chunk in enumerate(chunks):
print("Starting worker for chunk %i" % idx)
w = multiprocessing.Process(target=my_analysis.calculate_data, args=[chunk, idx])
workers.append(w)
w.start()
# Wait for workers to complete
for w in workers:
w.join()
def run_with_processes():
data = numpy.random.rand(16, 16, 16)
multiprocess_process(data)
if __name__ == "__main__":
t = timeit.timeit(run_with_processes, number=1)
print("Data processing took %.2f seconds" % t)
```
%% Output
Starting worker for chunk 0
Starting worker for chunk 1
Starting worker for chunk 2
Starting worker for chunk 3
Data processing took 9.74 seconds
%% Cell type:markdown id: tags:
# Multiprocessing works!
%% Cell type:markdown id: tags:
## BUT
# Caveats and gotchas
Before we just run off and replace all our threads with processes there are a few things we need to bear in mind:
## Data copying
- Python *copied* each chunk of data to each worker. If the data was very large this could be a significant overhead
- Python needs to know *how* to copy all the data we pass to the process.
- This is fine for normal data types (strings, lists, dictionaries, etc) and Numpy arrays
- Can get trouble if you try to pass complex objects to your function
- Anything you pass to the worker needs to be support the `pickle` module
## The global variable problem
- Can't rely on global variables being copied
%% Cell type:markdown id: tags:
## The output problem
If you change data in a subprocess, your main program will not see it.
## Example:
%% Cell type:code id: tags:
``` python
# my_analysis.py
def add_one(data):
data += 1
```
%% Cell type:code id: tags:
``` python
data = numpy.zeros([2, 2])
print("Starting with zeros")
print(data)
#my_analysis.add_one(data)
#w = threading.Thread(target=my_analysis.add_one, args=[data,])
w = multiprocessing.Process(target=my_analysis.add_one, args=[data,])
w.start()
w.join()
print("I think my worker just added one")
print(data)
```
%% Output
Starting with zeros
[[ 0. 0.]
[ 0. 0.]]
I think my worker just added one
[[ 0. 0.]
[ 0. 0.]]
%% Cell type:markdown id: tags:
# Making multiprocessing work better
%% Cell type:markdown id: tags:
## Problems to solve:
- Dividing data amongst processes
- Returning data from process
- Status updates from process
%% Cell type:markdown id: tags:
## Worker pools
A *Pool* is a fixed number of worker processes which can be given tasks to do.
We can give a pool as many tasks as we like - once a worker finishes a task it will start another, until they're all done.
We can create a pool using:
multiprocessing.Pool(num_workers)
- If the number of workers in the pool is equal to the number of cores, we should be able to keep our CPU busy.
- Pools are good for load balancing if some slices are more work than others
%% Cell type:markdown id: tags:
![Worker pool](pool.png)
%% Cell type:markdown id: tags:
## Splitting our data into chunks for the pool
- Now we can split our data up into as many chunks as we like
- Easiest solution is to use 1-voxel slices along one of the axes `split_axis`:
- `numpy.split(data, data.shape[split_axis], axis=split_axis)`
%% Cell type:markdown id: tags:
## Giving the tasks to the pool
- Easiest way is to use:
Pool.map(function, task_args)
- task_args is a *sequence of sequences*
- Each element in `task_args` is a sequence of arguments for one task
- The length of `task_args` is the number of tasks
#### Example `task_args` for 5 tasks, each being passed an ID and a chunk of data
[
[0, chunk_1], # task 1
[1, chunk_2], # task_2
[2, chunk_3], # task 3
[3, chunk_4], # task_4
[4, chunk_5], # task_5
]
- If we have a list of chunks we can generate this with `enumerate(chunks)`
- **Arguments are passed to the task in a slightly different way compared to `multiprocessing.Process()`**
%% Cell type:code id: tags:
``` python
# my_analysis.py
def do_task(args):
# Pool.map passes all our arguments as a single tuple, so unpack it
# and pass the arguments to the real calculate function.
id, data = args
return calculate_data(data, id)
```
%% Cell type:markdown id: tags:
- If you're using Python 3, look into `Pool.starmap`
%% Cell type:markdown id: tags:
## Getting the output and putting it back together
- `Pool.map()` captures the return value of your worker function
- It returns a list of all of the return values for each task
- for us this is a list of Numpy arrays, one for each slice
- `numpy.concatenate(list_of_slices, split_axis)` will combine them back into a single data item for us
%% Cell type:markdown id: tags:
## The full example
%% Cell type:code id: tags:
``` python
# Split our data along the x-axis
SPLIT_AXIS = 0
def pool_process(data):
n_workers = 4
# Split the data into 1-voxel slices along axis 0
parts = numpy.split(data, data.shape[SPLIT_AXIS], axis=SPLIT_AXIS)
print("Input data shape=%s" % str(data.shape))
print("Processing %i parts with %i workers" % (len(parts), n_workers))
# Create a pool - normally this would be 1 worker per CPU core
pool = multiprocessing.Pool(n_workers)
# Send the tasks to the workers
list_of_slices = pool.map(my_analysis.do_task, enumerate(parts))
# Combine the return data back into a single array
processed_array = numpy.concatenate(list_of_slices, SPLIT_AXIS)
print("Processed data, output shape=%s" % str(processed_array.shape))
def run_with_pool():
data = numpy.random.rand(23, 13, 11)
pool_process(data)
t = timeit.timeit(run_with_pool, number=1)
print("Data processing took %.2f seconds" % t)
```
%% Output
Input data shape=(23L, 13L, 11L)
Processing 23 parts with 4 workers
Processed data, output shape=(23L, 13L, 11L)
Data processing took 8.08 seconds
%% Cell type:markdown id: tags:
# Communication / Status updates?
- Would be nice if workers could communicate their progress as they work. One way to do this is using a `Queue`.
%% Cell type:markdown id: tags:
## Queues
A Queue is often used to send status updates from the process to the main program.
- Shared between the main program and the subprocesses
- Create it with `multiprocessing.Manager().Queue()`
- Pass it to the worker thread like any other argument
- Worker calls `queue.put()` to send some data to the queue
- Main program calls `queue.get()` to get data off the queue
- Queue is FIFO (First In First Out)
- Need to have a thread running which checks the queue for updates every so often
- This is a good use for threads!
![Queue](queue_put.png)
%% Cell type:markdown id: tags:
## Modify our example to report progress
%% Cell type:code id: tags:
``` python
# my_analysis.py
def calculate_data_and_report(args):
id, data, queue = args
# Run 'calculate_voxel' on each voxel in data
vectorized = numpy.vectorize(calculate_voxel)
vectorized(data)
# Report our ID and how many voxels we have done to the queue
queue.put((id, data.size))
```
%% Cell type:markdown id: tags:
## Create a thread to monitor the queue for updates
I've done this as a class, because it that's the easiest way
%% Cell type:code id: tags:
``` python
class QueueChecker():
def __init__(self, queue, num_voxels, interval_seconds=1):
self._queue = queue
self._num_voxels = num_voxels
self._interval_seconds = interval_seconds
self._voxels_done = 0
self._cancel = False
self._restart_timer()
def cancel(self):
self._cancel = True
def _restart_timer(self):
self._timer = threading.Timer(self._interval_seconds, self._check_queue)
self._timer.start()
def _check_queue(self):
while not self._queue.empty():
id, voxels_done = self._queue.get()
self._voxels_done += voxels_done
percent = int(100*float(self._voxels_done)/self._num_voxels)
print("%i%% complete" % percent)
if not self._cancel:
self._restart_timer()
```
%% Cell type:markdown id: tags:
## Modify our main program to pass the queue to each of our workers
We need to create the queue and the `QueueChecker` and make sure each task includes a copy of the queue
%% Cell type:code id: tags:
``` python
# Split our data along the x-axis
SPLIT_AXIS = 0
reload(my_analysis)
def pool_process(data):
n_workers = 4
# Split the data into 1-voxel slices along axis 0
parts = numpy.split(data, data.shape[SPLIT_AXIS], axis=SPLIT_AXIS)
print("Input data shape=%s" % str(data.shape))
print("We are processing %i parts with %i workers" % (len(parts), n_workers))
pool = multiprocessing.Pool(n_workers)
# Create the queue
queue = multiprocessing.Manager().Queue()
checker = QueueChecker(queue, data.size)
# Note that we need to pass the queue as an argument to the worker
args = [(id, part, queue) for id, part in enumerate(parts)]
list_of_slices = pool.map(my_analysis.calculate_data_and_report, args)
checker.cancel()
# Join processed data back together again
processed_array = numpy.concatenate(list_of_slices, SPLIT_AXIS)
print("Processed data, output shape=%s" % str(processed_array.shape))
def run_with_pool():
data = numpy.random.rand(23, 19, 17)
pool_process(data)
t = timeit.timeit(run_with_pool, number=1)
print("Data processing took %.2f seconds" % t)
```
%% Output
Input data shape=(23L, 19L, 17L)
We are processing 23 parts with 4 workers
0% complete
0% complete
13% complete
17% complete
17% complete
34% complete
34% complete
47% complete
52% complete
56% complete
69% complete
69% complete
78% complete
86% complete
95% complete
Processed data, output shape=(23L, 19L, 17L)
Data processing took 17.39 seconds
100% complete
%% Cell type:markdown id: tags:
# Summary
%% Cell type:markdown id: tags:
## What we've covered
- Limitations of threading for parallel processing in Python
- How to split up a simple voxel-processing task into separate chunks
- `numpy.split()`
- How to run each chunk in parallel using multiprocessing
- `multiprocessing.Process`
- How to separate the number of tasks from the number of workers
- `multiprocessing.Pool()`
- `Pool.map()`
- How to get output back from the workers and join it back together again
- `numpy.concatenate()`
- How to pass back progress information from our worker processes
- `multiprocessing.manager.Queue()`
- Using a threading.Timer object to monitor the queue and display updates
%% Cell type:markdown id: tags:
## Things I haven't covered
Loads of stuff!
%% Cell type:markdown id: tags:
### Threading
- Locking of shared data (so only one thread can use it at a time)
- Thread-local storage (see `threading.local()`)
- See Paul's tutorial on the PyTreat GIT for more information
- Or see the `threading` Python documentation for full details
%% Cell type:markdown id: tags:
### Multiprocessing
- Passing data *between* workers
- Can use `Queue` for one-way traffic
- Use `Pipe` for two-way communication between one worker and another
- May be required when your problem is not 'embarrasingly parallel'
- Sharing memory
- Way to avoid copying large amounts of data
- Look at `multiprocessing.Array`
- Need to convert Numpy array into a ctypes array
- Shared memory has pitfalls
- *Don't go here unless you have aready determined that data copying is a bottleneck*
- Running workers asynchronously
- So main program doesn't have to wait for them to finish
- Instead, a function is called every time a task is finished
- see `multiprocessing.apply_async()` for more information
- Error handling
- Needs a bit of care - very easy to 'lose' errors
- Workers should catch all exceptions
- And should return a value to signal when a task has failed
- Main program decides how to deal with it
%% Cell type:markdown id: tags:
## Always remember
**Python is not the best tool for every job!**
If you are really after performance, consider implementing your algorithm in multi-threaded C/C++ and then create a Python interface.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Main scientific python libraries
See https://scipy.org/
Most of these packages have or are in the progress of dropping support for python2.
So use python3!
## [Numpy](http://www.numpy.org/): arrays
This is the main library underlying (nearly) all of the scientific python ecosystem.
See the tutorial in the beginner session or [the official numpy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html) for usage details.
The usual nickname of numpy is np:
%% Cell type:code id: tags:
```
import numpy as np
```
%% Cell type:markdown id: tags:
Numpy includes support for:
- N-dimensional arrays with various datatypes
- masked arrays
- matrices
- structured/record array
- basic functions (e.g., sin, log, arctan, polynomials)
- basic linear algebra
- random number generation
## [Scipy](https://scipy.org/scipylib/index.html): most general scientific tools
At the top level this module includes all of the basic functionality from numpy.
You could import this as, but you might as well import numpy directly.
%% Cell type:code id: tags:
```
import scipy as sp
```
%% Cell type:markdown id: tags:
The main strength in scipy lies in its sub-packages:
%% Cell type:code id: tags:
```
from scipy import optimize
def costfunc(params):
return (params[0] - 3) ** 2
optimize.minimize(costfunc, x0=[0], method='l-bfgs-b')
```
%% Cell type:markdown id: tags:
Tutorials for all sub-packages can be found [here](https://docs.scipy.org/doc/scipy-1.0.0/reference/).
Alternative for `scipy.ndimage`:
- [Scikit-image](http://scikit-image.org/docs/stable/auto_examples/) for image manipulation/segmentation/feature detection
## [Matplotlib](https://matplotlib.org/): Main plotting library
%% Cell type:code id: tags:
```
import matplotlib as mpl
mpl.use('nbagg')
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
The matplotlib tutorials are [here](https://matplotlib.org/tutorials/index.html)
%% Cell type:code id: tags:
```
x = np.linspace(0, 2, 100)
plt.plot(x, x, label='linear')
plt.plot(x, x**2, label='quadratic')
plt.plot(x, x**3, label='cubic')
plt.xlabel('x label')
plt.ylabel('y label')
plt.title("Simple Plot")
plt.legend()
plt.show()
```
%% Cell type:markdown id: tags:
Alternatives:
- [Mayavi](http://docs.enthought.com/mayavi/mayavi/): 3D plotting (hard to install)
- [Bokeh](https://bokeh.pydata.org/en/latest/) among many others: interactive plots in the browser (i.e., in javascript)
## [Ipython](http://ipython.org/)/[Jupyter](https://jupyter.org/) notebook: interactive python environments
Supports:
- run code in multiple languages
%% Cell type:code id: tags:
```
%%bash
for name in python ruby ; do
echo $name
done
```
%% Cell type:markdown id: tags:
Features:
- debugging
%% Cell type:code id: tags:
```
from scipy import optimize
def costfunc(params):
return 1 / params[0] ** 2
optimize.minimize(costfunc, x0=[0], method='l-bfgs-b')
if params[0] <= 0:
raise ValueError('Input variable is too low')
return 1 / params[0]
optimize.minimize(costfunc, x0=[3], method='l-bfgs-b')
```
%% Cell type:code id: tags:
```
%debug
```
%% Cell type:markdown id: tags:
- timing/profiling
%% Cell type:code id: tags:
```
%%prun
plt.plot([0, 3])
```
%% Cell type:markdown id: tags:
- getting help
%% Cell type:code id: tags:
```
plt.plot?
```
%% Cell type:markdown id: tags:
- [and much more...](https://ipython.readthedocs.io/en/stable/interactive/magics.html)
The next generation is already out: [jupyterlab](https://jupyterlab.readthedocs.io/en/latest/)
There are many [useful extensions available](https://github.com/ipython-contrib/jupyter_contrib_nbextensions).
## [Pandas](https://pandas.pydata.org/): Analyzing "clean" data
Once your data is in tabular form (e.g. Biobank IDP's), you want to use pandas dataframes to analyze them.
This brings most of the functionality of R into python.
Pandas has excellent support for:
- fast IO to many tabular formats
- accurate handling of missing data
- Many, many routines to handle data
- group by categorical data (e.g., male/female)
- joining/merging data (all SQL-like operations and much more)
- time series support
- statistical models through [statsmodels](http://www.statsmodels.org/stable/index.html)
- plotting though [seaborn](https://seaborn.pydata.org/)
- Use [dask](https://dask.pydata.org/en/latest/) if your data is too big for memory (or if you want to run in parallel)
You should also install `numexpr` and `bottleneck` for optimal performance.
For the documentation check [here](http://pandas.pydata.org/pandas-docs/stable/index.html)
### Adjusted example from statsmodels tutorial
%% Cell type:code id: tags:
```
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
```
%% Cell type:code id: tags:
```
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df
```
%% Cell type:code id: tags:
```
df.describe()
```
%% Cell type:code id: tags:
```
df.groupby('Region').mean()
```
%% Cell type:code id: tags:
```
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=df).fit()
results.summary()
```
%% Cell type:code id: tags:
```
df['log_pop'] = np.log(df.Pop1831)
df
```
%% Cell type:code id: tags:
```
results = smf.ols('Lottery ~ Literacy + log_pop', data=df).fit()
results.summary()
```
%% Cell type:code id: tags:
```
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831) + Region', data=df).fit()
results.summary()
```
%% Cell type:code id: tags:
```
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831) + Region + Region * Literacy', data=df).fit()
results.summary()
```
%% Cell type:code id: tags:
```
%matplotlib nbagg
import seaborn as sns
sns.pairplot(df, hue="Region", vars=('Lottery', 'Literacy', 'log_pop'))
```
%% Cell type:markdown id: tags:
## [Sympy](http://www.sympy.org/en/index.html): Symbolic programming
%% Cell type:code id: tags:
```
import sympy as sym # no standard nickname
```
%% Cell type:code id: tags:
```
x, a, b, c = sym.symbols('x, a, b, c')
sym.solve(a * x ** 2 + b * x + c, x)
```
%% Cell type:code id: tags:
```
sym.integrate(x/(x**2+a*x+2), x)
```
%% Cell type:code id: tags:
```
f = sym.utilities.lambdify((x, a), sym.integrate((x**2+a*x+2), x))
f(np.random.rand(10), np.random.rand(10))
```
%% Cell type:markdown id: tags:
# Other topics
## [Argparse](https://docs.python.org/3.6/howto/argparse.html): Command line arguments
%% Cell type:code id: tags:
```
%%writefile test_argparse.py
import argparse
def main():
parser = argparse.ArgumentParser(description="calculate X to the power of Y")
parser.add_argument("-v", "--verbose", action="store_true")
parser.add_argument("x", type=int, help="the base")
parser.add_argument("y", type=int, help="the exponent")
args = parser.parse_args()
answer = args.x**args.y
if args.verbose:
print("{} to the power {} equals {}".format(args.x, args.y, answer))
else:
print("{}^{} == {}".format(args.x, args.y, answer))
if __name__ == '__main__':
main()
```
%% Cell type:code id: tags:
```
%run test_argparse.py 3 8 -v
```
%% Cell type:code id: tags:
```
%run test_argparse.py -h
```
%% Cell type:code id: tags:
```
%run test_argparse.py 3 8.5
```
%% Cell type:markdown id: tags:
Alternatives:
- [docopt](http://docopt.org/): You write a usage string, docopt will generate the parser
> ```
> # example from https://realpython.com/blog/python/comparing-python-command-line-parsing-libraries-argparse-docopt-click/
> """Greeter.
>
> Usage:
> commands.py hello
> commands.py goodbye
> commands.py -h | --help
>
> Options:
> -h --help Show this screen.
> """
> from docopt import docopt
>
> if __name__ == '__main__':
> arguments = docopt(__doc__)
> ```
- [clize](http://clize.readthedocs.io/en/stable/why.html): You write a function, clize will generate the parser
> ```
> from clize import run
>
> def echo(word):
> return word
>
> if __name__ == '__main__':
> run(echo)
> ```
### [Gooey](https://github.com/chriskiehl/Gooey): GUI from command line tool
%% Cell type:code id: tags:
```
%%writefile test_gooey.py
import argparse
from gooey import Gooey
@Gooey
def main():
parser = argparse.ArgumentParser(description="calculate X to the power of Y")
parser.add_argument("-v", "--verbose", action="store_true")
parser.add_argument("x", type=int, help="the base")
parser.add_argument("y", type=int, help="the exponent")
args = parser.parse_args()
answer = args.x**args.y
if args.verbose:
print("{} to the power {} equals {}".format(args.x, args.y, answer))
else:
print("{}^{} == {}".format(args.x, args.y, answer))
if __name__ == '__main__':
main()
```
%% Cell type:code id: tags:
```
!python.app test_gooey.py
```
%% Cell type:code id: tags:
```
!gcoord_gui
```
%% Cell type:markdown id: tags:
## [Jinja2](http://jinja.pocoo.org/docs/2.10/): Templating language
Jinja2 allows to create templates of files with placeholders, where future content will go.
This allows for the creation of a large number of similar files.
This can for example be used to produce static HTML output in a highly flexible manner.
%% Cell type:code id: tags:
```
%%writefile image_list.jinja2
<!DOCTYPE html>
<html lang="en">
<head>
{% block head %}
<title>{{ title }}</title>
{% endblock %}
</head>
<body>
<div id="content">
{% block content %}
{% for description, filenames in images %}
<p>
{{ description }}
</p>
{% for filename in filenames %}
<a href="{{ filename }}">
<img src="{{ filename }}">
</a>
{% endfor %}
{% endfor %}
{% endblock %}
</div>
<footer>
Created on {{ time }}
</footer>
</body>
</html>
```
%% Cell type:code id: tags:
```
import numpy as np
import matplotlib.pyplot as plt
plt.ioff()
def plot_sine(amplitude, frequency):
x = np.linspace(0, 2 * np.pi, 100)
y = amplitude * np.sin(frequency * x)
plt.plot(x, y)
plt.xticks([0, np.pi, 2 * np.pi], ['0', '$\pi$', '$2 \pi$'])
plt.ylim(-1.1, 1.1)
filename = 'plots/A{:.2f}_F{:.2f}.png'.format(amplitude, frequency)
plt.title('A={:.2f}, F={:.2f}'.format(amplitude, frequency))
plt.savefig(filename)
plt.close(plt.gcf())
return filename
!mkdir plots
amplitudes = [plot_sine(A, 1.) for A in [0.1, 0.3, 0.7, 1.0]]
frequencies = [plot_sine(1., F) for F in [1, 2, 3, 4, 5, 6]]
plt.ion()
```
%% Cell type:code id: tags:
```
from jinja2 import Environment, FileSystemLoader
from datetime import datetime
loader = FileSystemLoader('.')
env = Environment(loader=loader)
template = env.get_template('image_list.jinja2')
images = [
('Varying the amplitude', amplitudes),
('Varying the frequency', frequencies),
]
with open('image_list.html', 'w') as f:
f.write(template.render(title='Lots of sines',
images=images, time=datetime.now()))
```
%% Cell type:code id: tags:
```
!open image_list.html
```
%% Cell type:markdown id: tags:
## Neuroimage packages
The [nipy](http://nipy.org/) ecosystem covers most of these.
## [networkx](https://networkx.github.io/): graph theory
## GUI
- [tkinter](https://docs.python.org/3.6/library/tkinter.html): thin wrapper around Tcl/Tk; included in python
- [wxpython](https://www.wxpython.org/): Wrapper around the C++ wxWidgets library
%% Cell type:code id: tags:
```
%%writefile wx_hello_world.py
"""
Hello World, but with more meat.
"""
import wx
class HelloFrame(wx.Frame):
"""
A Frame that says Hello World
"""
def __init__(self, *args, **kw):
# ensure the parent's __init__ is called
super(HelloFrame, self).__init__(*args, **kw)
# create a panel in the frame
pnl = wx.Panel(self)
# and put some text with a larger bold font on it
st = wx.StaticText(pnl, label="Hello World!", pos=(25,25))
font = st.GetFont()
font.PointSize += 10
font = font.Bold()
st.SetFont(font)
# create a menu bar
self.makeMenuBar()
# and a status bar
self.CreateStatusBar()
self.SetStatusText("Welcome to wxPython!")
def makeMenuBar(self):
"""
A menu bar is composed of menus, which are composed of menu items.
This method builds a set of menus and binds handlers to be called
when the menu item is selected.
"""
# Make a file menu with Hello and Exit items
fileMenu = wx.Menu()
# The "\t..." syntax defines an accelerator key that also triggers
# the same event
helloItem = fileMenu.Append(-1, "&Hello...\tCtrl-H",
"Help string shown in status bar for this menu item")
fileMenu.AppendSeparator()
# When using a stock ID we don't need to specify the menu item's
# label
exitItem = fileMenu.Append(wx.ID_EXIT)
# Now a help menu for the about item
helpMenu = wx.Menu()
aboutItem = helpMenu.Append(wx.ID_ABOUT)
# Make the menu bar and add the two menus to it. The '&' defines
# that the next letter is the "mnemonic" for the menu item. On the
# platforms that support it those letters are underlined and can be
# triggered from the keyboard.
menuBar = wx.MenuBar()
menuBar.Append(fileMenu, "&File")
menuBar.Append(helpMenu, "&Help")
# Give the menu bar to the frame
self.SetMenuBar(menuBar)
# Finally, associate a handler function with the EVT_MENU event for
# each of the menu items. That means that when that menu item is
# activated then the associated handler function will be called.
self.Bind(wx.EVT_MENU, self.OnHello, helloItem)
self.Bind(wx.EVT_MENU, self.OnExit, exitItem)
self.Bind(wx.EVT_MENU, self.OnAbout, aboutItem)
def OnExit(self, event):
"""Close the frame, terminating the application."""
self.Close(True)
def OnHello(self, event):
"""Say hello to the user."""
wx.MessageBox("Hello again from wxPython")
def OnAbout(self, event):
"""Display an About Dialog"""
wx.MessageBox("This is a wxPython Hello World sample",
"About Hello World 2",
wx.OK|wx.ICON_INFORMATION)
if __name__ == '__main__':
# When this module is run (not imported) then create the app, the
# frame, show it, and start the event loop.
app = wx.App()
frm = HelloFrame(None, title='Hello World 2')
frm.Show()
app.MainLoop()
```
%% Cell type:code id: tags:
```
!python.app wx_hello_world.py
```
%% Cell type:markdown id: tags:
## Machine learning
- scikit-learn
- theano/tensorflow/pytorch
- keras
## [pymc3](http://docs.pymc.io/): Pobabilstic programming
%% Cell type:code id: tags:
```
import numpy as np
import matplotlib.pyplot as plt
# Initialize random number generator
np.random.seed(123)
# True parameter values
alpha, sigma = 1, 1
beta = [1, 2.5]
# Size of dataset
size = 100
# Predictor variable
X1 = np.random.randn(size)
X2 = np.random.randn(size) * 0.2
# Simulate outcome variable
Y = alpha + beta[0]*X1 + beta[1]*X2 + np.random.randn(size)*sigma
```
%% Cell type:code id: tags:
```
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10)
beta = pm.Normal('beta', mu=0, sd=10, shape=2)
sigma = pm.HalfNormal('sigma', sd=1)
# Expected value of outcome
mu = alpha + beta[0]*X1 + beta[1]*X2
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
```
%% Cell type:code id: tags:
```
with basic_model:
# obtain starting values via MAP
start = pm.find_MAP(fmin=optimize.fmin_powell)
# instantiate sampler
step = pm.Slice()
# draw 5000 posterior samples
trace = pm.sample(5000, step=step, start=start)
```
%% Cell type:code id: tags:
```
_ = pm.traceplot(trace)
```
%% Cell type:code id: tags:
```
pm.summary(trace)
```
%% Cell type:markdown id: tags:
Alternative: [pystan](https://pystan.readthedocs.io/en/latest/): wrapper around the [Stan](http://mc-stan.org/users/) probabilistic programming language.
Alternatives:
- [pystan](https://pystan.readthedocs.io/en/latest/): wrapper around the [Stan](http://mc-stan.org/users/) probabilistic programming language.
- [emcee](http://dfm.io/emcee/current/): if you just want MCMC
## [Pycuda](https://documen.tician.de/pycuda/): Programming the GPU
Wrapper around [Cuda](https://developer.nvidia.com/cuda-zone).
The alternative [Pyopencl](https://documen.tician.de/pyopencl/) provides a very similar wrapper around [OpenCL](https://www.khronos.org/opencl/).
%% Cell type:code id: tags:
```
import pycuda.autoinit
import pycuda.driver as drv
- The alternative [Pyopencl](https://documen.tician.de/pyopencl/) provides a very similar wrapper around [OpenCL](https://www.khronos.org/opencl/).
- Also see [pyopenGL](http://pyopengl.sourceforge.net/): graphics programming in python (used in FSLeyes)
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(double *dest, double *a, double *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
multiply_them = mod.get_function("multiply_them")
a = np.random.randn(400)
b = np.random.randn(400)
dest = np.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
print(dest-a*b)
```
%% Cell type:markdown id: tags:
Also see [pyopenGL](http://pyopengl.sourceforge.net/): graphics programming in python (used in FSLeyes)
## Testing
- [unittest](https://docs.python.org/3.6/library/unittest.html): python built-in testing
%% Cell type:code id: tags:
```
import unittest
class TestStringMethods(unittest.TestCase):
def test_upper(self):
self.assertEqual('foo'.upper(), 'FOO')
def test_isupper(self):
self.assertTrue('FOO'.isupper())
self.assertFalse('Foo'.isupper())
def test_split(self):
s = 'hello world'
self.assertEqual(s.split(), ['hello', 'world'])
# check that s.split fails when the separator is not a string
with self.assertRaises(TypeError):
s.split(2)
if __name__ == '__main__':
unittest.main()
```
%% Cell type:markdown id: tags:
- [doctest](https://docs.python.org/3.6/library/doctest.html): checks the example usage in the documentation
%% Cell type:code id: tags:
```
def factorial(n):
"""Return the factorial of n, an exact integer >= 0.
>>> [factorial(n) for n in range(6)]
[1, 1, 2, 6, 24, 120]
>>> factorial(30)
265252859812191058636308480000000
>>> factorial(-1)
Traceback (most recent call last):
...
ValueError: n must be >= 0
Factorials of floats are OK, but the float must be an exact integer:
>>> factorial(30.1)
Traceback (most recent call last):
...
ValueError: n must be exact integer
>>> factorial(30.0)
265252859812191058636308480000000
It must also not be ridiculously large:
>>> factorial(1e100)
Traceback (most recent call last):
...
OverflowError: n too large
"""
import math
if not n >= 0:
raise ValueError("n must be >= 0")
if math.floor(n) != n:
raise ValueError("n must be exact integer")
if n+1 == n: # catch a value like 1e300
raise OverflowError("n too large")
result = 1
factor = 2
while factor <= n:
result *= factor
factor += 1
return result
if __name__ == "__main__":
import doctest
doctest.testmod()
```
%% Cell type:markdown id: tags:
Two external packages provide more convenient unit tests:
- [py.test](https://docs.pytest.org/en/latest/)
- [nose2](http://nose2.readthedocs.io/en/latest/usage.html)
%% Cell type:code id: tags:
```
# content of test_sample.py
def inc(x):
return x + 1
def test_answer():
assert inc(3) == 5
```
%% Cell type:markdown id: tags:
- [coverage](https://coverage.readthedocs.io/en/coverage-4.5.1/): measures which part of the code is covered by the tests
## Linters
Linters check the code for any syntax errors, [style errors](https://www.python.org/dev/peps/pep-0008/), unused variables, unreachable code, etc.
- [pylint](https://pypi.python.org/pypi/pylint): most extensive linter
- [pyflake](https://pypi.python.org/pypi/pyflakes): if you think pylint is too strict
- [pep8](https://pypi.python.org/pypi/pep8): just checks for style errors
### Optional static typing
- Document how your method/function should be called
- Static checking of whether your type hints are still up to date
- Static checking of whether you call your own function correctly
- Even if you don't assign types yourself, static type checking can still check whether you call typed functions/methods from other packages correctly.
%% Cell type:code id: tags:
```
from typing import List
def greet_all(names: List[str]) -> None:
for name in names:
print('Hello, {}'.format(name))
greet_all(['python', 'java', 'C++']) # type checker will be fine with this
greet_all('matlab') # this will actually run fine, but type checker will raise an error
```
%% Cell type:markdown id: tags:
Packages:
- [typing](https://docs.python.org/3/library/typing.html): built-in library containing generics, unions, etc.
- [mypy](http://mypy-lang.org/): linter doing static type checking
- [pyAnnotate](https://github.com/dropbox/pyannotate): automatically assign types to most of your functions/methods based on runtime
## Web frameworks
- [Django2](https://www.djangoproject.com/): includes the most features, but also forces you to do things their way
- [Pyramid](https://trypyramid.com): Intermediate options
- [Flask](http://flask.pocoo.org/): Bare-bone web framework, but many extensions available
There are also many, many libraries to interact with databases, but you will have to google those yourself.
# Quick mentions
- [trimesh](https://github.com/mikedh/trimesh): Triangular mesh algorithms
- [Pillow](https://pillow.readthedocs.io/en/latest/): Read/write/manipulate a wide variety of images (png, jpg, tiff, etc.)
- [psychopy](http://www.psychopy.org/): equivalent of psychtoolbox (workshop coming up in April in Nottingham)
- [Sphinx](http://www.sphinx-doc.org/en/master/): documentation generator
- [Buit-in libraries](https://docs.python.org/3/py-modindex.html)
- [collections](https://docs.python.org/3.6/library/collections.html): deque, OrderedDict, namedtuple, and more
- [datetime](https://docs.python.org/3/library/datetime.html): Basic date and time types
- [functools](https://docs.python.org/3/library/functools.html): caching, decorators, and support for functional programming
- [json](https://docs.python.org/3/library/json.html)/[ipaddress](https://docs.python.org/3/library/ipaddress.html)/[xml](https://docs.python.org/3/library/xml.html#module-xml): parsing/writing
- [itertools](https://docs.python.org/3/library/itertools.html): more tools to loop over sequences
- [logging](https://docs.python.org/3/library/logging.htm): log your output to stdout or a file (more flexible than print statements)
- [multiprocessing](https://docs.python.org/3/library/multiprocessing.html)
- [os](https://docs.python.org/3/library/os.html#module-os)/[sys](https://docs.python.org/3/library/sys.html): Miscellaneous operating system interfaces
- [os.path](https://docs.python.org/3/library/os.path.html)/[pathlib](https://docs.python.org/3/library/pathlib.html): utilities to deal with filesystem paths (latter provides an object-oriented interface)
- [pickle](https://docs.python.org/3/library/pickle.html): Store/load any python object
- [shutil](https://docs.python.org/3/library/shutil.html): copy/move files
- [subprocess](https://docs.python.org/3/library/subprocess.html): call shell commands
- [time](https://docs.python.org/3/library/time.html)/[timeit](https://docs.python.org/3/library/timeit.html): Timing your code
- [turtle](https://docs.python.org/3/library/turtle.html#module-turtle): teach python to your kids!
- [warnings](https://docs.python.org/3/library/warnings.html#module-warnings): tell people they are not using your code properly
%% Cell type:code id: tags:
```
from turtle import *
color('red', 'yellow')
begin_fill()
speed(10)
while True:
forward(200)
left(170)
if abs(pos()) < 1:
break
end_fill()
done()
```
%% Cell type:code id: tags:
```
import this
```
......
......@@ -39,6 +39,9 @@ optimize.minimize(costfunc, x0=[0], method='l-bfgs-b')
Tutorials for all sub-packages can be found [here](https://docs.scipy.org/doc/scipy-1.0.0/reference/).
Alternative for `scipy.ndimage`:
- [Scikit-image](http://scikit-image.org/docs/stable/auto_examples/) for image manipulation/segmentation/feature detection
## [Matplotlib](https://matplotlib.org/): Main plotting library
```
import matplotlib as mpl
......@@ -69,22 +72,17 @@ Alternatives:
- [Bokeh](https://bokeh.pydata.org/en/latest/) among many others: interactive plots in the browser (i.e., in javascript)
## [Ipython](http://ipython.org/)/[Jupyter](https://jupyter.org/) notebook: interactive python environments
Supports:
- run code in multiple languages
```
%%bash
for name in python ruby ; do
echo $name
done
```
Features:
- debugging
```
from scipy import optimize
def costfunc(params):
return 1 / params[0] ** 2
optimize.minimize(costfunc, x0=[0], method='l-bfgs-b')
if params[0] <= 0:
raise ValueError('Input variable is too low')
return 1 / params[0]
optimize.minimize(costfunc, x0=[3], method='l-bfgs-b')
```
```
%debug
```
......@@ -552,39 +550,16 @@ _ = pm.traceplot(trace)
pm.summary(trace)
```
Alternative: [pystan](https://pystan.readthedocs.io/en/latest/): wrapper around the [Stan](http://mc-stan.org/users/) probabilistic programming language.
Alternatives:
- [pystan](https://pystan.readthedocs.io/en/latest/): wrapper around the [Stan](http://mc-stan.org/users/) probabilistic programming language.
- [emcee](http://dfm.io/emcee/current/): if you just want MCMC
## [Pycuda](https://documen.tician.de/pycuda/): Programming the GPU
Wrapper around [Cuda](https://developer.nvidia.com/cuda-zone).
The alternative [Pyopencl](https://documen.tician.de/pyopencl/) provides a very similar wrapper around [OpenCL](https://www.khronos.org/opencl/).
```
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(double *dest, double *a, double *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")
- The alternative [Pyopencl](https://documen.tician.de/pyopencl/) provides a very similar wrapper around [OpenCL](https://www.khronos.org/opencl/).
- Also see [pyopenGL](http://pyopengl.sourceforge.net/): graphics programming in python (used in FSLeyes)
multiply_them = mod.get_function("multiply_them")
a = np.random.randn(400)
b = np.random.randn(400)
dest = np.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1), grid=(1,1))
print(dest-a*b)
```
Also see [pyopenGL](http://pyopengl.sourceforge.net/): graphics programming in python (used in FSLeyes)
## Testing
- [unittest](https://docs.python.org/3.6/library/unittest.html): python built-in testing
```
......@@ -676,6 +651,7 @@ Linters check the code for any syntax errors, [style errors](https://www.python.
- [pylint](https://pypi.python.org/pypi/pylint): most extensive linter
- [pyflake](https://pypi.python.org/pypi/pyflakes): if you think pylint is too strict
- [pep8](https://pypi.python.org/pypi/pep8): just checks for style errors
### Optional static typing
- Document how your method/function should be called
- Static checking of whether your type hints are still up to date
......@@ -710,37 +686,18 @@ There are also many, many libraries to interact with databases, but you will hav
- [trimesh](https://github.com/mikedh/trimesh): Triangular mesh algorithms
- [Pillow](https://pillow.readthedocs.io/en/latest/): Read/write/manipulate a wide variety of images (png, jpg, tiff, etc.)
- [psychopy](http://www.psychopy.org/): equivalent of psychtoolbox (workshop coming up in April in Nottingham)
- [Sphinx](http://www.sphinx-doc.org/en/master/): documentation generator
- [Buit-in libraries](https://docs.python.org/3/py-modindex.html)
- [collections](https://docs.python.org/3.6/library/collections.html): deque, OrderedDict, namedtuple, and more
- [datetime](https://docs.python.org/3/library/datetime.html): Basic date and time types
- [functools](https://docs.python.org/3/library/functools.html): caching, decorators, and support for functional programming
- [json](https://docs.python.org/3/library/json.html)/[ipaddress](https://docs.python.org/3/library/ipaddress.html)/[xml](https://docs.python.org/3/library/xml.html#module-xml): parsing/writing
- [itertools](https://docs.python.org/3/library/itertools.html): more tools to loop over sequences
- [logging](https://docs.python.org/3/library/logging.htm): log your output to stdout or a file (more flexible than print statements)
- [multiprocessing](https://docs.python.org/3/library/multiprocessing.html)
- [os](https://docs.python.org/3/library/os.html#module-os)/[sys](https://docs.python.org/3/library/sys.html): Miscellaneous operating system interfaces
- [os.path](https://docs.python.org/3/library/os.path.html)/[pathlib](https://docs.python.org/3/library/pathlib.html): utilities to deal with filesystem paths (latter provides an object-oriented interface)
- [pickle](https://docs.python.org/3/library/pickle.html): Store/load any python object
- [shutil](https://docs.python.org/3/library/shutil.html): copy/move files
- [subprocess](https://docs.python.org/3/library/subprocess.html): call shell commands
- [time](https://docs.python.org/3/library/time.html)/[timeit](https://docs.python.org/3/library/timeit.html): Timing your code
- [turtle](https://docs.python.org/3/library/turtle.html#module-turtle): teach python to your kids!
- [warnings](https://docs.python.org/3/library/warnings.html#module-warnings): tell people they are not using your code properly
```
from turtle import *
color('red', 'yellow')
begin_fill()
speed(10)
while True:
forward(200)
left(170)
if abs(pos()) < 1:
break
end_fill()
done()
```
```
import this
```
File added
talks/pandas/group_by.png

24.8 KiB

%% Cell type:markdown id: tags:
# Pandas
Follow along online at: https://git.fmrib.ox.ac.uk/fsl/pytreat-practicals-2020/-/blob/master/talks/pandas/pandas.ipynb
Pandas is a data analysis library focused on the cleaning and exploration of
tabular data.
Some useful links are:
- [main website](https://pandas.pydata.org)
- [documentation](http://pandas.pydata.org/pandas-docs/stable/)<sup>1</sup>
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)<sup>1</sup> by
Jake van der Plas
- [List of Pandas tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)
<sup>1</sup> This tutorial borrows heavily from the pandas documentation and
the Python Data Science Handbook
%% Cell type:code id: tags:
```
%pylab inline
import pandas as pd # pd is the usual abbreviation for pandas
import matplotlib.pyplot as plt # matplotlib for plotting
import seaborn as sns # seaborn is the main plotting library for Pandas
import statsmodels.api as sm # statsmodels fits linear models to pandas data
import statsmodels.formula.api as smf
from IPython.display import Image
sns.set() # use the prettier seaborn plotting settings rather than the default matplotlib one
```
%% Cell type:markdown id: tags:
> We will mostly be using `seaborn` instead of `matplotlib` for
> visualisation. But `seaborn` is actually an extension to `matplotlib`, so we
> are still using the latter under the hood.
## Loading in data
Pandas supports a wide range of I/O tools to load from text files, binary files,
and SQL databases. You can find a table with all formats
[here](http://pandas.pydata.org/pandas-docs/stable/io.html).
%% Cell type:code id: tags:
```
titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
titanic
```
%% Cell type:markdown id: tags:
This loads the data into a
[`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
object, which is the main object we will be interacting with in pandas. It
represents a table of data. The other file formats all start with
`pd.read_{format}`. Note that we can provide the URL to the dataset, rather
than download it beforehand.
We can write out the dataset using `dataframe.to_{format}(<filename>)`:
%% Cell type:code id: tags:
```
titanic.to_csv('titanic_copy.csv', index=False) # we set index to False to prevent pandas from storing the row names
```
%% Cell type:markdown id: tags:
If you can not connect to the internet, you can run the command below to load
this locally stored titanic dataset
%% Cell type:code id: tags:
```
titanic = pd.read_csv('titanic.csv')
titanic
```
%% Cell type:markdown id: tags:
Note that the titanic dataset was also available to us as one of the standard
datasets included with seaborn. We could load it from there using
%% Cell type:code id: tags:
```
sns.load_dataset('titanic')
```
%% Cell type:markdown id: tags:
`Dataframes` can also be created from other python objects, using
`pd.DataFrame.from_{other type}`. The most useful of these is `from_dict`,
which converts a mapping of the columns to a pandas `DataFrame` (i.e., table).
%% Cell type:code id: tags:
```
pd.DataFrame.from_dict({
'random numbers': np.random.rand(5),
'sequence (int)': np.arange(5),
'sequence (float)': np.linspace(0, 5, 5),
'letters': list('abcde'),
'constant_value': 'same_value'
})
```
%% Cell type:markdown id: tags:
For many applications (e.g., ICA, machine learning input) you might want to
extract your data as a numpy array. The underlying numpy array can be accessed
using the `to_numpy` method
%% Cell type:code id: tags:
```
titanic.to_numpy()
```
%% Cell type:markdown id: tags:
Note that the type of the returned array is the most common type (in this case
object). If you just want the numeric parts of the table you can use
`select_dtypes`, which selects specific columns based on their dtype:
%% Cell type:code id: tags:
```
titanic.select_dtypes(include=np.number).to_numpy()
```
%% Cell type:markdown id: tags:
Note that the numpy array has no information on the column names or row indices.
Alternatively, when you want to include the categorical variables in your later
analysis (e.g., for machine learning), you can extract dummy variables using:
%% Cell type:code id: tags:
```
pd.get_dummies(titanic)
```
%% Cell type:markdown id: tags:
## Accessing parts of the data
[Documentation on indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html)
### Selecting columns by name
Single columns can be selected using the normal python indexing:
%% Cell type:code id: tags:
```
titanic['embark_town']
```
%% Cell type:markdown id: tags:
If the column names are simple strings (not required) we can also access it
directly as an attribute
%% Cell type:code id: tags:
```
titanic.embark_town
```
%% Cell type:markdown id: tags:
Note that this returns a pandas
[`Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)
rather than a `DataFrame` object. A `Series` is simply a 1-dimensional array
representing a single column. Multiple columns can be returned by providing a
list of columns names. This will return a `DataFrame`:
%% Cell type:code id: tags:
```
titanic[['class', 'alive']]
```
%% Cell type:markdown id: tags:
Note that you have to provide a list here (square brackets). If you provide a
tuple (round brackets) pandas will think you are trying to access a single
column that has that tuple as a name:
%% Cell type:code id: tags:
```
titanic[('class', 'alive')]
```
%% Cell type:markdown id: tags:
In this case there is no column called `('class', 'alive')` leading to an
error. Later on we will see some uses to having columns named like this.
### Indexing rows by name or integer
Individual rows can be accessed based on their name (i.e., the index) or integer
(i.e., which row it is in). In our current table this will give the same
results. To ensure that these are different, let's sort our titanic dataset
based on the passenger fare:
%% Cell type:code id: tags:
```
titanic_sorted = titanic.sort_values('fare')
titanic_sorted
```
%% Cell type:markdown id: tags:
Note that the re-sorting did not change the values in the index (i.e., left-most
column).
We can select the first row of this newly sorted table using `iloc`
%% Cell type:code id: tags:
```
titanic_sorted.iloc[0]
```
%% Cell type:markdown id: tags:
We can select the row with the index 0 using
%% Cell type:code id: tags:
```
titanic_sorted.loc[0]
```
%% Cell type:markdown id: tags:
Note that this gives the same passenger as the first row of the initial table
before sorting
%% Cell type:code id: tags:
```
titanic.iloc[0]
```
%% Cell type:markdown id: tags:
Another common way to access the first or last N rows of a table is using the
head/tail methods
%% Cell type:code id: tags:
```
titanic_sorted.head(3)
```
%% Cell type:code id: tags:
```
titanic_sorted.tail(3)
```
%% Cell type:markdown id: tags:
Note that nearly all methods in pandas return a new `Dataframe`, which means
that we can easily call another method on them
%% Cell type:code id: tags:
```
titanic_sorted.tail(10).head(5) # select the first 5 of the last 10 passengers in the database
```
%% Cell type:code id: tags:
```
titanic_sorted.iloc[-10:-5] # alternative way to get the same passengers
```
%% Cell type:markdown id: tags:
**Exercise**: use sorting and tail/head or indexing to find the 10 youngest
passengers on the titanic. Try to do this on a single line by chaining calls
to the titanic `DataFrame` object
%% Cell type:code id: tags:
```
titanic.sort_values...
```
%% Cell type:markdown id: tags:
### Indexing rows by value
One final way to select specific columns is by their value
%% Cell type:code id: tags:
```
titanic[titanic.sex == 'female'] # selects all females
```
%% Cell type:code id: tags:
```
# select all passengers older than 60 who departed from Southampton
titanic[(titanic.age > 60) & (titanic['embark_town'] == 'Southampton')]
```
%% Cell type:markdown id: tags:
Note that this required typing `titanic` quite often. A quicker way to get the
same result is using the `query` method, which is described in detail
[here](http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method)
(note that using the `query` method is also faster and uses a lot less
memory).
> You may have trouble using the `query` method with columns which have
a name that cannot be used as a Python identifier.
%% Cell type:code id: tags:
```
titanic.query('(age > 60) & (embark_town == "Southampton")')
```
%% Cell type:markdown id: tags:
When selecting a categorical multiple options from a categorical values you
might want to use `isin`:
%% Cell type:code id: tags:
```
titanic[titanic['class'].isin(['First','Second'])]
```
%% Cell type:markdown id: tags:
Particularly useful when selecting data like this is the `isna` method which
finds all missing data
%% Cell type:code id: tags:
```
titanic[~titanic.age.isna()] # select first few passengers whose age is not N/A
```
%% Cell type:markdown id: tags:
This removing of missing numbers is so common that it has is own method
%% Cell type:code id: tags:
```
titanic.dropna() # drops all passengers that have some datapoint missing
```
%% Cell type:code id: tags:
```
titanic.dropna(subset=['age', 'fare']) # Only drop passengers with missing ages or fares
```
%% Cell type:markdown id: tags:
**Exercise**: use sorting, indexing by value, `dropna` and `tail`/`head` or
indexing to find the 10 oldest female passengers on the titanic. Try to do
this on a single line by chaining calls to the titanic `DataFrame` object
%% Cell type:code id: tags:
```
titanic...
```
%% Cell type:markdown id: tags:
## Plotting the data
Before we start analyzing the data, let's play around with visualizing it.
Pandas does have some basic built-in plotting options:
%% Cell type:code id: tags:
```
titanic.fare.hist(bins=20, log=True)
```
%% Cell type:code id: tags:
```
titanic.age.plot()
```
%% Cell type:markdown id: tags:
To plot all variables simply call `plot` or `hist` on the full dataframe
rather than a single Series (i.e., column). You might want to set `subplots=True`
to plot each variable in a different subplot.
Individual Series are essentially 1D arrays, so we can use them as such in
`matplotlib`
%% Cell type:code id: tags:
```
plt.scatter(titanic.age, titanic.fare)
```
%% Cell type:markdown id: tags:
However, for most purposes much nicer plots can be obtained using
[Seaborn](https://seaborn.pydata.org). Seaborn has support to produce plots
showing the
[univariate](https://seaborn.pydata.org/tutorial/distributions.html#plotting-univariate-distributions)
or
[bivariate](https://seaborn.pydata.org/tutorial/distributions.html#plotting-bivariate-distributions)
distribution of data in a single or a grid of plots. Most of the seaborn
plotting functions expect to get a pandas `DataFrame` (although they will work
with Numpy arrays as well). So we can plot age vs. fare like:
%% Cell type:code id: tags:
```
sns.jointplot('age', 'fare', data=titanic)
```
%% Cell type:markdown id: tags:
**Exercise**: check the documentation from `sns.jointplot` (hover the mouse
over the text `jointplot` and press shift-tab) to find out how to turn the
scatter plot into a density (kde) map
%% Cell type:code id: tags:
```
sns.jointplot('age', 'fare', data=titanic, ...)
```
%% Cell type:markdown id: tags:
Here is just a brief example of how we can use multiple columns to illustrate
the data in more detail
%% Cell type:code id: tags:
```
sns.relplot(x='age', y='fare', col='class', hue='sex', data=titanic,
col_order=('First', 'Second', 'Third'))
```
%% Cell type:markdown id: tags:
**Exercise**: Split the plot above into two rows with the first row including
the passengers who survived and the second row those who did not (you might
have to check the documentation again by using shift-tab while overing the
mouse over `relplot`)
%% Cell type:code id: tags:
```
sns.relplot(x='age', y='fare', col='class', hue='sex', data=titanic,
col_order=('First', 'Second', 'Third')...)
```
%% Cell type:markdown id: tags:
One of the nice thing of Seaborn is how easy it is to update how these plots
look. You can read more about that
[here](https://seaborn.pydata.org/tutorial/aesthetics.html). For example, to
increase the font size to get a plot more approriate for a talk, you can use:
%% Cell type:code id: tags:
```
sns.set_context('talk')
sns.violinplot(x='class', y='age', hue='sex', data=titanic, split=True,
order=('First', 'Second', 'Third'))
```
%% Cell type:markdown id: tags:
## Summarizing the data (mean, std, etc.)
There are a large number of built-in methods to summarize the observations in
a Pandas `DataFrame`. Most of these will return a `Series` with the columns
names as index:
%% Cell type:code id: tags:
```
titanic.mean()
```
%% Cell type:code id: tags:
```
titanic.quantile(0.75)
```
%% Cell type:markdown id: tags:
One very useful one is `describe`, which gives an overview of many common
summary measures
%% Cell type:code id: tags:
```
titanic.describe()
```
%% Cell type:markdown id: tags:
For a more detailed exploration of the data, you might want to check
[pandas_profiliing](https://pandas-profiling.github.io/pandas-profiling/docs/)
(not installed in fslpython, so the following will not run in fslpython):
%% Cell type:code id: tags:
```
from pandas_profiling import ProfileReport
profile = ProfileReport(titanic, title='Titanic Report', html={'style':{'full_width':True}})
profile.to_widgets()
```
%% Cell type:markdown id: tags:
Note that non-numeric columns are ignored when summarizing data in this way.
We can also define our own functions to apply to the columns (in this case we
have to explicitly set the data types).
%% Cell type:code id: tags:
```
def mad(series):
"""
Computes the median absolute deviatation (MAD)
This is a outlier-resistant measure of the standard deviation
"""
no_nan = series.dropna()
return np.median(abs(no_nan - np.nanmedian(no_nan)))
titanic.select_dtypes(np.number).apply(mad)
```
%% Cell type:markdown id: tags:
We can also provide multiple functions to the `apply` method (note that
functions can be provided as strings)
%% Cell type:code id: tags:
```
titanic.select_dtypes(np.number).apply(['mean', np.median, np.std, mad])
```
%% Cell type:markdown id: tags:
### Grouping by
One of the more powerful features of is `groupby`, which splits the dataset on
a categorical variable. The book contains a clear tutorial on that feature
[here](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html). You
can check the pandas documentation
[here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for a more
formal introduction. One simple use is just to put it into a loop
%% Cell type:code id: tags:
```
for cls, part_table in titanic.groupby('class'):
print(f'Mean fare in {cls.lower()} class: {part_table.fare.mean()}')
```
%% Cell type:markdown id: tags:
However, it is more often combined with one of the aggregation functions
discussed above as illustrated in this figure from the [Python data science
handbook](https://jakevdp.github.io/PythonDataScienceHandbook/06.00-figure-code.html#Split-Apply-Combine)
![group by image](group_by.png)
%% Cell type:code id: tags:
```
titanic.groupby('class').mean()
```
%% Cell type:markdown id: tags:
We can also group by multiple variables at once
%% Cell type:code id: tags:
```
titanic.groupby(['class', 'survived']).mean() # as always in pandas supply multiple column names as lists, not tuples
```
%% Cell type:markdown id: tags:
When grouping it can help to use the `cut` method to split a continuous variable
into a categorical one
%% Cell type:code id: tags:
```
titanic.groupby(['class', pd.cut(titanic.age, bins=(0, 18, 50, np.inf))]).mean()
```
%% Cell type:markdown id: tags:
We can use the `aggregate` method to apply a different function to each series
%% Cell type:code id: tags:
```
titanic.groupby(['class', 'survived']).aggregate((np.median, mad))
```
%% Cell type:markdown id: tags:
Note that both the index (on the left) and the column names (on the top) now
have multiple levels. Such a multi-level index is referred to as `MultiIndex`.
This does complicate selecting specific columns/rows. You can read more of using
`MultiIndex` [here](http://pandas.pydata.org/pandas-docs/stable/advanced.html).
The short version is that columns can be selected using direct indexing (as
discussed above)
%% Cell type:code id: tags:
```
df_full = titanic.groupby(['class', 'survived']).aggregate((np.median, mad))
```
%% Cell type:code id: tags:
```
df_full[('age', 'median')] # selects median age column; note that the round brackets are optional
```
%% Cell type:code id: tags:
```
df_full['age'] # selects both age columns
```
%% Cell type:markdown id: tags:
Remember that indexing based on the index was done through `loc`. The rest is
the same as for the columns above
%% Cell type:code id: tags:
```
df_full.loc[('First', 0)]
```
%% Cell type:code id: tags:
```
df_full.loc['First']
```
%% Cell type:markdown id: tags:
More advanced use of the `MultiIndex` is possible through `xs`:
%% Cell type:code id: tags:
```
df_full.xs(0, level='survived') # selects all the zero's from the survived index
```
%% Cell type:code id: tags:
```
df_full.xs('mad', axis=1, level=1) # selects mad from the second level in the columns (i.e., axis=1)
```
%% Cell type:markdown id: tags:
## Reshaping tables
If we were interested in how the survival rate depends on the class and sex of
the passengers we could simply use a groupby:
%% Cell type:code id: tags:
```
titanic.groupby(['class', 'sex']).survived.mean()
```
%% Cell type:markdown id: tags:
However, this single-column table is difficult to read. The reason for this is
that the indexing is multi-leveled (called `MultiIndex` in pandas), while there
is only a single column. We would like to move one of the levels in the index to
the columns. This can be done using `stack`/`unstack`:
- `unstack`: Moves one levels in the index to the columns
- `stack`: Moves one of levels in the columns to the index
%% Cell type:code id: tags:
```
titanic.groupby(['class', 'sex']).survived.mean().unstack('sex')
```
%% Cell type:markdown id: tags:
The former table, where the different groups are defined in different rows, is
often referred to as long-form. After unstacking the table is often referred to
as wide-form as the different group (sex in this case) is now represented as
different columns. In pandas some operations are easier on long-form tables
(e.g., `groupby`) while others require wide_form tables (e.g., making scatter
plots of two variables). You can go back and forth using `unstack` or `stack` as
illustrated above, but as this is a crucial part of pandas there are many
alternatives, such as `pivot_table`, `melt`, and `wide_to_long`, which we will
discuss below.
We can prettify the table further using seaborn
%% Cell type:code id: tags:
```
ax = sns.heatmap(titanic.groupby(['class', 'sex']).survived.mean().unstack('sex'),
annot=True)
ax.set_title('survival rate')
```
%% Cell type:markdown id: tags:
Note that there are also many ways to produce prettier tables in pandas (e.g.,
color all the negative values). This is documented
[here](http://pandas.pydata.org/pandas-docs/stable/style.html).
Because this stacking/unstacking is fairly common after a groupby operation,
there is a shortcut for it: `pivot_table`
%% Cell type:code id: tags:
```
titanic.pivot_table('survived', 'class', 'sex')
```
%% Cell type:markdown id: tags:
As usual in pandas, where we can also provide multiple column names
%% Cell type:code id: tags:
```
sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))]), annot=True)
```
%% Cell type:markdown id: tags:
We can also change the function to be used to aggregate the data
%% Cell type:code id: tags:
```
sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))],
aggfunc='count'), annot=True)
```
%% Cell type:markdown id: tags:
As in `groupby` the aggregation function can be a string of a common aggregation
function, or any function that should be applied.
We can even apply different aggregate functions to different columns
%% Cell type:code id: tags:
```
titanic.pivot_table(index='class', columns='sex',
aggfunc={'survived': 'count', 'fare': np.mean}) # compute number of survivors and mean fare
```
%% Cell type:markdown id: tags:
The opposite of `pivot_table` is `melt`. This can be used to change a wide-form
table into a long-form table. This is not particularly useful on the titanic
dataset, so let's create a new table where this might be useful. Let's say we
have a dataset listing the FA and MD values in various WM tracts:
%% Cell type:code id: tags:
```
tracts = ('Corpus callosum', 'Internal capsule', 'SLF', 'Arcuate fasciculus')
df_wide = pd.DataFrame.from_dict(dict({'subject': list('ABCDEFGHIJ')}, **{
f'FA({tract})': np.random.rand(10) for tract in tracts }, **{
f'MD({tract})': np.random.rand(10) * 1e-3 for tract in tracts
}))
df_wide
```
%% Cell type:markdown id: tags:
This wide-form table (i.e., all the information is in different columns) makes
it hard to select just all the FA values or only the values associated with the
SLF. For this it would be easier to list all the values in a single column.
Most of the tools discussed above (e.g., `group_by` or `seaborn` plotting) work
better with long-form data, which we can obtain from `melt`:
%% Cell type:code id: tags:
```
df_long = df_wide.melt('subject', var_name='measurement', value_name='dti_value')
df_long.head(12)
```
%% Cell type:markdown id: tags:
We can see that `melt` took all the columns (we could also have specified a
specific sub-set) and returned each measurement as a seperate row. We probably
want to seperate the measurement column into the measurement type (FA or MD) and
the tract name. Many string manipulation function are available in the
`DataFrame` object under `DataFrame.str`
([tutorial](http://pandas.pydata.org/pandas-docs/stable/text.html))
%% Cell type:code id: tags:
```
df_long['variable'] = df_long.measurement.str.slice(0, 2) # first two letters correspond to FA or MD
df_long['tract'] = df_long.measurement.str.slice(3, -1) # fourth till the second-to-last letter correspond to the tract
df_long.head(12)
```
%% Cell type:markdown id: tags:
Finally we probably do want the FA and MD variables as different columns.
**Exercise**: Use `pivot_table` or `stack`/`unstack` to create a column for MD
and FA.
%% Cell type:code id: tags:
```
df_unstacked = df_long.
```
%% Cell type:markdown id: tags:
We can now use the tools discussed above to visualize the table (`seaborn`) or
to group the table based on tract (`groupby` or `pivot_table`).
%% Cell type:code id: tags:
```
# feel free to analyze this random data in more detail
```
%% Cell type:markdown id: tags:
In general pandas is better at handling long-form than wide-form data, although
for better visualization of the data an intermediate format is often best. One
exception is calculating a covariance (`DataFrame.cov`) or correlation
(`DataFrame.corr`) matrices which computes the correlation between each column:
%% Cell type:code id: tags:
```
sns.heatmap(df_wide.corr(), cmap=sns.diverging_palette(240, 10, s=99, n=300), )
```
%% Cell type:markdown id: tags:
## Linear fitting (`statsmodels`)
Linear fitting between the different columns is available through the
[`statsmodels`](https://www.statsmodels.org/stable/index.html) library. A nice
way to play around with a wide variety of possible models is to use R-style
functions. The usage of the functions in `statsmodels` is described
[here](https://www.statsmodels.org/dev/example_formulas.html). You can find a
more detailed description of the R-style functions
[here](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-
language).
In short these functions describe the linear model as a string. For example,
`"y ~ x + a + x * a"` fits the variable `y` as a function of `x`, `a`, and the
interaction between `x` and `a`. The intercept is included by default (you can
add `"+ 0"` to remove it).
%% Cell type:code id: tags:
```
result = smf.logit('survived ~ age + sex + age * sex', data=titanic).fit()
print(result.summary())
```
%% Cell type:markdown id: tags:
Note that `statsmodels` understands categorical variables and automatically
replaces them with dummy variables.
Above we used logistic regression, which is appropriate for the binary
survival rate. A wide variety of linear models are available. Let's try a GLM,
but assume that the fare is drawn from a Gamma distribution:
%% Cell type:code id: tags:
```
age_dmean = titanic.age - titanic.age.mean()
result = smf.glm('fare ~ age_dmean + embark_town', data=titanic).fit()
print(result.summary())
```
%% Cell type:markdown id: tags:
Cherbourg passengers clearly paid a lot more...
Note that we did not actually add the `age_dmean` to the
`DataFrame`. `statsmodels` (or more precisely the underlying
[patsy](https://patsy.readthedocs.io/en/latest/) library) automatically
extracted this from our environment. This can lead to confusing behaviour...
# More reading
Other useful features
- [Concatenating and merging tables](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/08_combine_dataframes.html)
- [Lots of time series support](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/09_timeseries.html)
- [Rolling Window
functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-
functions) for after you have meaningfully sorted your data
- and much, much more
# Pandas
Follow along online at: https://git.fmrib.ox.ac.uk/fsl/pytreat-practicals-2020/-/blob/master/talks/pandas/pandas.ipynb
Pandas is a data analysis library focused on the cleaning and exploration of
tabular data.
Some useful links are:
- [main website](https://pandas.pydata.org)
- [documentation](http://pandas.pydata.org/pandas-docs/stable/)<sup>1</sup>
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)<sup>1</sup> by
Jake van der Plas
- [List of Pandas tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)
<sup>1</sup> This tutorial borrows heavily from the pandas documentation and
the Python Data Science Handbook
```
%pylab inline
import pandas as pd # pd is the usual abbreviation for pandas
import matplotlib.pyplot as plt # matplotlib for plotting
import seaborn as sns # seaborn is the main plotting library for Pandas
import statsmodels.api as sm # statsmodels fits linear models to pandas data
import statsmodels.formula.api as smf
from IPython.display import Image
sns.set() # use the prettier seaborn plotting settings rather than the default matplotlib one
```
> We will mostly be using `seaborn` instead of `matplotlib` for
> visualisation. But `seaborn` is actually an extension to `matplotlib`, so we
> are still using the latter under the hood.
## Loading in data
Pandas supports a wide range of I/O tools to load from text files, binary files,
and SQL databases. You can find a table with all formats
[here](http://pandas.pydata.org/pandas-docs/stable/io.html).
```
titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
titanic
```
This loads the data into a
[`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
object, which is the main object we will be interacting with in pandas. It
represents a table of data. The other file formats all start with
`pd.read_{format}`. Note that we can provide the URL to the dataset, rather
than download it beforehand.
We can write out the dataset using `dataframe.to_{format}(<filename>)`:
```
titanic.to_csv('titanic_copy.csv', index=False) # we set index to False to prevent pandas from storing the row names
```
If you can not connect to the internet, you can run the command below to load
this locally stored titanic dataset
```
titanic = pd.read_csv('titanic.csv')
titanic
```
Note that the titanic dataset was also available to us as one of the standard
datasets included with seaborn. We could load it from there using
```
sns.load_dataset('titanic')
```
`Dataframes` can also be created from other python objects, using
`pd.DataFrame.from_{other type}`. The most useful of these is `from_dict`,
which converts a mapping of the columns to a pandas `DataFrame` (i.e., table).
```
pd.DataFrame.from_dict({
'random numbers': np.random.rand(5),
'sequence (int)': np.arange(5),
'sequence (float)': np.linspace(0, 5, 5),
'letters': list('abcde'),
'constant_value': 'same_value'
})
```
For many applications (e.g., ICA, machine learning input) you might want to
extract your data as a numpy array. The underlying numpy array can be accessed
using the `to_numpy` method
```
titanic.to_numpy()
```
Note that the type of the returned array is the most common type (in this case
object). If you just want the numeric parts of the table you can use
`select_dtypes`, which selects specific columns based on their dtype:
```
titanic.select_dtypes(include=np.number).to_numpy()
```
Note that the numpy array has no information on the column names or row indices.
Alternatively, when you want to include the categorical variables in your later
analysis (e.g., for machine learning), you can extract dummy variables using:
```
pd.get_dummies(titanic)
```
## Accessing parts of the data
[Documentation on indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html)
### Selecting columns by name
Single columns can be selected using the normal python indexing:
```
titanic['embark_town']
```
If the column names are simple strings (not required) we can also access it
directly as an attribute
```
titanic.embark_town
```
Note that this returns a pandas
[`Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)
rather than a `DataFrame` object. A `Series` is simply a 1-dimensional array
representing a single column. Multiple columns can be returned by providing a
list of columns names. This will return a `DataFrame`:
```
titanic[['class', 'alive']]
```
Note that you have to provide a list here (square brackets). If you provide a
tuple (round brackets) pandas will think you are trying to access a single
column that has that tuple as a name:
```
titanic[('class', 'alive')]
```
In this case there is no column called `('class', 'alive')` leading to an
error. Later on we will see some uses to having columns named like this.
### Indexing rows by name or integer
Individual rows can be accessed based on their name (i.e., the index) or integer
(i.e., which row it is in). In our current table this will give the same
results. To ensure that these are different, let's sort our titanic dataset
based on the passenger fare:
```
titanic_sorted = titanic.sort_values('fare')
titanic_sorted
```
Note that the re-sorting did not change the values in the index (i.e., left-most
column).
We can select the first row of this newly sorted table using `iloc`
```
titanic_sorted.iloc[0]
```
We can select the row with the index 0 using
```
titanic_sorted.loc[0]
```
Note that this gives the same passenger as the first row of the initial table
before sorting
```
titanic.iloc[0]
```
Another common way to access the first or last N rows of a table is using the
head/tail methods
```
titanic_sorted.head(3)
```
```
titanic_sorted.tail(3)
```
Note that nearly all methods in pandas return a new `Dataframe`, which means
that we can easily call another method on them
```
titanic_sorted.tail(10).head(5) # select the first 5 of the last 10 passengers in the database
```
```
titanic_sorted.iloc[-10:-5] # alternative way to get the same passengers
```
**Exercise**: use sorting and tail/head or indexing to find the 10 youngest
passengers on the titanic. Try to do this on a single line by chaining calls
to the titanic `DataFrame` object
```{.python .input}
titanic.sort_values...
```
### Indexing rows by value
One final way to select specific columns is by their value
```
titanic[titanic.sex == 'female'] # selects all females
```
```
# select all passengers older than 60 who departed from Southampton
titanic[(titanic.age > 60) & (titanic['embark_town'] == 'Southampton')]
```
Note that this required typing `titanic` quite often. A quicker way to get the
same result is using the `query` method, which is described in detail
[here](http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method)
(note that using the `query` method is also faster and uses a lot less
memory).
> You may have trouble using the `query` method with columns which have
a name that cannot be used as a Python identifier.
```
titanic.query('(age > 60) & (embark_town == "Southampton")')
```
When selecting a categorical multiple options from a categorical values you
might want to use `isin`:
```
titanic[titanic['class'].isin(['First','Second'])]
```
Particularly useful when selecting data like this is the `isna` method which
finds all missing data
```
titanic[~titanic.age.isna()] # select first few passengers whose age is not N/A
```
This removing of missing numbers is so common that it has is own method
```
titanic.dropna() # drops all passengers that have some datapoint missing
```
```
titanic.dropna(subset=['age', 'fare']) # Only drop passengers with missing ages or fares
```
**Exercise**: use sorting, indexing by value, `dropna` and `tail`/`head` or
indexing to find the 10 oldest female passengers on the titanic. Try to do
this on a single line by chaining calls to the titanic `DataFrame` object
```
titanic...
```
## Plotting the data
Before we start analyzing the data, let's play around with visualizing it.
Pandas does have some basic built-in plotting options:
```
titanic.fare.hist(bins=20, log=True)
```
```
titanic.age.plot()
```
To plot all variables simply call `plot` or `hist` on the full dataframe
rather than a single Series (i.e., column). You might want to set `subplots=True`
to plot each variable in a different subplot.
Individual Series are essentially 1D arrays, so we can use them as such in
`matplotlib`
```
plt.scatter(titanic.age, titanic.fare)
```
However, for most purposes much nicer plots can be obtained using
[Seaborn](https://seaborn.pydata.org). Seaborn has support to produce plots
showing the
[univariate](https://seaborn.pydata.org/tutorial/distributions.html#plotting-univariate-distributions)
or
[bivariate](https://seaborn.pydata.org/tutorial/distributions.html#plotting-bivariate-distributions)
distribution of data in a single or a grid of plots. Most of the seaborn
plotting functions expect to get a pandas `DataFrame` (although they will work
with Numpy arrays as well). So we can plot age vs. fare like:
```
sns.jointplot('age', 'fare', data=titanic)
```
**Exercise**: check the documentation from `sns.jointplot` (hover the mouse
over the text `jointplot` and press shift-tab) to find out how to turn the
scatter plot into a density (kde) map
```
sns.jointplot('age', 'fare', data=titanic, ...)
```
Here is just a brief example of how we can use multiple columns to illustrate
the data in more detail
```
sns.relplot(x='age', y='fare', col='class', hue='sex', data=titanic,
col_order=('First', 'Second', 'Third'))
```
**Exercise**: Split the plot above into two rows with the first row including
the passengers who survived and the second row those who did not (you might
have to check the documentation again by using shift-tab while overing the
mouse over `relplot`)
```
sns.relplot(x='age', y='fare', col='class', hue='sex', data=titanic,
col_order=('First', 'Second', 'Third')...)
```
One of the nice thing of Seaborn is how easy it is to update how these plots
look. You can read more about that
[here](https://seaborn.pydata.org/tutorial/aesthetics.html). For example, to
increase the font size to get a plot more approriate for a talk, you can use:
```
sns.set_context('talk')
sns.violinplot(x='class', y='age', hue='sex', data=titanic, split=True,
order=('First', 'Second', 'Third'))
```
## Summarizing the data (mean, std, etc.)
There are a large number of built-in methods to summarize the observations in
a Pandas `DataFrame`. Most of these will return a `Series` with the columns
names as index:
```
titanic.mean()
```
```
titanic.quantile(0.75)
```
One very useful one is `describe`, which gives an overview of many common
summary measures
```
titanic.describe()
```
For a more detailed exploration of the data, you might want to check
[pandas_profiliing](https://pandas-profiling.github.io/pandas-profiling/docs/)
(not installed in fslpython, so the following will not run in fslpython):
```
from pandas_profiling import ProfileReport
profile = ProfileReport(titanic, title='Titanic Report', html={'style':{'full_width':True}})
profile.to_widgets()
```
Note that non-numeric columns are ignored when summarizing data in this way.
We can also define our own functions to apply to the columns (in this case we
have to explicitly set the data types).
```
def mad(series):
"""
Computes the median absolute deviatation (MAD)
This is a outlier-resistant measure of the standard deviation
"""
no_nan = series.dropna()
return np.median(abs(no_nan - np.nanmedian(no_nan)))
titanic.select_dtypes(np.number).apply(mad)
```
We can also provide multiple functions to the `apply` method (note that
functions can be provided as strings)
```
titanic.select_dtypes(np.number).apply(['mean', np.median, np.std, mad])
```
### Grouping by
One of the more powerful features of is `groupby`, which splits the dataset on
a categorical variable. The book contains a clear tutorial on that feature
[here](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html). You
can check the pandas documentation
[here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for a more
formal introduction. One simple use is just to put it into a loop
```
for cls, part_table in titanic.groupby('class'):
print(f'Mean fare in {cls.lower()} class: {part_table.fare.mean()}')
```
However, it is more often combined with one of the aggregation functions
discussed above as illustrated in this figure from the [Python data science
handbook](https://jakevdp.github.io/PythonDataScienceHandbook/06.00-figure-code.html#Split-Apply-Combine)
![group by image](group_by.png)
```
titanic.groupby('class').mean()
```
We can also group by multiple variables at once
```
titanic.groupby(['class', 'survived']).mean() # as always in pandas supply multiple column names as lists, not tuples
```
When grouping it can help to use the `cut` method to split a continuous variable
into a categorical one
```
titanic.groupby(['class', pd.cut(titanic.age, bins=(0, 18, 50, np.inf))]).mean()
```
We can use the `aggregate` method to apply a different function to each series
```
titanic.groupby(['class', 'survived']).aggregate((np.median, mad))
```
Note that both the index (on the left) and the column names (on the top) now
have multiple levels. Such a multi-level index is referred to as `MultiIndex`.
This does complicate selecting specific columns/rows. You can read more of using
`MultiIndex` [here](http://pandas.pydata.org/pandas-docs/stable/advanced.html).
The short version is that columns can be selected using direct indexing (as
discussed above)
```
df_full = titanic.groupby(['class', 'survived']).aggregate((np.median, mad))
```
```
df_full[('age', 'median')] # selects median age column; note that the round brackets are optional
```
```
df_full['age'] # selects both age columns
```
Remember that indexing based on the index was done through `loc`. The rest is
the same as for the columns above
```
df_full.loc[('First', 0)]
```
```
df_full.loc['First']
```
More advanced use of the `MultiIndex` is possible through `xs`:
```
df_full.xs(0, level='survived') # selects all the zero's from the survived index
```
```
df_full.xs('mad', axis=1, level=1) # selects mad from the second level in the columns (i.e., axis=1)
```
## Reshaping tables
If we were interested in how the survival rate depends on the class and sex of
the passengers we could simply use a groupby:
```
titanic.groupby(['class', 'sex']).survived.mean()
```
However, this single-column table is difficult to read. The reason for this is
that the indexing is multi-leveled (called `MultiIndex` in pandas), while there
is only a single column. We would like to move one of the levels in the index to
the columns. This can be done using `stack`/`unstack`:
- `unstack`: Moves one levels in the index to the columns
- `stack`: Moves one of levels in the columns to the index
```
titanic.groupby(['class', 'sex']).survived.mean().unstack('sex')
```
The former table, where the different groups are defined in different rows, is
often referred to as long-form. After unstacking the table is often referred to
as wide-form as the different group (sex in this case) is now represented as
different columns. In pandas some operations are easier on long-form tables
(e.g., `groupby`) while others require wide_form tables (e.g., making scatter
plots of two variables). You can go back and forth using `unstack` or `stack` as
illustrated above, but as this is a crucial part of pandas there are many
alternatives, such as `pivot_table`, `melt`, and `wide_to_long`, which we will
discuss below.
We can prettify the table further using seaborn
```
ax = sns.heatmap(titanic.groupby(['class', 'sex']).survived.mean().unstack('sex'),
annot=True)
ax.set_title('survival rate')
```
Note that there are also many ways to produce prettier tables in pandas (e.g.,
color all the negative values). This is documented
[here](http://pandas.pydata.org/pandas-docs/stable/style.html).
Because this stacking/unstacking is fairly common after a groupby operation,
there is a shortcut for it: `pivot_table`
```
titanic.pivot_table('survived', 'class', 'sex')
```
As usual in pandas, where we can also provide multiple column names
```
sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))]), annot=True)
```
We can also change the function to be used to aggregate the data
```
sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))],
aggfunc='count'), annot=True)
```
As in `groupby` the aggregation function can be a string of a common aggregation
function, or any function that should be applied.
We can even apply different aggregate functions to different columns
```
titanic.pivot_table(index='class', columns='sex',
aggfunc={'survived': 'count', 'fare': np.mean}) # compute number of survivors and mean fare
```
The opposite of `pivot_table` is `melt`. This can be used to change a wide-form
table into a long-form table. This is not particularly useful on the titanic
dataset, so let's create a new table where this might be useful. Let's say we
have a dataset listing the FA and MD values in various WM tracts:
```
tracts = ('Corpus callosum', 'Internal capsule', 'SLF', 'Arcuate fasciculus')
df_wide = pd.DataFrame.from_dict(dict({'subject': list('ABCDEFGHIJ')}, **{
f'FA({tract})': np.random.rand(10) for tract in tracts }, **{
f'MD({tract})': np.random.rand(10) * 1e-3 for tract in tracts
}))
df_wide
```
This wide-form table (i.e., all the information is in different columns) makes
it hard to select just all the FA values or only the values associated with the
SLF. For this it would be easier to list all the values in a single column.
Most of the tools discussed above (e.g., `group_by` or `seaborn` plotting) work
better with long-form data, which we can obtain from `melt`:
```
df_long = df_wide.melt('subject', var_name='measurement', value_name='dti_value')
df_long.head(12)
```
We can see that `melt` took all the columns (we could also have specified a
specific sub-set) and returned each measurement as a seperate row. We probably
want to seperate the measurement column into the measurement type (FA or MD) and
the tract name. Many string manipulation function are available in the
`DataFrame` object under `DataFrame.str`
([tutorial](http://pandas.pydata.org/pandas-docs/stable/text.html))
```
df_long['variable'] = df_long.measurement.str.slice(0, 2) # first two letters correspond to FA or MD
df_long['tract'] = df_long.measurement.str.slice(3, -1) # fourth till the second-to-last letter correspond to the tract
df_long.head(12)
```
Finally we probably do want the FA and MD variables as different columns.
**Exercise**: Use `pivot_table` or `stack`/`unstack` to create a column for MD
and FA.
```
df_unstacked = df_long.
```
We can now use the tools discussed above to visualize the table (`seaborn`) or
to group the table based on tract (`groupby` or `pivot_table`).
```
# feel free to analyze this random data in more detail
```
In general pandas is better at handling long-form than wide-form data, although
for better visualization of the data an intermediate format is often best. One
exception is calculating a covariance (`DataFrame.cov`) or correlation
(`DataFrame.corr`) matrices which computes the correlation between each column:
```
sns.heatmap(df_wide.corr(), cmap=sns.diverging_palette(240, 10, s=99, n=300), )
```
## Linear fitting (`statsmodels`)
Linear fitting between the different columns is available through the
[`statsmodels`](https://www.statsmodels.org/stable/index.html) library. A nice
way to play around with a wide variety of possible models is to use R-style
functions. The usage of the functions in `statsmodels` is described
[here](https://www.statsmodels.org/dev/example_formulas.html). You can find a
more detailed description of the R-style functions
[here](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-
language).
In short these functions describe the linear model as a string. For example,
`"y ~ x + a + x * a"` fits the variable `y` as a function of `x`, `a`, and the
interaction between `x` and `a`. The intercept is included by default (you can
add `"+ 0"` to remove it).
```
result = smf.logit('survived ~ age + sex + age * sex', data=titanic).fit()
print(result.summary())
```
Note that `statsmodels` understands categorical variables and automatically
replaces them with dummy variables.
Above we used logistic regression, which is appropriate for the binary
survival rate. A wide variety of linear models are available. Let's try a GLM,
but assume that the fare is drawn from a Gamma distribution:
```
age_dmean = titanic.age - titanic.age.mean()
result = smf.glm('fare ~ age_dmean + embark_town', data=titanic).fit()
print(result.summary())
```
Cherbourg passengers clearly paid a lot more...
Note that we did not actually add the `age_dmean` to the
`DataFrame`. `statsmodels` (or more precisely the underlying
[patsy](https://patsy.readthedocs.io/en/latest/) library) automatically
extracted this from our environment. This can lead to confusing behaviour...
# More reading
Other useful features
- [Concatenating and merging tables](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/08_combine_dataframes.html)
- [Lots of time series support](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/09_timeseries.html)
- [Rolling Window
functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-
functions) for after you have meaningfully sorted your data
- and much, much more
survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False
1,3,female,4.0,1,1,16.7,S,Third,child,False,G,Southampton,yes,False
1,1,female,58.0,0,0,26.55,S,First,woman,False,C,Southampton,yes,True
0,3,male,20.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,39.0,1,5,31.275,S,Third,man,True,,Southampton,no,False
0,3,female,14.0,0,0,7.8542,S,Third,child,False,,Southampton,no,True
1,2,female,55.0,0,0,16.0,S,Second,woman,False,,Southampton,yes,True
0,3,male,2.0,4,1,29.125,Q,Third,child,False,,Queenstown,no,False
1,2,male,,0,0,13.0,S,Second,man,True,,Southampton,yes,True
0,3,female,31.0,1,0,18.0,S,Third,woman,False,,Southampton,no,False
1,3,female,,0,0,7.225,C,Third,woman,False,,Cherbourg,yes,True
0,2,male,35.0,0,0,26.0,S,Second,man,True,,Southampton,no,True
1,2,male,34.0,0,0,13.0,S,Second,man,True,D,Southampton,yes,True
1,3,female,15.0,0,0,8.0292,Q,Third,child,False,,Queenstown,yes,True
1,1,male,28.0,0,0,35.5,S,First,man,True,A,Southampton,yes,True
0,3,female,8.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
1,3,female,38.0,1,5,31.3875,S,Third,woman,False,,Southampton,yes,False
0,3,male,,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
0,1,male,19.0,3,2,263.0,S,First,man,True,C,Southampton,no,False
1,3,female,,0,0,7.8792,Q,Third,woman,False,,Queenstown,yes,True
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,1,male,40.0,0,0,27.7208,C,First,man,True,,Cherbourg,no,True
1,1,female,,1,0,146.5208,C,First,woman,False,B,Cherbourg,yes,False
1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
0,2,male,66.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
0,1,male,28.0,1,0,82.1708,C,First,man,True,,Cherbourg,no,False
0,1,male,42.0,1,0,52.0,S,First,man,True,,Southampton,no,False
1,3,male,,0,0,7.2292,C,Third,man,True,,Cherbourg,yes,True
0,3,male,21.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,female,18.0,2,0,18.0,S,Third,woman,False,,Southampton,no,False
1,3,female,14.0,1,0,11.2417,C,Third,child,False,,Cherbourg,yes,False
0,3,female,40.0,1,0,9.475,S,Third,woman,False,,Southampton,no,False
0,2,female,27.0,1,0,21.0,S,Second,woman,False,,Southampton,no,False
0,3,male,,0,0,7.8958,C,Third,man,True,,Cherbourg,no,True
1,2,female,3.0,1,2,41.5792,C,Second,child,False,,Cherbourg,yes,False
1,3,female,19.0,0,0,7.8792,Q,Third,woman,False,,Queenstown,yes,True
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,,1,0,15.5,Q,Third,man,True,,Queenstown,no,False
1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
0,3,male,,2,0,21.6792,C,Third,man,True,,Cherbourg,no,False
0,3,female,18.0,1,0,17.8,S,Third,woman,False,,Southampton,no,False
0,3,male,7.0,4,1,39.6875,S,Third,child,False,,Southampton,no,False
0,3,male,21.0,0,0,7.8,S,Third,man,True,,Southampton,no,True
1,1,female,49.0,1,0,76.7292,C,First,woman,False,D,Cherbourg,yes,False
1,2,female,29.0,1,0,26.0,S,Second,woman,False,,Southampton,yes,False
0,1,male,65.0,0,1,61.9792,C,First,man,True,B,Cherbourg,no,False
1,1,male,,0,0,35.5,S,First,man,True,C,Southampton,yes,True
1,2,female,21.0,0,0,10.5,S,Second,woman,False,,Southampton,yes,True
0,3,male,28.5,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
1,2,female,5.0,1,2,27.75,S,Second,child,False,,Southampton,yes,False
0,3,male,11.0,5,2,46.9,S,Third,child,False,,Southampton,no,False
0,3,male,22.0,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
1,1,female,38.0,0,0,80.0,,First,woman,False,B,,yes,True
0,1,male,45.0,1,0,83.475,S,First,man,True,C,Southampton,no,False
0,3,male,4.0,3,2,27.9,S,Third,child,False,,Southampton,no,False
0,1,male,,0,0,27.7208,C,First,man,True,,Cherbourg,no,True
1,3,male,,1,1,15.2458,C,Third,man,True,,Cherbourg,yes,False
1,2,female,29.0,0,0,10.5,S,Second,woman,False,F,Southampton,yes,True
0,3,male,19.0,0,0,8.1583,S,Third,man,True,,Southampton,no,True
1,3,female,17.0,4,2,7.925,S,Third,woman,False,,Southampton,yes,False
0,3,male,26.0,2,0,8.6625,S,Third,man,True,,Southampton,no,False
0,2,male,32.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
0,3,female,16.0,5,2,46.9,S,Third,woman,False,,Southampton,no,False
0,2,male,21.0,0,0,73.5,S,Second,man,True,,Southampton,no,True
0,3,male,26.0,1,0,14.4542,C,Third,man,True,,Cherbourg,no,False
1,3,male,32.0,0,0,56.4958,S,Third,man,True,,Southampton,yes,True
0,3,male,25.0,0,0,7.65,S,Third,man,True,F,Southampton,no,True
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,2,male,0.83,0,2,29.0,S,Second,child,False,,Southampton,yes,False
1,3,female,30.0,0,0,12.475,S,Third,woman,False,,Southampton,yes,True
0,3,male,22.0,0,0,9.0,S,Third,man,True,,Southampton,no,True
1,3,male,29.0,0,0,9.5,S,Third,man,True,,Southampton,yes,True
1,3,female,,0,0,7.7875,Q,Third,woman,False,,Queenstown,yes,True
0,1,male,28.0,0,0,47.1,S,First,man,True,,Southampton,no,True
1,2,female,17.0,0,0,10.5,S,Second,woman,False,,Southampton,yes,True
1,3,female,33.0,3,0,15.85,S,Third,woman,False,,Southampton,yes,False
0,3,male,16.0,1,3,34.375,S,Third,man,True,,Southampton,no,False
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,1,female,23.0,3,2,263.0,S,First,woman,False,C,Southampton,yes,False
0,3,male,24.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,29.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,20.0,0,0,7.8542,S,Third,man,True,,Southampton,no,True
0,1,male,46.0,1,0,61.175,S,First,man,True,E,Southampton,no,False
0,3,male,26.0,1,2,20.575,S,Third,man,True,,Southampton,no,False
0,3,male,59.0,0,0,7.25,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,1,male,71.0,0,0,34.6542,C,First,man,True,A,Cherbourg,no,True
1,1,male,23.0,0,1,63.3583,C,First,man,True,D,Cherbourg,yes,False
1,2,female,34.0,0,1,23.0,S,Second,woman,False,,Southampton,yes,False
0,2,male,34.0,1,0,26.0,S,Second,man,True,,Southampton,no,False
0,3,female,28.0,0,0,7.8958,S,Third,woman,False,,Southampton,no,True
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,1,male,21.0,0,1,77.2875,S,First,man,True,D,Southampton,no,False
0,3,male,33.0,0,0,8.6542,S,Third,man,True,,Southampton,no,True
0,3,male,37.0,2,0,7.925,S,Third,man,True,,Southampton,no,False
0,3,male,28.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,3,female,21.0,0,0,7.65,S,Third,woman,False,,Southampton,yes,True
1,3,male,,0,0,7.775,S,Third,man,True,,Southampton,yes,True
0,3,male,38.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,3,female,,1,0,24.15,Q,Third,woman,False,,Queenstown,yes,False
0,1,male,47.0,0,0,52.0,S,First,man,True,C,Southampton,no,True
0,3,female,14.5,1,0,14.4542,C,Third,child,False,,Cherbourg,no,False
0,3,male,22.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,female,20.0,1,0,9.825,S,Third,woman,False,,Southampton,no,False
0,3,female,17.0,0,0,14.4583,C,Third,woman,False,,Cherbourg,no,True
0,3,male,21.0,0,0,7.925,S,Third,man,True,,Southampton,no,True
0,3,male,70.5,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
0,2,male,29.0,1,0,21.0,S,Second,man,True,,Southampton,no,False
0,1,male,24.0,0,1,247.5208,C,First,man,True,B,Cherbourg,no,False
0,3,female,2.0,4,2,31.275,S,Third,child,False,,Southampton,no,False
0,2,male,21.0,2,0,73.5,S,Second,man,True,,Southampton,no,False
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,2,male,32.5,1,0,30.0708,C,Second,man,True,,Cherbourg,no,False
1,2,female,32.5,0,0,13.0,S,Second,woman,False,E,Southampton,yes,True
0,1,male,54.0,0,1,77.2875,S,First,man,True,D,Southampton,no,False
1,3,male,12.0,1,0,11.2417,C,Third,child,False,,Cherbourg,yes,False
0,3,male,,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
1,3,male,24.0,0,0,7.1417,S,Third,man,True,,Southampton,yes,True
1,3,female,,1,1,22.3583,C,Third,woman,False,F,Cherbourg,yes,False
0,3,male,45.0,0,0,6.975,S,Third,man,True,,Southampton,no,True
0,3,male,33.0,0,0,7.8958,C,Third,man,True,,Cherbourg,no,True
0,3,male,20.0,0,0,7.05,S,Third,man,True,,Southampton,no,True
0,3,female,47.0,1,0,14.5,S,Third,woman,False,,Southampton,no,False
1,2,female,29.0,1,0,26.0,S,Second,woman,False,,Southampton,yes,False
0,2,male,25.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,2,male,23.0,0,0,15.0458,C,Second,man,True,,Cherbourg,no,True
1,1,female,19.0,0,2,26.2833,S,First,woman,False,D,Southampton,yes,False
0,1,male,37.0,1,0,53.1,S,First,man,True,C,Southampton,no,False
0,3,male,16.0,0,0,9.2167,S,Third,man,True,,Southampton,no,True
0,1,male,24.0,0,0,79.2,C,First,man,True,B,Cherbourg,no,True
0,3,female,,0,2,15.2458,C,Third,woman,False,,Cherbourg,no,False
1,3,female,22.0,0,0,7.75,S,Third,woman,False,,Southampton,yes,True
1,3,female,24.0,1,0,15.85,S,Third,woman,False,,Southampton,yes,False
0,3,male,19.0,0,0,6.75,Q,Third,man,True,,Queenstown,no,True
0,2,male,18.0,0,0,11.5,S,Second,man,True,,Southampton,no,True
0,2,male,19.0,1,1,36.75,S,Second,man,True,,Southampton,no,False
1,3,male,27.0,0,0,7.7958,S,Third,man,True,,Southampton,yes,True
0,3,female,9.0,2,2,34.375,S,Third,child,False,,Southampton,no,False
0,2,male,36.5,0,2,26.0,S,Second,man,True,F,Southampton,no,False
0,2,male,42.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,2,male,51.0,0,0,12.525,S,Second,man,True,,Southampton,no,True
1,1,female,22.0,1,0,66.6,S,First,woman,False,C,Southampton,yes,False
0,3,male,55.5,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,40.5,0,2,14.5,S,Third,man,True,,Southampton,no,False
0,3,male,,0,0,7.3125,S,Third,man,True,,Southampton,no,True
0,1,male,51.0,0,1,61.3792,C,First,man,True,,Cherbourg,no,False
1,3,female,16.0,0,0,7.7333,Q,Third,woman,False,,Queenstown,yes,True
0,3,male,30.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,8.6625,S,Third,man,True,,Southampton,no,True
0,3,male,,8,2,69.55,S,Third,man,True,,Southampton,no,False
0,3,male,44.0,0,1,16.1,S,Third,man,True,,Southampton,no,False
1,2,female,40.0,0,0,15.75,S,Second,woman,False,,Southampton,yes,True
0,3,male,26.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
0,3,male,17.0,0,0,8.6625,S,Third,man,True,,Southampton,no,True
0,3,male,1.0,4,1,39.6875,S,Third,child,False,,Southampton,no,False
1,3,male,9.0,0,2,20.525,S,Third,child,False,,Southampton,yes,False
1,1,female,,0,1,55.0,S,First,woman,False,E,Southampton,yes,False
0,3,female,45.0,1,4,27.9,S,Third,woman,False,,Southampton,no,False
0,1,male,,0,0,25.925,S,First,man,True,,Southampton,no,True
0,3,male,28.0,0,0,56.4958,S,Third,man,True,,Southampton,no,True
0,1,male,61.0,0,0,33.5,S,First,man,True,B,Southampton,no,True
0,3,male,4.0,4,1,29.125,Q,Third,child,False,,Queenstown,no,False
1,3,female,1.0,1,1,11.1333,S,Third,child,False,,Southampton,yes,False
0,3,male,21.0,0,0,7.925,S,Third,man,True,,Southampton,no,True
0,1,male,56.0,0,0,30.6958,C,First,man,True,A,Cherbourg,no,True
0,3,male,18.0,1,1,7.8542,S,Third,man,True,,Southampton,no,False
0,3,male,,3,1,25.4667,S,Third,man,True,,Southampton,no,False
0,1,female,50.0,0,0,28.7125,C,First,woman,False,C,Cherbourg,no,True
0,2,male,30.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,3,male,36.0,0,0,0.0,S,Third,man,True,,Southampton,no,True
0,3,female,,8,2,69.55,S,Third,woman,False,,Southampton,no,False
0,2,male,,0,0,15.05,C,Second,man,True,,Cherbourg,no,True
0,3,male,9.0,4,2,31.3875,S,Third,child,False,,Southampton,no,False
1,2,male,1.0,2,1,39.0,S,Second,child,False,F,Southampton,yes,False
1,3,female,4.0,0,2,22.025,S,Third,child,False,,Southampton,yes,False
0,1,male,,0,0,50.0,S,First,man,True,A,Southampton,no,True
1,3,female,,1,0,15.5,Q,Third,woman,False,,Queenstown,yes,False
1,1,male,45.0,0,0,26.55,S,First,man,True,,Southampton,yes,True
0,3,male,40.0,1,1,15.5,Q,Third,man,True,,Queenstown,no,False
0,3,male,36.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,2,female,32.0,0,0,13.0,S,Second,woman,False,,Southampton,yes,True
0,2,male,19.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
1,3,female,19.0,1,0,7.8542,S,Third,woman,False,,Southampton,yes,False
1,2,male,3.0,1,1,26.0,S,Second,child,False,F,Southampton,yes,False
1,1,female,44.0,0,0,27.7208,C,First,woman,False,B,Cherbourg,yes,True
1,1,female,58.0,0,0,146.5208,C,First,woman,False,B,Cherbourg,yes,True
0,3,male,,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
0,3,male,42.0,0,1,8.4042,S,Third,man,True,,Southampton,no,False
1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
0,2,female,24.0,0,0,13.0,S,Second,woman,False,,Southampton,no,True
0,3,male,28.0,0,0,9.5,S,Third,man,True,,Southampton,no,True
0,3,male,,8,2,69.55,S,Third,man,True,,Southampton,no,False
0,3,male,34.0,0,0,6.4958,S,Third,man,True,,Southampton,no,True
0,3,male,45.5,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
1,3,male,18.0,0,0,8.05,S,Third,man,True,,Southampton,yes,True
0,3,female,2.0,0,1,10.4625,S,Third,child,False,G,Southampton,no,False
0,3,male,32.0,1,0,15.85,S,Third,man,True,,Southampton,no,False
1,3,male,26.0,0,0,18.7875,C,Third,man,True,,Cherbourg,yes,True
1,3,female,16.0,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
1,1,male,40.0,0,0,31.0,C,First,man,True,A,Cherbourg,yes,True
0,3,male,24.0,0,0,7.05,S,Third,man,True,,Southampton,no,True
1,2,female,35.0,0,0,21.0,S,Second,woman,False,,Southampton,yes,True
0,3,male,22.0,0,0,7.25,S,Third,man,True,,Southampton,no,True
0,2,male,30.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,3,male,,1,0,7.75,Q,Third,man,True,,Queenstown,no,False
1,1,female,31.0,1,0,113.275,C,First,woman,False,D,Cherbourg,yes,False
1,3,female,27.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
0,2,male,42.0,1,0,27.0,S,Second,man,True,,Southampton,no,False
1,1,female,32.0,0,0,76.2917,C,First,woman,False,D,Cherbourg,yes,True
0,2,male,30.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
1,3,male,16.0,0,0,8.05,S,Third,man,True,,Southampton,yes,True
0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,3,male,51.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,1,male,38.0,1,0,90.0,S,First,man,True,C,Southampton,yes,False
0,3,male,22.0,0,0,9.35,S,Third,man,True,,Southampton,no,True
1,2,male,19.0,0,0,10.5,S,Second,man,True,,Southampton,yes,True
0,3,male,20.5,0,0,7.25,S,Third,man,True,,Southampton,no,True
0,2,male,18.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,3,female,,3,1,25.4667,S,Third,woman,False,,Southampton,no,False
1,1,female,35.0,1,0,83.475,S,First,woman,False,C,Southampton,yes,False
0,3,male,29.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
0,2,male,59.0,0,0,13.5,S,Second,man,True,,Southampton,no,True
1,3,female,5.0,4,2,31.3875,S,Third,child,False,,Southampton,yes,False
0,2,male,24.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
0,3,female,,0,0,7.55,S,Third,woman,False,,Southampton,no,True
0,2,male,44.0,1,0,26.0,S,Second,man,True,,Southampton,no,False
1,2,female,8.0,0,2,26.25,S,Second,child,False,,Southampton,yes,False
0,2,male,19.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
0,2,male,33.0,0,0,12.275,S,Second,man,True,,Southampton,no,True
0,3,female,,1,0,14.4542,C,Third,woman,False,,Cherbourg,no,False
1,3,female,,1,0,15.5,Q,Third,woman,False,,Queenstown,yes,False
0,2,male,29.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
0,3,male,22.0,0,0,7.125,S,Third,man,True,,Southampton,no,True
0,3,male,30.0,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
0,1,male,44.0,2,0,90.0,Q,First,man,True,C,Queenstown,no,False
0,3,female,25.0,0,0,7.775,S,Third,woman,False,,Southampton,no,True
1,2,female,24.0,0,2,14.5,S,Second,woman,False,,Southampton,yes,False
1,1,male,37.0,1,1,52.5542,S,First,man,True,D,Southampton,yes,False
0,2,male,54.0,1,0,26.0,S,Second,man,True,,Southampton,no,False
0,3,male,,0,0,7.25,S,Third,man,True,,Southampton,no,True
0,3,female,29.0,1,1,10.4625,S,Third,woman,False,G,Southampton,no,False
0,1,male,62.0,0,0,26.55,S,First,man,True,C,Southampton,no,True
0,3,male,30.0,1,0,16.1,S,Third,man,True,,Southampton,no,False
0,3,female,41.0,0,2,20.2125,S,Third,woman,False,,Southampton,no,False
1,3,female,29.0,0,2,15.2458,C,Third,woman,False,,Cherbourg,yes,False
1,1,female,,0,0,79.2,C,First,woman,False,,Cherbourg,yes,True
1,1,female,30.0,0,0,86.5,S,First,woman,False,B,Southampton,yes,True
1,1,female,35.0,0,0,512.3292,C,First,woman,False,,Cherbourg,yes,True
1,2,female,50.0,0,1,26.0,S,Second,woman,False,,Southampton,yes,False
0,3,male,,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
1,3,male,3.0,4,2,31.3875,S,Third,child,False,,Southampton,yes,False
0,1,male,52.0,1,1,79.65,S,First,man,True,E,Southampton,no,False
0,1,male,40.0,0,0,0.0,S,First,man,True,B,Southampton,no,True
0,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,no,True
0,2,male,36.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
0,3,male,16.0,4,1,39.6875,S,Third,man,True,,Southampton,no,False
1,3,male,25.0,1,0,7.775,S,Third,man,True,,Southampton,yes,False
1,1,female,58.0,0,1,153.4625,S,First,woman,False,C,Southampton,yes,False
1,1,female,35.0,0,0,135.6333,S,First,woman,False,C,Southampton,yes,True
0,1,male,,0,0,31.0,S,First,man,True,,Southampton,no,True
1,3,male,25.0,0,0,0.0,S,Third,man,True,,Southampton,yes,True
1,2,female,41.0,0,1,19.5,S,Second,woman,False,,Southampton,yes,False
0,1,male,37.0,0,1,29.7,C,First,man,True,C,Cherbourg,no,False
1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
1,1,female,63.0,1,0,77.9583,S,First,woman,False,D,Southampton,yes,False
0,3,female,45.0,0,0,7.75,S,Third,woman,False,,Southampton,no,True
0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True
0,3,male,7.0,4,1,29.125,Q,Third,child,False,,Queenstown,no,False
1,3,female,35.0,1,1,20.25,S,Third,woman,False,,Southampton,yes,False
0,3,male,65.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
0,3,male,28.0,0,0,7.8542,S,Third,man,True,,Southampton,no,True
0,3,male,16.0,0,0,9.5,S,Third,man,True,,Southampton,no,True
1,3,male,19.0,0,0,8.05,S,Third,man,True,,Southampton,yes,True
0,1,male,,0,0,26.0,S,First,man,True,A,Southampton,no,True
0,3,male,33.0,0,0,8.6625,C,Third,man,True,,Cherbourg,no,True
1,3,male,30.0,0,0,9.5,S,Third,man,True,,Southampton,yes,True
0,3,male,22.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,2,male,42.0,0,0,13.0,S,Second,man,True,,Southampton,yes,True
1,3,female,22.0,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
1,1,female,26.0,0,0,78.85,S,First,woman,False,,Southampton,yes,True
1,1,female,19.0,1,0,91.0792,C,First,woman,False,B,Cherbourg,yes,False
0,2,male,36.0,0,0,12.875,C,Second,man,True,D,Cherbourg,no,True
0,3,female,24.0,0,0,8.85,S,Third,woman,False,,Southampton,no,True
0,3,male,24.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,1,male,,0,0,27.7208,C,First,man,True,,Cherbourg,no,True
0,3,male,23.5,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
0,1,female,2.0,1,2,151.55,S,First,child,False,C,Southampton,no,False
1,1,male,,0,0,30.5,S,First,man,True,C,Southampton,yes,True
1,1,female,50.0,0,1,247.5208,C,First,woman,False,B,Cherbourg,yes,False
1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
1,3,male,,2,0,23.25,Q,Third,man,True,,Queenstown,yes,False
0,3,male,19.0,0,0,0.0,S,Third,man,True,,Southampton,no,True
1,2,female,,0,0,12.35,Q,Second,woman,False,E,Queenstown,yes,True
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,1,male,0.92,1,2,151.55,S,First,child,False,C,Southampton,yes,False
1,1,female,,0,0,110.8833,C,First,woman,False,,Cherbourg,yes,True
1,1,female,17.0,1,0,108.9,C,First,woman,False,C,Cherbourg,yes,False
0,2,male,30.0,1,0,24.0,C,Second,man,True,,Cherbourg,no,False
1,1,female,30.0,0,0,56.9292,C,First,woman,False,E,Cherbourg,yes,True
1,1,female,24.0,0,0,83.1583,C,First,woman,False,C,Cherbourg,yes,True
1,1,female,18.0,2,2,262.375,C,First,woman,False,B,Cherbourg,yes,False
0,2,female,26.0,1,1,26.0,S,Second,woman,False,,Southampton,no,False
0,3,male,28.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,2,male,43.0,1,1,26.25,S,Second,man,True,,Southampton,no,False
1,3,female,26.0,0,0,7.8542,S,Third,woman,False,,Southampton,yes,True
1,2,female,24.0,1,0,26.0,S,Second,woman,False,,Southampton,yes,False
0,2,male,54.0,0,0,14.0,S,Second,man,True,,Southampton,no,True
1,1,female,31.0,0,2,164.8667,S,First,woman,False,C,Southampton,yes,False
1,1,female,40.0,1,1,134.5,C,First,woman,False,E,Cherbourg,yes,False
0,3,male,22.0,0,0,7.25,S,Third,man,True,,Southampton,no,True
0,3,male,27.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,2,female,30.0,0,0,12.35,Q,Second,woman,False,,Queenstown,yes,True
1,2,female,22.0,1,1,29.0,S,Second,woman,False,,Southampton,yes,False
0,3,male,,8,2,69.55,S,Third,man,True,,Southampton,no,False
1,1,female,36.0,0,0,135.6333,C,First,woman,False,C,Cherbourg,yes,True
0,3,male,61.0,0,0,6.2375,S,Third,man,True,,Southampton,no,True
1,2,female,36.0,0,0,13.0,S,Second,woman,False,D,Southampton,yes,True
1,3,female,31.0,1,1,20.525,S,Third,woman,False,,Southampton,yes,False
1,1,female,16.0,0,1,57.9792,C,First,woman,False,B,Cherbourg,yes,False
1,3,female,,2,0,23.25,Q,Third,woman,False,,Queenstown,yes,False
0,1,male,45.5,0,0,28.5,S,First,man,True,C,Southampton,no,True
0,1,male,38.0,0,1,153.4625,S,First,man,True,C,Southampton,no,False
0,3,male,16.0,2,0,18.0,S,Third,man,True,,Southampton,no,False
1,1,female,,1,0,133.65,S,First,woman,False,,Southampton,yes,False
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,1,male,29.0,1,0,66.6,S,First,man,True,C,Southampton,no,False
1,1,female,41.0,0,0,134.5,C,First,woman,False,E,Cherbourg,yes,True
1,3,male,45.0,0,0,8.05,S,Third,man,True,,Southampton,yes,True
0,1,male,45.0,0,0,35.5,S,First,man,True,,Southampton,no,True
1,2,male,2.0,1,1,26.0,S,Second,child,False,F,Southampton,yes,False
1,1,female,24.0,3,2,263.0,S,First,woman,False,C,Southampton,yes,False
0,2,male,28.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,2,male,25.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,2,male,36.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
1,2,female,24.0,0,0,13.0,S,Second,woman,False,F,Southampton,yes,True
1,2,female,40.0,0,0,13.0,S,Second,woman,False,,Southampton,yes,True
1,3,female,,1,0,16.1,S,Third,woman,False,,Southampton,yes,False
1,3,male,3.0,1,1,15.9,S,Third,child,False,,Southampton,yes,False
0,3,male,42.0,0,0,8.6625,S,Third,man,True,,Southampton,no,True
0,3,male,23.0,0,0,9.225,S,Third,man,True,,Southampton,no,True
0,1,male,,0,0,35.0,S,First,man,True,C,Southampton,no,True
0,3,male,15.0,1,1,7.2292,C,Third,child,False,,Cherbourg,no,False
0,3,male,25.0,1,0,17.8,S,Third,man,True,,Southampton,no,False
0,3,male,,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
0,3,male,28.0,0,0,9.5,S,Third,man,True,,Southampton,no,True
1,1,female,22.0,0,1,55.0,S,First,woman,False,E,Southampton,yes,False
0,2,female,38.0,0,0,13.0,S,Second,woman,False,,Southampton,no,True
1,3,female,,0,0,7.8792,Q,Third,woman,False,,Queenstown,yes,True
1,3,female,,0,0,7.8792,Q,Third,woman,False,,Queenstown,yes,True
0,3,male,40.0,1,4,27.9,S,Third,man,True,,Southampton,no,False
0,2,male,29.0,1,0,27.7208,C,Second,man,True,,Cherbourg,no,False
0,3,female,45.0,0,1,14.4542,C,Third,woman,False,,Cherbourg,no,False
0,3,male,35.0,0,0,7.05,S,Third,man,True,,Southampton,no,True
0,3,male,,1,0,15.5,Q,Third,man,True,,Queenstown,no,False
0,3,male,30.0,0,0,7.25,S,Third,man,True,,Southampton,no,True
1,1,female,60.0,1,0,75.25,C,First,woman,False,D,Cherbourg,yes,False
1,3,female,,0,0,7.2292,C,Third,woman,False,,Cherbourg,yes,True
1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
1,1,female,24.0,0,0,69.3,C,First,woman,False,B,Cherbourg,yes,True
1,1,male,25.0,1,0,55.4417,C,First,man,True,E,Cherbourg,yes,False
0,3,male,18.0,1,0,6.4958,S,Third,man,True,,Southampton,no,False
0,3,male,19.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,1,male,22.0,0,0,135.6333,C,First,man,True,,Cherbourg,no,True
0,3,female,3.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
1,1,female,,1,0,82.1708,C,First,woman,False,,Cherbourg,yes,False
1,3,female,22.0,0,0,7.25,S,Third,woman,False,,Southampton,yes,True
0,1,male,27.0,0,2,211.5,C,First,man,True,C,Cherbourg,no,False
0,3,male,20.0,0,0,4.0125,C,Third,man,True,,Cherbourg,no,True
0,3,male,19.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
1,1,female,42.0,0,0,227.525,C,First,woman,False,,Cherbourg,yes,True
1,3,female,1.0,0,2,15.7417,C,Third,child,False,,Cherbourg,yes,False
0,3,male,32.0,0,0,7.925,S,Third,man,True,,Southampton,no,True
1,1,female,35.0,1,0,52.0,S,First,woman,False,,Southampton,yes,False
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,2,male,18.0,0,0,73.5,S,Second,man,True,,Southampton,no,True
0,3,male,1.0,5,2,46.9,S,Third,child,False,,Southampton,no,False
1,2,female,36.0,0,0,13.0,S,Second,woman,False,,Southampton,yes,True
0,3,male,,0,0,7.7292,Q,Third,man,True,,Queenstown,no,True
1,2,female,17.0,0,0,12.0,C,Second,woman,False,,Cherbourg,yes,True
1,1,male,36.0,1,2,120.0,S,First,man,True,B,Southampton,yes,False
1,3,male,21.0,0,0,7.7958,S,Third,man,True,,Southampton,yes,True
0,3,male,28.0,2,0,7.925,S,Third,man,True,,Southampton,no,False
1,1,female,23.0,1,0,113.275,C,First,woman,False,D,Cherbourg,yes,False
1,3,female,24.0,0,2,16.7,S,Third,woman,False,G,Southampton,yes,False
0,3,male,22.0,0,0,7.7958,S,Third,man,True,,Southampton,no,True
0,3,female,31.0,0,0,7.8542,S,Third,woman,False,,Southampton,no,True
0,2,male,46.0,0,0,26.0,S,Second,man,True,,Southampton,no,True
0,2,male,23.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
1,2,female,28.0,0,0,12.65,S,Second,woman,False,,Southampton,yes,True
1,3,male,39.0,0,0,7.925,S,Third,man,True,,Southampton,yes,True
0,3,male,26.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,female,21.0,1,0,9.825,S,Third,woman,False,,Southampton,no,False
0,3,male,28.0,1,0,15.85,S,Third,man,True,,Southampton,no,False
0,3,female,20.0,0,0,8.6625,S,Third,woman,False,,Southampton,no,True
0,2,male,34.0,1,0,21.0,S,Second,man,True,,Southampton,no,False
0,3,male,51.0,0,0,7.75,S,Third,man,True,,Southampton,no,True
1,2,male,3.0,1,1,18.75,S,Second,child,False,,Southampton,yes,False
0,3,male,21.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
0,3,female,,3,1,25.4667,S,Third,woman,False,,Southampton,no,False
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,6.8583,Q,Third,man,True,,Queenstown,no,True
1,1,female,33.0,1,0,90.0,Q,First,woman,False,C,Queenstown,yes,False
0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True
1,3,male,44.0,0,0,7.925,S,Third,man,True,,Southampton,yes,True
0,3,female,,0,0,8.05,S,Third,woman,False,,Southampton,no,True
1,2,female,34.0,1,1,32.5,S,Second,woman,False,,Southampton,yes,False
1,2,female,18.0,0,2,13.0,S,Second,woman,False,,Southampton,yes,False
0,2,male,30.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,3,female,10.0,0,2,24.15,S,Third,child,False,,Southampton,no,False
0,3,male,,0,0,7.8958,C,Third,man,True,,Cherbourg,no,True
0,3,male,21.0,0,0,7.7333,Q,Third,man,True,,Queenstown,no,True
0,3,male,29.0,0,0,7.875,S,Third,man,True,,Southampton,no,True
0,3,female,28.0,1,1,14.4,S,Third,woman,False,,Southampton,no,False
0,3,male,18.0,1,1,20.2125,S,Third,man,True,,Southampton,no,False
0,3,male,,0,0,7.25,S,Third,man,True,,Southampton,no,True
1,2,female,28.0,1,0,26.0,S,Second,woman,False,,Southampton,yes,False
1,2,female,19.0,0,0,26.0,S,Second,woman,False,,Southampton,yes,True
0,3,male,,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
1,3,male,32.0,0,0,8.05,S,Third,man,True,E,Southampton,yes,True
1,1,male,28.0,0,0,26.55,S,First,man,True,C,Southampton,yes,True
1,3,female,,1,0,16.1,S,Third,woman,False,,Southampton,yes,False
1,2,female,42.0,1,0,26.0,S,Second,woman,False,,Southampton,yes,False
0,3,male,17.0,0,0,7.125,S,Third,man,True,,Southampton,no,True
0,1,male,50.0,1,0,55.9,S,First,man,True,E,Southampton,no,False
1,1,female,14.0,1,2,120.0,S,First,child,False,B,Southampton,yes,False
0,3,female,21.0,2,2,34.375,S,Third,woman,False,,Southampton,no,False
1,2,female,24.0,2,3,18.75,S,Second,woman,False,,Southampton,yes,False
0,1,male,64.0,1,4,263.0,S,First,man,True,C,Southampton,no,False
0,2,male,31.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
1,2,female,45.0,1,1,26.25,S,Second,woman,False,,Southampton,yes,False
0,3,male,20.0,0,0,9.5,S,Third,man,True,,Southampton,no,True
0,3,male,25.0,1,0,7.775,S,Third,man,True,,Southampton,no,False
1,2,female,28.0,0,0,13.0,S,Second,woman,False,,Southampton,yes,True
1,3,male,,0,0,8.1125,S,Third,man,True,,Southampton,yes,True
1,1,male,4.0,0,2,81.8583,S,First,child,False,A,Southampton,yes,False
1,2,female,13.0,0,1,19.5,S,Second,child,False,,Southampton,yes,False
1,1,male,34.0,0,0,26.55,S,First,man,True,,Southampton,yes,True
1,3,female,5.0,2,1,19.2583,C,Third,child,False,,Cherbourg,yes,False
1,1,male,52.0,0,0,30.5,S,First,man,True,C,Southampton,yes,True
0,2,male,36.0,1,2,27.75,S,Second,man,True,,Southampton,no,False
0,3,male,,1,0,19.9667,S,Third,man,True,,Southampton,no,False
0,1,male,30.0,0,0,27.75,C,First,man,True,C,Cherbourg,no,True
1,1,male,49.0,1,0,89.1042,C,First,man,True,C,Cherbourg,yes,False
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,3,male,29.0,0,0,7.8958,C,Third,man,True,,Cherbourg,yes,True
0,1,male,65.0,0,0,26.55,S,First,man,True,E,Southampton,no,True
1,1,female,,1,0,51.8625,S,First,woman,False,D,Southampton,yes,False
1,2,female,50.0,0,0,10.5,S,Second,woman,False,,Southampton,yes,True
0,3,male,,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
1,1,male,48.0,0,0,26.55,S,First,man,True,E,Southampton,yes,True
0,3,male,34.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,1,male,47.0,0,0,38.5,S,First,man,True,E,Southampton,no,True
0,2,male,48.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,38.0,0,0,7.05,S,Third,man,True,,Southampton,no,True
0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True
0,1,male,56.0,0,0,26.55,S,First,man,True,,Southampton,no,True
0,3,male,,0,0,7.725,Q,Third,man,True,,Queenstown,no,True
1,3,female,0.75,2,1,19.2583,C,Third,child,False,,Cherbourg,yes,False
0,3,male,,0,0,7.25,S,Third,man,True,,Southampton,no,True
0,3,male,38.0,0,0,8.6625,S,Third,man,True,,Southampton,no,True
1,2,female,33.0,1,2,27.75,S,Second,woman,False,,Southampton,yes,False
1,2,female,23.0,0,0,13.7917,C,Second,woman,False,D,Cherbourg,yes,True
0,3,female,22.0,0,0,9.8375,S,Third,woman,False,,Southampton,no,True
0,1,male,,0,0,52.0,S,First,man,True,A,Southampton,no,True
0,2,male,34.0,1,0,21.0,S,Second,man,True,,Southampton,no,False
0,3,male,29.0,1,0,7.0458,S,Third,man,True,,Southampton,no,False
0,3,male,22.0,0,0,7.5208,S,Third,man,True,,Southampton,no,True
1,3,female,2.0,0,1,12.2875,S,Third,child,False,,Southampton,yes,False
0,3,male,9.0,5,2,46.9,S,Third,child,False,,Southampton,no,False
0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True
0,3,male,50.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,3,female,63.0,0,0,9.5875,S,Third,woman,False,,Southampton,yes,True
1,1,male,25.0,1,0,91.0792,C,First,man,True,B,Cherbourg,yes,False
0,3,female,,3,1,25.4667,S,Third,woman,False,,Southampton,no,False
1,1,female,35.0,1,0,90.0,S,First,woman,False,C,Southampton,yes,False
0,1,male,58.0,0,0,29.7,C,First,man,True,B,Cherbourg,no,True
0,3,male,30.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,3,male,9.0,1,1,15.9,S,Third,child,False,,Southampton,yes,False
0,3,male,,1,0,19.9667,S,Third,man,True,,Southampton,no,False
0,3,male,21.0,0,0,7.25,S,Third,man,True,,Southampton,no,True
0,1,male,55.0,0,0,30.5,S,First,man,True,C,Southampton,no,True
0,1,male,71.0,0,0,49.5042,C,First,man,True,,Cherbourg,no,True
0,3,male,21.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,14.4583,C,Third,man,True,,Cherbourg,no,True
1,1,female,54.0,1,0,78.2667,C,First,woman,False,D,Cherbourg,yes,False
0,3,male,,0,0,15.1,S,Third,man,True,,Southampton,no,True
0,1,female,25.0,1,2,151.55,S,First,woman,False,C,Southampton,no,False
0,3,male,24.0,0,0,7.7958,S,Third,man,True,,Southampton,no,True
0,3,male,17.0,0,0,8.6625,S,Third,man,True,,Southampton,no,True
0,3,female,21.0,0,0,7.75,Q,Third,woman,False,,Queenstown,no,True
0,3,female,,0,0,7.6292,Q,Third,woman,False,,Queenstown,no,True
0,3,female,37.0,0,0,9.5875,S,Third,woman,False,,Southampton,no,True
1,1,female,16.0,0,0,86.5,S,First,woman,False,B,Southampton,yes,True
0,1,male,18.0,1,0,108.9,C,First,man,True,C,Cherbourg,no,False
1,2,female,33.0,0,2,26.0,S,Second,woman,False,,Southampton,yes,False
1,1,male,,0,0,26.55,S,First,man,True,,Southampton,yes,True
0,3,male,28.0,0,0,22.525,S,Third,man,True,,Southampton,no,True
1,3,male,26.0,0,0,56.4958,S,Third,man,True,,Southampton,yes,True
1,3,male,29.0,0,0,7.75,Q,Third,man,True,,Queenstown,yes,True
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,1,male,36.0,0,0,26.2875,S,First,man,True,E,Southampton,yes,True
1,1,female,54.0,1,0,59.4,C,First,woman,False,,Cherbourg,yes,False
0,3,male,24.0,0,0,7.4958,S,Third,man,True,,Southampton,no,True
0,1,male,47.0,0,0,34.0208,S,First,man,True,D,Southampton,no,True
1,2,female,34.0,0,0,10.5,S,Second,woman,False,F,Southampton,yes,True
0,3,male,,0,0,24.15,Q,Third,man,True,,Queenstown,no,True
1,2,female,36.0,1,0,26.0,S,Second,woman,False,,Southampton,yes,False
0,3,male,32.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,1,female,30.0,0,0,93.5,S,First,woman,False,B,Southampton,yes,True
0,3,male,22.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
1,1,female,44.0,0,1,57.9792,C,First,woman,False,B,Cherbourg,yes,False
0,3,male,,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
0,3,male,40.5,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
1,2,female,50.0,0,0,10.5,S,Second,woman,False,,Southampton,yes,True
0,1,male,,0,0,221.7792,S,First,man,True,C,Southampton,no,True
0,3,male,39.0,0,0,7.925,S,Third,man,True,,Southampton,no,True
0,2,male,23.0,2,1,11.5,S,Second,man,True,,Southampton,no,False
1,2,female,2.0,1,1,26.0,S,Second,child,False,,Southampton,yes,False
0,3,male,,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
0,3,male,17.0,1,1,7.2292,C,Third,man,True,,Cherbourg,no,False
1,3,female,,0,2,22.3583,C,Third,woman,False,,Cherbourg,yes,False
0,3,female,30.0,0,0,8.6625,S,Third,woman,False,,Southampton,no,True
1,2,female,7.0,0,2,26.25,S,Second,child,False,,Southampton,yes,False
0,1,male,45.0,0,0,26.55,S,First,man,True,B,Southampton,no,True
1,1,female,30.0,0,0,106.425,C,First,woman,False,,Cherbourg,yes,True
0,3,male,,0,0,14.5,S,Third,man,True,,Southampton,no,True
1,1,female,22.0,0,2,49.5,C,First,woman,False,B,Cherbourg,yes,False
1,1,female,36.0,0,2,71.0,S,First,woman,False,B,Southampton,yes,False
0,3,female,9.0,4,2,31.275,S,Third,child,False,,Southampton,no,False
0,3,female,11.0,4,2,31.275,S,Third,child,False,,Southampton,no,False
1,2,male,32.0,1,0,26.0,S,Second,man,True,,Southampton,yes,False
0,1,male,50.0,1,0,106.425,C,First,man,True,C,Cherbourg,no,False
0,1,male,64.0,0,0,26.0,S,First,man,True,,Southampton,no,True
1,2,female,19.0,1,0,26.0,S,Second,woman,False,,Southampton,yes,False
1,2,male,,0,0,13.8625,C,Second,man,True,,Cherbourg,yes,True
0,3,male,33.0,1,1,20.525,S,Third,man,True,,Southampton,no,False
1,2,male,8.0,1,1,36.75,S,Second,child,False,,Southampton,yes,False
1,1,male,17.0,0,2,110.8833,C,First,man,True,C,Cherbourg,yes,False
0,2,male,27.0,0,0,26.0,S,Second,man,True,,Southampton,no,True
0,3,male,,0,0,7.8292,Q,Third,man,True,,Queenstown,no,True
1,3,male,22.0,0,0,7.225,C,Third,man,True,,Cherbourg,yes,True
1,3,female,22.0,0,0,7.775,S,Third,woman,False,,Southampton,yes,True
0,1,male,62.0,0,0,26.55,S,First,man,True,,Southampton,no,True
1,1,female,48.0,1,0,39.6,C,First,woman,False,A,Cherbourg,yes,False
0,1,male,,0,0,227.525,C,First,man,True,,Cherbourg,no,True
1,1,female,39.0,1,1,79.65,S,First,woman,False,E,Southampton,yes,False
1,3,female,36.0,1,0,17.4,S,Third,woman,False,,Southampton,yes,False
0,3,male,,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
0,3,male,40.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,2,male,28.0,0,0,13.5,S,Second,man,True,,Southampton,no,True
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,female,,0,0,8.05,S,Third,woman,False,,Southampton,no,True
0,3,male,24.0,2,0,24.15,S,Third,man,True,,Southampton,no,False
0,3,male,19.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,3,female,29.0,0,4,21.075,S,Third,woman,False,,Southampton,no,False
0,3,male,,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
1,3,male,32.0,0,0,7.8542,S,Third,man,True,,Southampton,yes,True
1,2,male,62.0,0,0,10.5,S,Second,man,True,,Southampton,yes,True
1,1,female,53.0,2,0,51.4792,S,First,woman,False,C,Southampton,yes,False
1,1,male,36.0,0,0,26.3875,S,First,man,True,E,Southampton,yes,True
1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
0,3,male,16.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,19.0,0,0,14.5,S,Third,man,True,,Southampton,no,True
1,2,female,34.0,0,0,13.0,S,Second,woman,False,,Southampton,yes,True
1,1,female,39.0,1,0,55.9,S,First,woman,False,E,Southampton,yes,False
0,3,female,,1,0,14.4583,C,Third,woman,False,,Cherbourg,no,False
1,3,male,32.0,0,0,7.925,S,Third,man,True,,Southampton,yes,True
1,2,female,25.0,1,1,30.0,S,Second,woman,False,,Southampton,yes,False
1,1,female,39.0,1,1,110.8833,C,First,woman,False,C,Cherbourg,yes,False
0,2,male,54.0,0,0,26.0,S,Second,man,True,,Southampton,no,True
0,1,male,36.0,0,0,40.125,C,First,man,True,A,Cherbourg,no,True
0,3,male,,0,0,8.7125,C,Third,man,True,,Cherbourg,no,True
1,1,female,18.0,0,2,79.65,S,First,woman,False,E,Southampton,yes,False
0,2,male,47.0,0,0,15.0,S,Second,man,True,,Southampton,no,True
1,1,male,60.0,1,1,79.2,C,First,man,True,B,Cherbourg,yes,False
0,3,male,22.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,3,male,35.0,0,0,7.125,S,Third,man,True,,Southampton,no,True
1,1,female,52.0,1,0,78.2667,C,First,woman,False,D,Cherbourg,yes,False
0,3,male,47.0,0,0,7.25,S,Third,man,True,,Southampton,no,True
0,3,female,,0,2,7.75,Q,Third,woman,False,,Queenstown,no,False
0,2,male,37.0,1,0,26.0,S,Second,man,True,,Southampton,no,False
0,3,male,36.0,1,1,24.15,S,Third,man,True,,Southampton,no,False
1,2,female,,0,0,33.0,S,Second,woman,False,,Southampton,yes,True
0,3,male,49.0,0,0,0.0,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
1,1,male,49.0,1,0,56.9292,C,First,man,True,A,Cherbourg,yes,False
1,2,female,24.0,2,1,27.0,S,Second,woman,False,,Southampton,yes,False
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,1,male,,0,0,42.4,S,First,man,True,,Southampton,no,True
0,3,male,44.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,1,male,35.0,0,0,26.55,C,First,man,True,,Cherbourg,yes,True
0,3,male,36.0,1,0,15.55,S,Third,man,True,,Southampton,no,False
0,3,male,30.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,1,male,27.0,0,0,30.5,S,First,man,True,,Southampton,yes,True
1,2,female,22.0,1,2,41.5792,C,Second,woman,False,,Cherbourg,yes,False
1,1,female,40.0,0,0,153.4625,S,First,woman,False,C,Southampton,yes,True
0,3,female,39.0,1,5,31.275,S,Third,woman,False,,Southampton,no,False
0,3,male,,0,0,7.05,S,Third,man,True,,Southampton,no,True
1,3,female,,1,0,15.5,Q,Third,woman,False,,Queenstown,yes,False
0,3,male,,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,2,female,24.0,1,2,65.0,S,Second,woman,False,,Southampton,yes,False
0,3,male,34.0,1,1,14.4,S,Third,man,True,,Southampton,no,False
0,3,female,26.0,1,0,16.1,S,Third,woman,False,,Southampton,no,False
1,2,female,4.0,2,1,39.0,S,Second,child,False,F,Southampton,yes,False
0,2,male,26.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
0,3,male,27.0,1,0,14.4542,C,Third,man,True,,Cherbourg,no,False
1,1,male,42.0,1,0,52.5542,S,First,man,True,D,Southampton,yes,False
1,3,male,20.0,1,1,15.7417,C,Third,man,True,,Cherbourg,yes,False
0,3,male,21.0,0,0,7.8542,S,Third,man,True,,Southampton,no,True
0,3,male,21.0,0,0,16.1,S,Third,man,True,,Southampton,no,True
0,1,male,61.0,0,0,32.3208,S,First,man,True,D,Southampton,no,True
0,2,male,57.0,0,0,12.35,Q,Second,man,True,,Queenstown,no,True
1,1,female,21.0,0,0,77.9583,S,First,woman,False,D,Southampton,yes,True
0,3,male,26.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,7.7333,Q,Third,man,True,,Queenstown,no,True
1,1,male,80.0,0,0,30.0,S,First,man,True,A,Southampton,yes,True
0,3,male,51.0,0,0,7.0542,S,Third,man,True,,Southampton,no,True
1,1,male,32.0,0,0,30.5,C,First,man,True,B,Cherbourg,yes,True
0,1,male,,0,0,0.0,S,First,man,True,,Southampton,no,True
0,3,female,9.0,3,2,27.9,S,Third,child,False,,Southampton,no,False
1,2,female,28.0,0,0,13.0,S,Second,woman,False,,Southampton,yes,True
0,3,male,32.0,0,0,7.925,S,Third,man,True,,Southampton,no,True
0,2,male,31.0,1,1,26.25,S,Second,man,True,,Southampton,no,False
0,3,female,41.0,0,5,39.6875,S,Third,woman,False,,Southampton,no,False
0,3,male,,1,0,16.1,S,Third,man,True,,Southampton,no,False
0,3,male,20.0,0,0,7.8542,S,Third,man,True,,Southampton,no,True
1,1,female,24.0,0,0,69.3,C,First,woman,False,B,Cherbourg,yes,True
0,3,female,2.0,3,2,27.9,S,Third,child,False,,Southampton,no,False
1,3,male,,0,0,56.4958,S,Third,man,True,,Southampton,yes,True
1,3,female,0.75,2,1,19.2583,C,Third,child,False,,Cherbourg,yes,False
1,1,male,48.0,1,0,76.7292,C,First,man,True,D,Cherbourg,yes,False
0,3,male,19.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,1,male,56.0,0,0,35.5,C,First,man,True,A,Cherbourg,yes,True
0,3,male,,0,0,7.55,S,Third,man,True,,Southampton,no,True
1,3,female,23.0,0,0,7.55,S,Third,woman,False,,Southampton,yes,True
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,2,female,18.0,0,1,23.0,S,Second,woman,False,,Southampton,yes,False
0,3,male,21.0,0,0,8.4333,S,Third,man,True,,Southampton,no,True
1,3,female,,0,0,7.8292,Q,Third,woman,False,,Queenstown,yes,True
0,3,female,18.0,0,0,6.75,Q,Third,woman,False,,Queenstown,no,True
0,2,male,24.0,2,0,73.5,S,Second,man,True,,Southampton,no,False
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,3,female,32.0,1,1,15.5,Q,Third,woman,False,,Queenstown,no,False
0,2,male,23.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,1,male,58.0,0,2,113.275,C,First,man,True,D,Cherbourg,no,False
1,1,male,50.0,2,0,133.65,S,First,man,True,,Southampton,yes,False
0,3,male,40.0,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
0,1,male,47.0,0,0,25.5875,S,First,man,True,E,Southampton,no,True
0,3,male,36.0,0,0,7.4958,S,Third,man,True,,Southampton,no,True
1,3,male,20.0,1,0,7.925,S,Third,man,True,,Southampton,yes,False
0,2,male,32.0,2,0,73.5,S,Second,man,True,,Southampton,no,False
0,2,male,25.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,3,male,,0,0,7.775,S,Third,man,True,,Southampton,no,True
0,3,male,43.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,1,female,,1,0,52.0,S,First,woman,False,C,Southampton,yes,False
1,2,female,40.0,1,1,39.0,S,Second,woman,False,,Southampton,yes,False
0,1,male,31.0,1,0,52.0,S,First,man,True,B,Southampton,no,False
0,2,male,70.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
1,2,male,31.0,0,0,13.0,S,Second,man,True,,Southampton,yes,True
0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True
0,3,male,18.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
0,3,male,24.5,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,3,female,18.0,0,0,9.8417,S,Third,woman,False,,Southampton,yes,True
0,3,female,43.0,1,6,46.9,S,Third,woman,False,,Southampton,no,False
1,1,male,36.0,0,1,512.3292,C,First,man,True,B,Cherbourg,yes,False
0,3,female,,0,0,8.1375,Q,Third,woman,False,,Queenstown,no,True
1,1,male,27.0,0,0,76.7292,C,First,man,True,D,Cherbourg,yes,True
0,3,male,20.0,0,0,9.225,S,Third,man,True,,Southampton,no,True
0,3,male,14.0,5,2,46.9,S,Third,child,False,,Southampton,no,False
0,2,male,60.0,1,1,39.0,S,Second,man,True,,Southampton,no,False
0,2,male,25.0,1,2,41.5792,C,Second,man,True,,Cherbourg,no,False
0,3,male,14.0,4,1,39.6875,S,Third,child,False,,Southampton,no,False
0,3,male,19.0,0,0,10.1708,S,Third,man,True,,Southampton,no,True
0,3,male,18.0,0,0,7.7958,S,Third,man,True,,Southampton,no,True
1,1,female,15.0,0,1,211.3375,S,First,child,False,B,Southampton,yes,False
1,1,male,31.0,1,0,57.0,S,First,man,True,B,Southampton,yes,False
1,3,female,4.0,0,1,13.4167,C,Third,child,False,,Cherbourg,yes,False
1,3,male,,0,0,56.4958,S,Third,man,True,,Southampton,yes,True
0,3,male,25.0,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
0,1,male,60.0,0,0,26.55,S,First,man,True,,Southampton,no,True
0,2,male,52.0,0,0,13.5,S,Second,man,True,,Southampton,no,True
0,3,male,44.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,3,female,,0,0,7.7333,Q,Third,woman,False,,Queenstown,yes,True
0,1,male,49.0,1,1,110.8833,C,First,man,True,C,Cherbourg,no,False
0,3,male,42.0,0,0,7.65,S,Third,man,True,F,Southampton,no,True
1,1,female,18.0,1,0,227.525,C,First,woman,False,C,Cherbourg,yes,False
1,1,male,35.0,0,0,26.2875,S,First,man,True,E,Southampton,yes,True
0,3,female,18.0,0,1,14.4542,C,Third,woman,False,,Cherbourg,no,False
0,3,male,25.0,0,0,7.7417,Q,Third,man,True,,Queenstown,no,True
0,3,male,26.0,1,0,7.8542,S,Third,man,True,,Southampton,no,False
0,2,male,39.0,0,0,26.0,S,Second,man,True,,Southampton,no,True
1,2,female,45.0,0,0,13.5,S,Second,woman,False,,Southampton,yes,True
1,1,male,42.0,0,0,26.2875,S,First,man,True,E,Southampton,yes,True
1,1,female,22.0,0,0,151.55,S,First,woman,False,,Southampton,yes,True
1,3,male,,1,1,15.2458,C,Third,man,True,,Cherbourg,yes,False
1,1,female,24.0,0,0,49.5042,C,First,woman,False,C,Cherbourg,yes,True
0,1,male,,0,0,26.55,S,First,man,True,C,Southampton,no,True
1,1,male,48.0,1,0,52.0,S,First,man,True,C,Southampton,yes,False
0,3,male,29.0,0,0,9.4833,S,Third,man,True,,Southampton,no,True
0,2,male,52.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,3,male,19.0,0,0,7.65,S,Third,man,True,F,Southampton,no,True
1,1,female,38.0,0,0,227.525,C,First,woman,False,C,Cherbourg,yes,True
1,2,female,27.0,0,0,10.5,S,Second,woman,False,E,Southampton,yes,True
0,3,male,,0,0,15.5,Q,Third,man,True,,Queenstown,no,True
0,3,male,33.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
1,2,female,6.0,0,1,33.0,S,Second,child,False,,Southampton,yes,False
0,3,male,17.0,1,0,7.0542,S,Third,man,True,,Southampton,no,False
0,2,male,34.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,2,male,50.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
1,1,male,27.0,1,0,53.1,S,First,man,True,E,Southampton,yes,False
0,3,male,20.0,0,0,8.6625,S,Third,man,True,,Southampton,no,True
1,2,female,30.0,3,0,21.0,S,Second,woman,False,,Southampton,yes,False
1,3,female,,0,0,7.7375,Q,Third,woman,False,,Queenstown,yes,True
0,2,male,25.0,1,0,26.0,S,Second,man,True,,Southampton,no,False
0,3,female,25.0,1,0,7.925,S,Third,woman,False,,Southampton,no,False
1,1,female,29.0,0,0,211.3375,S,First,woman,False,B,Southampton,yes,True
0,3,male,11.0,0,0,18.7875,C,Third,child,False,,Cherbourg,no,True
0,2,male,,0,0,0.0,S,Second,man,True,,Southampton,no,True
0,2,male,23.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,2,male,23.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
0,3,male,28.5,0,0,16.1,S,Third,man,True,,Southampton,no,True
0,3,female,48.0,1,3,34.375,S,Third,woman,False,,Southampton,no,False
1,1,male,35.0,0,0,512.3292,C,First,man,True,B,Cherbourg,yes,True
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,1,male,,0,0,30.0,S,First,man,True,D,Southampton,yes,True
0,1,male,36.0,1,0,78.85,S,First,man,True,C,Southampton,no,False
1,1,female,21.0,2,2,262.375,C,First,woman,False,B,Cherbourg,yes,False
0,3,male,24.0,1,0,16.1,S,Third,man,True,,Southampton,no,False
1,3,male,31.0,0,0,7.925,S,Third,man,True,,Southampton,yes,True
0,1,male,70.0,1,1,71.0,S,First,man,True,B,Southampton,no,False
0,3,male,16.0,1,1,20.25,S,Third,man,True,,Southampton,no,False
1,2,female,30.0,0,0,13.0,S,Second,woman,False,,Southampton,yes,True
0,1,male,19.0,1,0,53.1,S,First,man,True,D,Southampton,no,False
0,3,male,31.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
1,2,female,4.0,1,1,23.0,S,Second,child,False,,Southampton,yes,False
1,3,male,6.0,0,1,12.475,S,Third,child,False,E,Southampton,yes,False
0,3,male,33.0,0,0,9.5,S,Third,man,True,,Southampton,no,True
0,3,male,23.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,2,female,48.0,1,2,65.0,S,Second,woman,False,,Southampton,yes,False
1,2,male,0.67,1,1,14.5,S,Second,child,False,,Southampton,yes,False
0,3,male,28.0,0,0,7.7958,S,Third,man,True,,Southampton,no,True
0,2,male,18.0,0,0,11.5,S,Second,man,True,,Southampton,no,True
0,3,male,34.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,1,female,33.0,0,0,86.5,S,First,woman,False,B,Southampton,yes,True
0,3,male,,0,0,14.5,S,Third,man,True,,Southampton,no,True
0,3,male,41.0,0,0,7.125,S,Third,man,True,,Southampton,no,True
1,3,male,20.0,0,0,7.2292,C,Third,man,True,,Cherbourg,yes,True
1,1,female,36.0,1,2,120.0,S,First,woman,False,B,Southampton,yes,False
0,3,male,16.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
1,1,female,51.0,1,0,77.9583,S,First,woman,False,D,Southampton,yes,False
0,1,male,,0,0,39.6,C,First,man,True,,Cherbourg,no,True
0,3,female,30.5,0,0,7.75,Q,Third,woman,False,,Queenstown,no,True
0,3,male,,1,0,24.15,Q,Third,man,True,,Queenstown,no,False
0,3,male,32.0,0,0,8.3625,S,Third,man,True,,Southampton,no,True
0,3,male,24.0,0,0,9.5,S,Third,man,True,,Southampton,no,True
0,3,male,48.0,0,0,7.8542,S,Third,man,True,,Southampton,no,True
0,2,female,57.0,0,0,10.5,S,Second,woman,False,E,Southampton,no,True
0,3,male,,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
1,2,female,54.0,1,3,23.0,S,Second,woman,False,,Southampton,yes,False
0,3,male,18.0,0,0,7.75,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,7.75,Q,Third,man,True,F,Queenstown,no,True
1,3,female,5.0,0,0,12.475,S,Third,child,False,,Southampton,yes,True
0,3,male,,0,0,7.7375,Q,Third,man,True,,Queenstown,no,True
1,1,female,43.0,0,1,211.3375,S,First,woman,False,B,Southampton,yes,False
1,3,female,13.0,0,0,7.2292,C,Third,child,False,,Cherbourg,yes,True
1,1,female,17.0,1,0,57.0,S,First,woman,False,B,Southampton,yes,False
0,1,male,29.0,0,0,30.0,S,First,man,True,D,Southampton,no,True
0,3,male,,1,2,23.45,S,Third,man,True,,Southampton,no,False
0,3,male,25.0,0,0,7.05,S,Third,man,True,,Southampton,no,True
0,3,male,25.0,0,0,7.25,S,Third,man,True,,Southampton,no,True
1,3,female,18.0,0,0,7.4958,S,Third,woman,False,,Southampton,yes,True
0,3,male,8.0,4,1,29.125,Q,Third,child,False,,Queenstown,no,False
1,3,male,1.0,1,2,20.575,S,Third,child,False,,Southampton,yes,False
0,1,male,46.0,0,0,79.2,C,First,man,True,B,Cherbourg,no,True
0,3,male,,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
0,2,male,16.0,0,0,26.0,S,Second,man,True,,Southampton,no,True
0,3,female,,8,2,69.55,S,Third,woman,False,,Southampton,no,False
0,1,male,,0,0,30.6958,C,First,man,True,,Cherbourg,no,True
0,3,male,25.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,2,male,39.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
1,1,female,49.0,0,0,25.9292,S,First,woman,False,D,Southampton,yes,True
1,3,female,31.0,0,0,8.6833,S,Third,woman,False,,Southampton,yes,True
0,3,male,30.0,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
0,3,female,30.0,1,1,24.15,S,Third,woman,False,,Southampton,no,False
0,2,male,34.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
1,2,female,31.0,1,1,26.25,S,Second,woman,False,,Southampton,yes,False
1,1,male,11.0,1,2,120.0,S,First,child,False,B,Southampton,yes,False
1,3,male,0.42,0,1,8.5167,C,Third,child,False,,Cherbourg,yes,False
1,3,male,27.0,0,0,6.975,S,Third,man,True,,Southampton,yes,True
0,3,male,31.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
0,1,male,39.0,0,0,0.0,S,First,man,True,A,Southampton,no,True
0,3,female,18.0,0,0,7.775,S,Third,woman,False,,Southampton,no,True
0,2,male,39.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
1,1,female,33.0,1,0,53.1,S,First,woman,False,E,Southampton,yes,False
0,3,male,26.0,0,0,7.8875,S,Third,man,True,,Southampton,no,True
0,3,male,39.0,0,0,24.15,S,Third,man,True,,Southampton,no,True
0,2,male,35.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
0,3,female,6.0,4,2,31.275,S,Third,child,False,,Southampton,no,False
0,3,male,30.5,0,0,8.05,S,Third,man,True,,Southampton,no,True
0,1,male,,0,0,0.0,S,First,man,True,B,Southampton,no,True
0,3,female,23.0,0,0,7.925,S,Third,woman,False,,Southampton,no,True
0,2,male,31.0,1,1,37.0042,C,Second,man,True,,Cherbourg,no,False
0,3,male,43.0,0,0,6.45,S,Third,man,True,,Southampton,no,True
0,3,male,10.0,3,2,27.9,S,Third,child,False,,Southampton,no,False
1,1,female,52.0,1,1,93.5,S,First,woman,False,B,Southampton,yes,False
1,3,male,27.0,0,0,8.6625,S,Third,man,True,,Southampton,yes,True
0,1,male,38.0,0,0,0.0,S,First,man,True,,Southampton,no,True
1,3,female,27.0,0,1,12.475,S,Third,woman,False,E,Southampton,yes,False
0,3,male,2.0,4,1,39.6875,S,Third,child,False,,Southampton,no,False
0,3,male,,0,0,6.95,Q,Third,man,True,,Queenstown,no,True
0,3,male,,0,0,56.4958,S,Third,man,True,,Southampton,no,True
1,2,male,1.0,0,2,37.0042,C,Second,child,False,,Cherbourg,yes,False
1,3,male,,0,0,7.75,Q,Third,man,True,,Queenstown,yes,True
1,1,female,62.0,0,0,80.0,,First,woman,False,B,,yes,True
1,3,female,15.0,1,0,14.4542,C,Third,child,False,,Cherbourg,yes,False
1,2,male,0.83,1,1,18.75,S,Second,child,False,,Southampton,yes,False
0,3,male,,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
0,3,male,23.0,0,0,7.8542,S,Third,man,True,,Southampton,no,True
0,3,male,18.0,0,0,8.3,S,Third,man,True,,Southampton,no,True
1,1,female,39.0,1,1,83.1583,C,First,woman,False,E,Cherbourg,yes,False
0,3,male,21.0,0,0,8.6625,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,8.05,S,Third,man,True,,Southampton,no,True
1,3,male,32.0,0,0,56.4958,S,Third,man,True,,Southampton,yes,True
1,1,male,,0,0,29.7,C,First,man,True,C,Cherbourg,yes,True
0,3,male,20.0,0,0,7.925,S,Third,man,True,,Southampton,no,True
0,2,male,16.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
1,1,female,30.0,0,0,31.0,C,First,woman,False,,Cherbourg,yes,True
0,3,male,34.5,0,0,6.4375,C,Third,man,True,,Cherbourg,no,True
0,3,male,17.0,0,0,8.6625,S,Third,man,True,,Southampton,no,True
0,3,male,42.0,0,0,7.55,S,Third,man,True,,Southampton,no,True
0,3,male,,8,2,69.55,S,Third,man,True,,Southampton,no,False
0,3,male,35.0,0,0,7.8958,C,Third,man,True,,Cherbourg,no,True
0,2,male,28.0,0,1,33.0,S,Second,man,True,,Southampton,no,False
1,1,female,,1,0,89.1042,C,First,woman,False,C,Cherbourg,yes,False
0,3,male,4.0,4,2,31.275,S,Third,child,False,,Southampton,no,False
0,3,male,74.0,0,0,7.775,S,Third,man,True,,Southampton,no,True
0,3,female,9.0,1,1,15.2458,C,Third,child,False,,Cherbourg,no,False
1,1,female,16.0,0,1,39.4,S,First,woman,False,D,Southampton,yes,False
0,2,female,44.0,1,0,26.0,S,Second,woman,False,,Southampton,no,False
1,3,female,18.0,0,1,9.35,S,Third,woman,False,,Southampton,yes,False
1,1,female,45.0,1,1,164.8667,S,First,woman,False,,Southampton,yes,False
1,1,male,51.0,0,0,26.55,S,First,man,True,E,Southampton,yes,True
1,3,female,24.0,0,3,19.2583,C,Third,woman,False,,Cherbourg,yes,False
0,3,male,,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
0,3,male,41.0,2,0,14.1083,S,Third,man,True,,Southampton,no,False
0,2,male,21.0,1,0,11.5,S,Second,man,True,,Southampton,no,False
1,1,female,48.0,0,0,25.9292,S,First,woman,False,D,Southampton,yes,True
0,3,female,,8,2,69.55,S,Third,woman,False,,Southampton,no,False
0,2,male,24.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
1,2,female,42.0,0,0,13.0,S,Second,woman,False,,Southampton,yes,True
1,2,female,27.0,1,0,13.8583,C,Second,woman,False,,Cherbourg,yes,False
0,1,male,31.0,0,0,50.4958,S,First,man,True,A,Southampton,no,True
0,3,male,,0,0,9.5,S,Third,man,True,,Southampton,no,True
1,3,male,4.0,1,1,11.1333,S,Third,child,False,,Southampton,yes,False
0,3,male,26.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,1,female,47.0,1,1,52.5542,S,First,woman,False,D,Southampton,yes,False
0,1,male,33.0,0,0,5.0,S,First,man,True,B,Southampton,no,True
0,3,male,47.0,0,0,9.0,S,Third,man,True,,Southampton,no,True
1,2,female,28.0,1,0,24.0,C,Second,woman,False,,Cherbourg,yes,False
1,3,female,15.0,0,0,7.225,C,Third,child,False,,Cherbourg,yes,True
0,3,male,20.0,0,0,9.8458,S,Third,man,True,,Southampton,no,True
0,3,male,19.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True
1,1,female,56.0,0,1,83.1583,C,First,woman,False,C,Cherbourg,yes,False
1,2,female,25.0,0,1,26.0,S,Second,woman,False,,Southampton,yes,False
0,3,male,33.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
0,3,female,22.0,0,0,10.5167,S,Third,woman,False,,Southampton,no,True
0,2,male,28.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
0,3,male,25.0,0,0,7.05,S,Third,man,True,,Southampton,no,True
0,3,female,39.0,0,5,29.125,Q,Third,woman,False,,Queenstown,no,False
0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True
%% Cell type:markdown id: tags:
![Translation](trans.png)
# Calling C++ code from Python
## Problem
- We have some existing C++ code which operates on array/image data
- We want to call it from Python
- We want to use Numpy arrays to pass input and receive output
- **Ideally, want to avoid too much copying of large data**
%% Cell type:markdown id: tags:
## Solution I will present
- Build a **Cython** extension
%% Cell type:markdown id: tags:
## Alternative solutions I will mention briefly
- Create a pure-C API and use `ctypes`
- Wrapper-generators (e.g. `swig`)
- Wrapping a command line tool
%% Cell type:markdown id: tags:
# Sample C++ code
#include "newimage/newimageall.h"
void process_volume(NEWIMAGE::volume4D<float> &invol)
{
// Do some clever stuff
invol.binarise(0.5);
}
int main(int argc, char **argv)
{
char *input_file = argv[1];
char *output_file = argv[2];
NEWIMAGE::volume4D<float> invol;
read_volume4D(invol, input_file);
process_volume(invol);
save_volume4D(invol, output_file);
}
%% Cell type:markdown id: tags:
## First provide an entry point using C++ native types
#include <vector>
#include <iostream>
std::vector<float> process_vectors(std::vector<float> &input, int nx, int ny, int nz, int nt)
{
// This is just so we can see if the data has been copied
std::cerr << "In C++ the input vector starts at address " << &input[0] << std::endl;
// Here we ought to check that nx, ny, nz, nt is consistent with overall length of input
// Create a volume4D using an existing data buffer
// when we do this, NEWIMAGE will not try to delete the data buffer
NEWIMAGE::volume4D<float> invol(nx, ny, nz, nt, &input[0]);
// Do our processing step
process_volume(invol);
// Input data has been modified, so return it directly
return input;
}
%% Cell type:markdown id: tags:
# Array ordering
![Ordering](abc.png)
If `input` is a 4D image, it's pretty clear that the first element is the voxel with co-ordinates `(0, 0, 0, 0)`
But what is the next element?
Is it voxel `(1, 0, 0, 0)`?
Or `(0, 0, 0, 1)`?
If the *first* axis is the one which varies fastest, we are using **Column-Major** ordering
If the *last* axis is the one which varies fastest, we are using **Row-Major** ordering
%% Cell type:markdown id: tags:
## So, which is the standard?
| Row-major | Column-major |
| ------------ | --------- |
| C/C++ native arrays | Fortran |
| Python/Numpy default | Matlab |
| SAS | FSL NEWIMAGE |
%% Cell type:markdown id: tags:
### Here, we will need to make sure our Numpy arrays are passed as 1-dimensional float arrays in Column-major order to match NEWIMAGE
data.flatten(order='F').astype(np.float32)
- `'F'` stands for 'Fortran order'
- A C++ `float` is *almost* guaranteed to be 32 bits
%% Cell type:markdown id: tags:
![Cython](cython.png)
# Cython extension
## First, the Cython wrapper
%% Cell type:code id: tags:
``` python
# my_analysis_wrapper.pyx
import numpy as np
cimport numpy as np
from libcpp.vector cimport vector
cdef extern from "my_analysis.h":
vector[float] process_vector(vector[float] &, int, int, int, int)
def process_using_vectors(data):
# Save the dimensions of the data because we're going to flatten it to 1D array
# Should be checking the dimensions at this point!
nx, ny, nz, nt = data.shape
# Convert data to 1D in Column-major (Fortran) order
# This always copies the data
data = data.flatten(order='F').astype(np.float32)
# This line is just so we can see if the data is being copied
print("In python the input data starts at %X" % data.__array_interface__['data'][0])
# Call the C++ code
output = process_vectors(data, nx, ny, nz, nt)
# Output is a 1D array in Fortran order - turn it back into a multidimensional array
# This should not copy the data
output = np.reshape(output, [nx, ny, nz, nt], order='F')
print("In python the reshaped data starts at %X" % output.__array_interface__['data'][0])
return output
```
%% Cell type:markdown id: tags:
## Next, build the extension
This would normally go in `setup.py`
%% Cell type:code id: tags:
``` python
import os
import sys
import numpy
from setuptools import setup
from Cython.Build import cythonize
from setuptools.extension import Extension
# My Cython extension
fsldir = os.environ["FSLDIR"]
ext = Extension("my_analysis_wrapper",
sources=['my_analysis_wrapper.pyx',
'my_analysis.cpp'],
language="c++",
include_dirs=[".", numpy.get_include(),
os.path.join(fsldir, "include"),
os.path.join(fsldir, "extras/include"),
os.path.join(fsldir, "extras/include/newmat")],
libraries=['newimage', 'miscmaths', 'fslio', 'niftiio', 'newmat', 'znz', "zlib"],
library_dirs=[os.path.join(fsldir, "lib"), os.path.join(fsldir, "extras/lib")])
# setup parameters
setup(name='my_app',
description='My Python application which calls C++',
ext_modules=cythonize(ext))
```
%% Cell type:code id: tags:
``` python
%run setup.py build_ext
```
%% Output
running build_ext
copying build\lib.win-amd64-2.7\my_analysis_wrapper.pyd ->
%% Cell type:code id: tags:
``` python
import numpy
import my_analysis_wrapper
import matplotlib.pyplot as plt
data = numpy.random.rand(10, 10, 10, 10)
plt.imshow(data[:,:,5,5])
plt.show()
output = my_analysis_wrapper.process_with_vectors(data)
plt.imshow(output[:,:,5,5])
plt.show()
```
%% Output
In python the input data starts at A55A980
In C++ the input vector starts at address 000000000A87D5A0
In python the reshaped data starts at A6B2040
%% Cell type:markdown id: tags:
# Great, it worked
## But we did copy our data - several times
- We copied the input once in Python to flatten it in the right order
- Cython copied it again, because the pointer to the memory in the `vector` is different from the pointer to the data in the Numpy array.
- Similarly, converting the output `vector` back to a Numpy array would involve a copy
## Do we care?
It depends on:
- Is the processing time per-voxel much greater than the data copying time?
- If so, copying will not add significant overhead
- Might the data be comparable in size to system memory?
- If so, copying may result in swapping and significant slowness
%% Cell type:markdown id: tags:
# Solution with less copying
We can't use a `vector`, it needs to be free to manage its own memory, not use an existing fixed buffer.
Instead pass a pure C array:
%% Cell type:markdown id: tags:
void process_array(float *input, int nx, int ny, int nz, int nt)
{
cerr << "In C++ the input array starts at address " << input << std::endl;
NEWIMAGE::volume4D<float> invol(nx, ny, nz, nt, input);
process_volume(invol);
// Volume data buffer is modified directly, so provided it was not copied
// we should be able to see the output directly in Python
}
%% Cell type:markdown id: tags:
- Note that we cannot check the size of the `input` buffer! It had better be correct
%% Cell type:code id: tags:
``` python
# my_analysis_wrapper.pyx
import numpy as np
cimport numpy as np
from libcpp.vector cimport vector
cdef extern from "my_analysis.h":
void process_array(float *, int, int, int, int)
def process_c(np.ndarray[np.float32_t, ndim=1] input,
nx, ny, nz, nt):
process_array(&input[0], nx, ny, nz, nt)
def process_with_arrays(data):
# Save the dimension of the data because we're going to flatten it to 1D array
nx, ny, nz, nt = data.shape
# Convert data to 1D in Column-major (Fortran) order
data = data.flatten(order='F').astype(np.float32)
print("In python the data starts at %X" % data.__array_interface__['data'][0])
process_c(data, nx, ny, nz, nt)
data = np.reshape(data, [nx, ny, nz, nt], order='F')
print("In python the reshaped data starts at %X" % data.__array_interface__['data'][0])
return data
```
%% Cell type:code id: tags:
``` python
import numpy
import my_analysis_wrapper
plt.imshow(data[:,:,5,5])
plt.show()
output = my_analysis_wrapper.process_with_arrays(data)
plt.imshow(output[:,:,5,5])
plt.show()
```
%% Output
In python the data starts at A870040
In C++ the input array starts at address 000000000A870040
In python the reshaped data starts at A870040
%% Cell type:markdown id: tags:
- We copied our input data once when we flattened it into Fortran order
- C++ code operated directly on that buffer
- Output data was not copied when reshaped
%% Cell type:markdown id: tags:
# Summary
- Easy-ish recipe for passing Numpy arrays to C++ either as a `std::vector` or as a `float *` array.
- Can construct `NEWIMAGE::volume<float>` or other complex containers from within C++
- Easy modification to instead use `double` array
- Can pass Python strings to `C++ std::string` and other C++ containers in a similar way
%% Cell type:markdown id: tags:
# Alternatives (Briefly!)
## Why?
![Compatibility](cython_compat1.png)
## Can we assume that our newly compiled Cython/C++ code will link correctly with `libnewimage.a`?
- Often, yes, but in general, no
- It depends on the compiler used for each - ideally they need to match
- The compiler of your Cython extension is **fixed** by the version of Python you are using
- Might need to recompile your dependency libraries with this compiler
- If you can't do this (e.g. commercial binary) you may be **stuck**
%% Cell type:markdown id: tags:
## Three common problem scenarios
- On Mac, need to use the same C++ standard library (either `libc++` or `libstdc++`)
- On Python 2, C++ compiler will be very old (may not support all of C++11)
- On Windows, no two versions of VC++ are binary compatible
%% Cell type:markdown id: tags:
# Alternative approach where this is a problem
- Make your code a shared library with a *Pure C* API
- Use `ctypes`
## `ctypes`
- Part of Python standard library
- Allows you to call library functions from 'C' shared library (not C++)
- **Pure 'C' libraries are (generally) binary compatible on a given platform**
- We have to load the library manually
- We have to tell Python about the input and return types
%% Cell type:markdown id: tags:
# Pure 'C' API for our processing function
// my_analysis_purec.h
#ifdef __cplusplus
extern "C" {
#endif
void process_array(float *input, int nx, int ny, int nz, int nt);
#ifdef __cplusplus
}
#endif
%% Cell type:markdown id: tags:
- Note need to use `extern "C" { }` if we may want to include this header from C++
- **On Windows, additional code is required to make the shared library (DLL) link correctly!**
- Note that the implementation *can* use C++
%% Cell type:code id: tags:
``` python
import ctypes import CDLL, c_int, c_char_p
import numpy as np
import numpy.ctypeslib
def process_ctypes(data):
clib = ctypes.cdll.LoadLibrary("libmy_analysis.so")
# This is the data type of a 1-D Numpy array
c_float_arr = numpy.ctypeslib.ndpointer(dtype=np.float32, ndim=1, flags='CONTIGUOUS')
# This specifies the argument types for the 'process_array' function
# This is not actually required but enables ctypes to do some error checking
clib.process_array.argtypes = [c_float_arr, c_int, c_int, c_int, c_int]
# Put the Numpy data into row-major order and make sure it is contiguous in memory
item = np.ascontiguousarray(item.flatten(order='F'), dtype=np.float32)
clib.process_carray(data, shape[0], shape[1], shape[2], shape[3])
```
%% Cell type:markdown id: tags:
# Comparison with Cython
## Cython advantages
- Python wrapper is probably a little quicker and cleaner to write
- Don't need to produce a new pure-C API provided we have an entry point using C++ types
- Potential for better error-checking
- Might integrate well if you are already using Cython
- No need to build a shared library
## `ctypes` advantages
- Part of the Python standard library
- No additional compile step in `setup.py`
- Binary compatibility - no need to be tied to a single (perhaps old) C++ compiler
## Conclusion?
- Use Cython when you can, `ctypes` if you have to
%% Cell type:markdown id: tags:
# Other alternatives (briefly for completeness)
## Wrapper Generators (SWIG, shiboken, others)
- Run a preprocessor on your C++ code to generate an 'automatic' Python wrapper
- Generally need to write an 'interface specifier' for each C++ header to describe how it interfaces to Python
- Great when you have a large, complex C++ API which needs to be consistently exposed to Python (e.g. wx/wxpython, QT/PyQT)
- SWIG can support other languages as well as Python
- Probably more work than Cython/ctypes if you have a single simple API
%% Cell type:markdown id: tags:
## Just Wrap the Command Line
- Quick and dirty
- Copies all data to/from filesystem
- Need to go via command line API, create temp directories, etc
- Don't overlook as a way of getting started - can move to other solution later
%% Cell type:code id: tags:
``` python
import os
import tempfile
import subprocess
import tempfile
import shutil
import numpy as np
import nibabel as nib
os.environ["FSLOUTPUTTYPE"] = "NIFTI_GZ"
def binarise(data):
# Remember the directory where we started
cwd_orig = os.getcwd()
try:
# Create a temporary directory
tempdir = tempfile.mkdtemp("fsl")
# Save input data in temp directory
os.chdir(tempdir)
tmpin = nib.Nifti1Image(data, np.identity(4))
tmpin.to_filename("in.nii.gz")
# Run a command from $FSLDIR
fslmaths = os.path.join(os.environ["FSLDIR"], "bin", "fslmaths")
# We could use os.system here if we don't care about returning the stdout/stderr
p = subprocess.Popen([fslmaths, "in.nii.gz", "-thr", "0.5", "-bin", "out.nii.gz"],
stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
cmd_stdout = ""
while 1:
retcode = p.poll()
cmd_stdout += p.stdout.readline()
if retcode is not None: break
if retcode != 0:
raise RuntimeError("Error: %s" % cmd_stdout)
# Load the output file and return it with the command standard output
out_nii = nib.load("out.nii.gz")
return out_nii.get_data(), cmd_stdout
finally:
# Change back to our starting directory
os.chdir(cwd_orig)
data = np.random.rand(10, 10, 10, 10)
plt.imshow(data[:,:,5,5])
plt.show()
output, stdout = binarise(data)
plt.imshow(output[:,:,5,5])
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
# Summary
## What we've done
- It's not that hard to call existing C++ code from Python
- Need to be a bit careful with Numpy arrays
- Cython is probably the easiest method
- Data copying can be minimised by passing data as C arrays (`float *` etc)
- `ctypes` may be a good alternative if you have binary compatibility issues
- Can always wrap a command line tool as a way of getting started!
%% Cell type:code id: tags:
``` python
```
File added
Source diff could not be displayed: it is too large. Options to address this: view the blob.
%% Cell type:markdown id: tags:
# Numpy vectorizing
%% Cell type:code id: tags:
``` python
import numpy as np
```
%% Cell type:code id: tags:
``` python
def root(a, b, c):
D = b ** 2 - 4 * a * c
if D < 0:
return np.nan, np.nan
x1 = (-b + np.sqrt(D)) / (2 * a)
x2 = (-b - np.sqrt(D)) / (2 * a)
return x1, x2
root(1, 3, 4), root(-1, 3, 4), root(1, 0, 0)
```
%% Output
((nan, nan), (-1.0, 4.0), (0.0, 0.0))
%% Cell type:code id: tags:
``` python
a = np.random.randn(int(1e6))
b = np.random.randn(int(1e6))
c = np.random.randn(int(1e6))
root(a, b, c)
```
%% Output
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-5d1fed3ed2df> in <module>()
2 b = np.random.randn(int(1e6))
3 c = np.random.randn(int(1e6))
----> 4 root(a, b, c)
<ipython-input-18-54b500cd66b1> in root(a, b, c)
1 def root(a, b, c):
2 D = b ** 2 - 4 * a * c
----> 3 if D < 0:
4 return np.nan, np.nan
5 x1 = (-b + np.sqrt(D)) / (2 * a)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
%% Cell type:code id: tags:
``` python
%%time
np.array([root(av, bv, cv) for av, bv, cv in zip(a, b, c)])
```
%% Output
CPU times: user 3.64 s, sys: 65.9 ms, total: 3.71 s
Wall time: 3.71 s
array([[-0.55797188, 1.11121928],
[ nan, nan],
[ 0.76166857, -1.70637551],
...,
[-0.56859783, 0.98659831],
[ nan, nan],
[-0.53531175, 0.18339357]])
%% Cell type:code id: tags:
``` python
root_vec = np.vectorize(root)
np.vectorize?
```
%% Cell type:code id: tags:
``` python
%%time
root_vec(a, b, c)
```
%% Output
CPU times: user 1.85 s, sys: 71.9 ms, total: 1.92 s
Wall time: 1.92 s
(array([-0.55797188, nan, 0.76166857, ..., -0.56859783,
nan, -0.53531175]),
array([ 1.11121928, nan, -1.70637551, ..., 0.98659831,
nan, 0.18339357]))
%% Cell type:code id: tags:
``` python
%%time
root_vec(a, b, 7)
```
%% Output
CPU times: user 1.66 s, sys: 62.7 ms, total: 1.73 s
Wall time: 1.73 s
(array([-1.72246968, -2.63787563, nan, ..., -2.20722406,
nan, -1.84415698]),
array([2.27571708, 2.04353033, nan, ..., 2.62522454, nan,
1.4922388 ]))
%% Cell type:markdown id: tags:
# Numba
%% Cell type:code id: tags:
``` python
import numba
```
%% Cell type:code id: tags:
``` python
root_jit = numba.jit(root)
```
%% Cell type:code id: tags:
``` python
%timeit root(23, 78, 19.0)
```
%% Output
3.08 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%% Cell type:code id: tags:
``` python
%timeit root_jit(23, 78, 19.0)
```
%% Output
The slowest run took 11.09 times longer than the fastest. This could mean that an intermediate result is being cached.
786 ns ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%% Cell type:code id: tags:
``` python
%%time
root_jit(a, b, 7)
```
%% Output
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
<timed eval> in <module>()
SystemError: CPUDispatcher(<function root at 0x114035620>) returned a result with an error set
%% Cell type:code id: tags:
``` python
root_jit_vec = np.vectorize(root_jit)
%time root_jit_vec(a, b, 7)
```
%% Output
CPU times: user 406 ms, sys: 57.2 ms, total: 464 ms
Wall time: 464 ms
(array([-1.72246968, -2.63787563, nan, ..., -2.20722406,
nan, -1.84415698]),
array([2.27571708, 2.04353033, nan, ..., 2.62522454, nan,
1.4922388 ]))
%% Cell type:code id: tags:
``` python
@np.vectorize
@numba.jit
def fast_root(a, b, c):
D = b ** 2 - 4 * a * c
if D < 0:
return np.nan, np.nan
x1 = (-b + np.sqrt(D)) / (2 * a)
x2 = (-b - np.sqrt(D)) / (2 * a)
return x1, x2
%time fast_root(a, b, 7)
```
%% Output
CPU times: user 391 ms, sys: 58.2 ms, total: 449 ms
Wall time: 448 ms
(array([-1.72246968, -2.63787563, nan, ..., -2.20722406,
nan, -1.84415698]),
array([2.27571708, 2.04353033, nan, ..., 2.62522454, nan,
1.4922388 ]))
%% Cell type:code id: tags:
``` python
@numba.vectorize
def fast_root(a, b, c):
D = b ** 2 - 4 * a * c
if D < 0:
return np.nan, np.nan
x1 = (-b + np.sqrt(D)) / (2 * a)
x2 = (-b - np.sqrt(D)) / (2 * a)
return x1, x2
%time fast_root(a, b, 7)
```
%% Output
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<timed eval> in <module>()
~/miniconda3/lib/python3.6/site-packages/numba/npyufunc/dufunc.py in _compile_for_args(self, *args, **kws)
166 argty = argty.dtype
167 argtys.append(argty)
--> 168 return self._compile_for_argtys(tuple(argtys))
169
170 def _compile_for_argtys(self, argtys, return_type=None):
~/miniconda3/lib/python3.6/site-packages/numba/npyufunc/dufunc.py in _compile_for_argtys(self, argtys, return_type)
188 cres, argtys, return_type)
189 dtypenums, ptr, env = ufuncbuilder._build_element_wise_ufunc_wrapper(
--> 190 cres, actual_sig)
191 self._add_loop(utils.longint(ptr), dtypenums)
192 self._keepalive.append((ptr, cres.library, env))
~/miniconda3/lib/python3.6/site-packages/numba/npyufunc/ufuncbuilder.py in _build_element_wise_ufunc_wrapper(cres, signature)
163 # Get dtypes
164 dtypenums = [as_dtype(a).num for a in signature.args]
--> 165 dtypenums.append(as_dtype(signature.return_type).num)
166 return dtypenums, ptr, env
167
~/miniconda3/lib/python3.6/site-packages/numba/numpy_support.py in as_dtype(nbtype)
134 return as_dtype(nbtype.dtype)
135 raise NotImplementedError("%r cannot be represented as a Numpy dtype"
--> 136 % (nbtype,))
137
138
NotImplementedError: (float64 x 2) cannot be represented as a Numpy dtype
%% Cell type:markdown id: tags:
# But run the profiler first
%% Cell type:code id: tags:
``` python
def fib(n):
if n <= 1:
return 1
else:
return fib(n-2) + fib(n-1)
%time fib(33)
```
%% Output
CPU times: user 1.56 s, sys: 3.5 ms, total: 1.56 s
Wall time: 1.56 s
5702887
%% Cell type:code id: tags:
``` python
@numba.jit
def fib_jit(n):
if n <= 1:
return 1
else:
return fib_jit(n-2) + fib_jit(n-1)
%time fib_jit(33)
```
%% Output
CPU times: user 63.4 ms, sys: 1.79 ms, total: 65.2 ms
Wall time: 64.2 ms
5702887
%% Cell type:code id: tags:
``` python
%time fib_jit(41)
```
%% Output
CPU times: user 1.28 s, sys: 5.26 ms, total: 1.28 s
Wall time: 1.28 s
267914296
%% Cell type:code id: tags:
``` python
%prun fib(33)
%prun fib_jit(41)
```
%% Output
%% Cell type:code id: tags:
``` python
%prun fib_jit(41)
%prun fib(33)
```
%% Output
%% Cell type:code id: tags:
``` python
from functools import lru_cache
@lru_cache(None)
def fib_cache(n):
if n <= 1:
return 1
else:
return fib_cache(n-2) + fib_cache(n-1)
%time fib_cache(33)
```
%% Output
CPU times: user 20 µs, sys: 1 µs, total: 21 µs
Wall time: 24.1 µs
%% Cell type:code id: tags:
``` python
lru_cache?
```
%% Cell type:code id: tags:
``` python
%time fib_cache(500)
```
%% Output
CPU times: user 261 µs, sys: 1e+03 ns, total: 262 µs
Wall time: 267 µs
225591516161936330872512695036072072046011324913758190588638866418474627738686883405015987052796968498626
%% Cell type:markdown id: tags:
# Optimizing numpy algebra
%% Cell type:markdown id: tags:
Multiplying/adding arrays incurs a lot of overhead, because lots of intermediate arrays get created and discarded again
%% Cell type:code id: tags:
``` python
a = np.random.randn(int(1e7))
b = np.random.randn(int(1e7))
%timeit a * b + 4.1 * a < 3 * b
```
%% Output
139 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%% Cell type:code id: tags:
``` python
import numexpr as ne
%timeit ne.evaluate('a * b + 4.1 * a < 3 * b')
```
%% Output
11.4 ms ± 484 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%% Cell type:markdown id: tags:
Efficiency gain reduces once the operations become more complex
%% Cell type:code id: tags:
``` python
%timeit a * np.sqrt(abs(b)) < 3 * b
c = np.random.randn(int(1e7))
%timeit (-b - np.sqrt(b ** 2 - 4 * a * c) ) / (2 * a)
```
%% Output
141 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%% Cell type:code id: tags:
``` python
%timeit ne.evaluate('a * sqrt(abs(b)) < 3 * b')
%timeit ne.evaluate('(-b - sqrt(b ** 2 - 4 * a * c) ) / (2 * a)')
```
%% Output
12.6 ms ± 986 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%% Cell type:markdown id: tags:
# Cython
%% Cell type:markdown id: tags:
http://docs.cython.org/en/latest/
Compiles most python programs to C code for extra speed (but less introspection). Also allows direct calling of C library code.
Can include python classes and much more.
%% Cell type:code id: tags:
``` python
%load_ext cython
```
%% Output
The cython extension is already loaded. To reload it, use:
%reload_ext cython
%% Cell type:code id: tags:
``` python
def primes(kmax):
p = {}
result = []
if kmax > 1000:
kmax = 1000
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
result.append(n)
n = n + 1
return result
print(primes(10))
%time _ = primes(10000)
```
%% Output
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
CPU times: user 67.4 ms, sys: 1.22 ms, total: 68.6 ms
Wall time: 68 ms
%% Cell type:code id: tags:
``` python
%%cython -a
def cprimes(int kmax):
cdef int[1000] p
def cprimes(kmax):
p = {}
result = []
if kmax > 1000:
kmax = 1000
cdef int k, n, i
k = 0
n = 2
while k < kmax:
i = 0
while i < k and n % p[i] != 0:
i = i + 1
if i == k:
p[k] = n
k = k + 1
result.append(n)
n = n + 1
return result
```
%% Output
<IPython.core.display.HTML object>
%% Cell type:code id: tags:
``` python
print(cprimes(10))
%time _ = cprimes(10000)
```
%% Output
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
CPU times: user 2.2 ms, sys: 10 µs, total: 2.21 ms
Wall time: 2.22 ms
%% Cell type:markdown id: tags:
## Defining C-style function
%% Cell type:code id: tags:
``` python
#%%cython -a
def fib(n):
%%cython -a
def cfib(n):
if n <= 1:
return 1
else:
return fib(n-2) + fib(n-1)
return cfib(n-2) + cfib(n-1)
```
%% Cell type:code id: tags:
``` python
%time fib(33)
%time cfib(33)
```
%% Output
CPU times: user 1.48 s, sys: 1.79 ms, total: 1.48 s
Wall time: 1.48 s
5702887
%% Cell type:code id: tags:
``` python
%time fib_jit(33)
```
%% Output
CPU times: user 27.5 ms, sys: 73 µs, total: 27.5 ms
Wall time: 27.5 ms
5702887
%% Cell type:code id: tags:
``` python
%%cython -a
cdef int cfib(int n) nogil:
if n <= 1:
return 1
else:
return cfib(n-2) + cfib(n-1)
def fib(n):
return cfib(n)
```
%% Output
<IPython.core.display.HTML object>
%% Cell type:code id: tags:
``` python
%time fib(33)
```
%% Output
CPU times: user 10.2 ms, sys: 138 µs, total: 10.4 ms
Wall time: 10.4 ms
5702887
%% Cell type:markdown id: tags:
## Cython in production code
%% Cell type:code id: tags:
``` python
def root(a, b, c):
D = b ** 2 - 4 * a * c
if D < 0:
return NAN, NAN
x1 = (-b + sqrt(D)) / (2 * a)
x2 = (-b - sqrt(D)) / (2 * a)
return np.nan, np.nan
x1 = (-b + np.sqrt(D)) / (2 * a)
x2 = (-b - np.sqrt(D)) / (2 * a)
return x1, x2
root(1, 3, 4), root(-1, 3, 4), root(1, 0, 0)
```
%% Cell type:code id: tags:
``` python
a = np.random.randn(int(1e6))
b = np.random.randn(int(1e6))
c = np.random.randn(int(1e6))
%time np.vectorize(root)(a, b, 7)
```
%% Output
CPU times: user 236 ms, sys: 53.7 ms, total: 290 ms
Wall time: 288 ms
(array([-1.72246969, -2.6378758 , nan, ..., -2.20722389,
nan, -1.84415698]),
array([2.27571702, 2.04353046, nan, ..., 2.62522459, nan,
1.49223876]))
%% Cell type:code id: tags:
``` python
%time root_jit_vec(a, b, 7)
```
%% Output
CPU times: user 353 ms, sys: 57.8 ms, total: 411 ms
Wall time: 410 ms
(array([-1.72246968, -2.63787563, nan, ..., -2.20722406,
nan, -1.84415698]),
array([2.27571708, 2.04353033, nan, ..., 2.62522454, nan,
1.4922388 ]))
%% Cell type:code id: tags:
``` python
%%cython -a
from libc.math cimport sqrt, NAN
def root(float a, float b, float c):
cdef float D, x1, x2
D = b ** 2 - 4 * a * c
if D < 0:
return NAN, NAN
x1 = (-b + sqrt(D)) / (2 * a)
x2 = (-b - sqrt(D)) / (2 * a)
return x1, x2
print(root(1, 3, 4), root(-1, 3, 4), root(1, 0, 0))
```
%% Cell type:markdown id: tags:
## Cython in production code
## More cython
%% Cell type:markdown id: tags:
http://docs.cython.org/en/latest/src/tutorial/cython_tutorial.html
%% Cell type:markdown id: tags:
# Steps to optimizing code:
- Profile where the bottleneck is (%prun)
- You can install a [line profiler](https://github.com/rkern/line_profiler) (available as %lprun after installation)
- Is there a faster algorithm for the bottleneck?
- If the bottleneck is vectorized: can we optimize with numexpr?
- If the internal part of the loop can not be vectorized:
- numba jit if simple enough otherwise cython
- the loop itself can be sped up with numpy.vectorize
- If the bottleneck gets called too often: cache the result
- If even that fails: pycuda
- If that is too slow: learn to optimize cuda code or find an easier problem
%% Cell type:code id: tags:
``` python
```
......