*[Estimators: fitting and predicting](#estimators)
*[Transformers](#transformers)
*[Pipelines: chaining transforms and estimators](#pipelines)
*[Model evaluation](#model-evaluation)
*[Neuroimaging examples](#examples)
*[Classification](#classification)
*[Regression](#regression)
*[Clustering](#clustering)
%% Cell type:markdown id: tags:
<aclass="anchor"id="getting-started"></a>
## Getting Started
Adapted from the `scikit-learn` "Getting Started" documentation: https://scikit-learn.org/stable/getting_started.html
Three important concepts to understand for using `scikit-learn`:
1.`estimator` objects and their `fit` and `predict` methods for fitting data and making predictions
2.`transformer` objects for pre/post-processing transforms
2.`transformer` objects and their `fit` and `transform` methods for pre/post-processing data
3.`pipeline` objects for chaining together `transformers` and `estimators` into a machine learning pipeline
%% Cell type:markdown id: tags:
<aclass="anchor"id="estimators"></a>
### Estimators: fitting and predicting
Each machine learning model in `scikit-learn` is implemented as an [`estimator`](https://scikit-learn.org/stable/glossary.html#term-estimators) object. Here we instantiate [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)`estimator`:
%% Cell type:code id: tags:
``` python
# import RF estimator
fromsklearn.ensembleimportRandomForestClassifier
# instantiate RF estimator
clf=RandomForestClassifier(random_state=0)
print(clf)
```
%% Cell type:markdown id: tags:
After creation, the `estimator` can be **fit** to the training data using the `fit` method.
The fit method accepts 2 inputs:
1.`X` is the training input samples of shape (`n_samples`, `n_features`). Thus, rows=samples and columns=features
2.`y` is the target values (class labels in classification, real numbers in regression) of shape (`n_samples`,) for one output, or (`n_samples`, `n_outputs`) for multiple outputs.
> Note: Both `X` and `y` are usually numpy arrays or equivalent array-like data types, including a [`pandas DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).
%% Cell type:code id: tags:
``` python
# create some toy data
X=[[1,2,3],# 2 samples, 3 features
[11,12,13]]
y=[0,1]# classes of each sample
# fit the model to the data
clf.fit(X,y)
```
%% Cell type:markdown id: tags:
Once trained, the `estimator` can make predictions on new data using the `predict` method.
%% Cell type:code id: tags:
``` python
# predict classes of new data
X_test=[[4,5,6],
[14,15,16]]
clf.predict(X_test)
```
%% Cell type:markdown id: tags:
Importantly, this `fit` and `predict` interface is consistent across different estimators making it very easy to change estimators within your code. For example, here we swap the Random Forest `estimator` for a Support Vector Machine `estimator`:
%% Cell type:code id: tags:
``` python
# import an SVM estimator
fromsklearn.svmimportSVC
# instantiate SVM estimator
clf=SVC(random_state=0)
# fit and predict
clf.fit(X,y)
clf.predict(X_test)
```
%% Cell type:markdown id: tags:
**Exercise:**`scikit-learn` implements a wide array of machine learning `estimators`. Try implementing fitting and predicting with a different model. See the `scikit-learn`[documentation](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) for model options.
%% Cell type:code id: tags:
``` python
# YOUR CODE GOES HERE
```
%% Cell type:markdown id: tags:
<aclass="anchor"id="transformers"></a>
### Transformers
The `transformer` is a special object that follows the same API as an `estimator` and allows you to apply pre-processing and/or post-procssing transforms to the data in your machine learning pipeline. The `transformer` object has a `transform` method instead of a `predict` method.
In this example we use a `transformer` to standardise the features (e.g. remove mean and scale to unit variance). The `fit` method calculates the mean and variance parameters from the data, and the `transform` method will do the scaling.
%% Cell type:code id: tags:
``` python
# import StandardScaler transformer
fromsklearn.preprocessingimportStandardScaler
# create some toy data
X=[[0,15],
[1,-10]]
# instantiate StandardScaler transformer
scaler=StandardScaler()
# fit the transform to the data
scaler.fit(X)
# apply transform
scaler.transform(X)
```
%% Cell type:markdown id: tags:
**Exercise:** Look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) for `StandardScaler` and see if you can extract the mean and variance estimates from the fitted `transformer`.
%% Cell type:code id: tags:
``` python
# YOUR CODE GOES HERE
```
%% Cell type:markdown id: tags:
> Note: Once you run the `fit` method on either a `transformer` or an `estimator`, you can access the parameters that have been learned.
For example, for the `transformer` used above to standardise the data:
%% Cell type:code id: tags:
``` python
# Look inside the object to see the learned parameters (and other object properties)
vars(scaler)
```
%% Cell type:markdown id: tags:
<aclass="anchor"id="pipelines"></a>
### Pipelines: chaining transforms and estimators
A typical machine learning pipeline will often involve numerous pre-processing transforms and one or more estimators. The `pipeline` object can be used to combine `transformer` and `estimator` objects into a single object that has an API that is the same as a regular `estimator`.
In this example we create a very simple `pipeline` that is comprised of a StandardScaler `transform` and a Random Forest `estimator`.
> Note: in this example the `estimator` predictions were made with the same data (`X`) on which the `estimator` was trained. Typically independent **train** and **test** data would be used for fitting and prediction. In the next section [Model evaluation](#model-evaluation) we will see how to split data into train and test sets.
fromsklearn.datasetsimportload_iris# function to load demo dataset
# create a pipeline object
# using a list of transformers/estimators (and give them names)
pipe=Pipeline(
[
('my_scaler',StandardScaler()),
('my_classifier',RandomForestClassifier())
]
)
# load the iris dataset
# See https://en.wikipedia.org/wiki/Iris_flower_data_set
X,y=load_iris(return_X_y=True)
# fit the whole pipeline
pipe.fit(X,y)
# predict classes from the training data
pipe.predict(X)
```
%% Cell type:markdown id: tags:
When running a pipeline, the different components of the pipeline are run sequentially.
You may want to also run the pipeline "up to a point", or check the parameters of a component of the pipeline. For example, in the pipeline above, we may want to see the parameters of the `StandardScaler` or only run that part on some data.
%% Cell type:code id: tags:
``` python
# Check the parameters learned by the scaler
print(pipe['my_scaler'].mean_)
print(pipe['my_scaler'].var_)
# Run only the scaler on some data
X_t=pipe['my_scaler'].transform(X)
# Running the full pipeline is equivalent to running the `transform` method on the different bits
# sequentially and then running a `predict` at the end
In machine learning we often want to train a model on some data, and then apply the model to new "unseen" data to make predictions. Thus we want our model to **generalise** well to unseen data. To assess generalisation, we need a dataset to train/fit the model AND an independent dataset to evaluate the model.
The simplest way to do this is to just split the data into a train set and a test set, then fit the model to the train set, and make predictions on the test set. There are many methods for splitting data that are implemented within `scikit-learn` in the [`model_selection`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) module.
To evaluate a model we also need a metric of performance. There a many metrics used in machine learning, and many of them are implemented within `scikit-learn` in the [`metrics`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) module.
In this example, we use the simple `pipeline` constructed earlier and extend it to include a simple train/test split using the `train_test_split` function. A major advantage of using the `pipeline` object is that it will prevent leakage between your train and test data, which would invalidate your evaluation. For the performance metric we use `accuracy_score` which is just the percentage of correct predictions.
# predict classes from the test data and calculate the accuracy
accuracy_score(pipe.predict(X_test),y_test)
```
%% Cell type:markdown id: tags:
It is typical to perform the train/test splits within the folds of a [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) to make more efficient use of the available data.
`scikit-learn` makes cross-validation very easy with the `cross_val_score` function. Furthermore, by using the `pipeline` object we can ensure that there is no data leakage with this more complex training scheme.
In this example, we use `cross_val_score` combined with the `KFold` cross-validation iterator to evaluate the model with 5-fold cross-validation:
Thus, rather than a single accuracy score, you have a distribution.
**Exercise:** Another commonly used cross-validation scheme is `StratifiedKFold` which is a variation of k-fold which uses stratified folds. A stratified fold is where each set (train/test) within a fold contains approximately the same percentage of samples of each target class as the complete set. Try training the pipeline from the previous example with `StratifiedKFold` and `k=6`. The documentation for `StratifiedKFold` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn-model-selection-stratifiedkfold).
%% Cell type:code id: tags:
``` python
# YOUR CODE GOES HERE
```
%% Cell type:markdown id: tags:
<aclass="anchor"id="examples"></a>
## Neuroimaging Examples
Here we present examples of using `scikit-learn` for classification, regression, and clustering on neuroimaging data.
%% Cell type:markdown id: tags:
<aclass="anchor"id="classification"></a>
### Classification
In this example, we will train a `scikit-learn` classifier to predict from fMRI if the subject is looking at an image of a face or a house.
The example will:
1. Download the fMRI for a single subject
2. Split the data into train and test sets
3. Fit a linear SVM to the train data
4. Evaluate the performance of the classifier on the test data
This example is adapted from: https://nilearn.github.io/stable/auto_examples/02_decoding/plot_haxby_anova_svm.html#sphx-glr-auto-examples-02-decoding-plot-haxby-anova-svm-py
%% Cell type:markdown id: tags:
First we must import the data. If this is the first time you run this code it will also download the data, which should take less than a minute.
> **NOTE:**
> 1. Here we use [`fetch_haxby`](https://nilearn.github.io/stable/modules/generated/nilearn.datasets.fetch_haxby.html#nilearn.datasets.fetch_haxby) from the [`nilearn`](https://nilearn.github.io/index.html) package to download the demo data
> 2. [`nb.load`](https://nipy.org/nibabel/gettingstarted.html) from the [`nibabel`](https://nipy.org/nibabel/index.html) package will load the image into [`nibabel.Nifti1Image`](https://nipy.org/nibabel/reference/nibabel.nifti1.html) object. This will not load the actual data though.
%% Cell type:code id: tags:
``` python
fromnilearnimportdatasets
importnibabelasnb
# load dataset for 1 subject (download if required, ~30s)
Next, load the behavioural data, which describes which task was being performed for each volume of the fMRI. We are only interested in a binary classification, so we select volumes where the user is looking at either a face or a house.
> **NOTE:**
> 1. Here we use [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) from the [`pandas`](https://pandas.pydata.org/docs/index.html) package to load the behavioural data
> 2. We use [`index_img`](https://nilearn.github.io/modules/generated/nilearn.image.index_img.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to select the face and house volumes from the fMRI image.
Now load the fMRI data (face and house conditions only), flatten into a `(n_sample, n_feature)` matrix for use with `scikit-learn`, and mask out non-brain voxels. In this case, samples will be individual volumes, and features will be all non-brain voxels.
Finally, we can look at the weights of the trained classifier spatially to evaluate which voxels were most discriminatory between faces and houses. This only works because we used a linear classifier. The process involves:
1. Extract the weights from the trained classifier
2. Reshape (un-flatten) the weights into a 3D volume
3. Plot the 3D weights
> **NOTE:**
> 1. Here we use [`plot_stat_map`](https://nilearn.github.io/modules/generated/nilearn.plotting.plot_stat_map.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to plot the images
> 2. [`subplots`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html) from `matplotlib.pyplot` creates a figure and axes with specified size
In this example, we will do a multivariate regression of voxel-based-morphometry with subject age.
The data come from the [OASIS](https://www.oasis-brains.org) project and has been run through a standard VBM pipeline to create VBM maps.
The example will:
1. Download the VBM maps for 150 subjects
2. Split the data into train and test sets
3. Implement a custom transformer to mask the non-GM voxels
3. Fit a `Ridge` regressor to the train data
4. Evaluate the performance of the regressor on the test data
This example is adapted from: https://nilearn.github.io/stable/auto_examples/02_decoding/plot_oasis_vbm.html#sphx-glr-auto-examples-02-decoding-plot-oasis-vbm-py
%% Cell type:markdown id: tags:
First we must import the data. If this is the first time you have run this code it will also download the data, which should take only a minute or two.
> **NOTE:**
> 1. Here we use [`fetch_oasis_vbm`](https://nilearn.github.io/stable/modules/generated/nilearn.datasets.fetch_oasis_vbm.html#nilearn.datasets.fetch_oasis_vbm) from the [`nilearn`](https://nilearn.github.io/index.html) package to download the VBM data
> 2. [`nb.load`](https://nipy.org/nibabel/gettingstarted.html) from the [`nibabel`](https://nipy.org/nibabel/index.html) package will load the image into [`nibabel.Nifti1Image`](https://nipy.org/nibabel/reference/nibabel.nifti1.html) object. This will not load the actual data though.
%% Cell type:code id: tags:
``` python
importnumpyasnp
importnibabelasnb
fromnilearnimportdatasets
# fetch the VBM data from 150 subjects
n_subjects=150
dataset_files=datasets.fetch_oasis_vbm(n_subjects=n_subjects)# 1-2 mins depending on your connection
# Sort test data by age, simply for better visualization later
perm=np.argsort(age_test)[::-1]
age_test=age_test[perm]
imgs_test=imgs_test[perm]
print(f'Number of train/test samples: {imgs_train.shape[0]}/{imgs_test.shape[0]}')
```
%% Cell type:markdown id: tags:
Each VBM map has a lot of non-GM voxels that are of no interest. Ideally we would remove these using a brain mask, but unfortunately there is not a mask provided with the dataset. We could calculate a mask by calculating a group mean image, and then binarising it. However, this would cause leakage between train and test data. Ideally, the mask would be calculated on the train data and applied to the test data. This is exactly the role of a `transformer` in a `scikit-learn` pipeline.
Here we create a custom `transformer` called `MaskTransformer`. When `fit` is called, it creates a mask by taking the mean across samples (rows of `X`) and then binarises it with the specified threshold. When `transform` is called, it uses the pre-calculated mask to remove samples (rows of `X`) that are not in the mask.
More information on custom transformers can be found [here](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156).
Now we fit a `Ridge` regressor to the train data, and evaluate on the test data.
This example includes an additioal `transformer` called `SelectPercentile` which will select the 50% of the features (voxels) with the highest ANOVA f-score with the target labels. We included this simply to reduce the memory requirements of the regression.
Finally, we can look at the coefficients of the trained regressor spatially and see which varied most with age. The process involves:
1. Extract the regression coefficients from the trained regressor
2. Reshape (un-flatten) the weights into a 3D volume
3. Plot the 3D coefficients
> **NOTE:**
> 1. Here we use [`plot_stat_map`](https://nilearn.github.io/modules/generated/nilearn.plotting.plot_stat_map.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to plot the orthographic images
> 2. [`subplots`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html) from `matplotlib.pyplot` creates a figure and axes with specified size
In this example, we will cluster the intensities of an anatomical image to label different tissues. A guassian mixture model will be fit by expectation-maximisation. This is similar to how `FSL fast` works for tissue segmentation.
The example will:
1. Fetch an anatomical image
2. Skull-strip the image using FSL tools
3. Fit a gaussian mixture model to the voxel intensities
%% Cell type:markdown id: tags:
First import the data. In this example we use a single anatomical image from the `haxby` dataset that was used in the classification example.
Now load the skull-stripped anatomical image, remove the non-brain voxels (intensity==0), and flatten into a `(n_sample, n_feature)` matrix for use with `scikit-learn`. In this case, samples will be individual voxels, and there will only be a single feature.
%% Cell type:code id: tags:
``` python
importnibabelasnb
importnumpyasnp
# load brain image
anat_img=nb.load(anat_brain)
anat_data=anat_img.get_fdata()
# create brain mask
mask=anat_data>0
# remove non-brain voxls and reshape (n_sample, n_features) for skikit-learn
anat_data=anat_data[mask][:,np.newaxis]
print(f'Flattened shape: {anat_data.shape}')
```
%% Cell type:markdown id: tags:
Fit a gaussian mixture model to the data with 4 gaussians, then predict the class/label on the same data. Given that both `fit` and `predict` will be used on the same data, we can use a convenience method that does both called `fit_predict`. Finally, reshape the classes/labels in a 3D volume and plot.
%% Cell type:code id: tags:
``` python
fromsklearn.mixtureimportGaussianMixture# import gaussian mixture model