Skip to content
Snippets Groups Projects
Commit eacbe25e authored by Sean Fitzgibbon's avatar Sean Fitzgibbon
Browse files

Merge branch 'tweak_scikit' into 'master'

tweaks to docs

See merge request fsl/win-pytreat!30
parents 74bd46ee 1041e83e
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# scikit-learn
Machine learning in python.
- Simple and efficient tools for predictive data analysis
- Built on `numpy`, `scipy`, and `matplotlib`
- Open source, commercially usable - BSD license
- \> 2.1k contributers, 22.5k forks, 48.5k "stars"
- Current version: 1.0.2
- Powerful generic API
- Very well documented
- Core techniques, hundreds of options for:
- Classification
- Regression
- Clustering
- Dimensionality Reduction
- Model Selection
- Preprocessing
Some useful links are:
- [Main website](https://scikit-learn.org/stable/index.html)
- [User guide](https://scikit-learn.org/stable/user_guide.html)
- [API](https://scikit-learn.org/stable/modules/classes.html)
- [GitHub](https://github.com/scikit-learn/scikit-learn)
## Notebook Overview:
* [Getting Started](#getting-started)
* [Estimators: fitting and predicting](#estimators)
* [Transformers](#transformers)
* [Pipelines: chaining transforms and estimators](#pipelines)
* [Model evaluation](#model-evaluation)
* [Neuroimaging examples](#examples)
* [Classification](#classification)
* [Regression](#regression)
* [Clustering](#clustering)
%% Cell type:markdown id: tags:
<a class="anchor" id="getting-started"></a>
## Getting Started
Adapted from the `scikit-learn` "Getting Started" documentation: https://scikit-learn.org/stable/getting_started.html
Three important concepts to understand for using `scikit-learn`:
1. `estimator` objects and their `fit` and `predict` methods for fitting data and making predictions
2. `transformer` objects for pre/post-processing transforms
2. `transformer` objects and their `fit` and `transform` methods for pre/post-processing data
3. `pipeline` objects for chaining together `transformers` and `estimators` into a machine learning pipeline
%% Cell type:markdown id: tags:
<a class="anchor" id="estimators"></a>
### Estimators: fitting and predicting
Each machine learning model in `scikit-learn` is implemented as an [`estimator`](https://scikit-learn.org/stable/glossary.html#term-estimators) object. Here we instantiate [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) `estimator`:
%% Cell type:code id: tags:
``` python
# import RF estimator
from sklearn.ensemble import RandomForestClassifier
# instantiate RF estimator
clf = RandomForestClassifier(random_state=0)
print(clf)
```
%% Cell type:markdown id: tags:
After creation, the `estimator` can be **fit** to the training data using the `fit` method.
The fit method accepts 2 inputs:
1. `X` is the training input samples of shape (`n_samples`, `n_features`). Thus, rows=samples and columns=features
2. `y` is the target values (class labels in classification, real numbers in regression) of shape (`n_samples`,) for one output, or (`n_samples`, `n_outputs`) for multiple outputs.
> Note: Both `X` and `y` are usually numpy arrays or equivalent array-like data types, including a [`pandas DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).
%% Cell type:code id: tags:
``` python
# create some toy data
X = [[ 1, 2, 3], # 2 samples, 3 features
[11, 12, 13]]
y = [0, 1] # classes of each sample
# fit the model to the data
clf.fit(X, y)
```
%% Cell type:markdown id: tags:
Once trained, the `estimator` can make predictions on new data using the `predict` method.
%% Cell type:code id: tags:
``` python
# predict classes of new data
X_test = [[ 4, 5, 6],
[14, 15, 16]]
clf.predict(X_test)
```
%% Cell type:markdown id: tags:
Importantly, this `fit` and `predict` interface is consistent across different estimators making it very easy to change estimators within your code. For example, here we swap the Random Forest `estimator` for a Support Vector Machine `estimator`:
%% Cell type:code id: tags:
``` python
# import an SVM estimator
from sklearn.svm import SVC
# instantiate SVM estimator
clf = SVC(random_state=0)
# fit and predict
clf.fit(X, y)
clf.predict(X_test)
```
%% Cell type:markdown id: tags:
**Exercise:** `scikit-learn` implements a wide array of machine learning `estimators`. Try implementing fitting and predicting with a different model. See the `scikit-learn` [documentation](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) for model options.
%% Cell type:code id: tags:
``` python
# YOUR CODE GOES HERE
```
%% Cell type:markdown id: tags:
<a class="anchor" id="transformers"></a>
### Transformers
The `transformer` is a special object that follows the same API as an `estimator` and allows you to apply pre-processing and/or post-procssing transforms to the data in your machine learning pipeline. The `transformer` object has a `transform` method instead of a `predict` method.
In this example we use a `transformer` to standardise the features (e.g. remove mean and scale to unit variance). The `fit` method calculates the mean and variance parameters from the data, and the `transform` method will do the scaling.
%% Cell type:code id: tags:
``` python
# import StandardScaler transformer
from sklearn.preprocessing import StandardScaler
# create some toy data
X = [[0, 15],
[1, -10]]
# instantiate StandardScaler transformer
scaler = StandardScaler()
# fit the transform to the data
scaler.fit(X)
# apply transform
scaler.transform(X)
```
%% Cell type:markdown id: tags:
**Exercise:** Look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) for `StandardScaler` and see if you can extract the mean and variance estimates from the fitted `transformer`.
%% Cell type:code id: tags:
``` python
# YOUR CODE GOES HERE
```
%% Cell type:markdown id: tags:
> Note: Once you run the `fit` method on either a `transformer` or an `estimator`, you can access the parameters that have been learned.
For example, for the `transformer` used above to standardise the data:
%% Cell type:code id: tags:
``` python
# Look inside the object to see the learned parameters (and other object properties)
vars(scaler)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="pipelines"></a>
### Pipelines: chaining transforms and estimators
A typical machine learning pipeline will often involve numerous pre-processing transforms and one or more estimators. The `pipeline` object can be used to combine `transformer` and `estimator` objects into a single object that has an API that is the same as a regular `estimator`.
In this example we create a very simple `pipeline` that is comprised of a StandardScaler `transform` and a Random Forest `estimator`.
> Note: in this example the `estimator` predictions were made with the same data (`X`) on which the `estimator` was trained. Typically independent **train** and **test** data would be used for fitting and prediction. In the next section [Model evaluation](#model-evaluation) we will see how to split data into train and test sets.
%% Cell type:code id: tags:
``` python
# imports
from sklearn.ensemble import RandomForestClassifier # estimator
from sklearn.preprocessing import StandardScaler # transformer
from sklearn.pipeline import Pipeline # pipeline
from sklearn.datasets import load_iris # function to load demo dataset
# create a pipeline object
# using a list of transformers/estimators (and give them names)
pipe = Pipeline(
[
('my_scaler', StandardScaler()),
('my_classifier', RandomForestClassifier())
]
)
# load the iris dataset
# See https://en.wikipedia.org/wiki/Iris_flower_data_set
X, y = load_iris(return_X_y=True)
# fit the whole pipeline
pipe.fit(X, y)
# predict classes from the training data
pipe.predict(X)
```
%% Cell type:markdown id: tags:
When running a pipeline, the different components of the pipeline are run sequentially.
You may want to also run the pipeline "up to a point", or check the parameters of a component of the pipeline. For example, in the pipeline above, we may want to see the parameters of the `StandardScaler` or only run that part on some data.
%% Cell type:code id: tags:
``` python
# Check the parameters learned by the scaler
print(pipe['my_scaler'].mean_)
print(pipe['my_scaler'].var_)
# Run only the scaler on some data
X_t = pipe['my_scaler'].transform(X)
# Running the full pipeline is equivalent to running the `transform` method on the different bits
# sequentially and then running a `predict` at the end
print(pipe.predict(X))
print(pipe['my_classifier'].predict(pipe['my_scaler'].transform(X)))
```
%% Cell type:markdown id: tags:
<a class="anchor" id="model-evaluation"></a>
### Model evaluation
In machine learning we often want to train a model on some data, and then apply the model to new "unseen" data to make predictions. Thus we want our model to **generalise** well to unseen data. To assess generalisation, we need a dataset to train/fit the model AND an independent dataset to evaluate the model.
The simplest way to do this is to just split the data into a train set and a test set, then fit the model to the train set, and make predictions on the test set. There are many methods for splitting data that are implemented within `scikit-learn` in the [`model_selection`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) module.
To evaluate a model we also need a metric of performance. There a many metrics used in machine learning, and many of them are implemented within `scikit-learn` in the [`metrics`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) module.
In this example, we use the simple `pipeline` constructed earlier and extend it to include a simple train/test split using the `train_test_split` function. A major advantage of using the `pipeline` object is that it will prevent leakage between your train and test data, which would invalidate your evaluation. For the performance metric we use `accuracy_score` which is just the percentage of correct predictions.
%% Cell type:code id: tags:
``` python
# imports
from sklearn.ensemble import RandomForestClassifier # estimator
from sklearn.preprocessing import StandardScaler # transformer
from sklearn.pipeline import Pipeline # pipeline
from sklearn.datasets import load_iris # function to load demo dataset
from sklearn.model_selection import train_test_split # function to split the data into train and test
from sklearn.metrics import accuracy_score # performance metric
# create a pipeline object
pipe = Pipeline(
[
('my_scaler', StandardScaler()),
('my_classifier', RandomForestClassifier())
]
)
# load the iris dataset and split into train and test
# See https://en.wikipedia.org/wiki/Iris_flower_data_set
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# fit the pipeline to the train data
pipe.fit(X_train, y_train)
# predict classes from the test data and calculate the accuracy
accuracy_score(pipe.predict(X_test), y_test)
```
%% Cell type:markdown id: tags:
It is typical to perform the train/test splits within the folds of a [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) to make more efficient use of the available data.
![cross-validation](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)
`scikit-learn` makes cross-validation very easy with the `cross_val_score` function. Furthermore, by using the `pipeline` object we can ensure that there is no data leakage with this more complex training scheme.
In this example, we use `cross_val_score` combined with the `KFold` cross-validation iterator to evaluate the model with 5-fold cross-validation:
%% Cell type:code id: tags:
``` python
# imports
from sklearn.ensemble import RandomForestClassifier # estimator
from sklearn.preprocessing import StandardScaler # transformer
from sklearn.pipeline import Pipeline # pipeline
from sklearn.datasets import load_iris # function to load demo dataset
from sklearn.model_selection import cross_val_score, KFold # cross-validation
# create a pipeline object
pipe = Pipeline(
[
('my_scaler',StandardScaler()),
('my_classifier',RandomForestClassifier())
]
)
# load the iris dataset and split into train and test
X, y = load_iris(return_X_y=True)
# run cross-validation (CV)
cv = KFold(n_splits=5)
result = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy')
for idx, acc in enumerate(result):
print(f'Accuracy fold {idx}: {acc*100:0.2f}%')
```
%% Cell type:markdown id: tags:
Thus, rather than a single accuracy score, you have a distribution.
**Exercise:** Another commonly used cross-validation scheme is `StratifiedKFold` which is a variation of k-fold which uses stratified folds. A stratified fold is where each set (train/test) within a fold contains approximately the same percentage of samples of each target class as the complete set. Try training the pipeline from the previous example with `StratifiedKFold` and `k=6`. The documentation for `StratifiedKFold` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn-model-selection-stratifiedkfold).
%% Cell type:code id: tags:
``` python
# YOUR CODE GOES HERE
```
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
## Neuroimaging Examples
Here we present examples of using `scikit-learn` for classification, regression, and clustering on neuroimaging data.
%% Cell type:markdown id: tags:
<a class="anchor" id="classification"></a>
### Classification
In this example, we will train a `scikit-learn` classifier to predict from fMRI if the subject is looking at an image of a face or a house.
The example will:
1. Download the fMRI for a single subject
2. Split the data into train and test sets
3. Fit a linear SVM to the train data
4. Evaluate the performance of the classifier on the test data
This example is adapted from: https://nilearn.github.io/stable/auto_examples/02_decoding/plot_haxby_anova_svm.html#sphx-glr-auto-examples-02-decoding-plot-haxby-anova-svm-py
%% Cell type:markdown id: tags:
First we must import the data. If this is the first time you run this code it will also download the data, which should take less than a minute.
> **NOTE:**
> 1. Here we use [`fetch_haxby`](https://nilearn.github.io/stable/modules/generated/nilearn.datasets.fetch_haxby.html#nilearn.datasets.fetch_haxby) from the [`nilearn`](https://nilearn.github.io/index.html) package to download the demo data
> 2. [`nb.load`](https://nipy.org/nibabel/gettingstarted.html) from the [`nibabel`](https://nipy.org/nibabel/index.html) package will load the image into [`nibabel.Nifti1Image`](https://nipy.org/nibabel/reference/nibabel.nifti1.html) object. This will not load the actual data though.
%% Cell type:code id: tags:
``` python
from nilearn import datasets
import nibabel as nb
# load dataset for 1 subject (download if required, ~30s)
# https://www.science.org/doi/10.1126/science.1063736
haxby_dataset = datasets.fetch_haxby()
# select and load images from dataset
func_img = nb.load(haxby_dataset.func[0])
mask_img = nb.load(haxby_dataset.mask)
anat_img = nb.load(haxby_dataset.anat[0])
print(f'Func dimension x,y,z: {func_img.shape[:-1]}')
print(f'Func vols/TRs: {func_img.shape[-1]}')
```
%% Cell type:markdown id: tags:
Next, load the behavioural data, which describes which task was being performed for each volume of the fMRI. We are only interested in a binary classification, so we select volumes where the user is looking at either a face or a house.
> **NOTE:**
> 1. Here we use [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) from the [`pandas`](https://pandas.pydata.org/docs/index.html) package to load the behavioural data
> 2. We use [`index_img`](https://nilearn.github.io/modules/generated/nilearn.image.index_img.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to select the face and house volumes from the fMRI image.
%% Cell type:code id: tags:
``` python
import pandas as pd
from nilearn.image import index_img
# load behavioral information
behavioral = pd.read_csv(haxby_dataset.session_target[0], sep=" ")
conditions = behavioral['labels']
# create mask for faces and places
condition_mask = conditions.isin(['face', 'house'])
# select faces/places volumes
func_img = index_img(func_img, condition_mask)
conditions = conditions[condition_mask]
# Confirm that we now have 2 conditions (face and house)
print(f'Number of conditions: {conditions.shape[0]}')
print(f'Unique conditions: {conditions.unique()}')
```
%% Cell type:markdown id: tags:
Now load the fMRI data (face and house conditions only), flatten into a `(n_sample, n_feature)` matrix for use with `scikit-learn`, and mask out non-brain voxels. In this case, samples will be individual volumes, and features will be all non-brain voxels.
%% Cell type:code id: tags:
``` python
func_data = func_img.get_fdata()[mask_img.get_fdata()!=0].T
print(f'Flattened data shape (n_sample, n_feature): {func_data.shape}')
```
%% Cell type:markdown id: tags:
Split the data into train/test with a 60/40 split.
%% Cell type:code id: tags:
``` python
# Split data into training set and test set (60% train)
from sklearn.model_selection import train_test_split
func_train, func_test, condition_train, condition_test = train_test_split(func_data, conditions, train_size=0.6, random_state=0)
print(f'Number of train/test samples: {func_train.shape[0]}/{func_test.shape[0]}')
```
%% Cell type:markdown id: tags:
Now fit and SVM classifier to the train data, and evaluate on the test data.
%% Cell type:code id: tags:
``` python
from sklearn.svm import SVC # estimator
from sklearn.preprocessing import StandardScaler # transformer
from sklearn.pipeline import Pipeline # pipeline
from sklearn.metrics import (accuracy_score, # metrics for confusion matrix and accuracy
confusion_matrix, ConfusionMatrixDisplay)
# construct classification pipeline
pipe = Pipeline(
[
('scaler', StandardScaler()),
('clf', SVC(kernel='linear')),
]
)
# fit and predict
pipe.fit(func_train, condition_train)
condition_pred = pipe.predict(func_test)
# evalute
acc = accuracy_score(condition_test, condition_pred)
print(f'Accuracy = {acc*100:.2f}%')
cm = confusion_matrix(condition_test, condition_pred, labels=pipe['clf'].classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=pipe['clf'].classes_)
disp.plot();
```
%% Cell type:markdown id: tags:
Finally, we can look at the weights of the trained classifier spatially to evaluate which voxels were most discriminatory between faces and houses. This only works because we used a linear classifier. The process involves:
1. Extract the weights from the trained classifier
2. Reshape (un-flatten) the weights into a 3D volume
3. Plot the 3D weights
> **NOTE:**
> 1. Here we use [`plot_stat_map`](https://nilearn.github.io/modules/generated/nilearn.plotting.plot_stat_map.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to plot the images
> 2. [`subplots`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html) from `matplotlib.pyplot` creates a figure and axes with specified size
%% Cell type:code id: tags:
``` python
import numpy as np # functions for reshaping arrays
import matplotlib.pyplot as plt
from nilearn import plotting
# get coefficients/weights from the classifier
coef = pipe['clf'].coef_.squeeze()
# reshape coefficients to 3D and create nifti
coef_3d = np.zeros(func_img.shape[:-1]).ravel()
coef_3d[(mask_img.get_fdata()!=0).ravel()] = coef
coef_3d = np.reshape(coef_3d, func_img.shape[:-1])
coef_img = nb.Nifti1Image(coef_3d, header=func_img.header, affine=func_img.affine)
# plot coefficients
fig, ax = plt.subplots(figsize=(20, 5))
plotting.plot_stat_map(coef_img, anat_img, title="discriminating weights", display_mode="z", cut_coords=7, axes=ax, threshold=0.0005, dim=-1)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="regression"></a>
### Regression
In this example, we will do a multivariate regression of voxel-based-morphometry with subject age.
The data come from the [OASIS](https://www.oasis-brains.org) project and has been run through a standard VBM pipeline to create VBM maps.
The example will:
1. Download the VBM maps for 150 subjects
2. Split the data into train and test sets
3. Implement a custom transformer to mask the non-GM voxels
3. Fit a `Ridge` regressor to the train data
4. Evaluate the performance of the regressor on the test data
This example is adapted from: https://nilearn.github.io/stable/auto_examples/02_decoding/plot_oasis_vbm.html#sphx-glr-auto-examples-02-decoding-plot-oasis-vbm-py
%% Cell type:markdown id: tags:
First we must import the data. If this is the first time you have run this code it will also download the data, which should take only a minute or two.
> **NOTE:**
> 1. Here we use [`fetch_oasis_vbm`](https://nilearn.github.io/stable/modules/generated/nilearn.datasets.fetch_oasis_vbm.html#nilearn.datasets.fetch_oasis_vbm) from the [`nilearn`](https://nilearn.github.io/index.html) package to download the VBM data
> 2. [`nb.load`](https://nipy.org/nibabel/gettingstarted.html) from the [`nibabel`](https://nipy.org/nibabel/index.html) package will load the image into [`nibabel.Nifti1Image`](https://nipy.org/nibabel/reference/nibabel.nifti1.html) object. This will not load the actual data though.
%% Cell type:code id: tags:
``` python
import numpy as np
import nibabel as nb
from nilearn import datasets
# fetch the VBM data from 150 subjects
n_subjects = 150
dataset_files = datasets.fetch_oasis_vbm(n_subjects=n_subjects) # 1-2 mins depending on your connection
# extract age from dataset
age = dataset_files.ext_vars['age'].astype(float)
age = np.array(age)
# load GM maps
imgs = [nb.load(f) for f in dataset_files.gray_matter_maps]
imgs_data = np.stack([img0.get_fdata().ravel() for img0 in imgs], axis=1).T # ~10s
print(f'Flattened data shape (n_sample, n_feature): {imgs_data.shape}')
```
%% Cell type:markdown id: tags:
Plot the distribution of the subjects age.
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(age, bins=20)
ax.set_xlabel('age (years)')
ax.set_ylabel('count')
```
%% Cell type:markdown id: tags:
Split the data into train/test with a 60/40 split.
%% Cell type:code id: tags:
``` python
# Split data into training set and test set (60% train)
from sklearn.model_selection import train_test_split
imgs_train, imgs_test, age_train, age_test = train_test_split(imgs_data, age, train_size=.6, random_state=0)
# Sort test data by age, simply for better visualization later
perm = np.argsort(age_test)[::-1]
age_test = age_test[perm]
imgs_test = imgs_test[perm]
print(f'Number of train/test samples: {imgs_train.shape[0]}/{imgs_test.shape[0]}')
```
%% Cell type:markdown id: tags:
Each VBM map has a lot of non-GM voxels that are of no interest. Ideally we would remove these using a brain mask, but unfortunately there is not a mask provided with the dataset. We could calculate a mask by calculating a group mean image, and then binarising it. However, this would cause leakage between train and test data. Ideally, the mask would be calculated on the train data and applied to the test data. This is exactly the role of a `transformer` in a `scikit-learn` pipeline.
Here we create a custom `transformer` called `MaskTransformer`. When `fit` is called, it creates a mask by taking the mean across samples (rows of `X`) and then binarises it with the specified threshold. When `transform` is called, it uses the pre-calculated mask to remove samples (rows of `X`) that are not in the mask.
More information on custom transformers can be found [here](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156).
%% Cell type:code id: tags:
``` python
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class MaskTransformer(BaseEstimator, TransformerMixin):
def __init__(self, threshold=0.3):
self.threshold = threshold
def fit(self, X, y=None):
self.mask = np.mean(X, axis=0) > self.threshold
return self
def transform(self, X, y=None):
X_ = X.copy()
return X_[:, self.mask]
def inverse_transform(self, X):
X_r = np.zeros((X.shape[0], self.mask.shape[0]))
X_r[:, self.mask] = X
return X_r
```
%% Cell type:markdown id: tags:
Now we fit a `Ridge` regressor to the train data, and evaluate on the test data.
This example includes an additioal `transformer` called `SelectPercentile` which will select the 50% of the features (voxels) with the highest ANOVA f-score with the target labels. We included this simply to reduce the memory requirements of the regression.
%% Cell type:code id: tags:
``` python
from sklearn.linear_model import Ridge # estimator
from sklearn.preprocessing import StandardScaler # transformer
from sklearn.pipeline import make_pipeline # function to construct pipeline
from sklearn.feature_selection import SelectPercentile
# regr = Ridge(alpha=0.5) # instantiate ridge regression estimator
# masker = MaskTransformer() # instantiate mask transformer
# construct regression pipeline
# (this is a shorthand for creating a pipeline but it does not allow you to give names to the bits)
pipe = make_pipeline(
MaskTransformer(),
SelectPercentile(percentile=50),
StandardScaler(),
Ridge(alpha=0.5),
)
# fit and predict
pipe.fit(imgs_train, age_train)
age_pred = pipe.predict(imgs_test)
# evaluate
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(age_test, label='True age')
ax.plot(age_pred, label='Predicted age')
ax.set_ylabel('age (years)')
ax.set_xlabel('samples')
plt.legend()
mse = np.mean(np.abs(age_test - age_pred))
ax.set_title(f'Mean absolute error {mse:.2f} years')
```
%% Cell type:markdown id: tags:
Finally, we can look at the coefficients of the trained regressor spatially and see which varied most with age. The process involves:
1. Extract the regression coefficients from the trained regressor
2. Reshape (un-flatten) the weights into a 3D volume
3. Plot the 3D coefficients
> **NOTE:**
> 1. Here we use [`plot_stat_map`](https://nilearn.github.io/modules/generated/nilearn.plotting.plot_stat_map.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to plot the orthographic images
> 2. [`subplots`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html) from `matplotlib.pyplot` creates a figure and axes with specified size
%% Cell type:code id: tags:
``` python
from nilearn import plotting
# get regression coefficients from the estimator
coef = pipe['ridge'].coef_[np.newaxis, :]
# inverse the transforms
coef = pipe['selectpercentile'].inverse_transform(coef)
coef = pipe['masktransformer'].inverse_transform(coef)
# reshape coefficients into 3D volume
coef_3d = np.reshape(coef, imgs[0].shape)
coef_img = nb.Nifti1Image(coef_3d, header=imgs[0].header, affine=imgs[0].affine)
# plot
fig, ax = plt.subplots(figsize=(20, 5))
plotting.plot_stat_map(coef_img, imgs[0], title="regression coeffs", display_mode="z", cut_coords=5, axes=ax, threshold=0.002)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="Clustering"></a>
### Clustering
In this example, we will cluster the intensities of an anatomical image to label different tissues. A guassian mixture model will be fit by expectation-maximisation. This is similar to how `FSL fast` works for tissue segmentation.
The example will:
1. Fetch an anatomical image
2. Skull-strip the image using FSL tools
3. Fit a gaussian mixture model to the voxel intensities
%% Cell type:markdown id: tags:
First import the data. In this example we use a single anatomical image from the `haxby` dataset that was used in the classification example.
%% Cell type:code id: tags:
``` python
from nilearn import datasets
from nilearn import plotting
import matplotlib.pyplot as plt
# fetch anatomical image
# https://www.science.org/doi/10.1126/science.1063736
haxby_dataset = datasets.fetch_haxby() # 1 subject, ~30s
anat = haxby_dataset.anat[0]
# plot
fig, ax = plt.subplots(figsize=(15, 7))
plotting.plot_anat(anat, dim=-1, axes=ax)
```
%% Cell type:markdown id: tags:
Now we limit the field-of-view, and skull-strip the anatomical image using `FSL` tools, specifically `robustfov` and `bet`.
> **Note**:
> 1. Here we use the `!` operator to execute the command in the shell
> 2. The `{}` will expand the contained python variable in the shell
%% Cell type:code id: tags:
``` python
# extract brain using FSL tools
anat_brain = anat.replace('.nii.gz', '_brain.nii.gz')
! robustfov -i {anat} -r {anat_brain}
! bet {anat_brain} {anat_brain}
# plot brain extraction
fig, ax = plt.subplots(figsize=(15,7))
plotting.plot_anat(anat_brain, dim=-1, axes=ax)
```
%% Cell type:markdown id: tags:
Now load the skull-stripped anatomical image, remove the non-brain voxels (intensity==0), and flatten into a `(n_sample, n_feature)` matrix for use with `scikit-learn`. In this case, samples will be individual voxels, and there will only be a single feature.
%% Cell type:code id: tags:
``` python
import nibabel as nb
import numpy as np
# load brain image
anat_img = nb.load(anat_brain)
anat_data = anat_img.get_fdata()
# create brain mask
mask = anat_data > 0
# remove non-brain voxls and reshape (n_sample, n_features) for skikit-learn
anat_data = anat_data[mask][:, np.newaxis]
print(f'Flattened shape: {anat_data.shape}')
```
%% Cell type:markdown id: tags:
Fit a gaussian mixture model to the data with 4 gaussians, then predict the class/label on the same data. Given that both `fit` and `predict` will be used on the same data, we can use a convenience method that does both called `fit_predict`. Finally, reshape the classes/labels in a 3D volume and plot.
%% Cell type:code id: tags:
``` python
from sklearn.mixture import GaussianMixture # import gaussian mixture model
# do cluster
labels = GaussianMixture(n_components=4, random_state=0).fit_predict(anat_data)
# reshape labels in 3D volume
labels_3d = np.zeros(anat_img.shape).ravel()
labels_3d[mask.ravel()] = labels+1
labels_3d = np.reshape(labels_3d, anat_img.shape)
labels_img = nb.Nifti1Image(labels_3d, header=anat_img.header, affine=anat_img.affine)
# plot labels
fig, ax = plt.subplots(figsize=(20,5))
disp = plotting.plot_anat(anat_brain, dim=-1, axes=ax, display_mode="z", cut_coords=7,)
disp.add_overlay(labels_img)
```
%% Cell type:markdown id: tags:
That's all folks!
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment