Merge branch 'tweak_scikit' into 'master'

tweaks to docs See merge request fsl/win-pytreat!30

Merge branch 'tweak_scikit' into 'master'
eacbe25e · Sean Fitzgibbon · 74bd46ee · 1041e83e · eacbe25e
Commit eacbe25e authored 3 years ago by Sean Fitzgibbon
--- a/applications/machine_learning/scikit_learn.ipynb
+++ b/applications/machine_learning/scikit_learn.ipynb
@@ -11,7 +11,6 @@
    "- Simple and efficient tools for predictive data analysis\n",
    "- Built on `numpy`, `scipy`, and `matplotlib`\n",
    "- Open source, commercially usable - BSD license\n",
-    "- \\> 2.1k contributers, 22.5k forks, 48.5k \"stars\"\n",
    "- Current version: 1.0.2\n",
    "- Powerful generic API\n",
    "- Very well documented\n",
@@ -52,7 +51,7 @@
    "\n",
    "Three important concepts to understand for using `scikit-learn`:\n",
    "1. `estimator` objects and their `fit` and `predict` methods for fitting data and making predictions\n",
-    "2. `transformer` objects for pre/post-processing transforms\n",
+    "2. `transformer` objects and their `fit` and `transform` methods for pre/post-processing data\n",
    "3. `pipeline` objects for chaining together `transformers` and `estimators` into a machine learning pipeline"
   ]
  },

 %% Cell type:markdown id: tags:

 # scikit-learn

 Machine learning in python.

 - Simple and efficient tools for predictive data analysis
 - Built on `numpy`, `scipy`, and `matplotlib`
 - Open source, commercially usable - BSD license
- \> 2.1k contributers, 22.5k forks, 48.5k "stars"
 - Current version: 1.0.2
 - Powerful generic API
 - Very well documented
 - Core techniques, hundreds of options for:
    - Classification
    - Regression
    - Clustering
    - Dimensionality Reduction
    - Model Selection
    - Preprocessing

 Some useful links are:
 - [Main website](https://scikit-learn.org/stable/index.html)
 - [User guide](https://scikit-learn.org/stable/user_guide.html)
 - [API](https://scikit-learn.org/stable/modules/classes.html)
 - [GitHub](https://github.com/scikit-learn/scikit-learn)

 ## Notebook Overview:
 * [Getting Started](#getting-started)
    * [Estimators: fitting and predicting](#estimators)
    * [Transformers](#transformers)
    * [Pipelines: chaining transforms and estimators](#pipelines)
    * [Model evaluation](#model-evaluation)
 * [Neuroimaging examples](#examples)
    * [Classification](#classification)
    * [Regression](#regression)
    * [Clustering](#clustering)

 %% Cell type:markdown id: tags:

 <a class="anchor" id="getting-started"></a>
 ## Getting Started

 Adapted from the `scikit-learn` "Getting Started" documentation: https://scikit-learn.org/stable/getting_started.html

 Three important concepts to understand for using `scikit-learn`:
 1. `estimator` objects and their `fit` and `predict` methods for fitting data and making predictions
-2. `transformer` objects for pre/post-processing transforms
+2. `transformer` objects and their `fit` and `transform` methods for pre/post-processing data
 3. `pipeline` objects for chaining together `transformers` and `estimators` into a machine learning pipeline

 %% Cell type:markdown id: tags:

 <a class="anchor" id="estimators"></a>
 ### Estimators: fitting and predicting

 Each machine learning model in `scikit-learn` is implemented as an [`estimator`](https://scikit-learn.org/stable/glossary.html#term-estimators) object.  Here we instantiate [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) `estimator`:

 %% Cell type:code id: tags:

 ``` python
 # import RF estimator
 from sklearn.ensemble import RandomForestClassifier

 # instantiate RF estimator
 clf = RandomForestClassifier(random_state=0)

 print(clf)
 ```

 %% Cell type:markdown id: tags:

 After creation, the `estimator` can be **fit** to the training data using the `fit` method.

 The fit method accepts 2 inputs:

 1. `X` is the training input samples of shape (`n_samples`, `n_features`).  Thus, rows=samples and columns=features
 2. `y` is the target values (class labels in classification, real numbers in regression) of shape (`n_samples`,) for one output, or (`n_samples`, `n_outputs`) for multiple outputs.

 > Note: Both `X` and `y` are usually numpy arrays or equivalent array-like data types, including a [`pandas DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

 %% Cell type:code id: tags:

 ``` python
 # create some toy data
 X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]
 y = [0, 1]  # classes of each sample

 # fit the model to the data
 clf.fit(X, y)
 ```

 %% Cell type:markdown id: tags:

 Once trained, the `estimator` can make predictions on new data using the `predict` method.

 %% Cell type:code id: tags:

 ``` python
 # predict classes of new data
 X_test = [[ 4,  5,  6],
          [14, 15, 16]]

 clf.predict(X_test)
 ```

 %% Cell type:markdown id: tags:


 Importantly, this `fit` and `predict` interface is consistent across different estimators making it very easy to change estimators within your code.  For example, here we swap the Random Forest `estimator` for a Support Vector Machine `estimator`:

 %% Cell type:code id: tags:

 ``` python
 # import an SVM estimator
 from sklearn.svm import SVC

 # instantiate SVM estimator
 clf = SVC(random_state=0)

 # fit and predict
 clf.fit(X, y)
 clf.predict(X_test)
 ```

 %% Cell type:markdown id: tags:

 **Exercise:** `scikit-learn` implements a wide array of machine learning `estimators`.  Try implementing fitting and predicting with a different model.  See the `scikit-learn` [documentation](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) for model options.

 %% Cell type:code id: tags:

 ``` python
 # YOUR CODE GOES HERE
 ```

 %% Cell type:markdown id: tags:

 <a class="anchor" id="transformers"></a>
 ### Transformers

 The `transformer` is a special object that follows the same API as an `estimator` and allows you to apply pre-processing and/or post-procssing transforms to the data in your machine learning pipeline.  The `transformer` object has a `transform` method instead of a `predict` method.

 In this example we use a `transformer` to standardise the features (e.g. remove mean and scale to unit variance).  The `fit` method calculates the mean and variance parameters from the data, and the `transform` method will do the scaling.

 %% Cell type:code id: tags:

 ``` python
 # import StandardScaler transformer
 from sklearn.preprocessing import StandardScaler

 # create some toy data
 X = [[0, 15],
     [1, -10]]

 # instantiate StandardScaler transformer
 scaler = StandardScaler()

 # fit the transform to the data
 scaler.fit(X)

 # apply transform
 scaler.transform(X)
 ```

 %% Cell type:markdown id: tags:

 **Exercise:** Look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) for `StandardScaler` and see if you can extract the mean and variance estimates from the fitted `transformer`.

 %% Cell type:code id: tags:

 ``` python
 # YOUR CODE GOES HERE
 ```

 %% Cell type:markdown id: tags:

 > Note: Once you run the `fit` method on either a `transformer` or an `estimator`, you can access the parameters that have been learned.

 For example, for the `transformer` used above to standardise the data:

 %% Cell type:code id: tags:

 ``` python
 # Look inside the object to see the learned parameters (and other object properties)
 vars(scaler)
 ```

 %% Cell type:markdown id: tags:

 <a class="anchor" id="pipelines"></a>
 ### Pipelines: chaining transforms and estimators

 A typical machine learning pipeline will often involve numerous pre-processing transforms and one or more estimators.  The `pipeline` object can be used to combine `transformer` and `estimator` objects into a single object that has an API that is the same as a regular `estimator`.

 In this example we create a very simple `pipeline` that is comprised of a StandardScaler `transform` and a Random Forest `estimator`.

 > Note: in this example the `estimator` predictions were made with the same data (`X`) on which the `estimator` was trained.  Typically independent **train** and **test** data would be used for fitting and prediction. In the next section [Model evaluation](#model-evaluation) we will see how to split data into train and test sets.

 %% Cell type:code id: tags:

 ``` python
 # imports
 from sklearn.ensemble import RandomForestClassifier  # estimator
 from sklearn.preprocessing import StandardScaler     # transformer
 from sklearn.pipeline import Pipeline                # pipeline
 from sklearn.datasets import load_iris               # function to load demo dataset

 # create a pipeline object
 # using a list of transformers/estimators (and give them names)
 pipe = Pipeline(
    [
        ('my_scaler', StandardScaler()),
        ('my_classifier', RandomForestClassifier())
    ]
 )

 # load the iris dataset
 # See https://en.wikipedia.org/wiki/Iris_flower_data_set
 X, y = load_iris(return_X_y=True)

 # fit the whole pipeline
 pipe.fit(X, y)

 # predict classes from the training data
 pipe.predict(X)
 ```

 %% Cell type:markdown id: tags:

 When running a pipeline, the different components of the pipeline are run sequentially.

 You may want to also run the pipeline "up to a point", or check the parameters of a component of the pipeline. For example, in the pipeline above, we may want to see the parameters of the `StandardScaler` or only run that part on some data.

 %% Cell type:code id: tags:

 ``` python
 # Check the parameters learned by the scaler
 print(pipe['my_scaler'].mean_)
 print(pipe['my_scaler'].var_)

 # Run only the scaler on some data
 X_t = pipe['my_scaler'].transform(X)

 # Running the full pipeline is equivalent to running the `transform` method on the different bits
 # sequentially and then running a `predict` at the end
 print(pipe.predict(X))
 print(pipe['my_classifier'].predict(pipe['my_scaler'].transform(X)))
 ```

 %% Cell type:markdown id: tags:

 <a class="anchor" id="model-evaluation"></a>
 ### Model evaluation

 In machine learning we often want to train a model on some data, and then apply the model to new "unseen" data to make predictions.  Thus we want our model to **generalise** well to unseen data.  To assess generalisation, we need a dataset to train/fit the model AND an independent dataset to evaluate the model.

 The simplest way to do this is to just split the data into a train set and a test set, then fit the model to the train set, and make predictions on the test set. There are many methods for splitting data that are implemented within `scikit-learn` in the [`model_selection`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) module.

 To evaluate a model we also need a metric of performance.  There a many metrics used in machine learning, and many of them are implemented within `scikit-learn` in the [`metrics`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) module.

 In this example, we use the simple `pipeline` constructed earlier and extend it to include a simple train/test split using the `train_test_split` function.  A major advantage of using the `pipeline` object is that it will prevent leakage between your train and test data, which would invalidate your evaluation.  For the performance metric we use `accuracy_score` which is just the percentage of correct predictions.

 %% Cell type:code id: tags:

 ``` python
 # imports
 from sklearn.ensemble import RandomForestClassifier  # estimator
 from sklearn.preprocessing import StandardScaler     # transformer
 from sklearn.pipeline import Pipeline                # pipeline
 from sklearn.datasets import load_iris               # function to load demo dataset
 from sklearn.model_selection import train_test_split # function to split the data into train and test
 from sklearn.metrics import accuracy_score           # performance metric

 # create a pipeline object
 pipe = Pipeline(
    [
        ('my_scaler', StandardScaler()),
        ('my_classifier', RandomForestClassifier())
    ]
 )

 # load the iris dataset and split into train and test
 # See https://en.wikipedia.org/wiki/Iris_flower_data_set
 X, y = load_iris(return_X_y=True)
 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

 # fit the pipeline to the train data
 pipe.fit(X_train, y_train)

 # predict classes from the test data and calculate the accuracy
 accuracy_score(pipe.predict(X_test), y_test)
 ```

 %% Cell type:markdown id: tags:

 It is typical to perform the train/test splits within the folds of a [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) to make more efficient use of the available data.

 ![cross-validation](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

 `scikit-learn` makes cross-validation very easy with the `cross_val_score` function.  Furthermore, by using the `pipeline` object we can ensure that there is no data leakage with this more complex training scheme.

 In this example, we use `cross_val_score` combined with the `KFold` cross-validation iterator to evaluate the model with 5-fold cross-validation:

 %% Cell type:code id: tags:

 ``` python
 # imports
 from sklearn.ensemble import RandomForestClassifier          # estimator
 from sklearn.preprocessing import StandardScaler             # transformer
 from sklearn.pipeline import Pipeline                        # pipeline
 from sklearn.datasets import load_iris                       # function to load demo dataset
 from sklearn.model_selection import cross_val_score, KFold   # cross-validation

 # create a pipeline object
 pipe = Pipeline(
    [
        ('my_scaler',StandardScaler()),
        ('my_classifier',RandomForestClassifier())
    ]
 )

 # load the iris dataset and split into train and test
 X, y = load_iris(return_X_y=True)

 # run cross-validation (CV)
 cv = KFold(n_splits=5)
 result = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy')

 for idx, acc in enumerate(result):
    print(f'Accuracy fold {idx}: {acc*100:0.2f}%')
 ```

 %% Cell type:markdown id: tags:

 Thus, rather than a single accuracy score, you have a distribution.


 **Exercise:** Another commonly used cross-validation scheme is `StratifiedKFold` which is a variation of k-fold which uses stratified folds.  A stratified fold is where each set (train/test) within a fold contains approximately the same percentage of samples of each target class as the complete set.  Try training the pipeline from the previous example with `StratifiedKFold` and `k=6`.  The documentation for `StratifiedKFold` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn-model-selection-stratifiedkfold).

 %% Cell type:code id: tags:

 ``` python
 # YOUR CODE GOES HERE
 ```

 %% Cell type:markdown id: tags:

 <a class="anchor" id="examples"></a>
 ## Neuroimaging Examples

 Here we present examples of using `scikit-learn` for classification, regression, and clustering on neuroimaging data.

 %% Cell type:markdown id: tags:

 <a class="anchor" id="classification"></a>
 ### Classification

 In this example, we will train a `scikit-learn` classifier to predict from fMRI if the subject is looking at an image of a face or a house.

 The example will:
 1. Download the fMRI for a single subject
 2. Split the data into train and test sets
 3. Fit a linear SVM to the train data
 4. Evaluate the performance of the classifier on the test data

 This example is adapted from: https://nilearn.github.io/stable/auto_examples/02_decoding/plot_haxby_anova_svm.html#sphx-glr-auto-examples-02-decoding-plot-haxby-anova-svm-py

 %% Cell type:markdown id: tags:

 First we must import the data.  If this is the first time you run this code it will also download the data, which should take less than a minute.

 > **NOTE:**
 > 1. Here we use [`fetch_haxby`](https://nilearn.github.io/stable/modules/generated/nilearn.datasets.fetch_haxby.html#nilearn.datasets.fetch_haxby) from the [`nilearn`](https://nilearn.github.io/index.html) package to download the demo data
 > 2. [`nb.load`](https://nipy.org/nibabel/gettingstarted.html) from the [`nibabel`](https://nipy.org/nibabel/index.html) package will load the image into [`nibabel.Nifti1Image`](https://nipy.org/nibabel/reference/nibabel.nifti1.html) object.  This will not load the actual data though.

 %% Cell type:code id: tags:

 ``` python
 from nilearn import datasets
 import nibabel as nb

 # load dataset for 1 subject (download if required, ~30s)
 # https://www.science.org/doi/10.1126/science.1063736
 haxby_dataset = datasets.fetch_haxby()

 # select and load images from dataset
 func_img = nb.load(haxby_dataset.func[0])
 mask_img = nb.load(haxby_dataset.mask)
 anat_img = nb.load(haxby_dataset.anat[0])

 print(f'Func dimension x,y,z: {func_img.shape[:-1]}')
 print(f'Func vols/TRs: {func_img.shape[-1]}')
 ```

 %% Cell type:markdown id: tags:

 Next, load the behavioural data, which describes which task was being performed for each volume of the fMRI.  We are only interested in a binary classification, so we select volumes where the user is looking at either a face or a house.

 > **NOTE:**
 > 1. Here we use [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) from the [`pandas`](https://pandas.pydata.org/docs/index.html) package to load the behavioural data
 > 2. We use [`index_img`](https://nilearn.github.io/modules/generated/nilearn.image.index_img.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to select the face and house volumes from the fMRI image.

 %% Cell type:code id: tags:

 ``` python
 import pandas as pd
 from nilearn.image import index_img

 # load behavioral information
 behavioral = pd.read_csv(haxby_dataset.session_target[0], sep=" ")
 conditions = behavioral['labels']

 # create mask for faces and places
 condition_mask = conditions.isin(['face', 'house'])

 # select faces/places volumes
 func_img = index_img(func_img, condition_mask)
 conditions = conditions[condition_mask]

 # Confirm that we now have 2 conditions (face and house)
 print(f'Number of conditions: {conditions.shape[0]}')
 print(f'Unique conditions: {conditions.unique()}')
 ```

 %% Cell type:markdown id: tags:

 Now load the fMRI data (face and house conditions only), flatten into a `(n_sample, n_feature)` matrix for use with `scikit-learn`, and mask out non-brain voxels.  In this case, samples will be individual volumes, and features will be all non-brain voxels.

 %% Cell type:code id: tags:

 ``` python
 func_data = func_img.get_fdata()[mask_img.get_fdata()!=0].T

 print(f'Flattened data shape (n_sample, n_feature): {func_data.shape}')
 ```

 %% Cell type:markdown id: tags:

 Split the data into train/test with a 60/40 split.

 %% Cell type:code id: tags:

 ``` python
 # Split data into training set and test set (60% train)
 from sklearn.model_selection import train_test_split

 func_train, func_test, condition_train, condition_test = train_test_split(func_data, conditions, train_size=0.6, random_state=0)

 print(f'Number of train/test samples: {func_train.shape[0]}/{func_test.shape[0]}')
 ```

 %% Cell type:markdown id: tags:

 Now fit and SVM classifier to the train data, and evaluate on the test data.

 %% Cell type:code id: tags:

 ``` python
 from sklearn.svm import SVC                         # estimator
 from sklearn.preprocessing import StandardScaler    # transformer
 from sklearn.pipeline import Pipeline               # pipeline
 from sklearn.metrics import (accuracy_score,        # metrics for confusion matrix and accuracy
    confusion_matrix, ConfusionMatrixDisplay)

 # construct classification pipeline
 pipe = Pipeline(
    [
        ('scaler', StandardScaler()),
        ('clf', SVC(kernel='linear')),
    ]
 )

 # fit and predict
 pipe.fit(func_train, condition_train)
 condition_pred = pipe.predict(func_test)

 # evalute
 acc = accuracy_score(condition_test, condition_pred)
 print(f'Accuracy = {acc*100:.2f}%')

 cm = confusion_matrix(condition_test, condition_pred, labels=pipe['clf'].classes_)
 disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=pipe['clf'].classes_)
 disp.plot();
 ```

 %% Cell type:markdown id: tags:

 Finally, we can look at the weights of the trained classifier spatially to evaluate which voxels were most discriminatory between faces and houses.  This only works because we used a linear classifier.  The process involves:
 1. Extract the weights from the trained classifier
 2. Reshape (un-flatten) the weights into a 3D volume
 3. Plot the 3D weights

 > **NOTE:**
 > 1. Here we use [`plot_stat_map`](https://nilearn.github.io/modules/generated/nilearn.plotting.plot_stat_map.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to plot the images
 > 2. [`subplots`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html) from `matplotlib.pyplot` creates a figure and axes with specified size

 %% Cell type:code id: tags:

 ``` python
 import numpy as np  # functions for reshaping arrays
 import matplotlib.pyplot as plt
 from nilearn import plotting

 # get coefficients/weights from the classifier
 coef = pipe['clf'].coef_.squeeze()

 # reshape coefficients to 3D and create nifti
 coef_3d = np.zeros(func_img.shape[:-1]).ravel()
 coef_3d[(mask_img.get_fdata()!=0).ravel()] = coef
 coef_3d = np.reshape(coef_3d, func_img.shape[:-1])
 coef_img = nb.Nifti1Image(coef_3d, header=func_img.header, affine=func_img.affine)

 # plot coefficients
 fig, ax = plt.subplots(figsize=(20, 5))
 plotting.plot_stat_map(coef_img, anat_img, title="discriminating weights", display_mode="z", cut_coords=7, axes=ax, threshold=0.0005, dim=-1)
 ```

 %% Cell type:markdown id: tags:

 <a class="anchor" id="regression"></a>
 ### Regression

 In this example, we will do a multivariate regression of voxel-based-morphometry with subject age.

 The data come from the [OASIS](https://www.oasis-brains.org) project and has been run through a standard VBM pipeline to create VBM maps.

 The example will:
 1. Download the VBM maps for 150 subjects
 2. Split the data into train and test sets
 3. Implement a custom transformer to mask the non-GM voxels
 3. Fit a `Ridge` regressor to the train data
 4. Evaluate the performance of the regressor on the test data

 This example is adapted from: https://nilearn.github.io/stable/auto_examples/02_decoding/plot_oasis_vbm.html#sphx-glr-auto-examples-02-decoding-plot-oasis-vbm-py

 %% Cell type:markdown id: tags:

 First we must import the data.  If this is the first time you have run this code it will also download the data, which should take only a minute or two.

 > **NOTE:**
 > 1. Here we use [`fetch_oasis_vbm`](https://nilearn.github.io/stable/modules/generated/nilearn.datasets.fetch_oasis_vbm.html#nilearn.datasets.fetch_oasis_vbm) from the [`nilearn`](https://nilearn.github.io/index.html) package to download the VBM data
 > 2. [`nb.load`](https://nipy.org/nibabel/gettingstarted.html) from the [`nibabel`](https://nipy.org/nibabel/index.html) package will load the image into [`nibabel.Nifti1Image`](https://nipy.org/nibabel/reference/nibabel.nifti1.html) object.  This will not load the actual data though.

 %% Cell type:code id: tags:

 ``` python
 import numpy as np
 import nibabel as nb
 from nilearn import datasets

 # fetch the VBM data from 150 subjects
 n_subjects = 150
 dataset_files = datasets.fetch_oasis_vbm(n_subjects=n_subjects) # 1-2 mins depending on your connection

 # extract age from dataset
 age = dataset_files.ext_vars['age'].astype(float)
 age = np.array(age)

 # load GM maps
 imgs = [nb.load(f) for f in dataset_files.gray_matter_maps]
 imgs_data = np.stack([img0.get_fdata().ravel() for img0 in imgs], axis=1).T  # ~10s

 print(f'Flattened data shape (n_sample, n_feature): {imgs_data.shape}')
 ```

 %% Cell type:markdown id: tags:

 Plot the distribution of the subjects age.

 %% Cell type:code id: tags:

 ``` python
 import matplotlib.pyplot as plt

 fig, ax = plt.subplots(figsize=(10, 5))
 ax.hist(age, bins=20)
 ax.set_xlabel('age (years)')
 ax.set_ylabel('count')
 ```

 %% Cell type:markdown id: tags:

 Split the data into train/test with a 60/40 split.

 %% Cell type:code id: tags:

 ``` python
 # Split data into training set and test set (60% train)
 from sklearn.model_selection import train_test_split

 imgs_train, imgs_test, age_train, age_test = train_test_split(imgs_data, age, train_size=.6, random_state=0)

 # Sort test data by age, simply for better visualization later
 perm = np.argsort(age_test)[::-1]
 age_test = age_test[perm]
 imgs_test = imgs_test[perm]

 print(f'Number of train/test samples: {imgs_train.shape[0]}/{imgs_test.shape[0]}')
 ```

 %% Cell type:markdown id: tags:

 Each VBM map has a lot of non-GM voxels that are of no interest.  Ideally we would remove these using a brain mask, but unfortunately there is not a mask provided with the dataset.  We could calculate a mask by calculating a group mean image, and then binarising it.  However, this would cause leakage between train and test data.  Ideally, the mask would be calculated on the train data and applied to the test data.  This is exactly the role of a `transformer` in a `scikit-learn` pipeline.

 Here we create a custom `transformer` called `MaskTransformer`.  When `fit` is called, it creates a mask by taking the mean across samples (rows of `X`) and then binarises it with the specified threshold.  When `transform` is called, it uses the pre-calculated mask to remove samples (rows of `X`) that are not in the mask.

 More information on custom transformers can be found [here](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156).

 %% Cell type:code id: tags:

 ``` python
 from sklearn.base import BaseEstimator, TransformerMixin
 import numpy as np

 class MaskTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, threshold=0.3):
        self.threshold = threshold

    def fit(self, X, y=None):
        self.mask = np.mean(X, axis=0) > self.threshold
        return self

    def transform(self, X, y=None):
        X_ = X.copy()
        return X_[:, self.mask]

    def inverse_transform(self, X):
        X_r = np.zeros((X.shape[0], self.mask.shape[0]))
        X_r[:, self.mask] = X
        return X_r
 ```

 %% Cell type:markdown id: tags:

 Now we fit a `Ridge` regressor to the train data, and evaluate on the test data.

 This example includes an additioal `transformer` called  `SelectPercentile` which will select the 50% of the features (voxels) with the highest ANOVA f-score with the target labels.  We included this simply to reduce the memory requirements of the regression.

 %% Cell type:code id: tags:

 ``` python
 from sklearn.linear_model import Ridge               # estimator
 from sklearn.preprocessing import StandardScaler     # transformer
 from sklearn.pipeline import make_pipeline           # function to construct pipeline
 from sklearn.feature_selection import SelectPercentile

 # regr = Ridge(alpha=0.5)     # instantiate ridge regression estimator
 # masker = MaskTransformer()  # instantiate mask transformer

 # construct regression pipeline
 # (this is a shorthand for creating a pipeline but it does not allow you to give names to the bits)
 pipe = make_pipeline(
    MaskTransformer(),
    SelectPercentile(percentile=50),
    StandardScaler(),
    Ridge(alpha=0.5),
 )

 # fit and predict
 pipe.fit(imgs_train, age_train)
 age_pred = pipe.predict(imgs_test)

 # evaluate
 fig, ax = plt.subplots(figsize=(10, 5))

 ax.plot(age_test, label='True age')
 ax.plot(age_pred, label='Predicted age')

 ax.set_ylabel('age (years)')
 ax.set_xlabel('samples')
 plt.legend()

 mse = np.mean(np.abs(age_test - age_pred))
 ax.set_title(f'Mean absolute error {mse:.2f} years')
 ```

 %% Cell type:markdown id: tags:

 Finally, we can look at the coefficients of the trained regressor spatially and see which varied most with age. The process involves:
 1. Extract the regression coefficients from the trained regressor
 2. Reshape (un-flatten) the weights into a 3D volume
 3. Plot the 3D coefficients

 > **NOTE:**
 > 1. Here we use [`plot_stat_map`](https://nilearn.github.io/modules/generated/nilearn.plotting.plot_stat_map.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to plot the orthographic images
 > 2. [`subplots`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html) from `matplotlib.pyplot` creates a figure and axes with specified size

 %% Cell type:code id: tags:

 ``` python
 from nilearn import plotting

 # get regression coefficients from the estimator
 coef = pipe['ridge'].coef_[np.newaxis, :]

 # inverse the transforms
 coef = pipe['selectpercentile'].inverse_transform(coef)
 coef = pipe['masktransformer'].inverse_transform(coef)

 # reshape coefficients into 3D volume
 coef_3d = np.reshape(coef, imgs[0].shape)
 coef_img = nb.Nifti1Image(coef_3d, header=imgs[0].header, affine=imgs[0].affine)

 # plot
 fig, ax = plt.subplots(figsize=(20, 5))
 plotting.plot_stat_map(coef_img, imgs[0], title="regression coeffs", display_mode="z", cut_coords=5, axes=ax, threshold=0.002)
 ```

 %% Cell type:markdown id: tags:

 <a class="anchor" id="Clustering"></a>
 ### Clustering

 In this example, we will cluster the intensities of an anatomical image to label different tissues.  A guassian mixture model will be fit by expectation-maximisation.  This is similar to how `FSL fast` works for tissue segmentation.

 The example will:
 1. Fetch an anatomical image
 2. Skull-strip the image using FSL tools
 3. Fit a gaussian mixture model to the voxel intensities

 %% Cell type:markdown id: tags:

 First import the data.  In this example we use a single anatomical image from the `haxby` dataset that was used in the classification example.

 %% Cell type:code id: tags:

 ``` python
 from nilearn import datasets
 from nilearn import plotting
 import matplotlib.pyplot as plt

 # fetch anatomical image
 # https://www.science.org/doi/10.1126/science.1063736
 haxby_dataset = datasets.fetch_haxby() # 1 subject, ~30s
 anat = haxby_dataset.anat[0]

 # plot
 fig, ax = plt.subplots(figsize=(15, 7))
 plotting.plot_anat(anat, dim=-1, axes=ax)
 ```

 %% Cell type:markdown id: tags:

 Now we limit the field-of-view, and skull-strip the anatomical image using `FSL` tools, specifically `robustfov` and `bet`.

 > **Note**:
 > 1. Here we use the `!` operator to execute the command in the shell
 > 2. The `{}` will expand the contained python variable in the shell

 %% Cell type:code id: tags:

 ``` python
 # extract brain using FSL tools

 anat_brain = anat.replace('.nii.gz', '_brain.nii.gz')

 ! robustfov -i {anat} -r {anat_brain}
 ! bet {anat_brain} {anat_brain}

 # plot brain extraction

 fig, ax = plt.subplots(figsize=(15,7))
 plotting.plot_anat(anat_brain, dim=-1, axes=ax)
 ```

 %% Cell type:markdown id: tags:

 Now load the skull-stripped anatomical image, remove the non-brain voxels (intensity==0), and flatten into a `(n_sample, n_feature)` matrix for use with `scikit-learn`.  In this case, samples will be individual voxels, and there will only be a single feature.

 %% Cell type:code id: tags:

 ``` python
 import nibabel as nb
 import numpy as np

 # load brain image
 anat_img = nb.load(anat_brain)
 anat_data = anat_img.get_fdata()

 # create  brain mask
 mask = anat_data > 0

 # remove non-brain voxls and reshape (n_sample, n_features) for skikit-learn
 anat_data = anat_data[mask][:, np.newaxis]

 print(f'Flattened shape: {anat_data.shape}')
 ```

 %% Cell type:markdown id: tags:

 Fit a gaussian mixture model to the data with 4 gaussians, then predict the class/label on the same data.  Given that both `fit` and `predict` will be used on the same data, we can use a convenience method that does both called `fit_predict`. Finally, reshape the classes/labels in a 3D volume and plot.

 %% Cell type:code id: tags:

 ``` python
 from sklearn.mixture import GaussianMixture  # import gaussian mixture model

 # do cluster
 labels = GaussianMixture(n_components=4, random_state=0).fit_predict(anat_data)

 # reshape labels in 3D volume
 labels_3d = np.zeros(anat_img.shape).ravel()
 labels_3d[mask.ravel()] = labels+1

 labels_3d = np.reshape(labels_3d, anat_img.shape)
 labels_img = nb.Nifti1Image(labels_3d, header=anat_img.header, affine=anat_img.affine)

 # plot labels
 fig, ax = plt.subplots(figsize=(20,5))
 disp = plotting.plot_anat(anat_brain, dim=-1, axes=ax, display_mode="z", cut_coords=7,)
 disp.add_overlay(labels_img)
 ```

 %% Cell type:markdown id: tags:

 That's all folks!