diff --git a/applications/machine_learning/scikit_learn.ipynb b/applications/machine_learning/scikit_learn.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..242c6313db5180ee2d820fbb13fc23eda50a8741 --- /dev/null +++ b/applications/machine_learning/scikit_learn.ipynb @@ -0,0 +1,1040 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# scikit-learn\n", + "\n", + "Machine learning in python.\n", + "\n", + "- Simple and efficient tools for predictive data analysis\n", + "- Built on `numpy`, `scipy`, and `matplotlib`\n", + "- Open source, commercially usable - BSD license\n", + "- \\> 2.1k contributers, 22.5k forks, 48.5k \"stars\"\n", + "- Current version: 1.0.2\n", + "- Powerful generic API\n", + "- Very well documented\n", + "- Core techniques, hundreds of options for:\n", + " - Classification \n", + " - Regression \n", + " - Clustering \n", + " - Dimensionality Reduction \n", + " - Model Selection \n", + " - Preprocessing\n", + "\n", + "Some useful links are:\n", + "- [Main website](https://scikit-learn.org/stable/index.html)\n", + "- [User guide](https://scikit-learn.org/stable/user_guide.html)\n", + "- [API](https://scikit-learn.org/stable/modules/classes.html)\n", + "- [GitHub](https://github.com/scikit-learn/scikit-learn)\n", + "\n", + "## Notebook Overview:\n", + "* [Getting Started](#getting-started)\n", + " * [Estimators: fitting and predicting](#estimators)\n", + " * [Transformers](#transformers)\n", + " * [Pipelines: chaining transforms and estimators](#pipelines)\n", + " * [Model evaluation](#model-evaluation)\n", + "* [Neuroimaging examples](#examples)\n", + " * [Classification](#classification)\n", + " * [Regression](#regression)\n", + " * [Clustering](#clustering)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Getting Started\n", + "\n", + "Adapted from the `scikit-learn` \"Getting Started\" documentation: https://scikit-learn.org/stable/getting_started.html\n", + "\n", + "Three important concepts to understand for using `scikit-learn`:\n", + "1. `estimator` objects and their `fit` and `predict` methods for fitting data and making predictions\n", + "2. `transformer` objects for pre/post-processing transforms\n", + "3. `pipeline` objects for chaining together `transformers` and `estimators` into a machine learning pipeline" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Estimators: fitting and predicting\n", + "\n", + "Each machine learning model in `scikit-learn` is implemented as an [`estimator`](https://scikit-learn.org/stable/glossary.html#term-estimators) object. Here we instantiate [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) `estimator`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import RF estimator\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# instantiate RF estimator\n", + "clf = RandomForestClassifier(random_state=0)\n", + "\n", + "print(clf)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After creation, the `estimator` can be **fit** to the training data using the `fit` method.\n", + "\n", + "The fit method accepts 2 inputs:\n", + "\n", + "1. `X` is the training input samples of shape (`n_samples`, `n_features`). Thus, rows=samples and columns=features\n", + "2. `y` is the target values (class labels in classification, real numbers in regression) of shape (`n_samples`,) for one output, or (`n_samples`, `n_outputs`) for multiple outputs. \n", + "\n", + "> Note: Both `X` and `y` are usually numpy arrays or equivalent array-like data types, including a [`pandas DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# create some toy data\n", + "X = [[ 1, 2, 3], # 2 samples, 3 features\n", + " [11, 12, 13]]\n", + "y = [0, 1] # classes of each sample\n", + "\n", + "# fit the model to the data\n", + "clf.fit(X, y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once trained, the `estimator` can make predictions on new data using the `predict` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# predict classes of new data\n", + "X_test = [[ 4, 5, 6], \n", + " [14, 15, 16]]\n", + "\n", + "clf.predict(X_test) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Importantly, this `fit` and `predict` interface is consistent across different estimators making it very easy to change estimators within your code. For example, here we swap the Random Forest `estimator` for a Support Vector Machine `estimator`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import an SVM estimator\n", + "from sklearn.svm import SVC\n", + "\n", + "# instantiate SVM estimator\n", + "clf = SVC(random_state=0)\n", + "\n", + "# fit and predict\n", + "clf.fit(X, y)\n", + "clf.predict(X_test) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Exercise:** `scikit-learn` implements a wide array of machine learning `estimators`. Try implementing fitting and predicting with a different model. See the `scikit-learn` [documentation](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) for model options." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Transformers\n", + "\n", + "The `transformer` is a special object that follows the same API as an `estimator` and allows you to apply pre-processing and/or post-procssing transforms to the data in your machine learning pipeline. The `transformer` object has a `transform` method instead of a `predict` method.\n", + "\n", + "In this example we use a `transformer` to standardise the features (e.g. remove mean and scale to unit variance). The `fit` method calculates the mean and variance parameters from the data, and the `transform` method will do the scaling." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# import StandardScaler transformer\n", + "from sklearn.preprocessing import StandardScaler\n", + "\n", + "# create some toy data\n", + "X = [[0, 15],\n", + " [1, -10]]\n", + "\n", + "# instantiate StandardScaler transformer\n", + "scaler = StandardScaler()\n", + "\n", + "# fit the transform to the data\n", + "scaler.fit(X)\n", + "\n", + "# apply transform\n", + "scaler.transform(X)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Exercise:** Look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) for `StandardScaler` and see if you can extract the mean and variance estimates from the fitted `transformer`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> Note: Once you run the `fit` method on either a `transformer` or an `estimator`, you can access the parameters that have been learned. \n", + "\n", + "For example, for the `transformer` used above to standardise the data:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Look inside the object to see the learned parameters (and other object properties)\n", + "vars(scaler)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Pipelines: chaining transforms and estimators\n", + "\n", + "A typical machine learning pipeline will often involve numerous pre-processing transforms and one or more estimators. The `pipeline` object can be used to combine `transformer` and `estimator` objects into a single object that has an API that is the same as a regular `estimator`. \n", + "\n", + "In this example we create a very simple `pipeline` that is comprised of a StandardScaler `transform` and a Random Forest `estimator`.\n", + "\n", + "> Note: in this example the `estimator` predictions were made with the same data (`X`) on which the `estimator` was trained. Typically independent **train** and **test** data would be used for fitting and prediction. In the next section [Model evaluation](#model-evaluation) we will see how to split data into train and test sets." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "from sklearn.ensemble import RandomForestClassifier # estimator\n", + "from sklearn.preprocessing import StandardScaler # transformer\n", + "from sklearn.pipeline import Pipeline # pipeline\n", + "from sklearn.datasets import load_iris # function to load demo dataset\n", + "\n", + "# create a pipeline object\n", + "# using a list of transformers/estimators (and give them names)\n", + "pipe = Pipeline(\n", + " [\n", + " ('my_scaler', StandardScaler()),\n", + " ('my_classifier', RandomForestClassifier())\n", + " ]\n", + ")\n", + "\n", + "# load the iris dataset\n", + "# See https://en.wikipedia.org/wiki/Iris_flower_data_set\n", + "X, y = load_iris(return_X_y=True)\n", + "\n", + "# fit the whole pipeline\n", + "pipe.fit(X, y)\n", + "\n", + "# predict classes from the training data\n", + "pipe.predict(X)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When running a pipeline, the different components of the pipeline are run sequentially. \n", + "\n", + "You may want to also run the pipeline \"up to a point\", or check the parameters of a component of the pipeline. For example, in the pipeline above, we may want to see the parameters of the `StandardScaler` or only run that part on some data.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Check the parameters learned by the scaler\n", + "print(pipe['my_scaler'].mean_)\n", + "print(pipe['my_scaler'].var_)\n", + "\n", + "# Run only the scaler on some data\n", + "X_t = pipe['my_scaler'].transform(X)\n", + "\n", + "# Running the full pipeline is equivalent to running the `transform` method on the different bits \n", + "# sequentially and then running a `predict` at the end\n", + "print(pipe.predict(X))\n", + "print(pipe['my_classifier'].predict(pipe['my_scaler'].transform(X)))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Model evaluation\n", + "\n", + "In machine learning we often want to train a model on some data, and then apply the model to new \"unseen\" data to make predictions. Thus we want our model to **generalise** well to unseen data. To assess generalisation, we need a dataset to train/fit the model AND an independent dataset to evaluate the model.\n", + "\n", + "The simplest way to do this is to just split the data into a train set and a test set, then fit the model to the train set, and make predictions on the test set. There are many methods for splitting data that are implemented within `scikit-learn` in the [`model_selection`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) module.\n", + "\n", + "To evaluate a model we also need a metric of performance. There a many metrics used in machine learning, and many of them are implemented within `scikit-learn` in the [`metrics`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) module. \n", + "\n", + "In this example, we use the simple `pipeline` constructed earlier and extend it to include a simple train/test split using the `train_test_split` function. A major advantage of using the `pipeline` object is that it will prevent leakage between your train and test data, which would invalidate your evaluation. For the performance metric we use `accuracy_score` which is just the percentage of correct predictions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "from sklearn.ensemble import RandomForestClassifier # estimator\n", + "from sklearn.preprocessing import StandardScaler # transformer\n", + "from sklearn.pipeline import Pipeline # pipeline\n", + "from sklearn.datasets import load_iris # function to load demo dataset\n", + "from sklearn.model_selection import train_test_split # function to split the data into train and test\n", + "from sklearn.metrics import accuracy_score # performance metric\n", + "\n", + "# create a pipeline object\n", + "pipe = Pipeline(\n", + " [\n", + " ('my_scaler', StandardScaler()),\n", + " ('my_classifier', RandomForestClassifier())\n", + " ]\n", + ")\n", + "\n", + "# load the iris dataset and split into train and test\n", + "# See https://en.wikipedia.org/wiki/Iris_flower_data_set\n", + "X, y = load_iris(return_X_y=True)\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n", + "\n", + "# fit the pipeline to the train data\n", + "pipe.fit(X_train, y_train)\n", + "\n", + "# predict classes from the test data and calculate the accuracy\n", + "accuracy_score(pipe.predict(X_test), y_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is typical to perform the train/test splits within the folds of a [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) to make more efficient use of the available data. \n", + "\n", + "![cross-validation](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)\n", + "\n", + "`scikit-learn` makes cross-validation very easy with the `cross_val_score` function. Furthermore, by using the `pipeline` object we can ensure that there is no data leakage with this more complex training scheme.\n", + "\n", + "In this example, we use `cross_val_score` combined with the `KFold` cross-validation iterator to evaluate the model with 5-fold cross-validation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "from sklearn.ensemble import RandomForestClassifier # estimator\n", + "from sklearn.preprocessing import StandardScaler # transformer\n", + "from sklearn.pipeline import Pipeline # pipeline\n", + "from sklearn.datasets import load_iris # function to load demo dataset\n", + "from sklearn.model_selection import cross_val_score, KFold # cross-validation\n", + "\n", + "# create a pipeline object\n", + "pipe = Pipeline(\n", + " [\n", + " ('my_scaler',StandardScaler()),\n", + " ('my_classifier',RandomForestClassifier())\n", + " ]\n", + ")\n", + "\n", + "# load the iris dataset and split into train and test\n", + "X, y = load_iris(return_X_y=True)\n", + "\n", + "# run cross-validation (CV)\n", + "cv = KFold(n_splits=5)\n", + "result = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy')\n", + "\n", + "for idx, acc in enumerate(result):\n", + " print(f'Accuracy fold {idx}: {acc*100:0.2f}%')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Thus, rather than a single accuracy score, you have a distribution.\n", + "\n", + "\n", + "**Exercise:** Another commonly used cross-validation scheme is `StratifiedKFold` which is a variation of k-fold which uses stratified folds. A stratified fold is where each set (train/test) within a fold contains approximately the same percentage of samples of each target class as the complete set. Try training the pipeline from the previous example with `StratifiedKFold` and `k=6`. The documentation for `StratifiedKFold` is [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn-model-selection-stratifiedkfold)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# YOUR CODE GOES HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Neuroimaging Examples\n", + "\n", + "Here we present examples of using `scikit-learn` for classification, regression, and clustering on neuroimaging data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Classification\n", + "\n", + "In this example, we will train a `scikit-learn` classifier to predict from fMRI if the subject is looking at an image of a face or a house. \n", + "\n", + "The example will:\n", + "1. Download the fMRI for a single subject\n", + "2. Split the data into train and test sets\n", + "3. Fit a linear SVM to the train data\n", + "4. Evaluate the performance of the classifier on the test data\n", + "\n", + "This example is adapted from: https://nilearn.github.io/stable/auto_examples/02_decoding/plot_haxby_anova_svm.html#sphx-glr-auto-examples-02-decoding-plot-haxby-anova-svm-py\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First we must import the data. If this is the first time you run this code it will also download the data, which should take less than a minute.\n", + "\n", + "> **NOTE:**\n", + "> 1. Here we use [`fetch_haxby`](https://nilearn.github.io/stable/modules/generated/nilearn.datasets.fetch_haxby.html#nilearn.datasets.fetch_haxby) from the [`nilearn`](https://nilearn.github.io/index.html) package to download the demo data\n", + "> 2. [`nb.load`](https://nipy.org/nibabel/gettingstarted.html) from the [`nibabel`](https://nipy.org/nibabel/index.html) package will load the image into [`nibabel.Nifti1Image`](https://nipy.org/nibabel/reference/nibabel.nifti1.html) object. This will not load the actual data though." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nilearn import datasets\n", + "import nibabel as nb\n", + "\n", + "# load dataset for 1 subject (download if required, ~30s)\n", + "# https://www.science.org/doi/10.1126/science.1063736\n", + "haxby_dataset = datasets.fetch_haxby() \n", + "\n", + "# select and load images from dataset\n", + "func_img = nb.load(haxby_dataset.func[0])\n", + "mask_img = nb.load(haxby_dataset.mask)\n", + "anat_img = nb.load(haxby_dataset.anat[0])\n", + "\n", + "print(f'Func dimension x,y,z: {func_img.shape[:-1]}')\n", + "print(f'Func vols/TRs: {func_img.shape[-1]}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, load the behavioural data, which describes which task was being performed for each volume of the fMRI. We are only interested in a binary classification, so we select volumes where the user is looking at either a face or a house.\n", + "\n", + "> **NOTE:**\n", + "> 1. Here we use [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) from the [`pandas`](https://pandas.pydata.org/docs/index.html) package to load the behavioural data\n", + "> 2. We use [`index_img`](https://nilearn.github.io/modules/generated/nilearn.image.index_img.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to select the face and house volumes from the fMRI image." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "from nilearn.image import index_img\n", + "\n", + "# load behavioral information\n", + "behavioral = pd.read_csv(haxby_dataset.session_target[0], sep=\" \")\n", + "conditions = behavioral['labels']\n", + "\n", + "# create mask for faces and places\n", + "condition_mask = conditions.isin(['face', 'house'])\n", + "\n", + "# select faces/places volumes\n", + "func_img = index_img(func_img, condition_mask)\n", + "conditions = conditions[condition_mask]\n", + "\n", + "# Confirm that we now have 2 conditions (face and house)\n", + "print(f'Number of conditions: {conditions.shape[0]}')\n", + "print(f'Unique conditions: {conditions.unique()}')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now load the fMRI data (face and house conditions only), flatten into a `(n_sample, n_feature)` matrix for use with `scikit-learn`, and mask out non-brain voxels. In this case, samples will be individual volumes, and features will be all non-brain voxels." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "func_data = func_img.get_fdata()[mask_img.get_fdata()!=0].T\n", + "\n", + "print(f'Flattened data shape (n_sample, n_feature): {func_data.shape}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Split the data into train/test with a 60/40 split." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Split data into training set and test set (60% train)\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "func_train, func_test, condition_train, condition_test = train_test_split(func_data, conditions, train_size=0.6, random_state=0)\n", + "\n", + "print(f'Number of train/test samples: {func_train.shape[0]}/{func_test.shape[0]}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now fit and SVM classifier to the train data, and evaluate on the test data. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.svm import SVC # estimator\n", + "from sklearn.preprocessing import StandardScaler # transformer\n", + "from sklearn.pipeline import Pipeline # pipeline\n", + "from sklearn.metrics import (accuracy_score, # metrics for confusion matrix and accuracy\n", + " confusion_matrix, ConfusionMatrixDisplay) \n", + "\n", + "# construct classification pipeline\n", + "pipe = Pipeline(\n", + " [\n", + " ('scaler', StandardScaler()),\n", + " ('clf', SVC(kernel='linear')),\n", + " ]\n", + ")\n", + "\n", + "# fit and predict\n", + "pipe.fit(func_train, condition_train) \n", + "condition_pred = pipe.predict(func_test)\n", + "\n", + "# evalute\n", + "acc = accuracy_score(condition_test, condition_pred)\n", + "print(f'Accuracy = {acc*100:.2f}%')\n", + "\n", + "cm = confusion_matrix(condition_test, condition_pred, labels=pipe['clf'].classes_)\n", + "disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=pipe['clf'].classes_)\n", + "disp.plot();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we can look at the weights of the trained classifier spatially to evaluate which voxels were most discriminatory between faces and houses. This only works because we used a linear classifier. The process involves:\n", + "1. Extract the weights from the trained classifier\n", + "2. Reshape (un-flatten) the weights into a 3D volume\n", + "3. Plot the 3D weights\n", + "\n", + "> **NOTE:**\n", + "> 1. Here we use [`plot_stat_map`](https://nilearn.github.io/modules/generated/nilearn.plotting.plot_stat_map.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to plot the images\n", + "> 2. [`subplots`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html) from `matplotlib.pyplot` creates a figure and axes with specified size" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np # functions for reshaping arrays\n", + "import matplotlib.pyplot as plt\n", + "from nilearn import plotting\n", + "\n", + "# get coefficients/weights from the classifier\n", + "coef = pipe['clf'].coef_.squeeze()\n", + "\n", + "# reshape coefficients to 3D and create nifti\n", + "coef_3d = np.zeros(func_img.shape[:-1]).ravel()\n", + "coef_3d[(mask_img.get_fdata()!=0).ravel()] = coef\n", + "coef_3d = np.reshape(coef_3d, func_img.shape[:-1])\n", + "coef_img = nb.Nifti1Image(coef_3d, header=func_img.header, affine=func_img.affine)\n", + "\n", + "# plot coefficients\n", + "fig, ax = plt.subplots(figsize=(20, 5))\n", + "plotting.plot_stat_map(coef_img, anat_img, title=\"discriminating weights\", display_mode=\"z\", cut_coords=7, axes=ax, threshold=0.0005, dim=-1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Regression\n", + "\n", + "In this example, we will do a multivariate regression of voxel-based-morphometry with subject age.\n", + "\n", + "The data come from the [OASIS](https://www.oasis-brains.org) project and has been run through a standard VBM pipeline to create VBM maps.\n", + "\n", + "The example will:\n", + "1. Download the VBM maps for 150 subjects\n", + "2. Split the data into train and test sets\n", + "3. Implement a custom transformer to mask the non-GM voxels\n", + "3. Fit a `Ridge` regressor to the train data\n", + "4. Evaluate the performance of the regressor on the test data\n", + "\n", + "This example is adapted from: https://nilearn.github.io/stable/auto_examples/02_decoding/plot_oasis_vbm.html#sphx-glr-auto-examples-02-decoding-plot-oasis-vbm-py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First we must import the data. If this is the first time you have run this code it will also download the data, which should take only a minute or two.\n", + "\n", + "> **NOTE:**\n", + "> 1. Here we use [`fetch_oasis_vbm`](https://nilearn.github.io/stable/modules/generated/nilearn.datasets.fetch_oasis_vbm.html#nilearn.datasets.fetch_oasis_vbm) from the [`nilearn`](https://nilearn.github.io/index.html) package to download the VBM data\n", + "> 2. [`nb.load`](https://nipy.org/nibabel/gettingstarted.html) from the [`nibabel`](https://nipy.org/nibabel/index.html) package will load the image into [`nibabel.Nifti1Image`](https://nipy.org/nibabel/reference/nibabel.nifti1.html) object. This will not load the actual data though." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import nibabel as nb\n", + "from nilearn import datasets\n", + "\n", + "# fetch the VBM data from 150 subjects\n", + "n_subjects = 150 \n", + "dataset_files = datasets.fetch_oasis_vbm(n_subjects=n_subjects) # 1-2 mins depending on your connection\n", + "\n", + "# extract age from dataset\n", + "age = dataset_files.ext_vars['age'].astype(float)\n", + "age = np.array(age)\n", + "\n", + "# load GM maps\n", + "imgs = [nb.load(f) for f in dataset_files.gray_matter_maps]\n", + "imgs_data = np.stack([img0.get_fdata().ravel() for img0 in imgs], axis=1).T # ~10s\n", + "\n", + "print(f'Flattened data shape (n_sample, n_feature): {imgs_data.shape}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Plot the distribution of the subjects age." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "fig, ax = plt.subplots(figsize=(10, 5))\n", + "ax.hist(age, bins=20)\n", + "ax.set_xlabel('age (years)')\n", + "ax.set_ylabel('count')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Split the data into train/test with a 60/40 split." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Split data into training set and test set (60% train)\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "imgs_train, imgs_test, age_train, age_test = train_test_split(imgs_data, age, train_size=.6, random_state=0)\n", + "\n", + "# Sort test data by age, simply for better visualization later\n", + "perm = np.argsort(age_test)[::-1]\n", + "age_test = age_test[perm]\n", + "imgs_test = imgs_test[perm]\n", + "\n", + "print(f'Number of train/test samples: {imgs_train.shape[0]}/{imgs_test.shape[0]}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Each VBM map has a lot of non-GM voxels that are of no interest. Ideally we would remove these using a brain mask, but unfortunately there is not a mask provided with the dataset. We could calculate a mask by calculating a group mean image, and then binarising it. However, this would cause leakage between train and test data. Ideally, the mask would be calculated on the train data and applied to the test data. This is exactly the role of a `transformer` in a `scikit-learn` pipeline.\n", + "\n", + "Here we create a custom `transformer` called `MaskTransformer`. When `fit` is called, it creates a mask by taking the mean across samples (rows of `X`) and then binarises it with the specified threshold. When `transform` is called, it uses the pre-calculated mask to remove samples (rows of `X`) that are not in the mask.\n", + "\n", + "More information on custom transformers can be found [here](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.base import BaseEstimator, TransformerMixin\n", + "import numpy as np\n", + "\n", + "class MaskTransformer(BaseEstimator, TransformerMixin):\n", + " def __init__(self, threshold=0.3):\n", + " self.threshold = threshold\n", + "\n", + " def fit(self, X, y=None):\n", + " self.mask = np.mean(X, axis=0) > self.threshold\n", + " return self\n", + "\n", + " def transform(self, X, y=None):\n", + " X_ = X.copy()\n", + " return X_[:, self.mask]\n", + "\n", + " def inverse_transform(self, X):\n", + " X_r = np.zeros((X.shape[0], self.mask.shape[0]))\n", + " X_r[:, self.mask] = X\n", + " return X_r" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we fit a `Ridge` regressor to the train data, and evaluate on the test data. \n", + "\n", + "This example includes an additioal `transformer` called `SelectPercentile` which will select the 50% of the features (voxels) with the highest ANOVA f-score with the target labels. We included this simply to reduce the memory requirements of the regression." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.linear_model import Ridge # estimator\n", + "from sklearn.preprocessing import StandardScaler # transformer\n", + "from sklearn.pipeline import make_pipeline # function to construct pipeline\n", + "from sklearn.feature_selection import SelectPercentile\n", + "\n", + "# regr = Ridge(alpha=0.5) # instantiate ridge regression estimator\n", + "# masker = MaskTransformer() # instantiate mask transformer\n", + "\n", + "# construct regression pipeline\n", + "# (this is a shorthand for creating a pipeline but it does not allow you to give names to the bits)\n", + "pipe = make_pipeline(\n", + " MaskTransformer(),\n", + " SelectPercentile(percentile=50),\n", + " StandardScaler(),\n", + " Ridge(alpha=0.5),\n", + ")\n", + "\n", + "# fit and predict\n", + "pipe.fit(imgs_train, age_train) \n", + "age_pred = pipe.predict(imgs_test)\n", + "\n", + "# evaluate\n", + "fig, ax = plt.subplots(figsize=(10, 5))\n", + "\n", + "ax.plot(age_test, label='True age')\n", + "ax.plot(age_pred, label='Predicted age')\n", + "\n", + "ax.set_ylabel('age (years)')\n", + "ax.set_xlabel('samples')\n", + "plt.legend()\n", + "\n", + "mse = np.mean(np.abs(age_test - age_pred))\n", + "ax.set_title(f'Mean absolute error {mse:.2f} years')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we can look at the coefficients of the trained regressor spatially and see which varied most with age. The process involves:\n", + "1. Extract the regression coefficients from the trained regressor\n", + "2. Reshape (un-flatten) the weights into a 3D volume\n", + "3. Plot the 3D coefficients\n", + "\n", + "> **NOTE:**\n", + "> 1. Here we use [`plot_stat_map`](https://nilearn.github.io/modules/generated/nilearn.plotting.plot_stat_map.html) from the [`nilearn`](https://nilearn.github.io/index.html) package to plot the orthographic images\n", + "> 2. [`subplots`](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html) from `matplotlib.pyplot` creates a figure and axes with specified size" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nilearn import plotting\n", + "\n", + "# get regression coefficients from the estimator\n", + "coef = pipe['ridge'].coef_[np.newaxis, :]\n", + "\n", + "# inverse the transforms\n", + "coef = pipe['selectpercentile'].inverse_transform(coef)\n", + "coef = pipe['masktransformer'].inverse_transform(coef)\n", + "\n", + "# reshape coefficients into 3D volume\n", + "coef_3d = np.reshape(coef, imgs[0].shape)\n", + "coef_img = nb.Nifti1Image(coef_3d, header=imgs[0].header, affine=imgs[0].affine)\n", + "\n", + "# plot\n", + "fig, ax = plt.subplots(figsize=(20, 5))\n", + "plotting.plot_stat_map(coef_img, imgs[0], title=\"regression coeffs\", display_mode=\"z\", cut_coords=5, axes=ax, threshold=0.002)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Clustering\n", + "\n", + "In this example, we will cluster the intensities of an anatomical image to label different tissues. A guassian mixture model will be fit by expectation-maximisation. This is similar to how `FSL fast` works for tissue segmentation.\n", + "\n", + "The example will:\n", + "1. Fetch an anatomical image\n", + "2. Skull-strip the image using FSL tools\n", + "3. Fit a gaussian mixture model to the voxel intensities" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First import the data. In this example we use a single anatomical image from the `haxby` dataset that was used in the classification example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nilearn import datasets\n", + "from nilearn import plotting\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# fetch anatomical image\n", + "# https://www.science.org/doi/10.1126/science.1063736\n", + "haxby_dataset = datasets.fetch_haxby() # 1 subject, ~30s\n", + "anat = haxby_dataset.anat[0]\n", + "\n", + "# plot\n", + "fig, ax = plt.subplots(figsize=(15, 7))\n", + "plotting.plot_anat(anat, dim=-1, axes=ax)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we limit the field-of-view, and skull-strip the anatomical image using `FSL` tools, specifically `robustfov` and `bet`.\n", + "\n", + "> **Note**: \n", + "> 1. Here we use the `!` operator to execute the command in the shell\n", + "> 2. The `{}` will expand the contained python variable in the shell" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# extract brain using FSL tools\n", + "\n", + "anat_brain = anat.replace('.nii.gz', '_brain.nii.gz')\n", + "\n", + "! robustfov -i {anat} -r {anat_brain}\n", + "! bet {anat_brain} {anat_brain}\n", + "\n", + "# plot brain extraction\n", + "\n", + "fig, ax = plt.subplots(figsize=(15,7))\n", + "plotting.plot_anat(anat_brain, dim=-1, axes=ax)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now load the skull-stripped anatomical image, remove the non-brain voxels (intensity==0), and flatten into a `(n_sample, n_feature)` matrix for use with `scikit-learn`. In this case, samples will be individual voxels, and there will only be a single feature." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import nibabel as nb\n", + "import numpy as np\n", + "\n", + "# load brain image\n", + "anat_img = nb.load(anat_brain)\n", + "anat_data = anat_img.get_fdata()\n", + "\n", + "# create brain mask\n", + "mask = anat_data > 0\n", + "\n", + "# remove non-brain voxls and reshape (n_sample, n_features) for skikit-learn\n", + "anat_data = anat_data[mask][:, np.newaxis]\n", + "\n", + "print(f'Flattened shape: {anat_data.shape}')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Fit a gaussian mixture model to the data with 4 gaussians, then predict the class/label on the same data. Given that both `fit` and `predict` will be used on the same data, we can use a convenience method that does both called `fit_predict`. Finally, reshape the classes/labels in a 3D volume and plot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.mixture import GaussianMixture # import gaussian mixture model\n", + "\n", + "# do cluster\n", + "labels = GaussianMixture(n_components=4, random_state=0).fit_predict(anat_data)\n", + "\n", + "# reshape labels in 3D volume\n", + "labels_3d = np.zeros(anat_img.shape).ravel()\n", + "labels_3d[mask.ravel()] = labels+1\n", + "\n", + "labels_3d = np.reshape(labels_3d, anat_img.shape)\n", + "labels_img = nb.Nifti1Image(labels_3d, header=anat_img.header, affine=anat_img.affine)\n", + "\n", + "# plot labels\n", + "fig, ax = plt.subplots(figsize=(20,5))\n", + "disp = plotting.plot_anat(anat_brain, dim=-1, axes=ax, display_mode=\"z\", cut_coords=7,)\n", + "disp.add_overlay(labels_img)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "That's all folks!" + ] + } + ], + "metadata": { + "interpreter": { + "hash": "8f5a176552fc0f23c8f04279340cb4c0ac62022f9ffb83054f20d62859be286b" + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}