Added transformers and pipelines

f87a698c · Sean Fitzgibbon · Paul McCarthy · 2a6ceb56 · f87a698c
Commit f87a698c authored 3 years ago by Sean Fitzgibbon Committed by Paul McCarthy 3 years ago
--- a/applications/machine_learning/scikit_learn.ipynb
+++ b/applications/machine_learning/scikit_learn.ipynb
@@ -11,15 +11,6 @@
    "- Simple and efficient tools for predictive data analysis\n",
    "- Built on `numpy`, `scipy`, and `matplotlib`\n",
    "- Open source, commercially usable - BSD license\n",
-    "\n",
-    "Some useful links are:\n",
-    "- [Main website](https://scikit-learn.org/stable/index.html)\n",
-    "- [User guide](https://scikit-learn.org/stable/user_guide.html)\n",
-    "- [API](https://scikit-learn.org/stable/modules/classes.html)\n",
-    "- [GitHub](https://github.com/scikit-learn/scikit-learn)\n",
-    "\n",
-    "## Project Overview:\n",
-    "\n",
    "- \\> 2.1k contributers, 22.5k forks, 48.5k \"stars\"\n",
    "- Current version: 1.0.2\n",
    "- Powerful generic API\n",
@@ -31,23 +22,41 @@
    "    - Dimensionality Reduction \n",
    "    - Model Selection \n",
    "    - Preprocessing\n",
-    "    "
+    "\n",
+    "Some useful links are:\n",
+    "- [Main website](https://scikit-learn.org/stable/index.html)\n",
+    "- [User guide](https://scikit-learn.org/stable/user_guide.html)\n",
+    "- [API](https://scikit-learn.org/stable/modules/classes.html)\n",
+    "- [GitHub](https://github.com/scikit-learn/scikit-learn)\n",
+    "\n",
+    "## Notebook Overview:\n",
+    "* [Getting Started](#getting-started)\n",
+    "    * [Estimators: fitting and predicting](#estimators)\n",
+    "    * [Transformers](#transformers)\n",
+    "    * [Pipelines: chaining transforms and estimators](#pipelines)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
+    "<a class=\"anchor\" id=\"getting-started\"></a>\n",
    "## Getting Started\n",
    "\n",
-    "Adapted from the `scikit-learn` \"Getting Started\" documetation: https://scikit-learn.org/stable/getting_started.html"
+    "Adapted from the `scikit-learn` \"Getting Started\" documetation: https://scikit-learn.org/stable/getting_started.html\n",
+    "\n",
+    "Three important concepts to understand for using `scikit-learn`:\n",
+    "1. `estimator` objects and their `fit` and `predict` methods for fitting data and making predictions\n",
+    "2. `tranformer` objects for pre/post-processing transforms\n",
+    "3. `pipeline` objects for chaining together `transformers` and `estimators` into a machine learning pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Fitting and predicting: estimator basics\n",
+    "<a class=\"anchor\" id=\"estimators\"></a>\n",
+    "### Estimators: fitting and predicting\n",
    "\n",
    "Each machine learning model in `scikit-learn` is implemented as an [`estimator`](https://scikit-learn.org/stable/glossary.html#term-estimators) object.  Here we instantiate [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) `estimator`:"
   ]
@@ -184,18 +193,104 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Transformers and pre-processors\n",
+    "<a class=\"anchor\" id=\"transformers\"></a>\n",
+    "### Transformers\n",
+    "\n",
+    "The `transformer` is a special object that follows the same API as an `estimator` and allows you to apply pre-processing and/or post-procssing transform to the data in your machine learning pipeline.  The `transformer` object has a `transform` method insted of a `predict` method.\n",
+    "\n",
+    "In this example we use a `transformer` to standarise the features (e.g. remove mean and scale to unit variance).  The `fit` method calculate the mean and variance parameters from the data, and the `transform` method will to the scaling."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([[-1.,  1.],\n",
+       "       [ 1., -1.]])"
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# import StandardScaler transformer\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "\n",
+    "# create some toy data\n",
+    "X = [[0, 15],\n",
+    "     [1, -10]]\n",
+    "\n",
+    "# instantiate StandardScaler transformer\n",
+    "scaler = StandardScaler()\n",
+    "\n",
+    "# fit the transform to the data\n",
+    "scaler.fit(X)\n",
    "\n",
-    "The `transformer` is a special object that follows the same API as an `estimator` and allows you to apply pre-processing and/or post-procssing transform to the data in your machine learning pipeline."
+    "# apply transform\n",
+    "scaler.transform(X)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Pipelines: chaining pre-processors and estimators\n",
+    "<a class=\"anchor\" id=\"pipelines\"></a>\n",
+    "### Pipelines: chaining transforms and estimators\n",
    "\n",
-    "TBD"
+    "A typical machine learning pipeline will often involve numerous pre-processing transforms and an estimator.  The `pipeline` object can be used to combine `transformer` and `estimator` objects into a single object that has an API that is the same as a regular `estimator`.  A pipeline is constructed with the `make_pipeline` function.  \n",
+    "\n",
+    "In this example we create a very simple `pipeline` that is comprised of a StandardScaler `transform` and a Random Forest `estimator`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+       "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
+       "       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
+       "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
+       "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
+       "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
+       "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])"
+      ]
+     },
+     "execution_count": 21,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# imports\n",
+    "from sklearn.ensemble import RandomForestClassifier  # estimator\n",
+    "from sklearn.preprocessing import StandardScaler     # transformer\n",
+    "from sklearn.pipeline import make_pipeline           # function to construct pipeline\n",
+    "from sklearn.datasets import load_iris               # function to load demo dataset\n",
+    "\n",
+    "# create a pipeline object\n",
+    "pipe = make_pipeline(\n",
+    "    StandardScaler(),\n",
+    "    RandomForestClassifier()\n",
+    ")\n",
+    "\n",
+    "# load the iris dataset\n",
+    "X, y = load_iris(return_X_y=True)\n",
+    "\n",
+    "# fit the whole pipeline\n",
+    "pipe.fit(X, y)\n",
+    "\n",
+    "# predict classes from the training data\n",
+    "pipe.predict(X)"
   ]
  },
  {

 %% Cell type:markdown id: tags:

 # scikit-learn

 Machine learning in python.

 - Simple and efficient tools for predictive data analysis
 - Built on `numpy`, `scipy`, and `matplotlib`
 - Open source, commercially usable - BSD license
-
-Some useful links are:
- [Main website](https://scikit-learn.org/stable/index.html)
- [User guide](https://scikit-learn.org/stable/user_guide.html)
- [API](https://scikit-learn.org/stable/modules/classes.html)
- [GitHub](https://github.com/scikit-learn/scikit-learn)
-
-## Project Overview:
-
 - \> 2.1k contributers, 22.5k forks, 48.5k "stars"
 - Current version: 1.0.2
 - Powerful generic API
 - Very well documented
 - Core techniques, hundreds of options for:
    - Classification
    - Regression
    - Clustering
    - Dimensionality Reduction
    - Model Selection
    - Preprocessing

+Some useful links are:
+- [Main website](https://scikit-learn.org/stable/index.html)
+- [User guide](https://scikit-learn.org/stable/user_guide.html)
+- [API](https://scikit-learn.org/stable/modules/classes.html)
+- [GitHub](https://github.com/scikit-learn/scikit-learn)
+
+## Notebook Overview:
+* [Getting Started](#getting-started)
+    * [Estimators: fitting and predicting](#estimators)
+    * [Transformers](#transformers)
+    * [Pipelines: chaining transforms and estimators](#pipelines)

 %% Cell type:markdown id: tags:

+<a class="anchor" id="getting-started"></a>
 ## Getting Started

 Adapted from the `scikit-learn` "Getting Started" documetation: https://scikit-learn.org/stable/getting_started.html

+Three important concepts to understand for using `scikit-learn`:
+1. `estimator` objects and their `fit` and `predict` methods for fitting data and making predictions
+2. `tranformer` objects for pre/post-processing transforms
+3. `pipeline` objects for chaining together `transformers` and `estimators` into a machine learning pipeline
+
 %% Cell type:markdown id: tags:

-### Fitting and predicting: estimator basics
+<a class="anchor" id="estimators"></a>
+### Estimators: fitting and predicting

 Each machine learning model in `scikit-learn` is implemented as an [`estimator`](https://scikit-learn.org/stable/glossary.html#term-estimators) object.  Here we instantiate [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) `estimator`:

 %% Cell type:code id: tags:

 ``` python
 # import RF estimator
 from sklearn.ensemble import RandomForestClassifier

 # instantiate RF estimator
 clf = RandomForestClassifier(random_state=0)

 print(clf)
 ```

 %% Output

    RandomForestClassifier(random_state=0)

 %% Cell type:markdown id: tags:

 After creation, the `estimator` can be **fit** to the training data using the `fit` method.

 The fit method accepts 2 inputs:

 1. `X` is training input samples of shape (`n_samples`, `n_features`).  Thus, rows=samples and columns=features
 2. `y` is the target values (class labels in classification, real numbers in regression) of shape (`n_samples`,) for one output, or (`n_samples`, `n_outputs`) for multiple outputs.


 > Note: Both `X` and `y` are usually are numpy arrays or equivalent array-like data types.

 %% Cell type:code id: tags:

 ``` python
 # create some toy data
 X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]
 y = [0, 1]  # classes of each sample

 # fit the model to the data
 clf.fit(X, y)
 ```

 %% Output

    RandomForestClassifier(random_state=0)

 %% Cell type:markdown id: tags:

 Once trained, the `estimator` can make predictions on new data using the `predict` method.

 %% Cell type:code id: tags:

 ``` python
 # predict classes of new data
 clf.predict([[4, 5, 6], [14, 15, 16]])
 ```

 %% Output

    array([0, 1])

 %% Cell type:markdown id: tags:


 Importantly, this `fit` and `predict` interface is consistent across different estimators making it very easy to change estimators within your code.  For example, here we swap the Random Forest `estimator` for a Support Vector Machine `estimator`:

 %% Cell type:code id: tags:

 ``` python
 # import an SVM estimator
 from sklearn.svm import SVC

 # instantiate SVM estimator
 clf = SVC(random_state=0)

 # fit and predict
 clf.fit(X, y)
 clf.predict([[4, 5, 6], [14, 15, 16]])
 ```

 %% Output

    array([0, 1])

 %% Cell type:markdown id: tags:

-### Transformers and pre-processors
+<a class="anchor" id="transformers"></a>
+### Transformers
+
+The `transformer` is a special object that follows the same API as an `estimator` and allows you to apply pre-processing and/or post-procssing transform to the data in your machine learning pipeline.  The `transformer` object has a `transform` method insted of a `predict` method.
+
+In this example we use a `transformer` to standarise the features (e.g. remove mean and scale to unit variance).  The `fit` method calculate the mean and variance parameters from the data, and the `transform` method will to the scaling.
+
+%% Cell type:code id: tags:
+
+``` python
+# import StandardScaler transformer
+from sklearn.preprocessing import StandardScaler
+
+# create some toy data
+X = [[0, 15],
+     [1, -10]]
+
+# instantiate StandardScaler transformer
+scaler = StandardScaler()

-The `transformer` is a special object that follows the same API as an `estimator` and allows you to apply pre-processing and/or post-procssing transform to the data in your machine learning pipeline.
+# fit the transform to the data
+scaler.fit(X)
+
+# apply transform
+scaler.transform(X)
+```
+
+%% Output
+
+    array([[-1.,  1.],
+           [ 1., -1.]])

 %% Cell type:markdown id: tags:

-### Pipelines: chaining pre-processors and estimators
+<a class="anchor" id="pipelines"></a>
+### Pipelines: chaining transforms and estimators

-TBD
+A typical machine learning pipeline will often involve numerous pre-processing transforms and an estimator.  The `pipeline` object can be used to combine `transformer` and `estimator` objects into a single object that has an API that is the same as a regular `estimator`.  A pipeline is constructed with the `make_pipeline` function.
+
+In this example we create a very simple `pipeline` that is comprised of a StandardScaler `transform` and a Random Forest `estimator`
+
+%% Cell type:code id: tags:
+
+``` python
+# imports
+from sklearn.ensemble import RandomForestClassifier  # estimator
+from sklearn.preprocessing import StandardScaler     # transformer
+from sklearn.pipeline import make_pipeline           # function to construct pipeline
+from sklearn.datasets import load_iris               # function to load demo dataset
+
+# create a pipeline object
+pipe = make_pipeline(
+    StandardScaler(),
+    RandomForestClassifier()
+)
+
+# load the iris dataset
+X, y = load_iris(return_X_y=True)
+
+# fit the whole pipeline
+pipe.fit(X, y)
+
+# predict classes from the training data
+pipe.predict(X)
+```
+
+%% Output
+
+    array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+           0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+           0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
+           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
+           2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

 %% Cell type:markdown id: tags:

 ### Model evaluation

 TBD

 %% Cell type:markdown id: tags:

 ### Automatic parameter searches

 TBD

 %% Cell type:markdown id: tags:

 ## Classification

 TBD

 %% Cell type:markdown id: tags:

 ## Regression

 TBD

 %% Cell type:markdown id: tags: