Skip to content
Snippets Groups Projects
Commit f87a698c authored by Sean Fitzgibbon's avatar Sean Fitzgibbon Committed by Paul McCarthy
Browse files

Added transformers and pipelines

parent 2a6ceb56
No related branches found
No related tags found
1 merge request!26Add scikit learn
%% Cell type:markdown id: tags:
# scikit-learn
Machine learning in python.
- Simple and efficient tools for predictive data analysis
- Built on `numpy`, `scipy`, and `matplotlib`
- Open source, commercially usable - BSD license
Some useful links are:
- [Main website](https://scikit-learn.org/stable/index.html)
- [User guide](https://scikit-learn.org/stable/user_guide.html)
- [API](https://scikit-learn.org/stable/modules/classes.html)
- [GitHub](https://github.com/scikit-learn/scikit-learn)
## Project Overview:
- \> 2.1k contributers, 22.5k forks, 48.5k "stars"
- Current version: 1.0.2
- Powerful generic API
- Very well documented
- Core techniques, hundreds of options for:
- Classification
- Regression
- Clustering
- Dimensionality Reduction
- Model Selection
- Preprocessing
Some useful links are:
- [Main website](https://scikit-learn.org/stable/index.html)
- [User guide](https://scikit-learn.org/stable/user_guide.html)
- [API](https://scikit-learn.org/stable/modules/classes.html)
- [GitHub](https://github.com/scikit-learn/scikit-learn)
## Notebook Overview:
* [Getting Started](#getting-started)
* [Estimators: fitting and predicting](#estimators)
* [Transformers](#transformers)
* [Pipelines: chaining transforms and estimators](#pipelines)
%% Cell type:markdown id: tags:
<a class="anchor" id="getting-started"></a>
## Getting Started
Adapted from the `scikit-learn` "Getting Started" documetation: https://scikit-learn.org/stable/getting_started.html
Three important concepts to understand for using `scikit-learn`:
1. `estimator` objects and their `fit` and `predict` methods for fitting data and making predictions
2. `tranformer` objects for pre/post-processing transforms
3. `pipeline` objects for chaining together `transformers` and `estimators` into a machine learning pipeline
%% Cell type:markdown id: tags:
### Fitting and predicting: estimator basics
<a class="anchor" id="estimators"></a>
### Estimators: fitting and predicting
Each machine learning model in `scikit-learn` is implemented as an [`estimator`](https://scikit-learn.org/stable/glossary.html#term-estimators) object. Here we instantiate [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) `estimator`:
%% Cell type:code id: tags:
``` python
# import RF estimator
from sklearn.ensemble import RandomForestClassifier
# instantiate RF estimator
clf = RandomForestClassifier(random_state=0)
print(clf)
```
%% Output
RandomForestClassifier(random_state=0)
%% Cell type:markdown id: tags:
After creation, the `estimator` can be **fit** to the training data using the `fit` method.
The fit method accepts 2 inputs:
1. `X` is training input samples of shape (`n_samples`, `n_features`). Thus, rows=samples and columns=features
2. `y` is the target values (class labels in classification, real numbers in regression) of shape (`n_samples`,) for one output, or (`n_samples`, `n_outputs`) for multiple outputs.
> Note: Both `X` and `y` are usually are numpy arrays or equivalent array-like data types.
%% Cell type:code id: tags:
``` python
# create some toy data
X = [[ 1, 2, 3], # 2 samples, 3 features
[11, 12, 13]]
y = [0, 1] # classes of each sample
# fit the model to the data
clf.fit(X, y)
```
%% Output
RandomForestClassifier(random_state=0)
%% Cell type:markdown id: tags:
Once trained, the `estimator` can make predictions on new data using the `predict` method.
%% Cell type:code id: tags:
``` python
# predict classes of new data
clf.predict([[4, 5, 6], [14, 15, 16]])
```
%% Output
array([0, 1])
%% Cell type:markdown id: tags:
Importantly, this `fit` and `predict` interface is consistent across different estimators making it very easy to change estimators within your code. For example, here we swap the Random Forest `estimator` for a Support Vector Machine `estimator`:
%% Cell type:code id: tags:
``` python
# import an SVM estimator
from sklearn.svm import SVC
# instantiate SVM estimator
clf = SVC(random_state=0)
# fit and predict
clf.fit(X, y)
clf.predict([[4, 5, 6], [14, 15, 16]])
```
%% Output
array([0, 1])
%% Cell type:markdown id: tags:
### Transformers and pre-processors
<a class="anchor" id="transformers"></a>
### Transformers
The `transformer` is a special object that follows the same API as an `estimator` and allows you to apply pre-processing and/or post-procssing transform to the data in your machine learning pipeline. The `transformer` object has a `transform` method insted of a `predict` method.
In this example we use a `transformer` to standarise the features (e.g. remove mean and scale to unit variance). The `fit` method calculate the mean and variance parameters from the data, and the `transform` method will to the scaling.
%% Cell type:code id: tags:
``` python
# import StandardScaler transformer
from sklearn.preprocessing import StandardScaler
# create some toy data
X = [[0, 15],
[1, -10]]
# instantiate StandardScaler transformer
scaler = StandardScaler()
The `transformer` is a special object that follows the same API as an `estimator` and allows you to apply pre-processing and/or post-procssing transform to the data in your machine learning pipeline.
# fit the transform to the data
scaler.fit(X)
# apply transform
scaler.transform(X)
```
%% Output
array([[-1., 1.],
[ 1., -1.]])
%% Cell type:markdown id: tags:
### Pipelines: chaining pre-processors and estimators
<a class="anchor" id="pipelines"></a>
### Pipelines: chaining transforms and estimators
TBD
A typical machine learning pipeline will often involve numerous pre-processing transforms and an estimator. The `pipeline` object can be used to combine `transformer` and `estimator` objects into a single object that has an API that is the same as a regular `estimator`. A pipeline is constructed with the `make_pipeline` function.
In this example we create a very simple `pipeline` that is comprised of a StandardScaler `transform` and a Random Forest `estimator`
%% Cell type:code id: tags:
``` python
# imports
from sklearn.ensemble import RandomForestClassifier # estimator
from sklearn.preprocessing import StandardScaler # transformer
from sklearn.pipeline import make_pipeline # function to construct pipeline
from sklearn.datasets import load_iris # function to load demo dataset
# create a pipeline object
pipe = make_pipeline(
StandardScaler(),
RandomForestClassifier()
)
# load the iris dataset
X, y = load_iris(return_X_y=True)
# fit the whole pipeline
pipe.fit(X, y)
# predict classes from the training data
pipe.predict(X)
```
%% Output
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
%% Cell type:markdown id: tags:
### Model evaluation
TBD
%% Cell type:markdown id: tags:
### Automatic parameter searches
TBD
%% Cell type:markdown id: tags:
## Classification
TBD
%% Cell type:markdown id: tags:
## Regression
TBD
%% Cell type:markdown id: tags:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment