diff --git a/applications/machine_learning/scikit_learn.ipynb b/applications/machine_learning/scikit_learn.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..23f2e146ee5ca4db6e0a0dce1d76f19bd82300c1 --- /dev/null +++ b/applications/machine_learning/scikit_learn.ipynb @@ -0,0 +1,268 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# scikit-learn\n", + "\n", + "Machine learning in python.\n", + "\n", + "- Simple and efficient tools for predictive data analysis\n", + "- Built on `numpy`, `scipy`, and `matplotlib`\n", + "- Open source, commercially usable - BSD license\n", + "\n", + "Some useful links are:\n", + "- [Main website](https://scikit-learn.org/stable/index.html)\n", + "- [User guide](https://scikit-learn.org/stable/user_guide.html)\n", + "- [API](https://scikit-learn.org/stable/modules/classes.html)\n", + "- [GitHub](https://github.com/scikit-learn/scikit-learn)\n", + "\n", + "## Project Overview:\n", + "\n", + "- \\> 2.1k contributers, 22.5k forks, 48.5k \"stars\"\n", + "- Current version: 1.0.2\n", + "- Powerful generic API\n", + "- Very well documented\n", + "- Core techniques, hundreds of options for:\n", + " - Classification \n", + " - Regression \n", + " - Clustering \n", + " - Dimensionality Reduction \n", + " - Model Selection \n", + " - Preprocessing\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Getting Started\n", + "\n", + "Adapted from the `scikit-learn` \"Getting Started\" documetation: https://scikit-learn.org/stable/getting_started.html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Fitting and predicting: estimator basics\n", + "\n", + "Each machine learning model in `scikit-learn` is implemented as an [`estimator`](https://scikit-learn.org/stable/glossary.html#term-estimators) object. Here we instantiate [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) `estimator`:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "RandomForestClassifier(random_state=0)\n" + ] + } + ], + "source": [ + "# import RF estimator\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# instantiate RF estimator\n", + "clf = RandomForestClassifier(random_state=0)\n", + "\n", + "print(clf)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After creation, the `estimator` can be **fit** to the training data using the `fit` method.\n", + "\n", + "The fit method accepts 2 inputs:\n", + "\n", + "1. `X` is training input samples of shape (`n_samples`, `n_features`). Thus, rows=samples and columns=features\n", + "2. `y` is the target values (class labels in classification, real numbers in regression) of shape (`n_samples`,) for one output, or (`n_samples`, `n_outputs`) for multiple outputs. \n", + "\n", + "\n", + "> Note: Both `X` and `y` are usually are numpy arrays or equivalent array-like data types." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "RandomForestClassifier(random_state=0)" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# create some toy data\n", + "X = [[ 1, 2, 3], # 2 samples, 3 features\n", + " [11, 12, 13]]\n", + "y = [0, 1] # classes of each sample\n", + "\n", + "# fit the model to the data\n", + "clf.fit(X, y)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once trained, the `estimator` can make predictions on new data using the `predict` method." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0, 1])" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# predict classes of new data\n", + "clf.predict([[4, 5, 6], [14, 15, 16]]) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "Importantly, this `fit` and `predict` interface is consistent across different estimators making it very easy to change estimators within your code. For example, here we swap the Random Forest `estimator` for a Support Vector Machine `estimator`:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0, 1])" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# import an SVM estimator\n", + "from sklearn.svm import SVC\n", + "\n", + "# instantiate SVM estimator\n", + "clf = SVC(random_state=0)\n", + "\n", + "# fit and predict\n", + "clf.fit(X, y)\n", + "clf.predict([[4, 5, 6], [14, 15, 16]]) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Transformers and pre-processors\n", + "\n", + "The `transformer` is a special object that follows the same API as an `estimator` and allows you to apply pre-processing and/or post-procssing transform to the data in your machine learning pipeline." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Pipelines: chaining pre-processors and estimators\n", + "\n", + "TBD" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Model evaluation\n", + "\n", + "TBD" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Automatic parameter searches\n", + "\n", + "TBD" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Classification\n", + "\n", + "TBD" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Regression\n", + "\n", + "TBD" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "interpreter": { + "hash": "8f33debd3f7a540b5e7318765d3a2e4659a97370cb154e29f8abf55d879f5f56" + }, + "kernelspec": { + "display_name": "Python 3.7.6 64-bit ('fslpython604': conda)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + }, + "orig_nbformat": 4 + }, + "nbformat": 4, + "nbformat_minor": 2 +}