diff --git a/applications/pandas/pandas.ipynb b/applications/pandas/pandas.ipynb index 1245aaef9b4ee333e1bea90787b8a823193e0223..0d5a968b904704735f927dcb7ab57687087ff485 100644 --- a/applications/pandas/pandas.ipynb +++ b/applications/pandas/pandas.ipynb @@ -2,6 +2,7 @@ "cells": [ { "cell_type": "markdown", + "id": "9803940b", "metadata": {}, "source": [ "# Pandas\n", @@ -23,6 +24,7 @@ { "cell_type": "code", "execution_count": null, + "id": "eb7f5417", "metadata": {}, "outputs": [], "source": [ @@ -38,6 +40,7 @@ }, { "cell_type": "markdown", + "id": "9f998231", "metadata": {}, "source": [ "> We will mostly be using `seaborn` instead of `matplotlib` for\n", @@ -54,6 +57,7 @@ { "cell_type": "code", "execution_count": null, + "id": "7020257e", "metadata": {}, "outputs": [], "source": [ @@ -63,6 +67,7 @@ }, { "cell_type": "markdown", + "id": "6a17483f", "metadata": {}, "source": [ "This loads the data into a\n", @@ -78,6 +83,7 @@ { "cell_type": "code", "execution_count": null, + "id": "4d0436b1", "metadata": {}, "outputs": [], "source": [ @@ -86,6 +92,7 @@ }, { "cell_type": "markdown", + "id": "173c767f", "metadata": {}, "source": [ "If you can not connect to the internet, you can run the command below to load\n", @@ -95,6 +102,7 @@ { "cell_type": "code", "execution_count": null, + "id": "9a047455", "metadata": {}, "outputs": [], "source": [ @@ -104,6 +112,7 @@ }, { "cell_type": "markdown", + "id": "6b801a28", "metadata": {}, "source": [ "Note that the titanic dataset was also available to us as one of the standard\n", @@ -113,6 +122,7 @@ { "cell_type": "code", "execution_count": null, + "id": "7be7954c", "metadata": {}, "outputs": [], "source": [ @@ -121,6 +131,7 @@ }, { "cell_type": "markdown", + "id": "112b9665", "metadata": {}, "source": [ "`Dataframes` can also be created from other python objects, using\n", @@ -131,6 +142,7 @@ { "cell_type": "code", "execution_count": null, + "id": "de3236a1", "metadata": {}, "outputs": [], "source": [ @@ -145,18 +157,50 @@ }, { "cell_type": "markdown", + "id": "0761c3c8", "metadata": {}, "source": [ + "## A note on types\n", + "Each column in the pandas dataframe has its own data type, which can be:\n", + "- integer or float for numbers\n", + "- boolean for True/False\n", + "- datetime for defining specific times (and timedelta for durations)\n", + "- categorical, where each element is selected from a finite list of text values\n", + "- objects for anything else used for strings or columns with mixed elements\n", + "Each element in the column must match the type of the whole column. \n", + "When reading in a dataset pandas will try to assign the most specific type to each column.\n", + "Every pandas datatype also has support for missing data (which we will look more at below).\n", + "\n", + "One can check the type of each column using:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "398a2240", + "metadata": {}, + "outputs": [], + "source": [ + "titanic.dtypes" + ] + }, + { + "cell_type": "markdown", + "id": "839cfc99", + "metadata": {}, + "source": [ + "Note that in much of python data types are referred to as dtypes.\n", "## Getting your data out\n", - "For many applications (e.g., ICA, machine learning) you might want to\n", + "For some applications you might want to\n", "extract your data as a numpy array, even though more and more projects \n", - "support pandas Dataframes directly. The underlying numpy array can be \n", - "accessed using the `to_numpy` method" + "support pandas Dataframes directly (including \"scikit-learn\"). \n", + "The underlying numpy array can be accessed using the `to_numpy` method" ] }, { "cell_type": "code", "execution_count": null, + "id": "987ee604", "metadata": {}, "outputs": [], "source": [ @@ -165,16 +209,25 @@ }, { "cell_type": "markdown", + "id": "1d210770", "metadata": {}, "source": [ - "Note that the type of the returned array is the most common type (in this case\n", - "object). If you just want the numeric parts of the table you can use\n", - "`select_dtypes`, which selects specific columns based on their dtype:" + "Similarly to the `pandas` types discussed above,\n", + "`numpy` also requires all elements to have the same type.\n", + "However, `numpy` requires all elements in the whole array,\n", + "not just a single column to be the same type.\n", + "In this case this means that all data had to be converted\n", + "to the generic \"object\" type, which is not particularly useful.\n", + "\n", + "For most analyses, we would only be interested in the numeric columns.\n", + "Thise can be extracted using `select_dtypes`, which selects specific columns \n", + "based on their data type (dtype):" ] }, { "cell_type": "code", "execution_count": null, + "id": "7836cb90", "metadata": {}, "outputs": [], "source": [ @@ -183,16 +236,26 @@ }, { "cell_type": "markdown", + "id": "85bbba82", "metadata": {}, "source": [ - "Note that the numpy array has no information on the column names or row indices.\n", - "Alternatively, when you want to include the categorical variables in your later\n", - "analysis (e.g., for machine learning), you can extract dummy variables using:" + "Now we get an array with a numeric type rather than the generic \"object\",\n", + "which is a lot more useful as we can now run math operations on the\n", + "resulting array (e.g., PCA).\n", + "\n", + "Finally, let's have a look at extracting categorical variables.\n", + "These are columns where each element has one of a finite list of possible values\n", + "(e.g., the \"embark_town\" column being \"Southampton\", \"Cherbourg\", or, \"Queenstown,\n", + "which are the three towns the Titanic docked to let on passengers).\n", + "As we will see below, `pandas` has extensive support for categorical values,\n", + "but many other tools do not. To support those tools, `pandas` allows you to \n", + "replace such columns with dummy variables:" ] }, { "cell_type": "code", "execution_count": null, + "id": "60c7e8fc", "metadata": {}, "outputs": [], "source": [ @@ -201,8 +264,14 @@ }, { "cell_type": "markdown", + "id": "1defde14", "metadata": {}, "source": [ + "Note that rather than having a single \"embark_town\" column with a categorical type,\n", + "we now have three columns named \"embark_town_<name>\" with a 1 for every passenger\n", + "who embarked in that town. These numeric columns can then be fed into a GLM or\n", + "a machine learning algorithm.\n", + "\n", "## Accessing parts of the data\n", "\n", "[Documentation on indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html)\n", @@ -215,6 +284,7 @@ { "cell_type": "code", "execution_count": null, + "id": "fa00ea38", "metadata": {}, "outputs": [], "source": [ @@ -223,6 +293,7 @@ }, { "cell_type": "markdown", + "id": "2bb923fa", "metadata": {}, "source": [ "If the column names is a valid python identifier (i.e., is a string that does not contain stuff like spaces)\n", @@ -232,6 +303,7 @@ { "cell_type": "code", "execution_count": null, + "id": "acf0cfc6", "metadata": {}, "outputs": [], "source": [ @@ -240,6 +312,7 @@ }, { "cell_type": "markdown", + "id": "1bfb79b1", "metadata": {}, "source": [ "Note that this returns a single column is represented by a pandas\n", @@ -253,6 +326,7 @@ { "cell_type": "code", "execution_count": null, + "id": "7c097dc4", "metadata": {}, "outputs": [], "source": [ @@ -261,6 +335,7 @@ }, { "cell_type": "markdown", + "id": "027848a7", "metadata": {}, "source": [ "Note that you have to provide a list here (square brackets). If you provide a\n", @@ -271,6 +346,7 @@ { "cell_type": "code", "execution_count": null, + "id": "5be86a4a", "metadata": {}, "outputs": [], "source": [ @@ -279,6 +355,7 @@ }, { "cell_type": "markdown", + "id": "bbf62da6", "metadata": {}, "source": [ "In this case there is no column called `('class', 'alive')` leading to an\n", @@ -290,6 +367,7 @@ { "cell_type": "code", "execution_count": null, + "id": "8fc35ce3", "metadata": {}, "outputs": [], "source": [ @@ -299,6 +377,7 @@ }, { "cell_type": "markdown", + "id": "766e1a41", "metadata": {}, "source": [ "We can delete a column using:" @@ -307,6 +386,7 @@ { "cell_type": "code", "execution_count": null, + "id": "61d91bdf", "metadata": {}, "outputs": [], "source": [ @@ -315,6 +395,7 @@ }, { "cell_type": "markdown", + "id": "9ea81208", "metadata": {}, "source": [ "### Indexing rows by name or integer\n", @@ -328,6 +409,7 @@ { "cell_type": "code", "execution_count": null, + "id": "e6263074", "metadata": {}, "outputs": [], "source": [ @@ -337,6 +419,7 @@ }, { "cell_type": "markdown", + "id": "0a6d1311", "metadata": {}, "source": [ "Note that the re-sorting did not change the values in the index (i.e., left-most\n", @@ -348,6 +431,7 @@ { "cell_type": "code", "execution_count": null, + "id": "1016cb0b", "metadata": {}, "outputs": [], "source": [ @@ -356,6 +440,7 @@ }, { "cell_type": "markdown", + "id": "00bfb183", "metadata": {}, "source": [ "We can select the row with the index 0 using" @@ -364,6 +449,7 @@ { "cell_type": "code", "execution_count": null, + "id": "63cb04fc", "metadata": {}, "outputs": [], "source": [ @@ -372,6 +458,7 @@ }, { "cell_type": "markdown", + "id": "b0bf4c58", "metadata": {}, "source": [ "Note that this gives the same passenger as the first row of the initial table\n", @@ -381,6 +468,7 @@ { "cell_type": "code", "execution_count": null, + "id": "ece87592", "metadata": {}, "outputs": [], "source": [ @@ -389,6 +477,7 @@ }, { "cell_type": "markdown", + "id": "738c3b6c", "metadata": {}, "source": [ "Another common way to access the first or last N rows of a table is using the\n", @@ -398,6 +487,7 @@ { "cell_type": "code", "execution_count": null, + "id": "8937c751", "metadata": {}, "outputs": [], "source": [ @@ -407,6 +497,7 @@ { "cell_type": "code", "execution_count": null, + "id": "0333e9cc", "metadata": {}, "outputs": [], "source": [ @@ -415,6 +506,7 @@ }, { "cell_type": "markdown", + "id": "e846457f", "metadata": {}, "source": [ "Note that nearly all methods in pandas return a new `DataFrame`, which means\n", @@ -427,6 +519,7 @@ { "cell_type": "code", "execution_count": null, + "id": "84cc7089", "metadata": {}, "outputs": [], "source": [ @@ -435,6 +528,7 @@ }, { "cell_type": "markdown", + "id": "0611a8bf", "metadata": {}, "source": [ "> This chaining is usually very efficient, because when creating a new `DataFrame`\n", @@ -446,6 +540,7 @@ { "cell_type": "code", "execution_count": null, + "id": "dbd982e7", "metadata": {}, "outputs": [], "source": [ @@ -454,6 +549,7 @@ }, { "cell_type": "markdown", + "id": "71a1e3a3", "metadata": {}, "source": [ "**Exercise**: use sorting and tail/head or indexing to find the 10 youngest\n", @@ -464,6 +560,7 @@ { "cell_type": "code", "execution_count": null, + "id": "7272cae5", "metadata": {}, "outputs": [], "source": [ @@ -472,6 +569,7 @@ }, { "cell_type": "markdown", + "id": "c836a51b", "metadata": {}, "source": [ "### Indexing rows by value\n", @@ -482,6 +580,7 @@ { "cell_type": "code", "execution_count": null, + "id": "7dbabecf", "metadata": {}, "outputs": [], "source": [ @@ -491,6 +590,7 @@ { "cell_type": "code", "execution_count": null, + "id": "e9503367", "metadata": {}, "outputs": [], "source": [ @@ -500,6 +600,7 @@ }, { "cell_type": "markdown", + "id": "edefb548", "metadata": {}, "source": [ "Note that this required typing `titanic` quite often.\n", @@ -515,6 +616,7 @@ { "cell_type": "code", "execution_count": null, + "id": "0c710cfa", "metadata": {}, "outputs": [], "source": [ @@ -523,6 +625,7 @@ }, { "cell_type": "markdown", + "id": "d8545d1a", "metadata": {}, "source": [ "When selecting a categorical multiple options from a categorical values you \n", @@ -532,6 +635,7 @@ { "cell_type": "code", "execution_count": null, + "id": "93a73be5", "metadata": {}, "outputs": [], "source": [ @@ -540,6 +644,7 @@ }, { "cell_type": "markdown", + "id": "129a27a3", "metadata": {}, "source": [ "Particularly useful when selecting data like this is the `isna` method which\n", @@ -549,6 +654,7 @@ { "cell_type": "code", "execution_count": null, + "id": "66af0870", "metadata": {}, "outputs": [], "source": [ @@ -557,6 +663,7 @@ }, { "cell_type": "markdown", + "id": "ce7d05c3", "metadata": {}, "source": [ "This removing of missing numbers is so common that it has is own method" @@ -565,6 +672,7 @@ { "cell_type": "code", "execution_count": null, + "id": "79d28611", "metadata": {}, "outputs": [], "source": [ @@ -574,6 +682,7 @@ { "cell_type": "code", "execution_count": null, + "id": "34ebdb36", "metadata": {}, "outputs": [], "source": [ @@ -582,6 +691,7 @@ }, { "cell_type": "markdown", + "id": "82ddcb59", "metadata": {}, "source": [ "**Exercise**: use sorting, indexing by value, `dropna` and `tail`/`head` or\n", @@ -592,6 +702,7 @@ { "cell_type": "code", "execution_count": null, + "id": "1fe9d398", "metadata": {}, "outputs": [], "source": [ @@ -600,6 +711,7 @@ }, { "cell_type": "markdown", + "id": "c394e5ac", "metadata": {}, "source": [ "## Plotting the data\n", @@ -611,6 +723,7 @@ { "cell_type": "code", "execution_count": null, + "id": "1d443d44", "metadata": {}, "outputs": [], "source": [ @@ -620,6 +733,7 @@ { "cell_type": "code", "execution_count": null, + "id": "8bd6d770", "metadata": {}, "outputs": [], "source": [ @@ -628,6 +742,7 @@ }, { "cell_type": "markdown", + "id": "dc8e64a9", "metadata": {}, "source": [ "To plot all variables simply call `plot` or `hist` on the full `DataFrame`\n", @@ -638,6 +753,7 @@ { "cell_type": "code", "execution_count": null, + "id": "ab6c3514", "metadata": {}, "outputs": [], "source": [ @@ -646,6 +762,7 @@ }, { "cell_type": "markdown", + "id": "8140217e", "metadata": {}, "source": [ "Individual `Series` are essentially 1D arrays, so we can use them as such in\n", @@ -655,6 +772,7 @@ { "cell_type": "code", "execution_count": null, + "id": "48a59b56", "metadata": {}, "outputs": [], "source": [ @@ -663,6 +781,7 @@ }, { "cell_type": "markdown", + "id": "07cf3584", "metadata": {}, "source": [ "However, for most purposes much nicer plots can be obtained using\n", @@ -679,6 +798,7 @@ { "cell_type": "code", "execution_count": null, + "id": "8563a7a1", "metadata": {}, "outputs": [], "source": [ @@ -687,6 +807,7 @@ }, { "cell_type": "markdown", + "id": "0e752be0", "metadata": {}, "source": [ "**Exercise**: check the documentation from `sns.jointplot` (hover the mouse\n", @@ -697,6 +818,7 @@ { "cell_type": "code", "execution_count": null, + "id": "4ccd4177", "metadata": {}, "outputs": [], "source": [ @@ -705,6 +827,7 @@ }, { "cell_type": "markdown", + "id": "4e513fbc", "metadata": {}, "source": [ "Here is just a brief example of how we can use multiple columns to illustrate\n", @@ -714,6 +837,7 @@ { "cell_type": "code", "execution_count": null, + "id": "f5cb00af", "metadata": {}, "outputs": [], "source": [ @@ -723,6 +847,7 @@ }, { "cell_type": "markdown", + "id": "6ec94eac", "metadata": {}, "source": [ "**Exercise**: Split the plot above into two rows with the first row including\n", @@ -734,6 +859,7 @@ { "cell_type": "code", "execution_count": null, + "id": "c6f6e763", "metadata": {}, "outputs": [], "source": [ @@ -743,6 +869,7 @@ }, { "cell_type": "markdown", + "id": "1d54e3ec", "metadata": {}, "source": [ "One of the nice thing of Seaborn is how easy it is to update how these plots\n", @@ -754,6 +881,7 @@ { "cell_type": "code", "execution_count": null, + "id": "52183332", "metadata": {}, "outputs": [], "source": [ @@ -764,6 +892,7 @@ }, { "cell_type": "markdown", + "id": "ac35e133", "metadata": {}, "source": [ "## Summarizing the data (mean, std, etc.)\n", @@ -776,6 +905,7 @@ { "cell_type": "code", "execution_count": null, + "id": "404be564", "metadata": {}, "outputs": [], "source": [ @@ -785,6 +915,7 @@ { "cell_type": "code", "execution_count": null, + "id": "bd6dd429", "metadata": {}, "outputs": [], "source": [ @@ -793,6 +924,7 @@ }, { "cell_type": "markdown", + "id": "3f0eaeb2", "metadata": {}, "source": [ "One very useful one is `describe`, which gives an overview of many common\n", @@ -802,6 +934,7 @@ { "cell_type": "code", "execution_count": null, + "id": "e52493af", "metadata": {}, "outputs": [], "source": [ @@ -810,6 +943,7 @@ }, { "cell_type": "markdown", + "id": "7fd8fba3", "metadata": {}, "source": [ "Note that non-numeric columns are ignored when summarizing data in this way.\n", @@ -822,6 +956,7 @@ { "cell_type": "code", "execution_count": null, + "id": "3fffbcb9", "metadata": {}, "outputs": [], "source": [ @@ -832,6 +967,7 @@ }, { "cell_type": "markdown", + "id": "d4c09639", "metadata": {}, "source": [ "We can also define our own functions to apply to the columns (in this case we\n", @@ -841,6 +977,7 @@ { "cell_type": "code", "execution_count": null, + "id": "e1b90c3f", "metadata": {}, "outputs": [], "source": [ @@ -858,6 +995,7 @@ }, { "cell_type": "markdown", + "id": "f869c17f", "metadata": {}, "source": [ "We can also provide multiple functions to the `apply` method (note that\n", @@ -867,6 +1005,7 @@ { "cell_type": "code", "execution_count": null, + "id": "2dd3d814", "metadata": {}, "outputs": [], "source": [ @@ -875,6 +1014,7 @@ }, { "cell_type": "markdown", + "id": "78e7e950", "metadata": {}, "source": [ "### Grouping by\n", @@ -890,6 +1030,7 @@ { "cell_type": "code", "execution_count": null, + "id": "c271697e", "metadata": {}, "outputs": [], "source": [ @@ -899,6 +1040,7 @@ }, { "cell_type": "markdown", + "id": "5537b1e4", "metadata": {}, "source": [ "However, it is more often combined with one of the aggregation functions\n", @@ -911,6 +1053,7 @@ { "cell_type": "code", "execution_count": null, + "id": "580a68d4", "metadata": {}, "outputs": [], "source": [ @@ -919,6 +1062,7 @@ }, { "cell_type": "markdown", + "id": "9a94b1c0", "metadata": {}, "source": [ "We can also group by multiple variables at once" @@ -927,6 +1071,7 @@ { "cell_type": "code", "execution_count": null, + "id": "4d119923", "metadata": {}, "outputs": [], "source": [ @@ -935,6 +1080,7 @@ }, { "cell_type": "markdown", + "id": "9c5c1119", "metadata": {}, "source": [ "When grouping it can help to use the `cut` method to split a continuous variable\n", @@ -944,6 +1090,7 @@ { "cell_type": "code", "execution_count": null, + "id": "e18ac0a4", "metadata": {}, "outputs": [], "source": [ @@ -952,6 +1099,7 @@ }, { "cell_type": "markdown", + "id": "0c3e2145", "metadata": {}, "source": [ "We can use the `aggregate` method to apply a different function to each series" @@ -960,6 +1108,7 @@ { "cell_type": "code", "execution_count": null, + "id": "cf6abd30", "metadata": {}, "outputs": [], "source": [ @@ -968,6 +1117,7 @@ }, { "cell_type": "markdown", + "id": "eaca0a93", "metadata": {}, "source": [ "Note that both the index (on the left) and the column names (on the top) now\n", @@ -981,6 +1131,7 @@ { "cell_type": "code", "execution_count": null, + "id": "79780e3b", "metadata": {}, "outputs": [], "source": [ @@ -990,6 +1141,7 @@ { "cell_type": "code", "execution_count": null, + "id": "fb15d602", "metadata": {}, "outputs": [], "source": [ @@ -999,6 +1151,7 @@ { "cell_type": "code", "execution_count": null, + "id": "e7ba4b48", "metadata": {}, "outputs": [], "source": [ @@ -1007,6 +1160,7 @@ }, { "cell_type": "markdown", + "id": "5414e4c5", "metadata": {}, "source": [ "Remember that indexing based on the index was done through `loc`. The rest is\n", @@ -1016,6 +1170,7 @@ { "cell_type": "code", "execution_count": null, + "id": "ed55f8ee", "metadata": {}, "outputs": [], "source": [ @@ -1025,6 +1180,7 @@ { "cell_type": "code", "execution_count": null, + "id": "1376c35c", "metadata": {}, "outputs": [], "source": [ @@ -1033,6 +1189,7 @@ }, { "cell_type": "markdown", + "id": "1289b2db", "metadata": {}, "source": [ "More advanced use of the `MultiIndex` is possible through `xs`:" @@ -1041,6 +1198,7 @@ { "cell_type": "code", "execution_count": null, + "id": "472127b8", "metadata": {}, "outputs": [], "source": [ @@ -1050,6 +1208,7 @@ { "cell_type": "code", "execution_count": null, + "id": "61e73d0b", "metadata": {}, "outputs": [], "source": [ @@ -1058,6 +1217,7 @@ }, { "cell_type": "markdown", + "id": "7fc120ae", "metadata": {}, "source": [ "## Reshaping tables\n", @@ -1069,6 +1229,7 @@ { "cell_type": "code", "execution_count": null, + "id": "6f3d1ccd", "metadata": {}, "outputs": [], "source": [ @@ -1077,6 +1238,7 @@ }, { "cell_type": "markdown", + "id": "b9a13425", "metadata": {}, "source": [ "However, this single-column table is difficult to read. The reason for this is\n", @@ -1091,6 +1253,7 @@ { "cell_type": "code", "execution_count": null, + "id": "c5cf521a", "metadata": {}, "outputs": [], "source": [ @@ -1099,6 +1262,7 @@ }, { "cell_type": "markdown", + "id": "55d2c5a4", "metadata": {}, "source": [ "The former table, where the different groups are defined in different rows, is\n", @@ -1117,6 +1281,7 @@ { "cell_type": "code", "execution_count": null, + "id": "69301d4e", "metadata": {}, "outputs": [], "source": [ @@ -1127,6 +1292,7 @@ }, { "cell_type": "markdown", + "id": "d81eb236", "metadata": {}, "source": [ "> There are also many ways to produce prettier tables in pandas. \n", @@ -1139,6 +1305,7 @@ { "cell_type": "code", "execution_count": null, + "id": "2ddece59", "metadata": {}, "outputs": [], "source": [ @@ -1147,6 +1314,7 @@ }, { "cell_type": "markdown", + "id": "e730a134", "metadata": {}, "source": [ "The first argument is the numeric variable that will be summarised. \n", @@ -1159,6 +1327,7 @@ { "cell_type": "code", "execution_count": null, + "id": "641a14bf", "metadata": {}, "outputs": [], "source": [ @@ -1167,6 +1336,7 @@ }, { "cell_type": "markdown", + "id": "ee37ad6b", "metadata": {}, "source": [ "We can also change the function to be used to aggregate the data (by default the mean is computed)" @@ -1175,6 +1345,7 @@ { "cell_type": "code", "execution_count": null, + "id": "a2ebc6c2", "metadata": {}, "outputs": [], "source": [ @@ -1184,6 +1355,7 @@ }, { "cell_type": "markdown", + "id": "b43f10b8", "metadata": {}, "source": [ "As in `groupby` the aggregation function can be a string of a common aggregation\n", @@ -1195,6 +1367,7 @@ { "cell_type": "code", "execution_count": null, + "id": "dc66e7c0", "metadata": {}, "outputs": [], "source": [ @@ -1204,6 +1377,7 @@ }, { "cell_type": "markdown", + "id": "25a8f2f1", "metadata": {}, "source": [ "The opposite of `pivot_table` is `melt`. This can be used to change a wide-form\n", @@ -1215,6 +1389,7 @@ { "cell_type": "code", "execution_count": null, + "id": "2d509083", "metadata": {}, "outputs": [], "source": [ @@ -1228,6 +1403,7 @@ }, { "cell_type": "markdown", + "id": "59b81e11", "metadata": {}, "source": [ "This wide-form table (i.e., all the information is in different columns) makes\n", @@ -1240,6 +1416,7 @@ { "cell_type": "code", "execution_count": null, + "id": "70e01ab3", "metadata": {}, "outputs": [], "source": [ @@ -1249,6 +1426,7 @@ }, { "cell_type": "markdown", + "id": "4c942e3b", "metadata": {}, "source": [ "We can see that `melt` took all the columns (we could also have specified a\n", @@ -1262,6 +1440,7 @@ { "cell_type": "code", "execution_count": null, + "id": "ee385ba6", "metadata": {}, "outputs": [], "source": [ @@ -1272,6 +1451,7 @@ }, { "cell_type": "markdown", + "id": "6145528a", "metadata": {}, "source": [ "Finally we probably do want the FA and MD variables as different columns.\n", @@ -1283,6 +1463,7 @@ { "cell_type": "code", "execution_count": null, + "id": "2fb60939", "metadata": {}, "outputs": [], "source": [ @@ -1291,6 +1472,7 @@ }, { "cell_type": "markdown", + "id": "2932f74b", "metadata": {}, "source": [ "We can now use the tools discussed above to visualize the table (`seaborn`) or\n", @@ -1300,6 +1482,7 @@ { "cell_type": "code", "execution_count": null, + "id": "621dfde3", "metadata": {}, "outputs": [], "source": [ @@ -1308,6 +1491,7 @@ }, { "cell_type": "markdown", + "id": "12aa9cb8", "metadata": {}, "source": [ "In general pandas is better at handling long-form than wide-form data, because\n", @@ -1321,6 +1505,7 @@ { "cell_type": "code", "execution_count": null, + "id": "acd7334f", "metadata": {}, "outputs": [], "source": [ @@ -1329,6 +1514,7 @@ }, { "cell_type": "markdown", + "id": "146c04ca", "metadata": {}, "source": [ "## Linear fitting (`statsmodels`)\n", @@ -1350,6 +1536,7 @@ { "cell_type": "code", "execution_count": null, + "id": "d9b6fae7", "metadata": {}, "outputs": [], "source": [ @@ -1359,6 +1546,7 @@ }, { "cell_type": "markdown", + "id": "ef53b1cd", "metadata": {}, "source": [ "Note that `statsmodels` understands categorical variables and automatically\n", @@ -1372,6 +1560,7 @@ { "cell_type": "code", "execution_count": null, + "id": "aa96dedf", "metadata": {}, "outputs": [], "source": [ @@ -1382,6 +1571,7 @@ }, { "cell_type": "markdown", + "id": "74d2b523", "metadata": {}, "source": [ "Cherbourg passengers clearly paid a lot more...\n", @@ -1400,14 +1590,13 @@ "Other useful features:\n", "- [Concatenating and merging tables](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/08_combine_dataframes.html)\n", "- [Lots of time series support](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/09_timeseries.html)\n", - "- [Rolling Window\n", - " functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-\n", - " functions) for after you have meaningfully sorted your data\n", + "- [Rolling Window functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-functions) \n", + " for after you have meaningfully sorted your data\n", "- and much, much more" ] } ], "metadata": {}, "nbformat": 4, - "nbformat_minor": 4 + "nbformat_minor": 5 } diff --git a/applications/pandas/pandas.md b/applications/pandas/pandas.md index 76dd30c74254105c8e62cb160bcf713a3822b89c..c627103c42fcf4a47877b6892feeb4d14eafdcad 100644 --- a/applications/pandas/pandas.md +++ b/applications/pandas/pandas.md @@ -80,31 +80,63 @@ pd.DataFrame.from_dict({ 'constant_value': 'same_value' }) ``` + +## A note on types +Each column in the pandas dataframe has its own data type, which can be: +- integer or float for numbers +- boolean for True/False +- datetime for defining specific times (and timedelta for durations) +- categorical, where each element is selected from a finite list of text values +- objects for anything else used for strings or columns with mixed elements +Each element in the column must match the type of the whole column. +When reading in a dataset pandas will try to assign the most specific type to each column. +Every pandas datatype also has support for missing data (which we will look more at below). + +One can check the type of each column using: +``` +titanic.dtypes +``` +Note that in much of python data types are referred to as dtypes. ## Getting your data out -For many applications (e.g., ICA, machine learning) you might want to +For some applications you might want to extract your data as a numpy array, even though more and more projects -support pandas Dataframes directly. The underlying numpy array can be -accessed using the `to_numpy` method - +support pandas Dataframes directly (including "scikit-learn"). +The underlying numpy array can be accessed using the `to_numpy` method ``` titanic.to_numpy() ``` -Note that the type of the returned array is the most common type (in this case -object). If you just want the numeric parts of the table you can use -`select_dtypes`, which selects specific columns based on their dtype: +Similarly to the `pandas` types discussed above, +`numpy` also requires all elements to have the same type. +However, `numpy` requires all elements in the whole array, +not just a single column to be the same type. +In this case this means that all data had to be converted +to the generic "object" type, which is not particularly useful. +For most analyses, we would only be interested in the numeric columns. +Thise can be extracted using `select_dtypes`, which selects specific columns +based on their data type (dtype): ``` titanic.select_dtypes(include=np.number).to_numpy() ``` +Now we get an array with a numeric type rather than the generic "object", +which is a lot more useful as we can now run math operations on the +resulting array (e.g., PCA). -Note that the numpy array has no information on the column names or row indices. -Alternatively, when you want to include the categorical variables in your later -analysis (e.g., for machine learning), you can extract dummy variables using: - +Finally, let's have a look at extracting categorical variables. +These are columns where each element has one of a finite list of possible values +(e.g., the "embark_town" column being "Southampton", "Cherbourg", or, "Queenstown, +which are the three towns the Titanic docked to let on passengers). +As we will see below, `pandas` has extensive support for categorical values, +but many other tools do not. To support those tools, `pandas` allows you to +replace such columns with dummy variables: ``` pd.get_dummies(titanic) ``` +Note that rather than having a single "embark_town" column with a categorical type, +we now have three columns named "embark_town_<name>" with a 1 for every passenger +who embarked in that town. These numeric columns can then be fed into a GLM or +a machine learning algorithm. ## Accessing parts of the data @@ -695,7 +727,6 @@ Not all data is well represented by a 2D table. If you want more dimensions to f Other useful features: - [Concatenating and merging tables](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/08_combine_dataframes.html) - [Lots of time series support](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/09_timeseries.html) -- [Rolling Window - functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window- - functions) for after you have meaningfully sorted your data +- [Rolling Window functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-functions) + for after you have meaningfully sorted your data - and much, much more \ No newline at end of file