09_pandas.ipynb

    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_full.loc['First']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More advanced use of the `MultiIndex` is possible through `xs`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_full.xs(0, level='survived') # selects all the zero's from the survived index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_full.xs('mad', axis=1, level=1) # selects mad from the second level in the columns (i.e., axis=1) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reshaping tables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we were interested in how the survival rate depends on the class and sex of the passengers we could simply use a groupby:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic.groupby(['class', 'sex']).survived.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, this single-column table is difficult to read. The reason for this is that the indexing is multi-leveled (called `MultiIndex` in pandas), while there is only a single column. We would like to move one of the levels in the index to the columns. This can be done using `stack`/`unstack`:\n",
    "- `unstack`: Moves one levels in the index to the columns\n",
    "- `stack`: Moves one of levels in the columns to the index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic.groupby(['class', 'sex']).survived.mean().unstack('sex')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The former table, where the different groups are defined in different rows, is often referred to as long-form. After unstacking the table is often referred to as wide-form as the different group (sex in this case) is now represented as different columns. In pandas some operations are easier on long-form tables (e.g., `groupby`) while others require wide_form tables (e.g., making scatter plots of two variables). You can go back and forth using `unstack` or `stack` as illustrated above, but as this is a crucial part of pandas there are many alternatives, such as `pivot_table`, `melt`, and `wide_to_long`, which we will discuss below.\n",
    "\n",
    "We can prettify the table further using seaborn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ax = sns.heatmap(titanic.groupby(['class', 'sex']).survived.mean().unstack('sex'), \n",
    "                 annot=True)\n",
    "ax.set_title('survival rate')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that there are also many ways to produce prettier tables in pandas (e.g., color all the negative values). This is documented [here](http://pandas.pydata.org/pandas-docs/stable/style.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because this stacking/unstacking is fairly common after a groupby operation, there is a shortcut for it: `pivot_table`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic.pivot_table('survived', 'class', 'sex')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As usual in pandas, where we can also provide multiple column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))]), annot=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also change the function to be used to aggregate the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))], \n",
    "                                aggfunc='count'), annot=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As in `groupby` the aggregation function can be a string of a common aggregation function, or any function that should be applied.\n",
    "\n",
    "We can even apply different aggregate functions to different columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "titanic.pivot_table(index='class', columns='sex',  \n",
    "                    aggfunc={'survived': 'count', 'fare': np.mean}) # compute number of survivors and mean fare\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The opposite of `pivot_table` is `melt`. This can be used to change a wide-form table into a long-form table. This is not particularly useful on the titanic dataset, so let's create a new table where this might be useful. Let's say we have a dataset listing the FA and MD values in various WM tracts:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tracts = ('Corpus callosum', 'Internal capsule', 'SLF', 'Arcuate fasciculus')\n",
    "df_wide = pd.DataFrame.from_dict(dict({'subject': list('ABCDEFGHIJ')}, **{\n",
    "    f'FA({tract})': np.random.rand(10) for tract in tracts }, **{\n",
    "    f'MD({tract})': np.random.rand(10) * 1e-3 for tract in tracts\n",
    "}))\n",
    "df_wide"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This wide-form table (i.e., all the information is in different columns) makes it hard to select just all the FA values or only the values associated with the SLF. For this it would be easier to lismt all the values in a single column. Most of the tools discussed above (e.g., `group_by` or `seaborn` plotting) work better with long-form data, which we can obtain from `melt`: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_long = df_wide.melt('subject', var_name='measurement', value_name='dti_value')\n",
    "df_long.head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that `melt` took all the columns (we could also have specified a specific sub-set) and returned each measurement as a seperate row. We probably want to seperate the measurement column into the measurement type (FA or MD) and the tract name. Many string manipulation function are available in the `DataFrame` object under `DataFrame.str` ([tutorial](http://pandas.pydata.org/pandas-docs/stable/text.html))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_long['variable'] = df_long.measurement.str.slice(0, 2)  # first two letters correspond to FA or MD\n",
    "df_long['tract'] = df_long.measurement.str.slice(3, -1)  # fourth till the second-to-last letter correspond to the tract\n",
    "df_long.head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally we probably do want the FA and MD variables as different columns. \n",
    "\n",
    "*Exercise*: Use `pivot_table` or `stack`/`unstack` to create a column for MD and FA."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df_unstacked = df_long."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now use the tools discussed above to visualize the table (`seaborn`) or to group the table based on tract (`groupby` or `pivot_table`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# feel free to analyze this random data in more detail"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In general pandas is better at handling long-form than wide-form data, although for better visualization of the data an intermediate format is often best. One exception is calculating a covariance (`DataFrame.cov`) or correlation (`DataFrame.corr`) matrices which computes the correlation between each column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sns.heatmap(df_wide.corr(), cmap=sns.diverging_palette(240, 10, s=99, n=300), )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear fitting (statsmodels)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Linear fitting between the different columns is available through the [statsmodels](https://www.statsmodels.org/stable/index.html) library. A nice way to play around with a wide variety of possible models is to use R-style functions. The usage of the functions in stastmodels is described [here](https://www.statsmodels.org/dev/example_formulas.html). You can find a more detailed description of the R-style functions [here](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-language). \n",
    "\n",
    "In short these functions describe the linear model as a string. For example, \"y ~ x + a + x * a\" fits the variable `y` as a function of `x`, `a`, and the interaction between `x` and `a`. The intercept is included by default (you can add \"+ 0\" to remove it)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "result = smf.logit('survived ~ age + sex + age * sex', data=titanic).fit()\n",
    "print(result.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that statsmodels understands categorical variables and automatically replaces them with dummy variables.\n",
    "\n",
    "Above we used logistic regression, which is appropriate for the binary survival rate. A wide variety of linear models are available. Let's try a GLM, but assume that the fare is drawn from a Gamma distribution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "age_dmean = titanic.age - titanic.age.mean()\n",
    "result = smf.glm('fare ~ age_dmean + embark_town', data=titanic).fit()\n",
    "print(result.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cherbourg passengers clearly paid a lot more...\n",
    "\n",
    "\n",
    "Note that we did not actually add the age_dmean to the dataframe. Statsmodels (or more precisely the underlying [patsy](https://patsy.readthedocs.io/en/latest/) library) automatically extracted this from our environment. This can lead to confusing behaviour..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# More reading"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other useful features\n",
    "- [Concatenating](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html) and [merging](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html) of tables\n",
    "- [Lots of](http://pandas.pydata.org/pandas-docs/stable/basics.html#dt-accessor) [time](http://pandas.pydata.org/pandas-docs/stable/timeseries.html) [series](http://pandas.pydata.org/pandas-docs/stable/timedeltas.html) support\n",
    "- [Rolling Window functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-functions) for after you have meaningfully sorted your data\n",
    "- and much, much more"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "225px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}