Skip to content
Snippets Groups Projects
pandas.ipynb 35.6 KiB
Newer Older
Paul McCarthy's avatar
Paul McCarthy committed
    "as wide-form as the different group (sex in this case) is now represented as\n",
    "different columns. In pandas some operations are easier on long-form tables\n",
    "(e.g., `groupby`) while others require wide_form tables (e.g., making scatter\n",
    "plots of two variables). You can go back and forth using `unstack` or `stack` as\n",
    "illustrated above, but as this is a crucial part of pandas there are many\n",
    "alternatives, such as `pivot_table`, `melt`, and `wide_to_long`, which we will\n",
    "discuss below.\n",
    "\n",
    "We can prettify the table further using seaborn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
Paul McCarthy's avatar
Paul McCarthy committed
    "ax = sns.heatmap(titanic.groupby(['class', 'sex']).survived.mean().unstack('sex'),\n",
    "                 annot=True)\n",
    "ax.set_title('survival rate')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Paul McCarthy's avatar
Paul McCarthy committed
    "Note that there are also many ways to produce prettier tables in pandas (e.g.,\n",
    "color all the negative values). This is documented\n",
    "[here](http://pandas.pydata.org/pandas-docs/stable/style.html).\n",
    "\n",
    "Because this stacking/unstacking is fairly common after a groupby operation,\n",
    "there is a shortcut for it: `pivot_table`"
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
   "outputs": [],
   "source": [
    "titanic.pivot_table('survived', 'class', 'sex')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As usual in pandas, where we can also provide multiple column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))]), annot=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also change the function to be used to aggregate the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
Paul McCarthy's avatar
Paul McCarthy committed
    "sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))],\n",
    "                                aggfunc='count'), annot=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Paul McCarthy's avatar
Paul McCarthy committed
    "As in `groupby` the aggregation function can be a string of a common aggregation\n",
    "function, or any function that should be applied.\n",
    "\n",
    "We can even apply different aggregate functions to different columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
Paul McCarthy's avatar
Paul McCarthy committed
    "titanic.pivot_table(index='class', columns='sex',\n",
    "                    aggfunc={'survived': 'count', 'fare': np.mean}) # compute number of survivors and mean fare\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Paul McCarthy's avatar
Paul McCarthy committed
    "The opposite of `pivot_table` is `melt`. This can be used to change a wide-form\n",
    "table into a long-form table. This is not particularly useful on the titanic\n",
    "dataset, so let's create a new table where this might be useful. Let's say we\n",
    "have a dataset listing the FA and MD values in various WM tracts:"
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
   "outputs": [],
   "source": [
    "tracts = ('Corpus callosum', 'Internal capsule', 'SLF', 'Arcuate fasciculus')\n",
    "df_wide = pd.DataFrame.from_dict(dict({'subject': list('ABCDEFGHIJ')}, **{\n",
    "    f'FA({tract})': np.random.rand(10) for tract in tracts }, **{\n",
    "    f'MD({tract})': np.random.rand(10) * 1e-3 for tract in tracts\n",
    "}))\n",
    "df_wide"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Paul McCarthy's avatar
Paul McCarthy committed
    "This wide-form table (i.e., all the information is in different columns) makes\n",
    "it hard to select just all the FA values or only the values associated with the\n",
    "SLF. For this it would be easier to list all the values in a single column.\n",
    "Most of the tools discussed above (e.g., `group_by` or `seaborn` plotting) work\n",
    "better with long-form data, which we can obtain from `melt`:"
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
   "outputs": [],
   "source": [
    "df_long = df_wide.melt('subject', var_name='measurement', value_name='dti_value')\n",
    "df_long.head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Paul McCarthy's avatar
Paul McCarthy committed
    "We can see that `melt` took all the columns (we could also have specified a\n",
    "specific sub-set) and returned each measurement as a seperate row. We probably\n",
    "want to seperate the measurement column into the measurement type (FA or MD) and\n",
    "the tract name. Many string manipulation function are available in the\n",
    "`DataFrame` object under `DataFrame.str`\n",
    "([tutorial](http://pandas.pydata.org/pandas-docs/stable/text.html))"
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
   "outputs": [],
   "source": [
    "df_long['variable'] = df_long.measurement.str.slice(0, 2)  # first two letters correspond to FA or MD\n",
    "df_long['tract'] = df_long.measurement.str.slice(3, -1)  # fourth till the second-to-last letter correspond to the tract\n",
    "df_long.head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Paul McCarthy's avatar
Paul McCarthy committed
    "Finally we probably do want the FA and MD variables as different columns.\n",
Paul McCarthy's avatar
Paul McCarthy committed
    "**Exercise**: Use `pivot_table` or `stack`/`unstack` to create a column for MD\n",
    "and FA."
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
   "outputs": [],
   "source": [
    "df_unstacked = df_long."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Paul McCarthy's avatar
Paul McCarthy committed
    "We can now use the tools discussed above to visualize the table (`seaborn`) or\n",
    "to group the table based on tract (`groupby` or `pivot_table`)."
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
   "outputs": [],
   "source": [
    "# feel free to analyze this random data in more detail"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Paul McCarthy's avatar
Paul McCarthy committed
    "In general pandas is better at handling long-form than wide-form data, although\n",
    "for better visualization of the data an intermediate format is often best. One\n",
    "exception is calculating a covariance (`DataFrame.cov`) or correlation\n",
    "(`DataFrame.corr`) matrices which computes the correlation between each column:"
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.heatmap(df_wide.corr(), cmap=sns.diverging_palette(240, 10, s=99, n=300), )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Paul McCarthy's avatar
Paul McCarthy committed
    "## Linear fitting (`statsmodels`)\n",
    "\n",
    "Linear fitting between the different columns is available through the\n",
    "[`statsmodels`](https://www.statsmodels.org/stable/index.html) library. A nice\n",
    "way to play around with a wide variety of possible models is to use R-style\n",
    "functions. The usage of the functions in `statsmodels` is described\n",
    "[here](https://www.statsmodels.org/dev/example_formulas.html). You can find a\n",
    "more detailed description of the R-style functions\n",
    "[here](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-\n",
    "language).\n",
Paul McCarthy's avatar
Paul McCarthy committed
    "In short these functions describe the linear model as a string. For example,\n",
    "`\"y ~ x + a + x * a\"` fits the variable `y` as a function of `x`, `a`, and the\n",
    "interaction between `x` and `a`. The intercept is included by default (you can\n",
    "add `\"+ 0\"` to remove it)."
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
   "outputs": [],
   "source": [
    "result = smf.logit('survived ~ age + sex + age * sex', data=titanic).fit()\n",
    "print(result.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
Paul McCarthy's avatar
Paul McCarthy committed
    "Note that `statsmodels` understands categorical variables and automatically\n",
    "replaces them with dummy variables.\n",
Paul McCarthy's avatar
Paul McCarthy committed
    "Above we used logistic regression, which is appropriate for the binary\n",
    "survival rate. A wide variety of linear models are available. Let's try a GLM,\n",
    "but assume that the fare is drawn from a Gamma distribution:"
Paul McCarthy's avatar
Paul McCarthy committed
   "metadata": {},
   "outputs": [],
   "source": [
    "age_dmean = titanic.age - titanic.age.mean()\n",
    "result = smf.glm('fare ~ age_dmean + embark_town', data=titanic).fit()\n",
    "print(result.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Cherbourg passengers clearly paid a lot more...\n",
    "\n",
    "\n",
Paul McCarthy's avatar
Paul McCarthy committed
    "Note that we did not actually add the `age_dmean` to the\n",
    "`DataFrame`. `statsmodels` (or more precisely the underlying\n",
    "[patsy](https://patsy.readthedocs.io/en/latest/) library) automatically\n",
    "extracted this from our environment. This can lead to confusing behaviour...\n",
    "\n",
    "# More reading\n",
    "\n",
Paul McCarthy's avatar
Paul McCarthy committed
    "\n",
    "- [Concatenating](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html)\n",
    "  and\n",
    "  [merging](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html)\n",
    "  of tables\n",
    "- [Lots\n",
    "  of](http://pandas.pydata.org/pandas-docs/stable/basics.html#dt-accessor)\n",
    "  [time](http://pandas.pydata.org/pandas-docs/stable/timeseries.html)\n",
    "  [series](http://pandas.pydata.org/pandas-docs/stable/timedeltas.html)\n",
    "  support\n",
    "- [Rolling Window\n",
    "  functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-\n",
    "  functions) for after you have meaningfully sorted your data\n",
Paul McCarthy's avatar
Paul McCarthy committed
 "metadata": {},