Newer
Older
"as wide-form as the different group (sex in this case) is now represented as\n",
"different columns. In pandas some operations are easier on long-form tables\n",
"(e.g., `groupby`) while others require wide_form tables (e.g., making scatter\n",
"plots of two variables). You can go back and forth using `unstack` or `stack` as\n",
"illustrated above, but as this is a crucial part of pandas there are many\n",
"alternatives, such as `pivot_table`, `melt`, and `wide_to_long`, which we will\n",
"discuss below.\n",
"\n",
"We can prettify the table further using seaborn"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"ax = sns.heatmap(titanic.groupby(['class', 'sex']).survived.mean().unstack('sex'),\n",
" annot=True)\n",
"ax.set_title('survival rate')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that there are also many ways to produce prettier tables in pandas (e.g.,\n",
"color all the negative values). This is documented\n",
"[here](http://pandas.pydata.org/pandas-docs/stable/style.html).\n",
"\n",
"Because this stacking/unstacking is fairly common after a groupby operation,\n",
"there is a shortcut for it: `pivot_table`"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"titanic.pivot_table('survived', 'class', 'sex')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As usual in pandas, where we can also provide multiple column names"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))]), annot=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also change the function to be used to aggregate the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"sns.heatmap(titanic.pivot_table('survived', ['class', 'embark_town'], ['sex', pd.cut(titanic.age, (0, 18, np.inf))],\n",
" aggfunc='count'), annot=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As in `groupby` the aggregation function can be a string of a common aggregation\n",
"function, or any function that should be applied.\n",
"\n",
"We can even apply different aggregate functions to different columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
" aggfunc={'survived': 'count', 'fare': np.mean}) # compute number of survivors and mean fare\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The opposite of `pivot_table` is `melt`. This can be used to change a wide-form\n",
"table into a long-form table. This is not particularly useful on the titanic\n",
"dataset, so let's create a new table where this might be useful. Let's say we\n",
"have a dataset listing the FA and MD values in various WM tracts:"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"tracts = ('Corpus callosum', 'Internal capsule', 'SLF', 'Arcuate fasciculus')\n",
"df_wide = pd.DataFrame.from_dict(dict({'subject': list('ABCDEFGHIJ')}, **{\n",
" f'FA({tract})': np.random.rand(10) for tract in tracts }, **{\n",
" f'MD({tract})': np.random.rand(10) * 1e-3 for tract in tracts\n",
"}))\n",
"df_wide"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This wide-form table (i.e., all the information is in different columns) makes\n",
"it hard to select just all the FA values or only the values associated with the\n",
"SLF. For this it would be easier to list all the values in a single column.\n",
"Most of the tools discussed above (e.g., `group_by` or `seaborn` plotting) work\n",
"better with long-form data, which we can obtain from `melt`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"df_long = df_wide.melt('subject', var_name='measurement', value_name='dti_value')\n",
"df_long.head(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that `melt` took all the columns (we could also have specified a\n",
"specific sub-set) and returned each measurement as a seperate row. We probably\n",
"want to seperate the measurement column into the measurement type (FA or MD) and\n",
"the tract name. Many string manipulation function are available in the\n",
"`DataFrame` object under `DataFrame.str`\n",
"([tutorial](http://pandas.pydata.org/pandas-docs/stable/text.html))"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"df_long['variable'] = df_long.measurement.str.slice(0, 2) # first two letters correspond to FA or MD\n",
"df_long['tract'] = df_long.measurement.str.slice(3, -1) # fourth till the second-to-last letter correspond to the tract\n",
"df_long.head(12)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally we probably do want the FA and MD variables as different columns.\n",
"\n",
"**Exercise**: Use `pivot_table` or `stack`/`unstack` to create a column for MD\n",
"and FA."
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"df_unstacked = df_long."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now use the tools discussed above to visualize the table (`seaborn`) or\n",
"to group the table based on tract (`groupby` or `pivot_table`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# feel free to analyze this random data in more detail"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In general pandas is better at handling long-form than wide-form data, although\n",
"for better visualization of the data an intermediate format is often best. One\n",
"exception is calculating a covariance (`DataFrame.cov`) or correlation\n",
"(`DataFrame.corr`) matrices which computes the correlation between each column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"sns.heatmap(df_wide.corr(), cmap=sns.diverging_palette(240, 10, s=99, n=300), )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Linear fitting (`statsmodels`)\n",
"\n",
"Linear fitting between the different columns is available through the\n",
"[`statsmodels`](https://www.statsmodels.org/stable/index.html) library. A nice\n",
"way to play around with a wide variety of possible models is to use R-style\n",
"functions. The usage of the functions in `statsmodels` is described\n",
"[here](https://www.statsmodels.org/dev/example_formulas.html). You can find a\n",
"more detailed description of the R-style functions\n",
"[here](https://patsy.readthedocs.io/en/latest/formulas.html#the-formula-\n",
"language).\n",
"\n",
"In short these functions describe the linear model as a string. For example,\n",
"`\"y ~ x + a + x * a\"` fits the variable `y` as a function of `x`, `a`, and the\n",
"interaction between `x` and `a`. The intercept is included by default (you can\n",
"add `\"+ 0\"` to remove it)."
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"result = smf.logit('survived ~ age + sex + age * sex', data=titanic).fit()\n",
"print(result.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that `statsmodels` understands categorical variables and automatically\n",
"replaces them with dummy variables.\n",
"\n",
"Above we used logistic regression, which is appropriate for the binary\n",
"survival rate. A wide variety of linear models are available. Let's try a GLM,\n",
"but assume that the fare is drawn from a Gamma distribution:"
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"age_dmean = titanic.age - titanic.age.mean()\n",
"result = smf.glm('fare ~ age_dmean + embark_town', data=titanic).fit()\n",
"print(result.summary())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cherbourg passengers clearly paid a lot more...\n",
"\n",
"\n",
"Note that we did not actually add the `age_dmean` to the\n",
"`DataFrame`. `statsmodels` (or more precisely the underlying\n",
"[patsy](https://patsy.readthedocs.io/en/latest/) library) automatically\n",
"extracted this from our environment. This can lead to confusing behaviour...\n",
"\n",
"# More reading\n",
"\n",
"Other useful features\n",
"\n",
"- [Concatenating](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html)\n",
" and\n",
" [merging](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html)\n",
" of tables\n",
"- [Lots\n",
" of](http://pandas.pydata.org/pandas-docs/stable/basics.html#dt-accessor)\n",
" [time](http://pandas.pydata.org/pandas-docs/stable/timeseries.html)\n",
" [series](http://pandas.pydata.org/pandas-docs/stable/timedeltas.html)\n",
" support\n",
"- [Rolling Window\n",
" functions](http://pandas.pydata.org/pandas-docs/stable/computation.html#window-\n",
" functions) for after you have meaningfully sorted your data\n",
"- and much, much more"
]
}
],
"nbformat": 4,
"nbformat_minor": 2
}