Commit 24882ffa authored by Michiel Cottaar's avatar Michiel Cottaar
Browse files

Merge branch 'master' into 'master'

Several minor additions to the talk (including pandas-profiling)

See merge request fsl/pytreat-practicals-2020!28
parents 04f898f4 b802870a
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Pandas # Pandas
Follow along online at: https://git.fmrib.ox.ac.uk/fsl/pytreat-practicals-2020/-/blob/master/talks/pandas/pandas.ipynb
Pandas is a data analysis library focused on the cleaning and exploration of Pandas is a data analysis library focused on the cleaning and exploration of
tabular data. tabular data.
Some useful links are: Some useful links are:
- [main website](https://pandas.pydata.org) - [main website](https://pandas.pydata.org)
- [documentation](http://pandas.pydata.org/pandas-docs/stable/)<sup>1</sup> - [documentation](http://pandas.pydata.org/pandas-docs/stable/)<sup>1</sup>
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)<sup>1</sup> by - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)<sup>1</sup> by
Jake van der Plas Jake van der Plas
- [List of Pandas tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)
<sup>1</sup> This tutorial borrows heavily from the pandas documentation and <sup>1</sup> This tutorial borrows heavily from the pandas documentation and
the Python Data Science Handbook the Python Data Science Handbook
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
%pylab inline %pylab inline
import pandas as pd # pd is the usual abbreviation for pandas import pandas as pd # pd is the usual abbreviation for pandas
import matplotlib.pyplot as plt # matplotlib for plotting import matplotlib.pyplot as plt # matplotlib for plotting
import seaborn as sns # seaborn is the main plotting library for Pandas import seaborn as sns # seaborn is the main plotting library for Pandas
import statsmodels.api as sm # statsmodels fits linear models to pandas data import statsmodels.api as sm # statsmodels fits linear models to pandas data
import statsmodels.formula.api as smf import statsmodels.formula.api as smf
from IPython.display import Image from IPython.display import Image
sns.set() # use the prettier seaborn plotting settings rather than the default matplotlib one sns.set() # use the prettier seaborn plotting settings rather than the default matplotlib one
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
> We will mostly be using `seaborn` instead of `matplotlib` for > We will mostly be using `seaborn` instead of `matplotlib` for
> visualisation. But `seaborn` is actually an extension to `matplotlib`, so we > visualisation. But `seaborn` is actually an extension to `matplotlib`, so we
> are still using the latter under the hood. > are still using the latter under the hood.
## Loading in data ## Loading in data
Pandas supports a wide range of I/O tools to load from text files, binary files, Pandas supports a wide range of I/O tools to load from text files, binary files,
and SQL databases. You can find a table with all formats and SQL databases. You can find a table with all formats
[here](http://pandas.pydata.org/pandas-docs/stable/io.html). [here](http://pandas.pydata.org/pandas-docs/stable/io.html).
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv') titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
titanic titanic
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
This loads the data into a This loads the data into a
[`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
object, which is the main object we will be interacting with in pandas. It object, which is the main object we will be interacting with in pandas. It
represents a table of data. The other file formats all start with represents a table of data. The other file formats all start with
`pd.read_{format}`. Note that we can provide the URL to the dataset, rather `pd.read_{format}`. Note that we can provide the URL to the dataset, rather
than download it beforehand. than download it beforehand.
We can write out the dataset using `dataframe.to_{format}(<filename)`: We can write out the dataset using `dataframe.to_{format}(<filename>)`:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic.to_csv('titanic_copy.csv', index=False) # we set index to False to prevent pandas from storing the row names titanic.to_csv('titanic_copy.csv', index=False) # we set index to False to prevent pandas from storing the row names
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If you can not connect to the internet, you can run the command below to load If you can not connect to the internet, you can run the command below to load
this locally stored titanic dataset this locally stored titanic dataset
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic = pd.read_csv('titanic.csv') titanic = pd.read_csv('titanic.csv')
titanic titanic
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that the titanic dataset was also available to us as one of the standard Note that the titanic dataset was also available to us as one of the standard
datasets included with seaborn. We could load it from there using datasets included with seaborn. We could load it from there using
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
sns.load_dataset('titanic') sns.load_dataset('titanic')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`Dataframes` can also be created from other python objects, using `Dataframes` can also be created from other python objects, using
`pd.DataFrame.from_{other type}`. The most useful of these is `from_dict`, `pd.DataFrame.from_{other type}`. The most useful of these is `from_dict`,
which converts a mapping of the columns to a pandas `DataFrame` (i.e., table). which converts a mapping of the columns to a pandas `DataFrame` (i.e., table).
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
pd.DataFrame.from_dict({ pd.DataFrame.from_dict({
'random numbers': np.random.rand(5), 'random numbers': np.random.rand(5),
'sequence (int)': np.arange(5), 'sequence (int)': np.arange(5),
'sequence (float)': np.linspace(0, 5, 5), 'sequence (float)': np.linspace(0, 5, 5),
'letters': list('abcde'), 'letters': list('abcde'),
'constant_value': 'same_value' 'constant_value': 'same_value'
}) })
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
For many applications (e.g., ICA, machine learning input) you might want to For many applications (e.g., ICA, machine learning input) you might want to
extract your data as a numpy array. The underlying numpy array can be accessed extract your data as a numpy array. The underlying numpy array can be accessed
using the `values` attribute using the `to_numpy` method
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic.values titanic.to_numpy()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that the type of the returned array is the most common type (in this case Note that the type of the returned array is the most common type (in this case
object). If you just want the numeric parts of the table you can use object). If you just want the numeric parts of the table you can use
`select_dtypes`, which selects specific columns based on their dtype: `select_dtypes`, which selects specific columns based on their dtype:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic.select_dtypes(include=np.number).values titanic.select_dtypes(include=np.number).to_numpy()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that the numpy array has no information on the column names or row indices. Note that the numpy array has no information on the column names or row indices.
Alternatively, when you want to include the categorical variables in your later Alternatively, when you want to include the categorical variables in your later
analysis (e.g., for machine learning), you can extract dummy variables using: analysis (e.g., for machine learning), you can extract dummy variables using:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
pd.get_dummies(titanic) pd.get_dummies(titanic)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Accessing parts of the data ## Accessing parts of the data
[Documentation on indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html) [Documentation on indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html)
### Selecting columns by name ### Selecting columns by name
Single columns can be selected using the normal python indexing: Single columns can be selected using the normal python indexing:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic['embark_town'] titanic['embark_town']
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If the column names are simple strings (not required) we can also access it If the column names are simple strings (not required) we can also access it
directly as an attribute directly as an attribute
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic.embark_town titanic.embark_town
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that this returns a pandas Note that this returns a pandas
[`Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) [`Series`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)
rather than a `DataFrame` object. A `Series` is simply a 1-dimensional array rather than a `DataFrame` object. A `Series` is simply a 1-dimensional array
representing a single column. Multiple columns can be returned by providing a representing a single column. Multiple columns can be returned by providing a
list of columns names. This will return a `DataFrame`: list of columns names. This will return a `DataFrame`:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic[['class', 'alive']] titanic[['class', 'alive']]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that you have to provide a list here (square brackets). If you provide a Note that you have to provide a list here (square brackets). If you provide a
tuple (round brackets) pandas will think you are trying to access a single tuple (round brackets) pandas will think you are trying to access a single
column that has that tuple as a name: column that has that tuple as a name:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic[('class', 'alive')] titanic[('class', 'alive')]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In this case there is no column called `('class', 'alive')` leading to an In this case there is no column called `('class', 'alive')` leading to an
error. Later on we will see some uses to having columns named like this. error. Later on we will see some uses to having columns named like this.
### Indexing rows by name or integer ### Indexing rows by name or integer
Individual rows can be accessed based on their name (i.e., the index) or integer Individual rows can be accessed based on their name (i.e., the index) or integer
(i.e., which row it is in). In our current table this will give the same (i.e., which row it is in). In our current table this will give the same
results. To ensure that these are different, let's sort our titanic dataset results. To ensure that these are different, let's sort our titanic dataset
based on the passenger fare: based on the passenger fare:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic_sorted = titanic.sort_values('fare') titanic_sorted = titanic.sort_values('fare')
titanic_sorted titanic_sorted
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that the re-sorting did not change the values in the index (i.e., left-most Note that the re-sorting did not change the values in the index (i.e., left-most
column). column).
We can select the first row of this newly sorted table using `iloc` We can select the first row of this newly sorted table using `iloc`
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic_sorted.iloc[0] titanic_sorted.iloc[0]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can select the row with the index 0 using We can select the row with the index 0 using
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic_sorted.loc[0] titanic_sorted.loc[0]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that this gives the same passenger as the first row of the initial table Note that this gives the same passenger as the first row of the initial table
before sorting before sorting
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic.iloc[0] titanic.iloc[0]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Another common way to access the first or last N rows of a table is using the Another common way to access the first or last N rows of a table is using the
head/tail methods head/tail methods
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic_sorted.head(3) titanic_sorted.head(3)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic_sorted.tail(3) titanic_sorted.tail(3)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that nearly all methods in pandas return a new `Dataframe`, which means Note that nearly all methods in pandas return a new `Dataframe`, which means
that we can easily call another method on them that we can easily call another method on them
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic_sorted.tail(10).head(5) # select the first 5 of the last 10 passengers in the database titanic_sorted.tail(10).head(5) # select the first 5 of the last 10 passengers in the database
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic_sorted.iloc[-10:-5] # alternative way to get the same passengers titanic_sorted.iloc[-10:-5] # alternative way to get the same passengers
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Exercise**: use sorting and tail/head or indexing to find the 10 youngest **Exercise**: use sorting and tail/head or indexing to find the 10 youngest
passengers on the titanic. Try to do this on a single line by chaining calls passengers on the titanic. Try to do this on a single line by chaining calls
to the titanic `DataFrame` object to the titanic `DataFrame` object
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic.sort_values... titanic.sort_values...
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Indexing rows by value ### Indexing rows by value
One final way to select specific columns is by their value One final way to select specific columns is by their value
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic[titanic.sex == 'female'] # selects all females titanic[titanic.sex == 'female'] # selects all females
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
# select all passengers older than 60 who departed from Southampton # select all passengers older than 60 who departed from Southampton
titanic[(titanic.age > 60) & (titanic['embark_town'] == 'Southampton')] titanic[(titanic.age > 60) & (titanic['embark_town'] == 'Southampton')]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that this required typing `titanic` quite often. A quicker way to get the Note that this required typing `titanic` quite often. A quicker way to get the
same result is using the `query` method, which is described in detail same result is using the `query` method, which is described in detail
[here](http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method) [here](http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method)
(note that using the `query` method is also faster and uses a lot less (note that using the `query` method is also faster and uses a lot less
memory). memory).
> You may have trouble using the `query` method with columns which have > You may have trouble using the `query` method with columns which have
a name that cannot be used as a Python identifier. a name that cannot be used as a Python identifier.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic.query('(age > 60) & (embark_town == "Southampton")') titanic.query('(age > 60) & (embark_town == "Southampton")')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
When selecting a categorical multiple options from a categorical values you
might want to use `isin`:
%% Cell type:code id: tags:
```
titanic[titanic['class'].isin(['First','Second'])]
```
%% Cell type:markdown id: tags:
Particularly useful when selecting data like this is the `isna` method which Particularly useful when selecting data like this is the `isna` method which
finds all missing data finds all missing data
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic[~titanic.age.isna()] # select first few passengers whose age is not N/A titanic[~titanic.age.isna()] # select first few passengers whose age is not N/A
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
This removing of missing numbers is so common that it has is own method This removing of missing numbers is so common that it has is own method
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic.dropna() # drops all passengers that have some datapoint missing titanic.dropna() # drops all passengers that have some datapoint missing
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
titanic.dropna(subset=['age', 'fare']) # Only drop passengers with missing ages or fares titanic.dropna(subset=['age', 'fare']) # Only drop passengers with missing ages or fares
```