Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
Michiel Cottaar
pytreat-practicals-2020
Commits
b802870a
Commit
b802870a
authored
Mar 24, 2020
by
Michiel Cottaar
Browse files
Several minor additions to the talk (including pandas-profiling)
parent
04f898f4
Changes
2
Show whitespace changes
Inline
Side-by-side
talks/pandas/pandas.ipynb
View file @
b802870a
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Pandas
# Pandas
Follow along online at: https://git.fmrib.ox.ac.uk/fsl/pytreat-practicals-2020/-/blob/master/talks/pandas/pandas.ipynb
Pandas is a data analysis library focused on the cleaning and exploration of
Pandas is a data analysis library focused on the cleaning and exploration of
tabular data.
tabular data.
Some useful links are:
Some useful links are:
-
[
main website
](
https://pandas.pydata.org
)
-
[
main website
](
https://pandas.pydata.org
)
-
[
documentation
](
http://pandas.pydata.org/pandas-docs/stable/
)
<sup>
1
</sup>
-
[
documentation
](
http://pandas.pydata.org/pandas-docs/stable/
)
<sup>
1
</sup>
-
[
Python Data Science Handbook
](
https://jakevdp.github.io/PythonDataScienceHandbook/
)
<sup>
1
</sup>
by
-
[
Python Data Science Handbook
](
https://jakevdp.github.io/PythonDataScienceHandbook/
)
<sup>
1
</sup>
by
Jake van der Plas
Jake van der Plas
-
[
List of Pandas tutorials
](
https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html
)
<sup>
1
</sup>
This tutorial borrows heavily from the pandas documentation and
<sup>
1
</sup>
This tutorial borrows heavily from the pandas documentation and
the Python Data Science Handbook
the Python Data Science Handbook
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
%pylab inline
%pylab inline
import pandas as pd # pd is the usual abbreviation for pandas
import pandas as pd # pd is the usual abbreviation for pandas
import matplotlib.pyplot as plt # matplotlib for plotting
import matplotlib.pyplot as plt # matplotlib for plotting
import seaborn as sns # seaborn is the main plotting library for Pandas
import seaborn as sns # seaborn is the main plotting library for Pandas
import statsmodels.api as sm # statsmodels fits linear models to pandas data
import statsmodels.api as sm # statsmodels fits linear models to pandas data
import statsmodels.formula.api as smf
import statsmodels.formula.api as smf
from IPython.display import Image
from IPython.display import Image
sns.set() # use the prettier seaborn plotting settings rather than the default matplotlib one
sns.set() # use the prettier seaborn plotting settings rather than the default matplotlib one
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
> We will mostly be using `seaborn` instead of `matplotlib` for
> We will mostly be using `seaborn` instead of `matplotlib` for
> visualisation. But `seaborn` is actually an extension to `matplotlib`, so we
> visualisation. But `seaborn` is actually an extension to `matplotlib`, so we
> are still using the latter under the hood.
> are still using the latter under the hood.
## Loading in data
## Loading in data
Pandas supports a wide range of I/O tools to load from text files, binary files,
Pandas supports a wide range of I/O tools to load from text files, binary files,
and SQL databases. You can find a table with all formats
and SQL databases. You can find a table with all formats
[
here
](
http://pandas.pydata.org/pandas-docs/stable/io.html
)
.
[
here
](
http://pandas.pydata.org/pandas-docs/stable/io.html
)
.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
titanic
titanic
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
This loads the data into a
This loads the data into a
[
`DataFrame`
](
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
)
[
`DataFrame`
](
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
)
object, which is the main object we will be interacting with in pandas. It
object, which is the main object we will be interacting with in pandas. It
represents a table of data. The other file formats all start with
represents a table of data. The other file formats all start with
`pd.read_{format}`
. Note that we can provide the URL to the dataset, rather
`pd.read_{format}`
. Note that we can provide the URL to the dataset, rather
than download it beforehand.
than download it beforehand.
We can write out the dataset using
`dataframe.to_{format}(<filename)`
:
We can write out the dataset using
`dataframe.to_{format}(<filename
>
)`
:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic.to_csv('titanic_copy.csv', index=False) # we set index to False to prevent pandas from storing the row names
titanic.to_csv('titanic_copy.csv', index=False) # we set index to False to prevent pandas from storing the row names
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
If you can not connect to the internet, you can run the command below to load
If you can not connect to the internet, you can run the command below to load
this locally stored titanic dataset
this locally stored titanic dataset
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic = pd.read_csv('titanic.csv')
titanic = pd.read_csv('titanic.csv')
titanic
titanic
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Note that the titanic dataset was also available to us as one of the standard
Note that the titanic dataset was also available to us as one of the standard
datasets included with seaborn. We could load it from there using
datasets included with seaborn. We could load it from there using
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
sns.load_dataset('titanic')
sns.load_dataset('titanic')
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
`Dataframes`
can also be created from other python objects, using
`Dataframes`
can also be created from other python objects, using
`pd.DataFrame.from_{other type}`
. The most useful of these is
`from_dict`
,
`pd.DataFrame.from_{other type}`
. The most useful of these is
`from_dict`
,
which converts a mapping of the columns to a pandas
`DataFrame`
(i.e., table).
which converts a mapping of the columns to a pandas
`DataFrame`
(i.e., table).
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
pd.DataFrame.from_dict({
pd.DataFrame.from_dict({
'random numbers': np.random.rand(5),
'random numbers': np.random.rand(5),
'sequence (int)': np.arange(5),
'sequence (int)': np.arange(5),
'sequence (float)': np.linspace(0, 5, 5),
'sequence (float)': np.linspace(0, 5, 5),
'letters': list('abcde'),
'letters': list('abcde'),
'constant_value': 'same_value'
'constant_value': 'same_value'
})
})
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
For many applications (e.g., ICA, machine learning input) you might want to
For many applications (e.g., ICA, machine learning input) you might want to
extract your data as a numpy array. The underlying numpy array can be accessed
extract your data as a numpy array. The underlying numpy array can be accessed
using the
`
values`
attribute
using the
`
to_numpy`
method
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic.
values
titanic.
to_numpy()
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Note that the type of the returned array is the most common type (in this case
Note that the type of the returned array is the most common type (in this case
object). If you just want the numeric parts of the table you can use
object). If you just want the numeric parts of the table you can use
`select_dtypes`
, which selects specific columns based on their dtype:
`select_dtypes`
, which selects specific columns based on their dtype:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic.select_dtypes(include=np.number).
values
titanic.select_dtypes(include=np.number).
to_numpy()
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Note that the numpy array has no information on the column names or row indices.
Note that the numpy array has no information on the column names or row indices.
Alternatively, when you want to include the categorical variables in your later
Alternatively, when you want to include the categorical variables in your later
analysis (e.g., for machine learning), you can extract dummy variables using:
analysis (e.g., for machine learning), you can extract dummy variables using:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
pd.get_dummies(titanic)
pd.get_dummies(titanic)
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Accessing parts of the data
## Accessing parts of the data
[
Documentation on indexing
](
http://pandas.pydata.org/pandas-docs/stable/indexing.html
)
[
Documentation on indexing
](
http://pandas.pydata.org/pandas-docs/stable/indexing.html
)
### Selecting columns by name
### Selecting columns by name
Single columns can be selected using the normal python indexing:
Single columns can be selected using the normal python indexing:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic['embark_town']
titanic['embark_town']
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
If the column names are simple strings (not required) we can also access it
If the column names are simple strings (not required) we can also access it
directly as an attribute
directly as an attribute
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic.embark_town
titanic.embark_town
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Note that this returns a pandas
Note that this returns a pandas
[
`Series`
](
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html
)
[
`Series`
](
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html
)
rather than a
`DataFrame`
object. A
`Series`
is simply a 1-dimensional array
rather than a
`DataFrame`
object. A
`Series`
is simply a 1-dimensional array
representing a single column. Multiple columns can be returned by providing a
representing a single column. Multiple columns can be returned by providing a
list of columns names. This will return a
`DataFrame`
:
list of columns names. This will return a
`DataFrame`
:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic[['class', 'alive']]
titanic[['class', 'alive']]
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Note that you have to provide a list here (square brackets). If you provide a
Note that you have to provide a list here (square brackets). If you provide a
tuple (round brackets) pandas will think you are trying to access a single
tuple (round brackets) pandas will think you are trying to access a single
column that has that tuple as a name:
column that has that tuple as a name:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic[('class', 'alive')]
titanic[('class', 'alive')]
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
In this case there is no column called
`('class', 'alive')`
leading to an
In this case there is no column called
`('class', 'alive')`
leading to an
error. Later on we will see some uses to having columns named like this.
error. Later on we will see some uses to having columns named like this.
### Indexing rows by name or integer
### Indexing rows by name or integer
Individual rows can be accessed based on their name (i.e., the index) or integer
Individual rows can be accessed based on their name (i.e., the index) or integer
(i.e., which row it is in). In our current table this will give the same
(i.e., which row it is in). In our current table this will give the same
results. To ensure that these are different, let's sort our titanic dataset
results. To ensure that these are different, let's sort our titanic dataset
based on the passenger fare:
based on the passenger fare:
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic_sorted = titanic.sort_values('fare')
titanic_sorted = titanic.sort_values('fare')
titanic_sorted
titanic_sorted
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Note that the re-sorting did not change the values in the index (i.e., left-most
Note that the re-sorting did not change the values in the index (i.e., left-most
column).
column).
We can select the first row of this newly sorted table using
`iloc`
We can select the first row of this newly sorted table using
`iloc`
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic_sorted.iloc[0]
titanic_sorted.iloc[0]
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
We can select the row with the index 0 using
We can select the row with the index 0 using
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic_sorted.loc[0]
titanic_sorted.loc[0]
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Note that this gives the same passenger as the first row of the initial table
Note that this gives the same passenger as the first row of the initial table
before sorting
before sorting
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic.iloc[0]
titanic.iloc[0]
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Another common way to access the first or last N rows of a table is using the
Another common way to access the first or last N rows of a table is using the
head/tail methods
head/tail methods
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic_sorted.head(3)
titanic_sorted.head(3)
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic_sorted.tail(3)
titanic_sorted.tail(3)
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Note that nearly all methods in pandas return a new
`Dataframe`
, which means
Note that nearly all methods in pandas return a new
`Dataframe`
, which means
that we can easily call another method on them
that we can easily call another method on them
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic_sorted.tail(10).head(5) # select the first 5 of the last 10 passengers in the database
titanic_sorted.tail(10).head(5) # select the first 5 of the last 10 passengers in the database
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic_sorted.iloc[-10:-5] # alternative way to get the same passengers
titanic_sorted.iloc[-10:-5] # alternative way to get the same passengers
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
**Exercise**
: use sorting and tail/head or indexing to find the 10 youngest
**Exercise**
: use sorting and tail/head or indexing to find the 10 youngest
passengers on the titanic. Try to do this on a single line by chaining calls
passengers on the titanic. Try to do this on a single line by chaining calls
to the titanic
`DataFrame`
object
to the titanic
`DataFrame`
object
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic.sort_values...
titanic.sort_values...
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
### Indexing rows by value
### Indexing rows by value
One final way to select specific columns is by their value
One final way to select specific columns is by their value
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic[titanic.sex == 'female'] # selects all females
titanic[titanic.sex == 'female'] # selects all females
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
# select all passengers older than 60 who departed from Southampton
# select all passengers older than 60 who departed from Southampton
titanic[(titanic.age > 60) & (titanic['embark_town'] == 'Southampton')]
titanic[(titanic.age > 60) & (titanic['embark_town'] == 'Southampton')]
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Note that this required typing
`titanic`
quite often. A quicker way to get the
Note that this required typing
`titanic`
quite often. A quicker way to get the
same result is using the
`query`
method, which is described in detail
same result is using the
`query`
method, which is described in detail
[
here
](
http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method
)
[
here
](
http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method
)
(note that using the
`query`
method is also faster and uses a lot less
(note that using the
`query`
method is also faster and uses a lot less
memory).
memory).
> You may have trouble using the `query` method with columns which have
> You may have trouble using the `query` method with columns which have
a name that cannot be used as a Python identifier.
a name that cannot be used as a Python identifier.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic.query('(age > 60) & (embark_town == "Southampton")')
titanic.query('(age > 60) & (embark_town == "Southampton")')
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
When selecting a categorical multiple options from a categorical values you
might want to use
`isin`
:
%% Cell type:code id: tags:
```
titanic[titanic['class'].isin(['First','Second'])]
```
%% Cell type:markdown id: tags:
Particularly useful when selecting data like this is the
`isna`
method which
Particularly useful when selecting data like this is the
`isna`
method which
finds all missing data
finds all missing data
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic[~titanic.age.isna()] # select first few passengers whose age is not N/A
titanic[~titanic.age.isna()] # select first few passengers whose age is not N/A
```
```
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
This removing of missing numbers is so common that it has is own method
This removing of missing numbers is so common that it has is own method
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic.dropna() # drops all passengers that have some datapoint missing
titanic.dropna() # drops all passengers that have some datapoint missing
```
```
%% Cell type:code id: tags:
%% Cell type:code id: tags:
```
```
titanic.dropna(subset=['age', 'fare']) # Only drop passengers with missing ages or fares
titanic.dropna(subset=['age', 'fare']) # Only drop passengers with missing ages or fares
```
```