Introduction to pandas
We have already encounter the module pandas in this lecture several times, but only used it to render nice-looking tables. Despite its name, pandas has nothing to do with the cute bears, but stands for "panel data". And its ability is way beyond preffifying tables. This chapter serves as an introduction to this module and covers some basic operations to get started with pandas. You can discover more functionalities in its official documentation.
We shall use a table containing properties of all chemical elements as an example to begin our journey. It is recommended to use jupyter notebook to get a nice render.
After importing pandas with
import pandas as pd
we can create an instance of the class DataFrame directly from the csv-file
by calling
df = pd.read_csv('ptable.csv')
df
We can see a table with middle rows and columns truncated. This enables us to
get an overview of huge tables in the limited screen space. In order to have
more columns displayed, we can use the function set_option:
pd.set_option('display.max_columns', 30)
This will allow us to view all columns. More options can be set using this function. For a detailed description, you can refer to the chapter Options and settings from pandas documentation.
If we just want to look at the names of all columns without the data, the
method columns can be used:
df.columns
The methods head and tail can be used with an optional argument n to
view the first or last n lines, respectively.
An important operation on DataFrame is indexing, which allows us to get
row(s), column(s), a subtable or a single value. Since the columns are named,
they can be accessed in the dictionaly-like style:
df['Element']
They can be also accessed as attributes:
df.Element
However, if the column name is identical with a built-in attribute or method
of DataFrame, this method will not work. Therefore, the dictionaly-like
style is preferred.
Multiple columns can be selected by using a list of strings as indices:
df[['AtomicNumber', 'Element']]
If one wants to get some columns and some rows, the property loc
can be used, e.g. like
df.loc[[2, 3, 4, 5], ['AtomicNumber', 'AtomicMass']]
Omitting the column indices will deliver us the selected rows.
One can also use slicing in a list-like fashion. But be careful, slices in
DataFrame includes both ends! The command
df.loc[2:5, 'AtomicNumber':'AtomicMass']
will thus give us every rows from row index 2 to 5 and from which the columns
AtomicNumber to AtomicMass. It actually does make sence to use include
both ends, otherwise you have to look up the name of the next column every
time you want to slice columns.
One can also use custom labels to index rows. To do this, we have to at first define row index by calling
df.set_index('Symbol')
This will return a new instance of DataFrame with the custom row-label. If
we want to overwrite the current instance, we can call set_index with the
optional argument inplace=True.
If we import the data from a csv file, like in this case, we can set the row label already at importing by calling
df = pd.read_csv('ptable.csv', index_col='Symbol')
After this, we can do things like
df.loc['Mo', 'Element':'AtomicMass']
If you still want to access rows using integer indices, you can use the
property iloc like
df.iloc[2:5]
or reset the row indices by calling
df.reset_index(inplace=True)
The command
df.loc['Mo', 'Group']
delivers us with the group of Molybdenum. Although possible, it is not recommended to do it like this, since the command
df.at['Mo', 'Group']
is both clearer and faster.
Instances of DataFrame also support boolean indexing. All metals e.g.
can be selected by
filt = df['Metal'] == 'yes'
df[filt]
DataFrame allows us to make plots extrem easily. While the calling the
plot method directiy by
df.plot()
plots every columns with numeric values against the row index, which is rarely useful, the command
df.plot.line(x='AtomicNumber', y='AtomicRadius')
plots atomic radius against atomic number with a line, which is more useful.
Although we cannot plot columns with text values, they can still be analyzed, by e.g. counting the number of different values:
loc_count = df['DiscoveryLocation'].value_counts()
loc_count
We can visualize this result using a pie chart:
loc_count.plot.pie()
The power of pandas is not limited to simple data analysis. We can perform more sophisticated statistics with pandas. For this, we shall use the result of Stack Overflow Annual Developer Survey from year 2021. After downloading, we unpack the obtained zip-file and use survey_results_public.csv for statistical analysis.
At first, we load the csv-file as an instance of DataFrame:
df = pd.read_csv('survey_results_public.csv')
df
To get an overview of all numeric data, we can call
df.describe()
This will give us some important statistical data, like mean, std, etc. of
all columns with numerical values. If you only want to compute mean of one
specific column, you can index this and call the mean method. In this case,
we use the sallery converted to USD:
df['ConvertedCompYearly'].mean()
You can even compute multiple statistical values by using the agg method,
which is a abbreviation of aggregation:
df['ConvertedCompYearly'].agg(['mean', 'median'])
The results of specific countries can be selected using loc property.
After treating the numerical data, we shall deal with the other. For data
with a lot of repeated values, like the column Country, we could group
them like
grp = df.groupby(['Country'])
Calling
grp.get_group('Germany')
will give us all entries with Country == Germany, similar to the effect
of boolean indexing. The grouping is, however, way more flexible. If we
would like to, say, see the age distribution among several countries,
we could filter for these countries and look for the Age column. A easier
way would be to create a group, get the Age column and count the frequency
of all possible answers. This is done by using
grp['Age'].value_counts()
To get the selected countries, we can just use loc property on the result.
Remember to use braces when indexing multiple countries:
grp['Age'].value_counts().loc[['Germany', 'United States of America']]
Often, the absolute number of each category does not tell us much, since the
total number of answers in different groups can vary quite a bit. In this
case, a fraction ot percentage would tell us more. This can be easily achieved
by using the optional argument normalize of the value_counts method:
grp['Age'].value_counts(normalize=True).loc[['Germany', 'United States of America']]
Inspecting the column LanguageHaveWorkedWith, we can find that it contains
programming languages concatenated with ",". If we want to find out the
number of programmers who have used python, we could call
df['LanguageHaveWorkedWith'].str.contains('Python').sum()
To get the number in one specific country, we can use boolean indexing:
filt = df['Country'] == 'Germany'
df.loc[filt]['LanguageHaveWorkedWith'].str.contains('Python').sum()
If we try to apply the str method on a group like
grp['LanguageHaveWorkedWith'].str.contains('Python').sum()
we get an error message, since str methods are not defined for instances
of the SeriesGroupBy class, which is obtained by slicing groups.
These instances however, offers a more general method apply, which can be
used like
grp['LanguageHaveWorkedWith'].apply(lambda x: x.str.contains('Python').sum())
to give us the number of python programmers in each country. The lambda
operator defines an anonymous function with the variable x, on which the
str method can be applied. Also in this case, a fraction contains more
information than the absolute number. For this, we can divide the number
of python programmers by the total number of people written something in
the LanguageHaveWorkedWith column:
grp['LanguageHaveWorkedWith'].apply(lambda x: x.str.contains('Python').sum() / x.count())
And of course, loc property can be used to get specific countries.
This was just a very brief introduction of pandas to get you started. Feel free to play around with different options and discover more functionalities yourself!