Introduction to pandas

We have already encounter the module pandas in this lecture several times, but only used it to render nice-looking tables. Despite its name, pandas has nothing to do with the cute bears, but stands for "panel data". And its ability is way beyond preffifying tables. This chapter serves as an introduction to this module and covers some basic operations to get started with pandas. You can discover more functionalities in its official documentation.

We shall use a table containing properties of all chemical elements as an example to begin our journey. It is recommended to use jupyter notebook to get a nice render.

After importing pandas with

import pandas as pd

we can create an instance of the class DataFrame directly from the csv-file by calling

df = pd.read_csv('ptable.csv')
df

We can see a table with middle rows and columns truncated. This enables us to get an overview of huge tables in the limited screen space. In order to have more columns displayed, we can use the function set_option:

pd.set_option('display.max_columns', 30)

This will allow us to view all columns. More options can be set using this function. For a detailed description, you can refer to the chapter Options and settings from pandas documentation.

If we just want to look at the names of all columns without the data, the method columns can be used:

df.columns

The methods head and tail can be used with an optional argument n to view the first or last n lines, respectively.

An important operation on DataFrame is indexing, which allows us to get row(s), column(s), a subtable or a single value. Since the columns are named, they can be accessed in the dictionaly-like style:

df['Element']

They can be also accessed as attributes:

df.Element

However, if the column name is identical with a built-in attribute or method of DataFrame, this method will not work. Therefore, the dictionaly-like style is preferred.

Multiple columns can be selected by using a list of strings as indices:

df[['AtomicNumber', 'Element']]

If one wants to get some columns and some rows, the property loc can be used, e.g. like

df.loc[[2, 3, 4, 5], ['AtomicNumber', 'AtomicMass']]

Omitting the column indices will deliver us the selected rows.

One can also use slicing in a list-like fashion. But be careful, slices in DataFrame includes both ends! The command

df.loc[2:5, 'AtomicNumber':'AtomicMass']

will thus give us every rows from row index 2 to 5 and from which the columns AtomicNumber to AtomicMass. It actually does make sence to use include both ends, otherwise you have to look up the name of the next column every time you want to slice columns.

One can also use custom labels to index rows. To do this, we have to at first define row index by calling

df.set_index('Symbol')

This will return a new instance of DataFrame with the custom row-label. If we want to overwrite the current instance, we can call set_index with the optional argument inplace=True.

If we import the data from a csv file, like in this case, we can set the row label already at importing by calling

df = pd.read_csv('ptable.csv', index_col='Symbol')

After this, we can do things like

df.loc['Mo', 'Element':'AtomicMass']

If you still want to access rows using integer indices, you can use the property iloc like

df.iloc[2:5]

or reset the row indices by calling

df.reset_index(inplace=True)

The command

df.loc['Mo', 'Group']

delivers us with the group of Molybdenum. Although possible, it is not recommended to do it like this, since the command

df.at['Mo', 'Group']

is both clearer and faster.

Instances of DataFrame also support boolean indexing. All metals e.g. can be selected by

filt = df['Metal'] == 'yes'
df[filt]

DataFrame allows us to make plots extrem easily. While the calling the plot method directiy by

df.plot()

plots every columns with numeric values against the row index, which is rarely useful, the command

df.plot.line(x='AtomicNumber', y='AtomicRadius')

plots atomic radius against atomic number with a line, which is more useful.

Although we cannot plot columns with text values, they can still be analyzed, by e.g. counting the number of different values:

loc_count = df['DiscoveryLocation'].value_counts()
loc_count

We can visualize this result using a pie chart:

loc_count.plot.pie()

The power of pandas is not limited to simple data analysis. We can perform more sophisticated statistics with pandas. For this, we shall use the result of Stack Overflow Annual Developer Survey from year 2021. After downloading, we unpack the obtained zip-file and use survey_results_public.csv for statistical analysis.

At first, we load the csv-file as an instance of DataFrame:

df = pd.read_csv('survey_results_public.csv')
df

To get an overview of all numeric data, we can call

df.describe()

This will give us some important statistical data, like mean, std, etc. of all columns with numerical values. If you only want to compute mean of one specific column, you can index this and call the mean method. In this case, we use the sallery converted to USD:

df['ConvertedCompYearly'].mean()

You can even compute multiple statistical values by using the agg method, which is a abbreviation of aggregation:

df['ConvertedCompYearly'].agg(['mean', 'median'])

The results of specific countries can be selected using loc property. After treating the numerical data, we shall deal with the other. For data with a lot of repeated values, like the column Country, we could group them like

grp = df.groupby(['Country'])

Calling

grp.get_group('Germany')

will give us all entries with Country == Germany, similar to the effect of boolean indexing. The grouping is, however, way more flexible. If we would like to, say, see the age distribution among several countries, we could filter for these countries and look for the Age column. A easier way would be to create a group, get the Age column and count the frequency of all possible answers. This is done by using

grp['Age'].value_counts()

To get the selected countries, we can just use loc property on the result. Remember to use braces when indexing multiple countries:

grp['Age'].value_counts().loc[['Germany', 'United States of America']]

Often, the absolute number of each category does not tell us much, since the total number of answers in different groups can vary quite a bit. In this case, a fraction ot percentage would tell us more. This can be easily achieved by using the optional argument normalize of the value_counts method:

grp['Age'].value_counts(normalize=True).loc[['Germany', 'United States of America']]

Inspecting the column LanguageHaveWorkedWith, we can find that it contains programming languages concatenated with ",". If we want to find out the number of programmers who have used python, we could call

df['LanguageHaveWorkedWith'].str.contains('Python').sum()

To get the number in one specific country, we can use boolean indexing:

filt = df['Country'] == 'Germany'
df.loc[filt]['LanguageHaveWorkedWith'].str.contains('Python').sum()

If we try to apply the str method on a group like

grp['LanguageHaveWorkedWith'].str.contains('Python').sum()

we get an error message, since str methods are not defined for instances of the SeriesGroupBy class, which is obtained by slicing groups. These instances however, offers a more general method apply, which can be used like

grp['LanguageHaveWorkedWith'].apply(lambda x: x.str.contains('Python').sum())

to give us the number of python programmers in each country. The lambda operator defines an anonymous function with the variable x, on which the str method can be applied. Also in this case, a fraction contains more information than the absolute number. For this, we can divide the number of python programmers by the total number of people written something in the LanguageHaveWorkedWith column:

grp['LanguageHaveWorkedWith'].apply(lambda x: x.str.contains('Python').sum() / x.count())

And of course, loc property can be used to get specific countries.

This was just a very brief introduction of pandas to get you started. Feel free to play around with different options and discover more functionalities yourself!