Data manipulation with pandas

PEP8 venv/conda expr/sttms Functions OOP data Cython

6.1 pandas basics

6.1.1 `stack()`

Pandas dataframes use "index" to identify rows and "column" to identify columns. Sometimes we will encounter multiindex dataframes. For example,

As usual, we can slice rows of the dataframe with numbers, without bothering the index: df1[:4]. A more preferred way is to use .loc or .iloc: df1.loc[("bar", "one"), "height"], df1.iloc[1, 1].

These slicing and accessing functionalities will not affect the index and the columns directly. When cleaning data, a more frequently encountered context is to change the shape of the data. Leveraging pandas' index and column data structure, This can be easily done.

Note: here using the unstack() function, we can interchange the index and the column. The positions of the data are different but the contents are the same.

6.1.2 The `dplyr` pipeline

Let's take a look at the famous iris data for classification. We want to first compute two ratios. $$x = \frac{sepal\_width}{sepal\_length}$$ and $$y=\frac{petal\_width}{petal\_length}.$$

Then we will make a plot of the two ratios.

Let's read in the data and using a pipeline operator to glue our commands together.

6.1.3 Long-wide conversion

We look at another conversion of data. This time, we allow the contents of the table to be swapable with the index/columns. The difference is how we organize data. If we keep the contents unchangble, the index/column provides us with the aspects how we view the contents. When we allow contents as index/columns, we are summarising the contents.

Since we are summarising the data contents, we can also pass to aggfunc an arguments.

We can also convert wide tables to long tables. This way, we are analyzing information. In machine learning studies (also in general data analysis), we use rows to represent observations and columns to represent the attributes of the units. When we pull columns down to rows (wide to long), we make attributes observations. This can happen when the attributes are repeated observations or comparisons of the same kind.

6.1.4 alignment

pandas aligns indeces and cloumns before performing operations to ensure consistency. So when you have unmatching index/cloumn, the operations will give unwanted values.

This is rensonable. We are also sure that the data we want to manipulate appear in the right place.

What is weird, though, is the alignment-assignment precedence.

6.1.5 WRDS connection

See this post for an introduction.

Back⏎