10 Python Pandas tips to make data analysis faster

1. Styling

Have you ever complained about the table output looks boring when you do .head() in Jupyter notebooks? Is there a way not to display indexes (especially when there is already an ID column)? There’re ways to fix these issues.

A. Highlight all negative values in a dataframe. (example revised from https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html)

import pandas as pd  
def color_negative_red(val):  
    color = 'red' if val < 0 else 'black'  
 return 'color: %s' % colordf = pd.DataFrame(dict(col_1=[1.53,-2.5,3.53],   
                       col_2=[-4.1,5.9,0])  
                 )  
df.style.applymap(color_negative_red)

B. Hide the index. Try df.head().style.hide_index()!
C. Add hovering effects. (example revised from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.set_table_styles.html)

df = pd.DataFrame(np.random.randn(5, 3))  
df.style.set_table_styles(  
[{'selector': 'tr:hover',  
  'props': [('background-color', 'yellow')]}]  
)

D. More CSS styles. You can use CSS to change the appearance of the table.

df = pd.DataFrame(  
dict(departure=['SFO', 'SFO', 'LAX', 'LAX', 'JFK', 'SFO'],  
     arrival=['ORD', 'DFW', 'DFW', 'ATL', 'ATL', 'ORD'],  
     airlines=['Delta','JetBlue','Delta',’AA','SouthWest',    
               'Delta']),  
columns=['airlines', 'departure','arrival'])
df.style.set_table_styles(  
[{'selector': 'tr:nth-of-type(odd)',  
  'props': [('background', '#eee')]},   
 {'selector': 'tr:nth-of-type(even)',  
  'props': [('background', 'white')]},  
 {'selector': 'th',  
  'props': [('background', '#606060'),   
            ('color', 'white'),  
            ('font-family', 'verdana')]},  
 {'selector': 'td',  
  'props': [('font-family', 'verdana')]},  
]  
).hide_index()

2. Pandas options

The reader may have experienced the following issues when using .head(n) to check the dataframe:
(1) There’re too many columns / rows in the dataframe and some columns / rows in the middle are omitted.
(2) Columns containing long texts get truncated.
(3) Columns containing floats display too many / too few digits.

One can set

import pandas as pd   
pd.options.display.max_columns = 50  # None -> No Restrictions  
pd.options.display.max_rows = 200    # None -> Be careful with this   
pd.options.display.max_colwidth = 100  
pd.options.display.precision = 3

to solve these issues.

3. Group by with multiple aggregations

In SQL we can do aggregations like

SELECT A, B, max(A), avg(A), sum(B), min(B), count(*)  
FROM table  
GROUP BY A, B

In Pandas it can be done with .groupby() and .agg():

import pandas as pd  
import numpy as np   
df = pd.DataFrame(dict(A=['coke', 'sprite', 'coke', 'sprite',  
                          'sprite', 'coke', 'coke'],  
                       B=['alpha','gamma', 'alpha', 'beta',  
                          'gamma', 'beta', 'beta'],  
                       col_1=[1,2,3,4,5,6,7],  
                       col_2=[1,6,2,4,7,9,3]))
tbl = df.groupby(['A','B']).agg({'col_1': ['max', np.mean],  
                                 'col_2': ['sum','min','count']})
# 'count' will always be the count for number of rows in each group.

And the result will look like this:

Both the rows and columns are multi-indexed. A quick solution to change it to a dataframe without multi-indices is

tbl = tbl.reset_index()  
tbl.columns = ['A', 'B', 'col_1_max', 'col_2_sum', 'col_2_min', 'count']

If you would like to have the column renaming process automated, you can do tbl.columns.get_level_values(0) and tbl.columns.get_level_values(1) to extract the indices in each level and combine them.

4. Column slicing

Some of you might be familiar with this already, but I still find it very useful when handling a dataframe with a ton of columns.

df.iloc[:,2:5].head()             # select the 2nd to the 4th column  
df.loc[:,'column_x':].head()     
# select all columns starting from 'column_x'

5. Add row ID / random row ID to each group

To add a row ID / random row ID for each group by A, B, one can first append an ID / random ID to all rows:

import numpy as np  
# df: target dataframe np.random.seed(0)   # set random seed  
df['random_ID_all'] = np.random.permutation(df.shape[0])  
df['ID_all'] = [i for i in range(1, df.shape[0]+1)]

To add a random ID to each group (by A, B), one can then do

df['ID'] = df.groupby(['A', 'B'])['ID_all'].rank(method='first',ascending=True).astype(int)
df['random_ID'] = df.groupby(['A', 'B'])'random_ID_all'].rank(method='first',ascending=True).astype(int)

to get

6. List all unique values in a group

Sometimes after we performed group by, we’d like to aggregate the values in the target column as a list of unique values instead of max, min, …etc. This is how it’s done.

df = pd.DataFrame(dict(A=['A','A','A','A','A','B','B','B','B'],  
                       B=[1,1,1,2,2,1,1,1,2],  
                       C=['CA','NY','CA','FL','FL',       
                          'WA','FL','NY','WA']))
tbl = df[['A', 'B', 'C']].drop_duplicates()\  
                         .groupby(['A','B'])['C']\  
                         .apply(list)\  
                         .reset_index() # list to string (separated by commas)   
tbl['C'] = tbl.apply(lambda x: (','.join([str(s) for s in x['C']])), axis = 1)

If you’d like to save the result, don’t forget to change the separator to anything other than commas.

7. Add row total and column total to a numerical dataframe

This is another common data manipulation. All you need is .apply().

df = pd.DataFrame(dict(A=[2,6,3],  
                       B=[2,2,6],   
                       C=[3,2,3]))
df['col_total']     = df.apply(lambda x: x.sum(), axis=1)  
df.loc['row_total'] = df.apply(lambda x: x.sum())

8. Check memory usage

**.memory_usage(deep=True)** can be used on Pandas dataframes to see the amount of memory used (in bytes) for each column. It’s useful when building machine learning models which may require a lot memory in training.

9. Cumulative sum

From time to time, cumulative sum is required when you generate some statistical outcomes. Simply do `

df['cumulative_sum'] = df['target_column'].cumsum()`  .

10. Crosstab

When you need to count the frequencies for groups formed by 3+ features, pd.crosstab() can make your life easier.

Thanks for reading! Comment below if you find bugs / better solutions.