What is the Difference Between axis=0 and axis=1 When Working with Pandas Dataframes?

Sometimes, functions ask you to specify an axis. The documentation can often feel vague and/or technical.

For instance, here’s a quote from the apply function’s documentation:

axis : {0 or ‘index’, 1 or ‘columns’}, default 0 Axis along which the function is applied: 0 or ‘index’: apply function to each column. 1 or ‘columns’: apply function to each row.

Uuuum… right.

So what’s the difference? Here’s an example…

Resources

Right-click -> Save as...

Link
Link

Load a CSV file to play with

Prerequisites (if you want to practice)

Install the Link library for your Python environment
Cells in this notebook expect the Link file to be in the same directory as your notebook
Resources to help you practice

First Things First

import pandas as pd

# Read the CSV file
# This assumes "Car Sales.csv" is in the same directory as your notebook
car_sales_data = pd.read_csv("Car Sales.csv")

# Show the first 5 rows
first_five = car_sales_data.head(5)
display(first_five)

	DealershipName	RedCars	SilverCars	BlackCars	BlueCars	MonthSold	YearSold
0	Clyde's Clunkers	902.0	650.0	754.0	792.0	1.0	2018.0
1	Clyde's Clunkers	710.0	476.0	518.0	492.0	2.0	2018.0
2	Clyde's Clunkers	248.0	912.0	606.0	350.0	3.0	2018.0
3	Clyde's Clunkers	782.0	912.0	858.0	446.0	4.0	2018.0
4	Clyde's Clunkers	278.0	354.0	482.0	752.0	5.0	2018.0

The car sales data looks like it contains one row that summarizes the total sales of each color of car for a given dealership, for each month of the year.

To state the “grain” of the data frame another way, the data frame contains one row per dealership, month, year combo and reports the total number of cars sold by color.

Choose Your Scenario

Suppose that two people come to you and ask separate questions about average car sales.

Lucy asks, “Can you calculate the average number of cars sold for each color?”
Zack asks, “Can you calculate the average number of cars sold (regardless of color) for each dealership in each month & year?” (so basically the average of red, silver, black, and blue cars for each row)

Start with Lucy.

Think about what you’d do to answer Lucy’s question by hand, manually, if you didn’t have Pandas to do the work for you. Here’s what I’d do:

Start with the RedCars column.
Add up 902, 710, 248, 782, 278, and so on.
Divide that sum by the total number of values. Boom. RedCars average.
Rinse and reapeat steps 1-3 for SilverCars, BlackCars, and BlueCars.

This is an axis=0 scenario in Pandas.

first_five[['RedCars', 'SilverCars', 'BlackCars', 'BlueCars']].mean(axis=0)

RedCars       584.0
SilverCars    660.8
BlackCars     643.6
BlueCars      566.4
dtype: float64

What about Zack?

What would you do to answer his question by hand, without Pandas? How’s this…

Start with the first row of data (Row 0), since his question matches the “grain” of the data frame… one row per dealership per month & year.
Add up the RedCars, SilverCars, BlackCars, and BlueCars values for Row 0 and divide by 4. So (902 + 650 + 754 + 792)/4
Rinse and repeat steps 1 & 2 for every row in the data frame. Boom. Average cars sold by dealer/month/year.

This is an axis=1 scenario.

first_five[['RedCars', 'SilverCars', 'BlackCars', 'BlueCars']].mean(axis=1)

0    774.5
1    549.0
2    529.0
3    749.5
4    466.5
dtype: float64

Summarizing the Findings

Specifying an axis to a function in Pandas is helping answer one of the following questions:

Should I (Pandas) start with a column and make this function do its job downward on all the “cells” for that column, and then continue doing the same thing for all the rest of the columns in the data frame? (axis=0)

Should I (Pandas) start with the first row of data in the data frame and make this function do its job horizontally on all of the “cells” for that row, and then continue doing the same thing for all the rest of the rows in the data frame? (axis=1)