Preppin' Data in Python #3: Loops, Pivots, Group Bys

Introduction to Preppin' Data Challenges

Preppin' Data Challenges are weekly exercises designed to help you develop your data preparation skills. While originally created for Tableau Prep, these challenges can also be completed using Python, SQL, R, dbt, EasyMorph, and other tools. By tackling these challenges, you gain hands-on experience in manipulating datasets, handling missing values, formatting dates, and structuring data for analysis.

In this post, we'll walk through a Python-based solution for the third Preppin' Data Challenge of 2021, breaking down each step to help beginners understand the process.

Here is a link to the Challenge: https://preppindata.blogspot.com/2021/01/2021-week-3.html

You can also find the solution on my GitHub: https://github.com/harveyjoyce/Preppin-Data-Challenges

This challenge focuses on combining multiple sheets from an Excel file, reshaping data using pivoting and splitting, and aggregating product sales. Here’s a breakdown of how we can accomplish these tasks in Python.

Step 1: Import Required Libraries

import os
import pandas as pd
import numpy as np
from pandas import concat, ExcelFile, melt, read_csv

These libraries help with data manipulation (pandas, numpy) and file handling (os).

Step 2: Load the Excel File

file_path = r"C:\Users\HarveyJoyce\Downloads\PDC_unprepped\PD 2021 Wk 3 Input.xlsx"
xls = pd.ExcelFile(file_path)

Since the dataset is an Excel file with multiple sheets, we use pd.ExcelFile to read all available sheets.

Step 3: Combine All Sheets into a Single DataFrame

dfIn = None
for sheet in xls.sheet_names:
    dfNew = xls.parse(sheet)
    dfNew['Store'] = sheet
    dfIn = dfNew if dfIn is None else concat([dfIn, dfNew], ignore_index=True)

Why Use a Loop?

The dataset consists of multiple sheets, each representing data for a different store.
Instead of manually loading each sheet one by one, we use a loop to iterate through all available sheets automatically.
This approach ensures scalability—if new sheets (stores) are added in the future, the code will still work without modifications.

Explanation:

Loops through each sheet name in the Excel file.
Reads each sheet into a temporary DataFrame (dfNew).
Adds a Store column, using the sheet name as the store identifier.
Appends all sheets together into dfIn. This ensures all store data is combined into a single dataset.

Step 4: Reshape Data Using `melt`

o1 = dfIn.melt(id_vars=['Date', 'Store'], 
        var_name='Customer Type - Product',
        value_name='Values')

Why Use `melt`?

The original dataset stores product sales for new and existing customers in separate columns.
melt transforms these columns into rows, making it easier to analyze and aggregate sales.

Explanation:

Pivoting (Melting): This process transforms data from a wide format to a long format.
The 'New' and 'Existing' columns are converted into rows under the Customer Type - Product column.
The corresponding sales numbers are stored in a new column called Values.

This restructuring makes it easier to analyze and aggregate data.

Step 5: Rename and Split Columns

o1.rename(columns={'Values':'Products Sold'}, inplace=True)
o1[['Customer Type', 'Product']] = o1['Customer Type - Product'].str.split(' - ', expand=True)

Explanation:

Renames Values to Products Sold for clarity.
Splits Customer Type - Product into two new columns: Customer Type and Product.
This transformation makes it easier to analyze product sales separately by customer type.

Step 6: Convert Dates to Quarters

o1['Date'] = pd.to_datetime(o1['Date'], format="%Y/%m/%d")
o1['Quarter'] = o1['Date'].dt.quarter

Explanation:

Converts Date from string format to datetime format.
Extracts the quarter (Q1, Q2, etc.) from the date, which is useful for time-based analysis.

Step 7: Aggregate Product Sales Data

1. Sales by Product and Quarter

output_1 = o1.groupby(
    ['Product', 'Quarter']
    ).agg(
        Product_Sold = ('Products Sold', 'sum'),
    ).reset_index()

Groups data by Product and Quarter.
Sums the total number of products sold within each quarter.

2. Sales by Store, Customer Type, and Product

output_2 = o1.groupby(
    ['Store', 'Customer Type','Product']
    ).agg(
        Product_Sold = ('Products Sold', 'sum'),
    ).reset_index()

Groups data by Store, Customer Type, and Product.
Sums up the total products sold per group.

Step 8: Save the Outputs

output1_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 3_output1.csv"
output2_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 3_output2.csv"
output_1.to_csv(output1_path, index=False)
output_2.to_csv(output2_path, index=False)

Saves the two aggregated datasets as CSV files for further analysis.

Author:

Harvey Joyce

View Profile

Preppin' Data in Python #3: Loops, Pivots, Group Bys

Introduction to Preppin' Data Challenges

Step 1: Import Required Libraries

Step 2: Load the Excel File

Step 3: Combine All Sheets into a Single DataFrame

Why Use a Loop?

Explanation:

Step 4: Reshape Data Using melt

Why Use melt?

Explanation:

Step 5: Rename and Split Columns

Explanation:

Step 6: Convert Dates to Quarters

Explanation:

Step 7: Aggregate Product Sales Data

1. Sales by Product and Quarter

2. Sales by Store, Customer Type, and Product

Step 8: Save the Outputs

Step 4: Reshape Data Using `melt`

Why Use `melt`?