Introduction to Preppin' Data Challenges
Preppin' Data Challenges are weekly exercises designed to help you develop your data preparation skills. While originally created for Tableau Prep, these challenges can also be completed using Python, SQL, R, dbt, EasyMorph, and other tools. By tackling these challenges, you gain hands-on experience in manipulating datasets, handling missing values, formatting dates, and structuring data for analysis.
In this post, we'll walk through a Python-based solution for the third Preppin' Data Challenge of 2021, breaking down each step to help beginners understand the process.
Here is a link to the Challenge: https://preppindata.blogspot.com/2021/01/2021-week-3.html
You can also find the solution on my GitHub: https://github.com/harveyjoyce/Preppin-Data-Challenges
This challenge focuses on combining multiple sheets from an Excel file, reshaping data using pivoting and splitting, and aggregating product sales. Here’s a breakdown of how we can accomplish these tasks in Python.
Step 1: Import Required Libraries
import os
import pandas as pd
import numpy as np
from pandas import concat, ExcelFile, melt, read_csv
These libraries help with data manipulation (pandas
, numpy
) and file handling (os
).
Step 2: Load the Excel File
file_path = r"C:\Users\HarveyJoyce\Downloads\PDC_unprepped\PD 2021 Wk 3 Input.xlsx"
xls = pd.ExcelFile(file_path)
Since the dataset is an Excel file with multiple sheets, we use pd.ExcelFile
to read all available sheets.
Step 3: Combine All Sheets into a Single DataFrame
dfIn = None
for sheet in xls.sheet_names:
dfNew = xls.parse(sheet)
dfNew['Store'] = sheet
dfIn = dfNew if dfIn is None else concat([dfIn, dfNew], ignore_index=True)
Why Use a Loop?
- The dataset consists of multiple sheets, each representing data for a different store.
- Instead of manually loading each sheet one by one, we use a loop to iterate through all available sheets automatically.
- This approach ensures scalability—if new sheets (stores) are added in the future, the code will still work without modifications.
Explanation:
- Loops through each sheet name in the Excel file.
- Reads each sheet into a temporary DataFrame (
dfNew
). - Adds a
Store
column, using the sheet name as the store identifier. - Appends all sheets together into
dfIn
. This ensures all store data is combined into a single dataset.
Step 4: Reshape Data Using melt
o1 = dfIn.melt(id_vars=['Date', 'Store'],
var_name='Customer Type - Product',
value_name='Values')
Why Use melt
?
- The original dataset stores product sales for new and existing customers in separate columns.
melt
transforms these columns into rows, making it easier to analyze and aggregate sales.
Explanation:
- Pivoting (Melting): This process transforms data from a wide format to a long format.
- The 'New' and 'Existing' columns are converted into rows under the
Customer Type - Product
column. - The corresponding sales numbers are stored in a new column called
Values
.
This restructuring makes it easier to analyze and aggregate data.
Step 5: Rename and Split Columns
o1.rename(columns={'Values':'Products Sold'}, inplace=True)
o1[['Customer Type', 'Product']] = o1['Customer Type - Product'].str.split(' - ', expand=True)
Explanation:
- Renames
Values
toProducts Sold
for clarity. - Splits
Customer Type - Product
into two new columns: Customer Type and Product. - This transformation makes it easier to analyze product sales separately by customer type.
Step 6: Convert Dates to Quarters
o1['Date'] = pd.to_datetime(o1['Date'], format="%Y/%m/%d")
o1['Quarter'] = o1['Date'].dt.quarter
Explanation:
- Converts
Date
from string format to datetime format. - Extracts the quarter (Q1, Q2, etc.) from the date, which is useful for time-based analysis.
Step 7: Aggregate Product Sales Data
1. Sales by Product and Quarter
output_1 = o1.groupby(
['Product', 'Quarter']
).agg(
Product_Sold = ('Products Sold', 'sum'),
).reset_index()
- Groups data by Product and Quarter.
- Sums the total number of products sold within each quarter.
2. Sales by Store, Customer Type, and Product
output_2 = o1.groupby(
['Store', 'Customer Type','Product']
).agg(
Product_Sold = ('Products Sold', 'sum'),
).reset_index()
- Groups data by Store, Customer Type, and Product.
- Sums up the total products sold per group.
Step 8: Save the Outputs
output1_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 3_output1.csv"
output2_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 3_output2.csv"
output_1.to_csv(output1_path, index=False)
output_2.to_csv(output2_path, index=False)
- Saves the two aggregated datasets as CSV files for further analysis.