Introduction to Preppin' Data Challenges
Preppin' Data Challenges are weekly exercises designed to help you develop your data preparation skills. While originally created for Tableau Prep, these challenges can also be completed using Python, SQL, R, dbt, EasyMorph, and other tools. By tackling these challenges, you gain hands-on experience in manipulating datasets, handling missing values, formatting dates, and structuring data for analysis.
In this post, we'll walk through a Python-based solution for the second Preppin' Data Challenge of 2021, breaking down each step to help beginners understand the process.
Here is a link to the Challenge: https://preppindata.blogspot.com/2021/01/2021-week-2.html
You can also find the solution on my GitHub: https://github.com/harveyjoyce/Preppin-Data-Challenges
The goal of this challenge is to:
- Extract the brand name from the model
- Calculate order values
- Aggregate sales data by brand and bike type
- Compute shipping time statistics
- Save the transformed datasets
Step 1: Importing Required Libraries
import os
import pandas as pd
import numpy as np
import re
Step 2: Loading the Dataset
file_path = r"C:\Users\HarveyJoyce\Downloads\PDC_unprepped\PD 2021 Wk 2 Input - Bike Model Sales (3).csv"
df = pd.read_csv(file_path)
Step 3: Extracting the Brand Name
df['Brand'] = df['Model'].str.extract(r'([a-zA-Z]+)')
Explanation:
- Uses regular expressions (regex) to extract only the letters from the
Model
column, which represents the brand name.
Step 4: Calculating Order Value
df['Order Value'] = df['Quantity'] * df['Value per Bike']
Explanation:
- Multiplies
Quantity
byValue per Bike
to compute the totalOrder Value
for each order.
Step 5: Aggregating Sales Data by Brand and Bike Type
o1 = df.groupby(['Brand', 'Bike Type']).agg(
Quantity_Sold=('Quantity', 'sum'),
Order_Value=('Order Value', 'sum'),
Avg_Bike_Value_per_Brand_Type=('Value per Bike', 'mean')
).reset_index()
o1['Avg_Bike_Value_per_Brand_Type'] = o1['Avg_Bike_Value_per_Brand_Type'].round(1)
Explanation:
- Groups the data by
Brand
andBike Type
. - Sums up the
Quantity
andOrder Value
for each brand and type. - Calculates the average
Value per Bike
for each group. - Rounds the average bike value to one decimal place for better readability.
Step 6: Calculating Days to Ship
df['Order Date'] = pd.to_datetime(df['Order Date'], format="%d/%m/%Y")
df['Shipping Date'] = pd.to_datetime(df['Shipping Date'], format="%d/%m/%Y")
df['Days to Ship'] = (df['Shipping Date'] - df['Order Date']).dt.days
Explanation:
- Converts
Order Date
andShipping Date
columns from text format to datetime format. - Computes the difference between
Shipping Date
andOrder Date
to determine the number of Days to Ship.
Step 7: Aggregating Shipping Data by Brand and Store
o2 = df.groupby(['Brand', 'Store']).agg(
Total_Quantity_Sold=('Quantity', 'sum'),
Total_Order_Value=('Order Value', 'sum'),
Avg_Days_to_Ship=('Days to Ship', 'mean')
).reset_index()
o2['Avg_Days_to_Ship'] = o2['Avg_Days_to_Ship'].round(1)
Explanation:
- Groups the data by
Brand
andStore
. - Sums up the total quantity sold and total order value.
- Computes the average days to ship for each brand at each store.
- Rounds the average shipping time to one decimal place to improve clarity.
Step 8: Saving the Aggregated Data
output1_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 2_output1.csv"
output2_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 2_output2.csv"
o1.to_csv(output1_path, index=False)
o2.to_csv(output2_path, index=False)
Explanation:
- Defines file paths for the two output CSV files.
- Saves the aggregated sales data (
o1
) and aggregated shipping data (o2
) as separate CSV files.