Preppin' Data in Python #2: Aggregating, Regex and Date Diffs

Introduction to Preppin' Data Challenges

Preppin' Data Challenges are weekly exercises designed to help you develop your data preparation skills. While originally created for Tableau Prep, these challenges can also be completed using Python, SQL, R, dbt, EasyMorph, and other tools. By tackling these challenges, you gain hands-on experience in manipulating datasets, handling missing values, formatting dates, and structuring data for analysis.

In this post, we'll walk through a Python-based solution for the second Preppin' Data Challenge of 2021, breaking down each step to help beginners understand the process.

Here is a link to the Challenge: https://preppindata.blogspot.com/2021/01/2021-week-2.html

You can also find the solution on my GitHub: https://github.com/harveyjoyce/Preppin-Data-Challenges

The goal of this challenge is to:

Extract the brand name from the model
Calculate order values
Aggregate sales data by brand and bike type
Compute shipping time statistics
Save the transformed datasets

Step 1: Importing Required Libraries

import os
import pandas as pd
import numpy as np
import re

Step 2: Loading the Dataset

file_path = r"C:\Users\HarveyJoyce\Downloads\PDC_unprepped\PD 2021 Wk 2 Input - Bike Model Sales (3).csv"
df = pd.read_csv(file_path)

Step 3: Extracting the Brand Name

df['Brand'] = df['Model'].str.extract(r'([a-zA-Z]+)')

Explanation:

Uses regular expressions (regex) to extract only the letters from the Model column, which represents the brand name.

Step 4: Calculating Order Value

df['Order Value'] = df['Quantity'] * df['Value per Bike']

Explanation:

Multiplies Quantity by Value per Bike to compute the total Order Value for each order.

Step 5: Aggregating Sales Data by Brand and Bike Type

o1 = df.groupby(['Brand', 'Bike Type']).agg(
    Quantity_Sold=('Quantity', 'sum'),
    Order_Value=('Order Value', 'sum'),
    Avg_Bike_Value_per_Brand_Type=('Value per Bike', 'mean')
).reset_index()
o1['Avg_Bike_Value_per_Brand_Type'] = o1['Avg_Bike_Value_per_Brand_Type'].round(1)

Explanation:

Groups the data by Brand and Bike Type.
Sums up the Quantity and Order Value for each brand and type.
Calculates the average Value per Bike for each group.
Rounds the average bike value to one decimal place for better readability.

Step 6: Calculating Days to Ship

df['Order Date'] = pd.to_datetime(df['Order Date'], format="%d/%m/%Y")
df['Shipping Date'] = pd.to_datetime(df['Shipping Date'], format="%d/%m/%Y")
df['Days to Ship'] = (df['Shipping Date'] - df['Order Date']).dt.days

Explanation:

Converts Order Date and Shipping Date columns from text format to datetime format.
Computes the difference between Shipping Date and Order Date to determine the number of Days to Ship.

Step 7: Aggregating Shipping Data by Brand and Store

o2 = df.groupby(['Brand', 'Store']).agg(
    Total_Quantity_Sold=('Quantity', 'sum'),
    Total_Order_Value=('Order Value', 'sum'),
    Avg_Days_to_Ship=('Days to Ship', 'mean')
).reset_index()
o2['Avg_Days_to_Ship'] = o2['Avg_Days_to_Ship'].round(1)

Explanation:

Groups the data by Brand and Store.
Sums up the total quantity sold and total order value.
Computes the average days to ship for each brand at each store.
Rounds the average shipping time to one decimal place to improve clarity.

Step 8: Saving the Aggregated Data

output1_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 2_output1.csv"
output2_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 2_output2.csv"
o1.to_csv(output1_path, index=False)
o2.to_csv(output2_path, index=False)

Explanation:

Defines file paths for the two output CSV files.
Saves the aggregated sales data (o1) and aggregated shipping data (o2) as separate CSV files.

Author:

Harvey Joyce

View Profile