Introduction to Preppin' Data Challenges
Data preparation is a crucial skill for anyone working with data, whether in analytics, machine learning, or business intelligence. Raw data often needs cleaning, restructuring, and transformation before it can be analyzed effectively. This is where Preppin' Data Challenges come in.
Preppin' Data Challenges are weekly exercises designed to help you develop your data preparation skills. While originally created for Tableau Prep, these challenges can also be completed using Python, SQL, R, dbt, EasyMorph, and other tools. By tackling these challenges, you gain hands-on experience in manipulating datasets, handling missing values, formatting dates, and structuring data for analysis.
In this post, we'll walk through a Python-based solution for the first Preppin' Data Challenge of 2021, breaking down each step to help beginners understand the process.
Here is a link to the Challenge: https://preppindata.blogspot.com/2021/01/2021-week-1.html
You can also find the solution on my GitHub: https://github.com/harveyjoyce/Preppin-Data-Challenges
The Challenge: Cleaning and Transforming Bike Sales Data
The goal of this challenge is to:
- Split a combined column into separate fields
- Standardise data values
- Extract useful date components
- Remove unnecessary data
- Save the cleaned dataset for further analysis
Let's break down the Python script step by step.
Step 1: Importing Required Libraries
import os
import pandas as pd
import numpy as np
Explanation:
os
allows interaction with the operating system (e.g., file paths).pandas as pd
is used for handling tabular data (like spreadsheets).numpy as np
helps with numerical operations and conditional replacements.
Step 2: Loading the Dataset
file_path = r"C:\Users\HarveyJoyce\Downloads\PDC_unprepped\PD 2021 Wk 1 Input - Bike Sales (2).csv"
df = pd.read_csv(file_path)
Explanation:
- Defines the file path for the input dataset.
- Uses
pd.read_csv(file_path)
to load the CSV file into a Pandas DataFrame.
Step 3: Splitting the 'Store - Bike' Column
df[['Store', 'Bike']] = df['Store - Bike'].str.split(' - ', expand=True)
Explanation:
- The
Store - Bike
column contains combined values (e.g.,"London - Mountain"
). .str.split(' - ', expand=True)
separates it into two new columns:Store
andBike
.
Step 4: Cleaning and Standardizing the 'Bike' Column
df['Bike'] = df['Bike'].str.lower() # Convert to lowercase
df['Bike'] = df['Bike'].str[0] # Keep only the first letter (m, r, g)
df['Bike'] = np.where(df['Bike']=='m', 'Mountain',
np.where(df['Bike']=='r', 'Road', 'Gravel')) # Map letters to full names
Explanation:
- Converts all values in the
Bike
column to lowercase for consistency. - Extracts the first letter of each value (
m
,r
, org
). - Uses
np.where()
to replace the letters with full names (Mountain
,Road
,Gravel
).
Step 5: Extracting Date Information
df['Date'] = pd.to_datetime(df['Date']) # Convert string to datetime
df['Quarter'] = df['Date'].dt.quarter # Extract the quarter (1, 2, 3, or 4)
df['Day of Month'] = df['Date'].dt.day # Extract the day of the month
Explanation:
- Converts the
Date
column from text to datetime format. - Extracts the quarter of the year.
- Extracts the day of the month for each order.
Step 6: Removing Unnecessary Columns
df = df.drop(['Store - Bike', 'Date'], axis=1)
Explanation:
- Removes
Store - Bike
(since it's now split) andDate
(since we've extracted useful parts).
Step 7: Filtering Out the First 10 Orders
df = df[df['Order ID'] >= 11]
Explanation:
- Keeps only rows where
Order ID
is 11 or higher, removing the first 10 orders.
Step 8: Saving the Cleaned Dataset
output_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 1 output.csv"
df.to_csv(output_path, index=False)
Explanation:
- Defines a new file path to save the cleaned data.
- Uses
df.to_csv(output_path, index=False)
to write the DataFrame to a CSV file. index=False
ensures that an extra column for row numbers is not added.
Final Thoughts
By following this process, we've successfully transformed raw bike sales data into a structured format, ready for analysis. This challenge covered essential data preparation skills, including:
- Splitting columns
- Standardising values
- Extracting date information
- Filtering and cleaning data
- Saving the output for future use