Preppin' Data in Python #1: Cleaning, Filtering and Dates

Introduction to Preppin' Data Challenges

Data preparation is a crucial skill for anyone working with data, whether in analytics, machine learning, or business intelligence. Raw data often needs cleaning, restructuring, and transformation before it can be analyzed effectively. This is where Preppin' Data Challenges come in.

Preppin' Data Challenges are weekly exercises designed to help you develop your data preparation skills. While originally created for Tableau Prep, these challenges can also be completed using Python, SQL, R, dbt, EasyMorph, and other tools. By tackling these challenges, you gain hands-on experience in manipulating datasets, handling missing values, formatting dates, and structuring data for analysis.

In this post, we'll walk through a Python-based solution for the first Preppin' Data Challenge of 2021, breaking down each step to help beginners understand the process.

Here is a link to the Challenge: https://preppindata.blogspot.com/2021/01/2021-week-1.html

You can also find the solution on my GitHub: https://github.com/harveyjoyce/Preppin-Data-Challenges

The Challenge: Cleaning and Transforming Bike Sales Data

The goal of this challenge is to:

Split a combined column into separate fields
Standardise data values
Extract useful date components
Remove unnecessary data
Save the cleaned dataset for further analysis

Let's break down the Python script step by step.

Step 1: Importing Required Libraries

import os
import pandas as pd
import numpy as np

Explanation:

os allows interaction with the operating system (e.g., file paths).
pandas as pd is used for handling tabular data (like spreadsheets).
numpy as np helps with numerical operations and conditional replacements.

Step 2: Loading the Dataset

file_path = r"C:\Users\HarveyJoyce\Downloads\PDC_unprepped\PD 2021 Wk 1 Input - Bike Sales (2).csv"
df = pd.read_csv(file_path)

Explanation:

Defines the file path for the input dataset.
Uses pd.read_csv(file_path) to load the CSV file into a Pandas DataFrame.

Step 3: Splitting the 'Store - Bike' Column

df[['Store', 'Bike']] = df['Store - Bike'].str.split(' - ', expand=True)

Explanation:

The Store - Bike column contains combined values (e.g., "London - Mountain").
.str.split(' - ', expand=True) separates it into two new columns: Store and Bike.

Step 4: Cleaning and Standardizing the 'Bike' Column

df['Bike'] = df['Bike'].str.lower()  # Convert to lowercase
df['Bike'] = df['Bike'].str[0]  # Keep only the first letter (m, r, g)
df['Bike'] = np.where(df['Bike']=='m', 'Mountain',
              np.where(df['Bike']=='r', 'Road', 'Gravel'))  # Map letters to full names

Explanation:

Converts all values in the Bike column to lowercase for consistency.
Extracts the first letter of each value (m, r, or g).
Uses np.where() to replace the letters with full names (Mountain, Road, Gravel).

Step 5: Extracting Date Information

df['Date'] = pd.to_datetime(df['Date'])  # Convert string to datetime
df['Quarter'] = df['Date'].dt.quarter  # Extract the quarter (1, 2, 3, or 4)
df['Day of Month'] = df['Date'].dt.day  # Extract the day of the month

Explanation:

Converts the Date column from text to datetime format.
Extracts the quarter of the year.
Extracts the day of the month for each order.

Step 6: Removing Unnecessary Columns

df = df.drop(['Store - Bike', 'Date'], axis=1)

Explanation:

Removes Store - Bike (since it's now split) and Date (since we've extracted useful parts).

Step 7: Filtering Out the First 10 Orders

df = df[df['Order ID'] >= 11]

Explanation:

Keeps only rows where Order ID is 11 or higher, removing the first 10 orders.

Step 8: Saving the Cleaned Dataset

output_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 1 output.csv"
df.to_csv(output_path, index=False)

Explanation:

Defines a new file path to save the cleaned data.
Uses df.to_csv(output_path, index=False) to write the DataFrame to a CSV file.
index=False ensures that an extra column for row numbers is not added.

Final Thoughts

By following this process, we've successfully transformed raw bike sales data into a structured format, ready for analysis. This challenge covered essential data preparation skills, including:

Splitting columns
Standardising values
Extracting date information
Filtering and cleaning data
Saving the output for future use

Author:

Harvey Joyce

View Profile