Preppin' Data in Python #5: Dealing with Duplicates

Preppin' Data Challenges are weekly exercises designed to help you develop your data preparation skills. While originally created for Tableau Prep, these challenges can also be completed using Python, SQL, R, dbt, EasyMorph, and other tools. By tackling these challenges, you gain hands-on experience in manipulating datasets, handling missing values, formatting dates, and structuring data for analysis.

In this post, we'll walk through a Python-based solution for the fifth Preppin' Data Challenge of 2021, breaking down each step to help beginners understand the process.

Here is a link to the Challenge: https://preppindata.blogspot.com/2021/02/2021-week-5-dealing-with-duplication.html

You can also find the solution on my GitHub: https://github.com/harveyjoyce/Preppin-Data-Challenges

🧹 What Does the Script Do?

Read the CSV file containing client data.
Identify the most recent account manager for each client.
Remove duplicates and keep the latest record.
Combine the cleaned data with training information.
Export the final cleaned dataset to a CSV file.

Let's break it down!

1. Importing Libraries

import os
import pandas as pd
import numpy as np
import re

We start by importing the necessary libraries:

os: For working with file paths.
pandas: Essential for data manipulation.
numpy: Used for numerical operations (though not directly used in this script).
re: Provides support for regular expressions (not used here but useful for future data cleaning).

2. Load the CSV File

file_path = r"C:\Users\HarveyJoyce\Downloads\PDC_unprepped\PD 2021 Wk 5 Input.csv"

df = pd.read_csv(file_path)

We define the file path and use pd.read_csv() to load the data from the CSV file into a DataFrame.

3. Convert Date and Add Helper Column

df['From Date'] = pd.to_datetime(df['From Date'], format="%d/%m/%Y")
df['1'] = 1

Explanation:

Convert Date: We convert the 'From Date' column to a datetime object using pd.to_datetime() to ensure proper handling of dates.
Helper Column: We add a new column called '1' with a constant value of 1. This is a simple way to count rows during aggregation.

4. Identify the Most Recent Account Manager

am = df.groupby(
    ['Client', 'Client ID', 'Account Manager', 'From Date']
).agg(
    Count=('1', 'sum'),
).reset_index()

am = am.drop(['Count'], axis=1)

What is Happening?

Group by: We group the data by 'Client', 'Client ID', 'Account Manager', and 'From Date'.
Aggregation: We count the rows (using the helper column '1').
Reset Index: Converts the grouped object back to a DataFrame.
Drop Column: We no longer need the 'Count' column, so we remove it using drop().

This gives us a list of unique account managers and their corresponding dates.

5. Find the Most Recent Date for Each Client

mxdate = am.groupby(
    ['Client']
).agg(
    Max_Date=('From Date', 'max'),
).reset_index()

Group by Client: We want the most recent account manager, so we only group by 'Client'.
Find Maximum Date: We use .agg() to find the maximum date for each client.

This DataFrame now holds the most recent date for each client.

6. Keep Training Data Separate

tr = df.groupby(
    ['Training', 'Contact Email', 'Contact Name', 'Client']
).agg(
    Count=('1', 'sum'),
).reset_index()

tr = tr.drop(['Count'], axis=1)

We repeat a similar process to isolate the training data:

Group by Relevant Columns: We group the data by columns related to training.
Count Rows: We count the occurrences using the helper column '1'.
Drop Helper Column: Remove the temporary 'Count' column.

This keeps the training information separate from account manager data.

7. Filter to the Most Recent Account Manager

merged_am = pd.merge(am, mxdate, left_on=['Client', 'From Date'], right_on=['Client', 'Max_Date'], how='inner')

Here, we merge the account manager data (am) with the most recent date (mxdate).

left_on and right_on: These parameters align the two DataFrames on 'Client' and 'From Date'.
how='inner': Ensures we only keep rows where there is a match in both DataFrames, filtering to the most recent records.

8. Combine the Cleaned Data

output = pd.merge(tr, merged_am, left_on=['Client'], right_on=['Client'], how='inner')
output = output.drop(['Max_Date'], axis=1)

Merge Training and Account Manager Data: We join the cleaned account manager data with the training data.
Drop the Max Date Column: Since we no longer need 'Max_Date', we remove it.

The resulting DataFrame now contains:

Training details
The most recent account manager for each client

9. Export the Cleaned Data

output_path = r"C:\Users\HarveyJoyce\Downloads\PDC_prepped\PD 2021 Wk 5 output.csv"

output.to_csv(output_path, index=False)

print(output)

We save the cleaned DataFrame to a CSV file and print the output to verify our work.

to_csv(): Writes the DataFrame to a CSV file.
index=False: Excludes the index column from the output.

📊 Final Thoughts

This script tackles a common data preparation task: handling duplicate records and keeping only the most recent entries. Here's a quick recap of the workflow:

Load the raw data.
Clean and deduplicate by identifying the latest record for each client.
Join the cleaned data with other relevant information (training details).
Export the final cleaned dataset for analysis.

Author:

Harvey Joyce

View Profile