Excel is still the go-to tool for storing and sharing data in many industries. But it’s not perfect—missing values, inconsistent formatting, and duplicate rows are common headaches.


That’s where Python and Pandas come in. With just a few lines of code, you can clean and transform Excel files faster and more reliably than doing it manually.

In this guide, we’ll walk through a realistic Excel cleaning workflow using Python.


Prerequisites

To follow along, make sure you have Python installed and the following packages:


pip install pandas openpyxl


Step 1: Load the Excel File

Let’s say you have a messy Excel file like this:


import pandas as pd

# Load Excel file (first sheet by default)
df = pd.read_excel("raw_data.xlsx", engine='openpyxl')

# Preview the top rows
print(df.head())


Step 2: Remove Unwanted Columns

Often, Excel sheets contain extra columns that aren’t needed for analysis.


# Drop irrelevant columns
df.drop(columns=["Unnamed: 0", "Notes"], inplace=True)

print(df.columns)


Step 3: Rename Columns for Consistency

Use snake_case or camelCase for better readability and consistency.


df.rename(columns={
"Full Name": "full_name",
"Phone Number": "phone",
"Join Date": "join_date"
}, inplace=True)


Step 4: Handle Missing Data

Drop rows with too many missing values:


df.dropna(thresh=3, inplace=True) # keep rows with at least 3 non-null values

df["email"].fillna("not_provided@example.com", inplace=True)
df["age"].fillna(df["age"].mean(), inplace=True) # fill numeric column with average


Step 5: Remove Duplicates

Duplicate entries can mess up reporting or lead to double-counting.


df.drop_duplicates(inplace=True)

df.drop_duplicates(subset=["email"], inplace=True)


Step 6: Fix Data Types and Formats

Make sure columns like dates and numbers are properly typed.


# Convert join_date to datetime
df["join_date"] = pd.to_datetime(df["join_date"], errors='coerce')

# Convert age to integer
df["age"] = df["age"].astype("Int64")


Step 7: Apply Custom Cleaning Rules

Trim whitespace, standardize text formats, and more:


# Remove extra whitespace and lowercase names
df["full_name"] = df["full_name"].str.strip().str.title()

# Ensure emails are lowercase
df["email"] = df["email"].str.strip().str.lower()


Step 8: Export Clean Data to a New Excel File

Save the cleaned DataFrame to a new Excel file:


df.to_excel("cleaned_data.xlsx", index=False, engine='openpyxl')


Cleaning Excel files manually is slow and repetitive—but with Python and Pandas, it becomes fast, consistent, and scalable.

Tags: python
Author

Farhad Uddin

Blogger