Data Pre-processing: From Raw Data to Machine Learning Ready

by Sajjad Ahmed Niloy

Our Sample Dataset: `employees.csv`

Throughout this guide, we'll use a simple dataset of employees. Our goal is to prepare this data to predict whether a customer purchased a product based on their age, city, and salary.

1. Data Reading

Loading data into a DataFrame.

Analogy: Think of this as opening a spreadsheet in your program. Pandas reads the file and organizes it into a DataFrame—a smart table that's easy to work with.

import pandas as pd
from io import StringIO

# For demonstration, we'll simulate a CSV file in memory.
csv_data = """EmployeeID,Age,City,Salary,PurchasedProduct
101,25,New York,60000,Yes
102,45,London,90000,No
103,31,Tokyo,72000,Yes
104,22,London,55000,No
105,53,New York,120000,Yes
106,38,Tokyo,85000,Yes
"""

# Read the data into a DataFrame
df = pd.read_csv(StringIO(csv_data))

print(df)
                        

2. Data Type Handling

Ensuring columns have the correct format.

Analogy: This is like formatting cells in Excel. We need to make sure numbers are treated as numbers and text as text. `EmployeeID` is a unique identifier, not a mathematical quantity, so we'll treat it as text (`object`) to avoid accidental calculations with it.

# Check the initial data types
print(df.dtypes)

# EmployeeID is a numeric ID, not a quantity. Let's change its type.
df['EmployeeID'] = df['EmployeeID'].astype('object')

print("\nData types after correction:")
print(df.dtypes)
                        

3. Encoding with Scikit-learn

Translating text into numbers.

Analogy: A computer doesn't understand "Tokyo". Encoding translates categories into a numerical language it can process. We use One-Hot Encoding for the `City` column to create separate "yes/no" columns for each city, preventing the model from thinking `Tokyo` > `London`.

from sklearn.preprocessing import OneHotEncoder

# Convert target 'Yes'/'No' to 1/0 first
df['PurchasedProduct'] = df['PurchasedProduct'].map({'Yes': 1, 'No': 0})

# One-Hot Encode the 'City' column
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
city_encoded = ohe.fit_transform(df[['City']])
city_df = pd.DataFrame(city_encoded, columns=ohe.get_feature_names_out())

# Join the new columns and drop the original 'City' and 'EmployeeID'
df_processed = df.drop(['City', 'EmployeeID'], axis=1).join(city_df)

print(df_processed)
                        

4. Scaling Numerical Data

Bringing features to the same scale.

Analogy: Imagine measuring one person's height in feet and another's in miles. The numbers are vastly different! Scaling converts `Age` and `Salary` so they are on a similar scale, ensuring the `Salary` column doesn't unfairly dominate the model's logic just because its numbers are bigger.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
numerical_cols = ['Age', 'Salary']
df_processed[numerical_cols] = scaler.fit_transform(df_processed[numerical_cols])

print(df_processed.head())
                        

5. Splitting the Data

Preparing for training and testing.

Analogy: This is like creating a "practice exam" and a "final exam" from a textbook. We use the bigger part (the training set) to teach the model. Then, we use the smaller part it has never seen (the testing set) to give it a fair final exam and see how well it truly learned.

from sklearn.model_selection import train_test_split

# 'X' contains all our feature columns (predictors)
X = df_processed.drop('PurchasedProduct', axis=1)
                        
# 'y' is the target column we want to predict
y = df_processed['PurchasedProduct']

# Split the data: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training Features Shape:", X_train.shape)
print("Testing Features Shape:", X_test.shape)