Data Pre-processing: From Raw Data to Machine Learning Ready

by Sajjad Ahmed Niloy

Our Sample Dataset: `employees.csv`

Throughout this guide, we'll use a simple dataset of employees. Our goal is to prepare this data to predict whether a customer purchased a product based on their age, city, and salary.

1. Data Reading

Loading data into a DataFrame.

Analogy: Think of this as opening a spreadsheet in your program. Pandas reads the file and organizes it into a DataFrame—a smart table that's easy to work with.

import pandas as pd from io import StringIO # For demonstration, we'll simulate a CSV file in memory. csv_data = """EmployeeID,Age,City,Salary,PurchasedProduct 101,25,New York,60000,Yes 102,45,London,90000,No 103,31,Tokyo,72000,Yes 104,22,London,55000,No 105,53,New York,120000,Yes 106,38,Tokyo,85000,Yes """ # Read the data into a DataFrame df = pd.read_csv(StringIO(csv_data)) print(df)

2. Data Type Handling

Ensuring columns have the correct format.

Analogy: This is like formatting cells in Excel. We need to make sure numbers are treated as numbers and text as text. `EmployeeID` is a unique identifier, not a mathematical quantity, so we'll treat it as text (`object`) to avoid accidental calculations with it.

# Check the initial data types print(df.dtypes) # EmployeeID is a numeric ID, not a quantity. Let's change its type. df['EmployeeID'] = df['EmployeeID'].astype('object') print("\nData types after correction:") print(df.dtypes)

3. Encoding with Scikit-learn

Translating text into numbers.

Analogy: A computer doesn't understand "Tokyo". Encoding translates categories into a numerical language it can process. We use One-Hot Encoding for the `City` column to create separate "yes/no" columns for each city, preventing the model from thinking `Tokyo` > `London`.

from sklearn.preprocessing import OneHotEncoder # Convert target 'Yes'/'No' to 1/0 first df['PurchasedProduct'] = df['PurchasedProduct'].map({'Yes': 1, 'No': 0}) # One-Hot Encode the 'City' column ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore') city_encoded = ohe.fit_transform(df[['City']]) city_df = pd.DataFrame(city_encoded, columns=ohe.get_feature_names_out()) # Join the new columns and drop the original 'City' and 'EmployeeID' df_processed = df.drop(['City', 'EmployeeID'], axis=1).join(city_df) print(df_processed)

4. Scaling Numerical Data

Bringing features to the same scale.

Analogy: Imagine measuring one person's height in feet and another's in miles. The numbers are vastly different! Scaling converts `Age` and `Salary` so they are on a similar scale, ensuring the `Salary` column doesn't unfairly dominate the model's logic just because its numbers are bigger.

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() numerical_cols = ['Age', 'Salary'] df_processed[numerical_cols] = scaler.fit_transform(df_processed[numerical_cols]) print(df_processed.head())

5. Splitting the Data

Preparing for training and testing.

Analogy: This is like creating a "practice exam" and a "final exam" from a textbook. We use the bigger part (the training set) to teach the model. Then, we use the smaller part it has never seen (the testing set) to give it a fair final exam and see how well it truly learned.

from sklearn.model_selection import train_test_split # 'X' contains all our feature columns (predictors) X = df_processed.drop('PurchasedProduct', axis=1) # 'y' is the target column we want to predict y = df_processed['PurchasedProduct'] # Split the data: 80% for training, 20% for testing X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print("Training Features Shape:", X_train.shape) print("Testing Features Shape:", X_test.shape)