Day 14: Data Normalization and Scaling using Python#

Objective#

Explore data normalization and feature scaling in Python, emphasizing the complete mathematical principles underlying these techniques, and understand their practical implications in data preprocessing for machine learning models.

Prerequisites#

  • Intermediate Python, Pandas, NumPy, and Scikit-Learn skills.

  • Basic understanding of statistical concepts and linear algebra.

  • Dataset: 100daysofml/100daysofml.github.io

Introduction to Data Normalization and Scaling#

Data normalization and scaling are essential preprocessing techniques in data analysis and machine learning. They ensure consistent data representation across features, improving the performance and interpretability of machine learning algorithms. By modifying feature scales, these techniques help in stabilizing model training and enhancing algorithm convergence speed.

Understanding Scaling Types#

  1. Z-Score Normalization (Standardization)

  2. Min-Max Scaling

Comprehensive Mathematical Foundations#

Z-Score Normalization (Standardization)#

  • Concept: Transforms data to have zero mean \((μ = 0)\) and unit variance \((σ² = 1)\).

  • Formula: \(Z=(X−μ)σZ=σ(X−μ)\)

    • Where XX is the original value, μμ is the mean, and σσ is the standard deviation.

  • Mathematical Rationale:

    • By subtracting the mean, the data is centered around zero.

    • Dividing by the standard deviation scales the data so that the variance is normalized.

    • Ideal for features following a Gaussian distribution or when the algorithm assumes Gaussian distribution (e.g., Linear Regression, Logistic Regression, Linear Discriminant Analysis).

    • Pros:

      • Less affected by outliers.

      • Keeps useful information about outliers.

    • Cons:

      • Not bound to a specific range.

    • Use Case: Algorithms that assume data is Gaussian (e.g., Linear Regression, Gaussian Naive Bayes).

Min-Max Scaling#

  • Concept: Rescales the range of features to [0, 1].

  • Formula: \(Xnorm=(X−Xmin)(Xmax−Xmin)Xnorm\)=(Xmax\(−Xmin\))(X−Xmin\()\)

    • Where XX is the original value, \(XminXmin\) and \(XmaxXmax\) are the minimum and maximum values.

  • Mathematical Rationale:

    • Ensures that all features contribute equally to the result.

    • Useful when algorithms are sensitive to the scale of input data (e.g., K-Nearest Neighbors, Neural Networks).

    • Prevents features with larger ranges from overpowering those with smaller ranges.

    • Pros:

      • Bound to a specific range.

      • Preserves relationships in data.

    • Cons:

      • Sensitive to outliers.

    • Use Case: Algorithms sensitive to input scale (e.g., Neural Networks, KNN).

Hands-On Activities with the Iris Dataset#

Implementing Z-Score Normalization#

Implement Z-Score Normalization and Min-Max Scaling using the StandardScaler and MinMaxScaler from Scikit-Learn, respectively. Tasks include examining the distribution of features before and after scaling, and comparing how scaling alters the range of values in features.

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import pandas as pd

iris = load_iris()
iris_data = pd.DataFrame(iris.data, columns=iris.feature_names)
scaler = StandardScaler()
iris_standardized = scaler.fit_transform(iris_data) 
  • Task: Examine the distribution of any feature before and after standardization.

Implementing Min-Max Scaling#

from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
iris_min_max_scaled = min_max_scaler.fit_transform(iris_data) 
  • Task: Compare how min-max scaling alters the range of values in a feature.

Visualization and Analysis Post Data Normalization and Scaling#

Objective#

To visually and analytically assess the impact of normalization and scaling on the data’s distribution and inter-feature relationships.

Importance of Visualization#

  • Understanding Data Transformation: Visualization helps in comprehending how normalization and scaling alter the data.

  • Model Readiness Check: It ensures that the data is appropriately scaled or normalized for specific machine learning models.

Implementing Z-Score Normalization#

  1. Load the dataset:

import pandas as pd
iris_data = pd.read_csv('Iris.csv')
print(iris_data.head())
  1. Implement Z-Score Normalization:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
iris_standardized = scaler.fit_transform(iris_data.iloc[:, :-1])  # Assuming last column is target
  1. Tasks:

    • Examine distribution of features before and after standardization.

  1. Implement Min-Max Scaling:

from sklearn.preprocessing import MinMaxScalermin_max_scaler = MinMaxScaler()
iris_min_max_scaled = min_max_scaler.fit_transform(iris_data.iloc[:, :-1])
  1. Tasks:

    • Compare how min-max scaling alters the range of values in features.

Visualization and Analysis Post Data Normalization and Scaling#

Visualization Techniques#

  1. Histograms and Density Plots:

    • Show distribution of features before and after normalization or scaling.

    • Example:

import seaborn as sns
import matplotlib.pyplot as plt
        
feature = 'SepalLengthCm'
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(iris_data[feature], kde=True, color='blue')
plt.title(f'Original {feature}')
        
plt.subplot(1, 2, 2)
sns.histplot(iris_standardized[:, 0], kde=True, color='green')  # Adjust index accordingly
plt.title(f'Standardized {feature}')
plt.show() 
  1. Histograms and Density Plots:

    • Show distribution of features before and after normalization or scaling.

    • Example:

import seaborn as sns
import matplotlib.pyplot as plt
        
feature = 'SepalLengthCm'
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(iris_data[feature], kde=True, color='blue')
plt.title(f'Original {feature}')
        
plt.subplot(1, 2, 2)
sns.histplot(iris_standardized[:, 0], kde=True, color='green')  # Adjust index accordingly
plt.title(f'Standardized {feature}')
plt.show() 
  • Box Plots:

    • Show spread and central tendency of the data.

    • Example:

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.boxplot(data=iris_data, y=feature)
plt.title(f'Original {feature}') 
        
plt.subplot(1, 2, 2)
sns.boxplot(y=iris_standardized[:, 0])  # Adjust index accordingly
plt.title(f'Standardized {feature}')
plt.show() 
  • Scatter Plots:

    • Observe relationship between two features.

    • Example:

plt.scatter(iris_data['SepalLengthCm'], iris_data['SepalWidthCm'], alpha=0.5, label='Original')
plt.scatter(iris_standardized[:, 0], iris_standardized[:, 1], alpha=0.5, label='Standardized')  # Adjust indices accordingly
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('Sepal Length vs. Width: Original vs. Standardized')
plt.legend()
plt.show() 

Analytical Assessment Post-Scaling#

  1. Statistical Summary:

    • Compare key statistics (mean, standard deviation, range) before and after scaling.

    • Use describe() function in Pandas.

  2. Correlation Analysis:

    • Assess changes in inter-feature correlations post-scaling.

    • Use corr() function in Pandas.

  3. Model Performance Evaluation:

    • Compare accuracy, precision, and recall before and after scaling.

    • Train models on original, standardized, and min-max scaled data.

Best Practices and Considerations#

  • Data Leakage: Fit scalers on the training set only to prevent information leakage.

  • Train/Test Splitting: Split data before preprocessing to avoid data snooping.

  • Choosing the Right Method: Consider data distribution, algorithm requirements, and presence of outliers when choosing a scaling method.

import pandas as pd
iris_data = pd.read_csv('Iris.csv')
print(iris_data.head())
   Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0   1            5.1           3.5            1.4           0.2  Iris-setosa
1   2            4.9           3.0            1.4           0.2  Iris-setosa
2   3            4.7           3.2            1.3           0.2  Iris-setosa
3   4            4.6           3.1            1.5           0.2  Iris-setosa
4   5            5.0           3.6            1.4           0.2  Iris-setosa
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
iris_standardized = scaler.fit_transform(iris_data.iloc[:, :-1])  # Assuming last column is target
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler()
iris_min_max_scaled = min_max_scaler.fit_transform(iris_data.iloc[:, :-1])
import seaborn as sns
import matplotlib.pyplot as plt
        
feature = 'SepalLengthCm'
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(iris_data[feature], kde=True, color='blue')
plt.title(f'Original {feature}')
        
plt.subplot(1, 2, 2)
sns.histplot(iris_standardized[:, 0], kde=True, color='green')  # Adjust index accordingly
plt.title(f'Standardized {feature}')
plt.show() 
../_images/f1236f7a28ad827ed3d854d69015c944647007e249de089ad21368b111983b3f.png
import seaborn as sns
import matplotlib.pyplot as plt
        
feature = 'SepalLengthCm'
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.histplot(iris_data[feature], kde=True, color='blue')
plt.title(f'Original {feature}')
        
plt.subplot(1, 2, 2)
sns.histplot(iris_standardized[:, 0], kde=True, color='green')  # Adjust index accordingly
plt.title(f'Standardized {feature}')
plt.show() 
../_images/f1236f7a28ad827ed3d854d69015c944647007e249de089ad21368b111983b3f.png
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.boxplot(data=iris_data, y=feature)
plt.title(f'Original {feature}') 
        
plt.subplot(1, 2, 2)
sns.boxplot(y=iris_standardized[:, 0])  # Adjust index accordingly
plt.title(f'Standardized {feature}')
plt.show() 
../_images/7607289fe3aabe8786c686338580755d16aeeca8570b599b7eaa66d39738d7ba.png
plt.scatter(iris_data['SepalLengthCm'], iris_data['SepalWidthCm'], alpha=0.5, label='Original')
plt.scatter(iris_standardized[:, 0], iris_standardized[:, 1], alpha=0.5, label='Standardized')  # Adjust indices accordingly
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('Sepal Length vs. Width: Original vs. Standardized')
plt.legend()
plt.show() 
../_images/b6366b0353cc1a025328b3a5ad091b5e99058e38d3e093b42b4fd7087b16511b.png

Additional Resources (Train/Test Splitting | Data Leakage)#

https://machinelearningmastery.com/data-preparation-without-data-leakage/

https://medium.com/@dancerworld60/detecting-and-preventing-data-leakage-in-machine-learning-4bb910900ab7

https://towardsdatascience.com/five-hidden-causes-of-data-leakage-you-should-be-aware-of-e44df654f185

https://www.analyticsvidhya.com/blog/2023/11/train-test-validation-split/

https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/

Day 14: Activity#

Step 1: Import necessary libraries#

Step 2: Load and Explore the Dataset#

Step 3: Preprocess the Data#

Step 4: Split the Data into Training and Testing Sets#

Step 5: Apply Z-Score Normalization (Standardization)#

Step 6: Apply Min-Max Scaling#

Step 7: Visualize and Analyze the Effect of Scaling#

Step 8: Statistical Summary Comparison#

#Step 1: Import Necessary Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import matplotlib.pyplot as plt
import seaborn as sns
#Step 2: Load and Explore the Dataset

# Load the dataset
titanic_data = pd.read_csv('titanic.csv')

# Display the first few rows of the dataset

print(titanic_data.head())

# Display summary statistics

print("\nSummary Statistics:")
print(titanic_data.describe())


# Check for missing values
print("\nMissing Values:")
print(titanic_data.isnull().sum()) 
   PassengerId  Pclass                                          Name     Sex  \
0          892       3                              Kelly, Mr. James    male   
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female   
2          894       2                     Myles, Mr. Thomas Francis    male   
3          895       3                              Wirz, Mr. Albert    male   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   

    Age  SibSp  Parch   Ticket     Fare Cabin Embarked  Survived  
0  34.5      0      0   330911   7.8292   NaN        Q         0  
1  47.0      1      0   363272   7.0000   NaN        S         1  
2  62.0      0      0   240276   9.6875   NaN        Q         0  
3  27.0      0      0   315154   8.6625   NaN        S         0  
4  22.0      1      1  3101298  12.2875   NaN        S         1  

Summary Statistics:
       PassengerId      Pclass         Age       SibSp       Parch  \
count   418.000000  418.000000  332.000000  418.000000  418.000000   
mean   1100.500000    2.265550   30.272590    0.447368    0.392344   
std     120.810458    0.841838   14.181209    0.896760    0.981429   
min     892.000000    1.000000    0.170000    0.000000    0.000000   
25%     996.250000    1.000000   21.000000    0.000000    0.000000   
50%    1100.500000    3.000000   27.000000    0.000000    0.000000   
75%    1204.750000    3.000000   39.000000    1.000000    0.000000   
max    1309.000000    3.000000   76.000000    8.000000    9.000000   

             Fare    Survived  
count  417.000000  418.000000  
mean    35.627188    0.385167  
std     55.907576    0.487218  
min      0.000000    0.000000  
25%      7.895800    0.000000  
50%     14.454200    0.000000  
75%     31.500000    1.000000  
max    512.329200    1.000000  

Missing Values:
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
Survived         0
dtype: int64
#Step 3: Preprocess the Data

# Handle missing values, encode categorical variables if necessary, and select features for scaling.

# Handling missing values (simple example by filling with median or mode)titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)
titanic_data['Embarked'].fillna(titanic_data['Embarked'].mode()[0], inplace=True)

# Select features for scaling (numeric features)
features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
X = titanic_data[features]

# If 'Survived' is the target variable
y = titanic_data['Survived'] 
/tmp/ipykernel_1401614/1279951594.py:6: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  titanic_data['Embarked'].fillna(titanic_data['Embarked'].mode()[0], inplace=True)
#Step 4: Split the Data into Training and Testing Sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 
#Step 5: Apply Z-Score Normalization (Standardization)

scaler = StandardScaler()
X_train_standardized = scaler.fit_transform(X_train)
X_test_standardized = scaler.transform(X_test)
#Step 6: Apply Min-Max Scaling

min_max_scaler = MinMaxScaler()
X_train_min_max = min_max_scaler.fit_transform(X_train)
X_test_min_max = min_max_scaler.transform(X_test) 
#Step 7: Visualize and Analyze the Effect of Scaling

#Visualize the effect of Z-Score Normalization and Min-Max Scaling on the feature distributions.

#Visualize the distribution of features before and after Z-Score Normalization
plt.figure(figsize=(16, 6))
for i, feature in enumerate(features):
    plt.subplot(2, len(features), i+1)
    sns.histplot(X_train[feature], kde=True, color='blue', label='Original')
    plt.title(f'Original {feature}')
    plt.subplot(2, len(features), len(features)+i+1)
    sns.histplot(X_train_standardized[:, i], kde=True, color='green', label='Standardized')
    plt.title(f'Standardized {feature}')
plt.tight_layout()
plt.show()

# Visualize the distribution of features before and after Min-Max Scaling
plt.figure(figsize=(16, 6))
for i, feature in enumerate(features):
    plt.subplot(2, len(features), i+1)
    sns.histplot(X_train[feature], kde=True, color='blue', label='Original')
    plt.title(f'Original {feature}')
    plt.subplot(2, len(features), len(features)+i+1)
    sns.histplot(X_train_min_max[:, i], kde=True, color='red', label='Min-Max Scaled')
    plt.title(f'Min-Max Scaled {feature}')
plt.tight_layout()
plt.show()
../_images/51d7ab6fc7216e276073e063e6f9a6cd41b707d76b6d9c1b34e435cc213a187e.png ../_images/45b36cabc83f6e785b315c77d858ec49a42521f8171198ffe3127a2eef75fcb9.png
#Step 8: Statistical Summary Comparison

#Compare the statistical summary before and after scaling.

# Statistical summary before scalingprint("Statistical Summary before scaling:")
print(X_train.describe())

# Statistical summary after Z-Score Normalization
print("\nStatistical Summary after Z-Score Normalization:")
print(pd.DataFrame(X_train_standardized, columns=features).describe())

# Statistical summary after Min-Max Scaling
print("\nStatistical Summary after Min-Max Scaling:")
print(pd.DataFrame(X_train_min_max, columns=features).describe())
           Pclass         Age       SibSp       Parch        Fare
count  334.000000  262.000000  334.000000  334.000000  333.000000
mean     2.269461   30.115763    0.470060    0.404192   36.909135
std      0.844961   14.655775    0.944719    0.937113   58.054690
min      1.000000    0.330000    0.000000    0.000000    0.000000
25%      1.000000   21.000000    0.000000    0.000000    7.887500
50%      3.000000   27.000000    0.000000    0.000000   14.454200
75%      3.000000   39.000000    1.000000    0.000000   32.500000
max      3.000000   76.000000    8.000000    9.000000  512.329200

Statistical Summary after Z-Score Normalization:
             Pclass           Age         SibSp         Parch          Fare
count  3.340000e+02  2.620000e+02  3.340000e+02  3.340000e+02  3.330000e+02
mean   2.233742e-16  2.711995e-17  2.127373e-17 -3.722904e-17 -4.267524e-17
std    1.001500e+00  1.001914e+00  1.001500e+00  1.001500e+00  1.001505e+00
min   -1.504644e+00 -2.036246e+00 -4.983123e-01 -4.319630e-01 -6.367217e-01
25%   -1.504644e+00 -6.231816e-01 -4.983123e-01 -4.319630e-01 -5.006539e-01
50%    8.658800e-01 -2.130032e-01 -4.983123e-01 -4.319630e-01 -3.873714e-01
75%    8.658800e-01  6.073537e-01  5.617915e-01 -4.319630e-01 -7.606225e-02
max    8.658800e-01  3.136788e+00  7.982518e+00  9.186412e+00  8.201500e+00

Statistical Summary after Min-Max Scaling:
           Pclass         Age       SibSp       Parch        Fare
count  334.000000  262.000000  334.000000  334.000000  333.000000
mean     0.634731    0.393627    0.058757    0.044910    0.072042
std      0.422481    0.193680    0.118090    0.104124    0.113315
min      0.000000    0.000000    0.000000    0.000000    0.000000
25%      0.000000    0.273160    0.000000    0.000000    0.015395
50%      1.000000    0.352451    0.000000    0.000000    0.028213
75%      1.000000    0.511035    0.125000    0.000000    0.063436
max      1.000000    1.000000    1.000000    1.000000    1.000000

Additional Resources (Data Normalization & Scaling)#

https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/

https://www.kaggle.com/code/alexisbcook/scaling-and-normalization

https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/

https://www.simplilearn.com/normalization-vs-standardization-article