Day 12: In-Depth Exploration of Data Splitting Techniques in Python with Cross-Validation

Day 12: In-Depth Exploration of Data Splitting Techniques in Python with Cross-Validation#

Objective#

Enhance understanding of data splitting in machine learning, focusing on mathematical principles, advanced methods like cross-validation, and Python programming activities.

Prerequisites#

Advanced knowledge in Python, Pandas, NumPy, and Scikit-Learn.
Familiarity with statistical concepts and machine learning principles.
Configured Python environment with necessary data science libraries.
Dataset from: 100daysofml/100daysofml.github.io

1. Theoretical Background and Mathematical Principles#

Understanding Data Distribution

Statistical Sampling: Importance in understanding how well samples represent populations.
Central Limit Theorem (CLT):
- Formula: \((\bar{X} \sim N(\mu, \frac{\sigma^2}{n})\)) for large \((n\)).
- Explanation: \((\bar{X}\)) is the sample mean; \((N\)) indicates a normal distribution; \((\mu\)) is the population mean; \((\sigma^2\)) is the population variance; \((n\)) is the sample size.
- Importance: Foundation for statistical methods, applicable even with unknown population distribution.

Importance of Random Sampling

Probability Sampling: Ensures unbiased population representation.
Math Behind Random Sampling: Probability of selection = \((\frac{1}{N}\)), where \((N\)) is the total number of observations.

Stratified Sampling Theory

Principle: Divides the population into homogeneous subgroups to reduce sampling error.
Calculation: Proportion in strata = \((\frac{\text{Number in Stratum}}{\text{Total Number}}\)).

2. Data Splitting Techniques#

Simple Random Split

Usage: Ideal for large, representative datasets.
Python Activity:
- Task: Load a dataset, apply a simple random split.

from sklearn.model_selection import train_test_split
# Assuming df is your DataFrame
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

Analysis: Assess how the split represents the dataset’s characteristics.

Note: The random_state is a pseudo-random number parameter that allows you to reproduce the same train test split each time you run the code. There are a number of reasons why people use random_state , including software testing, tutorials like this one and talks

Stratified Random Split

Usage: Crucial for datasets with significant internal variance or imbalanced classes.
Python Activity:
- Task: Perform a stratified split based on a key categorical variable.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df, df["stratum_column"]):
        strat_train_set = df.loc[train_index]
        strat_test_set = df.loc[test_index]

Analysis: Evaluate the variance within each stratum compared to the overall dataset.

3. Advanced Splitting Techniques#

K-Fold Cross-Validation

Principle: Each data point is used for both training and validation.
Formula: Divide dataset into (K) subsets; use (K-1) for training, 1 for testing; repeat (K) times.
Best Practices:
- Choose an appropriate (K) to balance bias and variance.
- Shuffle data when splitting to avoid biased splits.
Python Activity:
- Task: Implement K-Fold Cross-Validation.

from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
kf = KFold(n_splits=5, random_state=42, shuffle=True)
scores = cross_val_score(model, df_features, df_target, cv=kf)

print("Cross-validation scores:", scores)

Analysis: Explore how varying the number of splits ((K)) impacts model performance. The objective is to find a balance between bias and variance.

Stratified Cross-Validation#

Usage: Essential for datasets with imbalanced classes.
Python Activity:
- Task: Implement Stratified K-Fold Cross-Validation.

 from sklearn.model_selection import StratifiedKFold

    skf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
    for train_index, test_index in skf.split(df_features, df_target):
        X_train, X_test = df_features[train_index], df_features[test_index]
        y_train, y_test = df_target[train_index], df_target[test_index]

Blocked Cross-Validation#

Usage: Ideal for grouped data, such as repeated measurements from the same subject.
Python Activity:
- Task: Use Blocked Cross-Validation for grouped data.

Rolling Cross-Validation#

Usage: Best suited for time series data.
Python Activity:
- Task: Apply Rolling Cross-Validation on time series data.

Homework Assignment #1#

Objective#

Enhance understanding of practical data preprocessing, test set splitting, and model evaluation, focusing on Python implementation using a Kaggle dataset.

Step 1: Access and Load the Dataset#

Task: Download the “Wine Quality” dataset from GitHub: 100daysofml/100daysofml_notebooks

  import pandas as pd

  # Load the dataset (update the path according to your download)
  wine_data = pd.read_csv('winequality.csv')

Step 2: Data Preprocessing#

Tasks: Handle missing values, encode categorical variables, and normalize features

from sklearn.preprocessing import StandardScaler, LabelEncoder

# Handling missing values
wine_data.fillna(wine_data.mean(), inplace=True)

# Encoding categorical variables if any
encoder = LabelEncoder()
wine_data['categorical_column'] = encoder.fit_transform(wine_data['categorical_column'])

# Normalizing features
scaler = StandardScaler()
wine_data_scaled = scaler.fit_transform(wine_data.drop('quality', axis=1))

Step 3: Splitting the Dataset#

Task: Split the dataset into training and test sets.

from sklearn.model_selection import train_test_split

X = wine_data_scaled
y = wine_data['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Training a Model#

Task: Train a model using the training set.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

Step 5: Model Evaluation Techniques#

Tasks: Evaluate the model using a variety of techniques.
Accuracy Score:

from sklearn.metrics import accuracy_score

predictions = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))

Confusion Matrix:

from sklearn.metrics import confusion_matrix

print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))

Precision, Recall, and F1-Score:

from sklearn.metrics import classification_report

print("Classification Report:\n", classification_report(y_test, predictions))

ROC Curve:

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

# For binary classification
probabilities = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, probabilities)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

Cross-Validation:

from sklearn.model_selection import cross_val_score

cross_val_scores = cross_val_score(model, X, y, cv=5)
print("Cross-Validation Scores:", cross_val_scores)

Final Notes#

The five steps outlined in the lesson plan are generally aligned with best practices in machine learning for data preprocessing, splitting, and model evaluation. However, there are a few additional considerations and potential modifications to ensure efficiency and prevent data leakage:

Access and Load the Dataset: This step is straightforward and essential. Ensure data confidentiality and compliance with data usage policies.
Data Preprocessing: Best Practice: It’s crucial to conduct preprocessing on the training set independently from the test set to prevent data leakage. This means normalizing or standardizing the training data and then applying the same transformation to the test data using parameters from the training set. Modification: Split the data into training and test sets before applying any normalization or encoding to avoid data leakage.
Splitting the Dataset: This step is well-placed after initial data loading and handling missing values but should ideally be done before applying more complex preprocessing steps like normalization and encoding.
Training a Model: Best Practice: This step is well-positioned after preprocessing and splitting. Ensure that the model is trained only on the training set.
Model Evaluation Techniques: Best Practice: Evaluating the model on the test set is crucial to assess its performance on unseen data. Using a variety of metrics provides a comprehensive view of the model’s performance. Consideration: When dealing with imbalanced datasets, metrics like ROC-AUC and F1-score are particularly informative.

Additional Considerations for Best Practices#

Cross-Validation: Should ideally be performed before finalizing the model to understand its generalizability. Cross-validation should be conducted on the training dataset only. For time-series data, use time-series specific cross-validation methods to prevent data leakage.

Feature Selection and Engineering: Should be done after splitting the data to prevent information from the test set leaking into the model training process. Any feature engineering should also be consistent across training and test sets.

Data Leakage Prevention: Be vigilant about not allowing information from the test set to influence the training process. This includes careful handling of any preprocessing steps like scaling, normalization, or data augmentation.