Day 11: Introduction to Data Preprocessing in Python

Day 11: Introduction to Data Preprocessing in Python#

1. Overview of Data Preprocessing#

Importance and Goals#

Importance: Essential for converting raw data into a format suitable for analysis.
Goals: Enhance data quality, improve analysis efficiency, and prepare data for machine learning.

Data Preprocessing Workflow#

Cleaning Data: Remove duplicates, correct errors.
Handling Missing Values: Impute missing values or remove them.
Normalization: Scale data using methods like Min-Max scaling or Z-score normalization.
Feature Engineering: Create new features from existing data.

2. Understanding Data Types and Scales#

Data Types#

Numeric (Quantitative): Numbers representing continuous or discrete data.
Categorical (Qualitative): Data grouped into categories.

Scales#

Nominal: Categories without order (e.g., blood types).
Ordinal: Ordered categories (e.g., class levels).
Interval: Numeric scales without true zero (e.g., temperature in Celsius).
Ratio: Numeric scales with true zero (e.g., height).

3. Basic (Summary) Statistics in Python#

Setup for Activities#

Dataset: Covid Data. 100daysofml/100daysofml.github.io
Tools: Python with Pandas library.

#import relevant libraries
import numpy as np
import pandas as pd
from scipy import stats

#load data into dataframe
covid_data = pd.read_csv("covid_data.csv")

#inspect the first five and last five lines of the dataframe
covid_data.head()

	iso_code	continent	location	date	total_cases	new_cases	new_cases_smoothed	total_deaths	new_deaths	new_deaths_smoothed	...	female_smokers	male_smokers	handwashing_facilities	hospital_beds_per_thousand	life_expectancy	human_development_index	excess_mortality_cumulative_absolute	excess_mortality_cumulative	excess_mortality	excess_mortality_cumulative_per_million
0	AFG	Asia	Afghanistan	24/02/2020	5	5	NaN	NaN	NaN	NaN	...	NaN	NaN	37.746	0.5	64.83	0.511	NaN	NaN	NaN	NaN
1	AFG	Asia	Afghanistan	25/02/2020	5	0	NaN	NaN	NaN	NaN	...	NaN	NaN	37.746	0.5	64.83	0.511	NaN	NaN	NaN	NaN
2	AFG	Asia	Afghanistan	26/02/2020	5	0	NaN	NaN	NaN	NaN	...	NaN	NaN	37.746	0.5	64.83	0.511	NaN	NaN	NaN	NaN
3	AFG	Asia	Afghanistan	27/02/2020	5	0	NaN	NaN	NaN	NaN	...	NaN	NaN	37.746	0.5	64.83	0.511	NaN	NaN	NaN	NaN
4	AFG	Asia	Afghanistan	28/02/2020	5	0	NaN	NaN	NaN	NaN	...	NaN	NaN	37.746	0.5	64.83	0.511	NaN	NaN	NaN	NaN

5 rows × 67 columns

covid_data.tail()

	iso_code	continent	location	date	total_cases	new_cases	new_cases_smoothed	total_deaths	...	female_smokers	male_smokers	handwashing_facilities	hospital_beds_per_thousand	life_expectancy	human_development_index	excess_mortality_cumulative_absolute	excess_mortality_cumulative	excess_mortality	excess_mortality_cumulative_per_million
5813	NGA	Africa	Nigeria	06/10/2022	265741	236	51.286	3155.0	...	0.6	10.8	41.949	NaN	54.69	0.539	NaN	NaN	NaN	NaN
5814	NGA	Africa	Nigeria	07/10/2022	265741	0	51.286	3155.0	...	0.6	10.8	41.949	NaN	54.69	0.539	NaN	NaN	NaN	NaN
5815	NGA	Africa	Nigeria	08/10/2022	265816	75	55.000	3155.0	...	0.6	10.8	41.949	NaN	54.69	0.539	NaN	NaN	NaN	NaN
5816	NGA	Africa	Nigeria	09/10/2022	265816	0	55.000	3155.0	...	0.6	10.8	41.949	NaN	54.69	0.539	NaN	NaN	NaN	NaN
5817	NGA	Africa	Nigeria	10/10/2022	265816	0	55.000	3155.0	...	0.6	10.8	41.949	NaN	54.69	0.539	NaN	NaN	NaN	NaN

5 rows × 67 columns

#show all columns in pandas
covid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5818 entries, 0 to 5817
Data columns (total 67 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 iso_code                                    5818 non-null   object 
 continent                                   5818 non-null   object 
 location                                    5818 non-null   object 
 date                                        5818 non-null   object 
 total_cases                                 5818 non-null   int64  
 new_cases                                   5818 non-null   int64  
 new_cases_smoothed                          5788 non-null   float64
 total_deaths                                5638 non-null   float64
 new_deaths                                  5627 non-null   float64
 new_deaths_smoothed                         5596 non-null   float64
total_cases_per_million                     5818 non-null   float64
new_cases_per_million                       5818 non-null   float64
new_cases_smoothed_per_million              5788 non-null   float64
total_deaths_per_million                    5638 non-null   float64
new_deaths_per_million                      5627 non-null   float64
new_deaths_smoothed_per_million             5596 non-null   float64
reproduction_rate                           5566 non-null   float64
icu_patients                                2610 non-null   float64
icu_patients_per_million                    2610 non-null   float64
hosp_patients                               2610 non-null   float64
hosp_patients_per_million                   2610 non-null   float64
weekly_icu_admissions                       0 non-null      float64
weekly_icu_admissions_per_million           0 non-null      float64
weekly_hosp_admissions                      934 non-null    float64
weekly_hosp_admissions_per_million          934 non-null    float64
total_tests                                 3174 non-null   float64
new_tests                                   2948 non-null   float64
total_tests_per_thousand                    3174 non-null   float64
new_tests_per_thousand                      2948 non-null   float64
new_tests_smoothed                          4114 non-null   float64
new_tests_smoothed_per_thousand             4114 non-null   float64
positive_rate                               3440 non-null   float64
tests_per_case                              3440 non-null   float64
tests_units                                 4156 non-null   object 
total_vaccinations                          2104 non-null   float64
people_vaccinated                           2051 non-null   float64
people_fully_vaccinated                     2004 non-null   float64
total_boosters                              1170 non-null   float64
new_vaccinations                            1827 non-null   float64
new_vaccinations_smoothed                   3658 non-null   float64
total_vaccinations_per_hundred              2104 non-null   float64
people_vaccinated_per_hundred               2051 non-null   float64
people_fully_vaccinated_per_hundred         2004 non-null   float64
total_boosters_per_hundred                  1170 non-null   float64
new_vaccinations_smoothed_per_million       3658 non-null   float64
new_people_vaccinated_smoothed              3658 non-null   float64
new_people_vaccinated_smoothed_per_hundred  3658 non-null   float64
stringency_index                            5699 non-null   float64
population                                  5818 non-null   int64  
population_density                          5818 non-null   float64
median_age                                  5818 non-null   float64
aged_65_older                               5818 non-null   float64
aged_70_older                               5818 non-null   float64
gdp_per_capita                              5818 non-null   float64
extreme_poverty                             2922 non-null   float64
cardiovasc_death_rate                       5818 non-null   float64
diabetes_prevalence                         5818 non-null   float64
female_smokers                              4860 non-null   float64
male_smokers                                4860 non-null   float64
handwashing_facilities                      1913 non-null   float64
hospital_beds_per_thousand                  4863 non-null   float64
life_expectancy                             5818 non-null   float64
human_development_index                     5818 non-null   float64
excess_mortality_cumulative_absolute        421 non-null    float64
excess_mortality_cumulative                 421 non-null    float64
excess_mortality                            421 non-null    float64
excess_mortality_cumulative_per_million     421 non-null    float64
dtypes: float64(59), int64(3), object(5)
memory usage: 3.0+ MB

#create a dataframe that loads relevant columns
covid_datanew = covid_data[['iso_code','continent','location','date','total_cases','new_cases']]

#displaying the new dataframe
covid_datanew.head(5)

	iso_code	continent	location	date	total_cases	new_cases
0	AFG	Asia	Afghanistan	24/02/2020	5	5
1	AFG	Asia	Afghanistan	25/02/2020	5	0
2	AFG	Asia	Afghanistan	26/02/2020	5	0
3	AFG	Asia	Afghanistan	27/02/2020	5	0
4	AFG	Asia	Afghanistan	28/02/2020	5	0

#investigating the data type of the dataframe
covid_datanew.dtypes

iso_code       object
continent      object
location       object
date           object
total_cases     int64
new_cases       int64
dtype: object

#displaying the shape of the dataframe (rows x columns)
covid_data.shape

(5818, 67)

Mean (Arithmetic Average)

Formula: \((\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i\))
Activity: Calculate the mean of ‘new_cases’ in the dataset.

#analyze the mean of the new_cases column using the np.mean() in numpy
newcase_mean = np.mean(covid_datanew["new_cases"])

print("New case (mean):", newcase_mean)

New case (mean): 8814.365761430045

Median (Middle Value in Sorted Data)

Activity: Find the median of ‘new_cases’ in the dataset

newcase_median = np.median(covid_datanew["new_cases"])

print("New case (median):", newcase_median)

New case (median): 261.0

Mode (Most Frequent Value)

Activity: Determine the mode for ‘’.

The stats.mode function from the SciPy library returns a ModeResult object, which contains two arrays:

The first array (mode): This contains the mode value(s), i.e., the most frequently occurring value(s) in the dataset.
The second array (count): This contains the number of times the mode value(s) appears in the dataset.

Both of these are returned as arrays, even if there’s only one mode. When you access the mode using stats.mode(covid_datanew[‘new_cases’])[0], it returns an array with the mode value. The [0] at the end is used to access the first (and in most cases, the only) element of this array.

So, in the context of the code:

stats.mode(covid_datanew['new_cases']): Returns a ModeResult object with the mode and its count.
stats.mode(covid_datanew['new_cases'])[0]: Accesses the array containing the mode value(s).
stats.mode(covid_datanew['new_cases'])[0][0]: Accesses the first element of the array, providing the actual mode value.

This is necessary because the mode function is designed to handle multi-modal datasets (datasets with more than one mode) and thus returns an array instead of a single value. In most single-mode cases, you’ll need the [0][0] to access the actual mode value.

newcase_mode = stats.mode(covid_datanew["new_cases"])

print("New case (mode):", newcase_mode)

New case (mode): ModeResult(mode=0, count=805)

In the above example we used a numeric column in order to display the mode? Could you use a non-numeric column?#

Variance (σ²)

Formula: \((\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}\))
Activity: Compute the variance of ‘quality’.

#using numpy check the variance of the new_cases column
newcase_variance = np.var(covid_datanew["new_cases"])

print("New cases (variance:numpy):", newcase_variance)

New cases (variance:numpy): 451321915.9280954

#using numpy check the variance of the new_cases column
covid_datanew["new_cases"].var()

451399502.6421969

Standard Deviation (σ)

Formula: \(\sigma = \sqrt{\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n}}\)
Activity: Calculate the standard deviation for ‘quality’.

# Calculate the standard deviation using NumPy
# 'ddof=0' for population standard deviation; 'ddof=1' for sample standard deviation
newcase_stdev = np.std(covid_datanew["new_cases"], ddof=1)

print("New cases (stdev: numpy):", newcase_stdev)

New cases (stdev: numpy): 21246.164421895

covid_datanew["new_cases"].std()

21246.164421895

Why would there be a difference in the variance and standard deviation between NumPy and Pandas?#

The difference between the numpy var and pandas var methods are not dependent on the range of the data but on the degrees of freedom (ddof) set by package. pandas sets ddof=1 (unbiased estimator) while numpy sets ddof = 0 (mle). RE: https://stackoverflow.com/questions/62938495/difference-between-numpy-var-and-pandas-var

Max and Min Range

The range has a significant role in describing the variability of a data set, as long as there are no outliers. An outlier is an extreme high or low value that stands alone from the other values. If an outlier exist, the value of the range by itself can be misleading.

#Calculate the range of the dataset using NumPy
covid_newcases_max = np.max(covid_datanew["new_cases"])
covid_newcases_min = np.min(covid_datanew["new_cases"])

print(covid_newcases_max, covid_newcases_min)

287149 0

Why are Quartiles and Interquartile Range Important?#

Quartiles and the Interquartile Range (IQR) are essential in data analysis for several key reasons:

Measure of Spread
- Quartiles divide a dataset into four equal parts, providing insight into the distribution and variability of the data.
Outlier Detection
- The IQR is a robust measure of statistical dispersion and is commonly used for identifying outliers. Values that fall below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are often considered outliers.
Non-parametric
- Quartiles do not assume a normal distribution of data, making them non-parametric and robust measures for skewed distributions or data with outliers.
Data Segmentation and Comparison
- Quartiles allow for easy segmentation of data into groups, which is useful in various applications like finance and sales.
Informative for Further Statistical Analysis
- Understanding quartile positions helps in making informed decisions for further statistical analyses, especially with skewed data.
Basis for Other Statistical Measures
- Quartiles are foundational for other statistical visualizations like box plots, which depict quartiles and outliers graphically.

# Calculate quartiles
Q1 = np.quantile(covid_data["new_cases"],0.25)
Q3 = np.quantile(covid_data["new_cases"],0.75)

# Calculate the Interquartile Range
IQR = Q3 - Q1

print("Q1 (25th percentile):", Q1)
print("Q3 (75th percentile):", Q3)
print("Interquartile Range:", IQR)

Q1 (25th percentile): 24.0
Q3 (75th percentile): 3666.0
Interquartile Range: 3642.0

Activity - Hands-On#

Use the data set located at https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho and do a summary statistical analysis using either Pandas of NumPy.

Calculate basic (summary) statistics for this data set

Additional Resources#

https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/

https://blog.quantinsti.com/data-preprocessing/

https://training.experfy.com/courses/data-pre-processing