Day 11: Introduction to Data Preprocessing in Python#

1. Overview of Data Preprocessing#

Importance and Goals#

  • Importance: Essential for converting raw data into a format suitable for analysis.

  • Goals: Enhance data quality, improve analysis efficiency, and prepare data for machine learning.

Data Preprocessing Workflow#

  • Cleaning Data: Remove duplicates, correct errors.

  • Handling Missing Values: Impute missing values or remove them.

  • Normalization: Scale data using methods like Min-Max scaling or Z-score normalization.

  • Feature Engineering: Create new features from existing data.

2. Understanding Data Types and Scales#

Data Types#

  • Numeric (Quantitative): Numbers representing continuous or discrete data.

  • Categorical (Qualitative): Data grouped into categories.

Scales#

  • Nominal: Categories without order (e.g., blood types).

  • Ordinal: Ordered categories (e.g., class levels).

  • Interval: Numeric scales without true zero (e.g., temperature in Celsius).

  • Ratio: Numeric scales with true zero (e.g., height).

3. Basic (Summary) Statistics in Python#

Setup for Activities#

#import relevant libraries
import numpy as np
import pandas as pd
from scipy import stats
#load data into dataframe
covid_data = pd.read_csv("covid_data.csv")
#inspect the first five and last five lines of the dataframe
covid_data.head()
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
0 AFG Asia Afghanistan 24/02/2020 5 5 NaN NaN NaN NaN ... NaN NaN 37.746 0.5 64.83 0.511 NaN NaN NaN NaN
1 AFG Asia Afghanistan 25/02/2020 5 0 NaN NaN NaN NaN ... NaN NaN 37.746 0.5 64.83 0.511 NaN NaN NaN NaN
2 AFG Asia Afghanistan 26/02/2020 5 0 NaN NaN NaN NaN ... NaN NaN 37.746 0.5 64.83 0.511 NaN NaN NaN NaN
3 AFG Asia Afghanistan 27/02/2020 5 0 NaN NaN NaN NaN ... NaN NaN 37.746 0.5 64.83 0.511 NaN NaN NaN NaN
4 AFG Asia Afghanistan 28/02/2020 5 0 NaN NaN NaN NaN ... NaN NaN 37.746 0.5 64.83 0.511 NaN NaN NaN NaN

5 rows × 67 columns

covid_data.tail()
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
5813 NGA Africa Nigeria 06/10/2022 265741 236 51.286 3155.0 0.0 0.0 ... 0.6 10.8 41.949 NaN 54.69 0.539 NaN NaN NaN NaN
5814 NGA Africa Nigeria 07/10/2022 265741 0 51.286 3155.0 0.0 0.0 ... 0.6 10.8 41.949 NaN 54.69 0.539 NaN NaN NaN NaN
5815 NGA Africa Nigeria 08/10/2022 265816 75 55.000 3155.0 0.0 0.0 ... 0.6 10.8 41.949 NaN 54.69 0.539 NaN NaN NaN NaN
5816 NGA Africa Nigeria 09/10/2022 265816 0 55.000 3155.0 0.0 0.0 ... 0.6 10.8 41.949 NaN 54.69 0.539 NaN NaN NaN NaN
5817 NGA Africa Nigeria 10/10/2022 265816 0 55.000 3155.0 0.0 0.0 ... 0.6 10.8 41.949 NaN 54.69 0.539 NaN NaN NaN NaN

5 rows × 67 columns

#show all columns in pandas
covid_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5818 entries, 0 to 5817
Data columns (total 67 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   iso_code                                    5818 non-null   object 
 1   continent                                   5818 non-null   object 
 2   location                                    5818 non-null   object 
 3   date                                        5818 non-null   object 
 4   total_cases                                 5818 non-null   int64  
 5   new_cases                                   5818 non-null   int64  
 6   new_cases_smoothed                          5788 non-null   float64
 7   total_deaths                                5638 non-null   float64
 8   new_deaths                                  5627 non-null   float64
 9   new_deaths_smoothed                         5596 non-null   float64
 10  total_cases_per_million                     5818 non-null   float64
 11  new_cases_per_million                       5818 non-null   float64
 12  new_cases_smoothed_per_million              5788 non-null   float64
 13  total_deaths_per_million                    5638 non-null   float64
 14  new_deaths_per_million                      5627 non-null   float64
 15  new_deaths_smoothed_per_million             5596 non-null   float64
 16  reproduction_rate                           5566 non-null   float64
 17  icu_patients                                2610 non-null   float64
 18  icu_patients_per_million                    2610 non-null   float64
 19  hosp_patients                               2610 non-null   float64
 20  hosp_patients_per_million                   2610 non-null   float64
 21  weekly_icu_admissions                       0 non-null      float64
 22  weekly_icu_admissions_per_million           0 non-null      float64
 23  weekly_hosp_admissions                      934 non-null    float64
 24  weekly_hosp_admissions_per_million          934 non-null    float64
 25  total_tests                                 3174 non-null   float64
 26  new_tests                                   2948 non-null   float64
 27  total_tests_per_thousand                    3174 non-null   float64
 28  new_tests_per_thousand                      2948 non-null   float64
 29  new_tests_smoothed                          4114 non-null   float64
 30  new_tests_smoothed_per_thousand             4114 non-null   float64
 31  positive_rate                               3440 non-null   float64
 32  tests_per_case                              3440 non-null   float64
 33  tests_units                                 4156 non-null   object 
 34  total_vaccinations                          2104 non-null   float64
 35  people_vaccinated                           2051 non-null   float64
 36  people_fully_vaccinated                     2004 non-null   float64
 37  total_boosters                              1170 non-null   float64
 38  new_vaccinations                            1827 non-null   float64
 39  new_vaccinations_smoothed                   3658 non-null   float64
 40  total_vaccinations_per_hundred              2104 non-null   float64
 41  people_vaccinated_per_hundred               2051 non-null   float64
 42  people_fully_vaccinated_per_hundred         2004 non-null   float64
 43  total_boosters_per_hundred                  1170 non-null   float64
 44  new_vaccinations_smoothed_per_million       3658 non-null   float64
 45  new_people_vaccinated_smoothed              3658 non-null   float64
 46  new_people_vaccinated_smoothed_per_hundred  3658 non-null   float64
 47  stringency_index                            5699 non-null   float64
 48  population                                  5818 non-null   int64  
 49  population_density                          5818 non-null   float64
 50  median_age                                  5818 non-null   float64
 51  aged_65_older                               5818 non-null   float64
 52  aged_70_older                               5818 non-null   float64
 53  gdp_per_capita                              5818 non-null   float64
 54  extreme_poverty                             2922 non-null   float64
 55  cardiovasc_death_rate                       5818 non-null   float64
 56  diabetes_prevalence                         5818 non-null   float64
 57  female_smokers                              4860 non-null   float64
 58  male_smokers                                4860 non-null   float64
 59  handwashing_facilities                      1913 non-null   float64
 60  hospital_beds_per_thousand                  4863 non-null   float64
 61  life_expectancy                             5818 non-null   float64
 62  human_development_index                     5818 non-null   float64
 63  excess_mortality_cumulative_absolute        421 non-null    float64
 64  excess_mortality_cumulative                 421 non-null    float64
 65  excess_mortality                            421 non-null    float64
 66  excess_mortality_cumulative_per_million     421 non-null    float64
dtypes: float64(59), int64(3), object(5)
memory usage: 3.0+ MB
#create a dataframe that loads relevant columns
covid_datanew = covid_data[['iso_code','continent','location','date','total_cases','new_cases']]
#displaying the new dataframe
covid_datanew.head(5)
iso_code continent location date total_cases new_cases
0 AFG Asia Afghanistan 24/02/2020 5 5
1 AFG Asia Afghanistan 25/02/2020 5 0
2 AFG Asia Afghanistan 26/02/2020 5 0
3 AFG Asia Afghanistan 27/02/2020 5 0
4 AFG Asia Afghanistan 28/02/2020 5 0
#investigating the data type of the dataframe
covid_datanew.dtypes
iso_code       object
continent      object
location       object
date           object
total_cases     int64
new_cases       int64
dtype: object
#displaying the shape of the dataframe (rows x columns)
covid_data.shape
(5818, 67)

Mean (Arithmetic Average)

  • Formula: \((\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i\))

  • Activity: Calculate the mean of ‘new_cases’ in the dataset.

#analyze the mean of the new_cases column using the np.mean() in numpy
newcase_mean = np.mean(covid_datanew["new_cases"])

print("New case (mean):", newcase_mean)
New case (mean): 8814.365761430045

Median (Middle Value in Sorted Data)

  • Activity: Find the median of ‘new_cases’ in the dataset

newcase_median = np.median(covid_datanew["new_cases"])

print("New case (median):", newcase_median)
New case (median): 261.0

Mode (Most Frequent Value)

  • Activity: Determine the mode for ‘’.

The stats.mode function from the SciPy library returns a ModeResult object, which contains two arrays:

The first array (mode): This contains the mode value(s), i.e., the most frequently occurring value(s) in the dataset.
The second array (count): This contains the number of times the mode value(s) appears in the dataset.

Both of these are returned as arrays, even if there’s only one mode. When you access the mode using stats.mode(covid_datanew[‘new_cases’])[0], it returns an array with the mode value. The [0] at the end is used to access the first (and in most cases, the only) element of this array.

So, in the context of the code:

stats.mode(covid_datanew['new_cases']): Returns a ModeResult object with the mode and its count.
stats.mode(covid_datanew['new_cases'])[0]: Accesses the array containing the mode value(s).
stats.mode(covid_datanew['new_cases'])[0][0]: Accesses the first element of the array, providing the actual mode value.

This is necessary because the mode function is designed to handle multi-modal datasets (datasets with more than one mode) and thus returns an array instead of a single value. In most single-mode cases, you’ll need the [0][0] to access the actual mode value.

newcase_mode = stats.mode(covid_datanew["new_cases"])

print("New case (mode):", newcase_mode)
New case (mode): ModeResult(mode=0, count=805)

In the above example we used a numeric column in order to display the mode? Could you use a non-numeric column?#

Variance (σ²)

  • Formula: \((\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}\))

  • Activity: Compute the variance of ‘quality’.

#using numpy check the variance of the new_cases column
newcase_variance = np.var(covid_datanew["new_cases"])

print("New cases (variance:numpy):", newcase_variance)
New cases (variance:numpy): 451321915.9280954
#using numpy check the variance of the new_cases column
covid_datanew["new_cases"].var()
451399502.6421969

Standard Deviation (σ)

  • Formula: \(\sigma = \sqrt{\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n}}\)

  • Activity: Calculate the standard deviation for ‘quality’.

# Calculate the standard deviation using NumPy
# 'ddof=0' for population standard deviation; 'ddof=1' for sample standard deviation
newcase_stdev = np.std(covid_datanew["new_cases"], ddof=1)

print("New cases (stdev: numpy):", newcase_stdev)
New cases (stdev: numpy): 21246.164421895
covid_datanew["new_cases"].std()
21246.164421895

Why would there be a difference in the variance and standard deviation between NumPy and Pandas?#

The difference between the numpy var and pandas var methods are not dependent on the range of the data but on the degrees of freedom (ddof) set by package. pandas sets ddof=1 (unbiased estimator) while numpy sets ddof = 0 (mle). RE: https://stackoverflow.com/questions/62938495/difference-between-numpy-var-and-pandas-var

Max and Min Range

The range has a significant role in describing the variability of a data set, as long as there are no outliers. An outlier is an extreme high or low value that stands alone from the other values. If an outlier exist, the value of the range by itself can be misleading.

#Calculate the range of the dataset using NumPy
covid_newcases_max = np.max(covid_datanew["new_cases"])
covid_newcases_min = np.min(covid_datanew["new_cases"])

print(covid_newcases_max, covid_newcases_min)
287149 0

Why are Quartiles and Interquartile Range Important?#

Quartiles and the Interquartile Range (IQR) are essential in data analysis for several key reasons:

  1. Measure of Spread

    • Quartiles divide a dataset into four equal parts, providing insight into the distribution and variability of the data.

  2. Outlier Detection

    • The IQR is a robust measure of statistical dispersion and is commonly used for identifying outliers. Values that fall below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are often considered outliers.

  3. Non-parametric

    • Quartiles do not assume a normal distribution of data, making them non-parametric and robust measures for skewed distributions or data with outliers.

  4. Data Segmentation and Comparison

    • Quartiles allow for easy segmentation of data into groups, which is useful in various applications like finance and sales.

  5. Informative for Further Statistical Analysis

    • Understanding quartile positions helps in making informed decisions for further statistical analyses, especially with skewed data.

  6. Basis for Other Statistical Measures

    • Quartiles are foundational for other statistical visualizations like box plots, which depict quartiles and outliers graphically.

# Calculate quartiles
Q1 = np.quantile(covid_data["new_cases"],0.25)
Q3 = np.quantile(covid_data["new_cases"],0.75)

# Calculate the Interquartile Range
IQR = Q3 - Q1

print("Q1 (25th percentile):", Q1)
print("Q3 (75th percentile):", Q3)
print("Interquartile Range:", IQR)
Q1 (25th percentile): 24.0
Q3 (75th percentile): 3666.0
Interquartile Range: 3642.0

Activity - Hands-On#

Use the data set located at https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho and do a summary statistical analysis using either Pandas of NumPy.

Calculate basic (summary) statistics for this data set

Additional Resources#

https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/

https://blog.quantinsti.com/data-preprocessing/

https://training.experfy.com/courses/data-pre-processing