Day 11: Introduction to Data Preprocessing in Python#
1. Overview of Data Preprocessing#
Importance and Goals#
Importance: Essential for converting raw data into a format suitable for analysis.
Goals: Enhance data quality, improve analysis efficiency, and prepare data for machine learning.
Data Preprocessing Workflow#
Cleaning Data: Remove duplicates, correct errors.
Handling Missing Values: Impute missing values or remove them.
Normalization: Scale data using methods like Min-Max scaling or Z-score normalization.
Feature Engineering: Create new features from existing data.
2. Understanding Data Types and Scales#
Data Types#
Numeric (Quantitative): Numbers representing continuous or discrete data.
Categorical (Qualitative): Data grouped into categories.
Scales#
Nominal: Categories without order (e.g., blood types).
Ordinal: Ordered categories (e.g., class levels).
Interval: Numeric scales without true zero (e.g., temperature in Celsius).
Ratio: Numeric scales with true zero (e.g., height).
3. Basic (Summary) Statistics in Python#
Setup for Activities#
Dataset: Covid Data. 100daysofml/100daysofml.github.io
Tools: Python with Pandas library.
#import relevant libraries
import numpy as np
import pandas as pd
from scipy import stats
#load data into dataframe
covid_data = pd.read_csv("covid_data.csv")
#inspect the first five and last five lines of the dataframe
covid_data.head()
iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | ... | female_smokers | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AFG | Asia | Afghanistan | 24/02/2020 | 5 | 5 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
1 | AFG | Asia | Afghanistan | 25/02/2020 | 5 | 0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
2 | AFG | Asia | Afghanistan | 26/02/2020 | 5 | 0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
3 | AFG | Asia | Afghanistan | 27/02/2020 | 5 | 0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
4 | AFG | Asia | Afghanistan | 28/02/2020 | 5 | 0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
5 rows × 67 columns
covid_data.tail()
iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | ... | female_smokers | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5813 | NGA | Africa | Nigeria | 06/10/2022 | 265741 | 236 | 51.286 | 3155.0 | 0.0 | 0.0 | ... | 0.6 | 10.8 | 41.949 | NaN | 54.69 | 0.539 | NaN | NaN | NaN | NaN |
5814 | NGA | Africa | Nigeria | 07/10/2022 | 265741 | 0 | 51.286 | 3155.0 | 0.0 | 0.0 | ... | 0.6 | 10.8 | 41.949 | NaN | 54.69 | 0.539 | NaN | NaN | NaN | NaN |
5815 | NGA | Africa | Nigeria | 08/10/2022 | 265816 | 75 | 55.000 | 3155.0 | 0.0 | 0.0 | ... | 0.6 | 10.8 | 41.949 | NaN | 54.69 | 0.539 | NaN | NaN | NaN | NaN |
5816 | NGA | Africa | Nigeria | 09/10/2022 | 265816 | 0 | 55.000 | 3155.0 | 0.0 | 0.0 | ... | 0.6 | 10.8 | 41.949 | NaN | 54.69 | 0.539 | NaN | NaN | NaN | NaN |
5817 | NGA | Africa | Nigeria | 10/10/2022 | 265816 | 0 | 55.000 | 3155.0 | 0.0 | 0.0 | ... | 0.6 | 10.8 | 41.949 | NaN | 54.69 | 0.539 | NaN | NaN | NaN | NaN |
5 rows × 67 columns
#show all columns in pandas
covid_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5818 entries, 0 to 5817
Data columns (total 67 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 iso_code 5818 non-null object
1 continent 5818 non-null object
2 location 5818 non-null object
3 date 5818 non-null object
4 total_cases 5818 non-null int64
5 new_cases 5818 non-null int64
6 new_cases_smoothed 5788 non-null float64
7 total_deaths 5638 non-null float64
8 new_deaths 5627 non-null float64
9 new_deaths_smoothed 5596 non-null float64
10 total_cases_per_million 5818 non-null float64
11 new_cases_per_million 5818 non-null float64
12 new_cases_smoothed_per_million 5788 non-null float64
13 total_deaths_per_million 5638 non-null float64
14 new_deaths_per_million 5627 non-null float64
15 new_deaths_smoothed_per_million 5596 non-null float64
16 reproduction_rate 5566 non-null float64
17 icu_patients 2610 non-null float64
18 icu_patients_per_million 2610 non-null float64
19 hosp_patients 2610 non-null float64
20 hosp_patients_per_million 2610 non-null float64
21 weekly_icu_admissions 0 non-null float64
22 weekly_icu_admissions_per_million 0 non-null float64
23 weekly_hosp_admissions 934 non-null float64
24 weekly_hosp_admissions_per_million 934 non-null float64
25 total_tests 3174 non-null float64
26 new_tests 2948 non-null float64
27 total_tests_per_thousand 3174 non-null float64
28 new_tests_per_thousand 2948 non-null float64
29 new_tests_smoothed 4114 non-null float64
30 new_tests_smoothed_per_thousand 4114 non-null float64
31 positive_rate 3440 non-null float64
32 tests_per_case 3440 non-null float64
33 tests_units 4156 non-null object
34 total_vaccinations 2104 non-null float64
35 people_vaccinated 2051 non-null float64
36 people_fully_vaccinated 2004 non-null float64
37 total_boosters 1170 non-null float64
38 new_vaccinations 1827 non-null float64
39 new_vaccinations_smoothed 3658 non-null float64
40 total_vaccinations_per_hundred 2104 non-null float64
41 people_vaccinated_per_hundred 2051 non-null float64
42 people_fully_vaccinated_per_hundred 2004 non-null float64
43 total_boosters_per_hundred 1170 non-null float64
44 new_vaccinations_smoothed_per_million 3658 non-null float64
45 new_people_vaccinated_smoothed 3658 non-null float64
46 new_people_vaccinated_smoothed_per_hundred 3658 non-null float64
47 stringency_index 5699 non-null float64
48 population 5818 non-null int64
49 population_density 5818 non-null float64
50 median_age 5818 non-null float64
51 aged_65_older 5818 non-null float64
52 aged_70_older 5818 non-null float64
53 gdp_per_capita 5818 non-null float64
54 extreme_poverty 2922 non-null float64
55 cardiovasc_death_rate 5818 non-null float64
56 diabetes_prevalence 5818 non-null float64
57 female_smokers 4860 non-null float64
58 male_smokers 4860 non-null float64
59 handwashing_facilities 1913 non-null float64
60 hospital_beds_per_thousand 4863 non-null float64
61 life_expectancy 5818 non-null float64
62 human_development_index 5818 non-null float64
63 excess_mortality_cumulative_absolute 421 non-null float64
64 excess_mortality_cumulative 421 non-null float64
65 excess_mortality 421 non-null float64
66 excess_mortality_cumulative_per_million 421 non-null float64
dtypes: float64(59), int64(3), object(5)
memory usage: 3.0+ MB
#create a dataframe that loads relevant columns
covid_datanew = covid_data[['iso_code','continent','location','date','total_cases','new_cases']]
#displaying the new dataframe
covid_datanew.head(5)
iso_code | continent | location | date | total_cases | new_cases | |
---|---|---|---|---|---|---|
0 | AFG | Asia | Afghanistan | 24/02/2020 | 5 | 5 |
1 | AFG | Asia | Afghanistan | 25/02/2020 | 5 | 0 |
2 | AFG | Asia | Afghanistan | 26/02/2020 | 5 | 0 |
3 | AFG | Asia | Afghanistan | 27/02/2020 | 5 | 0 |
4 | AFG | Asia | Afghanistan | 28/02/2020 | 5 | 0 |
#investigating the data type of the dataframe
covid_datanew.dtypes
iso_code object
continent object
location object
date object
total_cases int64
new_cases int64
dtype: object
#displaying the shape of the dataframe (rows x columns)
covid_data.shape
(5818, 67)
Mean (Arithmetic Average)
Formula: \((\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i\))
Activity: Calculate the mean of ‘new_cases’ in the dataset.
#analyze the mean of the new_cases column using the np.mean() in numpy
newcase_mean = np.mean(covid_datanew["new_cases"])
print("New case (mean):", newcase_mean)
New case (mean): 8814.365761430045
Median (Middle Value in Sorted Data)
Activity: Find the median of ‘new_cases’ in the dataset
newcase_median = np.median(covid_datanew["new_cases"])
print("New case (median):", newcase_median)
New case (median): 261.0
Mode (Most Frequent Value)
Activity: Determine the mode for ‘’.
The stats.mode function from the SciPy library returns a ModeResult object, which contains two arrays:
The first array (mode): This contains the mode value(s), i.e., the most frequently occurring value(s) in the dataset.
The second array (count): This contains the number of times the mode value(s) appears in the dataset.
Both of these are returned as arrays, even if there’s only one mode. When you access the mode using stats.mode(covid_datanew[‘new_cases’])[0], it returns an array with the mode value. The [0] at the end is used to access the first (and in most cases, the only) element of this array.
So, in the context of the code:
stats.mode(covid_datanew['new_cases']): Returns a ModeResult object with the mode and its count.
stats.mode(covid_datanew['new_cases'])[0]: Accesses the array containing the mode value(s).
stats.mode(covid_datanew['new_cases'])[0][0]: Accesses the first element of the array, providing the actual mode value.
This is necessary because the mode function is designed to handle multi-modal datasets (datasets with more than one mode) and thus returns an array instead of a single value. In most single-mode cases, you’ll need the [0][0] to access the actual mode value.
newcase_mode = stats.mode(covid_datanew["new_cases"])
print("New case (mode):", newcase_mode)
New case (mode): ModeResult(mode=0, count=805)
In the above example we used a numeric column in order to display the mode? Could you use a non-numeric column?#
Variance (σ²)
Formula: \((\sigma^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}\))
Activity: Compute the variance of ‘quality’.
#using numpy check the variance of the new_cases column
newcase_variance = np.var(covid_datanew["new_cases"])
print("New cases (variance:numpy):", newcase_variance)
New cases (variance:numpy): 451321915.9280954
#using numpy check the variance of the new_cases column
covid_datanew["new_cases"].var()
451399502.6421969
Standard Deviation (σ)
Formula: \(\sigma = \sqrt{\frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n}}\)
Activity: Calculate the standard deviation for ‘quality’.
# Calculate the standard deviation using NumPy
# 'ddof=0' for population standard deviation; 'ddof=1' for sample standard deviation
newcase_stdev = np.std(covid_datanew["new_cases"], ddof=1)
print("New cases (stdev: numpy):", newcase_stdev)
New cases (stdev: numpy): 21246.164421895
covid_datanew["new_cases"].std()
21246.164421895
Why would there be a difference in the variance and standard deviation between NumPy and Pandas?#
The difference between the numpy var and pandas var methods are not dependent on the range of the data but on the degrees of freedom (ddof) set by package. pandas sets ddof=1 (unbiased estimator) while numpy sets ddof = 0 (mle). RE: https://stackoverflow.com/questions/62938495/difference-between-numpy-var-and-pandas-var
Max and Min Range
The range has a significant role in describing the variability of a data set, as long as there are no outliers. An outlier is an extreme high or low value that stands alone from the other values. If an outlier exist, the value of the range by itself can be misleading.
#Calculate the range of the dataset using NumPy
covid_newcases_max = np.max(covid_datanew["new_cases"])
covid_newcases_min = np.min(covid_datanew["new_cases"])
print(covid_newcases_max, covid_newcases_min)
287149 0
Why are Quartiles and Interquartile Range Important?#
Quartiles and the Interquartile Range (IQR) are essential in data analysis for several key reasons:
Measure of Spread
Quartiles divide a dataset into four equal parts, providing insight into the distribution and variability of the data.
Outlier Detection
The IQR is a robust measure of statistical dispersion and is commonly used for identifying outliers. Values that fall below
Q1 - 1.5*IQR
or aboveQ3 + 1.5*IQR
are often considered outliers.
Non-parametric
Quartiles do not assume a normal distribution of data, making them non-parametric and robust measures for skewed distributions or data with outliers.
Data Segmentation and Comparison
Quartiles allow for easy segmentation of data into groups, which is useful in various applications like finance and sales.
Informative for Further Statistical Analysis
Understanding quartile positions helps in making informed decisions for further statistical analyses, especially with skewed data.
Basis for Other Statistical Measures
Quartiles are foundational for other statistical visualizations like box plots, which depict quartiles and outliers graphically.
# Calculate quartiles
Q1 = np.quantile(covid_data["new_cases"],0.25)
Q3 = np.quantile(covid_data["new_cases"],0.75)
# Calculate the Interquartile Range
IQR = Q3 - Q1
print("Q1 (25th percentile):", Q1)
print("Q3 (75th percentile):", Q3)
print("Interquartile Range:", IQR)
Q1 (25th percentile): 24.0
Q3 (75th percentile): 3666.0
Interquartile Range: 3642.0
Activity - Hands-On#
Use the data set located at https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho and do a summary statistical analysis using either Pandas of NumPy.
Calculate basic (summary) statistics for this data set
Additional Resources#
https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/