{ "cells": [ { "cell_type": "markdown", "id": "08cd137d-05ef-4f0c-8a4e-b67704001831", "metadata": {}, "source": [ "# Day 11: Introduction to Data Preprocessing in Python\n", "\n", "## 1. Overview of Data Preprocessing\n", "\n", "### Importance and Goals\n", "- **Importance**: Essential for converting raw data into a format suitable for analysis.\n", "- **Goals**: Enhance data quality, improve analysis efficiency, and prepare data for machine learning." ] }, { "cell_type": "markdown", "id": "20dd5790-99ff-44f5-8659-17916d092086", "metadata": {}, "source": [ "### Data Preprocessing Workflow\n", "- **Cleaning Data**: Remove duplicates, correct errors.\n", "- **Handling Missing Values**: Impute missing values or remove them.\n", "- **Normalization**: Scale data using methods like Min-Max scaling or Z-score normalization.\n", "- **Feature Engineering**: Create new features from existing data." ] }, { "cell_type": "markdown", "id": "8a233f09-4fad-446b-acbb-c6e1c6c2ac92", "metadata": { "tags": [] }, "source": [ "## 2. Understanding Data Types and Scales\n", "\n", "### Data Types\n", "- **Numeric (Quantitative)**: Numbers representing continuous or discrete data.\n", "- **Categorical (Qualitative)**: Data grouped into categories.\n", "\n", "### Scales\n", "- **Nominal**: Categories without order (e.g., blood types).\n", "- **Ordinal**: Ordered categories (e.g., class levels).\n", "- **Interval**: Numeric scales without true zero (e.g., temperature in Celsius).\n", "- **Ratio**: Numeric scales with true zero (e.g., height)." ] }, { "cell_type": "markdown", "id": "d8758428-4aa9-45b7-941c-bc7d7d6d05da", "metadata": {}, "source": [ "## 3. Basic (Summary) Statistics in Python\n", "\n", "### Setup for Activities\n", "- **Dataset**: Covid Data. \n", "https://github.com/100daysofml/100daysofml.github.io/blob/main/content/Week_03/covid_data.csv\n", "- **Tools**: Python with Pandas library." ] }, { "cell_type": "code", "execution_count": 1, "id": "f2e59158-1f09-4f06-b096-dcdd4afd6a94", "metadata": {}, "outputs": [], "source": [ "#import relevant libraries\n", "import numpy as np\n", "import pandas as pd\n", "from scipy import stats" ] }, { "cell_type": "code", "execution_count": 2, "id": "b60b68c5-0a95-4e29-a5e7-ac07b6400133", "metadata": {}, "outputs": [], "source": [ "#load data into dataframe\n", "covid_data = pd.read_csv(\"covid_data.csv\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "92ddfe37-3de9-4b56-af96-1848ed7fa99d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iso_codecontinentlocationdatetotal_casesnew_casesnew_cases_smoothedtotal_deathsnew_deathsnew_deaths_smoothed...female_smokersmale_smokershandwashing_facilitieshospital_beds_per_thousandlife_expectancyhuman_development_indexexcess_mortality_cumulative_absoluteexcess_mortality_cumulativeexcess_mortalityexcess_mortality_cumulative_per_million
0AFGAsiaAfghanistan24/02/202055NaNNaNNaNNaN...NaNNaN37.7460.564.830.511NaNNaNNaNNaN
1AFGAsiaAfghanistan25/02/202050NaNNaNNaNNaN...NaNNaN37.7460.564.830.511NaNNaNNaNNaN
2AFGAsiaAfghanistan26/02/202050NaNNaNNaNNaN...NaNNaN37.7460.564.830.511NaNNaNNaNNaN
3AFGAsiaAfghanistan27/02/202050NaNNaNNaNNaN...NaNNaN37.7460.564.830.511NaNNaNNaNNaN
4AFGAsiaAfghanistan28/02/202050NaNNaNNaNNaN...NaNNaN37.7460.564.830.511NaNNaNNaNNaN
\n", "

5 rows × 67 columns

\n", "
" ], "text/plain": [ " iso_code continent location date total_cases new_cases \\\n", "0 AFG Asia Afghanistan 24/02/2020 5 5 \n", "1 AFG Asia Afghanistan 25/02/2020 5 0 \n", "2 AFG Asia Afghanistan 26/02/2020 5 0 \n", "3 AFG Asia Afghanistan 27/02/2020 5 0 \n", "4 AFG Asia Afghanistan 28/02/2020 5 0 \n", "\n", " new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... \\\n", "0 NaN NaN NaN NaN ... \n", "1 NaN NaN NaN NaN ... \n", "2 NaN NaN NaN NaN ... \n", "3 NaN NaN NaN NaN ... \n", "4 NaN NaN NaN NaN ... \n", "\n", " female_smokers male_smokers handwashing_facilities \\\n", "0 NaN NaN 37.746 \n", "1 NaN NaN 37.746 \n", "2 NaN NaN 37.746 \n", "3 NaN NaN 37.746 \n", "4 NaN NaN 37.746 \n", "\n", " hospital_beds_per_thousand life_expectancy human_development_index \\\n", "0 0.5 64.83 0.511 \n", "1 0.5 64.83 0.511 \n", "2 0.5 64.83 0.511 \n", "3 0.5 64.83 0.511 \n", "4 0.5 64.83 0.511 \n", "\n", " excess_mortality_cumulative_absolute excess_mortality_cumulative \\\n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", " excess_mortality excess_mortality_cumulative_per_million \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "\n", "[5 rows x 67 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#inspect the first five and last five lines of the dataframe\n", "covid_data.head()" ] }, { "cell_type": "code", "execution_count": 4, "id": "24d31f82-7b50-410b-94b7-5055dbc7a963", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iso_codecontinentlocationdatetotal_casesnew_casesnew_cases_smoothedtotal_deathsnew_deathsnew_deaths_smoothed...female_smokersmale_smokershandwashing_facilitieshospital_beds_per_thousandlife_expectancyhuman_development_indexexcess_mortality_cumulative_absoluteexcess_mortality_cumulativeexcess_mortalityexcess_mortality_cumulative_per_million
5813NGAAfricaNigeria06/10/202226574123651.2863155.00.00.0...0.610.841.949NaN54.690.539NaNNaNNaNNaN
5814NGAAfricaNigeria07/10/2022265741051.2863155.00.00.0...0.610.841.949NaN54.690.539NaNNaNNaNNaN
5815NGAAfricaNigeria08/10/20222658167555.0003155.00.00.0...0.610.841.949NaN54.690.539NaNNaNNaNNaN
5816NGAAfricaNigeria09/10/2022265816055.0003155.00.00.0...0.610.841.949NaN54.690.539NaNNaNNaNNaN
5817NGAAfricaNigeria10/10/2022265816055.0003155.00.00.0...0.610.841.949NaN54.690.539NaNNaNNaNNaN
\n", "

5 rows × 67 columns

\n", "
" ], "text/plain": [ " iso_code continent location date total_cases new_cases \\\n", "5813 NGA Africa Nigeria 06/10/2022 265741 236 \n", "5814 NGA Africa Nigeria 07/10/2022 265741 0 \n", "5815 NGA Africa Nigeria 08/10/2022 265816 75 \n", "5816 NGA Africa Nigeria 09/10/2022 265816 0 \n", "5817 NGA Africa Nigeria 10/10/2022 265816 0 \n", "\n", " new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... \\\n", "5813 51.286 3155.0 0.0 0.0 ... \n", "5814 51.286 3155.0 0.0 0.0 ... \n", "5815 55.000 3155.0 0.0 0.0 ... \n", "5816 55.000 3155.0 0.0 0.0 ... \n", "5817 55.000 3155.0 0.0 0.0 ... \n", "\n", " female_smokers male_smokers handwashing_facilities \\\n", "5813 0.6 10.8 41.949 \n", "5814 0.6 10.8 41.949 \n", "5815 0.6 10.8 41.949 \n", "5816 0.6 10.8 41.949 \n", "5817 0.6 10.8 41.949 \n", "\n", " hospital_beds_per_thousand life_expectancy human_development_index \\\n", "5813 NaN 54.69 0.539 \n", "5814 NaN 54.69 0.539 \n", "5815 NaN 54.69 0.539 \n", "5816 NaN 54.69 0.539 \n", "5817 NaN 54.69 0.539 \n", "\n", " excess_mortality_cumulative_absolute excess_mortality_cumulative \\\n", "5813 NaN NaN \n", "5814 NaN NaN \n", "5815 NaN NaN \n", "5816 NaN NaN \n", "5817 NaN NaN \n", "\n", " excess_mortality excess_mortality_cumulative_per_million \n", "5813 NaN NaN \n", "5814 NaN NaN \n", "5815 NaN NaN \n", "5816 NaN NaN \n", "5817 NaN NaN \n", "\n", "[5 rows x 67 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "covid_data.tail()" ] }, { "cell_type": "code", "execution_count": 5, "id": "0c66c4b6-270b-45a5-ae19-34eeadf35d33", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 5818 entries, 0 to 5817\n", "Data columns (total 67 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 iso_code 5818 non-null object \n", " 1 continent 5818 non-null object \n", " 2 location 5818 non-null object \n", " 3 date 5818 non-null object \n", " 4 total_cases 5818 non-null int64 \n", " 5 new_cases 5818 non-null int64 \n", " 6 new_cases_smoothed 5788 non-null float64\n", " 7 total_deaths 5638 non-null float64\n", " 8 new_deaths 5627 non-null float64\n", " 9 new_deaths_smoothed 5596 non-null float64\n", " 10 total_cases_per_million 5818 non-null float64\n", " 11 new_cases_per_million 5818 non-null float64\n", " 12 new_cases_smoothed_per_million 5788 non-null float64\n", " 13 total_deaths_per_million 5638 non-null float64\n", " 14 new_deaths_per_million 5627 non-null float64\n", " 15 new_deaths_smoothed_per_million 5596 non-null float64\n", " 16 reproduction_rate 5566 non-null float64\n", " 17 icu_patients 2610 non-null float64\n", " 18 icu_patients_per_million 2610 non-null float64\n", " 19 hosp_patients 2610 non-null float64\n", " 20 hosp_patients_per_million 2610 non-null float64\n", " 21 weekly_icu_admissions 0 non-null float64\n", " 22 weekly_icu_admissions_per_million 0 non-null float64\n", " 23 weekly_hosp_admissions 934 non-null float64\n", " 24 weekly_hosp_admissions_per_million 934 non-null float64\n", " 25 total_tests 3174 non-null float64\n", " 26 new_tests 2948 non-null float64\n", " 27 total_tests_per_thousand 3174 non-null float64\n", " 28 new_tests_per_thousand 2948 non-null float64\n", " 29 new_tests_smoothed 4114 non-null float64\n", " 30 new_tests_smoothed_per_thousand 4114 non-null float64\n", " 31 positive_rate 3440 non-null float64\n", " 32 tests_per_case 3440 non-null float64\n", " 33 tests_units 4156 non-null object \n", " 34 total_vaccinations 2104 non-null float64\n", " 35 people_vaccinated 2051 non-null float64\n", " 36 people_fully_vaccinated 2004 non-null float64\n", " 37 total_boosters 1170 non-null float64\n", " 38 new_vaccinations 1827 non-null float64\n", " 39 new_vaccinations_smoothed 3658 non-null float64\n", " 40 total_vaccinations_per_hundred 2104 non-null float64\n", " 41 people_vaccinated_per_hundred 2051 non-null float64\n", " 42 people_fully_vaccinated_per_hundred 2004 non-null float64\n", " 43 total_boosters_per_hundred 1170 non-null float64\n", " 44 new_vaccinations_smoothed_per_million 3658 non-null float64\n", " 45 new_people_vaccinated_smoothed 3658 non-null float64\n", " 46 new_people_vaccinated_smoothed_per_hundred 3658 non-null float64\n", " 47 stringency_index 5699 non-null float64\n", " 48 population 5818 non-null int64 \n", " 49 population_density 5818 non-null float64\n", " 50 median_age 5818 non-null float64\n", " 51 aged_65_older 5818 non-null float64\n", " 52 aged_70_older 5818 non-null float64\n", " 53 gdp_per_capita 5818 non-null float64\n", " 54 extreme_poverty 2922 non-null float64\n", " 55 cardiovasc_death_rate 5818 non-null float64\n", " 56 diabetes_prevalence 5818 non-null float64\n", " 57 female_smokers 4860 non-null float64\n", " 58 male_smokers 4860 non-null float64\n", " 59 handwashing_facilities 1913 non-null float64\n", " 60 hospital_beds_per_thousand 4863 non-null float64\n", " 61 life_expectancy 5818 non-null float64\n", " 62 human_development_index 5818 non-null float64\n", " 63 excess_mortality_cumulative_absolute 421 non-null float64\n", " 64 excess_mortality_cumulative 421 non-null float64\n", " 65 excess_mortality 421 non-null float64\n", " 66 excess_mortality_cumulative_per_million 421 non-null float64\n", "dtypes: float64(59), int64(3), object(5)\n", "memory usage: 3.0+ MB\n" ] } ], "source": [ "#show all columns in pandas\n", "covid_data.info()" ] }, { "cell_type": "code", "execution_count": 6, "id": "31e50afe-36f7-4539-9875-011ac83e4f5e", "metadata": {}, "outputs": [], "source": [ "#create a dataframe that loads relevant columns\n", "covid_datanew = covid_data[['iso_code','continent','location','date','total_cases','new_cases']]" ] }, { "cell_type": "code", "execution_count": 7, "id": "e5badd24-9263-4d29-9067-83e4fc7dd5eb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iso_codecontinentlocationdatetotal_casesnew_cases
0AFGAsiaAfghanistan24/02/202055
1AFGAsiaAfghanistan25/02/202050
2AFGAsiaAfghanistan26/02/202050
3AFGAsiaAfghanistan27/02/202050
4AFGAsiaAfghanistan28/02/202050
\n", "
" ], "text/plain": [ " iso_code continent location date total_cases new_cases\n", "0 AFG Asia Afghanistan 24/02/2020 5 5\n", "1 AFG Asia Afghanistan 25/02/2020 5 0\n", "2 AFG Asia Afghanistan 26/02/2020 5 0\n", "3 AFG Asia Afghanistan 27/02/2020 5 0\n", "4 AFG Asia Afghanistan 28/02/2020 5 0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#displaying the new dataframe\n", "covid_datanew.head(5)" ] }, { "cell_type": "code", "execution_count": 8, "id": "07fadcb7-3e53-46f4-87f7-fb4416667bea", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "iso_code object\n", "continent object\n", "location object\n", "date object\n", "total_cases int64\n", "new_cases int64\n", "dtype: object" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#investigating the data type of the dataframe\n", "covid_datanew.dtypes" ] }, { "cell_type": "code", "execution_count": 9, "id": "4d953f37-2824-44c3-81bb-f93f3ae6daa3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5818, 67)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#displaying the shape of the dataframe (rows x columns)\n", "covid_data.shape" ] }, { "cell_type": "markdown", "id": "cf60d23b-38e5-4b99-8e84-8dfeef026f4b", "metadata": {}, "source": [ "**Mean (Arithmetic Average)**\n", " * Formula: $(\\bar{x} = \\frac{1}{n}\\sum_{i=1}^{n}x_i$)\n", " * Activity: Calculate the mean of 'new_cases' in the dataset." ] }, { "cell_type": "code", "execution_count": 10, "id": "ad79875b-3caa-4a0c-9519-b070582113f7", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New case (mean): 8814.365761430045\n" ] } ], "source": [ "#analyze the mean of the new_cases column using the np.mean() in numpy\n", "newcase_mean = np.mean(covid_datanew[\"new_cases\"])\n", "\n", "print(\"New case (mean):\", newcase_mean)" ] }, { "cell_type": "markdown", "id": "578a07f1-faaf-4733-9d44-6c8590b688e7", "metadata": {}, "source": [ "**Median (Middle Value in Sorted Data)**\n", " * Activity: Find the median of 'new_cases' in the dataset" ] }, { "cell_type": "code", "execution_count": 11, "id": "d11fb256-7079-4f9b-8746-e20d88e1a7aa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New case (median): 261.0\n" ] } ], "source": [ "newcase_median = np.median(covid_datanew[\"new_cases\"])\n", "\n", "print(\"New case (median):\", newcase_median)" ] }, { "cell_type": "markdown", "id": "769e02ca-01ef-4a96-b1f1-d5519acd6aa0", "metadata": {}, "source": [ "**Mode (Most Frequent Value)**\n", " * Activity: Determine the mode for ''." ] }, { "cell_type": "markdown", "id": "118bb092-609e-42ab-a1c8-73e7b094633c", "metadata": {}, "source": [ "The stats.mode function from the SciPy library returns a ModeResult object, which contains two arrays:\n", "\n", " The first array (mode): This contains the mode value(s), i.e., the most frequently occurring value(s) in the dataset.\n", " The second array (count): This contains the number of times the mode value(s) appears in the dataset.\n", "\n", "Both of these are returned as arrays, even if there's only one mode. When you access the mode using stats.mode(covid_datanew['new_cases'])[0], it returns an array with the mode value. The [0] at the end is used to access the first (and in most cases, the only) element of this array.\n", "\n", "So, in the context of the code:\n", "\n", " stats.mode(covid_datanew['new_cases']): Returns a ModeResult object with the mode and its count.\n", " stats.mode(covid_datanew['new_cases'])[0]: Accesses the array containing the mode value(s).\n", " stats.mode(covid_datanew['new_cases'])[0][0]: Accesses the first element of the array, providing the actual mode value.\n", "\n", "This is necessary because the mode function is designed to handle multi-modal datasets (datasets with more than one mode) and thus returns an array instead of a single value. In most single-mode cases, you'll need the [0][0] to access the actual mode value." ] }, { "cell_type": "code", "execution_count": 12, "id": "e9f49272-b91f-4ea7-a6f6-61655852d238", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New case (mode): ModeResult(mode=0, count=805)\n" ] } ], "source": [ "newcase_mode = stats.mode(covid_datanew[\"new_cases\"])\n", "\n", "print(\"New case (mode):\", newcase_mode)" ] }, { "cell_type": "markdown", "id": "5383af07-d5cc-43e7-8773-71b5cd847ffa", "metadata": {}, "source": [ "### In the above example we used a numeric column in order to display the mode? **Could you use a non-numeric column?**" ] }, { "cell_type": "markdown", "id": "fa299856-4900-4871-b462-0c6ea2305b5b", "metadata": {}, "source": [ "**Variance (σ²)**\n", " * Formula: $(\\sigma^2 = \\frac{\\sum_{i=1}^{n}(x_i - \\bar{x})^2}{n}$)\n", " * Activity: Compute the variance of 'quality'." ] }, { "cell_type": "code", "execution_count": 13, "id": "8b58e5b9-fe94-4dec-b5c5-c76a02b6265c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New cases (variance:numpy): 451321915.92810047\n" ] } ], "source": [ "#using numpy check the variance of the new_cases column\n", "newcase_variance = np.var(covid_datanew[\"new_cases\"])\n", "\n", "print(\"New cases (variance:numpy):\", newcase_variance)" ] }, { "cell_type": "code", "execution_count": 14, "id": "259e401f-d673-462d-9df4-677cc6d307ac", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "451399502.6422019" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#using numpy check the variance of the new_cases column\n", "covid_datanew[\"new_cases\"].var()" ] }, { "cell_type": "markdown", "id": "e3c51691-23ea-477e-a8e6-2ea476251ef1", "metadata": {}, "source": [ "**Standard Deviation (σ)**\n", " * Formula: $\\sigma = \\sqrt{\\frac{\\sum_{i=1}^n (x_i-\\bar{x})^2}{n}}$\n", " * Activity: Calculate the standard deviation for 'quality'." ] }, { "cell_type": "code", "execution_count": 15, "id": "8344b34f-8e19-4bb3-a533-7c858a5fd403", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New cases (stdev: numpy): 21246.16442189512\n" ] } ], "source": [ "# Calculate the standard deviation using NumPy\n", "# 'ddof=0' for population standard deviation; 'ddof=1' for sample standard deviation\n", "newcase_stdev = np.std(covid_datanew[\"new_cases\"], ddof=1)\n", "\n", "print(\"New cases (stdev: numpy):\", newcase_stdev)" ] }, { "cell_type": "code", "execution_count": 16, "id": "e3eda326-fa1f-43a3-9aaa-009968e74c00", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "21246.16442189512" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "covid_datanew[\"new_cases\"].std()" ] }, { "cell_type": "markdown", "id": "b5afa9fc-c862-49f2-bc56-4f283761e59d", "metadata": {}, "source": [ "### **Why would there be a difference in the variance and standard deviation between NumPy and Pandas?**" ] }, { "cell_type": "markdown", "id": "c55d04c3-0ea6-437e-b850-ee733666c71e", "metadata": {}, "source": [ "The difference between the numpy var and pandas var methods are not dependent on the range of the data but on the degrees of freedom (ddof) set by package. pandas sets ddof=1 (unbiased estimator) while numpy sets ddof = 0 (mle). \n", "RE: https://stackoverflow.com/questions/62938495/difference-between-numpy-var-and-pandas-var" ] }, { "cell_type": "markdown", "id": "5885d612-c9be-43bf-b81d-4009a2cecc4f", "metadata": {}, "source": [ "**Max and Min Range**\n", "\n", "The range has a significant role in describing the variability of a data set, as long as there are no outliers. An outlier is an extreme high or low value that stands alone from the other values. If an outlier exist, the value of the range by itself can be misleading." ] }, { "cell_type": "code", "execution_count": 17, "id": "fea7a8e9-1a3e-401b-a4bd-eb2857a8fe8a", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "287149 0\n" ] } ], "source": [ "#Calculate the range of the dataset using NumPy\n", "covid_newcases_max = np.max(covid_datanew[\"new_cases\"])\n", "covid_newcases_min = np.min(covid_datanew[\"new_cases\"])\n", "\n", "print(covid_newcases_max, covid_newcases_min)" ] }, { "cell_type": "markdown", "id": "dfaa2b4b-8307-4ae9-8c05-eb161bc335e4", "metadata": {}, "source": [ "### Why are Quartiles and Interquartile Range Important?\n", "\n", "Quartiles and the Interquartile Range (IQR) are essential in data analysis for several key reasons:\n", "\n", "1. **Measure of Spread**\n", " * Quartiles divide a dataset into four equal parts, providing insight into the distribution and variability of the data.\n", "\n", "2. **Outlier Detection**\n", " * The IQR is a robust measure of statistical dispersion and is commonly used for identifying outliers. Values that fall below ``Q1 - 1.5*IQR`` or above ``Q3 + 1.5*IQR`` are often considered outliers.\n", "\n", "3. **Non-parametric**\n", " * Quartiles do not assume a normal distribution of data, making them non-parametric and robust measures for skewed distributions or data with outliers.\n", "\n", "4. **Data Segmentation and Comparison**\n", " * Quartiles allow for easy segmentation of data into groups, which is useful in various applications like finance and sales.\n", "\n", "5. **Informative for Further Statistical Analysis**\n", " * Understanding quartile positions helps in making informed decisions for further statistical analyses, especially with skewed data.\n", "\n", "6. **Basis for Other Statistical Measures**\n", " * Quartiles are foundational for other statistical visualizations like box plots, which depict quartiles and outliers graphically.\n" ] }, { "cell_type": "code", "execution_count": 18, "id": "a0e1e47c-4911-4ab1-bba2-6f901e994580", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Q1 (25th percentile): 24.0\n", "Q3 (75th percentile): 3666.0\n", "Interquartile Range: 3642.0\n" ] } ], "source": [ "# Calculate quartiles\n", "Q1 = np.quantile(covid_data[\"new_cases\"],0.25)\n", "Q3 = np.quantile(covid_data[\"new_cases\"],0.75)\n", "\n", "# Calculate the Interquartile Range\n", "IQR = Q3 - Q1\n", "\n", "print(\"Q1 (25th percentile):\", Q1)\n", "print(\"Q3 (75th percentile):\", Q3)\n", "print(\"Interquartile Range:\", IQR)" ] }, { "cell_type": "markdown", "id": "04ca4b40-c872-48ea-afe5-6e364c0dcfbb", "metadata": {}, "source": [ "### **Activity - Hands-On**\n", "Use the data set located at https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho and do a summary statistical analysis using either Pandas of NumPy.\n", "\n", "Calculate basic (summary) statistics for this data set" ] }, { "cell_type": "markdown", "id": "f1dc627b-759e-4fe1-9576-47fb7f157141", "metadata": {}, "source": [ "#### **Additional Resources**\n", "https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/\n", "\n", "https://blog.quantinsti.com/data-preprocessing/\n", "\n", "https://training.experfy.com/courses/data-pre-processing" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }