{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "08cd137d-05ef-4f0c-8a4e-b67704001831",
   "metadata": {},
   "source": [
    "# Day 11: Introduction to Data Preprocessing in Python\n",
    "\n",
    "## 1. Overview of Data Preprocessing\n",
    "\n",
    "### Importance and Goals\n",
    "- **Importance**: Essential for converting raw data into a format suitable for analysis.\n",
    "- **Goals**: Enhance data quality, improve analysis efficiency, and prepare data for machine learning."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20dd5790-99ff-44f5-8659-17916d092086",
   "metadata": {},
   "source": [
    "### Data Preprocessing Workflow\n",
    "- **Cleaning Data**: Remove duplicates, correct errors.\n",
    "- **Handling Missing Values**: Impute missing values or remove them.\n",
    "- **Normalization**: Scale data using methods like Min-Max scaling or Z-score normalization.\n",
    "- **Feature Engineering**: Create new features from existing data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a233f09-4fad-446b-acbb-c6e1c6c2ac92",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 2. Understanding Data Types and Scales\n",
    "\n",
    "### Data Types\n",
    "- **Numeric (Quantitative)**: Numbers representing continuous or discrete data.\n",
    "- **Categorical (Qualitative)**: Data grouped into categories.\n",
    "\n",
    "### Scales\n",
    "- **Nominal**: Categories without order (e.g., blood types).\n",
    "- **Ordinal**: Ordered categories (e.g., class levels).\n",
    "- **Interval**: Numeric scales without true zero (e.g., temperature in Celsius).\n",
    "- **Ratio**: Numeric scales with true zero (e.g., height)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d8758428-4aa9-45b7-941c-bc7d7d6d05da",
   "metadata": {},
   "source": [
    "## 3. Basic (Summary) Statistics in Python\n",
    "\n",
    "### Setup for Activities\n",
    "- **Dataset**: Covid Data. \n",
    "https://github.com/100daysofml/100daysofml.github.io/blob/main/content/Week_03/covid_data.csv\n",
    "- **Tools**: Python with Pandas library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f2e59158-1f09-4f06-b096-dcdd4afd6a94",
   "metadata": {},
   "outputs": [],
   "source": [
    "#import relevant libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy import stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b60b68c5-0a95-4e29-a5e7-ac07b6400133",
   "metadata": {},
   "outputs": [],
   "source": [
    "#load data into dataframe\n",
    "covid_data = pd.read_csv(\"covid_data.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "92ddfe37-3de9-4b56-af96-1848ed7fa99d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>iso_code</th>\n",
       "      <th>continent</th>\n",
       "      <th>location</th>\n",
       "      <th>date</th>\n",
       "      <th>total_cases</th>\n",
       "      <th>new_cases</th>\n",
       "      <th>new_cases_smoothed</th>\n",
       "      <th>total_deaths</th>\n",
       "      <th>new_deaths</th>\n",
       "      <th>new_deaths_smoothed</th>\n",
       "      <th>...</th>\n",
       "      <th>female_smokers</th>\n",
       "      <th>male_smokers</th>\n",
       "      <th>handwashing_facilities</th>\n",
       "      <th>hospital_beds_per_thousand</th>\n",
       "      <th>life_expectancy</th>\n",
       "      <th>human_development_index</th>\n",
       "      <th>excess_mortality_cumulative_absolute</th>\n",
       "      <th>excess_mortality_cumulative</th>\n",
       "      <th>excess_mortality</th>\n",
       "      <th>excess_mortality_cumulative_per_million</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AFG</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>24/02/2020</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.746</td>\n",
       "      <td>0.5</td>\n",
       "      <td>64.83</td>\n",
       "      <td>0.511</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AFG</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>25/02/2020</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.746</td>\n",
       "      <td>0.5</td>\n",
       "      <td>64.83</td>\n",
       "      <td>0.511</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AFG</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>26/02/2020</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.746</td>\n",
       "      <td>0.5</td>\n",
       "      <td>64.83</td>\n",
       "      <td>0.511</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AFG</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>27/02/2020</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.746</td>\n",
       "      <td>0.5</td>\n",
       "      <td>64.83</td>\n",
       "      <td>0.511</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AFG</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>28/02/2020</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>37.746</td>\n",
       "      <td>0.5</td>\n",
       "      <td>64.83</td>\n",
       "      <td>0.511</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 67 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  iso_code continent     location        date  total_cases  new_cases  \\\n",
       "0      AFG      Asia  Afghanistan  24/02/2020            5          5   \n",
       "1      AFG      Asia  Afghanistan  25/02/2020            5          0   \n",
       "2      AFG      Asia  Afghanistan  26/02/2020            5          0   \n",
       "3      AFG      Asia  Afghanistan  27/02/2020            5          0   \n",
       "4      AFG      Asia  Afghanistan  28/02/2020            5          0   \n",
       "\n",
       "   new_cases_smoothed  total_deaths  new_deaths  new_deaths_smoothed  ...  \\\n",
       "0                 NaN           NaN         NaN                  NaN  ...   \n",
       "1                 NaN           NaN         NaN                  NaN  ...   \n",
       "2                 NaN           NaN         NaN                  NaN  ...   \n",
       "3                 NaN           NaN         NaN                  NaN  ...   \n",
       "4                 NaN           NaN         NaN                  NaN  ...   \n",
       "\n",
       "   female_smokers  male_smokers  handwashing_facilities  \\\n",
       "0             NaN           NaN                  37.746   \n",
       "1             NaN           NaN                  37.746   \n",
       "2             NaN           NaN                  37.746   \n",
       "3             NaN           NaN                  37.746   \n",
       "4             NaN           NaN                  37.746   \n",
       "\n",
       "   hospital_beds_per_thousand  life_expectancy  human_development_index  \\\n",
       "0                         0.5            64.83                    0.511   \n",
       "1                         0.5            64.83                    0.511   \n",
       "2                         0.5            64.83                    0.511   \n",
       "3                         0.5            64.83                    0.511   \n",
       "4                         0.5            64.83                    0.511   \n",
       "\n",
       "   excess_mortality_cumulative_absolute  excess_mortality_cumulative  \\\n",
       "0                                   NaN                          NaN   \n",
       "1                                   NaN                          NaN   \n",
       "2                                   NaN                          NaN   \n",
       "3                                   NaN                          NaN   \n",
       "4                                   NaN                          NaN   \n",
       "\n",
       "   excess_mortality  excess_mortality_cumulative_per_million  \n",
       "0               NaN                                      NaN  \n",
       "1               NaN                                      NaN  \n",
       "2               NaN                                      NaN  \n",
       "3               NaN                                      NaN  \n",
       "4               NaN                                      NaN  \n",
       "\n",
       "[5 rows x 67 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#inspect the first five and last five lines of the dataframe\n",
    "covid_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "24d31f82-7b50-410b-94b7-5055dbc7a963",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>iso_code</th>\n",
       "      <th>continent</th>\n",
       "      <th>location</th>\n",
       "      <th>date</th>\n",
       "      <th>total_cases</th>\n",
       "      <th>new_cases</th>\n",
       "      <th>new_cases_smoothed</th>\n",
       "      <th>total_deaths</th>\n",
       "      <th>new_deaths</th>\n",
       "      <th>new_deaths_smoothed</th>\n",
       "      <th>...</th>\n",
       "      <th>female_smokers</th>\n",
       "      <th>male_smokers</th>\n",
       "      <th>handwashing_facilities</th>\n",
       "      <th>hospital_beds_per_thousand</th>\n",
       "      <th>life_expectancy</th>\n",
       "      <th>human_development_index</th>\n",
       "      <th>excess_mortality_cumulative_absolute</th>\n",
       "      <th>excess_mortality_cumulative</th>\n",
       "      <th>excess_mortality</th>\n",
       "      <th>excess_mortality_cumulative_per_million</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5813</th>\n",
       "      <td>NGA</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Nigeria</td>\n",
       "      <td>06/10/2022</td>\n",
       "      <td>265741</td>\n",
       "      <td>236</td>\n",
       "      <td>51.286</td>\n",
       "      <td>3155.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.6</td>\n",
       "      <td>10.8</td>\n",
       "      <td>41.949</td>\n",
       "      <td>NaN</td>\n",
       "      <td>54.69</td>\n",
       "      <td>0.539</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5814</th>\n",
       "      <td>NGA</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Nigeria</td>\n",
       "      <td>07/10/2022</td>\n",
       "      <td>265741</td>\n",
       "      <td>0</td>\n",
       "      <td>51.286</td>\n",
       "      <td>3155.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.6</td>\n",
       "      <td>10.8</td>\n",
       "      <td>41.949</td>\n",
       "      <td>NaN</td>\n",
       "      <td>54.69</td>\n",
       "      <td>0.539</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5815</th>\n",
       "      <td>NGA</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Nigeria</td>\n",
       "      <td>08/10/2022</td>\n",
       "      <td>265816</td>\n",
       "      <td>75</td>\n",
       "      <td>55.000</td>\n",
       "      <td>3155.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.6</td>\n",
       "      <td>10.8</td>\n",
       "      <td>41.949</td>\n",
       "      <td>NaN</td>\n",
       "      <td>54.69</td>\n",
       "      <td>0.539</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5816</th>\n",
       "      <td>NGA</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Nigeria</td>\n",
       "      <td>09/10/2022</td>\n",
       "      <td>265816</td>\n",
       "      <td>0</td>\n",
       "      <td>55.000</td>\n",
       "      <td>3155.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.6</td>\n",
       "      <td>10.8</td>\n",
       "      <td>41.949</td>\n",
       "      <td>NaN</td>\n",
       "      <td>54.69</td>\n",
       "      <td>0.539</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5817</th>\n",
       "      <td>NGA</td>\n",
       "      <td>Africa</td>\n",
       "      <td>Nigeria</td>\n",
       "      <td>10/10/2022</td>\n",
       "      <td>265816</td>\n",
       "      <td>0</td>\n",
       "      <td>55.000</td>\n",
       "      <td>3155.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.6</td>\n",
       "      <td>10.8</td>\n",
       "      <td>41.949</td>\n",
       "      <td>NaN</td>\n",
       "      <td>54.69</td>\n",
       "      <td>0.539</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 67 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     iso_code continent location        date  total_cases  new_cases  \\\n",
       "5813      NGA    Africa  Nigeria  06/10/2022       265741        236   \n",
       "5814      NGA    Africa  Nigeria  07/10/2022       265741          0   \n",
       "5815      NGA    Africa  Nigeria  08/10/2022       265816         75   \n",
       "5816      NGA    Africa  Nigeria  09/10/2022       265816          0   \n",
       "5817      NGA    Africa  Nigeria  10/10/2022       265816          0   \n",
       "\n",
       "      new_cases_smoothed  total_deaths  new_deaths  new_deaths_smoothed  ...  \\\n",
       "5813              51.286        3155.0         0.0                  0.0  ...   \n",
       "5814              51.286        3155.0         0.0                  0.0  ...   \n",
       "5815              55.000        3155.0         0.0                  0.0  ...   \n",
       "5816              55.000        3155.0         0.0                  0.0  ...   \n",
       "5817              55.000        3155.0         0.0                  0.0  ...   \n",
       "\n",
       "      female_smokers  male_smokers  handwashing_facilities  \\\n",
       "5813             0.6          10.8                  41.949   \n",
       "5814             0.6          10.8                  41.949   \n",
       "5815             0.6          10.8                  41.949   \n",
       "5816             0.6          10.8                  41.949   \n",
       "5817             0.6          10.8                  41.949   \n",
       "\n",
       "      hospital_beds_per_thousand  life_expectancy  human_development_index  \\\n",
       "5813                         NaN            54.69                    0.539   \n",
       "5814                         NaN            54.69                    0.539   \n",
       "5815                         NaN            54.69                    0.539   \n",
       "5816                         NaN            54.69                    0.539   \n",
       "5817                         NaN            54.69                    0.539   \n",
       "\n",
       "      excess_mortality_cumulative_absolute  excess_mortality_cumulative  \\\n",
       "5813                                   NaN                          NaN   \n",
       "5814                                   NaN                          NaN   \n",
       "5815                                   NaN                          NaN   \n",
       "5816                                   NaN                          NaN   \n",
       "5817                                   NaN                          NaN   \n",
       "\n",
       "      excess_mortality  excess_mortality_cumulative_per_million  \n",
       "5813               NaN                                      NaN  \n",
       "5814               NaN                                      NaN  \n",
       "5815               NaN                                      NaN  \n",
       "5816               NaN                                      NaN  \n",
       "5817               NaN                                      NaN  \n",
       "\n",
       "[5 rows x 67 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "covid_data.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "0c66c4b6-270b-45a5-ae19-34eeadf35d33",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 5818 entries, 0 to 5817\n",
      "Data columns (total 67 columns):\n",
      " #   Column                                      Non-Null Count  Dtype  \n",
      "---  ------                                      --------------  -----  \n",
      " 0   iso_code                                    5818 non-null   object \n",
      " 1   continent                                   5818 non-null   object \n",
      " 2   location                                    5818 non-null   object \n",
      " 3   date                                        5818 non-null   object \n",
      " 4   total_cases                                 5818 non-null   int64  \n",
      " 5   new_cases                                   5818 non-null   int64  \n",
      " 6   new_cases_smoothed                          5788 non-null   float64\n",
      " 7   total_deaths                                5638 non-null   float64\n",
      " 8   new_deaths                                  5627 non-null   float64\n",
      " 9   new_deaths_smoothed                         5596 non-null   float64\n",
      " 10  total_cases_per_million                     5818 non-null   float64\n",
      " 11  new_cases_per_million                       5818 non-null   float64\n",
      " 12  new_cases_smoothed_per_million              5788 non-null   float64\n",
      " 13  total_deaths_per_million                    5638 non-null   float64\n",
      " 14  new_deaths_per_million                      5627 non-null   float64\n",
      " 15  new_deaths_smoothed_per_million             5596 non-null   float64\n",
      " 16  reproduction_rate                           5566 non-null   float64\n",
      " 17  icu_patients                                2610 non-null   float64\n",
      " 18  icu_patients_per_million                    2610 non-null   float64\n",
      " 19  hosp_patients                               2610 non-null   float64\n",
      " 20  hosp_patients_per_million                   2610 non-null   float64\n",
      " 21  weekly_icu_admissions                       0 non-null      float64\n",
      " 22  weekly_icu_admissions_per_million           0 non-null      float64\n",
      " 23  weekly_hosp_admissions                      934 non-null    float64\n",
      " 24  weekly_hosp_admissions_per_million          934 non-null    float64\n",
      " 25  total_tests                                 3174 non-null   float64\n",
      " 26  new_tests                                   2948 non-null   float64\n",
      " 27  total_tests_per_thousand                    3174 non-null   float64\n",
      " 28  new_tests_per_thousand                      2948 non-null   float64\n",
      " 29  new_tests_smoothed                          4114 non-null   float64\n",
      " 30  new_tests_smoothed_per_thousand             4114 non-null   float64\n",
      " 31  positive_rate                               3440 non-null   float64\n",
      " 32  tests_per_case                              3440 non-null   float64\n",
      " 33  tests_units                                 4156 non-null   object \n",
      " 34  total_vaccinations                          2104 non-null   float64\n",
      " 35  people_vaccinated                           2051 non-null   float64\n",
      " 36  people_fully_vaccinated                     2004 non-null   float64\n",
      " 37  total_boosters                              1170 non-null   float64\n",
      " 38  new_vaccinations                            1827 non-null   float64\n",
      " 39  new_vaccinations_smoothed                   3658 non-null   float64\n",
      " 40  total_vaccinations_per_hundred              2104 non-null   float64\n",
      " 41  people_vaccinated_per_hundred               2051 non-null   float64\n",
      " 42  people_fully_vaccinated_per_hundred         2004 non-null   float64\n",
      " 43  total_boosters_per_hundred                  1170 non-null   float64\n",
      " 44  new_vaccinations_smoothed_per_million       3658 non-null   float64\n",
      " 45  new_people_vaccinated_smoothed              3658 non-null   float64\n",
      " 46  new_people_vaccinated_smoothed_per_hundred  3658 non-null   float64\n",
      " 47  stringency_index                            5699 non-null   float64\n",
      " 48  population                                  5818 non-null   int64  \n",
      " 49  population_density                          5818 non-null   float64\n",
      " 50  median_age                                  5818 non-null   float64\n",
      " 51  aged_65_older                               5818 non-null   float64\n",
      " 52  aged_70_older                               5818 non-null   float64\n",
      " 53  gdp_per_capita                              5818 non-null   float64\n",
      " 54  extreme_poverty                             2922 non-null   float64\n",
      " 55  cardiovasc_death_rate                       5818 non-null   float64\n",
      " 56  diabetes_prevalence                         5818 non-null   float64\n",
      " 57  female_smokers                              4860 non-null   float64\n",
      " 58  male_smokers                                4860 non-null   float64\n",
      " 59  handwashing_facilities                      1913 non-null   float64\n",
      " 60  hospital_beds_per_thousand                  4863 non-null   float64\n",
      " 61  life_expectancy                             5818 non-null   float64\n",
      " 62  human_development_index                     5818 non-null   float64\n",
      " 63  excess_mortality_cumulative_absolute        421 non-null    float64\n",
      " 64  excess_mortality_cumulative                 421 non-null    float64\n",
      " 65  excess_mortality                            421 non-null    float64\n",
      " 66  excess_mortality_cumulative_per_million     421 non-null    float64\n",
      "dtypes: float64(59), int64(3), object(5)\n",
      "memory usage: 3.0+ MB\n"
     ]
    }
   ],
   "source": [
    "#show all columns in pandas\n",
    "covid_data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "31e50afe-36f7-4539-9875-011ac83e4f5e",
   "metadata": {},
   "outputs": [],
   "source": [
    "#create a dataframe that loads relevant columns\n",
    "covid_datanew = covid_data[['iso_code','continent','location','date','total_cases','new_cases']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "e5badd24-9263-4d29-9067-83e4fc7dd5eb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>iso_code</th>\n",
       "      <th>continent</th>\n",
       "      <th>location</th>\n",
       "      <th>date</th>\n",
       "      <th>total_cases</th>\n",
       "      <th>new_cases</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AFG</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>24/02/2020</td>\n",
       "      <td>5</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AFG</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>25/02/2020</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AFG</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>26/02/2020</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AFG</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>27/02/2020</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AFG</td>\n",
       "      <td>Asia</td>\n",
       "      <td>Afghanistan</td>\n",
       "      <td>28/02/2020</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  iso_code continent     location        date  total_cases  new_cases\n",
       "0      AFG      Asia  Afghanistan  24/02/2020            5          5\n",
       "1      AFG      Asia  Afghanistan  25/02/2020            5          0\n",
       "2      AFG      Asia  Afghanistan  26/02/2020            5          0\n",
       "3      AFG      Asia  Afghanistan  27/02/2020            5          0\n",
       "4      AFG      Asia  Afghanistan  28/02/2020            5          0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#displaying the new dataframe\n",
    "covid_datanew.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "07fadcb7-3e53-46f4-87f7-fb4416667bea",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "iso_code       object\n",
       "continent      object\n",
       "location       object\n",
       "date           object\n",
       "total_cases     int64\n",
       "new_cases       int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#investigating the data type of the dataframe\n",
    "covid_datanew.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "4d953f37-2824-44c3-81bb-f93f3ae6daa3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(5818, 67)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#displaying the shape of the dataframe (rows x columns)\n",
    "covid_data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf60d23b-38e5-4b99-8e84-8dfeef026f4b",
   "metadata": {},
   "source": [
    "**Mean (Arithmetic Average)**\n",
    "  * Formula: $(\\bar{x} = \\frac{1}{n}\\sum_{i=1}^{n}x_i$)\n",
    "  * Activity: Calculate the mean of 'new_cases' in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "ad79875b-3caa-4a0c-9519-b070582113f7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New case (mean): 8814.365761430045\n"
     ]
    }
   ],
   "source": [
    "#analyze the mean of the new_cases column using the np.mean() in numpy\n",
    "newcase_mean = np.mean(covid_datanew[\"new_cases\"])\n",
    "\n",
    "print(\"New case (mean):\", newcase_mean)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "578a07f1-faaf-4733-9d44-6c8590b688e7",
   "metadata": {},
   "source": [
    "**Median (Middle Value in Sorted Data)**\n",
    "  * Activity: Find the median of 'new_cases' in the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "d11fb256-7079-4f9b-8746-e20d88e1a7aa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New case (median): 261.0\n"
     ]
    }
   ],
   "source": [
    "newcase_median = np.median(covid_datanew[\"new_cases\"])\n",
    "\n",
    "print(\"New case (median):\", newcase_median)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "769e02ca-01ef-4a96-b1f1-d5519acd6aa0",
   "metadata": {},
   "source": [
    "**Mode (Most Frequent Value)**\n",
    "  * Activity: Determine the mode for ''."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "118bb092-609e-42ab-a1c8-73e7b094633c",
   "metadata": {},
   "source": [
    "The stats.mode function from the SciPy library returns a ModeResult object, which contains two arrays:\n",
    "\n",
    "    The first array (mode): This contains the mode value(s), i.e., the most frequently occurring value(s) in the dataset.\n",
    "    The second array (count): This contains the number of times the mode value(s) appears in the dataset.\n",
    "\n",
    "Both of these are returned as arrays, even if there's only one mode. When you access the mode using stats.mode(covid_datanew['new_cases'])[0], it returns an array with the mode value. The [0] at the end is used to access the first (and in most cases, the only) element of this array.\n",
    "\n",
    "So, in the context of the code:\n",
    "\n",
    "    stats.mode(covid_datanew['new_cases']): Returns a ModeResult object with the mode and its count.\n",
    "    stats.mode(covid_datanew['new_cases'])[0]: Accesses the array containing the mode value(s).\n",
    "    stats.mode(covid_datanew['new_cases'])[0][0]: Accesses the first element of the array, providing the actual mode value.\n",
    "\n",
    "This is necessary because the mode function is designed to handle multi-modal datasets (datasets with more than one mode) and thus returns an array instead of a single value. In most single-mode cases, you'll need the [0][0] to access the actual mode value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "e9f49272-b91f-4ea7-a6f6-61655852d238",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New case (mode): ModeResult(mode=0, count=805)\n"
     ]
    }
   ],
   "source": [
    "newcase_mode = stats.mode(covid_datanew[\"new_cases\"])\n",
    "\n",
    "print(\"New case (mode):\", newcase_mode)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5383af07-d5cc-43e7-8773-71b5cd847ffa",
   "metadata": {},
   "source": [
    "### In the above example we used a numeric column in order to display the mode? **Could you use a non-numeric column?**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa299856-4900-4871-b462-0c6ea2305b5b",
   "metadata": {},
   "source": [
    "**Variance (σ²)**\n",
    "  * Formula: $(\\sigma^2 = \\frac{\\sum_{i=1}^{n}(x_i - \\bar{x})^2}{n}$)\n",
    "  * Activity: Compute the variance of 'quality'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8b58e5b9-fe94-4dec-b5c5-c76a02b6265c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New cases (variance:numpy): 451321915.92810047\n"
     ]
    }
   ],
   "source": [
    "#using numpy check the variance of the new_cases column\n",
    "newcase_variance = np.var(covid_datanew[\"new_cases\"])\n",
    "\n",
    "print(\"New cases (variance:numpy):\", newcase_variance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "259e401f-d673-462d-9df4-677cc6d307ac",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "451399502.6422019"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#using numpy check the variance of the new_cases column\n",
    "covid_datanew[\"new_cases\"].var()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3c51691-23ea-477e-a8e6-2ea476251ef1",
   "metadata": {},
   "source": [
    "**Standard Deviation (σ)**\n",
    "  * Formula: $\\sigma = \\sqrt{\\frac{\\sum_{i=1}^n (x_i-\\bar{x})^2}{n}}$\n",
    "  * Activity: Calculate the standard deviation for 'quality'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "8344b34f-8e19-4bb3-a533-7c858a5fd403",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "New cases (stdev: numpy): 21246.16442189512\n"
     ]
    }
   ],
   "source": [
    "# Calculate the standard deviation using NumPy\n",
    "# 'ddof=0' for population standard deviation; 'ddof=1' for sample standard deviation\n",
    "newcase_stdev = np.std(covid_datanew[\"new_cases\"], ddof=1)\n",
    "\n",
    "print(\"New cases (stdev: numpy):\", newcase_stdev)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "e3eda326-fa1f-43a3-9aaa-009968e74c00",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "21246.16442189512"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "covid_datanew[\"new_cases\"].std()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5afa9fc-c862-49f2-bc56-4f283761e59d",
   "metadata": {},
   "source": [
    "### **Why would there be a difference in the variance and standard deviation between NumPy and Pandas?**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c55d04c3-0ea6-437e-b850-ee733666c71e",
   "metadata": {},
   "source": [
    "The difference between the numpy var and pandas var methods are not dependent on the range of the data but on the degrees of freedom (ddof) set by package. pandas sets ddof=1 (unbiased estimator) while numpy sets ddof = 0 (mle). \n",
    "RE: https://stackoverflow.com/questions/62938495/difference-between-numpy-var-and-pandas-var"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5885d612-c9be-43bf-b81d-4009a2cecc4f",
   "metadata": {},
   "source": [
    "**Max and Min Range**\n",
    "\n",
    "The range has a significant role in describing the variability of a data set, as long as there are no outliers. An outlier is an extreme high or low value that stands alone from the other values. If an outlier exist, the value of the range by itself can be misleading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "fea7a8e9-1a3e-401b-a4bd-eb2857a8fe8a",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "287149 0\n"
     ]
    }
   ],
   "source": [
    "#Calculate the range of the dataset using NumPy\n",
    "covid_newcases_max = np.max(covid_datanew[\"new_cases\"])\n",
    "covid_newcases_min = np.min(covid_datanew[\"new_cases\"])\n",
    "\n",
    "print(covid_newcases_max, covid_newcases_min)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfaa2b4b-8307-4ae9-8c05-eb161bc335e4",
   "metadata": {},
   "source": [
    "### Why are Quartiles and Interquartile Range Important?\n",
    "\n",
    "Quartiles and the Interquartile Range (IQR) are essential in data analysis for several key reasons:\n",
    "\n",
    "1. **Measure of Spread**\n",
    "   * Quartiles divide a dataset into four equal parts, providing insight into the distribution and variability of the data.\n",
    "\n",
    "2. **Outlier Detection**\n",
    "   * The IQR is a robust measure of statistical dispersion and is commonly used for identifying outliers. Values that fall below ``Q1 - 1.5*IQR`` or above ``Q3 + 1.5*IQR`` are often considered outliers.\n",
    "\n",
    "3. **Non-parametric**\n",
    "   * Quartiles do not assume a normal distribution of data, making them non-parametric and robust measures for skewed distributions or data with outliers.\n",
    "\n",
    "4. **Data Segmentation and Comparison**\n",
    "   * Quartiles allow for easy segmentation of data into groups, which is useful in various applications like finance and sales.\n",
    "\n",
    "5. **Informative for Further Statistical Analysis**\n",
    "   * Understanding quartile positions helps in making informed decisions for further statistical analyses, especially with skewed data.\n",
    "\n",
    "6. **Basis for Other Statistical Measures**\n",
    "   * Quartiles are foundational for other statistical visualizations like box plots, which depict quartiles and outliers graphically.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "a0e1e47c-4911-4ab1-bba2-6f901e994580",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Q1 (25th percentile): 24.0\n",
      "Q3 (75th percentile): 3666.0\n",
      "Interquartile Range: 3642.0\n"
     ]
    }
   ],
   "source": [
    "# Calculate quartiles\n",
    "Q1 = np.quantile(covid_data[\"new_cases\"],0.25)\n",
    "Q3 = np.quantile(covid_data[\"new_cases\"],0.75)\n",
    "\n",
    "# Calculate the Interquartile Range\n",
    "IQR = Q3 - Q1\n",
    "\n",
    "print(\"Q1 (25th percentile):\", Q1)\n",
    "print(\"Q3 (75th percentile):\", Q3)\n",
    "print(\"Interquartile Range:\", IQR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04ca4b40-c872-48ea-afe5-6e364c0dcfbb",
   "metadata": {},
   "source": [
    "### **Activity - Hands-On**\n",
    "Use the data set located at https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho and do a summary statistical analysis using either Pandas of NumPy.\n",
    "\n",
    "Calculate basic (summary) statistics for this data set"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1dc627b-759e-4fe1-9576-47fb7f157141",
   "metadata": {},
   "source": [
    "#### **Additional Resources**\n",
    "https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/\n",
    "\n",
    "https://blog.quantinsti.com/data-preprocessing/\n",
    "\n",
    "https://training.experfy.com/courses/data-pre-processing"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}