{ "cells": [ { "cell_type": "markdown", "id": "08cd137d-05ef-4f0c-8a4e-b67704001831", "metadata": {}, "source": [ "# Day 11: Introduction to Data Preprocessing in Python\n", "\n", "## 1. Overview of Data Preprocessing\n", "\n", "### Importance and Goals\n", "- **Importance**: Essential for converting raw data into a format suitable for analysis.\n", "- **Goals**: Enhance data quality, improve analysis efficiency, and prepare data for machine learning." ] }, { "cell_type": "markdown", "id": "20dd5790-99ff-44f5-8659-17916d092086", "metadata": {}, "source": [ "### Data Preprocessing Workflow\n", "- **Cleaning Data**: Remove duplicates, correct errors.\n", "- **Handling Missing Values**: Impute missing values or remove them.\n", "- **Normalization**: Scale data using methods like Min-Max scaling or Z-score normalization.\n", "- **Feature Engineering**: Create new features from existing data." ] }, { "cell_type": "markdown", "id": "8a233f09-4fad-446b-acbb-c6e1c6c2ac92", "metadata": { "tags": [] }, "source": [ "## 2. Understanding Data Types and Scales\n", "\n", "### Data Types\n", "- **Numeric (Quantitative)**: Numbers representing continuous or discrete data.\n", "- **Categorical (Qualitative)**: Data grouped into categories.\n", "\n", "### Scales\n", "- **Nominal**: Categories without order (e.g., blood types).\n", "- **Ordinal**: Ordered categories (e.g., class levels).\n", "- **Interval**: Numeric scales without true zero (e.g., temperature in Celsius).\n", "- **Ratio**: Numeric scales with true zero (e.g., height)." ] }, { "cell_type": "markdown", "id": "d8758428-4aa9-45b7-941c-bc7d7d6d05da", "metadata": {}, "source": [ "## 3. Basic (Summary) Statistics in Python\n", "\n", "### Setup for Activities\n", "- **Dataset**: Covid Data. \n", "https://github.com/100daysofml/100daysofml.github.io/blob/main/content/Week_03/covid_data.csv\n", "- **Tools**: Python with Pandas library." ] }, { "cell_type": "code", "execution_count": 1, "id": "f2e59158-1f09-4f06-b096-dcdd4afd6a94", "metadata": {}, "outputs": [], "source": [ "#import relevant libraries\n", "import numpy as np\n", "import pandas as pd\n", "from scipy import stats" ] }, { "cell_type": "code", "execution_count": 2, "id": "b60b68c5-0a95-4e29-a5e7-ac07b6400133", "metadata": {}, "outputs": [], "source": [ "#load data into dataframe\n", "covid_data = pd.read_csv(\"covid_data.csv\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "92ddfe37-3de9-4b56-af96-1848ed7fa99d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | iso_code | \n", "continent | \n", "location | \n", "date | \n", "total_cases | \n", "new_cases | \n", "new_cases_smoothed | \n", "total_deaths | \n", "new_deaths | \n", "new_deaths_smoothed | \n", "... | \n", "female_smokers | \n", "male_smokers | \n", "handwashing_facilities | \n", "hospital_beds_per_thousand | \n", "life_expectancy | \n", "human_development_index | \n", "excess_mortality_cumulative_absolute | \n", "excess_mortality_cumulative | \n", "excess_mortality | \n", "excess_mortality_cumulative_per_million | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "AFG | \n", "Asia | \n", "Afghanistan | \n", "24/02/2020 | \n", "5 | \n", "5 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "37.746 | \n", "0.5 | \n", "64.83 | \n", "0.511 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1 | \n", "AFG | \n", "Asia | \n", "Afghanistan | \n", "25/02/2020 | \n", "5 | \n", "0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "37.746 | \n", "0.5 | \n", "64.83 | \n", "0.511 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 | \n", "AFG | \n", "Asia | \n", "Afghanistan | \n", "26/02/2020 | \n", "5 | \n", "0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "37.746 | \n", "0.5 | \n", "64.83 | \n", "0.511 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 | \n", "AFG | \n", "Asia | \n", "Afghanistan | \n", "27/02/2020 | \n", "5 | \n", "0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "37.746 | \n", "0.5 | \n", "64.83 | \n", "0.511 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
4 | \n", "AFG | \n", "Asia | \n", "Afghanistan | \n", "28/02/2020 | \n", "5 | \n", "0 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "37.746 | \n", "0.5 | \n", "64.83 | \n", "0.511 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 67 columns
\n", "\n", " | iso_code | \n", "continent | \n", "location | \n", "date | \n", "total_cases | \n", "new_cases | \n", "new_cases_smoothed | \n", "total_deaths | \n", "new_deaths | \n", "new_deaths_smoothed | \n", "... | \n", "female_smokers | \n", "male_smokers | \n", "handwashing_facilities | \n", "hospital_beds_per_thousand | \n", "life_expectancy | \n", "human_development_index | \n", "excess_mortality_cumulative_absolute | \n", "excess_mortality_cumulative | \n", "excess_mortality | \n", "excess_mortality_cumulative_per_million | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5813 | \n", "NGA | \n", "Africa | \n", "Nigeria | \n", "06/10/2022 | \n", "265741 | \n", "236 | \n", "51.286 | \n", "3155.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.6 | \n", "10.8 | \n", "41.949 | \n", "NaN | \n", "54.69 | \n", "0.539 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5814 | \n", "NGA | \n", "Africa | \n", "Nigeria | \n", "07/10/2022 | \n", "265741 | \n", "0 | \n", "51.286 | \n", "3155.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.6 | \n", "10.8 | \n", "41.949 | \n", "NaN | \n", "54.69 | \n", "0.539 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5815 | \n", "NGA | \n", "Africa | \n", "Nigeria | \n", "08/10/2022 | \n", "265816 | \n", "75 | \n", "55.000 | \n", "3155.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.6 | \n", "10.8 | \n", "41.949 | \n", "NaN | \n", "54.69 | \n", "0.539 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5816 | \n", "NGA | \n", "Africa | \n", "Nigeria | \n", "09/10/2022 | \n", "265816 | \n", "0 | \n", "55.000 | \n", "3155.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.6 | \n", "10.8 | \n", "41.949 | \n", "NaN | \n", "54.69 | \n", "0.539 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5817 | \n", "NGA | \n", "Africa | \n", "Nigeria | \n", "10/10/2022 | \n", "265816 | \n", "0 | \n", "55.000 | \n", "3155.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.6 | \n", "10.8 | \n", "41.949 | \n", "NaN | \n", "54.69 | \n", "0.539 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
5 rows × 67 columns
\n", "\n", " | iso_code | \n", "continent | \n", "location | \n", "date | \n", "total_cases | \n", "new_cases | \n", "
---|---|---|---|---|---|---|
0 | \n", "AFG | \n", "Asia | \n", "Afghanistan | \n", "24/02/2020 | \n", "5 | \n", "5 | \n", "
1 | \n", "AFG | \n", "Asia | \n", "Afghanistan | \n", "25/02/2020 | \n", "5 | \n", "0 | \n", "
2 | \n", "AFG | \n", "Asia | \n", "Afghanistan | \n", "26/02/2020 | \n", "5 | \n", "0 | \n", "
3 | \n", "AFG | \n", "Asia | \n", "Afghanistan | \n", "27/02/2020 | \n", "5 | \n", "0 | \n", "
4 | \n", "AFG | \n", "Asia | \n", "Afghanistan | \n", "28/02/2020 | \n", "5 | \n", "0 | \n", "