Global COVID-19 & GDP Analysis

This project analyzes a 184-country dataset that combines COVID-19 burden with IMF-linked GDP and GDP-per-capita indicators. The work moves from global comparisons into correlation analysis, regression diagnostics, geographic mapping, and a Random Forest mortality model.

The clearest story is that totals and per-capita rates describe different realities. Large countries and large economies dominate raw counts, while per-capita views expose a different geography of impact and make country context much more important.

Technologies Used

Project Focus

The notebook is built around population, total cases, cases per million, total deaths, deaths per million, GDP, and GDP per capita. That makes it possible to compare raw burden, normalized burden, and model behavior without changing datasets midway through the analysis.

The project works best as a global analytics case study with a modeling extension. The ML is useful here because it clarifies what the visible signals explain well and where the most severe countries still break pattern.

Dataset Snapshot

The file contains 184 country records, 9 columns, and no missing values. Most of the numerical fields are strongly right-skewed, which means a relatively small set of countries drive a large share of the visible variation.

Countries184

Columns9

Missing Values0

Strongest Correlation

0.90

Total cases and total deaths move together very strongly across countries.

Best Linear Fit

R2 0.586

GDP is much stronger than population alone for explaining total case counts.

Best Per-Capita Fit

R2 0.443

GDP per capita becomes useful once the target shifts to cases per million.

Random Forest Fit

R2 0.505

The mortality model captures baseline patterns but still misses the most extreme countries.

Core Readout

The notebook supports two connected conclusions. Raw totals behave mostly like a scale story, while relative burden is more tied to reporting intensity, infection spread, and country-level context that simple size measures do not capture.

Signal	Value
Total Cases to Total Deaths	0.90 corr.
GDP to Total Cases	R2 0.586
Population to Total Cases	R2 0.135
GDP per Capita to Cases per Million	R2 0.443
Random Forest to Deaths per Million	R2 0.505

The Random Forest mortality model adds value because it captures the baseline mortality pattern and the relative ordering of risk, even though it still regresses extreme outcomes back toward the middle.

Interpretation

Totals reflect scale

Large countries and large economies dominate total cases and total deaths.

Rates expose intensity

Per-capita views reveal a different cross-country pattern than the raw totals do.

Outliers reveal hidden structure

High-mortality countries are where the current feature set stops being enough.

Global Burden

The first step is separating absolute burden from relative burden. That distinction matters because the countries leading raw totals are not always the same countries carrying the heaviest burden per person.

Top Countries by Total Cases

Raw totals are concentrated in a small set of very large countries, with the United States far ahead of the rest. This is a useful baseline, but it mainly reflects scale.

Cases per Million Leaders

Top countries by COVID cases per million

Once the data is normalized, the leaderboard changes substantially. Smaller countries can show much higher relative burden even when they never dominate raw totals.

Global Choropleth

The map makes that metric shift easier to read. High total counts cluster around the largest countries, while high mortality rates appear more concentrated across Europe and parts of South America.

Global Bubble Map

This view layers outbreak scale and prosperity together. It is one of the clearest visuals for why wealth and reported burden can rise together without implying a simple causal story.

Relationship Structure

Before modeling, the key job is to identify which variables move together and which only become useful once the target is framed correctly.

Correlation Heatmap

The heatmap is the organizing chart for the whole project. Total cases and total deaths are tightly linked, GDP tracks raw burden strongly, and GDP per capita lines up more with cases per million than with raw totals.

Key Relationships

`Total cases to total deaths = 0.90` is the clearest structural relationship in the dataset, which tells us spread explains a large share of raw mortality burden.

`GDP to total cases = 0.77` and `GDP per capita to cases per million = 0.67` show that economic signals matter, but in different ways depending on whether the outcome is absolute or population-adjusted.

`Cases per million to deaths per million = 0.52` is meaningful without being complete, which is why mortality cannot be reduced to infection rates alone.

Regression Read

The linear models work best as explanatory tests. They help separate strong signals from weak ones and show where the global dataset becomes too skewed for a simple clean fit.

Population vs Total Cases Residual Pattern

Trimmed residual plot for total cases versus population

Population matters, but only weakly on its own. The fan-shaped residual pattern shows that error grows as fitted case counts grow, so a simple population model becomes less reliable for larger, more complex countries.

GDP per Capita Diagnostics for Cases per Million

Regression diagnostics for cases per million versus GDP per capita

This is one of the more useful linear relationships in the notebook. GDP per capita explains a meaningful share of per-capita case burden, but the residual and Q-Q plots still show curvature, skew, and outlier-heavy behavior.

The regression layer makes three points cleanly: population is statistically real but weak on its own, GDP is much stronger for raw totals, and GDP per capita works better for normalized burden. The diagnostics matter because they show why even the useful linear fits should be read as partial explanations.

Mortality Modeling Extension

The Random Forest mortality model is a supporting layer rather than the main event. It is helpful because it captures baseline mortality structure, ranks countries by relative risk, and shows exactly where the current variables stop being enough.

Actual vs Predicted Deaths per Million

Actual versus predicted deaths per million

The Random Forest model captures the overall mortality trend with moderate predictive power. It is strongest in the middle of the distribution and regresses the most extreme countries back toward the mean.

Residual Error Field

Error rises with mortality, which means the model struggles most in the highest-severity countries. That heteroscedastic pattern is one of the most useful results on the page because it tells us where the missing structure lives.

3D Mortality Prediction Space

This plot makes the interaction story easier to read. Predicted deaths per million rise most sharply in countries combining higher infection rates with stronger economic scale.

SHAP Summary

The SHAP summary shows that cases per million is the strongest driver in the Random Forest model, with GDP per capita also contributing meaningfully. That means the model is picking up infection severity plus country context, not one naive driver.

The Random Forest model explains the baseline relationship between infection intensity and mortality well enough to be useful for relative risk ranking. Its main weakness is exactly where the problem gets hardest: high-mortality countries where hidden structural variables matter most.

Data Needed To Explain The Outliers

The countries the Random Forest model misses most are the ones where broad country-scale and economic variables stop being enough. To explain those outliers better, the next dataset needs more direct measures of demographic vulnerability, health-system stress, and timing.

Population Age Structure

Median age, share of population over 65, and age dependency ratios would help explain why countries with similar infection rates can experience very different death rates.

Healthcare Capacity

Hospital beds per capita, ICU capacity, physician density, and healthcare spending would give the model a more direct read on whether severe outbreaks could be absorbed or overwhelmed.

Testing And Reporting Intensity

Testing volume, test positivity, excess mortality, and reporting quality indicators would help separate true burden from differences in detection and transparency across countries.

Vaccination And Immunity Timing

Vaccination rates, booster coverage, and rollout timing would help explain why some countries break away from the expected mortality pattern.

Policy Response And Mobility

Stringency indices, lockdown timing, mobility data, and border restrictions would capture the behavioral and policy responses that raw GDP and population cannot represent on their own.

Underlying Health Risk

Comorbidity prevalence such as obesity, diabetes, cardiovascular disease, and smoking rates would help explain why mortality can spike even when the visible outbreak variables look similar.

The strongest conclusion is that COVID mortality is not explained by infection rates alone. Cases per million provide the baseline signal, but economic and systemic context shape how that burden turns into reported mortality, especially in the highest-severity countries.

Notebook Trace

Data Profiling

The notebook starts with file loading, shape checks, missing-value checks, summary statistics, and distribution plots across the numerical variables.

Global Comparisons

Top-country charts compare total cases, cases per million, total deaths, deaths per million, GDP, and GDP per capita to separate raw scale from adjusted burden.

Inferential Analysis

The middle section builds a correlation matrix and runs simple OLS regressions with residual and Q-Q diagnostics to test which relationships hold up statistically.

Modeling Layer

The final section adds map views plus a Random Forest regressor for deaths per million, followed by feature-importance, SHAP, and model-diagnostic visuals.

Project Highlights

Global Country Dataset

Works from a 184-country dataset combining COVID burden, population, continent, GDP, and GDP-per-capita measures with no missing values.

Heatmap & Regression Analysis

Uses correlation analysis and regression diagnostics to compare how population, GDP, and GDP per capita relate to total and per-capita pandemic outcomes.

Interactive Map Layer

Builds choropleth and bubble-map views so the same dataset can be read geographically, not just through ranked charts and tables.

Mortality Model

Adds a Random Forest regressor, SHAP-based interpretation, and error diagnostics to show what the visible signals explain well and where the extremes still break the fit.