Gasoline Price Forecast Validation

Random Forest Accuracy

Less than 1 cent

Random Forest missed the July 2024 EIA value by only $0.0089 per gallon.

SARIMA Accuracy

Within 3 cents

SARIMA came within about $0.028 per gallon of the actual May 2024 EIA value.

Best Overall Model

SARIMA

Lowest total error across the later April 2024 to March 2026 validation window.

This project started as a research paper focused on extracting insight from U.S. gasoline prices: how they move over time, how strongly fuel categories correlate, and how much of that behavior can be explained by trend versus outside economic shocks. From there, the analysis expanded into predictive modeling, where I built machine learning and time-series forecasting approaches using data available through roughly March 15, 2024 to test whether future price movement could be estimated from the historical series.

The validation setup is what makes the result credible. The later April 2024 to March 2026 EIA data was used as a post-training validation window to check those forward forecasts against future published values, not to rerun the same 2024 workflow on the 2026 dataset and call that validation. Within that setup, Random Forest produced the closest single 2024 hit, while SARIMA delivered the strongest overall accuracy across the broader post-training window.

I trained the forecasts on data available in March 2024, then later checked them against the real EIA prices published from April 2024 through March 2026, with the best model landing within 1 cent and the second-best within 3 cents.

Technologies Used

Capstone Focus

The paper framed gasoline as a high-impact economic variable shaped by supply, demand, policy, and global events. The central question was whether time itself could function as a statistically meaningful predictor of U.S. gasoline price movement while also helping explain longer trend behavior and seasonality.

The written analysis used inferential analysis, predictive analytics, and time series modeling on monthly EIA gasoline data. The paper also connected the observed volatility to major external shocks, especially the 2008 financial crisis and the 2022 post-pandemic price spike.

Paper Findings

The capstone concluded that time was a significant predictor, but not a complete explanation. The different gasoline product types moved together closely, indicating strong correlation across fuel categories, while the regression analysis showed that outside events still played a major role in price behavior.

OLS R-Squared 0.645

Durbin-Watson 0.871

Paper RF MAE $0.06

2024 Forecast Check

The April to July 2024 window is where the forecast behavior becomes easy to read: April stayed close, May and June widened during a sharper price move, and July produced the standout near-exact Random Forest hit.

Apr 2024

May 2024

Jun 2024

Jul 2024

Actual EIA value Random Forest SARIMA Linear Regression

Key Takeaways

Random Forest

Closest 2024 forecast: less than 1 cent per gallon from the actual July 2024 value.

SARIMA

Closest 2024 forecast: about 3 cents per gallon from the actual May 2024 value.

Linear Regression

Closest 2024 forecast: about 6 cents per gallon from the actual July 2024 value.

The hardest move came between May and June, when actual prices dropped 14.8 cents. That is where the forecast paths widen most clearly before tightening again in July.

April to July 2024 Validation Table

Month	Actual	Random Forest	SARIMA	Linear Regression
April 2024	$3.611	$3.596	$3.473	$3.527
May 2024	$3.603	$3.726	$3.575	$3.533
June 2024	$3.455	$3.708	$3.632	$3.540
July 2024	$3.484	$3.493	$3.588	$3.547

This is the direct post-training check: forecasts generated from data ending March 15, 2024 compared against later published EIA prices. Across the longer April 2024 to March 2026 horizon, SARIMA delivered the lowest aggregate error, while Random Forest produced the single closest short-term hit.

Model Visuals

The original analysis produced several charts that make the forecasting logic easier to read. Taken together, they show the long-run structure of the gasoline series, the strength and limits of the linear fit, the shared movement across fuel categories, and the seasonal behavior that makes SARIMA useful in this problem.

Retail Gasoline Prices Over Time

This multi-series view establishes the core problem: fuel prices move through long trend cycles, sharp shocks, and overlapping product behavior. It also makes clear why a forecasting approach has to account for both trend and volatility rather than treating the data as a simple straight-line series.

Actual vs Predicted Prices

Actual versus predicted gasoline prices scatter plot

This scatter plot shows that the fitted model tracks the historical relationship reasonably well through the observed range. Points stay close to the reference line overall, but the later validation work on this page still matters because a good in-sample fit does not guarantee the strongest forward accuracy.

Time-Series Decomposition

The decomposition breaks the regular gasoline series into observed behavior, underlying trend, seasonal structure, and residual noise. This is one of the clearest visual reasons SARIMA belongs in the modeling mix: the data contains recurring seasonal shape alongside longer-term price movement.

Fuel Price Correlation Matrix

Correlation matrix of gasoline and diesel price variables

The heatmap shows how tightly the fuel categories move together. That strong shared structure supports the paper's conclusion that gasoline products are highly correlated, while also underscoring that outside economic shocks still push the whole system up or down together.

Where The Forecasts Were Generated

The later validation work traces back to specific numbered notebook cells inside the original March 2024 artifact. Those cells generated the forward-looking outputs that were later checked against the published EIA values.

Random Forest Forecast Cells

The clearest short-horizon forecast appears in In [43], where the notebook predicts the next three months, builds prediction_dates, and stores predicted_prices.

That same model family continues into the later forecast and plotting cells, but In [43] is the easiest place to navigate when tracing the original Random Forest-style forward predictions.

SARIMA Forecast Cell

The SARIMA forecast is generated in In [74], where the notebook fits the SARIMAX(...) model and calls results.get_forecast(steps=12).

That cell produced the later twelve-month SARIMA path that was compared against the published post-March-2024 EIA values.

Linear Regression Forecast Context

The numbered linear regression implementation appears in In [71], where the notebook converts dates to Date_ordinal, fits LinearRegression(), and prepares the regression trend output.

That time-based regression path is the basis for the validation comparison shown here alongside the Random Forest and SARIMA forecasts.

Notebook Route

The embedded notebook page preserves the original artifact directly, so the source forecast cells can be reviewed in context before jumping back to the validation summary on this page.

Use the notebook route below to inspect the original output path and then return here for the post-training accuracy comparison.

Why This Project Stands Out

The original capstone established the modeling foundation through trend analysis, correlation work, OLS regression, Random Forest forecasting, residual review, and SARIMA-based time-series modeling. The stronger portfolio contribution is the added validation step: those forecast paths were later checked against future EIA releases instead of being left as untested output.

That is what makes this project more credible than a standard notebook forecast. It shows both what the models captured well and where real market movement still drifted under the influence of external shocks.

Final Read

Time clearly carries predictive signal in gasoline pricing, and the less-than-one-cent July Random Forest hit is the strongest proof on the page. At the same time, the broader validation window shows why no single modeling approach fully explains gasoline behavior once supply, demand, policy, and macro shocks start moving together.

The best interpretation is not that one model solved gasoline prices. It is that the forecasting stack captured real structure well enough to survive contact with future data, while still making the project honest about the limits of trend-based prediction in a volatile market.

Project Highlights

Multi-Series Fuel Dataset

Analyzes national retail fuel price series across gasoline and diesel categories to measure shared movement, volatility, and downstream forecasting behavior.

Forecast Validation

Compares notebook-generated forecasts against later published EIA values to test how Random Forest, SARIMA, and linear regression behaved after training ended.

Time-Series Modeling

Uses regression, Random Forest forecasting, decomposition, and SARIMA-style logic to study trend, seasonality, and post-training prediction accuracy.

Residual & Correlation Review

Includes residual diagnostics, decomposition visuals, and cross-fuel correlation analysis to show where the models fit well and where outside shocks still dominate.