Skip to the content.

Predicting the Cause of Major U.S. Power Outages (2000–2016)

Overview

Introduction: This project takes a look at the dataset “Major Power Outage Events in the Continental U.S. (2000–2016).” The dataset has detailed records of large power outages which includes information on their time of occurence, affected states, outage duration, customers affected, and the cause of the outage. The causes section include severe weather, intentional attack, equipment failure, etc.

Research Question: What factors can predict the cause of a major power outage?

Predicting the causes of power outages is important because it can help utility companies and emergency planners anticipate and respond to outages better and more efficiently. The analysis and predictive modeling done in this project focus on determining which features, (such as the duration of the outage, number of customers affected, etc.), are most related to the underlying cause of the outage.


Dataset Summary

Dataset Overview:

Engineered Features:


Data Cleaning and EDA

After loading in the excel file, dropping unnamed extra columns, and reassiging them to easy-to-read names, I can begin the actual data cleaning.

  1. First, I merged date and time into single datetime columns.

  2. Then, the outage duration in hours was computed.

  3. Missing values were then explored.

  4. Finally, univariate and bivariate plots were performed.

The univariate plot shows the distribution of outage durations. The first histogram shows most outages are short with a long tail of multi-day events succeeding them. The second histogram shows the frequency of power outage cuases. Severe weather and intentional attacks dominate the dataset.

Exploratory Data Analysis:

The first bivariate plot shows the outage duration by year. It is a box plot that reveals year-to-year variability in outage lengths.

The second bivariate plot is a scatterplot that shows duration vs. customers affected. There was a modest positive trend where longer outages tend to affect more customers than shorter outages. This is to be expected.

The aggregate table shows the mean duration by cause category. It essentially shows average outage lengths per cause.


Visualizations

🔸 Outage Duration Distribution

🔸 Outage Cause Frequency

🔸 Outage Duration by Year

🔸 Duration vs. Customers Affected

🔸 Average Duration by Cause

Note that fuel supply emergencies had the longest average duration (~225 hrs), while islanding and intentional attacks were shorter on average.


Modeling: Predicting the Cause of an Outage

This is a multiclass classification problem. I use both a baseline and a tuned model.

Baseline Model: Logistic Regression

Next, I will create a baseline model using logisitic regression. The pipeline consists of StandardScaler + multinomial LogisticRegression.

As you can see below, an accuracy of ~82% was acheived, however there was poor recall on minority classes.

Prediction Problem: I am trying to predict the CAUSE.CATEGORY of a major power outage.

Features Used:

Type of Problem: This is a multiclass classification problem because the target, CAUSE.CATEGORY, has multiple different categorical classes.

Evaluation Metrics: I can use accuracy as well as precision, recall, and f1-score, macro and weighted averages, to assess the model’s performance.

Logistic regression showed pretty good performance but was limited by class imbalance and lack of feature expressiveness.


Final Model: Random Forest with Engineered Features

To improve upon the baseline, I engineered two additional features:

  1. DEMAND_LOSS_PER_CUSTOMER: This is calculated as ratio of DEMAND.LOSS.MW to CUSTOMERS.AFFECTED. It gives a measure of the outage impact per affected customer.

  2. IS_HURRICANE: This is a binary indicator that is set to 1 if the HURRICANE.NAMES field shows that a hurricane is involved. If not it gives a 0.

The final model uses these engineered features as well as the original two features in a Random Forest classifier. I can perform hyperparameter tuning using GridSearchCV, which searches over the number of estimators and maximum depth, and uses balanced class weights.

Finally, I evaluated the final model on the test set and compared its performance to baseline.

Hyperparameters tuned via GridSearchCV:

F1-Scores by Class:


Conclusion

This project demonstrates that real-time features can be used to reasonably predict cause of a major power outage. While common causes like severe weather are predicted with high accuracy, rare classes are still challenging.

Key Takeaways:


References