Investigating Fire Frequency in CO

J|R|B
The Startup
Published in
7 min readNov 20, 2020

--

This year, I have watched the world burn.

The Cameron Peak fire, west of Fort Collins. Courtesy of Poudre Fire Authority

For some of these fires, I have had a front row seat. The local outlets told us not to go outside; the particulates and smoke turning the sun a beautiful and eerie red.

Our air is precious. Here in Colorado, in California, in Australia, in Utah, the Amazon; all had record fire seasons causing deforestation. The atmosphere was full of these fires’ emissions at the same time covid-19 was being spread in the air. The pandemic, the fires, and the endless onslaught of disturbing news; in all the anxiety a question would come to me:

“Are these the beginning scenes of a stark and suffocating play that will roll out in our future?” My common sense and community tell me that I am not the only person wondering this.

I admit, my anxious query is projection, in more ways than one. It happens that I have been studying data science, intensively, at Lambda School for the last 8 weeks. It isn’t possible to answer the morbid question above, not yet. That didn’t stop me from setting out to see what I can answer with my expanding tool belt.

Growing into my Data Scientist role

It’s been 4 weeks since CO was dealing with the largest fire in its history, the Cameron Peak Fire. As that massive fire surpassed a burned area of 200k acres, another wildfire was undergoing an explosive growth. East Troublesome:

The scars of both the Cameron Peak fire and the East Troublesome fire, as well as Troublesome smoke. NASA image 10/22/20
The scars of both the Cameron Peak fire and the East Troublesome fire, as well as Troublesome smoke. NASA image 10/22/20

“On …(October 22, 2020)…the fire was estimated at 19,086 acres and 10% containment. By Thursday around 6:30 p.m., officials announced the most updated acreage of 170,000 acres, with 5% containment. Much of this growth is due to the weather, terrain and beetle-kill lodgepole pine, according to the incident management team.” Denver 7

That’s an increase of 140,000 acres in 24 hours! Being inundated by the smoke cast off by these fires as well as news about them; homes burning in Boulder and Estes Park being evacuated, I decided to start looking for my own answers. The Global Fire Emissions Database(GFED ←check this link for an easy to use tool)became a useful tool. On this site I found a workable dataset and moved forward working on it, fitting multiple models.

First, I needed to clean up my data. turning wide data:

A snippet of my dataset which held the number of fires per month spanning 01/2003 through 10/2020.

into tidy data:

The same data changed into a single feature, fires per month, in date order.

With the data rendered more useful, I plotted the fires. This is an exploratory visualization:

Given that our data column(a.k.a. feature), is a continuous variable; the next step was to build regression models. Regression will ostensibly look at the other features(i.e. columns) in our dataframe, to predict on a target variable. “Fire count per month” was going to be my target because I wanted to know if fires had been and will be happening more frequently. There was an issue:

The dataframe only had one feature.

Pardon me while I get nerdy on the answer: time series.

Technical Nerd Stuff

It feels natural that if one were to predict wildfires, the first place to look would be at the past, wildfire data. New features were engineered using the shift method:

In this instance March is the starting point because there isn’t data from December or November 2002. “Mo_Fire_Count” is the actual count of fires in a given month. The blue ellipsis encircles a given month, the previous month, and 2 months previous; from left to right. The yellow highlights that as we go down in rows(i.e. forward in time) the same fire-count-value becomes the previous month’s count and 2 months previous count of fires.

Models are trained on only a portion of the data. For this regression problem, the models were trained on 80% of the data; from 01/2003 to 04/2017. Our models aren’t allowed to “see” the rest of the data. When the model is given “new”, the unseen, data it doesn’t already “know” the answer. We call these the training and validation sets. The ‘target’ is separated from the rest of the features.

The target is what we are trying to predict: the number of fires in the future:

We know what actually happened a month in the future, when we look a month ahead. In January 2003 the red circle on the right is the real, observed value in the correct month. The red circle on the left, is what the correct prediction would be, if our regression model makes a perfect guess. I referred to the target as next month:

In this case, the model “guesses” how many fires are likely to happen in the next month. Then the guess is subtracted from the next month’s reading, i.e. the real fire count: “Next_Mo”.

When looking at regression models, we take the predictions the model makes and find the difference between the actual value and the model’s prediction. This is an error: how far off we were from the actual, reported value. Negative values in the average error are ignored.

The absolute value is taken. In the first row, index 0, the absolute value: |-13.27| = 13.27. Meaning that the prediction for that month is 13.27 fires off. That’s only one observation and there are 212 rows in this dataframe.

With regression problems, we take the all the actual observations and all of the predictions, to produce a Mean Absolute Error(MAE). We start with a simple MAE called a baseline and then try to beat that error with the models and generally choose the model that has the lowest mean error.

The fire count Baseline MAE was: ~59. Meaning our baseline predicted, on average, a fire count that was either 59 too many fires, or 59 too few. Seem shoddy? With a fairly large baseline error, the initial thought was that this is a beatable baseline. Now we pull in regression models.

I fit the training data on 3 different models: a Linear Regressor, a XGBoost Regressor, and a Random Forest Regressor.

Then we use the data that the model hasn’t seen, the validation data, to see if our model works well in the (peudo)real world.

And Bust

Out of all 3 models, my lowest mean-absolute-error for the validation set was: ~142. Not only did this not beat the baseline, it is very far off. So the answer to whether or not Colorado is having/will have more fires: inconclusive. If we are to conclude anything, it would be closer to ‘no’. Given the data we “scienced”, there is no answer to “Are fires happening more frequently in CO? Will there be more fires in the future?

Why all this length, for a model that doesn’t tell us much? I wanted to be able to write something that made a profound statement. But that isn’t how things work. I am confident there will be polished and tight pieces in the future.

There are a handful of reasons that I decided to write and publish about my model that isn’t performing:

  • As a writer, I’ve got to start somewhere, as a data scientist I do, too.
  • The path to success is a lot messier than most publications let on and I value transparency, one that I hope lets others know that not everything is perfect and polished.
  • It is important to highlight that my projections, while useful in deciding what to perform programming on, should not influence my outcome. The outcome is simply what it is. That is the scientific approach.
  • I believe we as a world and with subgroups of societies would benefit from more people doing science. I think people would benefit from more of the world having access to scientific information, different ways of experimenting, and sharing their discoveries.
  • It doesn’t make a lot sense to me that journals are behind a gilded pay wall guarded by agents of “collegiate” accreditations; the U.S. in particular has a great need for people in

In a world full of amazing people, I want others to know that things aren’t always complete. You take them for what they have to show and you continue to develop. Whether it’s data science or sculpture or relationships. As for me, I am not done with my question and I feel it is important to not read my expectations/desired outcomes. I feel very blessed to be developing tools to ask hard questions with!

If you are interested in my code and the data I used, visit this GitHub repo:

https://github.com/JRBOH/ds-21-bw-project-fire

--

--

J|R|B
The Startup

Jacob Bohlen. Artist, data scientist, musician, deep thinker, social organizer, curator, producer-and I like to write. Create and question without bounds. ❤