Data Resampling: An Introduction¶
Often, the reason we need temperature data is to explore the relationship between weather and energy use for a particular building, home, or collection of homes. To quantify this relationship, we need both temperature and energy observations to be on the same time scale - whether that is hourly, half-hourly, etc. Unfortunately, however, readings from weather stations aren’t always available at the particular time scale of your meter readings, and indeed, they aren’t even necessarily at regular intervals at all (the NOAA Integrated Surface Dataset has some stations that sometimes emit data every two minutes and sometimes every twenty minutes, for example).
riweather
has some tools to ease the process of aligning temperature data to a certain time scale. It builds on top of the resampling functionality of Pandas. This tutorial illustrates the different options at your disposal, so that you can handle the alignment exactly how you need to.
Resampling strategies on a toy example¶
Suppose we have the following ten weather observations. They occur twenty minutes apart, at 15, 35, and 55 minutes past the hour.
def plot_temps(datasets, labels, raw_idx=0):
import matplotlib.dates as mdates
fig, ax = plt.subplots(figsize=(12, 4))
fig.suptitle("Temperature resampling")
for i, (d, l) in enumerate(zip(datasets, labels, strict=True)):
if i == raw_idx:
ax.plot(d, "D", color="black", alpha=0.2, label=l)
else:
ax.plot(d, "x-", label=l)
ax.xaxis.set_major_formatter(mdates.DateFormatter("%I:%M %p"))
ax.set_ylabel("Temperature (C)")
_ = ax.legend()
plt.show()
The four rollup_*()
functions¶
We need to align these temperatures to a dataset of smart meter readings that occur at the top of every hour. How can we do this?
One option is to take the average of all readings over the previous hour. So, for 1 AM, the previous hour’s readings were 0, 9, and 3 degrees, and the average is 4. With riweather
we can do this with rollup_ending()
. The word “ending” in the name means that we are calculating “hour-ending” values.
Because meter readings are typically collected as hour-ending values as well, rollup_ending()
is the default behavior of the all-in-one fetching and resampling method Station.fetch_temp_data()
.
You can get hour-starting values instead with rollup_starting()
. Notice that the first three readings are still averaged together, but the timestamp assigned to the result is midnight (the start of the hour), instead of 1 AM.
Instead of hour-ending or hour-starting values, you might want the range over which the average is taken to extend both forward and backward. Using rollup_midpoint()
, the value at 1 AM will be the average over all data between 12:30 and 1:30 AM.
Finally, rollup_instant()
does not do any averaging at all. Instead, the value for a certain hour will simply be the first actual data point during that hour.
Going beyond hourly intervals¶
Every example of the rolling_*()
functions so far has specified "H"
for the second argument. This is a frequency string from Pandas. You can pass any frequency string to get data resampled at different intervals - see the link for more information on how to use the strings.
As another example, we could resample our data to get half-hourly intervals:
temps_30min_starting = riweather.rollup_starting(temps, "30T", upsample_first=False)
temps_30min_ending = riweather.rollup_ending(temps, "30T", upsample_first=False)
datasets = [
temps,
temps_30min_starting,
temps_30min_ending,
]
labels = [
"Actual observations",
"Half-hourly resampling, hour-starting",
"Half-hourly resampling, hour-ending",
]
plot_temps(datasets, labels)
Upsampling before resampling¶
So far, we have used all of the rollup functions with the parameter upsample_first=False
. With this option set, the resampling is done based on the raw values and the raw values only - so, for example, the hour-ending rollup value at 1 AM is precisely the average of all raw values that occurred between 12 AM and 1 AM.
This isn’t always the most accurate way to arrive at a single representative value for the entire hour (or half-hour, or whatever the desired time interval is). Consider the following temperatures:
There are two observations between 12 AM and 1 AM: 1 degree at 12:01 and 2 degrees at 12:33. If we were to simply take the average in our hour-ending resampling procedure, we would end up with a value of 1.5 degrees at 1 AM.
However, we can see that the temperature rose to 10 degrees by 1:05, so it’s safe to say there were some higher temperatures in the second half of the 12 AM hour that were not captured by averaging the two observations we had available, and so 1.5 degrees is likely an underestimate of the average temperature over that period.
We can address this problem by first upsampling the observations before resampling them. Upsampling means taking data from one time interval to a finer-grained time interval. In this case, we perform linear interpolation to take the data up to the minute level. This allows us to better capture changes in temperature.
You can perform the upsampling step on its own with riweather.upsample()
:
In the rollup functions, you can perform the upsampling and resampling all in one step by passing upsample_first=True
(this is the default, so you can also just omit it).
The resampled 1 AM value is now 3.3 degrees, up from 1.5 in the other approach.
Going back to the first toy example, let’s observe the difference in hour-ending values with and without the initial upsampling:
temps_ending_upsampled = riweather.rollup_ending(temps, "H", upsample_first=True)
temps_ending_not_upsampled = riweather.rollup_ending(temps, "H", upsample_first=False)
datasets = [
temps,
temps_ending_upsampled,
temps_ending_not_upsampled,
]
labels = [
"Actual observations",
"Hour-ending, upsampling",
"Hour-ending, no upsampling",
]
plot_temps(datasets, labels)
Fetching, upsampling, and resampling from a Station¶
When fetching temperature data from the ISD, you can perform all of the resampling in a single step by passing the appropriate parameters to Station.fetch_temp_data()
.
Customizing the resampling process¶
When used in this way, Station.fetch_temp_data()
performs three steps:
- Fetching the raw data from the ISD
- Upsample to minute-level (if desired)
- Resample to the given time interval
Each of these steps is exposed by the riweather
API as separate functions, so if you want to directly manipulate the data with Pandas, you may store the intermediate datasets and use them however you want.
To recreate the exact steps performed in fetch_temp_data()
: