PostgreSQL topic of the Day - aggregating timeseries data

| No Comments | No TrackBacks

postgreslogo.pngFrequently when dealing with parametric data, you need to "roll up" the data in summary fashion as it ages in order to reduce the volume kept on hand, or maybe because the summary statistics are what really interests you. There are several ways to do that, and this post highlights four different approaches.

I was reminded of this kind of "roll ups" today by a question on the pgsql-novice list. This is actually quite a large topic, so I this tip will likely just scratch the surface. The question was related to storing min, max, and avg summaries on an hourly, daily, and weekly basis. The basic idea, for example, is that you can keep raw data for maybe a week, hourly summaries for 6 months, daily summaries for 3 years, and weekly summaries forever. As I mentioned in my reply, I have done this kind of thing over the years using at least 4 approaches:

  1. Aggregate on demand
  2. Batch aggregate on a periodic basis -- e.g. run your aggregate query with a cron job which truncates and rebuilds a table (i.e. a "materialized view")
  3. Write a C based trigger that does "continuous aggregation" to a materialized table
  4. Write a C based bulk loader that aggregates as it bulk loads the raw data into a materialized table

The first approach is simply to run an aggregate query whenever you need the summarized data. Obviously this does not really satisfy the stated desire to discard aged raw data, but I mention it for completeness. In some cases you have sufficient storage given your data volume, and performance of the aggregate is "good enough".

The second is the rough equivalent of a materialized view. In other words, run a batch job via cron or something similar that TRUNCATEs and then repopulates a table used for storage of the aggregate result. Particularly for daily or weekly summary data, when the consumers of the data are 9-5 folk, this approach works pretty well. This also fits in nicely with common partitioning schemes.

The third is one where you want summary statistics to be updated live. In this case you actually want the summary data for the current hour/day/week to all be constantly updated as new raw data comes in. Otherwise you are stuck always looking at last hour's, or yesterday's, or last weeks, data. The way to do this is through a trigger. A while back I implemented a continuous aggregation trigger in C that used prepared queries to update my aggregate table for every INSERT/UPDATE/DELETE occurring on the target table. However even with the trigger written in C and using prepared queries, the performance impact of the trigger firing for every DML event was significant.

Finally, the forth method can be used when your reporting needs are such that the raw data can be collected for some period before storing in your database. Let's say the summary reports are never run against the current hour. What you can do is build up a file in suitable format for bulk loading via COPY. Then process the data as it is bulk loaded to calculate and insert the summary at the same time. Again, I had done that in the past using a C program that read in the stored files, generated the summary data while building a string buffer, and finally using libpq's PQputCopyData() to populate the tables.

More than likely some combination of the above is what you really want. Perhaps use method 2 to maintain your weekly and daily aggregate materialized views, and use method 4 to update your hourly aggregate data.

This post was a lot of discussion and no code -- perhaps tomorrow I will continue with some more concrete examples.

No TrackBacks

TrackBack URL: http://blog.credativ.com/mt-tb.cgi/180

Leave a comment