Monday, February 1, 2010

KPI part 2: materializing views

In my first KPI post, I mentioned how performance became an issue and how we solved it by pre-generating the results of the summarized data we wanted to display. That was dandy ... once. But it was obvious we'd want to do something similar for other data already stored in our database. As a huge fan of consistency and repeatability, I didn't care for the idea of a mess of tables and another mess of cron jobs and scripts to keep those different tables up to date, each in their own custom way.

A little research turned up the concept of 'Materialized Views'. The idea is simple: generate a view and use it to create and maintain the contents of a static table. A wide variety of "refesh" techniques can be applied (depending on the importance of having current data and limiting impact on either inserts to the source tables or to the end user viewing the summary. Some databases implement this natively. PostgreSQL isn't one of them.

Some great discussions of the concept are already available, so I'll just highlight how we chose to implement it.

First, we created a schema in our 'ironfist' database to contain the views as well as the tables where the materialized content is stored (an MV table) along with a master table too keep track of them:

CREATE SCHEMA matview;
GRANT ALL ON SCHEMA matview TO PUBLIC;

CREATE TABLE matview.views (
mv_name NAME NOT NULL PRIMARY KEY
, v_name NAME NOT NULL
, last_refresh TIMESTAMP WITH TIME ZONE
);

For our application, stale data wasn't a concern. All of our capacity KPIs either show monthly data or show current data. Regenerating the data weekly (and even the performance hit of regenerating all the data each time) doesn't cost us much. So the set of functions we used to create a materialized view infrastructure is fairly simple.

Before starting, we test and create a view in the 'matview' schema then use the various matview functions to create and maintain the MV table.

This first function creates an MV table based on an existing view. We tell it the name (which is used to find the view and to create the corresponding table) and the columns we wish to use as the primary key (this helps with future in-place updates). For example:

select matview.create('kpi_network_switch_ports','sector,month');

The create function looks like:

CREATE OR REPLACE FUNCTION matview.create(name text,primary_cols text) RETURNS VOID AS
$$
DECLARE
entry matview.views%ROWTYPE;
view_name text:= 'matview.' || name || '_v';
matview_name text:= 'matview.' || name || '_mv';
BEGIN
SELECT * INTO entry FROM matview.views WHERE mv_name = matview_name;
IF FOUND THEN
RAISE EXCEPTION 'Materialized view ''%'' already exists.',
matview_name;
END IF;
EXECUTE 'REVOKE ALL ON ' || view_name || ' FROM PUBLIC';
EXECUTE 'GRANT SELECT ON ' || view_name || ' TO PUBLIC';
EXECUTE 'CREATE TABLE ' || matview_name || ' AS SELECT * FROM ' || view_name;
EXECUTE 'ALTER TABLE ' || matview_name || ' ADD PRIMARY KEY ' || '(' || primary_cols || ')';
EXECUTE 'REVOKE ALL ON ' || matview_name || ' FROM PUBLIC';
EXECUTE 'GRANT SELECT ON ' || matview_name || ' TO PUBLIC';
EXECUTE 'CREATE TRIGGER ' || name || '_trigger' ||
' BEFORE INSERT ON ' || matview_name ||
' FOR EACH ROW EXECUTE PROCEDURE insert_or_replace_if_duplicate ()';

INSERT INTO matview.views (mv_name, v_name, last_refresh)
VALUES (matview_name, view_name, CURRENT_TIMESTAMP);

RETURN;
END
$$
SECURITY DEFINER
LANGUAGE plpgsql;

The final insert populates the table with the current results of a SELECT * on the view. The table can subsequently be 'refreshed' with the matview.refresh function. This function refreshes either a single table by name or 'all' (all MVs listed in the matview.views table):

CREATE OR REPLACE FUNCTION matview.refresh(name text) RETURNS VOID AS
$$
DECLARE
query text;
mview matview.views%ROWTYPE;
matview_name text:= 'matview.' || name || '_mv';
BEGIN
IF (name = 'all')
THEN
query := 'SELECT * from matview.views';
ELSE
query := 'SELECT * FROM matview.views WHERE mv_name =' || quote_literal(matview_name);
END IF;

FOR mview in EXECUTE query LOOP
EXECUTE 'INSERT INTO ' || mview.mv_name || ' SELECT * FROM ' || mview.v_name;
UPDATE matview.views
SET last_refresh=CURRENT_TIMESTAMP WHERE mv_name=mview.mv_name;
END LOOP;

IF NOT FOUND THEN
RAISE EXCEPTION 'Materialized view % does not exist.', matview_name;
END IF;

RETURN;
END;
SECURITY DEFINER
LANGUAGE plpgsql;

In the initial MV table creation (in the matview.create function), you'll notice that we added a BEFORE INSERT trigger to the table. When we refresh the table, we update all rows in the table. Other refresh approaches involve placing triggers on data source tables (we wanted to avoid that, especially since some of our tables are stored on another database server).

We could just empty the table and completely regenerate it, but some of our views take several minutes to run. During that time our table would be empty of data. And if the view failed (if, say, the dblink connection to our network modeling database failed) the table would remain empty. Stale data is better than no data.

Finally, although we have a last refresh time that last refresh time doesn't guarantee that running the view again won't change existing values for existing rows in the table. For "total number of devices active on the network in January", that number will update on January 1st, then again on January 2nd, etc. The refresh time is useful for confirming how up to date our data is, but not useful for reducing the work needed to update it.

So instead, we regenerate all data in an MV, adding a row if no row with the same primary key values exists, but otherwise replacing (not updating!) the matching row with a new one using the following function as a trigger:

CREATE FUNCTION insert_or_replace_if_duplicate() RETURNS trigger
AS $$
/* Description: Trigger function to insert a new row, first deleting if it would violate the primary key

To use this function, create a trigger on the appropriate table as follows:

CREATE TRIGGER availability_summary_insert_trigger
BEFORE INSERT ON kpi.availability_summary
FOR EACH ROW EXECUTE PROCEDURE insert_or_replace_if_duplicate();

This before trigger checks to see if the insert would violate the primary key of the table.
If it does, it will delete the entire row before returning to the insert call. This is easier
than trying to figure out which fields to update.
*/
DECLARE
table_name text := TG_TABLE_SCHEMA || '.' || TG_TABLE_NAME;
where_clause text := ' WHERE TRUE';
colname text;
BEGIN
FOR colname IN select column_name
from information_schema.key_column_usage
natural join information_schema.table_constraints itc
where itc.table_schema=quote_ident(TG_TABLE_SCHEMA)
and itc.table_name=quote_ident(TG_TABLE_NAME)
and itc.constraint_type='PRIMARY KEY'
LOOP
-- the syntax is hairy, but this casts the NEW row to the appropriate table and lets us pick
-- out the columns we want to compare.

where_clause := where_clause || ' AND ' || colname || '=' || '(' || quote_literal(new) || '::' || table_name || ').' || colname;
END LOOP;

-- now that we know what row we will conflict with, delete it before doing our insert.

EXECUTE 'DELETE FROM ' || table_name || where_clause;
RETURN NEW;
END
$$
LANGUAGE plpgsql;


This works because we are assured that updates will always generate data for all columns. The above function wouldn't be useful for cases where only some column data was provided for subsequent updates.

Finally, of course, we have a function that cleans up any MV tables if we choose to remove them:

CREATE OR REPLACE FUNCTION matview.drop(name text) RETURNS VOID
AS
$$
DECLARE
entry matview.views%ROWTYPE;
view_name text:= 'matview.' || name || '_v';
matview_name text:= 'matview.' || name || '_mv';
BEGIN
SELECT * INTO entry FROM matview.views WHERE mv_name = matview_name;

IF NOT FOUND THEN
RAISE EXCEPTION 'Materialized view % does not exist.', matview_name;
END IF;

EXECUTE 'DROP TABLE ' || matview_name;
DELETE FROM matview.views WHERE mv_name=matview_name;
RETURN;
END
$$
SECURITY DEFINER
LANGUAGE plpgsql;


The final post in this series will discuss some of what we did with the Google Visualization API to use and display the data from our Materialized Views.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.