Thursday, January 14, 2010

KPI part 1: database/XML mapping populates Key Performance Data charts

Recently we were asked to develop a page of Key Performance Indicator charts showing network availability and other network capacity statistics. We have a lot of historical data, scattered throughout flat files, RRD files and in our PostgreSQL database. In the past we've responded to similar requests by producing ad-hoc charts and tables, ending up with several web pages with disparate look and feel and levels of detail.

This time, we wanted to come up with a complete system that could be re-used; settling on a preferred (preferably javascript-based) flexible charting system, a consistent look and feel and a manageable way of storing the underlying data. Ideally some of those ad-hoc charts would even be migrated to this new system.

Just identifying how to calculate the data for the initial KPI was challenging (when you measure average network availability for a campus network, where do you measure? Do you weight the 'service points' equally? And what's the smallest meaningful time range to show?)

We determined that UW campus network availability would be measured by 'average uptime of unweighted subnet prefix' and that we would show average availability over week, quarter and year.

So, where do we get this data? We merge it from several of our production network management tools:

Database We use a PostgreSQL database to model our network. The Network Operations Center maintains tables of devices, interfaces, networks, contacts, events and all their related attributes, using a mixture of manual addition and automated device discovery systems.

Meerkat historical data Meerkat is our locally developed open source network monitoring system. It tracks all network events (writing them to the database) and generates alerts for the Network Operations Center. We have a companion web service which can provide availability information for devices or network prefixes from the detected network events.

Meerkat availability service The companion availability service reports raw availability for a given device or network prefix. Device availability is easy: if there's an event showing the device was unavailable in our database, its raw availability for the year is reduced by the length of the outage window. Determining network availability requires topology discovery and awareness of redundant paths. Only if outage events occur interrupting all paths from the monitoring system to the network will the raw availability of the network be reduced.

RT::Outages planned event tickets RT::Outages is locally written tool which interacts with the RT trouble ticket system. It provides a workbench out network operations staff can use to generate outage reports. The workbench allows selection of devices and interfaces involved in an outage and provides a date picker for consistent entry of outage window start and end times. Contact lists are automatically generated from the database device and network contact lists. And, essential for the KPI project, RT::Outages is also used to schedule planned events.

For the KPI project, some changes had to be made to enhance the Meerkat availability service. Availability isn't conventionally calculated on all network downtime; just on unplanned outages. The RT::Outages service has that information available and we use that to permit the Meerkat availability service to exclude those events. In addition, the service only provided an annual availability measurement. To get quarterly and weekly measurements we updated it to accept arbitrary time ranges.

Now that we knew where to get our raw information what would we do with it?

We started by writing a nightly tool to fetch the availability of each of the "service points" on our network (service points are network prefixes for the UW campus and UW Medical Centers networks, sites for the Washington K20 networks). To enable this, we made use of another locally developed open source initiative:

REX REX is a SQL/XML mapper that provides a framework for turning PostgreSQL database queries into XML documents available through a REST service. It leverages the XML functions available in PostgreSQL. In addition to fetching information from the database this way, we can also update the database.

The nightly process, for example, is a Perl program that connects to a REX-based REST service for the list of service points, processes each service point by sending a request to the Meerkat availability service for the availability for the previous day, then uses XML::Simple to generate an XML document to post back to a final service which updates the database with all the service point measurements. Given a hash of measurements:

$VAR1 = {
'measurement' => [
{
'owner' => 'uwcampus',
'availability' => '100.0',
'includes_planned_events' => 'false',
'service_point' => '128.95.138.0/24',
'datetime' => '2010-01-12'
}
]
};

We update the database with the following XML document:

<measurements>
<measurement availability="100.0" datetime="2010-01-12" includes_planned_events="false" owner="uwcampus" service_point="128.95.138.0/24" />
</measurements>

using the following code, using Perl, XML::Simple and LWP::UserAgent::Determined:

post_measurements( { ua => $ua,
post_document => XMLout ($measurements, RootName=>'measurements')
} );

sub post_measurements {
my ($param_ref) = @_;
my $post_document = $param_ref->{post_document} || croak 'No document to post';

my $ua = $param_ref->{ua} || LWP::UserAgent::Determined->new;
my $url = "https://$options{ws}/rex/xml/kpi_availability/ADD_DATA";
my $request = HTTP::Request->new('POST', $url);

$request->content_type('application/xml');
$request->content($post_document);
my $response = $ua->request($request);
}

Typically we send multiple measurements at a time. This produces a table full of measurements, one per day per service point.

When we display them, however, we want to summarize the daily service_point measurements to weekly, quarterly and annual measurements for an entire collection of service points (e.g. the service points in the UW campus network or in the K20 statewide network). For our prototype, we intended for the front end to process XML, so we wrote a PL/pgSQL function (see end of post) to generate an XML document. The REX-based service supplying data to the front end was configured to run this on demand. For more information on using the xml functions in PostgreSQL, see http://www.postgresql.org/docs/8.3/interactive/functions-xml.html

We soon found that running this on demand was much too slow (seven seconds!) and that the pchart system we initially used didn't provide all the features we wanted and locked us in to PHP.

Stay tuned; in my next posts I'll outline how we implemented Materialized Views to rescue us from performance problems like those experienced above and how we implemented our final Javascript-based front-end using Google Charts via the Google Visualization API.

CREATE FUNCTION rex_fn.kpi_availability_read(owner_selected text, year_selected text,
include_planned_events boolean)
RETURNS xml
AS $$
BEGIN
/* FUNCTION: rex_fn.kpi_availability_read()
Description: Summarizes weekly, quarterly and yearly availability measurements
Affects: Makes no changes to the data
Arguments: name of a supported owner, a year and whether or not the measurements should include or
exclude planned event windows
Returns: xml document like:

<availability>
<yearlyavailability>
<plannedincluded>true</plannedincluded>
<owner>
<name>k20</name>
<year>
<number>2009</number>
<averageavailability>99.8159518537013916</averageavailability>
<quarter>
<number>1</number>
<averageavailability>99.8047910271571390</averageavailability>
<complete>true</complete>
<week>
<number>01</number>
<averageavailability>99.6165324837199837</averageavailability>
<start_date>1230796800</start_date>
<end_date>1231315200</end_date>
<complete>true</complete>
</week>
</quarter>
</year>
</owner>
<yearlyavailability>
</availability>
*/
BEGIN
RETURN xmlelement (name availability,
xmlagg(availability_frag)
)
FROM ( SELECT xmlelement (name yearlyAvailability,
xmlconcat (xmlelement (name plannedIncluded, include_planned_events),
xmlagg (xmlelement (name owner, owner_frag))
)
) AS availability_frag
FROM ( SELECT xmlconcat(xmlelement (name name, owner),
xmlagg(xmlelement (name year, year_frag))
) AS owner_frag
FROM ( SELECT owner, year, xmlconcat (xmlelement (name number, year),
xmlagg (quarter_frag)
) AS year_frag
-- Clause to produce the "yearly availability"
FROM ( SELECT owner,
to_char(datetime,'YYYY') AS year,
xmlelement ( name averageAvailability,avg(availability) ) AS quarter_frag
FROM kpi.availability
WHERE owner = owner_selected
AND includes_planned_events = include_planned_events
AND to_char(datetime,'YYYY') = year_selected
GROUP BY owner, year
UNION ALL
-- Call helper function to produce the xml for the "quarterly availability"
SELECT *
FROM rex_fn.kpi_availability_read_quarter (owner_selected,
year_selected,
include_planned_events)
) AS sub4 GROUP BY owner, year
) AS sub3 GROUP BY owner
) AS sub2
) AS sub1;
END;
$$
LANGUAGE plpgsql;

CREATE FUNCTION rex_fn.kpi_availability_read_quarter(owner_selected text, year_selected text,
include_planned_events boolean)
RETURNS SETOF kpi_availability_quarter_record_type
AS $$
/* FUNCTION: rex_fn.kpi_availability_read_quarter()
Description: helper function for rex_fn.kpi_availability_read -- generates quarterly xml subtree

Affects: Makes no changes to the data
Arguments: name of a supported owner, a year and whether or not the measurements should include or
exclude planned event windows
Returns: set of xml documents
*/
BEGIN
RETURN QUERY SELECT owner, year, xmlelement(name quarter, xmlconcat (xmlelement (name number, quarter),
xmlagg (week_frag)
)
) AS quarter_frag
-- Clause to produce the "quarterly availability"
FROM ( SELECT owner, to_char(datetime,'YYYY') AS year, to_char(datetime, 'Q') as quarter,
xmlconcat (xmlelement ( name averageAvailability,avg(availability) ),
xmlelement ( name complete,
CASE WHEN to_char(max(datetime),'Q') != to_char( max(datetime) + interval '1 day','Q')
THEN TRUE
ELSE FALSE END
)
) AS week_frag
FROM kpi.availability
WHERE owner = owner_selected
AND includes_planned_events = include_planned_events
AND to_char(datetime,'YYYY') = year_selected
GROUP BY owner, year, quarter
UNION ALL
-- Call helper function to produce the xml for the "weekly availability"
SELECT * FROM rex_fn.kpi_availability_read_week (owner_selected,
year_selected,
include_planned_events)
) AS sub GROUP BY owner, year, quarter ORDER BY year, quarter ASC;
RETURN;
END;
$$
LANGUAGE plpgsql;

CREATE FUNCTION kpi_availability_read_week(owner_selected text, year_selected text, include_planned_events boolean)
RETURNS SETOF kpi_availability_week_record_type
AS $$
/* FUNCTION: rex_fn.kpi_availability_read_week()
Description: helper function for rex_fn.kpi_availability_read -- generates weekly xml subtree
Affects: Makes no changes to the data
Arguments: name of a supported owner, a year and whether or not the measurements should include or
exclude planned event windows
Returns: set of xml documents
*/
BEGIN
RETURN QUERY SELECT owner,
year,
quarter,
xmlelement(name week, xmlconcat (xmlelement (name number, week),
xmlagg ( xmlelement (name averageAvailability, averageAvailability) ),
xmlagg ( xmlelement (name start_date, extract (epoch from start_date)) ),
xmlagg ( xmlelement (name end_date, extract (epoch from end_date)) ),
xmlagg ( xmlelement (name complete,
CASE WHEN start_date + interval '1 week' = end_date + interval '1 day'
THEN TRUE
ELSE FALSE END)
) -- xml element week
) -- xmlconcat
) AS week_frag
FROM ( SELECT owner AS owner,
to_char(week_start_date(datetime,'Mon'),'YYYY') AS year,
to_char(week_start_date(datetime,'Mon'), 'Q') AS quarter,
to_char(week_start_date(datetime,'Mon'),'WW') AS week,
min(datetime) AS start_date,
max(datetime) AS end_date,
avg(availability) AS averageAvailability
FROM kpi.availability
WHERE owner = owner_selected
AND includes_planned_events = include_planned_events
AND to_char(week_start_date(datetime,'Mon'),'YYYY') = year_selected
GROUP BY owner, year, quarter, week
) AS sub
GROUP BY owner, year, quarter, week
ORDER BY quarter, week ASC;
RETURN;
END;
$$
LANGUAGE plpgsql;

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.