Friday, June 3, 2011

mDash, or Systems Monitoring Using jQuery Mobile

We've written Mobile Dash, or mDash, a mobile web app for monitoring UW-IT system alerts.

(If you're UW-IT staff, you can check it out here: mDash.)

Some screenshots:

mDash home screen on iPhone 4

mDash alert detail page

(It's only for UW-IT personnel or we'd have a link for you. If you are UW-IT personnel and want a closer look, feel free to leave us a comment.)

The interface uses jQuery Mobile. Behind the scenes, mDash uses jQuery for its Ajax call and DOM manipulation. It also uses Badgerfishjs to transform the alerts from our XML web service into JSON. And some custom CSS adds UW colors to the interface. 

So what is mDash? It's the mobile version Dash. But what's Dash?

Dash is UW-IT's Java Web Start (Swing) application for monitoring system alerts. One description reads that, "It is intended to be [a] simple and reliable way for system engineers, computer operations, client support groups, and management to view DSUE system status." (DSUE being Distributed Systems and Unix Engineering). By system we mean all sorts of systems. Dash monitors everything from web services used by a handful of developers to student email servers.

Our team's goal with mDash was to replicate a subset of Dash's functionality in mobile web app form.  We wanted to: show what's down (an abbreviation of the hostname), indicate the severity of the problem (the color-coded numbers adopted from the Java app), and allow a person to view a few more essential details via an alert detail page.

When monitoring critical systems, freshness of information matters. So, on the home page we display the time when alerts were last checked and we provide an unavoidable "Refresh Alerts" button so people can retrieve the latest alerts when they want. In the future we may refresh the alerts asynchronously, but due to the large amount of alerts that asynchronous process can slow some devices' browsers to a crawl or bring them to a standstill.

The alert detail page's fields might seem esoteric to people outside DSUE at UW-IT, but these labels are  familiar to the smallish set of people who will use this app at the outset. As mDash (and Dash) moves toward wider adoption, we'll need more human-friendly field labels.

It will sound clich├ęd, but mDash was fun to make. And jQuery Mobile is what made it fun. We did run into some funky bugs and behaviors along the way, but never insurmountable ones and never ones that the jQuery Mobile community hadn't already solved.

To say we're looking forward to making another jQuery Mobile web app is understating it.

Friday, February 18, 2011

Following wrong assumptions

An example of following wrong assumptions when debugging:

I was porting a bit of hyak (a several hundred node compute cluster)
monitoring from ProxD (an older collector of alerts) to Injector (a newer
generation). This monitor runs a vendor disgnostic utility once a minute
that lists status for every hyak node and issues alerts for any that are
'Down'. It saves status by node, so normally it only generates traffic for
node state changes. The exception is at daemon startup, when (since it
doesn't know whether there might be existing alerts) it issues traffic for
every node (either a Upd if the node is down or a Del otherwise).

Because determining the state of a node is just looking at the next line
of output, at startup there are going to be 500 odd alerts in a short
period of time, like .02 seconds.

Usually, there are no nodes down, so if you start the daemon, you won't
notice any dropped alerts (I'm sending these via UDP). Today, though,
there were a few nodes down, and one of them had a high number - n0418 or
something like that. When I fired up my newly ported code, I checked
against the list of active alerts from ProxD - and n0418 was missing.

I wrote a quicky program to send 500 alerts as fast as possible, and indeed
it start dropping alerts after 150 alerts or so. With ProxD, OTOH, you
could send blocks of 500 over and over w/ no skips. So I started picking
apart the code. The implementations aren't all that different, though, but
they did use different xml parsers and some minor differences. Or could
the read thread be blocking waiting because the write-to-activeMQ queue was
full? Anyway, I spent a couple of hours picking both pieces of code
apart, always looking at millisecond per transaction times - but those
were pretty close - on the order of 3 to 4 millisecs (I dunno if that is a
server limit or not; the client couldn't send faster).

Oooops. The problem was the udp input buffer size. The new alert
format is a fair amount longer than the old. The proxd one is:

The new one is:


and so on. Longer inbound messages fill the buffer sooner - duh. I hadn't
really thought about the performance implications there, and was quite
suprised they made such a difference. I might be a little more terse next
time. Anyway, I increased the buffer size so you can send 2000 odd msgs as
fast as a singlethreaded python prog can send them w/o problems.

I should have kept an open mind longer - once I focused in on transaction
times, I stopped thinking about buffer sizes; a classic case of narrowing
the suspect list too soon.

BTW, I went back and tried sending larger numbers of msgs to proxd, and
it started dropping them between 1000 and 2000 msgs, so we had been running
in a sweet spot - no other app sends traffic quite as bursty as this one,
and it was sending a little under the ceiling, so there weren't any
problems. I was also lucky that a high numbered node was down - if it had
been N0099, I never would have noticed.