Friday, February 18, 2011

Following wrong assumptions

An example of following wrong assumptions when debugging:

I was porting a bit of hyak (a several hundred node compute cluster)
monitoring from ProxD (an older collector of alerts) to Injector (a newer
generation). This monitor runs a vendor disgnostic utility once a minute
that lists status for every hyak node and issues alerts for any that are
'Down'. It saves status by node, so normally it only generates traffic for
node state changes. The exception is at daemon startup, when (since it
doesn't know whether there might be existing alerts) it issues traffic for
every node (either a Upd if the node is down or a Del otherwise).

Because determining the state of a node is just looking at the next line
of output, at startup there are going to be 500 odd alerts in a short
period of time, like .02 seconds.

Usually, there are no nodes down, so if you start the daemon, you won't
notice any dropped alerts (I'm sending these via UDP). Today, though,
there were a few nodes down, and one of them had a high number - n0418 or
something like that. When I fired up my newly ported code, I checked
against the list of active alerts from ProxD - and n0418 was missing.

I wrote a quicky program to send 500 alerts as fast as possible, and indeed
it start dropping alerts after 150 alerts or so. With ProxD, OTOH, you
could send blocks of 500 over and over w/ no skips. So I started picking
apart the code. The implementations aren't all that different, though, but
they did use different xml parsers and some minor differences. Or could
the read thread be blocking waiting because the write-to-activeMQ queue was
full? Anyway, I spent a couple of hours picking both pieces of code
apart, always looking at millisecond per transaction times - but those
were pretty close - on the order of 3 to 4 millisecs (I dunno if that is a
server limit or not; the client couldn't send faster).

Oooops. The problem was the udp input buffer size. The new alert
format is a fair amount longer than the old. The proxd one is:

The new one is:


and so on. Longer inbound messages fill the buffer sooner - duh. I hadn't
really thought about the performance implications there, and was quite
suprised they made such a difference. I might be a little more terse next
time. Anyway, I increased the buffer size so you can send 2000 odd msgs as
fast as a singlethreaded python prog can send them w/o problems.

I should have kept an open mind longer - once I focused in on transaction
times, I stopped thinking about buffer sizes; a classic case of narrowing
the suspect list too soon.

BTW, I went back and tried sending larger numbers of msgs to proxd, and
it started dropping them between 1000 and 2000 msgs, so we had been running
in a sweet spot - no other app sends traffic quite as bursty as this one,
and it was sending a little under the ceiling, so there weren't any
problems. I was also lucky that a high numbered node was down - if it had
been N0099, I never would have noticed.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.