Friday, February 10, 2012

strace and tcpdump: or "a good time with dhcpd and iptables"

After an upgrade to our DHCP servers resulted in a failure to hand out leases under peak load, we enhanced our DHCP load testing client and started narrowing down the problem.

Testing soon sorted out the performance issue (while the new disks were much faster for reads and writes, but the old disk controller had a write cache. The ‘fsync() after every lease’ behavior of dhcpd meant that we were getting killed writing the leases file).

I wanted to quantify how many DHCP leases and requests we could serve once we’d resolved that and I noticed one odd little problem under load. If we set our lease time down to 60 seconds and had more than 600 clients, the dhcpd.log would start reporting an error when sending certain DHCP ACK messages:

send_packet: Operation not permitted

At first I thought it was happening when the client entered the ‘rebinding’ phase (i.e. after the unicast request for a lease renewal had failed, timed out and then switched back to broadcast) because it happened after the initial leases were offered and after some leases had successfully renewed. But closer investigation showed something more interesting.

The first thing to know is that we run our DHCP servers in failover mode and they are located on different subnets. None of the clients would be local in production so we test with a similar configuration. The host the client software runs on is located on a different network from both DHCP servers.

The test client itself is written in java and creates multiple threads, doling out tens or hundreds of thousands of clients requesting IP addresses in batches of 300 at a time. It tracks the clients, following up with a unicast RENEW. If the RENEW times out, it has the client thread go through the broadcast REBINDING process. If that fails, it expires the lease and logs the problem. When the test client is acting like a relay it sets fields in the DHCP packet just like a router with a ‘helper-address’ would.

We didn’t want the network engineers to have to change all the routers between the client and the two pre-production servers by adding routes for a large block of test addresses.

In the past we used iptables on the test DHCP servers to “DNAT” (destination NAT) all traffic to our test network to port 68 on the DHCP test client server with this command:

iptables -t nat -A OUTPUT -d 10.10.0.0/255.255.0.0 -p udp -j DNAT --to-destination 172.16.8.3:68

resulting in:

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
DNAT       udp  --  anywhere             10.10.0.0/16        to:172.16.8.3:68 

Under low load, everything worked fine. Even under high load (once I’d worked around the issue with the fysnc() performance) I saw significantly higher DHCP ACK counts on the new servers. But I saw the send_packet complaints above intermittently appear in the log file and I saw clients go into rebinding even under relatively low load conditions.

A little searching turn up the claim that ‘Operation not permitted’ was related to iptables, but careful review of the iptables entries on the DHCP servers and examination of the connection tracking table limits didn’t show an obvious problem. The fact that most of the time traffic got through suggested the problem wasn’t a bad rule. And all the connection tracking limits were greater than 600 while the problem could be demonstrated with as few as 600 client threads.

Using the dhcpd source, I was able to add some debugging to narrow things down a bit more:

  • the problem only happened when a unicast packet was received from a client (but didn’t happen for all unicast packets)
  • the same functions and system calls were always being used to generate the responses whether there were issues or not

Using tcpdump on the DHCP servers to inspect the traffic I found a few more interesting things:

  • even though a DHCP ACK was logged in the dhcpd.log file, the corresponding packet was not seen by tcpdump (so the “send_packet:” complaint really did indicate a failure)
  • the complaint never happened until after the first 254 addresses were handed out. It always seemed to start when the client with 10.10.2.0 came back to renew.
  • the unicast responses from the DHCP server weren’t coming from port 67, but were going out from a variety of other low-numbered ports…

Using strace proved most informative of all:

  • the dhcpd sendto() call WAS sourcing traffic from port 67 despite what we saw with tcpdump
  • EPERM (the error code that corresponds to the message ‘Operation not permitted’) consistently happened at 10.10.2.0 – until I changed the range in the dhcpd.conf file to start offering addresses from a different range, after which the EPERM error started happening at 10.10.2.19.

So, back to iptables. Although we thought the DHCP traffic should not be subject to connection tracking due to the following table entries:

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
net_notrk  all  --  anywhere             anywhere            

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

Chain net_notrk (1 references)
target     prot opt source               destination         
NOTRACK    udp  --  anywhere             anywhere            udp dpt:domain 
NOTRACK    udp  --  anywhere             anywhere            multiport dports bootp,bootpc  

I was seeing entries in the connection tracking table:

ipv4     2 udp      17 18 src=192.168.1.12 dst=10.10.0.249 sport=67 dport=68 
          [UNREPLIED] src=172.16.8.3 dst=192.168.1.12 sport=68 dport=84 mark=0 secmark=0 use=2

A couple of things jump out:

  • as mentioned, I’m seeing connection tracking entries, despite the configuration above.
  • the dport=84 in the second line indicates that iptables is choosing to rewrite the source port when it sends out the traffic.

The first thing to realize is that the configuration above won’t affect packets sent out from the DHCP servers (note the empty OUTPUT clause).

Secondly, the dport=84 gives a big hint about the cause of the “EPERM” problem.

Because the outbound traffic is all destined for the same host and port, the iptables DNAT rule creates an entry in the /proc/net/nf_conntrack table to show how to map any responses. Not that there will be any for this application, of course, but the iptables rules don’t know that.

Well, okay. So why were packets dropping? We weren’t hitting connection table limits (the table never got above more than 500 or so entries).

Apparently, because the packets were sourced from port 67 (a reserved port), the implicit source mapping mapped the ports to a new unused port – but only used ports between 1 and 512. For more details see Implicit Source Mapping

So, how do I work around this issue? While it’s reassuring to know that this would not happen in production, I did want to finish my load testing. I thought perhaps we could DNAT w/o connection tracking, so we went ahead and fixed our connection tracking bypass rule:

Chain host_notrk (1 references)
target     prot opt source               destination         
NOTRACK    udp  --  anywhere             anywhere            multiport dports bootp,bootpc 
NOTRACK    icmp --  anywhere             anywhere            icmp echo-request 

I anticipate using these settings in production, bypassing connection tracking for all DHCP traffic and for the “ping” that the DHCP server does to avoid assigning an address that is being squatted on by another client.

Result? DNAT stopped working entirely. Bypassing connection tracking appears to bypass DNAT, too.

An alternative option is dropping the length of time that UDP entries live in the nf_conntrack table. I don’t want them there at all, so instructing the system to keep them for 1 second instead of 30 seems to work fine:

sysctl -w net.netfilter.nf_conntrack_udp_timeout=1

I no longer see sendto() EPERM complaints, even at 20,000 clients with a 60 second lease time and we no longer see lease failures (well, with fsync() commented out, that is. I’m still working on the permanent preferred solution to THAT problem).

Friday, June 3, 2011

mDash, or Systems Monitoring Using jQuery Mobile

We've written Mobile Dash, or mDash, a mobile web app for monitoring UW-IT system alerts.

(If you're UW-IT staff, you can check it out here: mDash.)

Some screenshots:

mDash home screen on iPhone 4

mDash alert detail page

(It's only for UW-IT personnel or we'd have a link for you. If you are UW-IT personnel and want a closer look, feel free to leave us a comment.)

The interface uses jQuery Mobile. Behind the scenes, mDash uses jQuery for its Ajax call and DOM manipulation. It also uses Badgerfishjs to transform the alerts from our XML web service into JSON. And some custom CSS adds UW colors to the interface. 

So what is mDash? It's the mobile version Dash. But what's Dash?

Dash is UW-IT's Java Web Start (Swing) application for monitoring system alerts. One description reads that, "It is intended to be [a] simple and reliable way for system engineers, computer operations, client support groups, and management to view DSUE system status." (DSUE being Distributed Systems and Unix Engineering). By system we mean all sorts of systems. Dash monitors everything from web services used by a handful of developers to student email servers.

Our team's goal with mDash was to replicate a subset of Dash's functionality in mobile web app form.  We wanted to: show what's down (an abbreviation of the hostname), indicate the severity of the problem (the color-coded numbers adopted from the Java app), and allow a person to view a few more essential details via an alert detail page.

When monitoring critical systems, freshness of information matters. So, on the home page we display the time when alerts were last checked and we provide an unavoidable "Refresh Alerts" button so people can retrieve the latest alerts when they want. In the future we may refresh the alerts asynchronously, but due to the large amount of alerts that asynchronous process can slow some devices' browsers to a crawl or bring them to a standstill.

The alert detail page's fields might seem esoteric to people outside DSUE at UW-IT, but these labels are  familiar to the smallish set of people who will use this app at the outset. As mDash (and Dash) moves toward wider adoption, we'll need more human-friendly field labels.

It will sound clichéd, but mDash was fun to make. And jQuery Mobile is what made it fun. We did run into some funky bugs and behaviors along the way, but never insurmountable ones and never ones that the jQuery Mobile community hadn't already solved.

To say we're looking forward to making another jQuery Mobile web app is understating it.

Friday, February 18, 2011

Following wrong assumptions

An example of following wrong assumptions when debugging:

I was porting a bit of hyak (a several hundred node compute cluster)
monitoring from ProxD (an older collector of alerts) to Injector (a newer
generation). This monitor runs a vendor disgnostic utility once a minute
that lists status for every hyak node and issues alerts for any that are
'Down'. It saves status by node, so normally it only generates traffic for
node state changes. The exception is at daemon startup, when (since it
doesn't know whether there might be existing alerts) it issues traffic for
every node (either a Upd if the node is down or a Del otherwise).

Because determining the state of a node is just looking at the next line
of output, at startup there are going to be 500 odd alerts in a short
period of time, like .02 seconds.

Usually, there are no nodes down, so if you start the daemon, you won't
notice any dropped alerts (I'm sending these via UDP). Today, though,
there were a few nodes down, and one of them had a high number - n0418 or
something like that. When I fired up my newly ported code, I checked
against the list of active alerts from ProxD - and n0418 was missing.

I wrote a quicky program to send 500 alerts as fast as possible, and indeed
it start dropping alerts after 150 alerts or so. With ProxD, OTOH, you
could send blocks of 500 over and over w/ no skips. So I started picking
apart the code. The implementations aren't all that different, though, but
they did use different xml parsers and some minor differences. Or could
the read thread be blocking waiting because the write-to-activeMQ queue was
full? Anyway, I spent a couple of hours picking both pieces of code
apart, always looking at millisecond per transaction times - but those
were pretty close - on the order of 3 to 4 millisecs (I dunno if that is a
server limit or not; the client couldn't send faster).

Oooops. The problem was the udp input buffer size. The new alert
format is a fair amount longer than the old. The proxd one is:



The new one is:


foo.u
myProg
5
...

and so on. Longer inbound messages fill the buffer sooner - duh. I hadn't
really thought about the performance implications there, and was quite
suprised they made such a difference. I might be a little more terse next
time. Anyway, I increased the buffer size so you can send 2000 odd msgs as
fast as a singlethreaded python prog can send them w/o problems.

I should have kept an open mind longer - once I focused in on transaction
times, I stopped thinking about buffer sizes; a classic case of narrowing
the suspect list too soon.

BTW, I went back and tried sending larger numbers of msgs to proxd, and
it started dropping them between 1000 and 2000 msgs, so we had been running
in a sweet spot - no other app sends traffic quite as bursty as this one,
and it was sending a little under the ceiling, so there weren't any
problems. I was also lucky that a high numbered node was down - if it had
been N0099, I never would have noticed.

Friday, November 5, 2010

... and ipset comes to the rescue

We require that hosts located on certain subnets register before they are allowed to use the full capabilities of our campus network. By default, all hosts on such a network are blocked, until the registration is performed. While blocked, however, they need to get redirected to the registration page. For years now, we have used iptables for dynamic dns redirection when performing these host registrations.

To achieve this, a source NAT mapping is generated to a lying dns server when a host on a network that requires registrations. This determination is done by a realtime processor script acting on DHCP transitions. Once a device has been registered the entry is updated to point to the true DNS and and the iptables entry is updated.

Unfortunately, iptables became a performance bottleneck. The finite time for inserting and deleting a rule for a chain increases as the number of rules in the table increases. In addition, adding a rule takes two calls to iptables (one to list and make sure the rule does not exist already, a second one to actually add the rule). This is necessary to prevent duplicated rules in the same chain.

As the demand for our service increased over the years, the number of rules in iptables increased along with it. Eventually the realtime processor script became "almost realtime" and then "almost never in time" during peak periods of use.

Analysis showed that the average time to add or delete an iptables rule was 0.17s (over 100k samples). At times where the DHCP transition rate could be as high as 1000/min, the iptables updates ran several minutes behind, making the registration process effectively unusable.

We researched an alternative method of storing the host ip address mappings using ipset (http://ipset.netfilter.org/). Designed to work with iptables, ipset is basically a very efficient bucket for storing network address (ip and port) information. Time to add and remove addresses is almost constant and much smaller than using iptables at 0.0065s (over 100k samples). This allows more head room for capacity!
toolaverage timestdev
iptables0.17s0.12
ipset tested0.0065s time0.034
pset in production0.008s0.027

Thursday, September 23, 2010

Jolly times with the postgresql optimizer and explain analyze

Suddenly yesterday one of my rex functions (REX is our REST/SQL mapper tool) jumped off a cliff, performance-wise. With a hint from Nick (who will forevermore be known as "the database whisperer") I was able to narrow the performance problem down to this particular query in the function that REX is calling, specifically the family(network)=4 element of the where clause:

SELECT DISTINCT network, sector.sector::text
FROM nfa_k20_network
LEFT JOIN info_site ON info_site.siteid = nfa_k20_network.site_id
LEFT JOIN dev_location on dev_location.site_id = info_site.id
LEFT JOIN dev_classification on dev_classification.device_id =
dev_location.device_id
LEFT JOIN sector on sector.id = dev_classification.sector
WHERE sector.sector is not null and family(network) = 4;

But I was unable to get the optimizer to generate the same plan on netdbeval/allyourbase as I got on netdbdev. I'm assuming this is an indication that the data size in one of the tables tipped the query optimizer over into selecting a less ideal strategy, although I'm not clear
on how or why.

The good strategy uses hashes, the bad one a nested join and a filter.
Here's where things start to run amok:

Good (netdbdev) http://explain.depesz.com/s/ukE

Hash Join (cost=3185.37..4996.42 rows=1 width=28) (actual
time=953.056..1227.372 rows=7379 loops=1)
* Hash Cond: (public.dev_state.device_id = interface.device_id)

Bad (netdbeval) http://explain.depesz.com/s/TI6

Nested Loop (cost=1851.83..8488.08 rows=1 width=34) (actual
time=247.812..404814.965 rows=1465 loops=1)
* Join Filter: (interface.device_id = public.dev_state.device_id)

Wowsers! Check out how the actual time compares to the estimated cost!

Explicitly disabling nested loops in psql before trying this query caused it to go back from taking several minutes to taking 3 seconds, so we clearly want to get postgres to use the first strategy.

The choice seems to be driven by the estimated row count that ifc_ip will return:

ripit_2=# explain analyze select count(*) from ifc_ip where family(ip)=4;
QUERY PLAN

------------------------------------------------------------------------------------------------------------
Aggregate (cost=268.85..268.87 rows=1 width=0) (actual
time=15.433..15.434 rows=1 loops=1)
-> Seq Scan on ifc_ip (cost=0.00..268.70 rows=61 width=0) (actual
time=0.048..9.489 rows=8498 loops=1)
Filter: (family(ip) = 4)
Total runtime: 15.481 ms
(4 rows)

Note that the estimated row count is 61, while the actual row count returned is 8498.

I tinkered with a number of settings (work_mem, effective_cache_size) to see if I could drive the query planner to go back to making better choices. I also tried analyzing and reindexing the table. The basic problem seems to be how much filtering the query planner expects the "family(ip)=4" clause to result in and I don't see how the pg_statistics table could capture that without adding an index (heck I don't even see where that mysterious 61 number is coming from).

A simple and reasonable solution, however, seems to be to rewrite the query as "family(ip) != 6" (the goal of the query is to eliminate any IPv6 addresses from getting returned).

Doing this results in an estimate that captures something closer to the expected size of the ifc_ip result set (on the high end!) and has the desired effect on the overall query.

ripit_2=# explain analyze select count(*) from ifc_ip where family(ip)!=6;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------
Aggregate (cost=299.00..299.01 rows=1 width=0) (actual
time=15.380..15.381 rows=1 loops=1)
-> Seq Scan on ifc_ip (cost=0.00..268.70 rows=12119 width=0) (actual
time=0.049..9.353 rows=8498 loops=1)
Filter: (family(ip) <> 6)
Total runtime: 15.432 ms
(4 rows)

The downside, of course, is that I've no idea what might cause the query planner to change its mind in future about this table or perhaps another table lurking in the query. So my sense is that it might be a more robust solution to add an index, but I'd like to understand better how the returned rows estimation is generated before I go adding more random indexes to tables which, after all, aren't free . . .

Friday, August 6, 2010

Presenting ... UW Service Tools

You might have noticed that we've changed the blog's name from UW Network Management Tools to UW Service Tools. The name change reflects reorganization within UW Information Technology. Our group has moved from Network Systems to the Technology Management division's Application Infrastructure Engineering unit--a move we're excited about.

For the most part, what we do on a daily basis remains the same: we'll keep making good software. As for what's new, our focus is widening. We'll be working on automating a wide range of technology services to support UW-IT and UW.

We'll be posting more news here about our new work soon.

Monday, June 14, 2010

Meerkat Framework: Presentation at SASAG ...

Meerkat Framework presentation to the Seattle Area System Administrators Guild (SASAG) on June 10, 2010