Friday, February 10, 2012

strace and tcpdump: or "a good time with dhcpd and iptables"

After an upgrade to our DHCP servers resulted in a failure to hand out leases under peak load, we enhanced our DHCP load testing client and started narrowing down the problem.

Testing soon sorted out the performance issue (while the new disks were much faster for reads and writes, but the old disk controller had a write cache. The ‘fsync() after every lease’ behavior of dhcpd meant that we were getting killed writing the leases file).

I wanted to quantify how many DHCP leases and requests we could serve once we’d resolved that and I noticed one odd little problem under load. If we set our lease time down to 60 seconds and had more than 600 clients, the dhcpd.log would start reporting an error when sending certain DHCP ACK messages:

send_packet: Operation not permitted

At first I thought it was happening when the client entered the ‘rebinding’ phase (i.e. after the unicast request for a lease renewal had failed, timed out and then switched back to broadcast) because it happened after the initial leases were offered and after some leases had successfully renewed. But closer investigation showed something more interesting.

The first thing to know is that we run our DHCP servers in failover mode and they are located on different subnets. None of the clients would be local in production so we test with a similar configuration. The host the client software runs on is located on a different network from both DHCP servers.

The test client itself is written in java and creates multiple threads, doling out tens or hundreds of thousands of clients requesting IP addresses in batches of 300 at a time. It tracks the clients, following up with a unicast RENEW. If the RENEW times out, it has the client thread go through the broadcast REBINDING process. If that fails, it expires the lease and logs the problem. When the test client is acting like a relay it sets fields in the DHCP packet just like a router with a ‘helper-address’ would.

We didn’t want the network engineers to have to change all the routers between the client and the two pre-production servers by adding routes for a large block of test addresses.

In the past we used iptables on the test DHCP servers to “DNAT” (destination NAT) all traffic to our test network to port 68 on the DHCP test client server with this command:

iptables -t nat -A OUTPUT -d 10.10.0.0/255.255.0.0 -p udp -j DNAT --to-destination 172.16.8.3:68

resulting in:

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
DNAT       udp  --  anywhere             10.10.0.0/16        to:172.16.8.3:68 

Under low load, everything worked fine. Even under high load (once I’d worked around the issue with the fysnc() performance) I saw significantly higher DHCP ACK counts on the new servers. But I saw the send_packet complaints above intermittently appear in the log file and I saw clients go into rebinding even under relatively low load conditions.

A little searching turn up the claim that ‘Operation not permitted’ was related to iptables, but careful review of the iptables entries on the DHCP servers and examination of the connection tracking table limits didn’t show an obvious problem. The fact that most of the time traffic got through suggested the problem wasn’t a bad rule. And all the connection tracking limits were greater than 600 while the problem could be demonstrated with as few as 600 client threads.

Using the dhcpd source, I was able to add some debugging to narrow things down a bit more:

  • the problem only happened when a unicast packet was received from a client (but didn’t happen for all unicast packets)
  • the same functions and system calls were always being used to generate the responses whether there were issues or not

Using tcpdump on the DHCP servers to inspect the traffic I found a few more interesting things:

  • even though a DHCP ACK was logged in the dhcpd.log file, the corresponding packet was not seen by tcpdump (so the “send_packet:” complaint really did indicate a failure)
  • the complaint never happened until after the first 254 addresses were handed out. It always seemed to start when the client with 10.10.2.0 came back to renew.
  • the unicast responses from the DHCP server weren’t coming from port 67, but were going out from a variety of other low-numbered ports…

Using strace proved most informative of all:

  • the dhcpd sendto() call WAS sourcing traffic from port 67 despite what we saw with tcpdump
  • EPERM (the error code that corresponds to the message ‘Operation not permitted’) consistently happened at 10.10.2.0 – until I changed the range in the dhcpd.conf file to start offering addresses from a different range, after which the EPERM error started happening at 10.10.2.19.

So, back to iptables. Although we thought the DHCP traffic should not be subject to connection tracking due to the following table entries:

Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
net_notrk  all  --  anywhere             anywhere            

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

Chain net_notrk (1 references)
target     prot opt source               destination         
NOTRACK    udp  --  anywhere             anywhere            udp dpt:domain 
NOTRACK    udp  --  anywhere             anywhere            multiport dports bootp,bootpc  

I was seeing entries in the connection tracking table:

ipv4     2 udp      17 18 src=192.168.1.12 dst=10.10.0.249 sport=67 dport=68 
          [UNREPLIED] src=172.16.8.3 dst=192.168.1.12 sport=68 dport=84 mark=0 secmark=0 use=2

A couple of things jump out:

  • as mentioned, I’m seeing connection tracking entries, despite the configuration above.
  • the dport=84 in the second line indicates that iptables is choosing to rewrite the source port when it sends out the traffic.

The first thing to realize is that the configuration above won’t affect packets sent out from the DHCP servers (note the empty OUTPUT clause).

Secondly, the dport=84 gives a big hint about the cause of the “EPERM” problem.

Because the outbound traffic is all destined for the same host and port, the iptables DNAT rule creates an entry in the /proc/net/nf_conntrack table to show how to map any responses. Not that there will be any for this application, of course, but the iptables rules don’t know that.

Well, okay. So why were packets dropping? We weren’t hitting connection table limits (the table never got above more than 500 or so entries).

Apparently, because the packets were sourced from port 67 (a reserved port), the implicit source mapping mapped the ports to a new unused port – but only used ports between 1 and 512. For more details see Implicit Source Mapping

So, how do I work around this issue? While it’s reassuring to know that this would not happen in production, I did want to finish my load testing. I thought perhaps we could DNAT w/o connection tracking, so we went ahead and fixed our connection tracking bypass rule:

Chain host_notrk (1 references)
target     prot opt source               destination         
NOTRACK    udp  --  anywhere             anywhere            multiport dports bootp,bootpc 
NOTRACK    icmp --  anywhere             anywhere            icmp echo-request 

I anticipate using these settings in production, bypassing connection tracking for all DHCP traffic and for the “ping” that the DHCP server does to avoid assigning an address that is being squatted on by another client.

Result? DNAT stopped working entirely. Bypassing connection tracking appears to bypass DNAT, too.

An alternative option is dropping the length of time that UDP entries live in the nf_conntrack table. I don’t want them there at all, so instructing the system to keep them for 1 second instead of 30 seems to work fine:

sysctl -w net.netfilter.nf_conntrack_udp_timeout=1

I no longer see sendto() EPERM complaints, even at 20,000 clients with a 60 second lease time and we no longer see lease failures (well, with fsync() commented out, that is. I’m still working on the permanent preferred solution to THAT problem).

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.