MTU woes again…

Written by Roland van Rijswijk in category: Resilience, Technical

We recently experienced some more MTU woes as a result of our main zone being DNSSEC signed that I’d like to share with you in the hope that this can help prevent this problem for others. Last week I got an e-mail from our IT department about mail issues that some of my colleagues were experiencing. They had received phone calls from companies we do business with that they were unable to send e-mail to my colleagues. Diving deeper into the problem revealed that all these companies had one thing in common: they were using the hosted MS Exchange servers of a large ISP. The errors they got looked something like this (with privacy sensitive information blanked out):

Unable to deliver message to the following recipients, due to being
unable to connect successfully to the destination mail server.
Reporting-MTA: dns;********************
Received-From-MTA: dns;macpro.lan
Arrival-Date: Thu, 17 Mar 2011 14:54:18 +0100
Final-Recipient: rfc822;*************@surfnet.nl
Action: failed
Status: 4.4.7
From:  ***********************************
To:  ******************@surfnet.nl

The rather unclear error code pointing out the problem is “4.4.7”. Apparently this means that the mail server was unable to connect to the recipients mail exchanger.

I talked to the people at the ISP about this and they said that the cause of the error message was a DNS resolving problem. When we dove into it a bit deeper it turned out that they had recently upgraded the resolvers used by their hosted Exchange environment from Windows Server 2008 to Windows Server 2008R2. And apparently Windows Server 2008R2 has EDNS0 enabled by default with the DNSSEC OK (DO) bit set to true. In addition to that, it turned out (after some packet tracing) that their servers are behind a firewall that drops UDP fragments and somewhere along the route between their servers and our authoritative name servers the MTU drops to 1472 bytes. Since the largest answer we could produce for our MX set is over 2000 bytes, this meant that packets were getting fragmented. And when these fragments then get dropped, the resolver never receives a proper answer.

The workaround was very simple: since they do not validate DNSSEC answers and are not planning to start doing so very soon, they decided to disable EDNS0. This is done by executing the following command:

C:\Windows\System32\> dnscmd /config /EnableEDNSProbes 0

Once that was done the problem was solved. This does however, teach us an important lesson: MTU troubles show up as very unclear problems in secondary systems…

There was also an interesting spin-off from the search we performed to find the problem: their resolver sent back ICMP message to indicate that fragment reassembly had failed:

11:01:59.849643 IP *.*.*.* > ns3.surfnet.nl: ICMP ip reassembly time exceeded, length 92
11:01:59.849655 IP *.*.*.* > ns3.surfnet.nl: ICMP ip reassembly time exceeded, length 92

I’m going to work some more with this because I usually receive these ICMP messages back on the authoritative server. That means I now have a way of detecting – with some probability – MTU problems to certain resolvers that query our authortitative name servers. If I get some results from this I will publish that information here on the blog.

Comments are closed