The phone started ringing today with about 10 people who suddenly couldn’t connect to JobTracker. That was strange because the servers were fine and everyone else was connecting just fine. What was special about these few?
Luckily there was a work-around: they could get to the servers by IP address, but just not by the moraware.net domain name. That’s strange because the DNS servers are hosted on the same set of servers as JobTracker, so if you can get to one, you can get to the other.
Then we found a clue: none of us could resolve the DNS for www.moraware.com This was a different story — the DNS for moraware.com has always been hosted on a 3rd party service, zoneedit.com, and apparently both the servers that were hosting the DNS went down at once. We’ve been using this service for 7 years, since we first set up our website, with no problems. But it’s a free service, so there’s noone to yell at when it goes down.
So I quickly rebuilt the DNS entries on our own servers and pointed the domain there, and had it fixed in about 15 minutes.
So why did the moraware.com DNS being down stop just a handful of people from accessing moraware.net addresses? It turns out the moraware.net domain was using name servers named like ns10.moraware.com. Now that doesn’t usually matter because the DNS servers can get the ip addresses for the name servers directly by using what’s called “glue” without having to do a separate DNS lookup. But apparently these 10 customers were connecting to some DNS servers that didn’t use the glue, but instead tried to lookup the addresses on moraware.com, which was failing.
So now moraware.net and moraware.com both use the same cluster of 4 DNS servers so this kind of problem won’t happen again.