It All Falls Down

August 21, 2007 on 3:18 pm | In Insider View, Updates by Josh Jones |

My apologies.

On the off-chance (and judging by that graph of our Level 1 queue, it seems like a pretty good off-chance) that a few of you may have noticed a little problem we had last Thursday afternoon, all the way through Friday morning, I thought I might offer something in the ways of an explanation to go along with that apology.

You customers really notice no DNS!

It’s funny how problems cascade.

It all started Wednesday around noon, when we had a sudden and mysterious network problem related to our core 2 router.

There seemed to be some sort of corruption with the ARP tables.. we eventually figured it out, and fixed it thanks to a gazillion sendARPs. Cisco support wasn’t helpful because we weren’t running the latest version of their IOS router operating system. Unfortunately, upgrading is scary stuff since it requires a short network outage, assuming everything works smoothly. We decided we’d do the upgrade Friday night.

Come Thursday at 2pm, exactly 24 hours after our previous outage was fixed, our network started to get wonky again. It seemed like it was most likely due to all the sendARPs from the previous day expiring at the same time. We were pretty much on top of this as soon as it happened though, and re-sent the sendARPs (staggered this time)!

In fact, it wasn’t actually due to an aging issue at all, but it was just an IOS bug on the core router. No big deal, we pretty much had things under control should the same problem pop up again on Friday at 2pm before the planned upgrade Friday night.

One pizza after another, all laid neatly on end.

A Chain of Events

Of course, little did we know, a chain of events had already been set in motion that would ruin everybody’s Friday.. FOREVER.

You see, every hour we have a little script that runs that purges old dead entries from our active nameserver database. Really, it isn’t the end of the world for us to keep that old stale stuff around, but in the name of being good dns citizens, I guess it’d been decided a while ago to remove them quickly.

Which is fine, I guess. However, the method in which we decided what entries should be removed was a bit suspect.

We first create a hash of ALL good domains “%domids” from our hosting database. Then, we go through all domains (as “$domid”) in our nameserver database and do:

unless ($domids{$domid}) {
print “- removing stray records under non-existant domain $domid\n”;
$pdb->do(”DELETE FROM records WHERE domain_id=” . $pdb->quote($domid));
}

Which works pretty well, assuming everything is working pretty well.

Well, everything was not working pretty well on Thursday. Because of the network weirdness, the connection to the hosting database apparently didn’t work, leaving %domdids blank.

And, due to the excellent error handling and sanity checking of that script, it did not die at that point, or even so much as raise an eyebrow as it happily decided to delete every single domain in our dns database.

I think I can see my site in there..

Now, for bad or good, it didn’t just hose the whole table at once. Instead, it just deleted one database after another, in order.. which turned out to be a rather slow process on a busy dns database. In fact, 22 hours later when we finally found it STILL RUNNING (normally it finishes in under a minute since there’s nothing to delete) it had only deleted a third of the domains in the table.. about 300,000. Hooray!

It actually would have been a lot better if it’d just hosed everything at once. It would have been much easier to detect, and rectify, immediately.

Instead, things worsened gradually. It took over two hours before we even started getting reports from customers that their sites were down. At that point, it seemed like the problem was just some sort of residual effect of the network problem, and re-generating DNS for each person who wrote in fixed it right away, and for good.

As time went on, and the problems kept coming in, we realized there was a pretty major data loss in the nameserver database, and started running some scripts to regenerate it all. Those would take a couple of hours, but when they were done everything would be better, we assumed!

It wasn’t until those regeneration scripts finished and we discovered there were still lots of missing domains that it finally dawned on us .. dns records were continuously being deleted!

And THAT is when we finally found the culprit, fixed the mess, and started trying to make sure this would never happen again!

When it rains, we’re poor.

And where was DreamHost Status for all this?

DreamHost Status was down. (See, if you just read DreamHost Status you would have known that!)

Like they’ve said befores, when it rains it pours.

We thought DreamHost Status was down because of the huge crush of people trying to access it due to the network problems. So, when we could finally get into it, we switched it to a static html page to try and lighten the load.

Lighten the load it did not!

Right about then we got a message from our remote data center in San Francisco (both ns2.dreamhost.com and dreamhoststatus.com are kept completely off our main network and in a different city exactly so they wouldn’t be affected by outages like this!)

Your server’s switchport has been de-rated to 10 Mb/s because your server began generating an out-bound storm of packets. This type of event usually indicates a compromise in security.

We have taken this action to mitigate the amount of bandwidth transfer charges incurred by your account related to this activity

Man, what timing! We did not need a DDoS attack right now.

But wait a second. Somehow that just seemed a little bit TOO Murphy-esque. And, indeed, when we probed them further, they told us:

According to my monitor, it appears you’re being DDoS attacked on your DNS service (UDP 53) specifically to IP 208.96.10.221. At 5a,
your traffic peaked our threshold for dangerous amounts of packets going through your switch port which was when your server was de-rated.

That “Distributed Denial of Service” attack was actually just honest DNS requests!

Which was super-high because ns1.dreamhost.com was returning “I don’t have any records for that domain” for a ton of domains, due to the deletion of the DNS database entries, due to the haywire script, due to the network blip, due to the IOS bug, due to us not upgrading as quickly as possible because of the network downtime involved!

After Math is Art!

The Aftermath

Well, we did the IOS upgrade and it looks like it fixed the networking problems.

We also made our crazy script do some sanity checking. But more importantly (and in just two lines of code!), we’ve now set all our internal scripts to just DIE MISERABLY if they ever get any kind of un-good data from an sql query. Clearly, ’tis better to not do something you were supposed to then to do something you were not supposed to!

We’re also going to separate good old DreamHost Status from absolutely everything else DreamHost related.. even if that means moving it to blogger or something!

We must break the cycle!

50 Responses to “It All Falls Down”

  1. Miikka Says:

    Josh, thank you for the great post about your problems. This is exactly the reason why I want to be DreamHost customer - when something bad happens I can count on the fact that someone will come in front and explain what happened.

  2. ajoy Says:

    i can’t believe i read the entire post. good thing you got to the bottom of that problem!

  3. mike Says:

    Josh
    I’m hosting to DH since 2004. Why, on vital issues, do you provide such a poor tech / email support ? I’m happy with DH services, but on support it sucks. On vital u have to wait with all the sites down more than 10 hours to get an answer, and that usually doesn’t solve the problem

    Regards,
    Mike

  4. Matt Says:

    Bring your status blog to WordPress.com! You’ll be in good company. (Second Life, Flickr, Laughing Squid, Rackspace, Layered Technologies, Server Beach, The Planet, CNN, People.) You’d still be using WordPress and can easily import your content. Not to mention I bribed Dallas with lunch yesterday.

  5. DWR Says:

    Much appreciated Josh. I have just moved back to Dreamhost after a short trial during last years “troubles”, am in the process of transferring all of my domain registrations over to you too, and these events made me wonder if I had made a good decision, but this explanation makes good sense and makes me feel a LOT better… each of these incidents helps you to make one more step towards relative invincibility!

    Yall are some alright cats… I’m very grateful that you exist and are doing things as you are… I feel like I am truly taking refuge in Dreamhost… yalls policies and manner show through in every aspect… there are wonderful little details scattered all through the Dreamhost Experience and more and more I see that those who have stuck with you all of these years through all ups and downs are some wise folks… the storms are temporary, but the fulfillment is deep and sustained.

    The dang kinda capture didnt show and now it thinks I’m a duplicate comment… maybe if I add this sentence it will let me try again.

  6. Unofficial DreamHost Blog Says:

    Josh, hope this have thought you not to commit hybris again by taunting your competitors. It makes the Hosting Gods mad ;-)

    Terrible not to have access to DreamHost Status when stuff like this happens. Luckily lot of people submittet info from DreamHost Support / DreamHost Status (when it worked) to the Unofficial DreamHost Blog (which wasn’t affected by the outage) and we got a surge of visitors.

  7. BUGabundo Says:

    [quote]it had only deleted a third of the domains in the table.. about 300,000[/quote]
    So now DH hosts many more then 500.000 domains…
    actually 900K
    damn.

  8. Chuck Says:

    I have three Code Monster accounts with you and a support incident I placed has gone over 18 hours without even a reply.

    The only reasons I have not left DH are:

    1. already paid in advance
    2. love one click install goodies
    3. it’s a hassle to move

    your “support” is a joke. any company that won’t give you a phone number is not committed to its customers. i wish i saw that red flag before I rewarded you with my biz.

    i have been a DH customer for years, I’m NOT one of your $7/mo customer (try $45) and I’m sick of being treated like crap by your “support”.

  9. Unofficial DreamHost Blog Says:

    BUGabundo - I suspect that the 900K domains includes subdomains… Am I right Josh, or do you really host that many domains?

  10. Joe Grossberg Says:

    So are you going to audit the perl scripts that UPDATE or DELETE from databases now?

  11. Jake Says:

    Yes, sanity checking is GOOD for all database queries! My eyes skipped ahead a bit and saw that condition-less piece of code and something inside of me was like “oh snap, I know where this is going”. Remember, databases and CLI are not forgiving beasts.

    Thanks for the explanation, though it may have cost your users some downtime, the valuable lesson everyone learns from the thorough explanation is a value by itself.

  12. Heikki Says:

    Thanks for the explanation. This I like about Dreamhost: eventually getting a honest explanation. I was given a couple of non-helpful support replies, but these problems would have been over anyones head.

    I’ve been moved to Dreamhost PS, and this didn’t exactly help to make the transition smoother. Well, Murphy happens.

  13. Dallas Says:

    The number of sites we claim to be hosting is based on whois records and the number of domains that are pointing to our name servers. For some reason or another there are domains in our dns database that are not actually using our name servers.

    The 900k number (it’s actually around 830k, I think), does not include subdomains as those all fall under one of two domains (either dreamhosters.com or dreamhost.com).

    We don’t currently count any subdomains (including dreamhost.com/dreamhosters.com) as ’sites hosted’, though I’m sure many of them are full standalone websites.

  14. John Says:

    @chuck
    http://www.dreamhost.com/hosting-features.html#callbacks

    Also the amount of support they would require for nearly 1000000domains would require huge amounts of resources which would in turn increase costs.

    thus decreasing customers and increasing costs again.

    thanks for the info josh

    Ive been a happy customer for coming up to a year this month and my sites have only been down once from what i have been aware of.

  15. Chuck Says:

    John,

    Please note that I did not say that dreamhost did not *claim* to offer any type of support or even a callback.

    What I *did* say is that I REQUESTED a callback and the request was ignored for over nineteen hours. I finally gave up and cancelled the request. I did, however contact the Better Business Bureau. A sad commentary, but such is DreamHost.

    Companies can claim anything they want. It’s what they DO that impacts the customer - positively or adversely.

    My observation that any company that does not have a phone number for its customers doesn’t care about them stands. A company taking money from customers to service “1000000domains” clearly has sufficient profits to provide customer support.

    I pay Dreamhost about $60 a month for three L3 accounts. You would think that would warrant at least a reply to my email within nineteen hours.

  16. Heikki Says:

    The support has indeed been quite slow to answer. My questions were technical, but 14-24 hours not fast.

    My “OMG BBQ Criticl WTF PPL iz dyin” was resolved in six hours by mail, and unless I’m mistaken, that’s during usual office hours in LA. That’s not still fast, but almost tolerable.

  17. Ginger Mayerson Says:

    Another brauva apology, Josh, well done, bravo! I understood most of it.

    Good thing I love you and DH so much.

    Don’t I get any treats for loving you and DH so much?

    Skype recently had a half day outage and they comped us paying customers an extra 7 days on all our accounts.

    And I don’t love them half as much as I love you and DH, Josh.

    Hm?

    Ginger
    Hosting with you and DH since 2002. Not leaving because I’ll never get the same webspace and data transfer anywhere else for the $ and for, y’know, emotional reasons (I’d never get over Josh and DH), but, gah! Think of how many marriage proposal emails I missed last week!

  18. ZOP Says:

    Why does the script even exist like that at all? It’s kind of absurd to do it that way. Why not delete the records when the account/domain is closed? Seems much saner, less resource intensive, and less apt to fail when you run out of RAM.

  19. Jared Says:

    You didn’t hire the person whose job I just took over did you? That individual was totally down with said coding practice. “Test that the data is valid? That the file is readable/writable? That I can connect to the database? Crazy talk! Having the system up is an operations problem.”

    The tone of this post is a little more serious and that’s appreciated. A few more “once again we’re really sorry and we’ll be at the LA farmer’s market next to the rotten tomato stand on Saturday if you want to throw things at us” might be nice too.

    Even though the programming would suck, automatically giving people a week’s credit for a day’s outage (and saying that vs saying we’re giving you two dollars back, because two dollars is not impressive) on their bills might be adviseable. Others have talked about it here before and you’ve said you offer credits if people complain, but I think being proactive is a good symbolic gesture. I don’t know how many customers it would save in the long run, though. Least of all the really vitriolic blog posters.

    I’ll stay as a customer because you’re cheap without being GoDaddy (the post about their check-out process is dead on) and the one-click software upgrades keep me out of trouble. But it’ll be a while before I’d host a bread-and-butter site here.

  20. Web Hosting 2.0 { defmay } Says:

    [...] de todo, los “grandes” también caen, varias veces y vaya si la joden en algunas [...]

  21. Ben Says:

    Would any of this have been adverted if DreamhostPS servers were used?

  22. Maggie Says:

    Thanks for an explanation that wasn’t flippant and anger-inducing. Seriously.
    Your sense of humour is appreciated (and it was in this post - in just the right measure), but knowing where to draw the line between funny/flippant is important when things are tense. Otherwise… we just get pissed when we’re upset/stressed and are given a stand-up routine of an answer.

    So, thanks for taking that to heart this time. It’s greatly appreciated. And seeing as another issue followed on right after, I’m not nearly as stressed as I was during the last major outage when we were all ticked off by the blog post’s comedic tone.

    Thanks, and I hope everything is resolved soon.

  23. Nestor Says:

    What about the other issue where you were cutoff from western Europe due to some 3rd party issue? It would have been nice to have some sort of explanation. I mean I could see through a proxy that my sites were up, but logging into panel or blogs through some proxy I just grabbed off a google search isn’t a good idea security wise. There were dozens of european clients complaining in the dreamhoststatus blog who thought their sites were down (And as far as they’re concerned they were) and this was not acknowledged in any way.

    Support email also bounces any question that hasn’t been tagged with a support code so there was no way to safely contact you, add that to the rainstorm

  24. David Szpunar Says:

    Why not host your status blog on Media Temple’s grid? ;-) Actually I’m glad Matt invited you to WordPress because that was going to be my suggestion, but his invite might carry a bit more weight than mine!

  25. adam Says:

    thanks for the explanation, however, that code was absolutely horrible..

    no wonder you had issues, if the rest of what you guys code is along the same lines I’m not filled with faith against more errors not cropping up every 2 weeks.. error checking + assuming something can go wrong is the first thing you do with any php, pearl, etc script you use for automation, whether it be a small script or something like that.

    to not do error checking on a script DELETING things on your system… wow… what are you thinking?!

    every time I think I couldn’t do hosting myself, because you guys probably know a lot more and are using all sorts of security tricks to keep things safe that I wouldn’t know how to do.. I wonder after something like that. I’m glad you’re open about it, but maybe you shouldn’t have been.

    I guess your average non coding user that just one click installs wordpress though won’t see what a n00b disaster that little code snippet was, and how easy it would have been to avoid. I on the other hand vote you guys are not allowed anywhere near notepad ever again. no more coding for you ;)

  26. Heikki Says:

    Ben: The network and DNS were broken, so Dreamhost PS does not help. The same network and the same self-deleting DNS system.

  27. Tracie Says:

    If it makes you feel any better, I didn’t even notice. ;) And I’m glad you guys are open about it. It takes guts to be honest about mistakes.

  28. Christoph Dollis Says:

    Josh, as your first commentator, Miikka said, thanks for the post explaining the problems.

    DreamHost isn’t perfect, but during the year and a half I’ve been here, I’ve went from complete novice to writing a valid XHTML/CSS web site, hand coded.

    And I have two, not one sites hosted here. It’s been a blast.

    I’m amazed at the capacity and features I get for around $10 a month. A little under, actually.

    IMAP is, hands down, the most important of these. Thanks for providing it.

    The communication with your company, considering I’m on email only support, is fantastic (like adding a CNAME value upon request - Merci!) and DreamHost Status adds to the mix.

    Did I mention your new control panel kicks ass over your last one?

    Signed,

    A sometimes frustrated DreamHost customer who is more often than not tickled pink,

    Christoph Dollis
    http://christophdollis.com - the place to come for more appointments, more people in the door, and more sales!(sm)

  29. Dreamhost lag plat — Michel Vuijlsteke's Weblog Says:

    [...] het lange verhaal is hoedanook de moeite van het lezen [...]

  30. mike Says:

    @Josh, @Dreamhost

    @Christoph

    The email support is beyond words. We completed an incident support ticket 5 days ago, we get the answer after 48 hours without 9 minutes. Meanwhile, while all our websites where down, and we didn’t had any possibility of retrieving our data, we had to complete another 10 tickets during the 2 days between dreamhost support reply.

    we completed after we get the answer from dreamhost another 4 tickets. From the first ticket we opened it’s more than 2 and half days, and the second ticket closes to 2 days since we have no answer. Meanwhile, all our sites are down, we have no means to get to them, Dreamhost just arrested our files, without reason and they refuse to give our backup files back.

    Tell me Christoph or Josh from Dreamhost,
    Is this normal to happen to the 3rd large hosting company in the world ? It’s just a matter of 2 minutes for a tech support guy to set the rights for our files, why does it takes 5 days of downtime? I’m a paying customer and I have brought to Dreamhost more than 100 clients, which pays thousands of $$$ pe month. More than that, we’ve been with Dreamhost since Q1 2004, and we supported them and where beside them with every downtime they had and we understand each time their possition.

    Regards,
    Mike

  31. James Printer Says:

    BAD CHUCK, it is against the terms of service to have multiple accounts

  32. Chuck Says:

    James,

    I have multiple business enterprises and companies. Having one account per makes sense and does not violate any TOS.

    As far as what’s relevant, I do hope that DH rights their ship. I am rooting for them and sincerely wish to remain their customer.

    Chuck

  33. Blake Says:

    Dreamhost panel still broken taking an age to load, something def up. Nothing on dream host status about this.

  34. Andy Says:

    While I appreciate the explanation it doesn’t change the fact that dreamhost’s service has been on a downward spiral during the last year. There are many things I love about DH but uptime currently… well, it sucks. Big time. It causes me issues with my clients, it causes me issues with the my business partners because hosting decisions I have made are being questioned.

    Up your prices if you need cash to fix this mess. I would be happy to pay double what we are paying to get a service I can actually rely on.

  35. Ryan Says:

    DH Panel loading lightning fast. Looks like just SMTP for some.
    Isn’t Cisco great to deal with? :)

  36. Christoph Dollis Says:

    Chuck, DreamHost Terms of Service, Item 8:

    The Customer agrees to hold only one (1) active web hosting service plan at any given time with DreamHost Web Hosting. Signing up for multiple plans is grounds for termination of all additional plans without warning.

    I have absolutely no idea why DreamHost would care if you had more than one account, it seems to me this would maximize their revenue.

    In fact, I was concerned about whether I could host a website I was then writing for my dad and emailed support. They told me this is fine and completely in accordance with the TOS.

    This seems bass-ackwards to me, but there you are.

    So you are definitely in violation of DH TOS and I think you should contact them immediately to either get an official okay for what you’re doing or make arrangements to get one higher level plan and cancel the others.

    I’d hate to see DreamHost act on their TOS and cancel your service.

    Sincerely,

  37. Chuck Says:

    Christoph,

    The “customers” here are my companies and not me personally. They are legal entities and recognized by the IRS as such.

    Thanks for your concern.

  38. Christoph Says:

    Then that makes sense.

  39. alex Says:

    Hmm… no host is perfect. A bigger hosting company I won’t name failed on us the other [s]day[/s] week (yes, all week).

    If your car goes bust, you can take it being off the road for a week, if your phone runs out of battery then it’s off for a few hours- but if your web host goes down then it’s suddenly a suicidal problem- I don’t get it.

    Web hosting goes down- like rain. It just happens. Amazon, MySpace, loads of biggies have downtime because it’s inevitable. There are so many dependencies, utilising 40 year old technology that wasn’t always designed with reliability or security in mind (go back to your textbooks and look up the 7 layers of ethernet), and so many issues- from power outages to malicious hackers… it is a wonder we have reliable internet at all.

    Your website isn’t the most important thing in the world- if people are bothered they’ll come back- if they don’t you’re not worth their time.
    Emailing support ten times a minute wont get anything fixed any faster than wearing out your F5 key.

  40. Dave Says:

    Christopher 36.
    The point of the TOS stipulation has mainly to do with usage. If you are actually filling up your storage/ eating up your bandwidth, then Dream Host (just about any other host for that matter) is loosing money on you (big time). They are cool with that, you can use what they have oversold you, but if you want more you have to pay the real $’s per GB to increase it. If a person who has filled up their account, sign’s up for another one and does the same (and no doubt uses a promo code too :) ) then they are really hurting.

    If you have an account with a couple of domains using 100MB, and you really want to pay them for another account in which which you are going to take up another 100MB (or whatever), they will be doing well on you, and would be very unlikely to enforce the TOS. But they have to keep the rules simple, and still have teeth to deal with gluttons who cheat. Your right though, lavish communication is probably best.

  41. It All Falls Down - Dreamhost [ Ectio.us ] Says:

    [...] From blog.dreamhost.com [...]

  42. dropsafe : How a techie company *should* explain when it goofs up… Says:

    [...] Dreamhost: It All Falls Down [...]

  43. Sun Says:

    Againnn?
    Woowwwwwww! D.H. can change some services….

  44. HELM, WHM/cPanel, Windows, Linux and SEO Blog » Blog Archive » Dreamhost review, 2.5 years Says:

    [...] last few weeks http://blog.dreamhost.com/2007/08/21/it-all-falls-down/have been rough. The core router issues were back, and indirectly caused their DNS update script to [...]

  45. Technical Writing Says:

    I really like the transparency afforded by these posts. It makes anonymous errors less frustrating, and gives us a chance to see what ongoing problems get fixed.

    Having worked with computers for years, I’m not really concerned about getting phone updates regarding outages. In my experience, if a major outage occurs, it’s because something went really wrong and talking to someone about it isn’t going to help me unless that someone is a licensed psychiatric health care professional.

    I still would like Dreamhost to make its own version of “Second Life.” This would give me more ways to screw around when claiming to be mainting web presences.

  46. Escrito dot Info » Web Server Hardware Upgrades Says:

    [...] I wasn’t aware that the host was down, it was a good thing that we were informed of the status and rectification. During the weekend, the [...]

  47. Jiminy Cricket Says:

    For the love of god, contract with some outsiders and do a full audit of your technology, network, power, HVAC, IT security, etc.

    I’m all for “doing it yourself”, but you’ve now got 100,000+ customers who depend on you (yes, none pay a lot by themselves, but collectively that’s a lot of money) and it’s clear that you are missing the knowledge of how to architect things to be appropriately reliable, scalable and secure.

    Consider it a one-time cost of doing business, like going to school. Continuing to “guess” or “hope” that you’ve fully fixed the problem despite massive multiple downtimes throughout the year is just folly. Admit you’re out of your league, get some education through these audits, and then implement the recommendations.

    Otherwise I’d say there’s a 100% chance of a similarly massive outage caused by poor engineering practices, faulty architecture, lack of true redundancy, whatever in the near future for dreamhost

  48. Christoph Dollis Says:

    A very thorough and logical answer Dave @ August 29 11:55 pm. I never thought it completely through.

    Thanks for the education.

  49. Code / Appnel Solutions Says:

    Making it look easy isn’t cheap…

    Late last week, Henry Blodget posted about an “Awesome Startup Idea” he has for a MT/WordPress as service business after his group struggle through some system issues. Being near and dear to my heart I call in to question is the realism of his reques…

  50. db Says:

    I ahve to say thats one of the better posts about tech problems. I’m wondering whats happened in the past at some hosting companies and I never found out.

Leave a Reply

Powered by WordPress. Pool theme by Borja Fernandez, modified by DreamHost.
Entries and comments feeds. ^Top^