Anatomy of a Disaster, Part 2
September 19, 2006 on 3:00 pm | In Foobars, Insider View by Dallas Kashuba |For the past several weeks many of you have been faced with slow or unusable websites and email. The original cause of that series of issues was detailed in Josh’s great Anatomy Of An Ongoing Disaster post. The network issue we were left with once the power outage problems were mostly resolved ended up being an especially nasty one. We were essentially caught with our pants down at just the wrong time and we’ve been taking our lumps for it.

The evidence we were seeing all pointed to one of the two routers as the primary troublemaker so we focused on that one. Configurations were changed with some improvement but without resolving the main issue. Ultimately, 6 separate Cisco support engineers and a Cisco Certified Internet Engineer were all unable to determine the cause of the errors we were seeing on our routers. That, along with the recent power outages, eventually led everyone to believe there was a hardware fault within the router somewhere. That started our process of replacing and/or upgrading every component. Once that was done and the main problem was still there we were able to finally pinpoint the point of network congestion and resolve it, and that’s where we are now.
The problem ending up being the connection between the two routers. Our network was set up so one router was primarily responsible for some of our servers, and the other router was primarily responsible for the rest of the servers. Both routers are connected to outside network connections and they share those roles providing wide-area network redundancy, but the inside of our network (our LAN) relied on both routers working together and passing bits back and forth. Some of you did not experience the problems because all of the servers your service relies on were on the same core router and were not bottlenecked by that inter-router link. Once one of the routers was fully upgraded we were able to move all traffic to that single router thereby removing the bottleneck and restoring service completely for everyone.
Our routers were not redundant and that hurt us. If our routers had been redundant we could have much more easily moved all traffic to one router or the other just to eliminate some variables. Having that option would have saved us a lot of time and you a lot of painfully slow service.

Establishing Power Redundancy
In searching for this solution we wasted a lot of time uncertain about the integrity of our equipment. Whenever a piece of electrical equipment suddenly loses power there is always a chance of some component failing and when you’re dealing with a device as complex as a router that’s a lot of components to worry about! If our data center’s UPS and generator setup had worked properly and the routers had not lost power, we could have instead focused on the new evidence at hand, confident that nothing else had changed. Knowing that, and knowing the track record of our data center, we are already in the process of adding an additional layer of power redundancy for our most critical (and expensive to replace!) equipment. The DC powered equipment housed in our data center is backed by a secondary UPS system and did not lose power throughout the recent power fluctuations. To take advantage of that ourselves, we are converting to DC power at the core of our network. We have the power supplies sitting and waiting to be installed and we’re currently waiting for the power to be wired into the racks we need it in.
We are also expanding our space in our Alchemy Communications Data Center. Alchemy has set up their own UPS backed power feed and were not hit as hard by the power outage that took us down. All of our future data center expansion is going into Alchemy.

Establishing Network Redundancy
Looking back, our worst mistake of this ordeal was allowing our network hardware to end up in a state where we could not redirect all of the traffic to one router or the other. Having that option earlier on in the process would have allowed us to debug the problems more easily and ultimately we would have solved the problem faster. There’s no doubt about that.
When our two current core routers were originally deployed either one of them was able to handle the full load of the network. They were set up to share networking duties and we could have redirected traffic to one or the other if that ever became necessary. Unfortunately, the routers were not upgraded when they should have been and we ended up in a state where one of them was not able to handle the full load of the network. That situation combined with the problems beginning with the power outages led to the nasty network congestion that was difficult for us to diagnose and resolve.
Currently we are using a single router at the core of the network. Every component has been replaced and most of them have been upgraded so it is essentially brand-new and very able to handle our network traffic for the time being. We are in the process of re-establishing core router redundancy now and expect that to be done in the next few weeks. As we proceed into the future we will ensure that one of the two routers is always handling the full load of the network and the second router is standing by idle as a hot spare, should the need for it arise.

Into the Future
While investigating this issue we have been forced to look more closely at our network than we have in a long time. That has uncovered more issues that may become larger problems for us down the road and we are already working on a large scale network reorganization to both improve overall performance and make network issues easier to detect and troubleshoot. If there’s a silver lining on this dark cloud, this may be it.
Our primary local area network setup is really two separate networks, one for traffic that never leaves our network (the private network) and the other for traffic that does mostly leave our network (the public network). When you access your website traffic has to go over both the public and private networks (possibly multiple times) before you will see it come up in your browser. During our network problems it was primarily the private network responsible for the high server loads and slow website load times and email access.
The first step we are taking to improve our network setup is to completely separate out our private network from the public network. That will immediately reduce the amount of traffic going through our core routers and additionally make it easier to track down problems. More equipment will be involved but network traffic will be more isolated. As part of this process we will also be rearranging network links in as close to an optimal way as possible to further isolate traffic and improve performance. Unfortunately due to limitations in our current network architecture the best we can do is about 30% optimal and it’s likely we will not even do that well.

So, the next step in the process is a complete rethinking of how we have been deploying our servers in our data center. For ease of deploying servers and efficient use of data center space we had architected our network to essentially allow any type of server (web server, email server, file server, mysql server, etc) to live anywhere on the network. That sort of setup has worked well for us for awhile but we are now starting to see the early signs of network bottlenecks arising. For future server deployments, we will be assigning physical areas in the data center for different types of servers to facilitate a more optimal network layout between them. That will essentially localize the network traffic as much as possible and allow us to continue scaling for quite some time into the future. Overall network flow will be reduced as well, better utilizing the available throughput. This step is currently being planned and will be implemented first for the next set of servers we deploy.
All told we will be investing somewhere in the neighborhood of $300,000 into our network upgrades, not to mention all of the human time involved in planning and implementing these changes. Now that we have gotten this issue behind us we are fully committed and prepared to maintain network stability and do the work needed to improve network performance and continue to scale with our growth.
We are very sorry for all of the headaches this has caused everyone. Believe me, there was no one who wanted this problem resolved more than we did. Providing sub-par service is no fun and isn’t the way we like to spend our time. This problem took longer than it should have to resolve, but coming out of it we are now in a much stronger position as we look ahead.
111 Comments
Sorry, the comment form is closed at this time.
Powered by WordPress. Pool theme by Borja Fernandez, modified by DreamHost.
Entries and comments feeds.
^Top^

This is great. The entire post was really well written, and I appreciate all the time put into detailing the history of the entire event, what you did to resolve it in the short term, and the short and longer-term steps you’re taking to prevent this kind of thing in the future.
I especially appreciate that it seems like you’ve taken a really grown-up approach (i.e. re-examining core-architectural decisions) and not just thrown some hardware at the problem.
Comment by cricket — September 19, 2006 #
Thanks.
Comment by Clay Smith — September 19, 2006 #
Good job on finally tracking it down. I guess you know that your problem is serious when seven Cisco guys can’t figure it out. Here’s to hoping that the network and overall service is very much more stable for a very long time.
Comment by Anthony DiSante — September 19, 2006 #
It’s kinda tough to decide. On the one hand, I’m glad that the irate, crazy nutbag customers bust your guys’ balls because it does provide the motivation for you guys to keep on top of your game, but at the same time, you guys are all so freakin’ likable that I hate seeing you get your balls busted.
What a conundrum.
Thanks for the update, as always, Dreamhost is teh rocks.
Comment by Nate Cavanaugh — September 19, 2006 #
Wow. Thanks for a great, clear update, Dallas.
I very nearly became one of the “nutbag customers” Nate mentions, but I’m glad I’ve hung in there with you through the EIGHT times my site has gone down in the last three weeks. (Looks like fix number eight is doing the trick so far…)
Looks like overall, Dreamhost has handled this very well. “Worst case scenario” doesn’t even begin to cover what’s been going on over there, and it sounds like you’re investing enough in infrastructure — and learning enough from previous mistakes — that we stand a decent chance of getting the great hosting we’ve become accustomed to from you folks. When everything works, DH kicks ass.
Good luck in there.
Comment by curtis — September 19, 2006 #
[...] In the post “Anatomy of a Disaster, Part 2” the folks at Dreamhost explain what all caused some of the problems seen with their service lately. The problems were obviously complex, considering seven different people from Cisco couldn’t even put their finger on it. But it seems that things have settled down for them, and are running much smoother (I posted before about the low load on the two shell accounts I have, and sure enough I’ve seen my sites loading faster and performing better than they have in quite some time, even before these problems were noticed.) [...]
Pingback by SRHuston Dot Net » Blog Archive » Oh, Was That All? — September 19, 2006 #
Python has now been upgraded to version 2.5. Dreamhost, please uppdate.
http://python.org/
And since I know you guys use debian, here’s the link for the package
http://packages.debian.org/testing/python/python2.5
Comment by Upgrade Please — September 19, 2006 #
This was the most useful & helpful post regarding the hell over the last few weeks.
Thanks
Comment by blaize — September 19, 2006 #
GO DREAMHOST!!!!
Comment by Adam — September 19, 2006 #
“Some of you did not experience the problems because all of the servers your service relies on were on the same core router and were not bottlenecked by that inter-router link.”
I am one of the lucky ones, it seems. Apart from some minor difficulties with email, I have not noticed any significant problems at all. It goes a long way to explaining why some people in the forum were screaming their heads off, while others were baffled at all the commotion.
The mature and professional manner in which these difficulties have been dealt with has convinced me that my faith in the DreamHost team is not misplaced. The candid, well-written missives on this blog by Josh and Dallas do them much credit.
Comment by Simon Jessey — September 19, 2006 #
Dear Upgrade Please,
This is not the place to get packages upgraded. :) Also, we only install packages out of the Debian Stable repository. If they have a stable version please let us know! Otherwise, you can knock yourself out installing Python in your home directory and running it from there.
The best spot to request things is: https://panel.dreamhost.com/index.cgi?tree=home.sugg&
Then get all of your friends and neighbors to vote on it.
Comment by Kelly — September 19, 2006 #
Thank you for the frank and open statement of your difficulties and for explaining it in language even I could follow so I could momentarily feel smarter than I am.
;-)
I had experienced frustration and had even researched your competitors for a possible migration… however, it occurred to me that I may be throwing the baby out with the bathwater and this may seriously motivate you to improve your network.
Since you announced that your network difficulties are under control, my websites have been fast and webmail accessible.
Further, I had occasion to lean upon your customer service folks and ask them what became of my “disappearing comments” at DreamHostStatus.com and while they don’t know yet, Michael S. has been incredibly dilligent in trying to find out and keeping in touch with me.
The service and communication (especially the communication) is way above average and I say that as an outstanding, yet modest, customer service representative with experience in multiple industries.
You should be proud of your customer service team. They saved one customer from migrating off your network — invest in “our” network and don’t let them down!
Migrating’s a bitch.
:-p
AFTERTHOUGHT: Besides, I also checked out your competitors’ CEOs’ blogs and they just had no panache.
Comment by Christoph Dollis — September 19, 2006 #
Glad to hear things are better. I’ll be setting up work related sites on my account eventually (instead of just goofing-off related sites) so I hope everything stays nice and stable for a long time (I’m sure you guys do as well).
Other than a couple of times when my sites were down I’ve had a mostly good experience with you guys and I’ve always gotten really fast replies to my support requests/questions even when they weren’t all that important overall (Even more impressive since I am in about as different a time zone as I can be from you guys).
Cheers.
Comment by ttancm — September 19, 2006 #
I can’t really bitch about your service since I was one of the lucky ones but from day one I got the feeling that your network was extremely ad hoc and unplanned. I guess you did not handle growth that well and that was the break point we just experienced the past weeks.
Now that you got over it please, please I beg you take your time and plan things out this time around and build something that scales well. We (the customers) that waited out this rough spot are bound to stay with you no matter what (you do offer amazing service after all) so dont rush into things.
We like you and we dont want you to fail us again.
A happy-you-pulled-thru customer
Comment by Nikos K — September 19, 2006 #
I just received an email today from InternetSeer saying my dreamhost hosted web site had 100% uptime :P
Comment by Jon — September 20, 2006 #
thumbs up guys \m/^^\m/
Just a quick note to the guy requesting the update of Python:
Look man this is not the place to do that. Nevertheless, take in consideration that the link you sent for Python is in Debian’s “testing” branch:
http://packages.debian.org/testing/python/python
DH is using the stable branch:
http://packages.debian.org/stable/python/python
Alltough the stable branch is slow with updates to new versions, trust me, you don’t wan’t to run on “testing” branch if you are a hosting provider. :)
on the side note: this is why Subversion is still at v1.1.4 and not at 1.4. (1.1 is not even supported by developers any more)
Comment by Goran Dodig — September 20, 2006 #
During all of this I had not seen any problems in my neck of the woods, I guess I am lucky. But problems are bound to happen of people can not understand that. At least you are trying to resolve the issue and future issues that might occur.
Back up plans for the back up plans.
Comment by dj ricin4 — September 20, 2006 #
[...] recommend bookmarking both if you are a current Dreamhost customer. Posted at 9:19 am | Comments (0) Post yourcomment: [...]
Pingback by J.D. Myers » Honesty From Dreamhost — September 20, 2006 #
I knew I decided to migrate my domains to you guys for a reason. Keep up the good work!
Comment by T.Sayegh De Bellis — September 20, 2006 #
Thanks for the candid update, Dallas. I now better understand how some customers had so much better experience than others.
Luckily, Dreamhost’s growth has, while pointing out infrasturcture and network architecture issues, been strong enough to fund the upgrade and reconfigurations necessary to address those issues; I’m looking forward to a return to the stability and reliablility I have enjoyed at Dreamhost for years.
I really appreciate Dreamhost’s intelligent posts on these matters. It makes it a lot easier to make intelligent business decisions when one knows what is going on. Thanks!
Comment by rlparker — September 20, 2006 #
Great post, Dallas. I’ve stuck with you guys through all this stuff because I was confident you’d resolve it successfully. And speaking of growth, I just referred a new customer to you yesterday. Long live Dreamhost!
Comment by Patrick — September 20, 2006 #
@Kelly
I don’t understand this. You won’t update Python to the current version but you will update Ruby on Rails (which is no where near getting into debian stable)?
What gives?
No love for us Python guys.
Comment by Upgrade Please — September 20, 2006 #
Your transparency in your operations is a refreshing change… thanks for sharing.
And to Upgrade Please: dude, WTF? Figure it out- this isn’t the venue for requesting upgrades.
Comment by Andre — September 20, 2006 #
To Upgrade Please: As has already been explained to you, there are mechanisms in place specifically for this kind of request. Posting messages in this totally unrelated forum is a complete waste of time. Use the suggestions dialog in the control panel.
Comment by Simon Jessey — September 20, 2006 #
I almost (but not quite) wish there would be another “disaster” so we could have another one of these great posts.
But maybe I can settle for just looking forward to stories on how things work to satisfy the geek side of me.
Thanks for you support DH!
Comment by A — September 20, 2006 #
What was so great about this post?
Talk is cheap. Once you fix the problem, I might then say “great post and thanks”.
-A disgruntled customers, who’s customers yell and scream at for advising them to use DreamHost.
Comment by Dan — September 20, 2006 #
Hey Dan, while I respect your opinion, I’m grateful that DreamHost fixed the problem, while communicating commendably, and keeping up relatively decent customer service (great agents, but too darn busy!) for the very reasonable amount I’ve paid.
If all goes well, I intend to renew with DreamHost.
Comment by Christoph Dollis — September 20, 2006 #
Re: Upgrade Please
Ruby on Rails was installed due to overwhelming demand in the suggestions section of the panel I linked. ;-)
And the various upgrades were due to some really bad problems we had with the unstable releases we installed, and then I think three security point releases.
Plus, you can install it yourself if you like. It will be slightly slower over NFS from your home directory, but if you work some FastCGI magic that is a moot point!
Comment by Kelly — September 20, 2006 #
Hi dreamhost, i was planning to post something really nice but by reading these comments i figured out that everything was said :)
I really find it interesting how your “behind the scene” things work. See me as a technical interested person/freak/geek.
It is indeed very nice to read that you have discovered what the bug in your network was and that you are busy avoiding that bug and others related by investing in and improving your network.
It’s also nice to read what other companies like to call inside information about the problem. They mostly just tell you: “the problem has been discovered and is being resolved” and no more info than that.
That is the most convincing reason why i stay with you, not the price but the idea that we count for you as people and not as some other costumer with a number. It makes me feel as i’m actually having a hosting dream :)
I really hope you get the strength of the network back and that you can keep it in the future! And as you communicate this well, it’s easier for us to appreciate when our server has to go down for an update, because we know it’s gonna make our server and our sites much better.
Thanks!!
Comment by Roeland — September 22, 2006 #
As somebody in the computing industry its nice to see a real report of what is happening reaching the customers. I only wish my managers were as transparent with our customers when something breaks in our infrastructure. Sometimes things happen that make you go Huh! and guess what you’ve planned for that . . And guess what something was screwed up which meant your plan didn’t work! (Having dual site resilience with 4 teamed NICs in each server means that your AD Domain controllers shouldn’t go down with an AD Corruption . . Or at least not BOTH of them . . . Unless your hardware partner didn’t configure them the way they had agreed with your network partner . . . A simple shutdown to install an additional shelf in the SAN shouldn’t have taken down our hosted environment for 24 hours, but it did)
Well done on getting to the bottom of it, and then for not resting there but taking the time to optimise hardware placement to minimise bottlenecks.
Comment by Aleman — September 22, 2006 #
When do we get anatomy of a disaster, part 3, since so many of us are down right now?
http://www.dreamhoststatus.com/2006/09/22/filer-issue-causing-a-few-service-interruptions/#comments
Read those comments… so many of us have sites that are down. And yet, when you go to the emergency status page, here’s what it says:
“Network Problems Resolved”
Come on. What’s going on?
Comment by Rob — September 22, 2006 #
a note from the future referred to in the post (like, right now)
I don’t really care why the sites are down. And truly, explaining how you’ve fixed them — which I’m reading now that they’re down again — is not how PR works.
Apropos, I’m redirecting a press inquiry into our own site (which is down) to your own press folks, Dreamhost. Who can they talk to? (seriously)
Comment by Kevin — September 22, 2006 #
What seems odd is that all this stuff happens during the middle of business hours. Who makes planned major changes to production equipment during business hours? With 100,000+ sites hosted you simply can’t be making changes during the day. It’s too risky, no matter how much you “think” service won’t be affected.
C’mon DH, you’re a business. Start acting a little more like one. I’m sorry that being a sys/network admin requires lots of after-hours work, but that’s the way it is.
Comment by cricket — September 22, 2006 #
@cricket
Think global! The internet does not have or know ‘business hours’. I usually have to do updates after midnight but luckily all of our customers are in the same timezone.
Comment by netwalker — September 22, 2006 #
No offense, while it’s great that you guys are working on it, we’ve heard that before. A lot. We reported one of our sites down around 9am Friday. We get ignored for multiple hours, then get told were getting our connection increased to help balance load, and our site should be back up. It’s now 1am on Saturday and it’s been more than 8 hours since we were told things would be back up in a few minutes, and 6+ hours since we last emailed support to get an update. We’ve had enough, and are moving at least our most important sites off Dreamhost to some place more reliable. It’s a good thing I can’t find anything about a uptime guarantee, or I’m sure there are hundreds of customer that would be asking for refunds for today alone.
Comment by Chance — September 22, 2006 #
WHY DOESN’T DREAMHOST SIMPLE SAY, “WE ARE A WEB HOST PROVIDER FOR INDIVIDUALS, AND NOT FOR BUSINESSES”.
BECAUSE WITH THE AMOUNT OF DOWNTIME THEY HAVE, NO BUSINESS WOULD EVER (OR SHOULD EVER) CONSIDER THEM AS A HOST.
-TED
Comment by Ted — September 23, 2006 #
One slight suggestion—change that $300,000 in network re-investment to $250,000, and add another customer service person or two. When things are blowing up and bosses are breathing down your customers necks, having a honest-to-gosh living person to respond to tickets makes a huge difference. It’s not the downtime that bothers my company so much as the lack of communication.
Comment by Ryan Cannon — September 23, 2006 #
[...] A recent post on their main blog goes into more detail about what went wrong recently, and what they’ve been doing to rectify it, and to prevent further problems. I suspect yesterdays outage was related to the attempts to provide a permanent fix. [...]
Pingback by Normal service has been resumed : Losing it[1] — September 23, 2006 #
Thanks, Dallas, for this post. The information is appreciated.
Comment by Kevin — September 24, 2006 #
Ted was an angry young bloke / And we see that his caps-lock is broke / What he said I don’t know / I did try to read it though / But all the yelling gave me a stroke.
Comment by ken — September 24, 2006 #
In response to Ryan: “One slight suggestion—change that $300,000 in network re-investment to $250,000, and add another customer service person or two.”
We hire regularly as our support volume increases. This month we’ve hired 4 or 5 new techs alone. This is above and beyond the $300k Dallas noted for network improvements.
DH does not run on-the-cheap, in any respect. :)
Comment by Karl — September 25, 2006 #
We hire regularly as our support volume increases. This month we’ve hired 4 or 5 new techs alone. This is above and beyond the $300k Dallas noted for network improvements.
DH does not run on-the-cheap, in any respect. :)
@Karl
Maybe DreamHost should start to consider hiring LESS, but hire more HIGHLY skilled employees.
This on-going, “we fixed the problem” then days later say, “whoops, we didn’t know what was wrong” indicates to me that DreamHost needs higher skilled employees who “know” what’s going on with the constant downtime.
Just my thoughts, with all due respect.
Comment by Shaun — September 25, 2006 #
This is basically why even with the issues, I don’t really contemplate changing from DH. I’ve never had a host this honest, or try this hard.
Comment by Grey Hodge — September 26, 2006 #
Still not working. Hope you guys get this sorted out soon!
Abi
Comment by Abi Titmuss — September 27, 2006 #
I was quite unhappy with the way the service was getting and how unreliable, but lately things have seemed a lot better and am very pleased that you’re taking time to put in redundancy. Cheers to dreamhost for fixing your problems and being able to assess where the problems were!
Comment by Betta — September 27, 2006 #
I second what Shaun said. I read an academic paper from amazon indicating they manage their entire online system with about a dozen systems administrators (run=operate, not design/build new stuff). They hire super-senior guys (and gals) who proactively look for ways to architect things so they don’t break and who create advanced tools to investigate and diagnose problems as they occur.
This post, which is refreshingly transparent and honest, is a good first step. But I fear that while you’re clearly admitting that you’re out of your depth when it comes to systems and network architecture, you’re going to have the same people come up with a solution to your problems.
Given that your entire business is hosting, it seems like you need to step up the level of people you’re hiring. Or at least go out and get some consultants who know what they’re doing for a short-time to get a “real” architecture (not one that just works out of luck, but one that is engineered to work). It couldn’t hurt to get some non-vendor, outside advice for something as critical as your network architecture.
Comment by cricket — September 28, 2006 #
[...] On their part DreamHost openly discussed their problems on their blog (part 1, part 2) which they were lauded for (Mike Davidson, Robert Scoble) because it, of all things, kept the dialogue open between them and customers. [...]
Pingback by Julio Garcia - Web Technology, Movies, Sports, Life in San Jose, and more - » Blog Archive » Looking to Move from DreamHost — September 28, 2006 #
Hey guys, I am impressed with your level of transparency in this matter. I am considering hosting with your service and this is the kind of information that personal customers require.
I also could not agree more with your post about being able to handle the full network load with one core router. HSRP was designed to provide redundancy, and most people load-balance the VLANs for S&G’s. When it comes down to the wire, you have to be able to use it for redundancy - redundancy first!
This is all the consulting you need:
http://www.cisco.com/en/US/netsol/ns656/networking_solutions_design_guidances_list.html#anchor3
And remember, if it’s not a hardware issue, the problem is almost always the load balancer! Never trust a load balancer.
Comment by John — September 29, 2006 #
oh yea, that’s CCIE - Cisco Certified Internetworking Expert ;)
Comment by John — September 29, 2006 #
I’ve been with you guys since 1999. I joined just after you’d been hacked into and a large proportion of your customers’ websites totally wiped out - only after the event did you discover a large % of your automatic back up systems weren’t working. Remember?
I only mention this because if I were a cynic, I’d say your ability to predict and avoid disaster hasn’t got any better as the years have gone by. Maybe you should hire a futurist - someone whose job it is to predict the worst case scenario, and then guard against it!
Anyhow, I am glad things are looking up for you. But I do agree with the observation elsewhere in these comments that maybe you need someone with more technical expertise in your support department to complement the current staff.
And finally - if I had known about the hacking event (it happened a week before I signed up, as I recall) I would never have joined. So being prepared for disaster and then avoiding it is most likely good for your business too.
Good luck,
Rod
Comment by Rod — September 29, 2006 #