Anatomy of a Disaster, Part 2
September 19, 2006 on 3:00 pm | In Foobars, Insider View by Dallas Kashuba | 111 CommentsFor the past several weeks many of you have been faced with slow or unusable websites and email. The original cause of that series of issues was detailed in Josh’s great Anatomy Of An Ongoing Disaster post. The network issue we were left with once the power outage problems were mostly resolved ended up being an especially nasty one. We were essentially caught with our pants down at just the wrong time and we’ve been taking our lumps for it.

The evidence we were seeing all pointed to one of the two routers as the primary troublemaker so we focused on that one. Configurations were changed with some improvement but without resolving the main issue. Ultimately, 6 separate Cisco support engineers and a Cisco Certified Internet Engineer were all unable to determine the cause of the errors we were seeing on our routers. That, along with the recent power outages, eventually led everyone to believe there was a hardware fault within the router somewhere. That started our process of replacing and/or upgrading every component. Once that was done and the main problem was still there we were able to finally pinpoint the point of network congestion and resolve it, and that’s where we are now.
The problem ending up being the connection between the two routers. Our network was set up so one router was primarily responsible for some of our servers, and the other router was primarily responsible for the rest of the servers. Both routers are connected to outside network connections and they share those roles providing wide-area network redundancy, but the inside of our network (our LAN) relied on both routers working together and passing bits back and forth. Some of you did not experience the problems because all of the servers your service relies on were on the same core router and were not bottlenecked by that inter-router link. Once one of the routers was fully upgraded we were able to move all traffic to that single router thereby removing the bottleneck and restoring service completely for everyone.
Our routers were not redundant and that hurt us. If our routers had been redundant we could have much more easily moved all traffic to one router or the other just to eliminate some variables. Having that option would have saved us a lot of time and you a lot of painfully slow service.

Establishing Power Redundancy
In searching for this solution we wasted a lot of time uncertain about the integrity of our equipment. Whenever a piece of electrical equipment suddenly loses power there is always a chance of some component failing and when you’re dealing with a device as complex as a router that’s a lot of components to worry about! If our data center’s UPS and generator setup had worked properly and the routers had not lost power, we could have instead focused on the new evidence at hand, confident that nothing else had changed. Knowing that, and knowing the track record of our data center, we are already in the process of adding an additional layer of power redundancy for our most critical (and expensive to replace!) equipment. The DC powered equipment housed in our data center is backed by a secondary UPS system and did not lose power throughout the recent power fluctuations. To take advantage of that ourselves, we are converting to DC power at the core of our network. We have the power supplies sitting and waiting to be installed and we’re currently waiting for the power to be wired into the racks we need it in.
We are also expanding our space in our Alchemy Communications Data Center. Alchemy has set up their own UPS backed power feed and were not hit as hard by the power outage that took us down. All of our future data center expansion is going into Alchemy.

Establishing Network Redundancy
Looking back, our worst mistake of this ordeal was allowing our network hardware to end up in a state where we could not redirect all of the traffic to one router or the other. Having that option earlier on in the process would have allowed us to debug the problems more easily and ultimately we would have solved the problem faster. There’s no doubt about that.
When our two current core routers were originally deployed either one of them was able to handle the full load of the network. They were set up to share networking duties and we could have redirected traffic to one or the other if that ever became necessary. Unfortunately, the routers were not upgraded when they should have been and we ended up in a state where one of them was not able to handle the full load of the network. That situation combined with the problems beginning with the power outages led to the nasty network congestion that was difficult for us to diagnose and resolve.
Currently we are using a single router at the core of the network. Every component has been replaced and most of them have been upgraded so it is essentially brand-new and very able to handle our network traffic for the time being. We are in the process of re-establishing core router redundancy now and expect that to be done in the next few weeks. As we proceed into the future we will ensure that one of the two routers is always handling the full load of the network and the second router is standing by idle as a hot spare, should the need for it arise.

Into the Future
While investigating this issue we have been forced to look more closely at our network than we have in a long time. That has uncovered more issues that may become larger problems for us down the road and we are already working on a large scale network reorganization to both improve overall performance and make network issues easier to detect and troubleshoot. If there’s a silver lining on this dark cloud, this may be it.
Our primary local area network setup is really two separate networks, one for traffic that never leaves our network (the private network) and the other for traffic that does mostly leave our network (the public network). When you access your website traffic has to go over both the public and private networks (possibly multiple times) before you will see it come up in your browser. During our network problems it was primarily the private network responsible for the high server loads and slow website load times and email access.
The first step we are taking to improve our network setup is to completely separate out our private network from the public network. That will immediately reduce the amount of traffic going through our core routers and additionally make it easier to track down problems. More equipment will be involved but network traffic will be more isolated. As part of this process we will also be rearranging network links in as close to an optimal way as possible to further isolate traffic and improve performance. Unfortunately due to limitations in our current network architecture the best we can do is about 30% optimal and it’s likely we will not even do that well.

So, the next step in the process is a complete rethinking of how we have been deploying our servers in our data center. For ease of deploying servers and efficient use of data center space we had architected our network to essentially allow any type of server (web server, email server, file server, mysql server, etc) to live anywhere on the network. That sort of setup has worked well for us for awhile but we are now starting to see the early signs of network bottlenecks arising. For future server deployments, we will be assigning physical areas in the data center for different types of servers to facilitate a more optimal network layout between them. That will essentially localize the network traffic as much as possible and allow us to continue scaling for quite some time into the future. Overall network flow will be reduced as well, better utilizing the available throughput. This step is currently being planned and will be implemented first for the next set of servers we deploy.
All told we will be investing somewhere in the neighborhood of $300,000 into our network upgrades, not to mention all of the human time involved in planning and implementing these changes. Now that we have gotten this issue behind us we are fully committed and prepared to maintain network stability and do the work needed to improve network performance and continue to scale with our growth.
We are very sorry for all of the headaches this has caused everyone. Believe me, there was no one who wanted this problem resolved more than we did. Providing sub-par service is no fun and isn’t the way we like to spend our time. This problem took longer than it should have to resolve, but coming out of it we are now in a much stronger position as we look ahead.
DreamHost Goes North (San Francisco)
September 12, 2006 on 11:54 am | In Jobs by Dallas Kashuba | 25 CommentsDreamHost is starting up a small satellite office in lovely San Francisco and we’re hiring ONE perl programmer with good UNIX/Linux skills. If you are in or near San Francisco and think you might fit in with our organization, check the full job description on our jobs listing page.
Don’t Forget Dreamhoststatus.com!
September 12, 2006 on 10:53 am | In Updates by Josh Jones | 18 CommentsJust a friendly reminder everybody: dreamhoststatus.com is the more serious, just-the-facts, business-first, updated about all downtime, blog.
In fact, there’s a new post there now all about the current state of the network.
Also, we decided to turn on comments there too from now on, because maybe that was a reason people liked erroneously checking this site for uptime-related things.
In summary, this blog will stay heavily focused on its original purpose.
Stroking Josh’s ego.
I Am Your Shepherd
September 8, 2006 on 6:40 pm | In Insider View, Musings by Josh Jones | 77 Comments
So, a lot of people apparently liked making fun of my wife last week.
The post got on the first page of digg, over 1000 diggs, and pushed about 14 times the typical daily traffic our way.
Cool, people must have really found it interesting.
I must have really struck a chord.
Or maybe I cheated.
You see, the DreamHost newsletter went out that day, to nearly all our subscribers.. and I put a link to that post in it. That in itself drew way more people to the blog than on average.
Then, I posted it on digg. (Actually, I tried to, but some Happy DreamHost Customer beat me to the punch, so I just used his post!)
For those of you who don’t know, digg is basically a “meta-blog” where anybody can post an interesting link with a blurb and if enough people find it interesting it gets promoted up to the front page. It’s supposedly a highly-democratic way for the “hive mind” to filter out the cruft from the crop. Supposedly.
As you might have noticed, there are a lot of sites like this these days, these user-driven sites. Flickr, YouTube, Wikipedia, yada yada yada.. there’s even a community-driven t-shirt site, threadless.com!
I guess that’s what makes the web 2.0.
Then, I included the little “Digg” icon at the bottom of the post with some html they provide. Unfortunately, due to a problem with Wordpress and Javascript in posts, the digg icon doesn’t seem to show up in IE 6.
Unfortunately… for nobody!
It turns out enough newsletter readers saw the digg link, were registered digg users, and thought the post was diggable to get it up to 50 or so diggs pretttty fast. And apparently somewhere around 50 is when digg decides a new post is popular enough to automatically make the front page!
Once it was on the front page, that was it. The diggers started digging it, the newsletterers kept newslettering it, the traffic kept rising, the bloggers starting blogging it, and so on and so on. I was contacted by my aunt, my wife’s cousin, my old college suitemate, Yahoo’s anti-phishing department, and even the head of the IRS’s anti-online-fraud department.
Who knew my aunt read digg?

So, the post was clearly a hit.
But really, was it that great? I mean, yes, it was. Of course. I’m awesome. Thanks.
But what it also seems, is that I have a human “bot-net” of hundreds of thousands of willing drones, that I can send in any direction I choose! Mwah ha ha ha.
And it just so turns out that community “voting” sites are highly susceptible to the directed efforts of thousands of people!
If I wanted to, I could probably repost my old prediction about Apple making Video Airport Expresses (in anticipation of their announcement next tuesday), get it on the front page of digg, the rumor sites would take it as gospel, and before you know it, I’d be paid a little visit by three large men in black turtlenecks.
Of course, anybody who’s ever run an online poll knows how easy it is for community-driven features to get abused. My way is a just slightly less automated, but a lot more “authentic way” of generating attention and spreading my influence.
It’s really just the “web 2.0″ version of special-interest groups and letter-writing campaigns.
Letters?
Ask your parents about them.
And that is where I see an inherent weakness in this brave new world we’re entering into. Sometimes, moderators are needed. Sometimes, the wisdom of crowds doesn’t work. Sometimes, the crowd is gaming you.
It’s the tyranny of the well-organized minority.
It even happens to us.. subversion was one of the most-voted on features in our suggestions area for a while. It seemed a teeny bit esoteric to me, but eh, whatever, I guess we’ve got a lot of fancy coding mama-yamas hosting with us. We implemented it, and it does seem popular.. but not nearly as popular as we might have expected based on the votes.
Later I discovered why.
Some sneakster had published a link right to the “Vote for official DreamHost support of Subversion” in the Subversion area of our wiki!
There’s nothing wrong with that per se, but it did mean that just about EVERYBODY who used DreamHost and was at all interested in Subversion voted for it.. whereas our other suggestions get a much more random and normal distribution of users.

Anyway, it’s all very interesting.
What EVIL SCHEME shall I use you, my sheep, for next?
Shall we get mandatory school vouchers for hybrid vehicles? Maybe we should all boycott McDonald’s new DRM-restricted big macs? Or do some offshore drilling into SCO’s legal team?
No matter what we decide though, I promise it’ll be in our best interest! When you’re a minority (and believe it or not, DreamHost customers are still a minority), you’ve got to organize and use the power you have efficiently and effectively. It worked for the Sunnis in Iraq, and it will work for you.
First order of business, let’s get this T-Shirt design made:
(It was made by a DH employee’s GIRLfriend!)
Please?
Powered by WordPress. Pool theme by Borja Fernandez, modified by DreamHost.
Entries and comments feeds.
^Top^

