At Least We Ate Lunch Beforehand
September 13, 2005 on 11:37 am | In Foobars, Updates by nate |Hey everybody! So, yeah, yesterday was exciting.
I’ll give you a little timeline/rundown on what-happened-when yesterday. But to generally describe the problems:
Beyond the obvious LA DWP screwup, and the building generator screwup (more word on exactly what that was soon), there are some non-trivial problems when all of your stuff shuts down. The biggest is that when any group of computers is unexpectedly powered off, a small percentage don’t come back up. When you’re talking about a few hundred servers, a small percentage becomes a significant number!
You do also have to be somewhat careful how quickly you power things up, because a popped breaker is just about the last thing you want.
11:45 Went to lunch at sushi place on Flower St.
12:30 BLACKOUT! We’re walking back and see streetlights out, office tower generators starting up. Our building’s UPS kicks in, generator is running (we can see the smoke from our office).
12:45 My idiotic blog post.
12:46 Building evacuated, our upstream providers are down although our servers are still powered up.
1:00(ish) UPS depleted, generator fails, power is down.
1:45 Jason, our datacenter manager, bullies his way in with me while building is still evacuated.
2:00-3:30 We plan for all the fun stuff that happens when the power comes on. Upstreams come back up, although now our servers are down.
3:45 Datacenter power back on, we start powering up cabinets slowly.
4:15 Most of our public network up, firewall busted, private network down (which means our monitoring system is down, Dreamhost site down, Web Panel down. Our blog is up because it’s just on a normal shared hosting account.)
5:00 Big problems come first: File server replaced, firewall being feverishly worked on, 3 public cabinets (out of 40) still down due to switch problems.
6:00 Firewall fixed (which lets us quickly identify continuing problems via monitoring system).
6:30 All public cabinets up, individual machines/services still wonky. Web panel, etc up.
6:30-midnight: Fixing individual servers/services. Some weird Web Panel redirection loop errors fixed. Webmail login errors fixed.
I do have to say, walking around and seeing aaaaaall our servers powered down and quiet was pretty creepy. I’ve been here for a long time, and seen plenty of nasty problems, but this one was particularly freaky.
We would have liked to have something like the above being posted live, but we were literally running and yelling and typing as fast as we physically could from 3:45 to midnight.
We are, of course, already digging into the issue of why power was ever out in this supposedly-very-prepared building in the first place. We’ve had grid outages before and never noticed a blip. There are LOTS of other internet companies in this building (including a bunch of other shared hosting folks) so yeah, we’re all pretty much going crazy about how badly the building handled things. We’ve always been told there are HOURS (not half-hours) of UPS capacity and that the generators are regularly tested and well-maintained.
Also, I bought a lottery ticket on a Red Bull run. I figure karma might balance things out . Keep your fingers crossed.
19 Responses to “At Least We Ate Lunch Beforehand”
Powered by WordPress. Pool theme by Borja Fernandez, modified by DreamHost.
Like WordPress? Consider attending WordCamp LA.
Entries and comments feeds.
^Top^
September 13th, 2005 at 12:56 pm
In my experience the only thing customers hate more than down time is a lack of information. Taking time during a major crisis to communicate is important and doesn’t mean you are ignoring the problem. People may not be willing to admit it but timely updates (or at least acknowledgments of known issues) are worth having even if it means the problems don’t get fixed quite as fast.
It is really easy to get tunnel vision and think you can post an update later because the problem is almost fixed. That’s great if you’re right but if you aren’t right people start to think you aren’t paying attention.
I’ve also learned to not be overly optimistic in the status reports I do give my customers… it is easier to tell people that you fixed something faster than expected! I think i got that from watching Scotty on Star Trek.
September 13th, 2005 at 12:59 pm
Just curious– have you guys ever tried Hama Sushi on 2nd in Little Tokyo? Only the best Sushi in the city! Last time I was there we had an earthquake. At first I thought it was my stomach aching for their oh-so-delicious unagi. Check it out next power outage!
September 13th, 2005 at 1:12 pm
I agree with Geoff. I understand the intense focus of being almost done with a hugely important thing, but a bit more communication would have gone a long way in this situation. However, Kudos for having the status page up and running offsite, glad it was there. It’s the candid reports of “how things are going” at the dreamhost world headquarters that make me think so highly of dreamhost. You guys do better than most, but yesterday you left some room for improvement. Perhaps next time we’ll hear a bit more regularly about how things are going in the server room.
September 13th, 2005 at 1:18 pm
So you guys are in the same building as Media Temple? They reckon:
Is that true? I know you guys have only said that the generators failed, but that’s a pretty scathing indictment of your building’s engineers.
September 13th, 2005 at 1:46 pm
Yeah, we talked to MT about the stuff they posted today, but we’re hearing differing reports about why the generators failed. The building is also sort of like this weird sewing circle of rumor and gossip. There’s no official word on what went wrong yet.
And we had emergency lighting come on at around 5:30pm and the generator start again, but we went outside to see if the grid was down again and everything around us was still on. I have no idea about the 8pm thing they reference.
And as for how much info we provided on the blog and status page…other than having something like a court reporter sitting down there with us, there really wasn’t better info to report. We did get the general word out pretty quick, and I don’t know how you wouldn’t know we were working on stuff when were were doing it!
A problem of this size doesn’t really lend itself to minute by minute updates. We did an insane amount of work yesterday. It took me about half an hour just to write up that little post…we would have fixed 5 or 6 machines in that time!
September 13th, 2005 at 1:50 pm
[quote]As a matter of fact there are few buildings that rival the amount of redundancy and investment which has been put into the Garland Building’s backup power systems.[/quote]
[img]http://bigredwa.temp.powweb.com/Personal/Forums/BSMETER.gif[/img]
The level of redundancy is average. IMHO, in order to be able to tout “there are few buildings that rival the amount of redundancy and investment which has been put into … backup power systems” you need to be on three power grids. Above average is two. My home office has about as much redundancy as the description of the DH data center. Most companies I’ve worked for offer the same as described.
September 13th, 2005 at 2:05 pm
Absolutely, what Geoff said. Plus, specifics HELP. Even if you’re giving me an insanely long time frame (like, “we hope to have everything back to normal within the next 12 hours”), that makes me feel a lot better than the fuzzy and nonspecific “we’ll have it all working soon.”
We know you were all working your butts off all night long; that was a given, and we DO appreciate it. But more (and slightly better) communication would have been very helpful.
September 13th, 2005 at 2:07 pm
nate,
For myself the status page was all the info I needed.
But I think there is a lot of good info in the timeline you posted here that could have been helpful to people yesterday evening.
September 13th, 2005 at 2:27 pm
I’m happy with the way the information was given to us, basics on the status page and, after everything was almost sorted out the blog… if they had posted constantly on the blog what they were doing it would have taken longer and, no doubt, people would have complained about them blogging instead of fixing!
September 13th, 2005 at 3:03 pm
After the incident a few years ago that generated the need for the status page, I never bookmarked it.
For some reason I bookedmarked the site a few days before this outage. I’m glad I did.
I received a few phone calls yesterday from other DH customers I know. They asked if my $hit was down too.
I told them about the status page. I just wonder how many others didn’t know about the status page and didn’t know any other DH customers to call.
I agree that more information would’ve been better. But better only in a utopian way. The information provided was way more than I’ve ever been given in this situation before. It’s best to solve the problem the debrief and analize later.
Eventually you guys will need to do more than post to this blog. I bet there are lots of DH customers who don’t know about the blog or the status page and are still confused.
September 13th, 2005 at 4:18 pm
This whole situation reminds me so much of my good old days of helping run an ISP. Ah, what fun. Everything is down, phone ringing off the hook, and how can you stop to post status when you are literally putting out fires and fixing things as fast as you can type.
I hope you guys enjoy these days, and look back on them fondly someday when you have boring desk jobs that pay real money. =)
September 13th, 2005 at 4:54 pm
I don’t think anybody here is suggesting we wanted a minute-by-minute report on the status page, just some more specific info and an update more frequently than once every 8 hours. (like, every 4 hours?) Actually, I found it to be not a very good use of MY time to have to continually check the blog to see if there was anything new.
There’s a happy medium in there somewhere. Challenging to find, for sure, but worth looking toward in the event of (goddess forbid) future disasters.
September 14th, 2005 at 12:06 am
[...] poor dreamhost…. (an excellent webhosting company by the way, that i’d recommend!) as craig pointed out… i would not want to have been in their shoes http://blog.dreamhost.com/2005/09/13/at-least-we-ate-lunch-beforehand/ [...]
September 14th, 2005 at 4:09 am
I think the info you guys put out is great. It’s so nice to actually read what sounds like the efforts of sushi filled folks trying to get things up and running than some generic ‘apologies for the inconvenience’ crap!
September 14th, 2005 at 10:08 am
Site & Email Downtime
My site and email were both down for a few hours last night, and I think my email is still down. Apparently there were some power outages in LA yesterday that effected Dreamhost. My xtra-rant email is down, but if you need to get ahold of me use my g…
September 15th, 2005 at 11:30 am
[...] So anyway, you can read all about that here, if you’re so inclined. And not to go all poster child or anything, but one more reason Dreamhost rocks is because their blog and their newsletters are consistently entertaining. I’m just saying. They’re also letting me give you sweet discounts on stuff. Like now, for example, if you decide to sign up for a hosting package at any level, and you want to pay monthly, type in code LSWSUF and I’ll waive your $50 setup fee. How cool is that? Very cool. [...]
September 17th, 2005 at 6:35 am
[...] If you checked out the top search queries on Technorati the past week you probably saw DreamHost in the top five for a few days. It was a little weird to see them above Hurricane Katrina for a day. There was a power outage in Los Angeles that affected their data center building. My websites were down for about half a day. You can read about what happened on their blog. [...]
October 20th, 2005 at 10:16 pm
[...] The power went out in our host provider’s facility - which wasn’t their fault - way back on 9/12 and I was impressed with the grace and skill they displayed getting it all back up again. I was happy with DH before, but now I’ve got a manly / tech crush on the lot of them now. [...]
December 23rd, 2005 at 2:01 am
The level of redundancy is average. IMHO, in order to be able to tout “there are few buildings that rival the amount of redundancy and investment which has been put into … backup power systems” you need to be on three power grids. Above average is two. My home office has about as much redundancy as the description of the DH data center. Most companies I’ve worked for offer the same as described