Another Anatomy
April 7, 2008 on 12:23 pm | In Foobars, Hardware, Insider View, New Features by Josh Jones |
Okay, nothing silly this time, I promise…
Some of you may have noticed that we’ve been having what a problem that is, although maybe not the worst in DreamHost history, definitely in the top 5.
There has been a DreamHost Status post about it, but it’s been going on so long, there obviously needs to be more said.

The History
The events that conspired to cause this horrible performance for everybody in our “blingy” cluster actually started to take root 19 months ago.
That was when I made this post asking our customers for some suggestions on storage. I made the mistake in that post of mentioning the name of one particular storage vendor who apparently does a search for their name in rss feeds of all kinds of blogs. I won’t mention their name again here, to test if they REALLY read this blog, but they were the one on the list right after “Netapp”.
Anyway, immediately a sales guy from there was hounding me about how great their product was. It would have super-duper reliability, super-duper performance, and super-duper ease-of-management. It was super-duper expensive compared to our current solution (about 3x the price per GB), so in the end I declined.
But, over the next year he kept hounding me and hounding me, and eventually the price came down to something in line with our current costs, so we decided to try one unit for our new cluster, “Blingy”. After we were satisfied with our internal testing, Blingy went live with the new storage solution in December 2007.

Smooth Sailing
At first, everything was fine, performance was great, everybody was hunky and dory. But then, as usage started to go up, the new file system started acting up. Around the same time every night, the system would stop responding to NFS requests for a while, which would immediately break web and mail service for everybody in the entire cluster.. thousands of customers.
Our Bad
Now, it can be a big mistake to put live customers on any new system. But honestly, we’d tested it lots, researched it a ton, and we added people very slowly at first, and it performed great.
Our biggest mistake I believe had nothing to do with what specific vendor or hardware we went with.. it was simply putting so many eggs in one basket!
Even with our Netapps (which are pretty much awesome), there are problems from time-to-time. However, a typical hosting cluster will have a dozen or so Netapps, which means any problems are one twelfth as big.
With Blingy, EVERY customer is on this one “mega” filer, which in theory should make for better performance, reliability, and ease of management. And since we got the clustered solution (in an active-active configuration)… there really is no single point of hardware failure in this thing.
But, as it turns out, there are a lot of non-hardware failures in the world.
Their Bad
Well, the techs at the vendor couldn’t figure out what was causing the NFS freezing, and so they recommended us doing a major OS upgrade to hopefully fix it.
During this whole time, the fiber channel disks were slowly filling up, and we’d been trying to move large files off to the sata pool (it’s a two-tiered solution, and there’s a feature that automatically moves less-accessed data to lower tiers).. however the thing couldn’t move the data fast enough. It couldn’t finish doing a “move job” in a single day, and every day it’d sort of “crash”, which would screw up the move job, and nothing would get moved.
Also, as the disk kept getting more full, performance kept getting worse, creating a vicious cycle. We ordered some more fiber channel disk shelves at the end of February to grow the main FC volume, since we couldn’t get things off to SATA, and it was supposed to come on March 10th and be installed at the same time as the major OS upgrade.
However, the disks didn’t end up getting installed until March 25th, and at that point it turned out we could NOT grow the FC volume with these disks (well, it was technically possible, but their on-site techs recommended VERY VERY heavily against it.. it would severly impact performance), which was sort of the whole point. So now we had a new FC volume which we still had to migrate users to.

Your Bad
Of course, this whole time, new customers just kept signing up, and being added to Blingy. What were you guys thinking?
By this point we knew this was a bad idea, but we didn’t have a new cluster ready (we’d expected Blingy to grow for another couple of months), and we try to never ever grow old clusters again once they’ve been “shut off” from new signups (because in time they stablize and have very few problems).
However, the moving people off to the new FC vol, or the original SATA vol, or even the new Netapp we also added to Blingy, just wasn’t happening fast enough. So on April 2nd we bit the bullet and switched Blingy off as the “new customer” cluster and started growing good old “Postal” again. Once we did that, we were finally able to get ahead of the curve and total usage on our first fiber channel volume has been slowly dropping ever since.
We tried at that point to contact the vendor to see if we could just get more drives that WOULD allow us to grow fcvol1, but they said their manufacturers were closed for inventory for a week after the end of the quarter and we couldn’t get anything until Friday, April 11th at the absolute soonest. Later they said they could find us some they could get us by Tuesday, April 7th, and we preliminarily said we’d take them.
This whole time we had a support ticket open with the vendor about the crashes (the OS upgrade didn’t fix it), and finally on April 3rd we received notice that they’d fixed the bug that they believed was causing it! However, the patch still needed to go through their “QA”. Finally, this Sunday April 6th they said it was all ready to be deployed, so last night we did.
What Now
Well, right now, performance is still not great on fcvol1… but mail and web should be pretty much working. One thing we’ve noticed is a website that hasn’t been visited in a long time will have a big lag still upon the first visit.. but then subsequent reloads/visits seem much faster.
At least the total disk usage is coming down now, and hopefully by tomorrow it’ll be below 85% which is supposedly a magic number where performance is fine. We’re going to keep off-loading it until things are great, though. We’ve got plenty of disk space for it, the problem is just it takes so long to move it.
We also I guess will find out tonight if the NFS freezing bug is fixed by this new patch. Hopefully so.

It’s Too Late…
I realize this is probably too little too late for many of you, but I just wanted to sincerely apologize for this whole big Blingy cluster-f*ck. Also, if you’re on Blingy (you can tell from the panel by clicking “account status” and looking at “Your Email Server”, we’d like to offer you a month worth of hosting credit.
To get it, all you need to do is contact support from our panel and make the subject of your message “Blingy Account Credit”. That’s all you have to do, and we’ll credit everybody who asks (and is actually on Blingy!) next Monday (April 14th).
122 Comments »
RSS feed for comments on this post.
Leave a comment
Powered by WordPress. Pool theme by Borja Fernandez, modified by DreamHost.
Entries and comments feeds.
^Top^


Nice to hear that you guys are on top of it. Blingy IS a big filer, running df shows 13 TB of storage. Incredible! That FC/SATA offloading system sounds pretty cool too.
Comment by Maynard — April 7, 2008 #
I sure would like to know who that storage system vendor is.
Comment by Lorenzo — April 7, 2008 #
I am really glad you guys are looking into this issue. I am a new customer, and this experience did leave a bad taste in my mouth. I was afraid I would have to just put up with these issues, but I am glad I made the right choice by going with dreamhost!
Comment by David Castellani — April 7, 2008 #
@Lorenzo: The vendor is BlueArc.
http://blog.dreamhost.com/2006/08/25/ask-dreamhost-customers/
Comment by Chris Benard — April 7, 2008 #
Thanks for the update Josh - now when are we getting next newsletter?
Comment by Hosting Blog — April 7, 2008 #
Seriously, when’s that newsletter coming? I think I’m going to open a support ticket about it.
Comment by Maynard — April 7, 2008 #
What about other customers who were affected due to Blingy? I have experienced almost a week of “Network Timeout”s!
Comment by Humble — April 7, 2008 #
A smarter person would have read the post before asking a stupid question.
If you’re on Blingy, you get credit.
If you’re not, you weren’t affected by it.
Comment by T1 — April 7, 2008 #
Damn, I was going to give you some good-natured crap but then you have to go and offer account credits. I can’t even tease you without feeling bad now.
Here’s hoping that the next three months are better than my first three months with y’all.
Comment by Dustin — April 7, 2008 #
Why not just GIVE everyone on Blingy a credit automatically. That is the right thing to do.
Comment by Joe — April 7, 2008 #
That’s only the right thing to do if you’re located within Stupidville city limits, but that’s not where their datacenter is.
Comment by T1 — April 7, 2008 #
Do customers on spunky get anything?
Comment by Ted — April 7, 2008 #
Sounds good. My friend was on blingy, I got in before him so I wasn’t. I thought it was odd that his site was always way slower than mine, so he’ll be happy to find out.
Comment by Tim — April 7, 2008 #
Spunky was nothing compared to Blingy. Get real.
Comment by T1 — April 7, 2008 #
So, are you going back to NetApp or will now try someone new again?
Just curious
Comment by Ted — April 7, 2008 #
“and we try to never ever grow old clusters again once they’ve been “shut off” from new signups (because in time they stablize and have very few problems).”
LOL
Comment by G M — April 7, 2008 #
@16: note “few,” not “none.”
Comment by humblefool — April 7, 2008 #
Thank you for your honesty, openness, and professionalism.
Your cock is so big.
Comment by Josh's Cock Hurts Me But I Love it Anyway — April 7, 2008 #
@18: Pics or it didn’t happen.
Comment by lulz — April 7, 2008 #
Shall I say it?
Google App Engine anyone?
Comment by Tim — April 7, 2008 #
I like the tshirts :D
good to see we’ll be getting credit back and also good to see an explanation
Comment by Ken — April 8, 2008 #
Can we get the T-shirt instead of the credit?
Comment by Lee — April 8, 2008 #
Is there any chance that all of this filer action has somehow affected spunky? I haven’t been able to access mail on that cluster for the past ~24 hours. I’ve sent in two trouble tickets, but have had no response.
Comment by Brent — April 8, 2008 #
I think I was one of the first 25,000 DH customers and stuck with them through a lot. I finally had enough about six months ago and have been slowly moving away. For the last few months, I’ve only used DH for DNS (very reliably, I might add) and thought I’d check this blog out now that I’ve finaly changed all of the DNS information in my domain registrations and am deleting the last of my domains from the DH control panel.
This post gives me no regrets about the decision to move about 250 domains to my own dedicated servers, which running in my office on a T1 are faster and more reliable than DH.
But, one good thing about DH is that they are less expensive than my current backup DNS provider. I’ve love to use DH for secondary DNS if I didn’t have to manually update the information in the control panel. If you guys could just pull my zone files, I’d continue to be DH customer for another 10 years.
Comment by The Captain — April 8, 2008 #
I want that shirt — talk about turning lemons into ice cold sweet lemonade!
Comment by Jonathan — April 8, 2008 #
@24 (The Captain)
Bingo!
That’s pretty much the future of DH. With Amazon, Google (soon) and others entering the hosting market with a no-oversell-like-hell businness model and offering a very good and reasonably priced service, DH is doomed to play in the minor leagues as a cheap and unreliable service… only suitable for backups and secondary services.
I’m also moving away my domains from Blingy (you guys can smoke the free month, thank you). But I’ll still use my account for svn, DNS and other services.
Regardless of what happens wit Blingy (which I seriously doubt it’ll be running smoothly anytime soon) I woudn’t recommend DH for hosting serious stuff from now on. The rules of the game have changed…
Comment by Blingycopter — April 8, 2008 #
so after this hell of troubles you still sell accounts? and call that ‘Your (us) bad?’
the only good thing in DH, they honor their 97 days money back.
damn it
Comment by damnit — April 8, 2008 #
Yeah.. I’d rather get a t-shirt too.. :-)
Comment by Jon — April 8, 2008 #
What amazes me is that 90% of the people who bitched on the dreamhoststatus blog won’t understand 5% of what’s explained here.
Comment by Howie — April 8, 2008 #
This is a pretty poor attempt at waving off a lot of pretty serious malpractice as a hosting provider. I have since moved on to a new hosting solution because of the constant email and website failures and delays, though I doubt I will ever see any sort of refund (much less the silly tee-shirt). The newsletter was even more pathetic for trying to laugh off the miserable support and failures. (And by the way, why should I have to ‘do’ anything to get the free month? Don’t you owe at least that much for all the screw-ups? Comes off as nothing more than another veiled attempt at sincerity.)
Seriously, you should get out of this business. Or better, someone should force you out for incompetence.
Comment by Christopher Murray — April 8, 2008 #
What I can’t understand is why you guys didn’t stop new signups from going to Blingy months ago! I mean, there’s problems with it, you knew this, and you knew that adding additional stress/data to the filer would cause more problems.
However, you guys STILL left the new signups going to Blingy and caused yourselves even more problems because of it… I mean seriously, why don’t we be a little smarter next time?
Comment by Charlie — April 8, 2008 #
Umm, wow. Definately interesting reading. I just moved to DH (don’t know if i’m on blingy though, too recent for a refund anyhow, lol). At least the attitude here is much better than the last shared hoster I was at, which would communicate what I guess was their displeasure by doing “chmod 000″ commands at random files in my account and see how long it would take me to notice. Think my home page was down for a couple months…
Comment by Erik Anderson — April 8, 2008 #
I’ve got several sites on blingy, DH came highly recommended from a guy I really trust so I didn’t even research it, of coarse it wasn’t helping that DH’s site is a constant sale banner. Well anyway my sites don’t require much bandwidth because we only deal with a few clients, but we definately needed a host company that dealt in hundreds of Gb not Mb, well long story short DH’s flexibilty and space has kept me so far, but if my sites aren’t running smooth on my 95th day I’m gonna be dissapointed but definately have to leave. ALTHOUGH… I think one of those spring break shirts is in order compliments of DH if I stay!!
Comment by Andrew — April 8, 2008 #
Waaaaaaaaaaaaaaaaaaaaaaaaaa!
Comment by T1 — April 8, 2008 #
I learned a few things in the week that I fought to keep my sites and email online.
1.) You can host your email with Google (Gmail). For free. It’s even integrated into the Custom MX record menu item in the mail menu.
2.) I made a bad assumption that DH had their stuff on track when I signed up in Jan 2008. I read old posts on DH downtime and slow servers, but I figured they were past that. I get in just in time for the billing SNAFU. Then the blingy SNAFU. I’m sure it’s an innocent glitch, but I can’t roll the dice with my clients’ moneys. You know?
It makes me sad because I really want to stay, but I just can’t do it in good faith. I took out a 256 slice with SliceHost a few days ago, and I’m setting it up to run a handful of low-traffic sites. My Fast-CGI Rails blog won’t suck so bad either because I’m going to front it with Mongrel on my slice.
Comment by Barrett — April 8, 2008 #
To offer the amount of services the Dreamhost does at their low prices, they probably can’t afford to have much extra capacity lying around. Unfortunately, that means that when there is trouble they don’t have extra servers to pick up the slack. It would be nice if they could get their hardware a few months ahead of demand, but overall they still do a good job.
Comment by bryan — April 8, 2008 #
How about concentrating on correcting the problems and spending less time on cute newsletters and smart ass comments. I’m losing face with my clients, haven’t got a single reply to the ticket I opened and going to leave your service very soon (and demand a FULL refund).
What a joke
Comment by Angelo — April 8, 2008 #
I’m still there :P I’m not paying my bill this month unless you want me to? :P However I’m still hanging in there, I have many other server and if you are down or too slow it just goes on another server (redundancy is the key…)
However do I need to pay my bill? LOL
Comment by ETechno — April 8, 2008 #
Guy Angelo grow the f@#k up dude and stop demanding shit, you don’t deserve nothing like that.
Comment by ETechno — April 8, 2008 #
Angelo — “Losing face with clients”, you are dumb enough to host your clients site on ANY economy hosting solution? Ever see a 99.9% guarantee here?
Yeesh.
Comment by James — April 8, 2008 #
I certainly am not surprise by all this shit. I have numerous problems over the last year. If anyone had paid any attention to the number of complaints coming from Blingy customers something would have been done last year…………
Comment by Jane — April 8, 2008 #
I would like to say that there were 2 problems here.
1. was the failure of blingy, which, although absurd in its proportions, was in some ways understandable as it was a combination of human and machine error.
2. The complete failure of adequate information via tickets, support and status updates for 3 weeks. I was out of (and still lack) stable email. I lost a client. I didn’t switch because I was left in the dark hoping. But as a strategy, this is pure human and DH error. I have use DH for 6 years, and I have never seen such poor service or response.
Truth be told, the second failure is much more offensive to me, and I can only say it deeply saddens me that a host such as DH, which is relatively cool, could not handle itself at least a little better. If I had had 1/10th the amount of information Josh just posted, I would have been much more patient and understanding.
Please consider this failure of communication both as machine and DH error, and hire more support staff.
nron
Comment by nron — April 8, 2008 #
I appreciate that you are so open about this episode. However, the most important thing will be to learn from this horrendous experience.
You can’t put so much blame on the vendor. After all, vendors will be vendors! All pushing their product, offering guarantees and great service….just like you guys to us !!!!
Yes you need to remain price competitive, but not at the expense of losing customers!
One suggestion for you is to have different levels of service - bronze, silver, gold or standard / preium, etc…. Bronze would be your current shaky offering, or specifically certain clusters. Silver or premium would then be a less risky setup with more stability. For that you could charge an extra buck or two per month, or whatever.
Good luck…but please don’t leave it to luck, do more thinking as a team. Maybe you should also set up a user group to help with say 10-20 representative users - by representative I mean not just nerds but real business people?!! :o)
Comment by cavehomme — April 8, 2008 #
Still not fixed. 12 s for a single web page. 27 s for a simple perl script. This was now. When is it fixed?
Comment by Mr J — April 8, 2008 #
I like how everyone gives pats on the back and laughs like we’re all friends. I don’t know any of you guys, I don’t know anyone at DreamHost, and they’re a business. They provide a service rather unreliably and should take some responsibility for their product. I could really give two snaps about what the problem with “Blingy” was, but I do care about why it took weeks on end to fix it. If you can’t solve your own tech problems then contract it out - this isn’t the Stone Age and things shouldn’t take more than 24 hours to be fixed. If its that bad throw the servers out and get new ones the same day. Its called a ‘business expense’ to those running serious businesses.
Comment by EJH — April 8, 2008 #
I love dreamhost with all of my heart. You guys need to understand that generally you get what you pay for and dh by no means should be deployed for a mission critical application, but for personal blogs and other web fun. If you are planning on hosting digg on a any shared services, you may want to reconsider. But for everything else it gets the job done cheap!
Comment by dhfan — April 8, 2008 #
DH couldn’t organize a fuck in a whorehouse or a piss-up in a brewery.
DH tres suxor - dey is in yur servers fuckin up yur shit.
Comment by DHater — April 8, 2008 #
this is why you need a PROPER storage vendor… IBM/NetApp for everything.. eggs in a basket i know, but they do know their stuff.. I do know that EMC, have and do actually page on call engineering teams (the guys who actually CODE the guts of the various arrays they sell) no matter what time of the day it is, to get a workaround/fix implemented. microcode can be modified ‘on the fly’ to change values and non-user settings to alleviate performance issues if the code is misbehaving, till a permanent fix is released…. I’ve been there, done that, got the boxers with the skid marks due to threats coming from upper management if the problem isn’t fixed in a timely fashion.
Comment by anonymous — April 9, 2008 #
Instead of giving credit, you could speed us up by lowering the maximum number of customers on blingy and dividing what is left over among the users.
Right now we are getting more time on a slow server rather than a faster server!
Also I do not think that blaming the troubles on the vendor is completely justified here. As the sysadmin you should have known better. Stop apologizing and give us something real!
Comment by daniel — April 9, 2008 #
Been with DH for nearly 2 years, lovin it all the way! I’m on the Barqs server and I’ve had like 6 hours down time that I’ve actually noticed in nearly 2 years.
LOL’ing @ all you bitchers, whing whing whing “I’m losing face with my customers” bitch bitch bitch “I’ve no email in 24hours” If your losing face with your customers, try running your business properly then…
To Josh and crew, keep going guys, My crew & I are with you all the way.
Comment by Knightrous — April 9, 2008 #