Another Anatomy
April 7, 2008 on 12:23 pm | In Foobars, Hardware, Insider View, New Features by Josh Jones | 122 Comments
Okay, nothing silly this time, I promise…
Some of you may have noticed that we’ve been having what a problem that is, although maybe not the worst in DreamHost history, definitely in the top 5.
There has been a DreamHost Status post about it, but it’s been going on so long, there obviously needs to be more said.

The History
The events that conspired to cause this horrible performance for everybody in our “blingy” cluster actually started to take root 19 months ago.
That was when I made this post asking our customers for some suggestions on storage. I made the mistake in that post of mentioning the name of one particular storage vendor who apparently does a search for their name in rss feeds of all kinds of blogs. I won’t mention their name again here, to test if they REALLY read this blog, but they were the one on the list right after “Netapp”.
Anyway, immediately a sales guy from there was hounding me about how great their product was. It would have super-duper reliability, super-duper performance, and super-duper ease-of-management. It was super-duper expensive compared to our current solution (about 3x the price per GB), so in the end I declined.
But, over the next year he kept hounding me and hounding me, and eventually the price came down to something in line with our current costs, so we decided to try one unit for our new cluster, “Blingy”. After we were satisfied with our internal testing, Blingy went live with the new storage solution in December 2007.

Smooth Sailing
At first, everything was fine, performance was great, everybody was hunky and dory. But then, as usage started to go up, the new file system started acting up. Around the same time every night, the system would stop responding to NFS requests for a while, which would immediately break web and mail service for everybody in the entire cluster.. thousands of customers.
Our Bad
Now, it can be a big mistake to put live customers on any new system. But honestly, we’d tested it lots, researched it a ton, and we added people very slowly at first, and it performed great.
Our biggest mistake I believe had nothing to do with what specific vendor or hardware we went with.. it was simply putting so many eggs in one basket!
Even with our Netapps (which are pretty much awesome), there are problems from time-to-time. However, a typical hosting cluster will have a dozen or so Netapps, which means any problems are one twelfth as big.
With Blingy, EVERY customer is on this one “mega” filer, which in theory should make for better performance, reliability, and ease of management. And since we got the clustered solution (in an active-active configuration)… there really is no single point of hardware failure in this thing.
But, as it turns out, there are a lot of non-hardware failures in the world.
Their Bad
Well, the techs at the vendor couldn’t figure out what was causing the NFS freezing, and so they recommended us doing a major OS upgrade to hopefully fix it.
During this whole time, the fiber channel disks were slowly filling up, and we’d been trying to move large files off to the sata pool (it’s a two-tiered solution, and there’s a feature that automatically moves less-accessed data to lower tiers).. however the thing couldn’t move the data fast enough. It couldn’t finish doing a “move job” in a single day, and every day it’d sort of “crash”, which would screw up the move job, and nothing would get moved.
Also, as the disk kept getting more full, performance kept getting worse, creating a vicious cycle. We ordered some more fiber channel disk shelves at the end of February to grow the main FC volume, since we couldn’t get things off to SATA, and it was supposed to come on March 10th and be installed at the same time as the major OS upgrade.
However, the disks didn’t end up getting installed until March 25th, and at that point it turned out we could NOT grow the FC volume with these disks (well, it was technically possible, but their on-site techs recommended VERY VERY heavily against it.. it would severly impact performance), which was sort of the whole point. So now we had a new FC volume which we still had to migrate users to.

Your Bad
Of course, this whole time, new customers just kept signing up, and being added to Blingy. What were you guys thinking?
By this point we knew this was a bad idea, but we didn’t have a new cluster ready (we’d expected Blingy to grow for another couple of months), and we try to never ever grow old clusters again once they’ve been “shut off” from new signups (because in time they stablize and have very few problems).
However, the moving people off to the new FC vol, or the original SATA vol, or even the new Netapp we also added to Blingy, just wasn’t happening fast enough. So on April 2nd we bit the bullet and switched Blingy off as the “new customer” cluster and started growing good old “Postal” again. Once we did that, we were finally able to get ahead of the curve and total usage on our first fiber channel volume has been slowly dropping ever since.
We tried at that point to contact the vendor to see if we could just get more drives that WOULD allow us to grow fcvol1, but they said their manufacturers were closed for inventory for a week after the end of the quarter and we couldn’t get anything until Friday, April 11th at the absolute soonest. Later they said they could find us some they could get us by Tuesday, April 7th, and we preliminarily said we’d take them.
This whole time we had a support ticket open with the vendor about the crashes (the OS upgrade didn’t fix it), and finally on April 3rd we received notice that they’d fixed the bug that they believed was causing it! However, the patch still needed to go through their “QA”. Finally, this Sunday April 6th they said it was all ready to be deployed, so last night we did.
What Now
Well, right now, performance is still not great on fcvol1… but mail and web should be pretty much working. One thing we’ve noticed is a website that hasn’t been visited in a long time will have a big lag still upon the first visit.. but then subsequent reloads/visits seem much faster.
At least the total disk usage is coming down now, and hopefully by tomorrow it’ll be below 85% which is supposedly a magic number where performance is fine. We’re going to keep off-loading it until things are great, though. We’ve got plenty of disk space for it, the problem is just it takes so long to move it.
We also I guess will find out tonight if the NFS freezing bug is fixed by this new patch. Hopefully so.

It’s Too Late…
I realize this is probably too little too late for many of you, but I just wanted to sincerely apologize for this whole big Blingy cluster-f*ck. Also, if you’re on Blingy (you can tell from the panel by clicking “account status” and looking at “Your Email Server”, we’d like to offer you a month worth of hosting credit.
To get it, all you need to do is contact support from our panel and make the subject of your message “Blingy Account Credit”. That’s all you have to do, and we’ll credit everybody who asks (and is actually on Blingy!) next Monday (April 14th).
Read This Now!
February 23, 2007 on 5:38 pm | In Foobars, Hardware, Insider View, Updates by Josh Jones | 32 Comments
Quick, before it’s gone!
If you enjoy all the hilarious hijinks, illuminating illustrations, and jovial jokes of the DreamHost Blog, you better suck down a local copy TODAY…
We’re having a planned power outage tomorrow night!
(Click that link for some more details.. it’ll be from 11:15pm PST (GMT -0800) tomorrow night (Saturday) to hopefully much less than 5 hours from then.)
Not planned by us though, planned by our building. It would have been very nice if they could have given us a little earlier heads up, or avoided the outage at all, but no, they just can’t. And trust me, we want this to happen even a tiny bit less than you do!
So, this site will be down then, as well as all other DreamHost services, with the exception of ns2.dreamhost.com and dreamhoststatus.com, which are kept off-site for exactly this sort of situation.

Well, I just thought I better post something about it here too.. thanks for your understanding, and we’re really really really really sorry.
P.S. Here’s the pic the building emailed us of the problem:

So, um, yeah. I think what that shows is a piece of metal is vibrating next to that wire and cutting into the rubber insulation… and if it gets much further in, KABOOM!
Some Late Night Moves!
January 25, 2007 on 6:10 pm | In Foobars, Funnyish, Hardware, Insider View, Updates by Josh Jones | 36 Comments
Last night we made some moves.
Patrick and I moved about 60 servers!
And I only dropped one! (Sorry about that, bomberman.)
It took about two hours, and here we are, wrapping things up:
Stage 2? At 12:30 in the morning, after moving 60 servers, what else could we possibly want to move for a STAGE 2?
Hmm… something about “Brea”?

We passed this car in the parking lot.. and soon, we were at the OTHER DreamHost office.
We waited.. THE CON WAS ON!
Patrick had told Pete (who lives right by the office) that he was just in the area, at 1:30am on a Wednesday, and I wanted him to pick up some WWF glasses for the downtown office. But, HAD PETE PLAYED US FOR FOUR FOOLS?!
Apparently not….
We made short work of the coveted sign.

And then I decided to go raid the kitchen…. WHAAaaaa!!!!
Yes, very funny, Brea. But who’s wearing the cool shades now?!

As long as we were there, we thought we might as well have some fun…
And some more fun…
We took our time. We even checked out the Official DreamHost Museum!
Why hello there, Señor Corona, you sure are working late tonight!

Of course, we couldn’t just leave those poor, unsuspecting Breaites bare-walled!
Around 3:30, we were back “home.” Mission complete. Tired. Satisfied. Ugly.
The neon sign was finally where it has always been destined to be. Down in our NOC. The HEART of DreamHost.

Epilogue…
New Dream Resolutions
January 3, 2007 on 6:53 pm | In Business, Hardware, Insider View, New Features, Promotions, Rants by Josh Jones | 178 CommentsHappy New Year!
The snow’s not even dry on the rooftops of LA and we here at DreamHost already have a pile of resolutions for, as the cool sports video gamers call it, the 2K7.
In 2007 we do solemnly resolve to:
#1. Never get involved in a land war in Asia.
#2. Never go in against a Sicilian when death is on the line!
#3. Become once again renowned the Web-over as a stable, reliable, robust, high-performance webhost!

As those of you who’ve been playing the home game know, we had some troubles in 2006.
But actually, the root of those troubles began WAY back to June 2005, when the building our data center is in informed everybody they were unable to provide any new UPS and generator-backed power, period.
Moving data centers wasn’t really doable back then, and so for the next year or so we were forced into “low-power mode” .. scrapping our Dedicated Servers option and squeezing every last bit of power efficiency we could from our operations, even at a fair amount of expense.

Somehow, we kept going. And going. And going. And gahhhh, you get it.
And really, our service didn’t suffer for it.
But then, exactly one year ago today, something changed that seemed to affect our reputation for the worse ever since.
We started giving away a lot more disk and bandwidth. Like 8 times. As. Much.

That’s when things went downhill.
Well, not really.
In fact, we had exactly the same amount of problems (actually less, per customer!) we’d had the last eight years, but now finally people could put their finger on a REASON for them!
We were overselling!
Clearly, every problem we had stemmed from the simple fact that we gave away too much disk and bandwidth!
Well, I’ve already covered “overselling” plenty, and ALL the quota increases really did was increase the number of new customers we got!
But still, all through 2006 our rep seemed to slowly decline.
Every time we had a server crash; “Overselling.” A network fubar, “They’re overselling.” A panel bug: “Didn’t your mama ever teach you about overselling?” A power outage? “Oh yeah, sign up for DreamHost if you happen to like a fresh bunch of OVERSELLING!!!”
Of course, the power outages didn’t help. Nor did the weird problem between our two core routers that made our entire network suck eggs for six weeks this summer.
But in a way, those power outages were perhaps a blessing in disguise. A disguise that reminded me of a big mob of angry customers.
Those outages forced us, and our building, to really DO something about the power situation… which as you may recall is the real foundation for any stability problems we’ve had in the last 12 months.

After the power outages this summer, the building started literally BLEEDING data center tenants, figuratively.
This had two effects. Ichi, it forced them to start taking their UPS and generator problems seriously, and as of now they actually seem to have things in order. In fact, believe it or not, just TODAY the building experienced a power outage from DWP… and for the first time ever we were not affected at all!
Memo
DATE: January 3, 2007
TO: All Garland Tenants
FROM: Timothy J. Moore
RE: DWP Power Outage TodayThis is to advise you that at approximately 11:50am today, the Garland Building received a power outage from the Department of Water and Power. The outage lasted less than one minute and all systems worked according to design.
The Building’s loads were transferred to the Emergency Generator System. Upon stabilization of the DWP service, all ATS Switches transferred back to normal DWP power.
Should you have any questions, please do not hesitate to contact the Office of the Building.
Sincerely,
The Happy Office of the Building Team
Oh BOY was I ticked OFF when I saw how they stole our signature signature!
Ni, by bleeding those tenants, a lot of power was freed up for us! And by us I mean you! LITERALLY.
Also, we were now of a size (thanks, ironically, to our generous bandwidth and disk allocations!) that expanding to more than one data center was finally feasible.
So, this fall we expanded to two more facilities, with dark fiber connections between all three.
With all this new power available, we were finally able to spend more on hardware! So we did, and have been, and are, and will be! We’ve put many many fewer users per web server, mysql server, and mail server, added load balancers, beefed up our network equipment, and have added new targets (that we now have the power to attain) for server stability.
In fact, we spent over ONE MILLLLION DOLLARS on hardware in November alone! That’s more than we normally spend in a whole quarter! And in fact, things are quantitatively more stable now across our whole system than they’ve ever been in the past.
But our reputation as an “overselling host” seems to linger!
How can we fix it? Aren’t people just going to notice things are a lot better? And start telling their friends?
Won’t they just believe this blog post?
Probably not. It’s a stumper!
Fortunately, I pulled deep into my master-of-public-relations pouch, and pulled out this gem of wisdom:
People aren’t going to consider us a “stable” host until we offer LESS DISK AND BANDWIDTH!
But…ARGH! More disk and bw => more sign ups => more money => more resources => better service!
What to do?
Fortunately, I have a master-of-marketing pouch too (double-major).. so here’s what we’re doing:
Every day, starting tomorrow, the amount of starting disk and bandwidth we offer new customers (this does not affect existing customers at all!) will drop. You can see the amounts here.
(Don’t worry, once you sign up, your disk and bandwidth allocations will grow weekly just like before!)
And we’ll keep dropping them daily until our precious rep is restored!
(Or it cuts into our sign-ups too much.)
(Whichever comes first.)

(Reputation be damned.)
Ask DreamHost Customers
August 25, 2006 on 10:44 am | In Business, Hardware, Insider View by Josh Jones | 61 Comments
I’ve got a question.
And I thought, who better to ask, than everybody?
Here goes…
We’ve got pretty serious storage needs. Like, in the next year, we’re estimating needing about 250TB (big T, big B) of additional centralized, networked, storage.
Besides needing a lot, we also need very high performance, redundancy, and thrift.

We want it ALL!
Our absolute requirements for our system are as follows:
* RELIABILITY .. we can never ever lose any data, ever.
* PERFORMANCE .. we need something that can serve approximately 3000 NFS ops per second per TB. (See spec.org). (It needs to do NFS.)
* PRICE .. that’s the whole reason we’re looking.
We’ve got a good system of RELIABILITY and PERFORMANCE already.. but the cost per usable GB is $10. The main problem is the 300GB Fiber Channel drives we use, which are $800 each. Is there anything out there that can do the same but with SATA drives that cost more like $100? Even if we needed twice or four times as many drives for the same performance and reliability, it seems possible!
There are also some REALLY WANT TO HAVE features, though possibly could be passed up if the top three are satisfied.
* SNAPSHOTS .. automatic versioning backups of all files by the OS. We’ve got this now, in a hidden “.snapshot” directory in every folder.. check it out!
* USER QUOTAS .. really, with the amount of space we’re giving out these days, quotas are almost a moot point. They’d be nice to have though.
* HIGH DENSITY/LOW POWER .. it’s always a plus to fit more storage in the same amount of space with the same amount of power, but it’s not really that big a deal.
* RAID 6 SUPPORT .. it’s cool.
Here are some vendors/solutions we’re considering..
* NETAPP .. what we use now.
* BLUEARC
* ONSTOR
* PANASAS
* CORAID (Open Source, ATA over Ethernet.. intereesssssting!)
* LUSTRE (Open Source, Clustering storage)
Soo.. basically, if people could post their suggestions, experience, other solutions, etc.. in the comments, it would be much appreciated. Not to mention you will be doing your patriotic duty to improve your hosting forever!

And remember, we’re talking serious NFS ops.. and we’d be willing to buy 256TB at once if our tests showed this system can do what we want and THE PRICE IS RIGHT!
Reconstruction Efforts
August 11, 2006 on 11:01 am | In Foobars, Hardware, Insider View, Updates by Josh Jones | 46 Comments
Well, things could be worse.
We’ve pretty much got our whole network under control now.. the ongoing problem mentioned last post was finally figured out by Cisco support. It turns out it was a bug undocumented feature in IOS dealing with how they learn MAC addresses.
There was also another network problem we got resolved yesterday that was causing general slowness on web and mail servers. It’s complicated (i.e. I don’t understand it exactly myself), but in the end we took a distribution switch out of the network and that fixed it.
We still have one open ticket with Cisco for our core routers having some HSRP problems. It doesn’t seem like that’s having any real effect on our network, but we want it fixed!
We are also installing two new Ciscos to offload the BGP duties from the core routers so they’ll just have to handle switching. This set-up should be able to handle about 300% more traffic than our entire network now pushes at peak times!
Thanks to these network problems being resolved, we’ve also begun re-deploying in Alchemy, who at least didn’t have the second power outage.
We’re also still in the process of getting real UPS power on our network cabinet, plus our internal databases and a few internal servers. Basically, everything that keeps all the customer mail, web, database, and file servers from coming right back up quickly should there ever be another outage.

So, um.. that’s how it stands now! We hope this will all soon be nothing more than a long bad dream (that was real).
The Expert Speaks!
May 25, 2006 on 4:52 pm | In Business, Foobars, Funnyish, Hardware, Tech News by Josh Jones | 25 Comments
Last week, as I was lazily thumbing through the May 2006 issue of AmericanWay, the AWARD-WINNING in-flight magazine of American Airlines, a familar face caught my eye.

Who WAS that cheery-eyed elf peering back at me from the upper-right hand corner of THE EXPERT SPEAKS?
I knew I’d never met the man, and yet I’d seen his face many times.. and yet I also knew he wasn’t famous. Where was he from? I couldn’t quite place it. It finally hit me when I read the introductory text… of course!
It was none other than CEO and publicity-hound extraordinaire of our favorite competitor CI HOST, C.F.! (One must never actually type his full name, lest he suddenly appear in a flash of smoke, lawsuit in hand.)
Great! So good to see other hosting guys moving up in the world!
I practically quivered in anticipation of the gleaming nuggets of insight soon to be bestowed upon me!
My practical quivering was soon rewarded as I came across this beautiful passage:

(Personally, I prefer my UMPCs with Nintendo DS.)
Ode to Destro
April 15, 2006 on 11:18 am | In Funnyish, Hardware by tavis | 12 CommentsDestro was the first server at dreamhost. One could say it’s the father of dreamhost. Here is my Ode to Destro. Pardon the production quality, I did this from my bed this morning. I used my own CRM method (Cyclic Record Mix) to accomplish this, no editing software needed.
Like hammer for sidekick.
March 21, 2006 on 6:12 pm | In Funnyish, Hardware, Insider View by tavis | 49 CommentsWe go through a lot of sidekicks and some never get returned for one reason or another. We had some fun with one of those un-returned sidekicks the other night.
Sneaky Changes Afoot..
March 16, 2006 on 5:12 pm | In Hardware, Insider View, New Features by Josh Jones | 36 Comments
We’re making some changes to the way we do MySQL…
One of the most popular suggestions we have is to be able to have multiple databases per hostname. Along with that, people want to be able to use the same mysql username to access multiple databases. To top it off, they are even so bold as to want one PhpMyAdmin management area for all the databases on their account!
This hasn’t been possible with our system since very near the beginning, because our customer database servers are separate from our web servers. That’s why you can’t use “localhost” to connect to your database. It is not why your database may sometimes seem slow. The reason for that (if it happens) is because the database server itself is overloaded (or maybe dns was messed up). And actually, if it had been the same physical machine as your web server, that would have also meant all the websites on that server would be slow too (not just the ones accessing that database server). SO THERE!
But that’s not really why we kept our database servers separate from our web servers. The main reason is we mount the file system on our web servers over NFS, and that is just noooo good for MySQL performance. The other reason is we do performance tweaks on our MySQL servers that we wouldn’t want to do on Apache servers.
Now, when you’ve got your database servers separated from your web servers, you’ve got to have some way to determine which database server new databases go on.
The simplest thing, I guess, would be to just assign database servers to particular web servers.
If we’d just done that, everything people are asking for now would have always been possible!
But, there is one teeensy drawback with doing it that way. It’s not the most efficient use of hardware.
And when you’re a self-funded, completely independent, low-price, big-feature web host, efficient use of hardware is pretty much the difference between driving a ferrari and driving a limo. Let me explain (not the “driving a limo” part)…
Here are the three essential facts for your consideration:
1. Each database server has a maximum number of databases it can support.
2. Customers continue adding databases gradually for the life of their account with us.
3. Moving databases between servers causes some downtime and is a big pain in the admin’s behind.
So our solution to these three facts was to just make a current “active database server.” Any new MySQL created by a customer, regardless of their home web server, went on it, and when it filled up, we just added a new one.

That works great, and is pretty much the maximum efficiency you can get in terms of hardware use.. at any given time you only have one non-full database server, and it’s in the process of filling up pretty darn fast!
The only thing that doesn’t work great is one person’s databases are most likely spread across multiple servers… which means you need a separate hostname for each one, and most importantly, you can’t do all those things you Happy Customers clamor so loudly for! It also means if one database server poops its chute, a huge swath of customers are affected instead of just the ones on web servers tied to that database machine.
But still, it was the cheapest and easiest way to do things!
However.. we’re rich now and we like a challenge!
So, we’re in the process of changing our system to start assigning each new customer to a database server for life! What this means though, if we don’t want to move databases later (see fact 3 above), is we have to essentially “cut off” a database server before it’s “full’ (see facts 1 + 2). And that means add another database server sooner than we would otherwise have to.
But it also means all your databases will be on the same server, which means soon you WILL be able to manage them all from one hostname (and one username). It will also mean there’ll be less chance of your databases being affected should a random mysql server have problems. In fact, it also also means if you’re a heavy MySQL user and causing problems for your server, you’ll mostly be affecting your own performance and not too many unsuspecting neighbors!
It’s more expensive, but you guys are worth it!

Powered by WordPress. Pool theme by Borja Fernandez, modified by DreamHost.
Entries and comments feeds.
^Top^



