The Official DreamHost Blog!Tales From the Inside!
Blog Pages

Um, Whoops.


The $7,500,000 finger.

Hello.. how’s your morning going?

I hope it’s been a little better than mine.

We had a teensy eensy weensy little billing error last night… my first clue something was up when I saw this morning’s daily billing report (so far): $7,500,000.

It turns out due to my excessively fat fingers, nearly every one of our customers has been seriously over-billed in the last 12 hours.

I bet when you read this part of the last newsletter:

4. New Office!

Another important thing I’ve been doing instead of writing newsletters
is looking out the window of our NEW OFFICE:

http://blog.dreamhost.com/2007/12/21/were-so-high-right-now-you-dont-even-know

If your next web hosting bill from us is mysteriously tripled, now you
know why.

.. you thought it was a joke!

Ha, the joke is on you! I guess. Um, okay, no, not really, I’m sorry.

How on earth could something like this happen?

Let Me Explain

A couple of weeks ago, just around new years, we started beefing up some of our internal “controller” servers. These are the machines that run all of our “behind-the-scenes” services; things from adding a user to registering a domain to configuring apaches to rebilling customers.

I was on a little-bit-too-long vacation, but when I got back, I noticed our daily credit card payments seemed a tad low in the new year.

So, late last week I tried re-running the billing services for all the days back three weeks or so. I knew this was safe, because after 10 years, the one thing you DO get perfect is your billing system. Our biller is pretty bug-free and robust at this point, because we’d be broke and eating bugs if it weren’t.

In fact, it’s so robust you can just run it on any day you want, and it’s safe. It won’t double-charge people and it’ll even automatically find any missing charges and catch everything up to the day you said.

Anyway, I ran it, and things were fine.. and sure enough, it caught a lot of missed payments. I didn’t have time to look into it right then, but I made a note to myself to check up on it on Monday (yesterday) and see if things were fine or still messed up.

And a terminal case it is.

Come Monday

Monday came. I checked the reports and sure enough, things were still pretty low. So I looked at the logs for some of the biller services, and I noticed they were only failing on the machines that had been recently upgraded!

That explained why we were getting some money still (since not all the controllers have been upgraded yet), but not all of it.

Anyway, it turned out there was no 64 bit version of the PFProAPI module we use to interface to the credit card transaction server. No big deal, there’s a new module that interfaces with their new and preferred https interface, and it was only a couple of lines of code to change to get us switched over!

So anyway, I made the change, and it worked, and I even tested it, and things were fine!

But then… late last night, I realized: when I re-ran those biller services last week, they must not have fixed everybody then either! It’s just that by running it again I randomly got different people being charged on the working controllers who had been assigned an upgraded (and therefore broken) one before.

So why not just run it all one more time?

Sure, it should be no problem! So I did, manually running the biller (which is normally automatically scheduled) for 2008-01-14, 2008-01-13, 2008-01-12, 2008-01-11, 2008-01-10, 2008-01-09, 2008-01-08, 2008-01-07, 2008-01-06, 2008-01-05, 2008-01-04, 2008-01-03, 2008-01-02, and 2008-01-01.

I probably should have just stopped there. But then I thought better. I thought to myself, “When did we start upgrading these controllers anyway?”

I couldn’t remember. But, since the biller is super-safe and robust anyway, I went ahead and ran it for 2008-12-31, 2008-12-30, 2008-12-29, 2008-12-28, 2008-12-27, 2008-12-26, and 2008-12-25, just for the hell of it.

Notice Anything?

Don’t feel bad if you didn’t. I kind of missed it myself.

THOSE SHOULD HAVE BEEN 2007!!

Heh, uh.. um, er.. my bad?

So what happened?

Well, that super-robust and stable biller did what it was programmed to do, it ran as though today was December 31st, 2008!

And what did it see? Well, it saw a whole lot of accounts (essentially all of them) who for some unknown, mysterious reason hadn’t been charged at all for eleven and a half months!

So off it went, busily through the night, “fixing” everything up for “today”, December 31st, 2008.

Really, it’s sort of amazing this never happened before in the last ten years.

We have a NEW SUPPORT RECORD!

There IS a bug here.

I can imagine the half second or so of thought that sprinted through the programmer’s mind when he was adding the ability to allow you to pass in what day to run the biller as though today is:

Hmm.. well, I could see us POSSIBLY wanting to be able to bill for a future date.

Well guess what… NO! We will NEVER want to rebill as though today were a day that hasn’t happened yet! But instead, somebody along the line (Sage? Me? Somebody else?) figured, “What’s the harm in keeping it flexible?”

About $7,500,000 in harm, that’s what!

The serious part.

The end to this story is that of course, I’m very very sorry, we’re very very sorry, and I’m sure you’re very very sorry this happened. I really am. I understand the sort of problems that an unexpected large charge to your credit card (or worse yet, your debit card) can cause. If the tone of this blog post seemed a little light, I apologize I don’t mean to offend and I realize how serious an issue this is. I’ve been up since 3:50am trying to undo the damage and maybe I’m a little shell-shocked.

A new service is running right now (in parallel on all the controllers) that fixes all those future charges, re-enables your account if it was erroneously suspended, and if your credit card was automatically rebilled, refunds the payment automatically. You don’t have to contact us or your bank, and you’ll get an email when your account is finished fixing up. It’s going to take several more hours to complete. There are (or were, after this incident) a lot of you these days!

If, because of this billing mistake, you somehow incurred some fees from your bank or credit card company, please let us know after tomorrow (today we are just replying to all 10,000+ billing messages with a generic explanation) and we’ll do our best to make it right for you.

And of course, the biller no longer allows dates in the future.

The moral of this story is that “flexibility” is rarely desired in programming! The less a program will accept/the less a program will do/the less options and preferences it has, the more usable it is/the more understandable it is/the more stable it is.

Tough Love

I wouldn’t want him to compile me!

When designing a program, you’ve got to make some tough decisions .. and when you really can’t decide if this is something your users will need someday, err on the side of leaving it out.

Otherwise, your users will someday err on the side of your face.

Filed Under: Foobars, Insider View, Musings

Schadenfreude


I feel SOOOOO bad for them.

Almost exactly a year ago today, DreamHost experienced its last unplanned power outage.

Last ever?

Last ever so far! Who knows what the future holds? (Besides me.)

But for now, I’m just glad the present has been a little better for DreamHost customers than for 365 Main’s!

Because in case you hadn’t heard or noticed, power outages in San Francisco today caused downtime at Craigslist, Technorati, TypePad, LiveJournal, Yelp, RedEnvelope, and more!

San Francisco in August, 2007.

Who here is glad DreamHost is in sunny, safe, earthquake, mudslide, forest fire, riot, tsunami-free, Los Angeles now? And who here is publicly enjoying that 365 Main is not?

Here’s a big hint: he’s really good looking and wrote this post.

Of course, the real reason we had no problems is not because our data center is finally super reliable, or that Los Angeles itself never has so much as a cloudy day, or even that we’re just lucky.

It’s because I am in Chicago at HostingCon and so am temporarily unable to break anything.

Of course, that’s not really true either. I’m not in Chicago; as everyone knows, I’m a compulsive liar. In fact, this statement is a lie.

But, even if I was at hosting con (and everybody knows we don’t go to hosting conventions), my ability to break DreamHost systems knows no boundary of time or space, and strikes at any time, usually without warning and definitely without mercy.

Why were we were spared this time?

The honest truth is that any data center can, at any time and for any reason, no matter what precautions they take, have an outage! You’d think making a reliable data center would be a lot easier than making a reliable software service, seeing as how it’s all just power cables, air conditioning, and gasoline.

And yet somehow, it seems like all even the best and most expensive data centers can do is make the outages a little less frequent.

"Jem" is for "Josh: Everybodys Master".

What IS a poor host to do?

Nothing, really.

I mean, the only way you can really achieve “five nines” uptime is by having an entire architecture designed around the assumption that ANYTHING can fail… and at the worst possible time. Duh.

However, like most Las Vegas escorts, that sort of redundancy does not come easily. Or cheap. And the truth of the matter is unless you’re Google, most likely an entire day of downtime once a year is not going to cost you as much as it would to truly prevent it.

In fact, I wish there were some low-reliability data centers out there! I bet if somebody made an ultra low-cost data center, one that provided “adequate” cooling, network, and power capacity, but no UPS, fire-suppression, generators, crazy physical security, or extra earthquake protection, they would clean UP.

They could probably charge around half of any data center I’ve ever seen, and I bet with only twice the downtime… and that would be very appealing.

I mean, think about it… how many of you could deal with an extra day of downtime per year for half the price? Heck, you’d probably be fine with FOUR days of downtime a year if it meant 75% off.. but would you pay double to save 12 hours of downtime a year? Would you pay FOUR times as much to save 18? Eight times as much to save 21?

That’s pretty much how it works, and I’m guessing not a lot of you would.

Of course, maybe I’m over-estimating the cost savings of skimping on redundancy in a data center a little, and maybe I’m under-estimating the reliability hit a tiny bit. On the other hand, my blog posts have never been wrong before.

The more wires, the more porn.

AND, if somebody did come out with a “Crap-of-the-Art” data center, it’d make it a lot more feasible for those who really need reliability to get two; thereby keeping all their company’s eggs out of one risqué basket.

In fact, what we’ve been doing over the last year is breaking our system down into smaller and smaller isolated “clusters,” and distributing them between three data centers (all in LA). The idea being, data centers will go down.. let’s at least try and keep the eggs in our other baskets un-scrambled. And since we’re not really counting on much reliability from them anyway, it sure would be nice if those data centers all charged a lot less!

Of course our network still has a single (though redundant) point of failure, but we are working towards eventually making each data center a complete stand-alone “node”… some day.

This day, however, I think I’ll just go to bed… while taking pleasure in the fact that it was somebody else this time!

Filed Under: Business, Foobars, Insider View, Musings, Tech News

Super Lame Apology


I am really bearly sorry!

We are all really bearry sorry about the extended downtime this Sunday from the planned power outage!

The power was only out for about an hour, but as it came back on, there was trouble, trouble, trouble. Our router started acting funny, some file servers were mis-configured, some web servers didn’t want to come back on, and so on, and so on, and so on

Although most things were back up and running within the five hours, the network in general was still flakey for about 8 hours, and everything wasn’t TOTALLY fixed for about 36 hours.

We really thought things would go a lot smoother, given that for once we had some advance warning, but good old Murphy was in full effect, y’all, again.. urgh.

Anyway, to try and make up for it a little bit, we thought we’d offer something we’ve never offered before at DreamHost, something we thought we’d never need, something we always thought a little silly… an SLA!

Even GOD is sorry!

That’s right, I’m offering you a… Super Lame Apology!

HA ha ha! Oh, did you think I meant a “Service Level Agreement”?

But really, isn’t that all a typical SLA is?

“We’re sorry we broke our promise, here’s credit for the 46 minutes you were down. Sorry.”

Lame!

In web hosting, it’s usually a credit for the exact amount of time you were down, sometimes a full day’s worth, or I guess if you are really paying a lot, a month’s worth.. though an SLA like that even in the high-end business world would be a rare animal indeed.

Animal ate it! Sahhhhry!

In the case of the outage this past weekend, if you were paying $8.95 a month you were down for anywhere from 6 to 44 cents worth of service. What would you think to yourself if we automatically credited you 44 cents on your next monthly bill?

You’d probably think either:

A. Is this 44 cent credit because February only had 28 days?

or

B. My site is down for hours and all I get is 44 cents?! That barely pays for the stamp I’m going to need to mail my foot all the way up your butt, DreamA$$Host!!

In fact, even if we gave you a full month’s credit, $8.95, you’d probably think the same thing. Either A. you didn’t really care, and the money doesn’t matter, or B. you really did care, and the money doesn’t matter.

The truth is though, we do offer an “SLA”… the same “service level agreement” you’ll find at McDonalds, Nordstrom’s, Staples, or just about any other successful business. If any customer ever comes to us with even an eigth-way legitimate gripe, we’ll do our best to fix it, even if it means giving them an account credit or their money back (even after our 97-day money-back guarantee period). Better to lose a customer on good terms than on bad, eh?

Oscars, Naked American Idols, DreamHost.

So, if we’ll happily give refunds anyway, why not go ahead and lay it all out in a “real” SLA?

I guess mostly because we feel they’re B.S. Case in point, we actually have SLAs from our data centers! Which is why I sleep sowell at night, knowing our servers are safe and sound. HA!

Not only do they fail to meet the SLA, I believe we’ve never gotten a single service credit out of them for outages… and I’ve asked!

The only useful thing you can get out of an SLA is the ability to break a long-term contract without penalty. All you really want is for everything to just work. If you’re constantly having to exercise your SLA, you’d trade all the service credits in the world for a new provider!

If that’s not the case, you don’t really care about the downtime and are just complaining to get the money! Shame on you! Go back to fatwallet.com where you come from! Hissssss!

All I’m saying is, since we’re in an industry with such a low barrier to entry, and since there’s nothing stopping you from switching hosts at any time, we really already have a lot of incentive to make our service as good as we can.

I know we fubar it sometimes, and I know we fubar it a lot, and when we do, you guys are doing the right thing by bitching and moaning and even quitting us. But a service level agreement wouldn’t change a thing.

So, so-o-o-o-o-o-o-orry!

And that’s the Super-est, Lame-est, Apology-est SLA you’re going to get!




Filed Under: Business, Foobars, Insider View, Musings, Rants, Updates

Read This Now!


Just in time for the Academy Awards!

Quick, before it’s gone!

If you enjoy all the hilarious hijinks, illuminating illustrations, and jovial jokes of the DreamHost Blog, you better suck down a local copy TODAY…

We’re having a planned power outage tomorrow night!

(Click that link for some more details.. it’ll be from 11:15pm PST (GMT -0800) tomorrow night (Saturday) to hopefully much less than 5 hours from then.)

Not planned by us though, planned by our building. It would have been very nice if they could have given us a little earlier heads up, or avoided the outage at all, but no, they just can’t. And trust me, we want this to happen even a tiny bit less than you do!

So, this site will be down then, as well as all other DreamHost services, with the exception of ns2.dreamhost.com and dreamhoststatus.com, which are kept off-site for exactly this sort of situation.

Off-site. WAY off-site.

Well, I just thought I better post something about it here too.. thanks for your understanding, and we’re really really really really sorry.

P.S. Here’s the pic the building emailed us of the problem:

Well, that clears THAT up!

So, um, yeah. I think what that shows is a piece of metal is vibrating next to that wire and cutting into the rubber insulation… and if it gets much further in, KABOOM!

Filed Under: Foobars, Hardware, Insider View, Updates

Some Late Night Moves!


Leaning tower of Pizza box servers.

Last night we made some moves.

Patrick and I moved about 60 servers!

And I only dropped one! (Sorry about that, bomberman.)

It took about two hours, and here we are, wrapping things up:

Stage 2? At 12:30 in the morning, after moving 60 servers, what else could we possibly want to move for a STAGE 2?

Hmm… something about “Brea”?

I don't get it..

We passed this car in the parking lot.. and soon, we were at the OTHER DreamHost office.

We waited.. THE CON WAS ON!

Patrick had told Pete (who lives right by the office) that he was just in the area, at 1:30am on a Wednesday, and I wanted him to pick up some WWF glasses for the downtown office. But, HAD PETE PLAYED US FOR FOUR FOOLS?!

Apparently not….

We made short work of the coveted sign.

But is it art?

And then I decided to go raid the kitchen…. WHAAaaaa!!!!

Yes, very funny, Brea. But who’s wearing the cool shades now?!

This man switched my neon sign! While wearing cool shades.

As long as we were there, we thought we might as well have some fun…

And some more fun…

We took our time. We even checked out the Official DreamHost Museum!

Why hello there, Señor Corona, you sure are working late tonight!

There was one guy working late.. Mr. Corona!

Of course, we couldn’t just leave those poor, unsuspecting Breaites bare-walled!

Around 3:30, we were back “home.” Mission complete. Tired. Satisfied. Ugly.

The neon sign was finally where it has always been destined to be. Down in our NOC. The HEART of DreamHost.

Because data center power just grows on trees!

Epilogue…

Filed Under: Foobars, Funnyish, Hardware, Insider View, Updates

No Run-Of-The-Mill Week, This!


First Allen Iverson, now Martin Luther King, Jr.?!

MONDAY!

We started the week off by having a little MLK, JR. weekend sale on our web site. Apparently, some of our affiliates hate civil rights because we got some angry emails that our little MLK, JR. day stunt was stealing their referrals by putting a promo code of our own right on our website!

They have a good point. Why would somebody use the promo code they were given when they get to the website and see a better one in a pop-up window! All these affiliates are working super-hard in the hopes of some good paypal lovin’, only to find out the very company they’ve been shilling to all their croneys has turned around and STABBED THEM IN THE FACE!

Et tu, Joshé?

Now, why so ever would we do something like that? Of course we don’t want to hurt our affiliates! So why steal their referrers in this way? Especially when our promo code was more than $97, it actually costs us more to “steal” these people about to sign up with some other promo code.

The strange, but true, but unbelievable, but really honestly true, but you-can’t-comprehend-it, but it’s for reals, truth is we’ve found that putting a pop-up promo code like that on our main site actually helps all signups, ACROSS THE BOARD.

We’re not sure why ourselves, but we get more no-promo-code signups, more affiliate-promo-code signups, and of course, more DreamHost-promo-code signups whenever we have that pop-up there! And, it seems to also have no residual effect on signups on other days afterwards.

Maybe people just feel some loyalty to their original promo code. Maybe they don’t care. Maybe they just don’t see or read the pop-up. Maybe seeing that code gets everybody in some kind of weird web-hosting frenzy and they just decide to sign up, whether they use the code or not! I know, it’s CRAZY, but that’s why I like marketing!

So basically, please please PLEASE trust us, affiliates! We try our best to only take actions in everybody’s best interests, and we’ve got nothing to gain by “stealing” your referrals!

TUESDAY!

Brett was checking the public voicemail when he came across this little gem apparently from a phone number in British Columbia.

Now, most of us didn’t really think that was the scariest bomb threat they’d ever heard. But, it was decided to notify our main data center building anyway. They then called the LAPD, and so a lot of Tuesday was spent explaining what “we do,” and how we have a “voice mail” in a “computer file”.

Also that “Brea” is a city in the LA area, different from “La Brea”, a street.

P.S. No bomb so far.

P.P.S. And really, what sort of threat gives you a deadline that’s a 48-hour window?

WEDNESDAY!

Ladies, avert your eyes!

Head Honcho Michael is out of town right now, so apparently ANONYMOUS LAZY HAPPY DREAMHOST EMPLOYEE thought it’d be fine to sneak in a little nap on his office couch. And it would have been fine, if he could have only kept his pants on.

(Notice the huge amounts of DreamHost power that permeate the office?)

THURSDAY!

Your charity dollars at work!

Aw shucks! We just got a huge shipment of World Wrestling Fund (or something?) pint glasses and travel mugs for thanks for the generous donation DreamHost and her customers gave a few months ago! Well, we’d split all the glasses in half with you, but I guess we’re just pessimists.

Consider them gifts from you to us for keeping prices down while we REDICLOUS-LY add (and subtract) bandwidth and disk space!

FRIDAY!

Whoops, did I spell something ridiculously wrong up there?

My apologies, I must have just been influenced by this SECRET PACT email I received today from an UN-NAMED WEB HOST!

I’m just trying to get some key players in the industry to agree on a few things:

a. That the current disk space/bandwidth allocations are rediclous
b. To cap them at a specific range (based on price)
c. To create a self-enforcement method for the industry

I was wondering if DreamHost would be interested in joining the discussions. So far, I have spoken to almost every major hosting provider in our segment.

I guess somebody doesn’t read our blog! Doesn’t he know our whole “lowering disk and bandwidth” thing is just a coy marketing ploy?

And in summary, what a wild, zany, not-run-of-the-mill week it’s been!

And, OH, I just remembered.

This IS a run-of-the-mill week at DreamHost!

Busted!

Filed Under: Business, Foobars, Funnyish, Insider View, Musings, Updates

Anatomy of a Disaster, Part 2


For the past several weeks many of you have been faced with slow or unusable websites and email. The original cause of that series of issues was detailed in Josh’s great Anatomy Of An Ongoing Disaster post. The network issue we were left with once the power outage problems were mostly resolved ended up being an especially nasty one. We were essentially caught with our pants down at just the wrong time and we’ve been taking our lumps for it.

Sour Face

The evidence we were seeing all pointed to one of the two routers as the primary troublemaker so we focused on that one. Configurations were changed with some improvement but without resolving the main issue. Ultimately, 6 separate Cisco support engineers and a Cisco Certified Internet Engineer were all unable to determine the cause of the errors we were seeing on our routers. That, along with the recent power outages, eventually led everyone to believe there was a hardware fault within the router somewhere. That started our process of replacing and/or upgrading every component. Once that was done and the main problem was still there we were able to finally pinpoint the point of network congestion and resolve it, and that’s where we are now.

The problem ending up being the connection between the two routers. Our network was set up so one router was primarily responsible for some of our servers, and the other router was primarily responsible for the rest of the servers. Both routers are connected to outside network connections and they share those roles providing wide-area network redundancy, but the inside of our network (our LAN) relied on both routers working together and passing bits back and forth. Some of you did not experience the problems because all of the servers your service relies on were on the same core router and were not bottlenecked by that inter-router link. Once one of the routers was fully upgraded we were able to move all traffic to that single router thereby removing the bottleneck and restoring service completely for everyone.

Our routers were not redundant and that hurt us. If our routers had been redundant we could have much more easily moved all traffic to one router or the other just to eliminate some variables. Having that option would have saved us a lot of time and you a lot of painfully slow service.

Ugly Dog

Establishing Power Redundancy

In searching for this solution we wasted a lot of time uncertain about the integrity of our equipment. Whenever a piece of electrical equipment suddenly loses power there is always a chance of some component failing and when you’re dealing with a device as complex as a router that’s a lot of components to worry about! If our data center’s UPS and generator setup had worked properly and the routers had not lost power, we could have instead focused on the new evidence at hand, confident that nothing else had changed. Knowing that, and knowing the track record of our data center, we are already in the process of adding an additional layer of power redundancy for our most critical (and expensive to replace!) equipment. The DC powered equipment housed in our data center is backed by a secondary UPS system and did not lose power throughout the recent power fluctuations. To take advantage of that ourselves, we are converting to DC power at the core of our network. We have the power supplies sitting and waiting to be installed and we’re currently waiting for the power to be wired into the racks we need it in.

We are also expanding our space in our Alchemy Communications Data Center. Alchemy has set up their own UPS backed power feed and were not hit as hard by the power outage that took us down. All of our future data center expansion is going into Alchemy.

Big Batteries

Establishing Network Redundancy
Looking back, our worst mistake of this ordeal was allowing our network hardware to end up in a state where we could not redirect all of the traffic to one router or the other. Having that option earlier on in the process would have allowed us to debug the problems more easily and ultimately we would have solved the problem faster. There’s no doubt about that.

When our two current core routers were originally deployed either one of them was able to handle the full load of the network. They were set up to share networking duties and we could have redirected traffic to one or the other if that ever became necessary. Unfortunately, the routers were not upgraded when they should have been and we ended up in a state where one of them was not able to handle the full load of the network. That situation combined with the problems beginning with the power outages led to the nasty network congestion that was difficult for us to diagnose and resolve.

Currently we are using a single router at the core of the network. Every component has been replaced and most of them have been upgraded so it is essentially brand-new and very able to handle our network traffic for the time being. We are in the process of re-establishing core router redundancy now and expect that to be done in the next few weeks. As we proceed into the future we will ensure that one of the two routers is always handling the full load of the network and the second router is standing by idle as a hot spare, should the need for it arise.

Redundancy

Into the Future
While investigating this issue we have been forced to look more closely at our network than we have in a long time. That has uncovered more issues that may become larger problems for us down the road and we are already working on a large scale network reorganization to both improve overall performance and make network issues easier to detect and troubleshoot. If there’s a silver lining on this dark cloud, this may be it.

Our primary local area network setup is really two separate networks, one for traffic that never leaves our network (the private network) and the other for traffic that does mostly leave our network (the public network). When you access your website traffic has to go over both the public and private networks (possibly multiple times) before you will see it come up in your browser. During our network problems it was primarily the private network responsible for the high server loads and slow website load times and email access.

The first step we are taking to improve our network setup is to completely separate out our private network from the public network. That will immediately reduce the amount of traffic going through our core routers and additionally make it easier to track down problems. More equipment will be involved but network traffic will be more isolated. As part of this process we will also be rearranging network links in as close to an optimal way as possible to further isolate traffic and improve performance. Unfortunately due to limitations in our current network architecture the best we can do is about 30% optimal and it’s likely we will not even do that well.

Less Than Optimal

So, the next step in the process is a complete rethinking of how we have been deploying our servers in our data center. For ease of deploying servers and efficient use of data center space we had architected our network to essentially allow any type of server (web server, email server, file server, mysql server, etc) to live anywhere on the network. That sort of setup has worked well for us for awhile but we are now starting to see the early signs of network bottlenecks arising. For future server deployments, we will be assigning physical areas in the data center for different types of servers to facilitate a more optimal network layout between them. That will essentially localize the network traffic as much as possible and allow us to continue scaling for quite some time into the future. Overall network flow will be reduced as well, better utilizing the available throughput. This step is currently being planned and will be implemented first for the next set of servers we deploy.

All told we will be investing somewhere in the neighborhood of $300,000 into our network upgrades, not to mention all of the human time involved in planning and implementing these changes. Now that we have gotten this issue behind us we are fully committed and prepared to maintain network stability and do the work needed to improve network performance and continue to scale with our growth.

We are very sorry for all of the headaches this has caused everyone. Believe me, there was no one who wanted this problem resolved more than we did. Providing sub-par service is no fun and isn’t the way we like to spend our time. This problem took longer than it should have to resolve, but coming out of it we are now in a much stronger position as we look ahead.

Filed Under: Foobars, Insider View

Phishing Phor Phishers


Phinding Nemo!

A funny thing happened to me on Tuesday.

Well, really it happened to my wife. But I hear being married is all about sharing.

We had just finished dinner when she casually mentioned we were getting a tax refund.

“Oh?” I responded…

“Yeah, I got an email”

“OH???????”

I immediately had a sinking feeling.. had she been PHISHED?

How aLUREing!

I asked if she’d given her credit card number out?

“Yes.”

Social Security Number?

Yes.

MY Social Security Number?

NO! Sheesh, what do you take me for?!

Which credit card?

Our Visa check card.

Oi! That’s a bad one! I’m not sure the kind of fraud protection we have on it, and it’s tied to our bank account directly!

Before even inspecting the email, I called in and had them cancel the card. Hooray, no charges had gone through yet!

Honey, didn’t I warn you before about PHISHING scams?

Well, yes.. but I forwarded it to you on Monday and you never wrote back! So I just did it.

I never saw that email! (Sure enough.. it was caught in my spam filters. Makes sense!)

Couldn’t you have called me on the phone or even asked me in person on Monday night or Tuesday morning?!

I forgot about it until I checked my email again!

Anyway.. let me see the email you got.

And here it was..

Date: Mon, 28 Aug 2006 11:58:14 -0500
To: joshswife@yahoo.com
Subject: Tax Information – joshswife@yahoo.com – (Code 7863-3843)
From: “IRS.gov” Add to Address Book Add Mobile Alert

God bless the IRS!

Account : joshswife@yahoo.com Number : 7863

After the last annual calculations of your fiscal activity we have determined that you are eligible
to receive a tax refund of $191,40. Please submit the tax refund request and allow us 5-7 days in orders to process it.

A refund can be delayed for a variety of reasons. For example submitting invalid records of applying after the deadline.

To access the form for your tax refund, please click here.

Regards,
Internal Revenue Service

Here are the immediate red flags that go off in my head when I get emails like this:

Right off the bat, any email I get from an address I’ve never received one from before has a 99% chance in my mind of being a spam, scam, or some kind of an annoyance.

I never get tax refunds! Ever ever ever. It’s not fair.

The IRS and state taxing authorities don’t send notices via email.

The IRS and state taxing authorities don’t have my email address.

They DO have my name and SSN, and would probably put those in an email, IF they had my email address and IF they sent emails.

There’s a typo in the email.. it says “of” where it should have said “or”.

They used a comma instead of a period for the decimal point in the dollar amount! That may fly in Europe, but god bless the IRS, this is America!

The link takes you to thistlejack.com!

But, believe it or not, my wife is not stupid. In fact, she has a PhD from Harvard!

Not my wife.

For real.

Too bad she doesn’t run a web hosting company!

There’s no better training against phishing scams than having dozens of fraudsters a day attempting to send them from your servers!

But for the rest of you LOWLY Internet users, phishing scams work. And I think I know why:

They send a lot of phishing emails.

Just by sending a lot of messages, they’re going to catch a tiny percent of people who were specifically waiting for that email!

Even the almighty Josh nearly fell for an Ebay phishing scam once when I got the phish the very moment I had just won an auction.

And of course, a tiny percent of people are going to go for it even when they weren’t expecting an IRS refund, a paypal payment, or an ebay auction.

They prey on people’s greed or fears.

To my wife’s credit, (she claims) there were a LOT of red flags and alarms going off in her head while she filled out that form. But the lure of the $191,40 was just too strong!

And we’re rich!

People are getting really comfortable with “e-commerce”.

My wife doesn’t really care too much about giving out her credit card info online. Really, why should she? You’re not generally liable, and we should have the replacement card in the mail tomorrow. I do wish she was a little less comfortable with giving out her SSN though…

The thing is, how often in the real world do you come across an individual or business who is really trying to scam the crap out of you? Hopefully not too often in this country at least. It just doesn’t really happen. But on the Internet, it really does happen. Millions of times per day.

Fortunately, a lot of people are still deathly afraid of this “Internets”, and won’t give out anything to anybody! Or maybe that’s not fortunate.. because really, you’re not generally liable.

People are technically naive.

Honestly, it’s pretty easy to look at a URL and know if it’s legit.

Or is it?

I was trying to explain to my sister-in-law how to know. Basically the best I could do was “If the VERY first part of the URL is the correct domain name, and only the domain name, and doesn’t have a dash or something before it, but it’s okay if it has a dot before it, as long as it doesn’t have a slash before the dot, then it’s the right site!”

In fact, my wife was even like:

Well, I knew thistlejack.com wasn’t irs.gov, but you know how sometimes websites link off to some other server for their payment processing? And when I clicked all the links on the site, they were legit.

Because the links WERE to irs.gov!

Even the fact the page wasn’t secure didn’t faze her!

What was I to do?

I already canceled the credit card. But I wanted more! I wanted to shut this guy down, and I wanted to make sure nothing happened to my wife’s SSN.

First, I did a whois lookup on thistlejack.com and called the owner, Mr. Robert Stirling.

I knew he wasn’t the phisher. Nobody in the US phishes, and nobody uses real contact info when registering a domain for phishing! It looked like from the URL that the phisher had exploited a hole in a photo gallery script he had installed. (Which is why we have mod security for our happy hosters!)

Fortunately, he answered the phone.. I explained the situation and he was very, very, cooperative and helpful!

He logged in to his domain, took the phishing site down (it’s down now), and then at my request emailed me the source code for their web form. I wanted to see what was happening to the data.

Just as I might have guessed, it was being emailed off to two separate anonymous yahoo.com email addresses.

I immediately emailed abuse and postmaster@yahoo.com, got a tracking number back and started waiting. And waiting. (I’m still waiting…)

I couldn’t wait anymore!

I had to do something (besides call the credit reporting agencies and tell them what happened)!

And then it hit me!

Maybe I could fill this jerk’s mailboxes with enough BOGUS DATA that he’ll just give up on it all and not realize that my wife’s info was for reals!

Of course, it wouldn’t be too hard for him to realize all submissions after a certain time were fake.. but hey what did I have to lose?

I took the source code from that script and made up my own that sent an identical email to those two addresses, but with randomly generated info!

In this picture, are you on the left or right? I know that I'M on the left!

It was fun!

I set it up with a cron job to run every 20 minutes (but I put a random sleep of 1-20 minutes at the front so they didn’t come in too regularly).. it’s still going right now.

I’m going to keep it going until I hear back from Yahoo!.. and just FYI, here’s the output they were receiving from their phish:

Date: Thu, 31 Aug 2006 16:58:15 -0700 (PDT)
From: thistlej@server4.whmsecure.com
To: phisher@yahoo.com
Subject: IRS – Full

[ . . . : : : IRS FOUNDS : : : . . . ]
Social Security Number: 356 – 00 – 0258
Name On Card: Robert Rieger
Card Number: 6105341453830068
Expidation Date: 12 / 2007
CVV: 123
PIN: 5702
[ . . . : : : IRS FOUNDS : : : . . . ]

(Don’t worry, that’s a fake one I generated!)

In closing…

Phishing scams are pretty darn effective. They’re tricky, and they’re lucrative!

Or do!

Anyway, my wife’s pretty embarassed about the whole thing and made me promise not to tell anyone.


Filed Under: Foobars, Funnyish, Insider View, Musings, Rants

Reconstruction Efforts


Hey, Homer came in with a very competitive bid.

Well, things could be worse.

We’ve pretty much got our whole network under control now.. the ongoing problem mentioned last post was finally figured out by Cisco support. It turns out it was a bug undocumented feature in IOS dealing with how they learn MAC addresses.

There was also another network problem we got resolved yesterday that was causing general slowness on web and mail servers. It’s complicated (i.e. I don’t understand it exactly myself), but in the end we took a distribution switch out of the network and that fixed it.

We still have one open ticket with Cisco for our core routers having some HSRP problems. It doesn’t seem like that’s having any real effect on our network, but we want it fixed!

We are also installing two new Ciscos to offload the BGP duties from the core routers so they’ll just have to handle switching. This set-up should be able to handle about 300% more traffic than our entire network now pushes at peak times!

Thanks to these network problems being resolved, we’ve also begun re-deploying in Alchemy, who at least didn’t have the second power outage.

We’re also still in the process of getting real UPS power on our network cabinet, plus our internal databases and a few internal servers. Basically, everything that keeps all the customer mail, web, database, and file servers from coming right back up quickly should there ever be another outage.

Less like a disaster, more like a field of wildflowers.

So, um.. that’s how it stands now! We hope this will all soon be nothing more than a long bad dream (that was real).

Filed Under: Foobars, Hardware, Insider View, Updates

Anatomy of a(n ongoing) Disaster..


Hopefully not THAT bad.

What a three weeks…

As I’m sure most of you already know, we’ve had nothing but troubles, large troubles, for pretty much the last three weeks. A lot of these troubles were our fault, a couple of them were at least ostensibly beyond our control, and they all compounded each other.

Here I’ll try and go into as much detail as possible about what happened, why, and the steps we’re taking to stop this sort of thing from ever happening again. I can’t excuse what happened, just apologize and hopefully elucidate.

Ironically, all the recent disasters stem somewhat from us attempting to take some proactive steps to head off any sort of future power outages like the kind we experienced last year.

Not THAT bad either..

The Back Story

As some of you may know, we are co-located with Switch and Data in The Garland Building in downtown L.A. To say we’re co-located is a bit misleading though, since we’re now basically 95% of their data center.

Why don’t we have our own data center?

Because, believe it or not, we’re still not big enough for it to make sense. Even now, we only use about 1000 sq ft of data center space.. for it to really start to make sense to get our own space, we’d have to be using around 2500 sq ft. Mainly because when you buy a data center, you want to get one big enough to handle a lot of growth.. and although it’s cheaper per square foot than co-locating, you have to pay for all the space you’re not using yet.

And really, The Garland Building is supposed to be an excellent place for data centers. There are more than a dozen in the building. Companies like iPowerWeb, Media Temple, BroadSpire, and even MySpace (now the most popular website in the whole US!) are in there. It’s got FIVE huge generators, UPS for the whole building, on two separate power grids, and a dedicated engineering staff to make it all work flawlessly. Or so we were all assured.

Around last June though, the building informed all its data center tenants that they had essentially run out of power! Not power altogether, but the “good” power that data centers need.. i.e. ups and generator-backed power. Because Wells Fargo, who holds the master lease on the building, wasn’t sure if they were going to renew the lease when it is up in three years, they didn’t want to invest the millions of dollars to add more generators and ups to increase capacity. This is in fact the primary reason we’re still not selling any more dedicated servers .. they use too much power per dollar!

Of course, none of that was supposed to have any affect on their ability to keep the current power going in the case of an outage. September 12th, 2005 we discovered they actually couldn’t… when two of the five generators failed!

However, since then, the building has repaired and replaced the faulty generators, and given all their tenants numerous assurances that what happened before would never ever happen again.

Not THIS horrible..

Why didn’t we move data centers right then?

That would have been a fairly massive undertaking, resulted in even more down time, been very expensive, and actually we did look around and there weren’t any really good options for moving… data center space is becoming pretty tight (in the LA area at least) and the Garland Building is still one of the best options, believe it or not. Also, this was the first time something like this had ever happened, and it seemed pretty reasonable that it wouldn’t happen again. We even asked around and none of those other tenants mentioned above were moving, so I guess it seemed like people were generally pretty confident it was a one-time freak occurrence.

Nevertheless, we started making contingency plans, searching around for another data center that had some power and would make sense for us. Eventually, we found Alchemy, just down the hall from S+D actually, and began making arrangements for getting some space from them. They had a little bit of power available because they were moving some of their clients out to El Segundo, and because they had gotten permission from the building to install their own generator. With that generator and some UPSes they were able to convert a “dirty” power feed into “clean” (i.e. good for data center use) power.

Pretty bad...

How the troubles began.

All this took a very, very, very long time. After months of searching and negotiating with Alchemy, we still had to get Switch and Data to allow us to put a cross-connect in from their data center over to their competitors down the hall. After even more months and teeth-pulling, we finally got that up and running. In fact, we finally got the first live server up in Alchemy a little less than a month ago.

All this in an attempt to head off future power problems.

Unfortunately, shortly after setting up the new footprint, we noticed something wasn’t right. Getting to Alchemy from Switch and Data we would lose huge buckets of packets. Just as we were trying to figure out the problem, we started to have problems with one of our file servers.

This resulted in a lot of problems across the board. The web servers that mounted that filer all had problems. The mail servers that mounted that filer all had problems. In fact, one of the mail servers was mis-configured and was logging thousands of errors a second to a remote logging machine… so many in fact that it was saturating its switch and clogging up a whole chunk of our network. Which in turn caused other machines to get slow and crashy because they couldn’t get to their filers, and so on and so on.

It turned out the filer problem seemed to stem from the fact that we had one shelf of 300GB disks and one shelf of 150GB disks on it. Apparently they’re not supposed to be able to support this, or at least it’s a bad idea. So, this was entirely our fault. However, we did have a number of other filers we did this on, and we’d never had problems before. Nonetheless, we will never mix disk shelf types on a file server again.

We eventually cleared all this up.

However, the Alchemy connection problems were still ongoing.

After trying all sorts of things, we eventually decided to replace one of our distribution switches that was acting strangely with a new one. This didn’t really seem to fix the problem either. This was on Friday, July 21st.

Never strikes thrice..

On Saturday, July 22nd, the building lost power.

This time, the generators actually worked, but the UPS failed! Honestly, it was much better than last year’s.. but unfortunately, even a brief power outage wreaks havoc on a data center. And this one wasn’t so brief.. here’s the building’s explanation:

At around 5:21pm, on Saturday July 22nd, a brown out occurred due to record high temperatures in downtown Los Angeles. Voltage dropped due to the high demand of electrical current along with equipment failure operated by the Department of Water and Power, City of Los Angeles. This condition caused failure of “ATS-B” switch and to UPS Module #3. Engineering crews were dispatched and began repair of this damaged equipment. A power interruption was required to replace contacts in “ATS-B”.

Repair of “ATS-B” failed contacts was completed on 7-24-06. Power was restored between 4:00am and 4:30am by the Engineering department.

Thank you,
Office of the Building

So, after all the emergency filer stuff going on the previous weekend, just about the entire admin team was back last weekend, working on getting everything back up when power came back on. Even when we had power, it was in a degraded state and so the cooling wasn’t working. As temperatures rose, file servers automatically shut themselves down rather than risk being damaged by the hostile environment. Apparently, MySpace made the decision to just keep all their servers off until cooling was restored.

Where are the engineers?!

More network troubles..

After the power outage, we decided to just yank everything back out of Alchemy (they lost power too!) until we could figure out what was going on with the network to there. Unfortunately, this didn’t seem to fix things, and our internal (“red”) network was still really fubar. When our red network isn’t working, the panel isn’t working, webmail isn’t working, and our server configuration system starts having problems (basically, anything that connects to our internal databases).

It took us just about all of Monday to figure out (and then fix) that a lot of the file servers had bad routes after being powercycled.. and so were sending ALL their traffic through the red network, saturating it. These things are generally pretty stable and a lot hadn’t been rebooted since September 12th, 2005.. and some had apparently had their networking set up by hand instead of correctly configured via our database. We’re making sure that doesn’t happen anymore either.

More network troubles..

Once that was fixed, things generally got better. Except there was STILL strange stuff going on (causing slowness and high loads around the system, but not an actual system-wide outage), even without NFS traffic going through red, and even without anything at Alchemy. It started to look like there was a problem with one of our core routers. We called our Cisco consultant and opened a trouble ticket with Cisco themselves..

Servers crashing? Not so bad.

More power problems..

On Friday, July 28th, we lost power again. The building wrote:

The Garland Building experienced a dead short which resulted in a brief power outage today, July 28, 2006. The air conditioning, elevators, and the electrical utility have all been restored.

While on generator power, a dead short occurred from one of our internal telecom users. We are investigating where the dead short occurred. A follow-up memo will be sent by the end of the business day reconfirming our transfer at 11:30pm tonight. We are currently on DWP power until further notice.

And then:

The Garland Building UPS System is back on-line supplied by DWP. Diesel generators have returned to an on-call status.

The 11:30pm transfer has been cancelled due to the dead short prematurely returning us to utility power. At 4:30pm the engineers engaged the UPS System to protect all tenants at the Garland Building.

Thank you,
Office of the Building

This time, we were able to get our entire system back up much quicker and with close to no problems. Of course, it had been less than a week since our last power outage.

Alchemy was the only data center in the building who did not lose power this time.

Could be worse.

More network troubles..

Over the weekend (this last weekend), we kept having the same ongoing weird network problems I mentioned above, and Cisco hasn’t made much progress. Yesterday, we realized the new distribution switch (an extreme) was causing spanning tree problems with the older Ciscos. Jeremy got it all figured out, but in the process it erroneously blocked our “green” (public!) network for a few brief periods, taking down everything again.

Unfortunately, that fix STILL doesn’t seem to have fixed the ongoing core network problems. We were finally able to get our tickets escalated with Cisco yesterday. It is starting to look like something may have been damaged during the first power failure, although we’re not sure. The replacement/repair cost might be around $80,000 it looks like.

Happier days..

And that’s where things stand today.

Our number one priority right now is getting this nagging network problem understood and fixed. Once that’s the case, we should be able to put things back in Alchemy, who didn’t lose power on Friday at least. Once things are going good there, we’ll be able to add new servers and transition old ones slowly with little to no downtime.

We’re also going to be buying our own UPSes, since we have learned we can’t trust our data center OR our building to do it. We’ll start by putting the core routers on them, then our internal databases and servers, then our file servers, and finally the hundreds of customer mail, web, and database servers.

The end.

Finally…

We’re very sorry for what happened. We definitely don’t want it to happen again, and we’re trying to take all the practical steps we can to prevent it. We never want to have another July 2006 again.

Ironically, some of the network problems seem to have stemmed from us trying to better protect ourselves from power failures. I also want to say for the record that none of these problems in my opinion stemmed from “overselling”. Rather, I’d say it’s the result of bad luck. And incompetence on our (and the building’s) part.

I don’t know if we’ll be able to change our luck, but hopefully we’ve at least learned something and will be able to become a tiny bit less incompetent in the future.

I hope you’ll all stay with us to find out.

Filed Under: Foobars, Insider View, Updates