The Official DreamHost Blog!Tales From the Inside!
Blog Pages

Goin’ down…


Uptime is everything in this business.

DreamHost Level. Ladies purses, washing machines, web hosting.

Unfortunately, Monday was kind of a rough day for us. You may have noticed that your sites were unreachable for most of the day.

The last time something so wide-reaching and disruptive happened was less than a year ago when we discovered, quite out of the blue, that we were hosting the website for “Draw Mohammed Day.” That was…fun. And educational. Unfortunately, the technical lessons we learned during that experience simply were not applicable to what happened on Monday.

Technically, every website we host was up and running on Monday.

However, they were mostly unreachable for much of the day.

We’ve just completed our own internal review and want you know exactly what happened, why it happened, and why we’re taking great pains to ensure that it won’t happen again.

So let’s start with…

What went wrong?

Let’s look at the timeline.

02:00AM

Network connectivity becomes…quirky. Our core network begins to exhibit latency issues. We begin to investigate possible causes.

05:07AM

The main Cisco switch at our datacenter locked up and became unresponsive.

05:30 AM

It’s clear the switch isn’t going to come back up on its own. We reboot it, and Cisco’s VSS (Virtual Switching System) fails over all of its traffic to our secondary switch. At this point partial connectivity has been restored. Sites are reachable, if somewhat slow.

06:00 AM

It becomes clear that our primary switch’s configuration has been wiped clean – spontaneously – and attempts to recover it are failing.

07:00 AM

In an attempt to recover some utility from our primary switch, we clone the config from our secondary switch with some tweaks to avoid a dual-active situation. Unfortunately that does not work.

08:30 AM

We scrap the primary switch’s configuration – on purpose this time – and begin to rebuild it from scratch.

09:00 AM

We enable port-channel VSL (Virtual Switch Link) on two channels. This causes our secondary switch to reload itself and makes our primary switch the active one – wiping out our configuration files again. Technically speaking, this should not happen.

09:30 AM

We’ve restored configs to both switches. Say a few Hail Marys, crack our necks from side to side, and then discover that the chaos of the morning took down the link to our El Segundo datacenter.

10:00 AM

Techs arrive at El Segundo and begin troubleshooting.

10:30 AM

The link between datacenters is restored.

12:00 PM

At this point things are looking up. Partial core connectivity is restored and we begin looking for smaller fires to put out.

01:00 PM

Outage continues. The config on our primary switch appears to be corrupted.

03:30 PM

Still more connectivity issues are reported. Config files are continually found to be mangled. Restoring from backups is not helpful as they’re discovered to be mostly out of date.

05:30 PM

Not out of the woods yet, but the end is in sight. All network interfaces are audited and restored. Routing and switching are repaired. The majority of issues have been resolved.

8:00 PM

Out of the woods. Full functionality is restored.

What does all that mean?

It means that a combination of factors, set into motion by what we believe to be a hardware failure in a key part of our network, caused many customers’ sites to be unreachable for much of Monday. Those that were reachable worked intermittently and slowly.

The fact that our core switch supervisor wiped its own configuration spontaneously – and continued to do so even after we restored and rebuilt that config manually – told us that the switch was not operating to spec and is a prime candidate for replacement.

We also witnessed other anomalous switch behavior during the recovery process that, according to Cisco, should just not be possible. We’re going through the RMA process with Cisco now.

Don’t you have some kind of backup system in place?

We do. We rely heavily on Cisco’s Virtual Switching System (VSS) architecture to provide fault-tolerant network redundancy for situations just like this one. On paper, and based on our specific network environment, Monday’s problems should not have happened.

VSS should have stepped up to route traffic around the troubled hardware. It didn’t. We believe our network configuration is solid – and that VSS did not behave as it should have. In fact the VSS behavior we saw was unexpected and inconsistent with what VSS claims to be. We’ll be working with Cisco to determine the nature of the failure.

Why didn’t you call me?

We would have loved to reach out to every customer individually, but with over one million domains hosted, that could – quite literally – have taken all year. We’d have loved to email you too, but well, we had this little network problem blocking emails.

Please bookmark and continue to check http://www.dreamhoststatus.com/ at the first sign of trouble. That domain is hosted offsite and we use it as our primary means of communication in cases of any and all planned – and unplanned – service interruptions.

You may also want to follow @dhstatus on Twitter as well, as it syndicates the post titles from dreamhoststatus.com.

Why weren’t you more responsive?

Unfortunately, there really wasn’t much news to share as we constantly created, installed, and reloaded switch configuration files only to have them crumble and disappear before our eyes.

Our network admins were, quite understandably, wound pretty tightly on Monday and under the gun to get things back to normal again as quickly as possible. When they did stick their heads out of the bunker long enough to pass on status updates, we piped that out to dreamhoststatus.com immediately. You knew what we knew – as soon as we knew it.

Why did it take three days for you to write this?

We wanted to provide as much context and detail as we could to support the events already posted on the status blog. That meant doing some serious research and analysis.

We were busy doing our own internal assessment of the situation; figuring out exactly what happened, working on a detailed accounting of what failed and why it took so long to get things back to normal. That report is now complete and that’s why we’re able to share our findings with you now.

How will events like this be handled in the future?

We learned some key things on Monday:

1. First and foremost, you want to be kept more in the loop. You want to feel as if you’re staring over our network admins’ sweaty shoulders, watching pages of text scroll by on the console. We get that. And you certainly deserve it.

We provided as much information as we could during the outage on our status blog throughout the day. That’s not enough. We’ll be posting more frequently, even if there’s really nothing to report, during future large-scale service disruptions. The nature of Monday’s problems meant that we really couldn’t even guess at an ETA for a resolution, and we didn’t want to make promises that we weren’t prepared to keep.

2. We’ll work to refine our network topography so that our datacenters aren’t so dependent on each other and operate more like freestanding units and less like an interconnected web of services.

3. We’re going to be better about keeping our network config backups up to date. We’ll keep several versions on file so that we can rollback to a last-known good configuration if need be.

4. Finally, we’ll be beefing up our network monitoring situation and implementing a centralized ‘Network Health’ page that any employee can use to get a bird’s-eye view of our networking situation at any time.

I’d like compensation.

You’ve earned it! You pay for 365 days of service – not 364.375. Contact our technical support team and we’ll do what we can to make it right.

Aren’t you forgetting something?

We were saving the most important part for the end.

We’re sorry.

We let you down, and truly, each and every one of us here behind the blue curtain gets a knot in their stomach thinking about what happened on Monday.

Some of you weren’t able to post to your blogs.
Some of you weren’t able to work on (or submit) projects for classes.
Some of you had nothing to show at SXSW.
Some of you weren’t able to accept online orders.
Some of you got yelled at by your boss.
Some of you weren’t able to get any business done.

All of that, and perhaps more, is our fault.

We appreciate your business and value the relationship that we’ve worked hard to build with each and every one of you.

It’s our hope that you’ll stick with us as we work to regain your trust by once again providing solid, dependable, hosting services.

-Brett

Filed Under: Business, Foobars, Insider View, Updates

Wren Jones


Hi, it’s Josh.. some of you old timers may remember me as the guy who used to write the newsletters and most of blog posts around here.

You may have noticed that it was about a year ago (exactly), that I stopped.

The reason was that my wife and I had our first child, a baby boy we named Wren on that day. March 9th, 2010. He was 3 weeks early, 7 pounds, 20.5 inches, and delivered at 12:12pm. It was honestly the best day of my life. It was also the worst.

About 11 hours after his birth, Wren stopped breathing. We were at home by ourselves in Santa Monica (we’d had a home birth, and the midwives had left about 3 hours after the birth), so we called 911. They arrived within three minutes and rushed him off to the hospital just one mile from our house, but after about three hours of nothing working they had to pull the plug.

I won’t get into all the details here. You can see everything over at wrenjones.com or Group B Strep International or Hurt By Homebirth. Since then, it’s been a pretty shitty year for me and my wife, and our families and friends. There’s been a lot of crying. A lot of looking for answers. A lot of trying again (no luck so far).

When we got the autopsy back and found out for sure that Wren had died of a Group B Strep infection, it seemed like none of our friends or family members knew anything about it. I was like “people need to know about this!” But after doing a little bit of research I realized that although most parents and lay people have never heard of GBS, everybody in the medical world already have… and it’s basically been solved. Since the 90s there’s been a straightforward protocol on how to prevent GBS, that is over 99.8% effective.

What Then?

Which left me floundering. What happened? Why us? Were we really just that unlucky?

Finally, it dawned on me that the GBS infection was really just the symptom of the deeper “disease.” The home birth itself.

When we had decided to do a home birth, I was skeptical at first. It just intuitively seemed like a risky proposition.

But… after visiting a couple different home birth providers around LA, as well as our HMO-provided OBs, I developed an analogy I could accept. “Home births are to hospital births what Whole Foods is to Safeway.” (A rich people place that probably isn’t actually any better, but at least isn’t any worse.)

I’d long ago given in to shopping at Whole Foods even though I gag at the site of Nature’s Path Organic Love Crunch.

This epiphany struck me when I saw that home births were actually more expensive than hospital births… ours was $5,200 (and they don’t take insurance), compared to basically free with our HMO. The home birth specialists stated that as long as these three key components held true, home births were actually safer than hospitals:

1. You’re low risk. No complications of any kind; no medical conditions, no twins, no premature labor, no breech, no nada.

2. You have highly trained professional midwives assisting you.

3. You have pre-arranged a backup hospital that is very close by, just in case.

I didn’t buy that it was safeR, but it did seem somewhat reasonable that if you carefully followed these rules it could be as safe. And if the experience was nicer than the HMO (and the checkups definitely were), the $5,200 seemed worth it.

I now know the flaws in each of those three components:

1. You’re low risk.

Even if you’re low risk, that doesn’t mean you’re no risk.

The math basically works out like this.. let’s say a “high risk” person has an 80 in 10,000 chance of a life-threatening emergency during childbirth and a “low risk” person has an 8 in 10,000 chance. Let’s say the survival rate of such emergencies is 25% at home and 50% in a hospital.

If that’s the case, when you’re “high risk,” you’d be adding a 20 in 10,000 chance that your baby will die. And when you’re “low risk” you’d be adding a 2 in 10,000 chance! It’s better than if you’d been “high risk”, but why add any extra chance your baby will die?

Secondly, what is “low risk”? Early on, our OBs detected GBS in my wife’s urine. They dealt with it fine (although they could have told us about the higher risk of infecting your child during birth when you’re heavily colonized!).

To them, we were still low risk because GBS is so easy to treat… the mother just gets an antibiotic IV when she goes into labor… except they forgot we were planning a home birth. For our midwives we were also considered “low risk” … mostly because they held a certain complacency about GBS, I guess because they had never experienced it personally.

You never really know if you’re low risk (especially with your first pregnancy!) until after the fact, plus when you’ve decided to go the home birth route, there all of the sudden becomes this (typically) unspoken pressure to go through with it, even if “high risk” warning signs start to appear, because to deliver at the hospital would be some kind of a failure.

2. You have highly trained professional midwives assisting you.

In the U.S., there are basically two types of certified midwives: CPMs and CNMs. What you want is a CNM: Certified Nurse Midwife.

Everything else (CPM, LM, MPH, LLC, direct-entry, state licensed, etc..) is a Professional Midwife. The differences between the two are quite large.

A Nurse Midwife is required to graduate from nursing school, and works in the health care system with real medical doctors.

A Professional Midwife needs only a high school degree and to get certified by a midwifery association.

To go back to my analogy theme, a CPM is to a CNM as a real estate agent is to a district attorney.

It is currently illegal in 23 states for CPMs to deliver babies. Unfortunately it is legal in California.

In fact, there are some studies that show that births attended by CNMs have survival rates even slightly higher than those attended by MDs. However, almost no CNMs will do a home birth… they all deliver in hospitals.

I can only assume something they learned in medical school scared them.

3. You have a hospital very close by.

That almost all Certified Nurse Midwives will only deliver in a hospital says a lot.

Being close to a hospital is not the same as being in a hospital. Believe it or not, babies can die very suddenly during labor, delivery, or even the first few days afterwards. You’re never completely in the clear of course, but the most likely day for any human to die is the day they’re born.

Our story alone should prove that being close (we live literally one mile from the new UCLA medical center NICU, one of the best in the world) is not always good enough.

Clearly, being close to a hospital is better than being far from a hospital.

So it seems pretty logical that being in a hospital is even better than being close.

And again, why add any extra chance that your baby would die?

The Sad Thing

There seems to be a teensy bit of the beginning of a trend towards home births right now, maybe it goes with the green/local/organic/global warming craze. It may seem harmless, but the problem with the whole culture of home birth though is its intense focus on the process of childbirth rather than the result.

I wish I could somehow get everybody laser focused on the most important, nay, the only important thing in childbirth. Getting a healthy baby out of a healthy mommy. I wish I could impart this to people without them having to go through what we’ve been through.

I know it’s near impossible to change somebody’s mind once it’s been made up. I also know that the vast majority of home births are always going to go fine; the numbers we’re talking about are all pretty “small”.

The sad thing is, many people will still choose to have a home birth with a CPM even if they know that they are adding a 1 in 1000 chance that their baby will die.

(That’s the actual odds! For comparison, there are an estimated 85.5 million drunken drives a month and about 11,000 fatalities a year in the U.S. That implies that in America having a home birth with a CPM is 93 times more dangerous than driving drunk.)

I’m okay with that. I just want people to make their decision educated with the best possible information.

(Personally, my advice would be to not.)

Addendum

If you’re considering having a home birth, please… you owe it to yourself, your spouse, your friends, your family, and your unborn child to consider the “unthinkable.”

Before you decide, try checking out The Skeptical OB blog by Dr. Amy Tuteur. She’s been doing this way longer than me and is much more qualified than I am to talk about this stuff.

And if you do still decide to have a home birth, please, find a CNM! (And if you’re GBS positive, get the antibiotic IV for crying out loud!)

Finally, have you ever heard (or can you even imagine hearing) somebody say, “If only I’d had a home birth, my baby would be alive.”?

Because if only I hadn’t.

Filed Under: Foobars, Rants, Updates

Yeah…about that downtime…


You may or may not have noticed that yesterday, www.dreamhost.com was offline and unreachable for the better part of 6 hours. We can’t let something like that go without an explanation.

I should note that during this time no customer sites were affected (other than one – which I’ll get to) and the main “www.dreamhost.com” domain. Customer sites were up, our web panel was up, everything was up…including the ire of some tech-savvy Muslims!

We’ve got a fairly liberal free-speech policy here which we’re quite proud of. Speech that is protected by the United States Constitution’s First Amendment is protected by DreamHost. While we don’t always agree with the content of the sites we host, we do support their right to host it in America!

Yesterday was Draw Mohammad Day.

This did not sit well with roughly 21% of the world’s population.

We happened to be hosting drawmuhammadday.com, a site that encouraged people to draw images of Mohammed. That’s kind of a no-no in the Muslim world.

Incidentally, did you know there’s like a million different ways to spell Mohammed?

In the spirit of yesterday’s event, but without the offensive parts, I’ve drawn some pictures to show you what you might have missed!

Some people weren’t too keen on the idea of the Draw Mohammad Day website and suddenly we were the target of the largest Distributed Denial of Service attack (DDoS) we’ve ever seen. drawmuhammadday.com was the first to fall. It was the main target and it didn’t take long…based on our stats it looked like almost the entire country of Pakistan was attacking us! Well not really. But nobody in Pakistan could reach YouTube, Facebook, or Twitter yesterday, so what else were they gonna do?

These weren’t just random attacks from here and there. We saw several Pakistani groups targeting us on their blogs, often providing step-by-step directions and automated tools for launching e-assaults on dreamhost.com and drawmuhammadday.com.

They did not let up once the site was down. At one point dreamhost.com (the site itself) was handling around 20,000 requests per second. To put that number in perspective, when our customers’ sites have traffic surges a busy day might see that number get up to ten or even twenty.

Our load balancers, as great as they are, typically handle about 4,000 connections at any given moment. During the attack they made it up to 400,000 before they seized up and crapped out. We believe that even the most top-shelf battle-hardened load balancing options would not have been able to withstand an attack of this scale – a quick jump in traffic about 100x larger than normal traffic patterns we see on any given day.

Our fault-tolerant setup relied on those load balancers and they proved to be our undoing. Luckily only some services were affected by this for a very short time (webmail being one of them) before we got them going again a few minutes later.

To restore services we had to take the site down altogether while we moved it to newer, stronger hardware, beyond the reach of our load balancers. We tuned the Linux kernel on this new machine aggressively to use less memory for TCP connections. We also abandoned Apache, favoring a specialized nginx installation.

When we flipped the switch to get dreamhost.com up and running again at around 2PM PDT, the attack load had dropped to 130,000 simultaneous connections with over 20,000 requests per second. The new setup took it like a champ and continues to perform well today – even while we’re still seeing elevated traffic as a result of lingering attacks.

We’re proud to say (and repeat!) that customer sites were not affected and our control panel was still reachable during this entire debacle. And of course if you ever suspect server problems with your DreamHost account be sure to check dreamhoststatus.com!

We learned some lessons yesterday and, moving forward, we’re going to put them into practice. Thanks for hangin’ in there.


(not Mohammed)


Filed Under: Foobars, Funnyish, Insider View, Updates

Speaking of scheming…


Sucks Sites.

I’m sure you’ve seen them. Wikipedia calls them gripe sites. They’re usually set up by disgruntled customers and then typically disappear a few weeks later once the creator has had time to cool down.

Sucks to be whoever's on the receiving end of this thing!

Oh yeah, they’re out there. NoDaddy.com, for example…but in their case it turns out they may actually be on to something!

Thanks to some great investigative journalism by Andrew Allemann over at Domain Name Wire, you can now read in great detail the lengths that GoDaddy has gone to to conceal its involvement in its own domain name warehousing operation.

Standard Tactics, LLC: How GoDaddy Profits from Expired Domains

The Go Daddy Group allegedly uses a complicated web of subsidiaries and anonymized whois records to hide its involvement in its domain warehousing/auctioning scheme.

Check it out. It’s a great read to get you into the Christmas spirit. If you’re the Grinch.

I guess when you’ve got a $2 million Christmas party to throw and a $3 million Super Bowl commercial to put on, that money’s gotta come from somewhere!

Filed Under: Business, Foobars, Rants, Tech News

DreamQuake!


Red Square of SHAKE!

We just felt a BIG earthquake! Okay, just a 5.8, but it sure made our 50th story office BOUNCE!

Don’t worry, all data centers are okay!

Filed Under: Foobars, Funnyish, Insider View

Another Anatomy


X-Rays are used to explain a lot of things at DreamHost.

Okay, nothing silly this time, I promise…

Some of you may have noticed that we’ve been having what a problem that is, although maybe not the worst in DreamHost history, definitely in the top 5.

There has been a DreamHost Status post about it, but it’s been going on so long, there obviously needs to be more said.

This wasn't the first disaster.

The History

The events that conspired to cause this horrible performance for everybody in our “blingy” cluster actually started to take root 19 months ago.

That was when I made this post asking our customers for some suggestions on storage. I made the mistake in that post of mentioning the name of one particular storage vendor who apparently does a search for their name in rss feeds of all kinds of blogs. I won’t mention their name again here, to test if they REALLY read this blog, but they were the one on the list right after “Netapp”.

Anyway, immediately a sales guy from there was hounding me about how great their product was. It would have super-duper reliability, super-duper performance, and super-duper ease-of-management. It was super-duper expensive compared to our current solution (about 3x the price per GB), so in the end I declined.

But, over the next year he kept hounding me and hounding me, and eventually the price came down to something in line with our current costs, so we decided to try one unit for our new cluster, “Blingy”. After we were satisfied with our internal testing, Blingy went live with the new storage solution in December 2007.

No need for life boats!

Smooth Sailing

At first, everything was fine, performance was great, everybody was hunky and dory. But then, as usage started to go up, the new file system started acting up. Around the same time every night, the system would stop responding to NFS requests for a while, which would immediately break web and mail service for everybody in the entire cluster.. thousands of customers.

Our Bad

Now, it can be a big mistake to put live customers on any new system. But honestly, we’d tested it lots, researched it a ton, and we added people very slowly at first, and it performed great.

Our biggest mistake I believe had nothing to do with what specific vendor or hardware we went with.. it was simply putting so many eggs in one basket!

Even with our Netapps (which are pretty much awesome), there are problems from time-to-time. However, a typical hosting cluster will have a dozen or so Netapps, which means any problems are one twelfth as big.

With Blingy, EVERY customer is on this one “mega” filer, which in theory should make for better performance, reliability, and ease of management. And since we got the clustered solution (in an active-active configuration)… there really is no single point of hardware failure in this thing.

But, as it turns out, there are a lot of non-hardware failures in the world.

Their Bad

Well, the techs at the vendor couldn’t figure out what was causing the NFS freezing, and so they recommended us doing a major OS upgrade to hopefully fix it.

During this whole time, the fiber channel disks were slowly filling up, and we’d been trying to move large files off to the sata pool (it’s a two-tiered solution, and there’s a feature that automatically moves less-accessed data to lower tiers).. however the thing couldn’t move the data fast enough. It couldn’t finish doing a “move job” in a single day, and every day it’d sort of “crash”, which would screw up the move job, and nothing would get moved.

Also, as the disk kept getting more full, performance kept getting worse, creating a vicious cycle. We ordered some more fiber channel disk shelves at the end of February to grow the main FC volume, since we couldn’t get things off to SATA, and it was supposed to come on March 10th and be installed at the same time as the major OS upgrade.

However, the disks didn’t end up getting installed until March 25th, and at that point it turned out we could NOT grow the FC volume with these disks (well, it was technically possible, but their on-site techs recommended VERY VERY heavily against it.. it would severly impact performance), which was sort of the whole point. So now we had a new FC volume which we still had to migrate users to.

The Exxon Valdez ain't got NOTHING on us!

Your Bad

Of course, this whole time, new customers just kept signing up, and being added to Blingy. What were you guys thinking?

By this point we knew this was a bad idea, but we didn’t have a new cluster ready (we’d expected Blingy to grow for another couple of months), and we try to never ever grow old clusters again once they’ve been “shut off” from new signups (because in time they stablize and have very few problems).

However, the moving people off to the new FC vol, or the original SATA vol, or even the new Netapp we also added to Blingy, just wasn’t happening fast enough. So on April 2nd we bit the bullet and switched Blingy off as the “new customer” cluster and started growing good old “Postal” again. Once we did that, we were finally able to get ahead of the curve and total usage on our first fiber channel volume has been slowly dropping ever since.

We tried at that point to contact the vendor to see if we could just get more drives that WOULD allow us to grow fcvol1, but they said their manufacturers were closed for inventory for a week after the end of the quarter and we couldn’t get anything until Friday, April 11th at the absolute soonest. Later they said they could find us some they could get us by Tuesday, April 7th, and we preliminarily said we’d take them.

This whole time we had a support ticket open with the vendor about the crashes (the OS upgrade didn’t fix it), and finally on April 3rd we received notice that they’d fixed the bug that they believed was causing it! However, the patch still needed to go through their “QA”. Finally, this Sunday April 6th they said it was all ready to be deployed, so last night we did.

What Now

Well, right now, performance is still not great on fcvol1… but mail and web should be pretty much working. One thing we’ve noticed is a website that hasn’t been visited in a long time will have a big lag still upon the first visit.. but then subsequent reloads/visits seem much faster.

At least the total disk usage is coming down now, and hopefully by tomorrow it’ll be below 85% which is supposedly a magic number where performance is fine. We’re going to keep off-loading it until things are great, though. We’ve got plenty of disk space for it, the problem is just it takes so long to move it.

We also I guess will find out tonight if the NFS freezing bug is fixed by this new patch. Hopefully so.

Apologize this kung-fu kick!

It’s Too Late…

I realize this is probably too little too late for many of you, but I just wanted to sincerely apologize for this whole big Blingy cluster-f*ck. Also, if you’re on Blingy (you can tell from the panel by clicking “account status” and looking at “Your Email Server”, we’d like to offer you a month worth of hosting credit.

To get it, all you need to do is contact support from our panel and make the subject of your message “Blingy Account Credit”. That’s all you have to do, and we’ll credit everybody who asks (and is actually on Blingy!) next Monday (April 14th).

Very funny, Mr. Happy Blingy Customer.

Filed Under: Foobars, Hardware, Insider View, New Features

Good Reminiscing Friday


Those were the days!

Well, it was a little over two months ago that we had what I think is pretty safe to call the worst disaster in DreamHost history.

In retrospect to me, it’s kind of funny that the worst disaster didn’t turn out to be due to a security breach, a power outage, a loss of data, or actually anything related to our actual hosting service. I guess it shouldn’t be a surprise that people care a lot more about their bank accounts than they do their websites.

I have realized that billing is the one issue where how important we feel it is is completely at odds with how important you guys feel it is.

What I’m trying to say is, we’ve always been ultra-flexible and lax about how people pay, when people pay, or even about giving credits, discounts, or refunds. We figure, whatever, pay us when you’re ready, we’re not sending anybody to collections or ruining anybody’s credit over some measly bandwidth bill.

If everybody had just been paying by check!

What we’ve always tried to focus on more (even though it might not seem like it at times!) is our hosting system’s stability, performance, and features.

I guess I’ve always figured that any billing-related error can be easily undone (worst case scenario, it costs us a little money); there is no lasting harm done to the customer. Whereas having a website or email problem could potentially cause permanent damage to somebody’s business or personal life or something?

Well then, let’s go back and see just how little money a worst case scenario actually costs, shall we?

Credits and refunds to cover people’s bank fees: $52,000.

Sigh, if only everybody kept a big cushion of cash in their account! The main damage that can be caused by a billing snafu is for people who get their account overdrawn, and because of that aren’t able to make a critical purchase, or have a check bounce, causing hassles and incurring bank fees. We offered to pay people any amount their bank charged them for going negative, and in the end that total looks like it came to about $52,000.

Discover how much money I lost DreamHost!

Accidental refunds: $170,000.

The worst part of this whole process (for us) turned out to be just after the accidental billing, ironically when we were trying to make things right!

If you recall, our system was not actually charging about 75% of the time we thought it did.. and so we refunded thousands of people who were never charged (but, 75% of the refunds didn’t work either). Well, out of all that, and after two months, there are still about 600 accounts who were credited a total of $170,000 in excess of what we charged them that we haven’t been able to get back from them or their bank.

It is slightly annoying when the same guy who complains to the high heavens when he thought he’d been over-charged $9,000 by accident conveniently disappears when we realize that actually, he’s been over-refunded $9,000 by accident.

Extra credit card fees: $82,000.

Another slightly annoying thing is that credit card processors don’t credit you back any fees when you refund a transaction. Overall, the extra credit card processing we did resulted in extra fees of about $350,000! Fortunately, after a whole lot of groveling and explaining the situation (and waiting two months), we finally got all but $82,000 of that back from First Data, American Express, and Discover Card.

Apparently our snafu didn't screw up Visa's IPO too badly.

Extra support messages: 20,000.

As you may have surmised, people wrote to us about this thing. About 20,000 times… and it would have been tens of thousands more if we hadn’t put up an “emergency block” against new messages for a little while in there.

How much this extra support actually cost (in terms of your wased time, tech support overtime pay, and other questions taking longer to answer to) is hard to say, but normally we only get about 45,000 messages in a whole month!

Accounts canceled: 1000.

It’s also kind of hard to say how many people actually closed their account because of the incident, but in January we did have about 1,000 more accounts closed than average. Assuming each of those accounts would have stayed for maybe another year, that’s another $120,000 down the Intertubes. It’s crazy… from all our power problems back in 2006, we hardly lost any accounts at all.

mastercard.jpg

Goodwill lost: Priceless.

Yeah, it turns out this whole blog post is nothing more than another clichéd MasterCard commercial parody.

P.S. I guess it’s nice to know, less than two hours away from our biggest data center move ever, that we’ll cause a tiny fraction of the disruption to our customers that one unexpected fat finger did!

P.P.S. Thanks RIM, for scheduling a blackberry outage exactly at the same time. It makes us look better. And, maybe some of our Happy Customers will blame their lack of email tonight on you!

Filed Under: Foobars, Insider View, Updates

The Final Update


Okay, all the people who had still not gotten their refunds was starting to seem a little weird, so after further investigation yesterday, I think we’ve finally got things completely fixed.

It turns out, there was a glitch in our new PayflowPro.pm that resulted in only the first transaction in a single second actually going through! According to Paypal’s site, that PayflowPro.pm should be just a drop-in replacement for the old PFProAPI.pm… and it did seem to be, after changing two lines everything seemed okay.

However, there was one little difference. The new HTTPS interface requires you to pass a unique id for each transaction, and PayflowPro.pm generated that unique id as follows:

my $request_id=substr(time . $data->{TRXTYPE} . $data->{INVNUM},0,32);

The problem was, we never passed in the (optional) “INVNUM” field.. we had an invoice number, but we passed it in as the (also optional) “COMMENT1″. So, our “unique” request_id was pretty much just the current time (plus whether it was a sale or a credit)!

In my testing this didn’t fail, because I didn’t run multiple transactions in the same second. Also, they apparently still return the same old success code we test for when this happens! But when multiple biller services run in parallel on all our controllers, lots of transactions end up happening on the same second.

The Upside

It turns out of the actually closer to $9,600,000 we thought we mistakenly charged, only actually about 1/4 of them ever _actually_ hit people’s credit cards. Our system thought we charged them, and they received an email receipt, but that was where it ended. It turns out we actually billed “only” about $2,100,000 incorrectly.

The Downside

This bug still existed until late last night (around 4am).. so when we ran our super-refunder script, the same thing was happening. Only about 1/4 of the refunds successfully went through. This resulted in the following situation:

About 9/16th of our customers: weren’t actually billed OR actually refunded.
About 1/16th of our customers: were billed AND were refunded.
About 3/16th of our customers: were billed BUT WEREN’T refunded.
About 3/16th of our customers: weren’t billed BUT WERE refunded. (of course, nobody wrote in about it!)

Anyway, last night we fixed the bug (by passing our invoice in as INVNUM) and re-ran another fixer that took an actual log of successful transactions downloaded from our processor and cross-referenced everything with our system. This is what it did:

About 9/16th of our customers: marked their bill and refund as $0 amount.
About 1/16th of our customers: left everything alone.
About 3/16th of our customers: redid the refund.
About 3/16th of our customers: redid the charge.

Double checking now, there were no more of those glitches from before, so everything seems okay.

Once again, all the stuff mentioned in the last post still holds true (you may not see the correction on your statement yet, but if you call your processor they should see it coming, for REALs this time), and once again, I’m very sorry about this whole fiasco.

Sincerely,
Josh Jones

P.S. For people wondering how the “robust and stable” rebiller could have created multiple future charges for the same date… I guess I meant “robust and stable” in regards to normal use over the last ten years. It looks like in this case, when multiple instances were running in parallel on a future date, race conditions allowed some multiple charges for the same period to be created. That too should never happen again now that we don’t allow future bill dates.

Filed Under: Foobars, Updates

The Aftermath


It seems like it’s about time for a follow-up on things from yesterday.

First, I just want to apologize for the regular-style blog post about it yesterday. Hopefully this will be the (picture, bold, and italics-free) blog post many of you would have liked to have seen yesterday.

The current status: we believe to have refunded everybody who was incorrectly billed at this point. This was pretty much finished yesterday at 3pm, but there were a few stragglers who we got today. If you were charged and haven’t seen the refund show up on your credit card / bank statement yet, try calling your bank. Lots of places take a day or two or three or even four to update their statements even if the money’s already back in, but they should see it (by tomorrow for sure) if you call them.

If this/these erroneous charge(s) by us resulted in you having any sort of overdraft/bounced check/nsf fee from your financial institution, please contact our support team from the web panel. We’d just like to request that you include a copy of your statement with the necessary info showing the fees. It can be either a paper statement or a print out of your online statement, or even a screenshot of your online statement and it can be scanned and attached to your support message via our support form or faxed to us at 714-990-2600. If you fax it, please be sure to write your domain name or DreamHost account number on the fax. When we get this, we will put money on your credit card equal to the amount your bank charged you, as well as give you a DreamHost account credit for the same amount on top of that.

Another thing… if you’ve decided because of this fiasco you’d like to cancel hosting with us, we will allow you to get a full credit card refund of any unused portion of your pre-paid contract, even if you’re past our standard 97 day money-back guarantee. To do so, just close your account as normal from our web panel (“Billing > Manage Account” area). Then, after it’s done, write into support and let them know you’d like to get your remaining account credit refunded to your credit card due to the billing snafu of January 15th and we’ll be happy to comply.

Checks to Protect Your Balances

Finally, here are the precautions we’ve now added to our billing system to make sure nothing like this happen ever again:

1. Our biller service will no longer accept a date in the future.
2. This whole time, we did have an option to specify “never automatically bill me more than $X in a day” on our web panel. Of course, not too many people had this set, and why would they have to? Nevertheless, we’ve made a change now that even if you don’t have a specific daily limit set our system will not allow billing you in one day more than 50% more than the most you’ve ever authorized in the past.
3. Our rebiller does an automatic filling-in of old charges when it finds some missing. This should never actually happen anyway, but we’ve added a new check that if it ever finds itself filling in more than 3 missing charges on any account it stops immediately and notifies our financial team.
4. We’ve also added an overall check where if the total number of payments in a day are more than double the average number of payments we’ve gotten on that calendar day for the last seven months it fails and notifies our financial team.

And that’s it.. I hope this puts things more or less behind us. And remember, if you have any specific issues, our support team is always there!

And of course, my sincere apologies for all of this.

Thanks,
Josh Jones

P.S. I apologize for that joke about the triple billing in the newsletter thing too, but you have to admit, it was kind of ironic that I actually did screw up billing less than a week later.

P.P.S. Some of you have attempted to email us directly with information about unresolved issues stemming from this billing fiasco and have received autoresponders telling you you can’t email us directly. That restriction was unintentional has now been removed so please re-send us your email if you have not already contacted us through other means.

Filed Under: Foobars, Updates

Um, Whoops.


The $7,500,000 finger.

Hello.. how’s your morning going?

I hope it’s been a little better than mine.

We had a teensy eensy weensy little billing error last night… my first clue something was up when I saw this morning’s daily billing report (so far): $7,500,000.

It turns out due to my excessively fat fingers, nearly every one of our customers has been seriously over-billed in the last 12 hours.

I bet when you read this part of the last newsletter:

4. New Office!

Another important thing I’ve been doing instead of writing newsletters
is looking out the window of our NEW OFFICE:

http://blog.dreamhost.com/2007/12/21/were-so-high-right-now-you-dont-even-know

If your next web hosting bill from us is mysteriously tripled, now you
know why.

.. you thought it was a joke!

Ha, the joke is on you! I guess. Um, okay, no, not really, I’m sorry.

How on earth could something like this happen?

Let Me Explain

A couple of weeks ago, just around new years, we started beefing up some of our internal “controller” servers. These are the machines that run all of our “behind-the-scenes” services; things from adding a user to registering a domain to configuring apaches to rebilling customers.

I was on a little-bit-too-long vacation, but when I got back, I noticed our daily credit card payments seemed a tad low in the new year.

So, late last week I tried re-running the billing services for all the days back three weeks or so. I knew this was safe, because after 10 years, the one thing you DO get perfect is your billing system. Our biller is pretty bug-free and robust at this point, because we’d be broke and eating bugs if it weren’t.

In fact, it’s so robust you can just run it on any day you want, and it’s safe. It won’t double-charge people and it’ll even automatically find any missing charges and catch everything up to the day you said.

Anyway, I ran it, and things were fine.. and sure enough, it caught a lot of missed payments. I didn’t have time to look into it right then, but I made a note to myself to check up on it on Monday (yesterday) and see if things were fine or still messed up.

And a terminal case it is.

Come Monday

Monday came. I checked the reports and sure enough, things were still pretty low. So I looked at the logs for some of the biller services, and I noticed they were only failing on the machines that had been recently upgraded!

That explained why we were getting some money still (since not all the controllers have been upgraded yet), but not all of it.

Anyway, it turned out there was no 64 bit version of the PFProAPI module we use to interface to the credit card transaction server. No big deal, there’s a new module that interfaces with their new and preferred https interface, and it was only a couple of lines of code to change to get us switched over!

So anyway, I made the change, and it worked, and I even tested it, and things were fine!

But then… late last night, I realized: when I re-ran those biller services last week, they must not have fixed everybody then either! It’s just that by running it again I randomly got different people being charged on the working controllers who had been assigned an upgraded (and therefore broken) one before.

So why not just run it all one more time?

Sure, it should be no problem! So I did, manually running the biller (which is normally automatically scheduled) for 2008-01-14, 2008-01-13, 2008-01-12, 2008-01-11, 2008-01-10, 2008-01-09, 2008-01-08, 2008-01-07, 2008-01-06, 2008-01-05, 2008-01-04, 2008-01-03, 2008-01-02, and 2008-01-01.

I probably should have just stopped there. But then I thought better. I thought to myself, “When did we start upgrading these controllers anyway?”

I couldn’t remember. But, since the biller is super-safe and robust anyway, I went ahead and ran it for 2008-12-31, 2008-12-30, 2008-12-29, 2008-12-28, 2008-12-27, 2008-12-26, and 2008-12-25, just for the hell of it.

Notice Anything?

Don’t feel bad if you didn’t. I kind of missed it myself.

THOSE SHOULD HAVE BEEN 2007!!

Heh, uh.. um, er.. my bad?

So what happened?

Well, that super-robust and stable biller did what it was programmed to do, it ran as though today was December 31st, 2008!

And what did it see? Well, it saw a whole lot of accounts (essentially all of them) who for some unknown, mysterious reason hadn’t been charged at all for eleven and a half months!

So off it went, busily through the night, “fixing” everything up for “today”, December 31st, 2008.

Really, it’s sort of amazing this never happened before in the last ten years.

We have a NEW SUPPORT RECORD!

There IS a bug here.

I can imagine the half second or so of thought that sprinted through the programmer’s mind when he was adding the ability to allow you to pass in what day to run the biller as though today is:

Hmm.. well, I could see us POSSIBLY wanting to be able to bill for a future date.

Well guess what… NO! We will NEVER want to rebill as though today were a day that hasn’t happened yet! But instead, somebody along the line (Sage? Me? Somebody else?) figured, “What’s the harm in keeping it flexible?”

About $7,500,000 in harm, that’s what!

The serious part.

The end to this story is that of course, I’m very very sorry, we’re very very sorry, and I’m sure you’re very very sorry this happened. I really am. I understand the sort of problems that an unexpected large charge to your credit card (or worse yet, your debit card) can cause. If the tone of this blog post seemed a little light, I apologize I don’t mean to offend and I realize how serious an issue this is. I’ve been up since 3:50am trying to undo the damage and maybe I’m a little shell-shocked.

A new service is running right now (in parallel on all the controllers) that fixes all those future charges, re-enables your account if it was erroneously suspended, and if your credit card was automatically rebilled, refunds the payment automatically. You don’t have to contact us or your bank, and you’ll get an email when your account is finished fixing up. It’s going to take several more hours to complete. There are (or were, after this incident) a lot of you these days!

If, because of this billing mistake, you somehow incurred some fees from your bank or credit card company, please let us know after tomorrow (today we are just replying to all 10,000+ billing messages with a generic explanation) and we’ll do our best to make it right for you.

And of course, the biller no longer allows dates in the future.

The moral of this story is that “flexibility” is rarely desired in programming! The less a program will accept/the less a program will do/the less options and preferences it has, the more usable it is/the more understandable it is/the more stable it is.

Tough Love

I wouldn’t want him to compile me!

When designing a program, you’ve got to make some tough decisions .. and when you really can’t decide if this is something your users will need someday, err on the side of leaving it out.

Otherwise, your users will someday err on the side of your face.

Filed Under: Foobars, Insider View, Musings