The fight for stable Private Servers

March 8, 2010 on 4:05 pm | In Insider View, Musings, Rants, Tech News by jeremy | 19 Comments

As I’m sure some of you have noticed, the stability of some of our PS servers has been spotty at best from roughly the end of November.  What started out as an emergency kernel upgrade to fix some pretty serious newly-released exploits turned into months of non-stop bug hunting that resulted in the discovery of not one bug as we’d originally thought, but 4!  To make matters even worse, these 4 bugs were spread across 4 completely separately distributed pieces of the kernel which meant there wasn’t really anyone outside DreamHost who’d been likely to encounter our particular group of issues.

The first symptom we noticed was some hosts (ok, a lot of hosts…on the order of 30/day) were simply rebooting themselves.  The problem here was they were rebooting themselves so quickly that most of the time they hadn’t even stored any logs related to what was going on!  After closer inspection and a bit of luck, we found the dreaded “PANIC” string in their kernel logs.  Here’s the thing: normally when a server runs out of memory, it’s a Really Bad Thing.  When you’re talking about a virtual server, however, things are a bit less “doomsday scenario”.  It turns out that the Linux-Vserver patch we were using was failing to check exactly what part of the system it was that’d just run out of memory and if any guest ran out, BOOM.  Down went the host (we have them set to automatically reboot in such cases to speed their recovery).

Incidentally, the semi-panic caused by the lack of logging for such an immediate crash prompted us to write a new system that lets us remotely log all sorts of debugging activity so we can always be sure it’ll be available for later use.  With any luck, we’ll never be delayed in our fixing of a stability issue ever again for lack of information.

So after fixing the suicidal servers we’d been dealing with (that first bug took about a week to track down and roll out fixes for), we were feeling pretty relieved.  Then we noticed that while we were no longer having 30 machines crash every day we still had 20!  CRAP, we thought, what else could be wrong here?  Thankfully it didn’t take long to see that it was a bug in one of the security-related patches we use (thanks to the new-fangled remote logging system!).  So off we go to upgrade to the latest release which already fixed the bug (how lucky was that???).  And that’s where bug #3 comes in.  In one of our average PS hosts, we almost always see around 30,000 file handles in use at any given time (a file handle is basically what’s used by an application to read from or write to anything, be it a regular file, the network, whatever the case may be).  After upgrading we noticed something weird.  After just a couple hours, file handle usage was TEN TIMES the usual.  In order to ease some aspects of management, we decided a while back to boot some of our servers off of network storage.  One of the kernel patches that makes that possible is called AUFS (Advanced Unification File System).  After much back and forth with its developer, we finally got a patch back that fixed the problem.  That took a couple more weeks (and yes, we’re moving away from that system entirely).

Phew, 3 kernel bugs.  What are the chances, right?  After all, we didn’t make THAT big a jump in order to fix the security holes.  We were feeling pretty unlucky, but at least the problems were finally behind us.

That’s when we noticed that we were still having about 10 hosts crash every day (before the upgrade we’d maybe see 2-3 crashes per WEEK).  Unlike the old crashes, we no longer saw any real pattern between the machines that were crashing and the ones that were stable.  Some used the AUFS code we thought may still be buggy, but some didn’t (the split was actually almost perfectly 50/50 every day).  All we knew for sure was that some trigger was spontaneously causing an entire machine to cease being able to process anything at all, requiring a heavy-handed reboot to fix.  We spent weeks talking with the Vserver developers, talking with our own in-house kernel developers (the guys working on the CEPH filesystem), and anyone else who would listen.  The funny thing about bugs in other peoples’ software is that no matter how much proof you give them that YOU can trigger the bug, they’re rarely willing to put too much effort into fixing it unless you can show THEM how to trigger it themselves.  After a week of late nights and little sleep, we finally came up with a reproducible method of triggering the bug (for the more technically inclined, it involved a malloc() of just a bit more memory than was available to the PS environment, followed by an fread() to fill it up and trigger an OOM).  Even with the code in hand that proved the bug was, in fact, to be found in the Vserver kernel patch (or potentially the main kernel, though we weren’t able to trigger it there) it was still another week before anyone was able to figure out exactly what was going on.  One of the things that both made it so hard to find the bug and so obvious that the bug was either in the mainline kernel or the Vserver patch was the near-complete rewrite of a lot of the code related to what happens when the server runs out of memory.  As it turns out, one of the things that the Linux kernel attempts to do when a process is killed in order to free up memory is it gives it the highest priority it can and (and this is the important part) gives it a little bit of extra memory.  Yes, when a Linux server triggers its “OMG I’m totally out of memory!” routine, it’s not actually out of memory.  And this is where the Vserver patch comes in.  The way that it’s designed, it is impossible to get that little extra bit of memory that’s sometimes required for a process to die gracefully.  What happens in that case is you suddenly have a process with access to 100% of one CPU core that simply doesn’t have anywhere to go.  Once that happens, you can pretty much say goodbye to your server (and all the Private Servers it hosts).  The solution from the patch developers?  ”Get rid of all our memory management and use the kernel’s built-in Cgroup support”.  And this is why we we really like these guys.  A lot of software developers out there would let their egos get in the way and demand to come up with their solution.  These guys were happy to say “You know what?  The kernel already has a pretty complete mechanism for just this thing and we’d hate to duplicate all the functionality.”  And in case you were wondering, Cgroups are pretty new and didn’t exist when the first Vserver patches were developed.

We’re still rolling out upgrades to some hosts on an as-needed basis, but the results are extremely promising.

19 Responses to “The fight for stable Private Servers”

  1. Warll Says:

    You guys should write more posts like this, they make good reads.

    “(for the more technically inclined, it involved a malloc() of just a bit more memory than was available to the PS environment, followed by an fread() to fill it up and trigger an OOM)”

    Wow, I understood all that and I have yet to touch a piece of C code.

  2. Rhett Soveran Says:

    Agreed. It’s much easier not to be angry when we get a human side and a thorough explanation.

  3. Greg Clute Says:

    w00t! I see a stable PS on the horizon!

  4. neb Says:

    Wonderful article. DH support has been very attentive to this issue, and I’m glad to hear there is a resolution.

    I’ve been evaluating DH PS for my (beta) game server, but these reboots kept killing my server. I’ve been tweeting my PS uptime and have yet to report over 15 days.

    Anyone else have experience running a (MMORPG) game server at DreamHost?

  5. Gosherm Says:

    While I’m not a PS customer (3 years Dreamhost customer though) as I just don’t have the need for it at this point, I want to thank you for your openness on this issue (and all others!). As stated above, this was a great read, and my heart (bits?) go out to you guys for this mess you’ve been dealing with.

    Congrats on getting it figured out!

  6. yonkeltron Says:

    No chance of Xen hosting instead? This sounds like such a pain. Thanks for being so great about everything!

  7. vicm3 Says:

    Or kvm? as I understand Xen is also a PITA and not in the main kernel tree. Anyway Xen would be nice… I lost track of openvz…

  8. Fred Says:

    I think Xen is the best one.

  9. dertyp Says:

    This level of openness is exactly why I love being hosted with you guys. Keep up the good work!

  10. Andrew F Says:

    @yonkeltron, @vicm3, @Fred: Xen and KVM are both definitely nice, but they have some significantly different properties from Linux VServer that make them not quite as great for what we’re doing. For instance, there’s a significantly higher memory overhead for each running guest under Xen and KVM (you really can’t get by with any less than 64 MB wired down for each guest), whereas the memory used by an idle Vserver is often under 16 MB. Similarly, it’s difficult to adjust the memory allocated to a Xen/KVM guest while it’s running, which is an important feature for many of our customers!’

    So yeah, there’s reasons we aren’t using Xen or KVM for our Private Server product right now! We’ve continued to follow the development of both projects with interest, though… while they aren’t perfect for DreamHost PS, they definitely still have their own benefits.

  11. Dallas Kashuba Says:

    Yes, and Andrew F mentioned, we have been spending some time with both Xen and KVM. We actually use Xen for our internal package building system which automatically builds our custom debian packages across two releases and two architectures (etch and lenny, 32bit and 64bit)… it’s kinda neat and may deserve it’s own blog post.

    Anyway, though, we have been investigating using either KVM or Xen to supplement our use of Linux-VServer. We love how lightweight Linux-VServer is, but it’s not the best fit for every situation.

  12. Charlie Schluting Says:

    Epic story. Thanks for sharing it! Do more like this (but don’t introduce bugs just to have something interesting to write about, kay?) :)

  13. yonkeltron Says:

    @andrew but what if I want a proper vsp? I would rather buy from you!

  14. Alex Says:

    Nice! That was a really good reading!
    This kind of explanation should be more frequently!
    It is really impressive how some few bugs can do such a mess!
    More interesting is to follow the daily fight of DH developers in order do keep things up and running.
    Thank you guys.
    DH is the better choice for hosting! :)

  15. Chad Says:

    Just wanted to thank you guys! I’ve seen a huge improvement in VPS reliability since you have worked on those 4 fixes.

  16. Pete Carapetyan Says:

    Can’t believe I read the whole thing, didn’t intend to. But that’s the power of a good mystery and honest disclosure. Go dreamhost.

  17. nike jordan shoes Says:

    This is just a theory but I think it’s pretty sound.Join us to start sharing your reviews, news about jordans in killersneakers.com.

  18. Stefan Says:

    While I do appreciate you guys being open I have to say my $0.02

    I’m a bit surprised it took you a week to do a malloc(), so you can repro the issue. I’m also surprised that after more than 45 days from this blog post, it’s not yet solved. And last but not least, I’m surprised it actually started in November, and it’s still not over.

    From my point of view, Dreamhost PS has been very bad. Constant crashing, sites down daily, etc. Overall a very bad experience. I don’t feel I should be happy and thankful, like some of the other guys here, that it’s not crashing as often, when we’re paying extra for it.

    Or perhaps I was wrong in expecting it to work more than it does (which isn’t very often).

    Anywho, time for yet another change.

  19. Mike Says:

    Wow! It is great to see an actual explanation – for me it is too late though. The worst thing about Dreamhost is the lack of communication when things go wrong. After failures got more and more frequent, and I kept asking “what happened” , I got NO EXPLANATION – just, “your site is back up now”. This led me to migrate most of my sites away from dreamhost over the past few months, hopefully they can start communicating better and not lose the rest of their customers.

Powered by WordPress. Pool theme by Borja Fernandez, modified by DreamHost.
Like WordPress? Consider attending WordCamp LA.
Entries and comments feeds. ^Top^