Announcement Listory

August 30, 2005 on 11:06 am | In Foobars, Insider View by Josh Jones | 17 Comments
Here is where you find the announcement lists.

The story of how a bad random number generator can result in 3 hour announcement list delays.

Some of you may use our Announcement List feature to send e-mailing to a happy group of subscribers (happy because they all opted-in)!

Some of you may have noticed that recently it was taking a long time to send out announcements.. they’d generally go two to three hours after the time you’d scheduled them for.

Some of you may have been getting angry about this..

Please be happy, it’s fixed now!

The first few times people reported this we thought it was just a temporary problem, like maybe there were just a lot of messages going out and the servers couldn’t keep up. We weren’t able to reproduce the problem and generally if you can’t reproduce the problem it’s going to be too hard to find and fix to make it worth the effort.

Finally yesterday, we were able to reproduce the problem.. right in front of our eyes the mailing lists were going out two hours behind schedule! Hooray! Well, actually BOO! that’s bad … but also, Hooray! now we can maybe fix it!

It turned out the root of the problem was not the sending of the mailing list itself, but actually another thing the same mailing list sender script did.. send out confirmation emails to addresses people have manually added to the list from our web panel.

But first let me give you a little “Announcement List History”…

It used to be whenever somebody wanted to subscribe a bunch of people to their list from our panel, our panel would immediately attempt to send out the confirmation emails. This was fine when people were subscribing less than say, 20 addresses.. but if you tried sending hundreds of email addresses right there in real time it would take so long that the panel would usually time out in the listmaster’s web browser!

This was no good because A. it looked bad, B. not all emails would always go out, and C. people would generally get scared and re-submit their confirmation list, thereby possibly “spamming” the very people we’re trying to make sure don’t get spammed!

Soon we implemented a better way. Instead of sending all the emails immediately, we’d just INSERT them into a database and then when our script ran to send announcements it’d also send the confrimation emails based on that table!

And everything was great for a few months!

Then, strangely we started getting reports the panel was timing out AGAIN! Why, God, why?! Well, it turned out even INSERTing thousands of emails into the database was too slow for the panel (which is a bit strange, but I guess not unreasonable).

So, to fix it this time, we created a temporary table that would just immediately store the whole list of addresses in just one INSERT. Then later (during the sending) we’d break that list into it’s thousands of individual components and INSERT them into the main table. (The reason we need that table at all is to track the unique “goop” … something like 005lcw1grDw5jA … for when subscribers verify their email by clicking the link in the email they get).

Back to the present.

As it turns out, the sending of actual announcements has been getting held up by the thousands of INSERTs the script was doing to send confirmations to people being added from the web panel!

Well, the first thing we did was separate these two scripts.. there’s no reason announcements need to wait on confirmation emails! That’s just dumb. So that fixed it.. but why were these INSERTs taking so long?

After some poking around, it turned out the problem was actually that “goop” stuff! You see, we need them all to be unique, and so this is the code we were using:

do {
## create the goop!
srand(time() ^ ($$ + ($$ << 15)) ); #gets a nice random seed.
my $p = rand();

my @chars = ('a'..'z', 'A'..'Z', '0'..'9');
my ($salt) = $chars[rand($#chars)] . $chars[rand($#chars)];
($goop) = crypt($p, $salt);
} until ($db->Insert('mailinglist_approve',
['goop','address','name','list','domain','sub_date'],
[$goop,$address,$name,$self->address,$self->domain,sref('now()')]));

Basically, we’d get some random goop, try and INSERT it into the table, and if that goop was already in there, it’d fail. Then we’d just create a new random goop and try again. Given the number of potential goops and the number of entries in the table at any given time, we should basically NEVER have to INSERT more than once. This is nice and good except for one part:

srand(time() ^ ($$ + ($$ << 15)) ); #gets a nice random seed.
my $p = rand();

It turns out that because the seed is based on the current time, we were not getting a "nice random seed" every time we ran it. The time only updated once a second, and so our goop would only change once a second! That meant we would do dozens of failing INSERTs over and over and over each second until the goop finally changed. And those dozens (hundreds?) of INSERTs were making the table slowww...

After a little bit of research on better random number generating techniques, we changed that code to be:

my $p = rand(`head -1 /dev/urandom');

Which actually gives you a good random seed ALL the time. Immediately the number of INSERTs we were doing dropped exponentially and everything is now fast and happy!

And that's how a bad random number generator can result in 3 hour announcement list delays.

17 Responses to “Announcement Listory”

  1. Daniel Says:

    Josh… There’s one word that would probably best describe how you guys felt when you found out the problem MONTHS after it began… “PWNED!!!!!!”

    hahahaha…

    – Daniel

  2. Martey Says:

    I have never used announcement lists, but was impressed enough by this entry that I felt obligated to comment.

    What I really liked, was the depth of this entry. While other web hosts might have released short clinicial posts saying something to the effect that “There was a problem with the announcement lists; it is fixed now,” it was far more interesting to see snippets of the actual problem code. Here’s to more explanation!

  3. Matt Says:

    Ahhh, sucks when this happens, but don’t you love that warm happy feeling when you figure out why something is erroring, and when you fix it? I wish you could buy it bottled..

  4. Steven Says:

    I agree with Martey. Thanks for the story!

  5. Mark Says:

    Great story, Josh! The bug is sneaky because the code seems to make goop just fine if you run it manually, outside the loop…

  6. Martin Says:

    Tricky thing, easy to miss. I guess, you will be slapping your head in just a moment :-)

    The actual bug is not the srand() call itself or its seed, but rather using it in the loop. srand() sets the seed for a *sequence* of random values. So you just set it *once*, and use rand() to request the next value.

    So I would change the script to use the old seed generator (it is good enough), but place it outside the loop. This will safe you spawing a shell and a head command and a read from urandom on every iteration – this should make the script even faster.

    The random numer generators from CPAN are also nice :-)

    And thumbs-up from me for not just posting a short “we fixed it” message.

  7. josh Says:

    Hey Martin,

    Good point about the looping! We didn’t really like having to do a system call there, but it seemed better than the alternative…

    I’ll check if moving seed outside the loop fixes things up even mo bettah!

  8. Chris Says:

    See, this is why I love Dreamhost. Like Martey said, you folk actually tell us what the hell is going on rather than treating us like mushrooms.

  9. Brian Says:

    a) thanks for fixing it
    b) thanks for describing the problem in verbose detail.

  10. shazow Says:

    Just thought I’d point out a couple of things:

    Then, strangely we started getting reports the panel was timing out AGAIN! Why, God, why?! Well, it turned out even INSERTing thousands of emails into the database was too slow for the panel
    [...]
    So, to fix it this time, we created a temporary table that would just immediately store the whole list of addresses in just one INSERT.

    Most databases support “INSERT DELAYED” for this very reason. What this does is it sends the query to the database, and instantly returns to the client without waiting for the database to finish processing the query.

    Also, instead of using randomly-generated “goop”, would an auto-incrementing integer id not suffice?

    :)

    - shazow

  11. riki Says:

    One feature I’d love to see on the Announcement Lists, would be the ability to give subscribers the option to specify areas of interest when they subscribe. That way we wouldn’t have to use multiple Announcement lists for different areas of our company and Users would only get info that they were interested in. Which could be good for DH as well.

    As my Mum use to say “Pay me don’t thank me!” :)

  12. Ben Says:

    Blech – don’t do that;

    As Martin pointed out, don’t call srand each time through the loop. It’s not necessary and can even be bad, depending on the arguments.

    More to the point, you a) likely don’t need to srand() at all — Perl 5.004 onwards calls it implicitly the first time rand is used — and b) if you call srand() without arguments, it’s going to do the right thing anyway:

    “…the generally acceptable default, which is based on time of day, process ID, and memory allocation, or the /dev/urandom device, if available.”

    ‘perldoc -f srand’ is your friend.

  13. Kelly Says:

    To reply to shazow, the problem with an autoincrementing ID is then people can reasonably guess what our “goop” is going to be. This means they could subscribe someone to the list just by subscribing themselves and someone else in rapid succession, and just trying a bunch of sequential numbers.

    We also couldn’t really call it goop at that point, and where is the fun in that? ;-)

  14. shazow Says:

    RE: Kelly
    Good point, just wouldn’t be nearly as fun without the goop :-)

  15. Louis Says:

    So, is anybody at DreamHost secretly working on a Rails version of the control panel, like how Apple was compiling OS X for Intel for years? ;)

  16. Daniel Says:

    POST MORE!!!! I’ve made this blog one of the websites I repeatedly check throughout the day while working (to look busy) and am disappointed everytime I don’t find new posts!

  17. David Strauss Says:

    You guys need to fundamentally change the way your primary key and authentication mechanism work, not just improve the random number generator. You should be using a simple incrementing counter as the primary key. The “goop” code should should be the primary key (or email) HMACed against a key you have somewhere on your server. Whenever someone clicks the link, your server HMACs the primary key against your private key and checks it against the link’s submission. A good hash makes it near impossible for someone to figure out your key and generate the codes themselves.

    Another way to fix your system is using a normal incrementing primary key and a random number in a separate field. When someone clicks the link to verify, look up the record and check if the separate random number field matches.

Powered by WordPress. Pool theme by Borja Fernandez, modified by DreamHost.
Like WordPress? Consider attending WordCamp LA.
Entries and comments feeds. ^Top^