July 27 at 1:16 am
- Link
RAPatton, Dread Pirate PJ, imabonehead and 7 other people liked this
When we looked closely at the dumps, we saw that not only
were all the updates on the queue from IMP 50, but they all had
one of three sequence numbers (either 8, 40, or 44), and were
ordered in the queue as follows:
8, 40, 44, 8, 40, 44, 8, 40, 44, ... Note that by the definition
of LATER, 44 is LATER than 40 (44 > 40 and 44 - 40 <= 32), 40 is
LATER than 8 (40 > 8 and 40 - 8 <= 32), and 8 is LATER than 44
(8 < 44 and 44 - 8 > 32). Given the presence of three updates
from the same IMP with these three sequence numbers, this is what
would be expected. Since each update is LATER than one of the
others, a cycle is formed which keeps the three updates floating
around the network indefinitely. Thus the IMPs spend most of
their CPU time and buffer space in processing these updates. - Paul Buchheit
via http://friendfeed.com/e/b8d29f... - Paul Buchheit
It turns out the real lesson here isn't that you should put error detection on even the most inconsequential protocols, but that you must not allow any one process to take over all the resources on your system. - Gabe
gossip - Peter Dawson
Gabe: Error-detection at every step is not bad either. Hard drives occasionally flip bits. Network errors slip through error-correction algs. When our 1960s IBM mainframe would stop unexpectedly, we'd open up the computer to look for dead insects. It wasn't bad code, just an insect. This carried over into my programming, where I ALWAYS expect bad things at every step. Been hit by lightning twice while on a keyboard. Sh-t happens! We got many awards http://fmsinc.com for extra code-checking. - Mitchell Tsai
"When our 1960s IBM mainframe would stop unexpectedly, we'd open up the computer to look for dead insects. It wasn't bad code, just an insect." yeah thats how bugs' become the phase for s/w flaws :)- - Peter Dawson
People tried to save programming memory & time in the 1960s-70s by not validity-checking (In those days we EQUIVALENCEd memory to reuse it for multiple sets of variables, and programmers used 1-letter variables for compactness), and thus began the endless series of buffer-overflow leaks in unix. Sighh... Like the don't-check-e-mail-source-address issues of TCP/IP routing which allow today's SPAM e-mails, simple decisions can have 40+ yrs of bad consequences. Much worse than the Year 2000 problem. :-( - Mitchell Tsai
Mitchell, you're right. Error detection is important, but that was just the secondary lesson of that story. The primary problem was that it was possible for one process to take over the machine. The bad bits wouldn't have caused as much of a problem if packets kept getting routed as usual. - Gabe
I think you all missed the pun I throw in - "“On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn't able to successfully process many customer " - yeah while every1 keeps "gossipping" when does anything really get done eh ?? :)- - Peter Dawson
Gabe: I'm suffering runaway processes on my new MacBook Pro (usually the Window Server). Used to be sometimes 2-3X day I would have to reboot (with 156% CPU usage and such nonsense - dual-core Intel maybe?). It's much less stable & thrashes more than my 14" iBook (thinking of switching back). The process-takes-over-machine problem is still around today. - Mitchell Tsai

