I noticed this talk by my former colleague (at Netscape and AOL) Jim Roskind, who now works at Amazon.com. He gives a great introduction to the phenomenon of congestion collapse in complex queueing systems. His examples include familiar scenarios such as busy highways, and phone call routing. Then he proceeds to show how the same abstract pattern (congestion collapse under heavy load) can occur within the kinds of distributed computing systems servicing web sites and api endpoints.
We’ve seen this kind of behavior over and over again in our consulting engagements and have employed the mitigation strategies Jim outlines, as well as some more advanced techniques we’ve developed in-house.