Last week was busy… making travel arrangements for this week’s trip to New York (technically Jersey) and some light analysis of AWR reports from exadata RAT runs and some heavy troubleshooting of a Solaris x86 RAC cluster with random node reboots. (I think I finally traced the node reboots to a kernel CPU/scheduling problem). I really did thoroughly enjoy my time in Africa despite being nowhere near Oracle software – but it feels good to be working on challenging cluster problems again!
Before I completely forget the details from my work in Africa I want to wrap up my article about high-level lessons learned earlier this week. By the way, I’m not just stretching obscure aspects of my work in Africa to get stuff that sounds good. I view these cultural lessons which we learned together as the most central and most important technical aspect of my work at the hospital. And it might be surprising, but it’s true: the same cultural adjustments are important and oft-missed here in corporate America.
The first two lessons were to (1) understand the fundamentals and (2) avoid unjustifiable complexity. The remaining two lessons I want to talk about are slightly less technical but equally important.
- People first, Technology second
I’m using the word “people” here to sum up three major components of our accomplishments at the hospital: organizational policy, user education and technical training.
First, a fun technical story:
On day 1 after our arrival at the hospital, several issues required immediate attention – I discussed one example in the previous article. A second issue involved the network links to the outside world. In particular, sites like gmail were often completely unreachable. I worked hard on this one. Eventually I got the reproducible test case by connecting to an AWS instance very close to the other end of their link. I compared low-level packet traces from both sides to see what was happening.
I could initiate a very large download and throttle the connection on our end, causing TCP window full messages to cascade all the way up to the AWS source server – normal behavior for throttling connections. But I did notice that the source server sent a very large chunk of data before it started throttling to the same rate that I was demanding.
Next I opened a second connection with a different protocol on a different port. The throttled connection wasn’t even using a fraction of our contracted bandwidth – but every single packet on the second port took a consistent 60 seconds or so to get through. If I killed the download then things would zoom at normal high speeds again.
The killer was that the TLS handshake in HTTPS connections needed a response to the initial packet within 30 seconds in order to continue. Small file downloads had no impact – but if a download was large enough, then SSL quit working completely – although HTTP would still (slowly) connect.
My best guess was that this network provider (not an African company BTW) was running some enormous cache in the middle combined with a single packet queue per customer – no fair queuing by TCP connection. It doesn’t make complete sense but I haven’t yet thought of anything better. Maybe those other packets were just stuck in line on a huge cache throttled by packets in front of them? I actually didn’t think it was possible to configure network equipment to be this stupid… but maybe? Whatever the cause, the end result was that a large download – even throttled – generally trashed our whole network connection.