Lessons from Africa, Part 2

August 13th, 2012 § 1 Response

Last week was busy… making travel arrangements for this week’s trip to New York (technically Jersey) and some light analysis of AWR reports from exadata RAT runs and some heavy troubleshooting of a Solaris x86 RAC cluster with random node reboots. (I think I finally traced the node reboots to a kernel CPU/scheduling problem). I really did thoroughly enjoy my time in Africa despite being nowhere near Oracle software – but it feels good to be working on challenging cluster problems again!

Before I completely forget the details from my work in Africa I want to wrap up my article about high-level lessons learned earlier this week. By the way, I’m not just stretching obscure aspects of my work in Africa to get stuff that sounds good. I view these cultural lessons which we learned together as the most central and most important technical aspect of my work at the hospital. And it might be surprising, but it’s true: the same cultural adjustments are important and oft-missed here in corporate America.

The first two lessons were to (1) understand the fundamentals and (2) avoid unjustifiable complexity. The remaining two lessons I want to talk about are slightly less technical but equally important.

  1. People first, Technology second

    I’m using the word “people” here to sum up three major components of our accomplishments at the hospital: organizational policy, user education and technical training.

    First, a fun technical story:


    On day 1 after our arrival at the hospital, several issues required immediate attention – I discussed one example in the previous article. A second issue involved the network links to the outside world. In particular, sites like gmail were often completely unreachable. I worked hard on this one. Eventually I got the reproducible test case by connecting to an AWS instance very close to the other end of their link. I compared low-level packet traces from both sides to see what was happening.

    I could initiate a very large download and throttle the connection on our end, causing TCP window full messages to cascade all the way up to the AWS source server – normal behavior for throttling connections. But I did notice that the source server sent a very large chunk of data before it started throttling to the same rate that I was demanding.

    Next I opened a second connection with a different protocol on a different port. The throttled connection wasn’t even using a fraction of our contracted bandwidth – but every single packet on the second port took a consistent 60 seconds or so to get through. If I killed the download then things would zoom at normal high speeds again.

    The killer was that the TLS handshake in HTTPS connections needed a response to the initial packet within 30 seconds in order to continue. Small file downloads had no impact – but if a download was large enough, then SSL quit working completely – although HTTP would still (slowly) connect.

    My best guess was that this network provider (not an African company BTW) was running some enormous cache in the middle combined with a single packet queue per customer – no fair queuing by TCP connection. It doesn’t make complete sense but I haven’t yet thought of anything better. Maybe those other packets were just stuck in line on a huge cache throttled by packets in front of them? I actually didn’t think it was possible to configure network equipment to be this stupid… but maybe? Whatever the cause, the end result was that a large download – even throttled – generally trashed our whole network connection.

    » Read the rest of this article «

Lessons From Rural Africa

August 8th, 2012 § 1

It has been nine months since I’ve written here. Needless to say, a lot has happened! First, my family was living in Africa for three months earlier this year while I did some tech work at an NGO hospital. Second, upon our return I decided to join the good people at Pythian. I’m not moving [...]

Making Simple Performance Charts

November 28th, 2011 § 2

Before I dive into this blog post, quick heads up for anyone attending UKOUG: on Tuesday only, I’ll be hanging out with some very smart people from the IOUG RAC Special Interest Group in the “gallery” above the exhibition hall. We’re ready to help anyone run a RAC cluster in a virtual environment on their [...]

RAC Attack in OTN Lounge at OOW11

September 21st, 2011 Comments Off

Want to get your hands on a key technology in both the Exadata Database Machine and the newly announced Oracle Database Appliance? If you’ll be at OpenWorld – in just 11 days – then the IOUG RAC SIG is putting together a special event for you!  (You might have already heard about this on Twitter [...]

Performance Tuning for Oracle Developers

August 30th, 2011 § 1

One of my recent customers was a company with a somewhat large warehouse (around 60TB) on Oracle 10gR2.  The system was using RAC, though it was a fairly simple setup: two nodes, very large AIX LPARs, workload manually partitioned between them and somewhat evenly balanced.  The most important demand of their business is a large [...]

Developer Access To 10046 Trace Files

August 19th, 2011 § 3

Lets suppose you are a DBA at a large company. You have some great developers, and they’re learning all about how to turn on full logging of their code through the 10046 database trace. They just learned how to use this data in summary form to find out – at a very detailed level – [...]

Finally on Twitter

May 24th, 2011 § 2

Overheard in an IRC chat room (Freenode#oracle) this morning… [24 May 11 09:30] * cheboygan: glad to see that someone read the  blog post though.  at least i know one person read it. [24 May 11 09:30] * rizzo: it got re-tweeted a lot last week [24 May 11 09:30] * rizzo: get yourself on [...]