Before I started consulting, I was an Oracle engineer in a very large software development organization. The company had a number of major products and the one I worked with was used by hospitals and radiology offices world-wide. (These guys are one of the biggest companies worldwide in the field.) Our product included the hardware and software; you could order it with Sun or HP hardware (Solaris or Linux). It had an Oracle backend and a web-based middle tier built with lots of C++ and Java code.

Any software engineer who has worked on large projects - industry or community - can tell you the importance of solid change control processes. So… since everything for the build had to be checked into clearcase… yes, we checked Oracle 10g into the repo. A 2G tarball. And whenever there was a 1K patch to the oracle install? A brand new 2G tarball. The clearcase guys loved me.

And that was how I started thinking about an automated build process. I don’t need to check B24792-01_1of5.zip into the repository because it’s straight off edelivery. I can skip p7150622_10203_Linux-x86-64.zip since it’s direct from metalink. The only thing missing was a solid, simple, flexible program for automating the oracle install, patchset, CPUs and oneoffs - taking Oracle’s official bits as input.

Anyone else out there who could use a program like this? How about for rapid provisioning of servers? (Like all that grid buzz.) Grid Control does some basic stuff (if you can make it work) and there are advanced kits for the big data centers - but there have to be more people than just me who would love to have this program.

Proposing orainstall

That’s why I’m proposing to write a bit of software called orainstall. It would be script-based and cross-platform and intended for community use. I have some ideas for a design but I’m really interested to hear how you might use something like this and what features might be useful to you. Do you think it’s a good design? I’d also be interested in hearing if you’d be interested to use it or to help test it.

The program would be launched by running › Read more

Is asmlib obsolete on a modern Linux system? I’m still undecided but starting to lean toward “yes”.

Everybody knows that asmlib was very useful when it was first introduced with Oracle 10.1 to simplify a host of issues on Linux: direct async device access without raw devices, file permissions & ownership without custom code, and persistent device naming without devlabel.

But I’m now involved in setting some standards to be used across a large organization for Oracle 10.2 on RHEL5 and I’m wondering if there’s still a case for using asmlib. So I did a little trolling for info - which was suprisingly sparse. Had a hard time finding much, but after a lot of digging I think I’ve compiled a useful bit of information about benefits and drawbacks.

Benefits

It seems to me that the ASMLIB API was originally introduced to do more than just simplify file permissions - sounds like it was an alternative I/O API to the standard unix one, allowing ASM to access the underlying storage more efficiently and completely. I don’t think it’s just an “extra layer” - it’s an alternative code path to the std unix I/O libs. Like an ODM for block devices - and the idea was that there could be additional vendor implementations. And Oracle released an initial generic implementation on Linux under the GPL.
› Read more

So Collaborate is over and I’m back in Chicago… home sweet home. I thoroughly enjoyed the week in Denver, in spite of the snow! Thursday, the last day, was especially fun.

First was a panel debate “To RAC or Not To RAC: What’s Best for HA.” Dan Norris invited me to participate in this panel along with Alex Gorbachev (Pythian), Neil Greene (Predictive Technologies) and Matt Zito (GridApp) - I certainly felt privileged to take part. Here are a few of the things that I took away from the debate… feel free to discuss:
› Read more

Just a quick post to say that I’ve uploaded the slides from my services presentation at Collaborate and you can find them over on the publications page. Thanks to everyone who attended!! Great questions and comments throughout the session. Next time I’ll try to get through everything faster so that there’s more time for Q&A!

Next week sometime I expect to upload the instructions from today’s Hands-on Lab (11g RAC on VMware with ASM and OEL5). I want to clean it up a bit first. For anyone who didn’t hear the story, I heard on Sunday evening that they had a room with about 50 computers which was going to sit empty during a few timeslots. And to me that seemed like a tragedy - how often do I wish I could have a chance to try something new with a bit of guidance and without worrying about hosing my laptop?! So I decided to write a hands-on lab for 11gRAC/VMware. I pretty much spent all day yesterday putting it together but it came out great!

Also I’m thinking about repeating that RAC/VMware lab around Chicago sometime… is there anyone around the Chicago area who might be interested in something like this?

Guess I should also mention that I’ve been rather enjoying myself at Collaborate so far too! (Even though I spent pretty much all day yesterday making that Hands-on Lab…) Got a chance to meet Peter Scott Monday night - that was fun! Somehow he spotted my name badge and then I got to finally put a face to someone I’d only known as a blogger. :)

About a month ago I wrote an overview of Linux Caching and I/O Queues as they pertain to Oracle. I was working on a project to architect, install and configure the beginnings of an 8-node cluster consisting of either one or two RAC databases. During the project, while I was waiting for the OS guys to resolve some networking issues, I ran a bunch of benchmarks on the storage subsystem. Specifically, I experimented with the size of the HBA Queue Depth to see if it would make a difference in performance.

But before getting into the results, a quick overview of our configuration: it was 11g RAC on Red Hat Enterprise Linux 5; Dell servers with four dual-core Opteron chips each. The RAC cluster initially had four nodes but will grow to at least eight as the data is migrated. The system has 4G QLogic cards, a McData switch and a 3Par SAN (which is blazing fast). ASM (no CFS) and dedicated Oracle Homes. The first spec had an InfiniBand interconnect but after a teleconference with Alex from Pythian discussing the project’s specific requirements, the spec was updated to use redundant Gigabit Ethernet.

Picking up where I left off: the default limit set by the Linux qla2xxx driver for concurrent I/O requests on QLogic cards (32 per LUN) is conservative. So can I increase performance by increasing this limit? The best way to answer a question like this is simply to try it.
› Read more

The trouble with Linux? No… the trouble with computers in general - is that they keep changing! Solaris 10 comes out, Oracle 11g, Red Hat 5… and everything works different!! It’s a full-time job just trying to keep up with everything.

Almost exactly one year ago I wrote about using udev on 2.6 kernels to set the proper permissions for Oracle RAC. Two weeks after that post (March 14) Red Hat Enterprise Linux 5 was released and changed everything.

In my original post, I demonstrated how to create a PERMISSIONS file that udev would use when creating the device nodes. This worked on RHEL4 and SLES9. However this week I’ve been helping a client deploy 11g RAC on a RHEL5-based cluster - and I remembered that the PERMISSIONS facility was removed from udev in RH5. Seems like I remember reading something about having a single source of configuration for udev, which makes sense… so maybe they picked the RULES. (You’ll remember from my previous post that RULES are processes right before PERMISSIONS.) This is just as well since RULES are actually quite a bit more powerful than PERMISSIONS.

So on RHEL5 and OEL5 - in order to conform to Linux Best Practices - we now have to set correct RAC file permissions using udev RULES. To get started, we need to review how RULES work. The udev manual page gives a good overview of rules processing. But of course there are plenty of great tutorials that go deeper if you’re looking for more.
› Read more

Well it’s been awhile since I’ve written anything for the blog - during the past four months I went on a trip to Asia, celebrated Thanksgiving and Christmas with both my family and my girlfriend’s family and then in January I got engaged! I’ve also been working on some continuing educational goals - so needless to say I’ve been keeping busy. And I will continue to be very busy over the next few months of wedding planning so I probably won’t be writing too much.

This post is just another interesting case study from a customer I’m working with right now. We were looking at various queues in the I/O stream and wondering how we might be able to tweak them. I was mainly investigating the host HBA’s - but in the process of digging into this I also learned about few other queues and Linux internals in general as it relates to Oracle.
› Read more

Two days until I leave for Asia. (You have no idea how busy I’ve been for the past month!) If anyone’s curious about trip details I’ve posted about it over on the non-technical side of my blog.

http://www.ardentperf.com/2007/09/27/two-days-until-asia-trip-update/

Last night I posted a case study where I used the AWR (a blessed new feature) to investigate “gc buffer busy” wait events in a RAC environment. I concluded the write-up by theorizing that the single freelist was pointing all nodes of the cluster to the same small group of blocks for inserts and thereby causing the blocks on the freelist to always be subject to unearthly contention across the cluster.

One common piece of advice for gc buffer busy waits is to treat them like regular buffer busy waits. Because essentially that’s what they are - a buffer busy wait on a remote instance. So another avenue of investigation is to look at what might be causing buffer busy waits across the cluster.

Some people may remember that back in the days before YAPP and the wait interface, latches were usually where the purported “experts” looked when you had performance problems. Particularly those two infamous latches cache buffers chains and library cache. And of course today these are still an important part of any in-depth investigation and V$LATCH even includes wait time so you can take a time-based approach to analysis. I spent some time yesterday having a look at the latching in this RAC system and it yielded some results that I thought might be interesting to post. So here goes…
› Read more

Well I don’t have a lot of time to write anything up… sheesh - it’s like 10pm and I’m still messing with this. I should be in bed. But before I quit for the night I thought I’d just do a quick post with some queries that might be useful for anyone working on a RAC system who sees a lot of the event “gc buffer busy”.

Now you’ll recall that this event simply means that we’re waiting for another instance who has the block. But generally if you see lots of these then it’s an indication of contention across the cluster. So here’s how I got to the bottom of a problem on a pretty active 6-node cluster here in NYC.

Using the ASH

I’ll show two different ways here to arrive at the same conclusion. First, we’ll look a the ASH to see what the sampled sessions today were waiting on. Second, we’ll look at the segment statistics captured by the AWR.

First of all some setup. I already knew what the wait events looked like from looking at dbconsole but here’s a quick snapshot using the ASH data from today:
› Read more

keep looking »