Oracle – Ardent Performance Computing

Data Safety on a Budget

Jeremy — Sun, 05 Oct 2025 05:39:31 +0000

Many experienced DBAs joke that you can boil down the entire job to a single rule of thumb: Don’t lose your data. It’s simple, memorable, and absolutely true – albeit a little oversimplified.

Mark Porter’s Cultural Hint “The Onion of our Requirements” conveys the same idea with a lot more accuracy:

We need to always make sure we prioritize our requirements correctly. In order, we think about Security, Durability, Correctness, Availability, Scalability via Scale-out, Operability, Features, Performance via Scaleup, and Efficiency. What this means is that for each item on the left side, it is more important than the items on the right side.

But this does not tell the whole story. If we’re honest, there is one critical principle of equal importance to everything on this list: Don’t lose all your money.

Every adult who’s managed their own finances knows we don’t have infinite money. Yes we want to keep the data safe. We also want to be smart about spending our money.

Relational databases are one of the most powerful and versatile places to store your data – and they are also one of the most expensive places to store your data. Just look at the per-GB pricing of block storage with provisioned IOPS and low latency, then compare with the pricing of object storage. No contest. Any time a SQL database is beginning to approach the TB range, we definitely should be looking at the largest tables and asking whether significant portions of that data can be moved to cheaper storage – for example parquet files on S3. (Or F3 files?)

Of course, sometimes we need fast powerful SQL and joins and transactions. So relational databases also should run as efficiently as possible. This has direct implications around how we keep the data safe.

From personal photos to enterprise databases, the core of all data safety is copies of the data. Logs and row-store/column-store files (and indexes) are data copies in different formats. You could almost parse the entire database industry through a lense that compares how each technology is just a unique way to replicate data between different formats and places. The revered and time-honored “3-2-1 Backup Rule” is all about copies of the data. From an information theory standpoint, it can be argued that even RAID5 parity, checksums, CRCs, and hashes are a shadow or fingerprint “copy” of the original – even though they aren’t literal full copies of the data.

One of my favorite cultural hints from Mark is: Don’t Let Entropy Win.

In the absence of people making things better, they will get worse. It’s just a fact.

This isn’t Mark’s point, but I think it’s a related concept: at every business that’s successful enough to grow large, there is a natural gravitation toward forming silos of technology. I think of this as a kind of entropy that we need to actively counteract in every large business. Lets look at an example where an enterprise business team building a public API needs a 600GB write-intensive database. Suppose we can buy enterprise grade high-endurance NVMe SSDs (handling write-intensive database workloads) for $1000 each. How much will the storage cost to “keep the data safe” for this public API?

The business team provisions three environments: one for production and two more for development and testing.
For business continuity in case of regional problems, the database team creates primary and replica CloudNativePG clusters, so that we are able to run from either of our two regions.
To maintain high availability, the database team configures CloudNativePG with three instance within each region and they configure preferred anti-affinity so that kubernetes will attempt to schedule the three instances in different buildings or availability zones.
Persistent storage is provided by the storage team who configures ceph volumes backed by two mirror copies.
Object storage for backups uses two mirror copies.
Servers are built by the infrastructure team who configure RAID 1 (mirroring).

In the worst case, we can easily end up spending $96,000 on disks alone – for a database that can fit on a single $1000 enterprise drive! Now that is some crazy storage amplification.

In order to take a smarter approach, lets work backwards from the problems we’re solving. When we say “keep the data safe” – what are some specific situations we want to protect the data from?

Unavailability during maintenance & deployments at all levels of the stack
Operational mistakes
Software bugs at all levels of the stack, from business app to firmware
Hardware failures of disks
Hardware failures of servers/compute which can make good disks temporarily inaccessable
External threats from direct attacks, malware, social engineering, supply chain attacks, etc
Insider threats arising from situations like personal grievances or personal financial pressures
Natural disasters (and perhaps political disasters…)

Armed with a list, we can now ask ourselves: what is an economical solution that addresses everything here? There isn’t one right answer but we probably don’t need 12 physical copies of each database per data center. A few ideas:

Three CNPG instances that use local SSD storage directly (no hardware RAID), for a total of three copies in the primary data center.
Two or three CNPG instances that use either ceph block storage or local SSD with hardware RAID (but not both) for a total of four or six copies in the primary data center.
A single CNPG instance in the second data center, with the capability to dynamically add instances on switchovers/failovers.
Slower, less expensive disks for development databases.
No CNPG instance for immediate switchover/failover of development databases in second data center.
Testing tier that matches production config but can be provisioned on demand from backups for load testing, and deprovisioned when unused for some period of time. Development tier also provisioned on demand and deprovisioned when unused for some period of time.

There are many ways to keep data safe on a reasonable budget – these are just a few ideas.

Default Sort Order in Db2, SQL Server, Oracle & Postgres 17

Jeremy — Wed, 22 May 2024 17:28:38 +0000

TLDR: I was starting to think that the best choice of default DB collation (for sort order, comparison, etc) in Postgres might be ICU. But after spending some time reviewing the landscape, I now think that code-point order is the best default DB collation – mirroring Db2 and Oracle – and linguistic sorting can be used via SQL when it’s actually needed for the application logic. In existing versions of Postgres, this would be something like C or C.UTF-8 and Postgres 17 will add the builtin collation provider (more details at the bottom of this article). This ensures that the system catalogs always use code-point collation, and it is a similar conclusion to what Daniel Vérité seems to propose in his March 13 blog, “Using binary-sorted indexes”. I like the suggestion he closed his blog with: SELECT ... FROM ... ORDER BY colname COLLATE "unicode" – when you need natural language sort order.

I spent some time reading documentation, experimenting, and talking to others in order to learn more about the general landscape of collation and SQL databases. It’s safe to say that every SQL database that’s been around for more than a hot minute has fun collation quirks. (Another reason you shouldn’t write your own database… rediscovering all of this for yourself.)

Next week at PGConf.dev in Vancouver, Jeff Davis (and I) will be talking about collation and Postgres. If you’ll be at the conference then be sure to stop by and listen!

Wednesday May 29 at 2:30pm in the Canfor room (1600) – “Collations from A to Z” – https://www.pgevents.ca/events/pgconfdev2024/schedule/session/95-collations-from-a-to-z/

Db2

I asked Josh Tiefenbach – a friend who previously worked in Db2 development – and he’s helped me better understand the picture here. First off: Db2 will format your dates and numbers according to the client’s localization environment. I heard a funny story about an IBM engineer whose programs were randomly breaking because of comparison mismatches on dates. It wasn’t critical enough to warrant an immediate deep dive, and after a few months of annoyance they realized it was because their laptop was set to canadian english while some other systems were set to american english, and the client locale information would be passed to the database and change the string output format.

However, while date formatting can be influenced by client locale, SQL statements with ORDER BY will always sort your results according to server settings. If your server is in switzerland, I don’t think you can have one terminal or client in france automatically ordering strings with french rules and another terminal in germany automatically ordering strings with german rules. You can write explicit SQL syntax to accomplish this (and use different SQL statements in france and germany), you just can’t do it with an implicit client locale setting. Finally: if I understand correctly, today new Db2 installations default to unicode encoding with code-point order collation which is called IDENTITY in Db2 nomenclature.

A good entry point to Db2 docs on collation is https://www.ibm.com/docs/en/db2/11.5?topic=collation-collating-sequences
Docs for CREATE DATABASE sql and Create-database api
It looks to me like in some cases, Db2 can automatically set up a database to maintain compatible sort order with an older territory that is configured at the client when creating the database. In this case, only code points that directly map to the old code page are linguistically sorted, and all other code points get IDENTITY collation. ( Fun Collation Quirk) https://www.ibm.com/docs/en/db2/11.5?topic=collation-language-aware-collations-unicode-data and https://www.ibm.com/docs/en/db2/11.5?topic=support-supported-territory-codes-code-pages
UCA (Unicode Collation Algorithm, same as ICU) is not default but can be configured in Db2. It looks like IBM “deprecated” Unicode 4.0 however as of Db2 v11.5 there are still four versions of UCA built into the database: Unicode 4.0 (CLDR 1.2), Unicode 5.0 (CLDR 1.5.1), Unicode 5.2 (CLDR 1.8.1) and Unicode 7.0 (CLDR 2.7.0.1). It’s pretty hard to deprecate a collation in the database world. https://www.ibm.com/docs/en/db2/11.5?topic=collation-unicode-algorithm-based-collations
For unicode databases, system catalogs always use IDENTITY (code-point order) collation, even when the database default is linguistic collation.
After the database is created, the default collation cannot be changed.

SQL Server

I did some searching to figure out what SQL Server does. My best read is that – still today – SQL Server defaults to 8-bit encodings (like Windows-1252) and an associated ISO-8859-ish collation based on what language you picked for your Windows Server install.

They’ve supported UTF-16 in the database forever via the NVARCHAR data type. (Mandatory 2x storage overhead… Fun Encoding Quirk) An option for UTF-8 encoding was added around 2019 but it seems to me this is not widely used.

I asked Brent Ozar if he had any insights he could share around this. Brent confirmed that the vast majority of SQL Servers use default collation. And this leads often to an interesting challenge:

You’ve got a database that was created on another server, with a different default collation, and the database inherited that
You create temp tables without specifying collation (so they inherit the server’s default collation)
You load data from the user database into the temp table
You join the user table & temp tables together on a column, and that column has two different collations (one in the temp table, one in the user db) and you get errors

Because that’s been a widely known issue for decades, the practical solution is pretty simple: when vendors hand out databases to their clients, like for packaged applications, they just say as part of the requirements, “Your SQL Server has to use ____ collation,” and they give installation instructions for when the user’s setting up their SQL Server. OR, the vendor learns the problem early on, and stops joining on strings, and uses numbers for pk/fk fields instead. ( Fun Collation Quirk)

Thank you Brent for this!

A good entry point to SQL Server docs on collation is https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16
The best explanation I found of SQL Server was Solomon Rutzky’s Stackoverflow answer at https://stackoverflow.com/questions/5182164/sql-server-default-character-encoding
An interesting deep-dive on what exactly these ISO-8859-ish collations are is at https://sqlquantumleap.com/2021/05/21/sql-server-collations-what-does-cp1-mean-in-sql_latin1_general_cp1_ci_as/ (and this is why I say “ish” … Fun Collation Quirk)
Collation cannot be changed for a server, however it can be changed for a database (make sure to read the doc page on data conversion with collation changes).
From my read of the Azure SQL Database docs, it seems like that version might allow control of system catalog collation (separate from database collation). I don’t see the CATALOG_COLLATION clause of CREATE DATABASE in the docs for SQL Server 2022 (windows or linux). On those installations this is likely controlled by the Server Collation (which is separate from the Database Collation).

Oracle

I myself have an Oracle background, so I had the joy of discovering Oracle’s rather noticeable idiosyncrasies here.

Oracle takes a fundamentally different approach to default collation: it is a property of the client connection rather than a property of the server. Similar to Db2, Oracle defaults to unicode encoding and code-point order collation which is called BINARY in Oracle nomenclature …UNLESS you live in Europe, the Middle East, Quebec, or a few other unlucky countries. China and Japan and Korea and India – lucky. Thailand and Vietnam and Pakistan – unlucky. (I have no idea what reasoning was behind this seemingly arbitrary list!)

It gets fun in the unlucky countries. If your client environment is set to one of these languages, then by default Oracle makes ORDER-BY and a few functions (like regex) sort words with a collation reflecting your client locale. But operators like greater-than and less-than, group-by, and indexes all still use code-point order (BINARY) collation.

As a result, this is the default behavior of Oracle in “unlucky” countries:

NLS_LANG=french_france.al32utf8 sqlplus / as sysdba

SQL> select * from test_table order by field1;
FIELD1
--------------------------------------------------
baño
banqueta
Baptisto
chorizo
como

SQL> select * from test_table where field1 <= 'banqueta' order by field1;
FIELD1
--------------------------------------------------
banqueta
Baptisto

( Fun Collation Quirk)

Sometimes I wonder how the rest of the world can live with us Americans. Part of me thinks a bunch of them probably run Oracle with NLS_LANG=AMERICAN out of exasperation (and conditioning). One long-time Oracle person suggested that in the days before databases were relational, one might have looked at this behavior through the lens of “projection” versus “presentation” layers?

But on the plus side, Oracle seems to be the most flexible with collation, it being a property of the client connection rather than the server. By simply changing your client environment or setting a session variable, you can switch to all-binary mode (NLS_SORT setting), or switch to all-linguistic mode (NLS_COMP setting – perf caveats notwithstanding, eg. need to create linguistic indexes).

All-binary mode is straightforward, and is default for many languages (including “ENGLISH” and the default value of “AMERICAN” – which I assume is named this way because Americans don’t speak proper English ). But evidently users have hit a few strange bugs in all-linguistic mode and found themselves waiting for Oracle to create one-off patches.

Oracle has published a detailed support note 227335.1 about Linguistic Sorting, including a list of known bugs and an explicit recommendation against setting linguistic sorting at the database/instance level.

“This is actually a rather bad idea, linguistic sorting is much more cpu consuming than the binary sorting (and needs linguistic indexes) hence it should be set (by using alter session for example) by the application and only when actually needed for the application logic (and the tables involved have linguistic indexes defined).”
— support note 227335.1 —

And similar to SQL Server, at the end of the day the vendors of packaged applications are often going to dictate how the database is configured. Let’s just take one simple example: Oracle eBusiness Suite.

Support doc 396009.1 includes these four mandatory Database Initialization Parameters for E-Business Suite Release 12:

# Mandatory parameters are denoted with the #MP symbol as a
# comment. This includes parameters such as NLS and optimizer
# related parameters.

nls_comp = binary #MP
nls_sort = binary #MP
nls_date_format = DD-MON-RR #MP
nls_length_semantics = BYTE #MP

If using Oracle Applications in global organizations, you would refer to support doc 393861.1 for more details. Oracle Apps releases starting with 12.2.2 (2013) support linguistic sort order. And how is it implemented? The NLS_SORT session variable is set by the application – not set at the database level. And NLS_COMP is still required to be in BINARY mode (WHERE clause, “=”, like etc).

( Fun Collation Quirk)

It’s worth mentioning that in Oracle you can also write explicit SQL syntax to specify a collation, you can bind a collation to data, and you can define default collation at multiple levels. But interestingly, Postgres added support for data-bound collation about 5-6 years before Oracle! (PG 9.1 in 2011 versus Oracle 12.2 in 2016/2017)

A good entry point to Oracle docs on collation is https://docs.oracle.com/en/database/oracle/oracle-database/23/nlspg/linguistic-sorting-and-matching.html#GUID-84F2A594-E641-4436-A903-D5D5B4D7FFA9
It’s worth noting that like Db2 and SQL Server and Postgres, Oracle also supports single-byte encodings and associated collations that predate Unicode. Details can be found in the documentation.
The “BINARY” column of table 5-3 contains the complete list of SQL syntax & functions, showing which ones default to BINARY collation and which ones default to linguistic collation (NLS_SORT) when you’re in an “unlucky” locale. https://docs.oracle.com/en/database/oracle/oracle-database/19/nlspg/linguistic-sorting-and-matching.html#GUID-85328A92-CB08-4467-9F02-42E9EF9D41B7__CIHBFIID Summary – linguistic collation is used for ORDER BY, Analytic Functions, REGEXP_* functions and NLS* functions. Binary collation is used for GROUP BY, LIKE & IN & BETWEEN, operators like < > =, UNION & INTERSECT & MINUS, and much more.)
The “Default Sort” column of Table A-1 shows exactly which lucky locales default to BINARY sorting for ORDER-BY and which unlucky locales default to a linguistic sort for ORDER-BY (but not other operators – see previous bullet point) https://docs.oracle.com/en/database/oracle/oracle-database/23/nlspg/appendix-A-locale-data.html#GUID-D2FCFD55-EDC3-473F-9832-AAB564457830
That default sort column was first added to the Oracle Globalization manual in version 10.1 (2003) however all the way back to v8.1.6 (late 90’s) I see the default value of NLS_SORT is “Default character sort sequence for a particular language” …I didn’t check older docs but it sounds to me like this behavior has existed for a very long time in Oracle. I guess after 30-40 years, why change now?
Per support note 227335.1 – even when Oracle is in all-linguistic mode (NLS_COMP=LINGUISTIC), indexes are still created as BINARY by default (you need explicit SQL syntax to create linguistic indexes), and I suspect the catalog always works in BINARY mode under the covers.
UCA (Unicode Collation Algorithm, same as ICU) is not default but can be configured in Oracle. It looks like Oracle “deprecated” Unicode 6.1 in 21c however as of today there are still four versions of UCA built into the database: Unicode 6.1, Unicode 6.2, Unicode 7.0 and Unicode 12.1. It’s pretty hard to deprecate a collation in the database world.
The database does not have a default collation. The default collation of a table can be changed, but this only impacts newly added columns – existing columns are not changed. A client can change the locale settings it uses to connect at any time, but this may impact performance (eg. indexes may not be used).

Postgres

Historically, Postgres has relied on external libraries for collation (originally the operating system libc, with the later addition of an option to use ICU from unicode.org if it’s separately installed on the system). This external dependency is unlike every other relational database that exists.‡ This means that the onus is on the administrators who install and manage software to not mess up their database by changing external libraries in a way that breaks things. When you install the next major version of RHEL, if you physically detach/attach your storage volume with 10TB of datafiles to avoid spending two weeks on a logical dump-and-load, then it’s your job to go find the old version of ICU and get it compiled, installed, and working on your new server. Let’s hope you’re not using linguistic collation from the OS libc (like en-US). Code-point order collation is a fine choice though.

‡Interestingly, Postgres does historically have a built-in collation called “C” however it’s not used by default and it has some usability issues in non-english languages. For example, accented characters and other non-ASCII characters don’t work with upper/lower, regex, etc

As I mentioned above, Postgres 17 will most likely include a builtin stable collation which is very performant and fast, comparing characters by code point and addressing problems with the older “C” collation. (Hooray! Scroll Down for More Details!) But even in version 17, Postgres defaults to the locale settings from the OS environment (eg. LANG/LC_COLLATE) when the initdb program creates a new database cluster. https://www.postgresql.org/docs/current/app-initdb.html

In practice, this means that – similar to SQL Server – Postgres defaults are determined by Operating System defaults. I remember running linux installers and choosing locales many times over the years. These days it’s just as common to use a virtual machine image. I pulled up a few official versions of Ubuntu on EC2 to see what the defaults looked like, and I got different results!

Operating System	Default LANG (locale)	Collation of default database	PostgreSQL Version	AMI
Ubuntu 16.04.7 LTS	LANG = en_US.UTF-8	en_US.UTF-8	apt install postgresql-9.5	ami-0b0ea68c435eb488d
Ubuntu 18.04.6 LTS	LANG = C.UTF-8	C.UTF-8	apt install postgresql-10	ami-0279c3b3186e54acd
Ubuntu 20.04.3 LTS	LANG = C.UTF-8	C.UTF-8	apt install postgresql-16	ami-083654bd07b5da81d
Ubuntu 22.04 LTS	LANG = C.UTF-8	C.UTF-8	apt install postgresql-16	ami-0ba8e031ca32ab37f

So it turns out that a number of Postgres installations already use libc C.UTF-8 collation by default, which is a safer choice than a libc linguistic collation like en_US.UTF-8.

A good entry point to Postgres docs on collation is https://www.postgresql.org/docs/current/collation.html
The official PostgreSQL wiki page for collations has some info from 2019-2021, but doesn’t have the latest updates yet (as of publishing date of this blog). https://wiki.postgresql.org/wiki/Collations
Peter Eisenstraut’s blog is an amazing resource on collation in PostgreSQL. http://peter.eisentraut.org/
Daniel Vérité’s blog also has several great articles about collation in PostgreSQL. https://postgresql.verite.pro/blog/
About three and a half years ago, Thomas Munro published an article on the Azure database blog about index corruption from collation changes. Another one of my favorites that stands the test of time – especially since Thomas includes so many great links back into the official Unicode specs as he explains things. https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/don-t-let-collation-versions-corrupt-your-postgresql-indexes/ba-p/1978394
Related article here on my blog: Did Postgres Lose My Data?
UCA (Unicode Collation Algorithm, via ICU) is not default but can be configured in Postgres. It needs to be installed separately by the administrator before it can be used by Postgres. It is the responsibility of administrators to ensure that the same version of ICU continues to be available if the database is physically migrated to new servers or operating systems, and that the same version of ICU is available on hot standbys. Even a small ICU update can be risky. Postgres docs say it works with ICU 4.2 (CLDR 1.7.1) and newer. However there was a mailing list thread suggesting that ICU 54 and earlier should be avoided.
A Postgres client can detect encoding based on the client environment (LC_CTYPE for libpq) but – similar to Db2 – sort order is determined only by server settings and SQL syntax.

The Difference Between C and C.UTF-8 Collation in PostgreSQL – and Whats New in 17

In PostgreSQL, you first choose a provider (libc or icu) and then you choose a collation. The collation doesn’t just control ordering, it also controls character semantics (upper/lower/initcap, regex classes, equality, etc).

PostgreSQL has only two collations with code-point ordering: libc C collation and libc C.UTF-8 collation. Before Postgres 17, you have to choose between the good character semantics of C.UTF-8 collation and the performance and reliability of C collation. Thanks to Jeff Davis’s work, Postgres 17 will soon offer the best of both worlds via the new provider called “builtin” with C.UTF-8 collation (same collation name, different provider). The already-existing C collation (libc provider) can also be accessed through the new “builtin” provider, which makes sense because this collation was built into Postgres all along.

Builtin C.UTF-8 collation (aka pg_c_utf8) has the same blazing fast performance on sorts and comparisons as C collation. C collation might slightly edge out builtin C.UTF-8 performance on character semantics like uppercase/lowercase operations. But personally I’ll happily trade that for broadly useful character semantics and rock-solid reliability.

libc & pg17 builtin provider C collation	libc provider C.UTF-8 collation	pg17 builtin provider C.UTF-8 collation
implemented internally; does not call libc (the PG provider name of “libc” is misleading)	calls libc	implemented internally; does not call libc
stable & safe; does not change	changes should be uncommon (less than icu and libc linguistic locales), but history shows that both character semantics and sort order have not remained unchanged for example in Debian/Ubuntu (cf. mailing list thread)	stable & safe; does not change
poor semantics for non-ASCII characters; eg. accented characters break upper/lower, regex, etc	good semantics for non-ASCII characters (upper/lower, regex, etc) implementation specifics can vary by operating system and libc version	good semantics for non-ASCII characters (upper/lower, regex, etc) same on all platforms: regex classes based on “POSIX Compatible” semantics spec, case mappings are “simple” variant
fast	slower than C and builtin, faster than libc linguistic and icu	fast ‡
not default	default for some operating system installations, not default for others	not default

‡builtin C.UTF-8 has the same performance as C for collation/sorting, but it is a little slower than C for character semantic operations like upper/lower, etc. It is faster than libc and ICU for character semantic operations.

As I said at the beginning – I now lean toward always setting the Postgres database default to code-point order collation (carefully considering the differences between C and C.UTF-8) and specifying linguistic sorting with SQL when it’s actually needed for the application logic. This approach is used for many existing enterprise applications & databases around the world today and it ensures that system catalogs use code-point collation. There are speed & performance benefits, and this provides the best stability and safety.

Final Notes:

I probably know a little more about Postgres collation than the average Postgres user, but I haven’t kept up with all the mailing list threads. Folks like Jeff, Peter, Daniel, Thomas and many other PG hackers know WAY more than I do. It’s very possible I’ve made some mistakes here. Please let me know if you spot anything I got wrong!
Lukas Fittl recently reviewed the new pg17 builtin collator in “5mins of Postgres” episode 107. Well worth the listen! https://pganalyze.com/blog/5mins-postgres-17-builtin-c-utf8-locale
Just for fun… on a related note, but more about encodings than collations, this is a fun blog post. “Falsehoods Programmers Believe About Plain Text” https://jeremyhussell.blogspot.com/2017/11/falsehoods-programmers-believe-about.html

Understanding CPU on AIX Power SMT Systems

Jeremy — Fri, 01 Jul 2016 08:30:42 +0000

This month I worked with a chicagoland company to improve performance for eBusiness Suite on AIX. I’ve worked with databases running on AIX a number of times over the years now. Nevertheless, I got thrown for a loop this week.

TLDR: In the end, it came down to a fundamental change in resource accounting that IBM introduced with the POWER7 processor in 2010. The bottom line is twofold:

if SMT is enabled then the meaning of CPU utilization numbers is changed. the CPU utilization numbers for individual processes mean something completely new.
oracle database 11.2.0.3 (I haven’t tested newer versions but they might also be affected) is not aware of this change. as a result, all CPU time values captured in AWR reports and extended SQL traces are wrong and misleading if it’s running on AIX/POWER7/SMT. (I haven’t tested CPU time values at other places in the database but they might also be wrong.)

On other unix operating systems (for example Linux with Intel Hyper-Threading), the CPU numbers for an individual process reflect the time that the process spent on the CPU. It’s pretty straightforward: 100% means that the process is spending 100% of its time on the logical CPU (a.k.a. thread – each hardware thread context on a hyper-threaded core appears as a CPU in Linux). However AIX with SMT is different. On AIX, when you look at an individual process, the CPU utilization numbers reflect IBM’s opinion about what percentage of physical capacity is being used.

Why did IBM do this? I think that their overall goal was to help us in system-wide monitoring and capacity planning – however it came at the expense of tuning individual processes. They are trying to address real shortcomings inherent to SMT – but as someone who does a lot of performance optimization, I find that their changes made my life a lot more difficult!

History

Ls Cheng started a conversation in November 2012 on the Oracle-L mailing list about his database on AIX with SMT enabled, where the CPU numbers in the AWR report didn’t even come close to adding up correctly. Jonathan Lewis argued that double-counting was the simplest explanation while Karl Arao made the case for time in the CPU run queue. A final resolution as never posted to the list – but in hindsight it was almost certainly the same problem I’m investigating in this article. It fooled all of us. CPU intensive workloads on AIX/Power7/SMT will frequently mislead performance experts into thinking there is a CPU runqueue problem at the OS level. In fact, after researching for this article I went back and looked at my own final report from a consulting engagement with an AIX/SMT client back in August 2011 and realized that I made this mistake myself!

As far as I’m aware, Marcin Przepiorowski was the first person to really “crack” the case when and he researched and published a detailed explanation back in February 2013 with his article Oracle on AIX – where’s my cpu time?. Marcin was tipped off by Steve Pittman’s detailed explanation published in a December 2012 article Understanding Processor Utilization on Power Systems – AIX. Karl Arao was also researching it back in 2013 and published a lot of information on his tricky cpu aix stuff tiddlywiki page. Finally, Graham Wood was digging into it at the same time and contributed to several conversations amongst oak table members. Just to be clear that I’m not posting any kind of new discovery! :)

However – despite the fact that it’s been in the public for a few years – most people don’t understand just how significant this is, or even understand exactly what the problem is in technical terms. So this is where I think I can make a contribution: by giving a few simple demonstrations of the behavior which Steve, Marcin and Karl have documented.

CPU and Multitasking

I recently spent a few years leading database operations for a cloud/SaaS company. Perhaps one of the most striking aspects of this job was that I had crossed over… from being one of the “young guys” to being one of the “old guys”! I certainly wasn’t the oldest guy at the company but more than half my co-workers were younger than me!

Well my generation might be the last one to remember owning personal computers that didn’t multitask. Ok… I know that I’m still working alongside plenty of folks who learned to program on punch-cards. But at the other end of the spectrum, I think that DOS was already obsolete when many of my younger coworkers starting using technology! Some of you younger devs started with Windows 95. You’ve always had computers that could run two programs in different windows at the same time.

Sometimes you take a little more notice of tech advancements you personally experience and appreciate. I remember it being a big deal when my family got our first computer that could do more than one thing at a time! Multitasking (or time sharing) is not a complicated concept. But it’s important and foundational.

testcap

" data-large-file="https://ardentperf.com/wp-content/uploads/2016/06/450px-cputimeonsinglecpumultitaskingsystem-svg.png?w=450" src="https://ardentperf.com/wp-content/uploads/2016/06/450px-cputimeonsinglecpumultitaskingsystem-svg.png" alt="testcap" width="450" height="266" class="size-full wp-image-2128" srcset="https://ardentperf.com/wp-content/uploads/2016/06/450px-cputimeonsinglecpumultitaskingsystem-svg.png 450w, https://ardentperf.com/wp-content/uploads/2016/06/450px-cputimeonsinglecpumultitaskingsystem-svg.png?w=150&h=89 150w, https://ardentperf.com/wp-content/uploads/2016/06/450px-cputimeonsinglecpumultitaskingsystem-svg.png?w=300&h=177 300w" sizes="(max-width: 450px) 100vw, 450px" />

CPU Time on Single CPU Multi Tasking System
CPU color time for program P1

So obviously (I hope), if there are multiple processes and only a single CPU then the processes will take turns running. There are some nuances around if, when and how the operating system might force a process to get off the CPU but the most important thing to understand is just the timeline pictured above. Because for the rest of this blog post we will be talking about performance and time.

Here is a concrete example of the illustration above: one core in my laptop CPU can copy 13GB of data through memory in about 4-5 seconds:

$ time -p taskset 2 dd if=/dev/zero of=/dev/null bs=64k count=200k
204800+0 records in
204800+0 records out
13421772800 bytes (13 GB) copied, 4.73811 s, 2.8 GB/s
real 4.74
user 0.13
sys 4.54

The “taskset” command on linux pins a command on a specific CPU #2 – so “dd” is only allowed to execute on that CPU. This way, my example runs exactly like the illustration above, with just a single CPU.

What happens if we run two jobs at the same time on that CPU?

$ time -p taskset 2 dd if=/dev/zero of=/dev/null bs=64k count=200k &
[1] 18740

$ time -p taskset 2 dd if=/dev/zero of=/dev/null bs=64k count=200k &
[2] 18742

204800+0 records in
204800+0 records out
13421772800 bytes (13 GB) copied, 9.25034 s, 1.5 GB/s
real 9.25
user 0.09
sys 4.57
204800+0 records in
204800+0 records out
13421772800 bytes (13 GB) copied, 9.22493 s, 1.5 GB/s
real 9.24
user 0.12
sys 4.54

[1]-  Done                    time -p taskset 2 dd if=/dev/zero of=/dev/null bs=64k count=200k
[2]+  Done                    time -p taskset 2 dd if=/dev/zero of=/dev/null bs=64k count=200k

Naturally, it takes twice as long – 9-10 seconds. I ran these commands on my linux laptop but the same results could be observed on any platform. By the way, notice that the “sys” number was still 4-5 seconds. This means that each process was actually executing on the CPU for 4-5 seconds even though it took 9-10 seconds of wall clock time.

The “time” command above provides a summary of how much real (wall-clock) time has elapsed and how much time the process was executing on the CPU in both user and system modes. This time is tracked and accounted for by the operating system kernel. The linux “time” command uses the wait4() system call to retrieve the CPU accounting information. This can be verified with strace:

$ strace -t time -p dd if=/dev/zero of=/dev/null bs=64k count=200k
10:07:06 execve("/usr/bin/time", ["time", "-p", "dd", "if=/dev/zero", "of=/dev/null", \
        "bs=64k", "count=200k"], [/* 48 vars */]) = 0
...
10:07:06 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, \
        child_tidptr=0x7f8f841589d0) = 12851
10:07:06 rt_sigaction(SIGINT, {SIG_IGN, [INT], SA_RESTORER|SA_RESTART, 0x7f8f83be90e0}, \
        {SIG_DFL, [], 0}, 8) = 0
10:07:06 rt_sigaction(SIGQUIT, {SIG_IGN, [QUIT], SA_RESTORER|SA_RESTART, 0x7f8f83be90e0}, \
        {SIG_IGN, [], 0}, 8) = 0
10:07:06 wait4(-1, 

204800+0 records in
204800+0 records out
13421772800 bytes (13 GB) copied, 4.66168 s, 2.9 GB/s

[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, {ru_utime={0, 108000}, \
        ru_stime={4, 524000}, ...}) = 12851
10:07:11 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=12851, si_uid=1000, \
        si_status=0, si_utime=10, si_stime=454} ---
10:07:11 rt_sigaction(SIGINT, {SIG_DFL, [INT], SA_RESTORER|SA_RESTART, 0x7f8f83be90e0}, \
        {SIG_IGN, [INT], SA_RESTORER|SA_RESTART, 0x7f8f83be90e0}, 8) = 0
10:07:11 rt_sigaction(SIGQUIT, {SIG_IGN, [QUIT], SA_RESTORER|SA_RESTART, 0x7f8f83be90e0}, \
        {SIG_IGN, [QUIT], SA_RESTORER|SA_RESTART, 0x7f8f83be90e0}, 8) = 0
10:07:11 write(2, "r", 1r)                        = 1
10:07:11 ...

Two notes about this. First, you’ll see from the timestamps that there’s a 5 second pause during the wait4() syscall and the output from “dd” interrupts its output. Clearly this is when “dd” is running. Second, you’ll see that the wait4() call is returning two variables called ru_utime and ru_stime. The man page on wait4() clarifies that this return parameter is the rusage struct which is defined in the POSIX spec. The structure is defined in time.h and is the same structure returned by getrusage() and times(). This is how the operating system kernel returns the timing information to “time” for display on the output.

CPU Utilization on Linux with Intel SMT (Hyper-Threading)

Since many people are familiar with Linux, it will be helpful to provide a side-by-side comparison of Linux/Intel/Hyper-Threading with AIX/Power7/SMT. This will also help clarify exactly what AIX is doing that’s so unusual.

For this comparison, we will switch to Amos Waterland’s useful stress utility for CPU load generation. This program is readily available for all major unix flavors and cleanly loads a CPU by spinning on the sqrt() function. I found a copy at perzl.org already ported and packaged for AIX on POWER.

For our comparison, we will load a single idle CPU for 100 seconds of wall-clock time. We know that the process will spin on the CPU for all 100 seconds, but lets see how the operating system kernel reports it.

First, lets verify that we have SMT (Hyper-Threading):

user@debian:~$ lscpu | egrep '(per|name)'
Thread(s) per core:    2
Core(s) per socket:    2
Model name:            Intel(R) Core(TM) i3-4005U CPU @ 1.70GHz

Next lets run our stress test (pinned to a single CPU) and see what the kernel reports for CPU usage:

user@debian:~$ time -p taskset 2 stress -c 1 -t 100
stress: info: [20875] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [20875] successful run completed in 100s
real 100.00
user 100.03
sys 0.00

Just what we would expect – the system is idle, and the process was on the CPU for all 100 seconds.

Now lets use mpstat to look at the utilization of CPU #2 in a second window:

user@debian:~$ mpstat -P 1 10 12
Linux 3.16.0-4-amd64 (debian) 	06/30/2016 	_x86_64_	(4 CPU)

01:58:07 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
01:58:17 AM    1    0.00    0.00    0.10    0.00    0.00    0.00    0.00    0.00    0.00   99.90
01:58:27 AM    1   17.44    0.00    0.10    0.00    0.00    0.00    0.00    0.00    0.00   82.45
01:58:37 AM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:58:47 AM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:58:57 AM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:59:07 AM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:59:17 AM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:59:27 AM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:59:37 AM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:59:47 AM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
01:59:57 AM    1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
02:00:07 AM    1   82.88    0.00    0.10    0.00    0.00    0.00    0.00    0.00    0.00   17.02
Average:       1   83.52    0.00    0.03    0.00    0.00    0.00    0.00    0.00    0.00   16.45

Again, no surprises here. We see that the CPU was running at 100% for the duration of our stress test.

Next lets check the system-wide view. On linux, most people use the top command to see what’s happening system-wide. Top shows a list of processes and estimates how much time each spends on the CPU. Note that the “top” utility is using the /proc//stat file to get kernel-tracked CPU time rather than libc calls – but this still returns the same data as the “time” command. It then divides by wall-clock time to express the CPU time as a percentage. If two processes are running on one CPU, then each process will report 50% CPU utilization (in the default Irix mode).

We will run top in a third window while the stress and mpstat programs are running to get the system-wide view:

Linux top (in Irix mode) reports that the “stress” program is using 100% of a single CPU and that 26.3% of my total CPU capacity is used by the system.

This is wrong. Did you spot the problem with my statement above? If you have any linux servers with hyper-threading enabled then I really hope you understand this!

The problem is with the second statement – that 26% of my total CPU capacity is used. In reality, a “hardware thread” is nothing like a “real core”. (For Oracle specific details about Hyper-Threading and CPU Capacity, Karl Arao might be one of the best sources of information.) Linux kernel developers represent each hardware thread as a logical CPU. As a result (and this is counter-intuitive) it’s very misleading to look at that “total CPU utilization” number as something related to total CPU capacity.

What does this mean for you? You must set your CPU monitoring thresholds on Linux/Hyper-Threading very low. You might set your critical threshold for paging at 70%. Personally, I like to keep utilization on transactional systems under 50%. If your hyper-threaded linux system has 70% CPU utilization, then you are going to run out of CPU very soon!

Why is this important? This is exactly the problem that IBM’s AIX team aimed to solve with SMT on POWER. But there is a catch: the source data used by standard tools to calculate system-level CPU usage is the POSIX-defined “rusage” process accounting information. IBM tweaked the meaning of rusage to fix our system-level CPU reporting problem – and they introduced a new problem at the individual process level. Lets take a look.

CPU Utilization on AIX with Power SMT

First, as we did on Linux, lets verify that we have SMT (Hyper-Threading):

# prtconf|grep Processor
Processor Type: PowerPC_POWER7
Processor Implementation Mode: POWER 7
Processor Version: PV_7_Compat
Number Of Processors: 4
Processor Clock Speed: 3000 MHz
  Model Implementation: Multiple Processor, PCI bus
+ proc0                                                                         Processor
+ proc4                                                                         Processor
+ proc8                                                                         Processor
+ proc12                                                                        Processor

# lparstat -i|egrep '(Type|Capacity  )'
Type                                       : Shared-SMT-4
Entitled Capacity                          : 2.00
Minimum Capacity                           : 2.00
Maximum Capacity                           : 4.00

So you can see that we’re working with 2 to 4 POWER7 processors in SMT4 mode, which will appear as 8 to 16 logical processors.

Now lets run the exact same stress test, again pinned to a single CPU.

# ps -o THREAD
    USER      PID     PPID       TID ST  CP PRI SC    WCHAN        F     TT BND COMMAND
jschneid 13238466 28704946         - A    0  60  1        -   240001  pts/0   - -ksh
jschneid  9044322 13238466         - A    3  61  1        -   200001  pts/0   - ps -o THREAD

# bindprocessor 13238466 4

# /usr/bin/time -p ./stress -c 1 -t 100
stress: info: [19398818] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [19398818] successful run completed in 100s

Real   100.00
User   65.01
System 0.00

Wait… where did my CPU time go?! (This is one of the first things Marcin noticed too!) The AIX kernel reported that my process ran for 100 seconds of wall-clock time, but it was only running on the CPU for 65 seconds of that time!

On unix flavors such as Linux, this means that the operating system was not trying to put the process on the CPU during the missing time. Maybe the process was waiting for a disk operation or a signal from another process. But our stress test only executes the sqrt() function – so we know that it was not waiting for anything.

When you know the process was not waiting, there is only other reason the operating system wouldn’t put the process on the CPU. Look again at our very first demo in this article: two (or more) processes needed to share the CPU. And notice that the user+system time was lower than wall-clock time, exactly like our output here on AIX!

So lets take a look at the system-wide view with the “nmon” utility in a second window. (topas reports CPU usage solaris-style while nmon reports irix-style, so nmon will be more suitable for this test. they are actually the same binary anyway, just invoked differently.)

Wait… this doesn’t seem right! Our “stress” process is the only process running on the system, and we know that it is just spinning CPU with the sqrt() call. The “nmon” tool seems to verify the output of the time command – that the process is only on the CPU for 65% of the time! Why isn’t AIX letting my process run on the CPU?!

Lets take a look at the output of the mpstat command, which we are running in our third window:

# mpstat 10 12|egrep '(cpu|^  4)'
System configuration: lcpu=16 ent=2.0 mode=Uncapped 
cpu  min  maj  mpc  int   cs  ics   rq  mig lpa sysc us sy wa id   pc  %ec  lcs
  4    0    0    0    2    0    0    0    1 100    0  0 49  0 51 0.00  0.0    1
  4   19    0   40  143    7    7    1    1 100   19 100  0  0  0 0.61 30.7    7
  4    0    0    0  117    2    2    1    1 100    0 100  0  0  0 0.65 32.6    4
  4    0    0    0   99    1    1    1    1 100    0 100  0  0  0 0.65 32.6    3
  4    0    0    0  107    3    3    1    3 100    0 100  0  0  0 0.65 32.6    6
  4    0    0    0  145    5    5    1    3 100    0 100  0  0  0 0.65 32.6    9
  4    0    0    0  113    2    2    1    1 100    0 100  0  0  0 0.65 32.6    3
  4    0    0    0  115    1    1    1    1 100    0 100  0  0  0 0.65 32.6    7
  4    0    0    0  106    1    1    1    1 100    0 100  0  0  0 0.65 32.6    2
  4    0    0    0  113    1    1    1    1 100    0 100  0  0  0 0.65 32.6    5
  4    0    0   41  152    2    2    1    1 100    0 100  0  0  0 0.65 32.6    3
  4    5    0    0    6    0    0    0    1 100    4 100  0  0  0 0.04  1.8    1

Processor 4 is running at 100%. Right away you should realize something is wrong with how we are interpreting the nmon output – because our “stress” process is the only thing running on this processor. The mpstat utility is not using the kernel’s rusage process accounting data and it shows that our process is running on the CPU for the full time.

So… what in the world did IBM do? The answer – which Steve and Marcin published a few years ago – starts with the little mpstat column called “pc”. This stands for “physical consumption”. (It’s called “physc” in sar -P output and in topas/nmon.) This leads us to the heart of IBM’s solution to the system-wide CPU reporting problem.

IBM is thinking about everything in terms of capacity rather than time. The pc number is a fraction that scales down utilization numbers to reflect utilization of the core (physical cpu) rather than the hardware thread (logical cpu). And in doing this, they don’t just divide by four on an SMT-4 chip. The fraction is dynamically computed by the POWER processor hardware in real time and exposed through a new register called PURR. IBM did a lot of testing and then – starting with POWER7 – they built the intelligence in to the POWER processor hardware.

In our example, we are using one SMT hardware thread at 100% in SMT-4 mode. The POWER processor reports through the PURR register that this represents 65% of the processor’s capacity, exposed to us through the pc scale-down factor of 0.65 in mpstat. My POWER7 processor claims it is only 65% busy when one if its four threads is running at 100%.

I also ran the test using two SMT hardware threads at 100% on the same processor in SMT-4 mode. The processor scaled both threads down to 45% so that when you add them together, the processor is claiming that it’s 90% busy – though nmon & topas will show each of the two processes running at only 45% of a CPU! When all four threads are being used at 100% in SMT-4 mode then of course the processor will scale all four processes down to 25% – and the processor will finally show that it is 100% busy.

On a side note, the %ec column is showing the physical consumption as a percentage of entitled capacity (2 processors). My supposed 65% utilization of a processor equates to 32.6% of my system-wide entitled capacity. Not coincidentally, topas shows the “stress” process running at 32.6% (like I said, solaris-style).

So AIX is factoring in the PURR ratio when it populates the POSIX rusage process accounting structure. What is the benefit? Topas and other monitoring tools calculate system load by adding up the processor and/or process utilization numbers. By changing the meaning from time to capacity at such a low level, it helps us to very easily get an accurate view of total system utilization – taking into account the real life performance characteristics of SMT.

The big win for us is that on AIX, we can use our normal paging thresholds and we have better visibility into how utilized our system is.

The Big Problem With AIX/POWER7/SMT CPU Accounting Changes

But there is also a big problem. Even if it’s not a formal standard, it has been a widely accepted convention for decades that the POSIX rusage process accounting numbers represent time. Even on AIX with POWER7/SMT, the “time” command baked into both ksh and bash still uses the old default output format:

# time ./stress -c 1 -t 66  
stress: info: [34537674] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd
stress: info: [34537674] successful run completed in 66s

real    1m6.00s
user    0m41.14s
sys     0m0.00s

It’s obvious from the output here that everybody expects the rusage information to describe time. And the real problem is that many software packages use the rusage information based on this assumption. By changing how resource accounting works, IBM has essentially made itself incompatible with all of that code.

Of course, the specific software that’s most relevant to me is the Oracle database.

I did do a basic truss on a dedicated server process; truss logged a call to appgetrusage() which I couldn’t identify but I think it’s most likely calling getrusage() under the hood.

# truss -p 15728860
kread(0, 0x09001000A035EAC8, 1152921504606799456) (sleeping...)
kread(0, 0x09001000A035EAC8, 1152921504606799456) = 207
kwrite(6, "\n * * *   2 0 1 6 - 0 6".., 29)     = 29
lseek(6, 0, 1)                                  = 100316
...
kwrite(6, " C L O S E   # 4 5 7 3 8".., 59)     = 59
kwrite(6, "\n", 1)                              = 1
appgetrusage(0, 0x0FFFFFFFFFFF89C8)             = 0
kwrite(6, " = = = = = = = = = = = =".., 21)     = 21
kwrite(6, "\n", 1)                              = 1

For what it’s worth, I also checked the /usr/bin/time command on AIX – it is using the times() system call, in the same library as getrusage().

# truss time sleep 5
execve("/usr/bin/time", 0x2FF22C48, 0x200130A8)  argc: 3
sbrk(0x00000000)                                = 0x20001C34
vmgetinfo(0x2FF21E10, 7, 16)                    = 0
...
kwaitpid(0x2FF22B70, -1, 4, 0x00000000, 0x00000000) (sleeping...)
kwaitpid(0x2FF22B70, -1, 4, 0x00000000, 0x00000000) = 26017858
times(0x2FF22B78)                               = 600548912
kwrite(2, "\n", 1)                              = 1
kopen("/usr/lib/nls/msg/en_US/time.cat", O_RDONLY) = 3

Problems For Oracle Databases

The fundamental problem for Oracle databases is that it relies on getrusage() for nearly all of its CPU metrics. DB Time and DB CPU in the AWR report… V$SQLSTATS.CPU_TIME… extended sql trace sql execution statistics… as far as I know, all of these rely on the assumption that the POSIX rusage data represents time – and none of them are aware of the physc scaling factor on AIX/POWER7/SMT.

To quickly give an example, here is what I saw in one extended SQL trace file:

FETCH #4578129832:c=13561,e=37669,p=2,cr=527,...

I can’t list all the WAIT lines from this trace file – but the CPU time reported here is significantly lower than the elapsed time after removing all the wait time from it. Typically this would mean we need to check if the CPU is oversaturated or if there is a bug in Oracle’s code. But I suspect that now Oracle is just passing along the rusage information it received from the AIX kernel, assuming that ru_utime and ru_stime are both representing time.

If you use a profiler for analyzing trace files then you might see something like this:

The key is “unaccounted-for time within dbcalls” – this is what I’ve seen associated with the AIX/Power7/SMT change. It’s worth scrolling down to the next section of this profile too:

There was at least a little unaccounted-for time in every single one of the 81,000 dbcalls and it was the FETCH calls that account for 99% of the missing time. The FETCH calls also account for 99% of the CPU time.

What This Means For You

The problem with this unaccounted-for time on AIX/SMT is that you have far less visibility than usual into what it means. You can rest assured that CPU time will always be under-counted. On AIX/SMT, a percentage of CPU time will always be reported incorrectly as unaccounted-for time – but there’s no way to know how much of the unaccounted-for time was actually CPU. Anywhere from 25% to 75% of the CPU time could be incorrectly rolled into unaccounted-for time. At any point in time each processor and process has a different percentage.

I’ve heard one person say that they always double the CPU numbers in AWR reports for AIX/SMT systems. It’s a stab in the dark but perhaps useful to remember. Also, I’m not sure whether anyone has opened a bug with Oracle yet – but that should get done. If you’re an Oracle customer on AIX then open a ticket and let Oracle know that you need their code to correctly report CPU time on POWER7/SMT!

In the meantime we need to keep doing what we can. The most important point to remember is that when you see unaccounted-for time, some or all of it is normal CPU time which was not correctly accounted. As Karl Arao and others have said: when you’re on AIX/SMT, investigate operating system load/capacity and runqueue health with ONLY the psize, physc and app columns from lparstat (and corresponding values in nmon/topas).

If you’re running Oracle on AIX, I’d love to hear your feedback. Please feel welcome to leave comments on this article and share your thoughts, additions and corrections!

Patching Time

Jeremy — Wed, 15 Oct 2014 15:17:14 +0000

Just a quick note to point out that the October PSU was just released. The database has a few more vulnerabilities than usual (31), but they are mostly related to Java and the high CVSS score of 9 only applies to people running Oracle on windows. (On other operating systems, the highest score is 6.5.)

I did happen to glance at the announcement on the security blog, and I thought this short blurb was worth repeating:

In today’s Critical Patch Update Advisory, you will see a stronger than previously-used statement about the importance of applying security patches. Even though Oracle has consistently tried to encourage customers to apply Critical Patch Updates on a timely basis and recommended customers remain on actively-supported versions, Oracle continues to receive credible reports of attempts to exploit vulnerabilities for which fixes have been already published by Oracle. In many instances, these fixes were published by Oracle years ago, but their non-application by customers, particularly against Internet-facing systems, results in dangerous exposure for these customers. Keeping up with security releases is a good security practice and good IT governance.

The Oracle Database was first released in a different age than we live in today. Ordering physical parts involved navigating paper catalogs and faxing order sheets to the supplier. Physical inventory management relied heavily on notebooks and clipboards. Mainframes were processing data but manufacturing and supply chain had not yet been revolutionized by technology. Likewise, software base installs and upgrades were shipped on CDs through the mail and installed via physical consoles. The feedback cycle incorporating customer requests into software features took years.

Today, manufacturing is lean and the supply chain is digitized. Inventory is managed with the help of scanners and real-time analytics. Customer communication is more streamlined than ever before and developers respond quickly to the market. Bugs are exploited maliciously as soon as they’re discovered and the software development and delivery process has been optimized for fast response and rapid digital delivery of fixes.

Here’s the puzzle: Cell phones, web browsers and laptop operating systems all get security updates installed frequently. Even the linux OS running on your servers is easy to update with security patches. Oracle is no exception – they have streamlined delivery of database patches through the quarterly PSU program. Why do so many people simply ignore the whole area of Oracle database patches? Are we stuck in the old age of infrequent patching activity even though Oracle themselves have moved on?

Repetition

For many, it just seems overwhelming to think about patching. And honestly – it is. At first. The key is actually a little counter-intuitive: it’s painful, so you should in fact do it a lot! Believe it or not, it will actually become very easy once you get over the initial hump.

In my experience working at one small org (two dba’s), the key is doing it regularly. Lots of practice. You keep decent notes and setup scripts/tools where it makes sense and then you start to get a lot faster after several times around. By the way, my thinking has been influenced quite a bit here by the devops movement (like Jez Humble’s ’12 berlin talk and John Allspaw’s ’09 velocity talk). I think they have a nice articulation of this basic repetition principle. And it is very relevant to people who have Oracle databases.

So with all that said, happy patching! I know that I’ll be working with these PSUs over the next week or two. I hope that you’ll be working with them too!

Grid/CRS AddNode or runInstaller fails with NullPointerException

Jeremy — Fri, 08 Aug 2014 18:43:43 +0000

Posting this here mostly to archive it, so I can find it later if I ever see this problem again.

Today I was repeatedly getting this error while trying to add a node to a cluster:

(grid)$ $ORACLE_HOME/oui/bin/addNode.sh -silent -noCopy CRS_ADDNODE=true CRS_DHCP_ENABLED=false INVENTORY_LOCATION=/u01/oraInventory ORACLE_HOME=$ORACLE_HOME "CLUSTER_NEW_NODES={new-node}" "CLUSTER_NEW_VIRTUAL_HOSTNAMES={new-node-vip}"
Starting Oracle Universal Installer...

Checking swap space: must be greater than 500 MB.   Actual 24575 MB    Passed
Oracle Universal Installer, Version 11.2.0.3.0 Production
Copyright (C) 1999, 2011, Oracle. All rights reserved.

Exception java.lang.NullPointerException occurred..
java.lang.NullPointerException
        at oracle.sysman.oii.oiic.OiicAddNodeSession.initialize(OiicAddNodeSession.java:524)
        at oracle.sysman.oii.oiic.OiicAddNodeSession.(OiicAddNodeSession.java:133)
        at oracle.sysman.oii.oiic.OiicSessionWrapper.createNewSession(OiicSessionWrapper.java:884)
        at oracle.sysman.oii.oiic.OiicSessionWrapper.(OiicSessionWrapper.java:191)
        at oracle.sysman.oii.oiic.OiicInstaller.init(OiicInstaller.java:512)
        at oracle.sysman.oii.oiic.OiicInstaller.runInstaller(OiicInstaller.java:968)
        at oracle.sysman.oii.oiic.OiicInstaller.main(OiicInstaller.java:906)
SEVERE:Abnormal program termination. An internal error has occured. Please provide the following files to Oracle Support :

"Unknown"
"Unknown"
"Unknown"

There were two notes on MOS related to NullPointerExceptions from runInstaller (which is used behind the scenes for addNode in 11.2.0.3 on which I had this problem). Note 1073878.1 describes addNode failing in 10gR2, and the root cause was that the home containing CRS binaries was not registered in the central inventory. Note 1511859.1 describes attachHome failing, presumably on 11.2.0.1 – and the root cause was file permissions that blocked reading of oraInst.loc.

Based on these two notes, I had a suspicion that my problem had something to do with the inventory. Note that you can get runInstaller options by running “runInstaller -help” and on 11.2.0.3 you can debug with “-debug -logLevel finest” at the end of your addNode command line. The log file is produced in a logs directory under your inventory. However in this case, it produces absolutely nothing helpful at all…

After quite a bit of work (even running strace and ltrace on the runInstaller, which didn’t help one bit)… I finally figured it out:

(grid)$ grep oraInst $ORACLE_HOME/oui/bin/addNode.sh
INVPTRLOC=$OHOME/oraInst.loc

The addNode script was hardcoded to look only in the ORACLE_HOME for the oraInst.loc file. It would not read the file from /etc or /var/opt/oracle because of this parameter.

On this particular server, there was not an oraInst.loc file in the grid ORACLE_HOME. Usually the file is there when you do a normal cluster installation. In our case, it’s absence was an artifact of the specific cloning process we use to rapidly provision clusters. As soon as I copied the file from /etc into the grid ORACLE_HOME, the addNode process continued as normal.

Sometimes it would be nice if runInstaller could give more informative error messages or tracing info!

OSP: Overview

Jeremy — Thu, 31 Oct 2013 15:11:22 +0000

This is the second of twelve articles in a series called Operationally Scalable Practices. You can read the introduction in the first article. In short, this series offers helpful suggestions for younger organizations and newer DBAs to best position them for very large-scale growth.

Before getting into specifics, we will lay out a general overview of the content. I expect this overview to be revised the most as the series is refined over time – so check periodically to see if there have been updates!

First, one quick note: my single overarching principle is simplicity. Your specific implementation of a guideline offered here will depend on your company size and will change as your company grows. A small team can get a lot of mileage out of online collaborative spreadsheets for inventory or deployment checklists. I recommend a bias toward simpler platforms like this instead of big enterprise software solutions – the key is that you’re *doing* the inventory or checklist itself. If you keep growing then sophisticated tools will eventually become necessary but they’re complicated and costly and add overhead to your team activities. Always try to find the simplest possible techniques and tools for your team as you implement the guidelines herein!

And now, here’s the outline of the next ten articles. The outline in itself is actually a great checklist as you think about your operations. It’s ambitious – hopefully I’ll be able to follow through with getting this written!

The Foundation

Keep documentation & processes somewhere with change history
Use checklists for general tasks, maintenance and deployments
- Start simple, grow slowly into more sophisticated systems (e.g. ticketing or release management)
Make sure your basics are solid: monitoring (both email & paging), backups, inventory
- Monitor what matters to your business from end-to-end (networks, applications, databases, storage)
- Actively manage paging events and thresholds, no “expected & non-critical” pages

Build a Standard Platform from the Bottom-Up (Part 1) (Part 2) (Part3)

Define storage
- Start small & simple (local RAID)
- Minimum two independent volumes per server (recovery area)
- Set expectations for capacity, IOPS and throughput
- Define a growth path for the next year or so
- Morle’s SANE SAN is still worth reading for storage appliances
- Tread carefully with newer, less-understood, non-traditional storage (dedupe, serialized, etc)
Calculate or choose a rough core-to-memory ratio
- Need extra memory for consolidation
- This standard core count becomes minimum unit of licensing; balance cost implications with technical considerations
Discuss network communication for inter-DB traffic and backup traffic (DG, GG, DP, environment refreshes, etc)
Suggestions for defining slots per server
Suggestions for defining workload per slot

Build a Standard Cluster Platform (Part 1)

When to consider clusters and when not to consider clusters
- Parallel or distributed processing, fault tolerance, incremental growth, pooled resources for better utilization
- Expensive software licenses, require application modifications, complexity drives very expensive training and hiring
There are many ways to cluster (not just RAC)
Clustering with Oracle will require shared storage of some kind (DAS,NAS,NFS,SAN)
Individual services or applications can be active-active or active-passive independently
Keep it simple – for example, with slots and pools

Suggestions on Naming

Servers
Virtual Machines
DNS, NIS, Domains
Database, SID, etc
Users (NIS, OS, DB)
Cluster, DG config
Services
SQLNET (tnsnames)

Design Small yet Ready for MAA

Consolidate high in the stack
Scale up before out
Security topics: minimize privileges and roles
Maximize what you already bought, minimize expensive extras
- Express Edition, Personal Edition, Standard Edition
- Standby for fault tolerance (data guard, dbvisit)
- Other tools – especially for standard edition (MOS, sash, etc)
ASM vs FS, OMF
- Suggestions on DB file paths, names, extensions for all file types
Standardize on RMAN with a backup agent, hot backups, BCT
- Use storage snapshots and standby DBs for backups when justified
Create databases ready for adding features later with minimal change
- Physical replication (standby)
- Logical replication (goldengate, etc)
- Encryption
- RAC
Justify non-default config
Strongly justify hidden config

Managing Software (install/patch/upgrade)

Plan for patching
Suggestions for users and groups
- OS users & groups (ASM and CRS, individual accounts)
- DB users (including sysdba, sysasm, sysoper)
- Account privileges
- Records and auditing
Suggestions for directory layouts
- OFA
- Using (or not using) symbolic links
- Installing cluster software on local storage
Server time zones and time synchronization
Repackaging oracle binaries
- Versioning your packages

Managing Configuration & Scripts

Centralize
Use version control (a.k.a. revision control, source control or change control)
Configuration management and automation
Configuration that needs managing
- Database configuration (e.g. tnsnames, wallets, init files, pwfiles)
- OS configuration (e.g. hugepages, udev/asmlib)
Files and directories that need cleanup/housekeeping
Consistent management of jobs and schedules (cron, DB, oem, backup software, etc)
- Built-in DB maintenance windows and jobs
Management & administration GUIs

Operations Team Processes

Using automation and scripting (deployment, db create, backups, schema/user creation, config)
Utilize backups to move data
Trust Oracle not Google (when it comes to Oracle software)
- Cite references in process documentation
- Official documentation should be the source for 99% of processes
- MOS is secondary source; notes can have problems but may also be more detailed and up-to-date than the official documentation
- Heavy internal testing and documentation is only acceptable tertiary source
- Blogs, forums, wikis and other public websites are never an acceptable source for processes

Operations Team Calendar

Assess quarterly patches
Schedule recovery exercises
Schedule failover exercises
The importance of vacations

Attitude and Culture

Vendor support services are your ally not your enemy – even if you’re a small company
Value soft skills (personality insights, negotiation, other interpersonal communication skills)
Value learning & education
- Encourage a culture of experiments, tests, evidence and proof
- Always cite references
Understand business priorities: customer relationships, vendor relationships, cost, time, budgeting, negotiation, complexity, manpower…
Don’t get too cynical or too comfortable; expect business first
Importance of breaks and hobbies

This covers a lot of ground – but I still see a few potential gaps. In particular, I’m still getting up to speed on some of the latest updates to Oracle’s database. For those of you who are working with 12c – is there anything we should do differently to be prepared for upcoming changes? For example, I’m still contemplating whether this blueprint should be tweaked for multitenant or far sync readiness. If you have any thoughts then please share them!

Voting Disk Lies (CRS-4000)

Jeremy — Mon, 07 Oct 2013 16:33:56 +0000

Add this to the category of annoyingly unhelpful error messages.

I’m working on a mostly-automated process to create a new cluster by cloning another existing cluster. After running OUI (Oracle Universal Installer – called by config.sh to just run config assistants) there is a single ASM diskgroup which contains both the OCR and Voting Disk; however I wanted to switch the voting disks over to some different physical devices.

Upon which I received this errors:


(root)# /oracle/11203/grid/bin/crsctl replace votedisk +CLST3_VOTING
Failed to create voting files on disk group CLST3_VOTING.
Change to configuration failed, but was successfully rolled back.
CRS-4000: Command Replace failed, or completed with errors.

Lovely… so very informative. And just to be clear, it didn’t complete with errors, it completely failed. Thanks Oracle.

Maybe I used the wrong command to switch the voting disk? Maybe I need to start CRS in the special bootstrap mode? (Remember that the 11g docs apply to 11.2.0.1 only; for 11.2.0.2 and newer you need to use Oracle Support Notes like 1062983.1.) Maybe I should search oracle support for this error message? (Nada.)

Actually the Oracle Support knowledge base was a good idea. Note 1526096.1 didn’t solve my problem but it mentioned this error (CRS-4000) and it gave me a good lead: looking in the ASM alert log.

Aaaaand there’s our answer. :)


Mon Oct 07 10:04:24 2013
NOTE: [crsctl.bin@server1 (TNS V1-V3) 29941] opening OCR file

Mon Oct 07 10:04:24 2013
NOTE: updated gpnp profile ASM diskstring: /dev/mapper/*

Mon Oct 07 10:04:24 2013
NOTE: Creating voting files in diskgroup CLST3_VOTING

Mon Oct 07 10:04:24 2013
NOTE: Voting File refresh pending for group 2/0xb1dd29a2 (CLST3_VOTING)
NOTE: Attempting voting file creation in diskgroup CLST3_VOTING
ERROR: Voting file allocation failed for group CLST3_VOTING
Errors in file /oracle/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_29953.trc:
ORA-15221: ASM operation requires compatible.asm of 11.2.0.0.0 or higher
NOTE: Attempting voting file refresh on diskgroup BMCLST3_VOTING

Would’ve been nice to put that message on the console instead of only dropping it in the alert log – but at least now I’m more likely to remember checking the alert log if the console doesn’t make sense!

How To Troubleshoot OEM 12c Cloud Control Auto-Discovery

Jeremy — Thu, 14 Feb 2013 14:41:05 +0000

I was recently involved with an upgrade project to go from 11.2.0.2 to 11.2.0.3 on an Exadata V2. We hit some snags during the upgrade specifically related to OEM 12c Cloud Control. We performed an out-of-place upgrade and OEM 12.1.0.1.0 had some difficulty in dealing with this.

12c Cloud Control is supposed to run a daily check which looks for new targets on each server. When it finds something new, it places this in a queue to wait for admin approval. With a single click you can promote the newly discovered target into an OEM managed object.

You can manually trigger this process. Go to Setup -> Add Target -> Configure Auto-Discovery -> Multiple Target-Type Discovery: Configure/Wrench-Icon -> Run Discovery Now
You can view the queue and promote targets by navigating to Setup -> Add Target -> Auto Discovery Results

It looks like a very nice process – except that it’s not working at all on our system. The stuff currently in our queue hasn’t existed on the server for years, and the new stuff is not getting detected and added to the queue. I haven’t complete solved this yet, but I have learned a few things in the process of working on it.

Few Useful References

First off, a few useful references about auto discovery. OEM is very extensible and you can create your own target types and write your own auto-discovery scripts.

Good overview with lots of pictures: http://docs.oracle.com/cd/E24628_01/doc.121/e24473/discovery.htm
Details about how discovery scripts are implemented and how they work: http://docs.oracle.com/cd/E24628_01/doc.121/e25161/discover.htm

Key takeaway: auto discovery is driven by a script which is sitting on the server. Generally written in Perl. This script does not change anything when it runs – it only generates output: a list of all targets that it currently sees. OEM sorts out which are new. This means that it’s safe to manually run the auto-discovery script on the server and see if it’s working as expected.

Additional note: some scripts will also write debug information (info/debug/trace) to emagent_perl.trc in the Management Agent log directory.

Manually Running Auto-Discovery

Took me a few tries to get all the library references right, but you can now just copy and past the commands below to run the Oracle Home auto-discovery yourself. Should only be minor tweaks to run other auto-discovery scripts.

[oracle@server:~]$ cd $AGENT_HOME [oracle@server:/u01/app/oracle/product/12.1]$ export LD_LIBRARY_PATH=$AGENT_HOME/core/12.1.0.1.0/lib [oracle@server:/u01/app/oracle/product/12.1]$ core/12.1.0.1.0/perl/bin/perl -Icore/12.1.0.1.0/sysman/admin/scripts -Icore/12.1.0.1.0/perl/lib/site_perl/5.10.0/x86_64-linux-thread-multi plugins/oracle.sysman.oh.discovery.plugin_12.1.0.1.0/OracleHomeDiscovery.pl

Update Feb 18: You can get all four nodes at once by copying the following dcli command:

dcli -l oracle -g ~/dbs_group LD_LIBRARY_PATH=$AGENT_HOME/core/12.1.0.1.0/lib $AGENT_HOME/core/12.1.0.1.0/perl/bin/perl -I $AGENT_HOME/core/12.1.0.1.0/sysman/admin/scripts -I $AGENT_HOME/core/12.1.0.1.0/perl/lib/site_perl/5.10.0/x86_64-linux-thread-multi $AGENT_HOME/plugins/oracle.sysman.oh.discovery.plugin_12.1.0.1.0/OracleHomeDiscovery.pl

Manually Adding Targets

After you’ve run that script, you can manually add the targets using the exact outputs from the auto-discovery script. In most situations you probably don’t need to do this, but it’s neat that there’s a one-to-one correlation between the manual add parameters and the script output.

Manually Adding Oracle Home

I’m curious – has anyone else ever had to troubleshoot auto-discovery? Care to share about your experience? Any additional tips you can add?

Set Up Exadata for Cloud Control 12.1.0.2

Jeremy — Wed, 09 Jan 2013 14:39:04 +0000

I recently helped setup an Exadata X2-8 Database Machine with the latest version of OEM Cloud Countrol (12.1.0.2). A few documents do exist for this process – the most useful of which are the Exadata Discovery Cookbook and the Setup Automation Kit. However I found a few inconsistencies and problems; I think the existing documents I found were written on older versions of OEM and older versions of the tools. Also there are some additional steps for older Exadatas which didn’t apply to my case.

I’m publishing my final procedure here with hopes that it helps you, but as always please cross-reference this with the appropriate documentation before doing anything in your own environment.

At this customer’s request we also configured SNMP to integrate alerting with another system – in their case, Exadata-related alerts will be raised in BMC Event Manager. I’ve also included the steps I followed to enable this; it should be easy enough to tweak my procedure here for any SNMP-compatible monitoring system.

Steps to setup Exadata with Cloud Control 12c

IMPORTANT NOTE: the steps provided here are NOT a substitute for reviewing relevant documentation. future environments might be different at any level (exadata, patches, agents, etc) and require a fresh review of documentation for changes.

Download Exadata Discovery Cookbook: http://www.oracle.com/technetwork/oem/exa-mgmt/em12c-exadata-discovery-cookbook-1662643.pdf
Download OEM Setup Automation Kit (Note 1440951.1 -> Patch 14628061)
Verify the Exadata and OEM Pre-Requisites. I found three good pre-requisite lilsts: (1) the Cookbook itself, (2) the Setup Automation Kit README and (3) Oracle Support Note 1437434.1. There’s a lot of overlap in these three lists but it’s worth checking all three documents because some new updates have only gotten into one or two of them.For this client, there were a handful of issues that we had to get resolved:
- Major Issue – OEM Self-Update: If your OEM server is not 64-bit Linux then it doesn’t have the 64-bit Linux agent in its library be default. This procedure relies on OEM’s agent deployment capabilities, so you need to add the 64-bit Linux agent to the OEM software library using OEM’s “self-update” feature.This client’s OEM server was hosted on AIX. Furthermore, they have somewhat restrictive network security policies. OEM’s self-update feature relies on the OEM server directly accessing some of Oracle Corp’s internet services; we submitted organizational security requests for this access.Offline Update: While waiting for network access, I gave offline update [OEM manual] a try. OEM gave me a URL and I was able to download this file. Then, there seem to be two choices in 11.2.0.2 for getting the file into the OEM repo. I think the manuals are invalid for this section; I never got it to work.
  1) The web interface asked me to upload the file directly, which isn’t mentioned in the manual. The file uploaded alright, then it claimed that a job was submitted to process the file. However i could not find any evidence that a job ever ran. I did notice that the local agent seemed to be unreachable.
  
  2) I also tried following the command-line process outlined in the manual. I received the error: “Specified file is not a valid Self Update catalog file. Please check and try again with a valid file.” Searched for info about this error in both metalink and google… no meaningful results.
  
  Online Update: This work was spread over a period of time. Before I had a chance to finish troubleshooting the offline update, our network access request came through. Following guidance from Support Note 1457376.1 (which Alex G found) we requested access for these hostnames:
  
  aru-akam.oracle.com
  ccr.oracle.com
  login.oracle.com
  support.oracle.com
  updates.oracle.com
  
  After network access was granted, online update worked flawlessly – although you do need to remember that it’s a two-step process: (1) download the file and (2) “apply” the file to the repository.
- Minor Issues – acquire all needed passwords, create missing accounts/passwords, ssh ciphers: First, it’s useful to remember that the default password for nearly every account and device in an exadata database machine is the same password. This is terribly useful – I’m so glad Oracle took the time to make this consistent. If nobody knows the password for some obscure device, try the default one. (Which I’m sure you already know!)I only had to create two new passwords: (1) a new account on the ILOM, which is easy and well-documented in pre-req notes and (2) a new password for the DBSNMP account in one existing database on the system.The pre-req notes also say you should make sure sshd explicitly lists certain ssh ciphers; I suspect this is mainly for some older exadata database machines. On my machine, the cells all matched the docs exactly but the compute servers didn’t have a cipher line in the sshd config. All the required ciphers are included by default but I went ahead and added an explicit line anyway.
Update file /opt/oracle.SupportTools/onecommand/em.param – the automation kit will use values here. I had to update the values for OMS_HOST, OMS_PORT and EM_USER.
Extract kit to exadata node 1 and run the kit as root. Ignore the cookbook instructions because they’re out-of-date; use the README from the kit. I just typed “perl setupem.pl” to run the kit.The kit is fantastic and I highly recommend it. I can also vouch for its rollback capabilities which are excellent. I had to rollback the entire process after I’d finished it the first time in order to change the user it was installed under; it worked flawlessly. Just make sure to pay attention and follow all the manual instructions carefully whenever they are given!A few notes from this particular client:
- I skipped the exachk run – we had a very old version installed, i had problems running it even manually and didn’t have time/scope allocated to update it. In this case I manually verified everything related to the OEM setup and spent a lot of time making sure I had been thorough on this. In general though it’s definitely best practice to have an up-to-date version of exachk installed and use it.
- I received a DNS error “script can’t get domain” which was safe to ignore because I had already the verified correct domain manually.
- I received an infiniband error “script can’t SSH into switches” which was safe to ignore because I had already verified correct IB firmware versions manually.
The kit/script automates the entire process of getting agents installed properly onto exadata, including the exadata-specific extensions. After you finish this script, you’re 75% finished with the whole thing!
Follow directions and screenshots from the cookbook to perform exadata, cluster and database discovery through the guided processes. I found the cookbook to be reliable for this part. It’s pretty simple and the screenshots are quite nice.One note: during exadata discovery, you are presented with a list of which hosts act as primary monitors for various devices. Take note of which host you accept as default for the IP (Cisco) switch.After guided discovery, the cookbook will also walk you through configuring the switch to forward SNMP traps. You’ll need to remember which host you configured as the primary monitor!
For this particular client we had one additional bump: I wasn’t able to acquire the password for the unix user who owned the agent, and it couldn’t be changed either. (!!) The agent password is not needed for agent deployment since the kit runs as the root user. However it is needed for guided discovery. To get around this, I backed up /etc/shadow and changed the password just for the short time I was running discovery. I restored the shadow file and thus the original password after I finished.

At this point the Exadata is configured with OEM Cloud Control! Really it’s not that complicated, the tough part is just knowing where to start and then getting all the right pieces in the right places.

Steps to setup SNMP Integration for 3rd Party Alerting

Version 12c of OEM Cloud Control introduces some key new features around “Incident Management”. Before configuring SNMP notifications, it is very important to understand the underlying concepts around this feature.

A good starting point are chapters 3 and 4 of the Cloud Control Administrator’s Guide:
http://docs.oracle.com/cd/E24628_01/doc.121/e24473/toc.htm

After you understand OEM Incident Management, continue with these instructions to configure SNMP traps and create a test event:

Find your MIB by logging into the OEM server and getting the file $OMS_HOME/network/doc/omstrap.v1The MIB and Events are documented in Appendices A, B and C of the Administrators Guide. Your team that manages your 3d party alerting system will probably want to import this MIB.Note: our client did have one issue here. It seems that their current BEM tool allowed a maximum of 30 slots for storing variables on each event class. Unfortunately the MIB from OEM has one event class with 70 variables. (!!) The Event Class in question is “SNMP_oraEMNGEvent” and can be found starting at line 1314 of our 12.1.0.2 MIB file. We asked Oracle and of course the response was: “there’s important info after field 30 and there’s no workaround.” Nice. However we are still moving forward with this client (ignoring fields after 30 for that event class) and we’re going to see whether it causes any major problems in practice.
Login to the OEM web console
Navigate to Setup -> Notifications -> Notification Methods
Choose to Add SNMP trap targets using the guided wizard. You will need to know the IP addresses or hostnames of your SNMP host. Also at this point input the community if it’s not the default value of “public”.
Nagivate to Setup -> Incidents -> Incident Rules
Create a new Rule Set for testingMy test ruleset:
- named “test tablespace low”
- applies only to exadata cluster database BIP
- one rule: “exadata tablespace free”
  - applies only to 1 metric: “Tablespace Space Used (%)”
  - one action: no conditions, call both SNMP trap targets

Trigger the testing ruleset which you just created:

SQL> create tablespace testalert1 datafile '+data' size 1m;

Tablespace created.

SQL> create table test_obj tablespace testalert1 as select * from dba_objects where 1=0

Table created.

SQL> insert into test_obj select * from dba_objects where rownum<10000;
insert into test_obj select * from dba_objects where rownum select bytes from dba_segments where segment_name='TEST_OBJ'

     BYTES
----------
    983040

SQL> select user_bytes, bytes from dba_data_files where tablespace_name='TESTALERT1';

USER_BYTES      BYTES
---------- ----------
    983040    1048576

SQL> select bytes, user_bytes, blocks, user_blocks from dba_data_files where tablespace_name='TESTALERT1';

     BYTES USER_BYTES     BLOCKS USER_BLOCKS
---------- ---------- ---------- -----------
   1048576     983040        128         120

Wait several minutes, then navigate in OEM to Enterprise -> Monitoring -> Incident Manager
The standard view is selected by default. Look for the incident “Tablespace [TESTALERT1] is [100 percent] full” in the right pane and click on this incident to select it.
In the bottom pane, click on the EVENTS tab. Click on the Latest Event to open its details.
The General tab is open by default. In the Last Comment field, verify that the SNMP trap was sent.

And that’s it. This worked for us – let us know if it worked for you or if you did anything differently!

Adaptive Log File Sync: Oracle, Please Don’t Do That Again

Jeremy — Fri, 19 Oct 2012 13:37:22 +0000

The Summary

Underscore parameter _use_adaptive_log_file_sync
- Default value changed in 11.2.0.3 from FALSE to TRUE
- Dynamic parameter
Enables a new method of communication for LGWR to notify foreground processes of commit
- Old method used semaphores, LGWR had to explicitly “post” every waiting process
- New method has the FG processes sleep and “poll” to see if commit is complete
- Advantage is to free LGWR from CPU work required to inform lots of processes about commits
LGWR dynamically switches between old and new method based on load and responsiveness
- Method can switch frequently at runtime, max frequency is 3 switches per minute (configurable)
- Switch is logged in LGWR tracefile, we have seen several switches per day
Few problems in general, possible issues seem to be in RAC and/or the switching process itself

The Story

We’re working with a customer now who had a very successful Exadata go-live last weekend. They moved a large application mostly unchanged from 9i to Exadata – and we’ve generally had very positive feedback. (!!) It might be that just getting on modern hardware and software accounts for this… but nonetheless it gives everyone a good feeling.

Nevertheless there are always a few little adventures. One of these was a mysterious orange glob on the Cloud Control radar Monday at 1am. Orange is the “commit” class – so this was a little surprising! (And reminiscent of Karl Arao’s famous halloween monster!) Few drill-down clicks and we’re looking at “log file sync” events. To make a long story short, it wasn’t any of the usual suspects and the necessary diagnostics info wasn’t available after the fact to say what happened with certainty. We’ve done a fairly comprehensive incident report and it hasn’t recurred… so I’m happy for now.

Anyway, I haven’t gotten to the really fun part yet. The really fun part is the #everydaylearning as Yury would say. In the process of analyzing the orange glob I ended up investigating whether it might be related to a new feature called Adaptive Log File Sync. I’d never heard of this before so I had quite a bit of fun learning, and it seemed worthwhile to share a few things I found.

Before I dive in, one other important thing: I have to mention Christo Kutrovsky. We were both on-site and we were both digging into this feature at the same time. There was a constant exchange of competing theories and different ideas about how things might work. Together we assembled a strong, in-depth working theory very, very quickly – I love working on a great team.

Log File Sync

First off, there are a few basic starting points for anyone troubleshooting time spent in the log file sync event.

For past problems, use AWR reports and ASH data
For live problems, use sql trace and Tanel’s snapper
Read Riyaj Shamsudeen’s excellent article about Tuning ‘log file sync’ Event Waits
Read Kevin Closson’s excellent article about LGWR processing.
Note 34592.1 is a generic reference note explaining the wait event
Note 137696.1 is a useful troubleshooting guide
Note 1064487.1 is a useful script to collect some diagnostic info for LFS troubleshooting

In my specific case, I discovered quickly from the ASH data that this was a very unusual situation. This led to some more creative searching in the Oracle Support KnowledgeBase – at which point I discovered something very interesting:

Note 1462942.1

Adaptive Log File Sync

Note 1462942.1 describes a feature whereby LGWR can switch between log write methods. This feature is enabled through an underscore parameter called _use_adaptive_log_file_sync and the description of this parameter is: adaptively switch between post/wait and polling.

Now I’ll be honest: searching on the internet and searching in Oracle Support’s KnowledgeBase for “adaptive log file sync” yields almost nothing. I do believe in the principle of BAAG – but this is a case where some guesswork might be useful, especially to guide experimentation that could nail down more concrete answers. Hence the disclaimer at the top of this article.

The words “post” and “wait” indicate that we’re talking about semaphores. For some general background on semaphores check out the wikipedia article and for more detail about the unix post() and wait() calls a good resource is chapter 30 from Andrea Arpaci-Dusseau’s (UW Madison) textbook on Operating Systems.

Oracle uses semaphores extensively. In fact LGWR and commits are specifically mentioned as an example in the Performance Tuning Guide – and Oracle even has an API to replace semaphore use with a third-party driver for lightweight post-wait implementations. (Like an ODM for your OS kernel. )

If you look for references to “commit” in the Oracle docs, you’ll find the word “post” everywhere when they talk about communication between the foreground processes and LGWR. Now, remember that a COMMIT has two options: first, IMMEDIATE or BATCH and second, WAIT or NOWAIT. It looks to me like this:

Immediate: FG process will post to LGWR, triggering I/O (default)
Batch: FG process will not post LGWR
Wait: LGWR will post FG process when I/O is complete (default)
NoWait: LGWR will not post FG process when I/O is complete

Why change a good thing? My theory is that it ties back to a rare problem Riyaj mentioned in that blog post over 4 years ago. I’ll quote him:

LGWR is unable to post the processes fast enough, due to excessive commits. It is quite possible that there is no starvation for CPU or memory, and that I/O performance is decent enough. Still, if there are excessive commits, LGWR has to perform many writes/semctl calls, and this can increase ‘log file sync’ waits. This can also result in sharp increase in redo wastage statistics.

Maybe Oracle is reading Riyaj’s blog? It appears that they came up with a new algorithm where LGWR doesn’t post. My guess: foreground processes can probably still post LGWR – but LGWR never posts back. Instead, foreground processes in WAIT mode “poll” either a memory structure or LGWR itself. It could be an in-house implementation by Oracle, it could still use semaphores, or maybe it uses message queues somehow [seems like a long shot but the unix poll() call is found there].

There’s one interesting challenge that I can think of in implementing this. With the semaphore approach, all the active commits are *sleeping* (off the CPU) while LGWR flushes the log buffer. There could be dozens or hundreds of foreground processes simultaneously commiting on a very busy system. If we switch to a polling method, how do we ensure that these hundreds of processes don’t start spinning and steal CPU from LGWR, making the whole system even worse than it was in the beginning?

The answer might lie in a quick search for “adaptive_log_file_sync” from the underscore parameters. There are five more hidden parameters with that string:

select a.ksppinm name, b.ksppstvl value, a.ksppdesc description
from sys.x$ksppi a, sys.x$ksppcv b
where a.indx = b.indx and a.ksppinm like '_%adaptive_log%' escape ''
order by name

Name	Default	Description
_adaptive_log_file_sync_high_switch_freq_threshold	3	Threshold for frequent log file sync mode switches (per minute)
_adaptive_log_file_sync_poll_aggressiveness	0	Polling interval selection bias (conservative=0, aggressive=100)
_adaptive_log_file_sync_sched_delay_window	60	Window (in seconds) for measuring average scheduling delay
_adaptive_log_file_sync_use_polling_threshold	200	Ratio of redo synch time to expected poll time as a percentage
_adaptive_log_file_sync_use_postwait_threshold	50	Percentage of foreground load from when post/wait was last used

It appears that there are a number of knobs to turn with this new algorithm – and it looks like Oracle is somehow dynamically calculating a “polling interval”. Furthermore, it seems to be taking “scheduling delay” into account.

It also seems that by default we’re limited to switching modes every 20 seconds (3 per minute). This switching seems to be controlled by “thresholds” – and the threshold for enabling the new polling mode seems to be based on time. From the default percentage, it looks to me like Oracle won’t switch until it thinks the poll time will be less than half of the current post time. For switching back it also seems to be another “half” percentage (50), though I’m not sure what “foreground load” might mean.

Now there’s one other underscore parameter which I noticed while looking for things related to log file sync:

Name	Default	Description
_fg_sync_sleep_usecs	0	Log file sync via usleep

This parameter is interesting for one reason: the adaptive sync parameters were introduced in 11g – but this parameter was introduced in 10g. And if you think about the name, it actually sounds very similar to the “polling” strategy and it doesn’t sound like a semaphore strategy at all! Foreground processes call usleep() during a log sync – if you’re calling wait() then you don’t need to sleep. But if you’re polling then you definitely need to sleep. Maybe Oracle has been working on this idea since 10g? I wouldn’t put it past them. :)

So how do you know if you’re using this feature? The most obvious sign will be in the LGWR trace file.

There will be messages each time that a switch happens, looking something like this:

*** 2012-10-16 01:47:50.289
kcrfw_update_adaptive_sync_mode: poll->post current_sched_delay=0 switch_sched_delay=1 current_sync_count_delta=8 switch_sync_count_delta=59

*** 2012-10-16 01:47:50.289
Log file sync switching to post/wait
Current approximate redo synch write rate is 2 per sec

*** 2012-10-16 02:51:19.285
kcrfw_update_adaptive_sync_mode: post->poll long#=51 sync#=352 sync=4600 poll=1061 rw=500 rw+=500 ack=7 min_sleep=1061

*** 2012-10-16 02:51:19.285
Log file sync switching to polling
Current scheduling delay is 1 usec
Current approximate redo synch write rate is 117 per sec

I mentioned that the underscore parameters were introduced in 11g – however my current understanding is that only in 11.2.0.3 did the default value of _use_adaptive_log_file_sync change to true!! Is it comforting that they waited so long, or is it scary that they’d make this change with no available documentation or troubleshooting information? I guess it depends on your perspective. Here’s mine:

Explanation Please?

This is a change of critical, core code that impacted every single database installation and upgrade.
I think it is a strange departure from Oracle’s normal practices to change such important code with absolutely no announcement, documentation or technical troubleshooting information even in their customer-only support database.
In my opinion this requires explanation; they can do (and have done) much better.

Regardless, it’s now been 12 months since 11.2.0.3 was released. At present it’s only mentioned in a single support article. This offers at least some evidence that adaptive log file sync hasn’t caused widespread panic. For the past year, many of you have been using this feature without even knowing – and there weren’t enough problems to merit even two Oracle Support notes.

I sent a few emails out to friends who also tend to work on large, loaded systems. I wasn’t the first person to stumble across this – James Morle mentioned it in a tweet a few months ago, and (not surprisingly) Riyaj has stumbled across the feature as well. It seems to me that the issues that have been encountered intersect with RAC and/or the switching process itself. But my informal survey was hardly scientific.

Overall, I think Oracle dodged the bullet this time. But they still made an important change without supporting documentation. To my friends inside the big red mothership: please don’t do that again! We like it better the other way.

Have you ever heard of this new 11.2.0.3 feature? Have you heard of the underscore parameter? I’d love to hear your stories!

Update September 2015: Frits Hoogland has published an article with a lot more detailed technical info about this – well worth the read!