jump to navigation

TOD – Most cluster issues June 5, 2009

Posted by grumpydba in RAC, Tip 'o the day.
add a comment

As a DBA we have heard the old axiom that 80-90% of database performance issues are query related. I have a similar axiom about Oracle RAC: 90% of all cluster startup issues are either disk (voting/ocr) or interconnect related.

Today I forgot the second part of that axiom when I could not get two nodes of a three node cluster started. I ignored the interconnect part because I checked it first with ifconfig and the NIC was up on all three nodes. Secondly, because the error I would get in the ocssd.log file went on and on about:

clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(4715) LATS(1135926) Disk lastSeqNo(4715)

incrementing and rapidly enlarging the log. After changing several settings on the multipathing and in /etc/udev/rules.d, I tried the old test:

ping -b 1.1.1.255

That is, perform a broadcast ping on the full range of the interconnect. Nodes 2 and 3 could see each other and node 1 could see only itself. Once I fixed the private VLAN issue, all was well and the cluster came straight up.

The moral of this story is that just because it walks like a disk problem, talks like a disk problem and acts like a disk problem, in clusterware, it might just be a network issue.

Attributes of a Great DBA #1 – Humility May 22, 2009

Posted by grumpydba in general.
add a comment

Humility and a decided lack of ego. It is one thing to be confident with your skills, however, most of the best Oracle people I know are also the most humble. Case in point is Mike Ault. I have known of his work, books and appreciation for all things Oracle for many years. I met him in person about two years ago at RMOUG training days in Denver and again at IOUG Collagorate 09 in Orlando. Instead of regaling me with Oracle knowledge, we talked of diving, kids and tiramisu. Why is this so important? Because while you can learn sitting at the feet of a guru, you can also learn something new from even the lowest, newest and greenest of people if you are open to it.  I have also known engineers who believe the world should revolve around them, and you know what? They are usually constantly involved in a perceived crisis that someone, anyone, else caused.

Humility will gain you the world.  There are a few Oracle bigwigs out there that set up shop at a conference and look more like  PT Barnum than a credible source all in the name of hawking their books, services or advice.  (Those that know me know of whom I speak!)  Getting a perspective from others with different talents than you and lesser as well as greater talents than you.  Listen to your colleagues and don’t rush to judgment on ideas just because they are from a newbie, you never know where that spark will come from that will solve a problem now in the future.

Attributes of a Great DBA #2 – Integrity May 15, 2009

Posted by grumpydba in general.
add a comment

Integrity – be honest in all you do, it is easier than trying to remember what you lied about!  There is not a whole lot is new under the sun, ideas/scripts/processes are products of hashing and rehashing old ideas with new to create something that fits your needs.  My favorite ASM scripts are based on Jeff Hunters scripts, he certainly writes better formatted SQL than I do and the scripts I based my ASM scripts on are very useful for me.  Often plagiarism goes under the guise of “code reuse”,  that is fine, but give credit where credit is due.

I have solved many problems for customers over the years, but try to never leave them without an understanding of what went on, how it was resolved and how it was fixed if at all possible and time permits.   This is vital to to your client relationship and your own sense of self worth.  There are times where a root cause analysis does not bare anything out, and you have to explain to a client or manager that the cause may be found, but it may be cost and time prohibitive.  Be honest in all things you do, covering something up almost always involves digging your own grave.  Having been on more than one forensic analysis teams, I have seen first hand what happens when someone either maliciously damages a system or damages it accidentally, and then tries to cover it up.   It rarely works and the damage to your reputation can be permanent.

Attributes of a Great DBA #3 – Imagination May 13, 2009

Posted by grumpydba in general.
add a comment

Imagination – Above intelligence? You bet. The ability to think outside the box is critical, much of that comes from experience.  I am not a huge Star Trek fan, but I remember a scene from The Wrath of Khan when the crew was trying to force their enemy’s shields down and Kirk said “You have to know why things work.”  That is an excellent point, so many things can affect an Oracle instance, database, cluster, etc.  It is often like the large mixing boards that music producers use in a studio.  If you move one slide up 12 others may move down.  It partially comes back to #4, that if you amass knowledge in many different areas, they will mix in such a way that your imagination can find soluti0ns to which there is no (apparent) logic.

Attributes of a Great DBA – #4 Intelligence May 6, 2009

Posted by grumpydba in general.
add a comment

Intelligence – This had to be in the list right? Not number one, however. There is a difference in “book learnin’” and intelligence.  Intelligence is more the process of solving a problem to the point of resolution.  It can be coupled directly with wisdom.   A DBA must have the ability to go from point A to B to C to solve a problem.  With experience, that process may go from A to D to Z because of an intuition born of experience even if you have not seen a similar problem before.

Where do you get this intelligence?  As noted above, time, in the guise of experience, is  a large part.  Absorbing information my multiple sources is the majority of the rest.  Oracle is the type of software that you learn by doing, not reading.  Don’t just troll OTN forums and blogs, participate!  Get a dialog going, you would be amazed at what you can learn from people in similar and dis-similar circumstances.    If you are a RAC person join the RAC SIG (www.oracleracsig.org).   Most importantly, however, is learn about more than just your area of expertise, by getting outside your comfort zone.  While I am a specialist at RAC, I try to be a generalist in as many IT engineering areas as I can.  For example, don’t just subscribe to Oracle Magazine or Select Journal, get Network World or Storage.  You may not understand all the topics and concepts, but with time you will absorb them and when the time  comes to make an architecture, support, design or down-time decision, you might find that some important data from outside your comfort zone has helped to make a better informed decision.

Monday at IOUG Collaborate 2009 May 4, 2009

Posted by grumpydba in general.
add a comment

I am on site at IOUG Collaborate ‘09 this week in sunny Orlando, Florida.  I will be speaking as part of a customer panel on RAC on Virtual Machines at 4:30 Wednesday.  Had a great converstaion on Oracle RAC and streams with Arup Nanda and talked with Mike Ault already, and I am looking forward to the conferences and chatting with many of my friends I have made over the years.

Top 5 attributes of a great DBA April 23, 2009

Posted by grumpydba in general.
add a comment

In the next few days I am posting my top 5 list of what it takes to be a great DBA.  While many may not agree, it is not all about knowledge and insight.   Am I a great DBA?  Time and my clients will tell, but these attributes will help to ensure a long and enjoyable career in the Oracle world.

We will start with #5 -

5. Sense of humor & grace under pressure – Jean Kerr may have been right when she said, “If you can keep your head about you when all about you are losing theirs, its just possible you haven’t grasped the situation”. Most of us have been in one type of crisis or another over the years involving Oracle, the systems it lies upon and users, managers and clients beating at the door with torches and pitchforks. Many of us have wondered if we would survive the onslaught. When you are at the end of your rope, tie a knot, make a joke and hang on, there is little left to lose. I have been in this type of situations many times. Most of the time I have been called into the middle of a disaster (or what is believed to be one) with the simple command of “fix it, it be broke!”, and have a cube or conference call of people staring over your shoulder waiting on every keystroke.  As bad as my typing is, that is never a good thing to watch.

The first truly memorable situation I had like this was back in the 1997-1998 timeframe, the web was just taking off and Amazon.com was the darling of the internet boom.  I was working for Oracle Support at the time as a technical specialist which is supposed to mean that I know more than most about how the database kernel worked when I got called in because Amazon was down.   The call between myself and their DBA became the two of us and about 30 other people all on the conference call.  I was not allowed to hang up, transfer or call the client back.  Everyone, including a seniror Oracle VP and two VPs from Amazon were on the phone and expecting me to articulate every move I made while manually rebuilding several datafile headers with a hex editor that had become corrupted due to a bug.  Someone noted that Amazon had just made the national news because they were down.  It was not pretty.  But after being on the phone for 422 minutes (our phones had counters on them) every one signed off and the problem was fixed.   The point to this anecdote is that I had to, with politeness and humore, be able to tell everyone on the line two things, one that I was not going to repeat every keystroke I made to the audience and two, to please shut up and let me do my work.  That if they had to make business or political decisions do it on another call.  I was tired, grouchy and more than once had to hit the mute button, but kept my cool with the customer.

Now, I have been in the situation where I screwed up and all the only non-explicative thing I could say was to quote that great American philosopher Urkel – “Did I do that?”. Oh boy, believe me, if you have been in the business long enough, you will break something, and break it bad. I have overwritten datafiles, dropped the wrong table, killed the wrong node, just to mention a few.

For those that know me, I fully admit, that when the mess is over (and sometimes when I push “mute” on the phone before it is over) I can get grouchy, grumpy and generally be a joy to be around.

However, a sense of humor does not always work, and you have to know when to pick your battles, as it were.  There is one hospital client I was working with when I first started working with McKesson. I was on a conference call with them and let out a couple of my humorous observations at which point I would swear I heard crickets in the background. I realized very quickly that they had no sense of humor and dropped it at that point. The best part of it was that I have not been on a conference call with that client since!

No one wants a comedian during a crisis.  But establishing a good raport with a client or group that is experiencing a problem or just in general is often easier with a little good humor and a lot of empathy and grace.

The Sun also Rises . . . April 20, 2009

Posted by grumpydba in general.
add a comment

Today Oracle announced it was buying Sun Microsystems in a deal tentatively valued at $7.4 billion.  Why does Oracle want Sun?  One reason and one reason only, Java.  Java is a key part of Oracle’s Fusion middleware strategy and Oracle’s ownership will drive changes in what is supposed to be an “open source” programming language to meet their own needs.  

The bigger question is, what will Oracle do with the rest of Sun?  Sun is really a hardware company that happens to “own” java.  Larry Ellison has said several times in recent years that he does not want Oracle to become a hardware company (remember the Network Computer?).  While this may be true, their partnership with HP on the fast selling (and big margin) Exadata project shows that this statement is rather flexible.  I would imagine that Oracle will sell off the parts of Sun it has no need for, probably to IBM.   Sun itself has been cutting staff consistently for eight years now.  I worked a consulting engagement with Sun’s storage devision back in 2005, right at the time Sun bought Storage Tek.  The result?  I large part of the division was scuttled almost immediately as it was all but replaced by Storage Tek hardware.  On a personal level, I am glad it killed my contract, as a few days later I began my great releationship with McKesson.  However, quite a few people on that campus were cut. 

This may not be the case, however.  Oracle has been going to great lengths to own the software stack.  Currently, a company can start with Oracle Enterprise Linux, add the database, application server, Fusion middleware and a whole host of Oracle applications derived from Oracle apps, Peoplesoft, JD Edwards, Siebel and others.  It is no longer a best of breed scenario.  The next logical step would be to own the hardware that runs the stack.  You might think that this would put Oracle at odds with HP, but I don’t think so, they will more likely begin a transition to the same love/hate relationship Oracle has with Microsoft.

Another unaswered question from the conference call was the fate of MySQL.  I never thought it was a good idea for Sun to buy the open source database in the first plance, but now Oracle owns it.  I seriously doubt it will become the new Oracle Lite, more likely any functionality deemed worthy will just be absorbed into the Oracle DB kernel, much like Times Ten and Sleepy Cat were.

It makes me wonder, however, if Oracle still has the cash to make the multi-billion dollar purchases, what will be next?  If Bill, Steve and the crew in Redmond get worried, I guess with all the cash Microsoft has in the bank, they could just buy IBM…or Oracle.

ASM and the "Vampire" database March 5, 2009

Posted by grumpydba in ASM.
add a comment

Almost all of the 10g and above databases on which I work use ASM.  One of the most common requests I receive on development systems is to add another lun to add more space to ASM.  This is one of the great features of ASM, adding or removing disk without affecting the current databases.  Before adding a lun, however, I always check to see if space is really needed.  Sometimes, there will be a database which has not been backed up in weeks and it will have literally thousands of archive logs that are filling up ASM.  That remedy is easy, either back up the database and archives or delete the archive logs and run a new full backup.

Then there are the vampire databases, those that suck up disk space, but are not active and not being used.  Often, they are just forgotten databases by developers or DBAs that are shut down, removed from oratab and backups and generally forgotten about since no one notices the space they take up in ASM.  In dealing with space issues, I needed a quick way to determine what databases were in ASM, whether or not they were actually running.  Here is what I came up with:

REM asm_db_size.sql
REM Author - Jay Caviness - Grumpy-dba.com
REM 5 March 2009
REM ------------------------------------------------------------
set pages 999
set heading on
set feedback off
set lines 80
col "Size in MB" for 999,999,999
col "Database" for a25

select database_name "Database",
         sum(space)/1024/1024 "Size in MB" FROM (
  SELECT
      CONNECT_BY_ROOT db_name as database_name, space
  FROM
      ( SELECT
            a.parent_index       pindex
          , a.name               db_name
          , a.reference_index    rindex
          , f.bytes              bytes
          , f.space              space
          , f.type               type
        FROM
            v$asm_file f RIGHT OUTER JOIN v$asm_alias a
                         USING (group_number, file_number)
      )
  WHERE type IS NOT NULL
  START WITH (MOD(pindex, POWER(2, 24))) = 0
      CONNECT BY PRIOR rindex = pindex)
group by database_name
order by database_name
/

Which, when run gives the output:

Database                    Size in MB
------------------------- ------------
DB_UNKNOWN                          10
QA                              29,667
QACA20A                          9,011
QARA20A                         13,258
QCONFIG                         17,653
QHC1011                          8,874
QHR1011                         12,068
QICA11A                         17,247
QIRA11A                         27,041
TESTC02                          8,908
TESTR02                         11,791

In this case I know that TESTC02 and TESTR02 are not running and upon checking with the owners of this system, received permission to drive a stake through the heart of…um..er..remove these databases saving 20G of space.

For a quick and dirty delete script for ASM, see the link page and get the “drop_asm_db.sql” script. Happy vampire hunting!

Why crossover cables are not supported in RAC February 27, 2009

Posted by grumpydba in RAC.
add a comment

Many Oracle shops in this world use crossover cables, literally a network cable, between nodes for use as the interconnect between two rac nodes. Does this work, yep, you bet. Is it supported, no. Why? well it all has to do with how a node reacts when its sister fails in a two node cluster.

Each node in the cluster constantly checks on the other nodes in the cluster through both the network (interconnect) and storage (voting disks), if one or both are lost, the cluster node is instructed to commit suicide and reboot itself in hopes of rejoining the cluster healthy and happy.

If a crossover cable is used, and one of the nodes drops the remaining node will have to wait for the tcp timout, generally 60-300  seconds, before it realized that the lost node is gone.   At which point, the cluster will remove the lost node from the cluster.   What can happen during  that time is two fold, the surviving node can lock up, litterally freeze during the wait for the timeout and/or the cluster can become very confused if the dead node restarts and attempts to join the cluster at a point when the cluster still thinks it is there.   Strange things have been known to happen, many errors thrown and at times will cause both nodes to evict and restart.

Having a switch between the nodes allows a signal to be sent immediately if a node quits responding, at which time the surviving node will check for 60 seconds then evict the failing node, allowing it to rejoin (upon reboot) a clean cluster without any problems.

In short, crossover cables are fine in an emergency or development, any situation where failover is not critical, but for production, spend the money on a good switch, two in fact if you can bond your nics (that’s for another post), for the best senario to survive a failover with as few issues as possible.

Follow

Get every new post delivered to your Inbox.