Tuesday, June 30, 2009

Monitoring Scripts vs. Tools

If you have been monitoring databases for awhile, you probably have a set of scripts that you have to run against the database to provide you valuable information. The scripts might tell you if a tablespace is getting full, what indexes might need to be rebuilt, if there are any errors in the alert logs and other health checks against the database. If the monitoring provides good information in a timely manor, the DBA is able to be more proactive, like adding datafiles to tablespaces before they run out, or even reacting quickly to an issue that might arise in the alert logs and contacting the application team before they have a chance to pick up the phone.
So, are monitoring scripts being replaced by tools? Tools such as HP Openview or Oracle Enterprise Manager will provide alerts and notifications about several issues as well. Just configure a couple of thresholds and away you go. But what if the configuration takes more work then the quick kornshell script? For example, monitor tablespaces and let me know when they get under 20% free, but if it is a large tablespace such as 4TB use 80GB as a threshold instead of percent. I'm sure that this can be done with tools, but still haven't figured out quite how to do it yet. Where my script has and can still provide this list very easily.
So, how do we let go of these monitoring scripts that have been around since Oracle 7? Something that we have depended on for all of these years to do our checks of the database, and use a tool to do this for us. Well, I'm sure that maintaining the scripts does take time, and learning new things is fun as well. I think that they both have a place in our environments. Setting up a tool out of the box, might even provide a quick report much faster which might have been something you wish you had.
When looking at the tools be grateful for having them, because some of these scripts were developed because the budget didn't always allow for tools in the environments. But consider what is important to monitor, consider the ease of the tool to configure and then change if needed. Let them run in parallel for a little bit to confirm the same alerts and information is being sent. Then if there are those one or two little things that the scripts have been able to do better, keep the scripts around (maybe even let a tool company know of an enhancement idea). Also, keep an eye on the tool upgrades, for new things that they monitor that you might not have thought of. Enjoy getting health checks and proactive monitoring from whatever is available to you in the environment, because isn't it really about being able to address a problem very quickly or prevent one from happening in the database anyway!

Wednesday, June 17, 2009

Something is wrong with the database

So, the emails start flying, something is wrong, the database has a problem. That is a very typical situation, and instead of defending the database right away, take some time to do a quick check of a couple of things.
Check number one might just be too obvious, but check the alert for errors. Validate that there is nothing goofy going on. And while you are checking out the bdump directory, a quick glance at udump for any trace files that could also be out there might show some information.
Check number two, any invalid objects or unusable indexes? Make sure that all procedures, views, triggers have a status of valid, but before recompiling, make sure you grab that last_modified date, because it might be needed later. Also, unusable indexes that might need to be rebuilt should be noted for what tables they are on and see if they are part of the issue.
Check number three, validate that statistics are up to date on indexes and tables.
And then check to make sure that there are no objects that were recently changed. Check that modified date on all of the objects. Even a modification to a data type can cause a join that was previously working to fail.
Maybe you use the checks in a different order, but with just this four, any obvious errors on the server have been found, anything that has changed has been validated and noted as changes made to the database and statistics have been checked, which can either show that this regular type of maintenance is not running or things are looking good and up to date on the datebase.
So, something wrong with the database, possibly, but now after these quick checks you can pull out more details about what they are seeing and what can be wrong. There is also supporting information if things have been changed or modified and help drilldown to more of the issue at hand.