Wednesday, September 2, 2009

Restore that just wasn't wanting to happen

So, maybe I should read my own blog posting, or get more sleep, but I recently caused myself enough problems when trying to attempt a simple restore.
I had backups, check. I had a point in time when I wanted to restore to, check. I had a good reason to restore the whole database, check. What I didn't was the undivided attention that it needed or had planned properly for the things that went wrong.
So, I started off on my adventure. In knowing that there was no activity on the database I just choose any time around the point of failure, without checking for fuzzy issues. (I'll come back to fuzzy).
Opened up RMAN, connect to target, run script to allocate channels for tape and restore database until time, recover database until time. Restore started, and I thought in an hour I would be good to go again. Check back, still running, check back, still running. OK, that is strange, nothing showing issues just looks like it is hanging. Wonder if it is waiting for tapes. Simple thing to do, and what I should have done was just to call the backup team and ask about tapes I am trying to access. Instead, I thought, well, let's try again for a different time, because I just need it around this time, and maybe I will be able to hit different tapes.
Started it up again, and this time my computer crashed in the middle of this. So several hours later, restore still not complete, and now I really have a database that is not useable. Fun stuff.
Cleared out all of the processes that might have been left from over from the crash, picked my point in time. Contacted the backup team to make sure I didn't have locks on the tapes and they were available. Restore, recover. Open database - media data file 1 needs recovery. And this is where FUZZY comes in. The point in time, I had randomly picked without doing my homework, had a datafile restored that had a different SCN then the others. So, at this point of course I am wishing that I had done my homework, and that I had treated this restore as a production restore instead of thinking, it is just a test system, so no big deal.
I would like to say that after all of this, I was able to restore with the next attempt, but I ran into one more issue. Since I was trying to duplicate the production into test, I was using duplicate and the restore is using the flash recovery area, and guess what...all of these attempts and such had filled up that destination. Of course! Simple query to find out space available and clear this area out, ready for another attempt.
I am sure at this point you are either crying or laughing with me or at me. But I share this because there were several things I could have done along the way to make this restore simple to begin with. And even the simple tasks we perform can cause issues with the database or things that we touch. In not treating this at the same level as a production restore or issue, I wasn't prepared as I should have been. Did I create some great documentations for problems and how to fix them to prevent this in the future? I sure did! But that really shouldn't be the point of doing a restore. I am hoping to save others from going through the same process and trouble, and it has already been documented ;-)