This is a post for T-SQL Tuesday #98 and it is about a time when you faced a technical challenge that you overcame- for me this is a story about database corruption. #TSQL2sday was created by Adam Machanic. It is being hosted by Arun Sirpal – Thanks Arun!
One of my favorite movies of all time is Dr. Strangelove Or: How I Learned to Stop Worrying and Love the Bomb. It is a dark comedy. A crazy General exploits a limitation in a nuclear attack plan and issues orders for his bombers to execute “Wing Attack Plan R”. Not to be confused with “Wing Attack Plan Python“, ha. It starts a sequence of events which should have never have happened with fail-safes, human in the loop systems, recall of bombers, arming systems- but some nuclear bombs on one lone bomber were on their way to getting dropped and thus starting a nuclear war of epic scale. And I love Stanley Kubrick films too.
Despite current rhetoric to the contrary, the POTUS does NOT have a button to push to launch nuclear weapons. He can order their use but he doesn’t have a big red button on his desk to bang away on like an arcade-style video game.
I am no longer with the company who hired me for this client and the client company closed the main office which I used to work at here in Colorado. ALL of the consultants, managers, directors and other employees I worked with and knew are also no longer with this company.
It was a strange situation. I have too many stories, Good, Bad, Ugly and Fugly about this client. I was with them for several months. They had been through many full-time employees as database administrators and they also had a large developer team out in California. The company was bleeding money and filing for bankruptcy. In the consulting business, sunsets can be long and yes there is money to be made riding out into the sunset. Eliminating pain is attractive to investors. Another competitor was in the similar state of affairs and they decided to merge the two bankrupt companies together as one. When I was there, people were leaving on their own as it is easier to find a job when you still have a job than if one waits until layoffs or firings occur. I did some quick research for this post and found out that in May of 2016, they filed for their THIRD bankruptcy in seven years and their stock is no longer publicly traded- it is a private company. If you were the CEO, how would you attract the best and the brightest to come work for you in a dying industry? Hm.
How To Save Money – Get Rid of Temps, Vendors and Consultants
Yes you read that right. And so the story begins. I was brought in for performance tuning, disaster recovery, migrations and general database administration work. The main project was to migrate a single SQL2005 mission-critical instance to SQL2012 and all that it offered including failover clustering and Availability Groups (AG). Off-loading reports to the AG was going to be yuge as well as all of those jacked up ad-hoc queries Business Intelligence and Sales people like to run against up-to-the-second production database stuff. NOT hitting the main Production instance was going to make MANY executives extremely happy provided of course their queries used existing indexes. But I digress…
One of the cost-cutting measures was getting rid of another consulting company which performed many tasks including managing a messaging system that had a SQL2008 backend. They were also in the midst of migrating that system from on-prem to the cloud.
So I got pulled into a meeting and was told I was now responsible for system XYZ. No documentation, no hand-off, just here is the server name and go manage it. They were also using software to perform the migration and they were not going to spend another dime with that software vendor as it had gotten ugly with them financially.
I did a quick sanity check and look-see. Under-powered and running on an old server with 8GB of RAM hosting three instances and the messaging system migration was using the instance with a ~2TB database on it. Disk storage was on an old HP SAN. Average CPU load was in the high 70’s. SQL 2008 Standard Edition and no backup compression software either. Very few SQL jobs, just indexing and backups. No jobs to run CHECKDB and no SQL Alerts either…
Mr. Murphy Walks In Monday Waving Wing Attack Plan R
I respect Murphy’s Law but if I ever meet him in person, I might be compelled to beat him. Repeatedly, if necessary. No, I don’t have advanced alien weapons like these but sometimes I wish I did (contains foul language, NSFW): The Doctor is gonna bust a cap in yo ass
With limited disk space for backups and with management unwilling to spend money on backup compression software like we were using on the main system I was brought on to migrate, the previous consulting company had to make due with what they had and what I had just now inherited. Which was: full database backup to a share on Monday, Wednesday, Friday and Sunday. Backup to a different share on Tuesday, Thursday and Saturday. And then there was a job to delete the previous two backups for each share. I was told verbally by management that they had these contractual database backup requirements. And that if I wanted to change things, I had to prove that they worked before implementing them. But they didn’t have a dev/qa/uat/test system and what was worse, they didn’t have ANY disk space to even test a restore on. So I could not implement something like a once a week full, differential every day, and transaction log backups running on a 15 minute basis. NOTE!!! You need a restore strategy, not a blind backup strategy. So they had never tested a backup restore of the 2TB database due to lack of disk space.
First thing Monday morning someone from the messaging team asks me to look at the database as it looks like their on-prem to cloud migration process suddenly slowed WAY down over the weekend.
I login, launch SSMS and my heart sank. All of the SQL Logs showed today’s date on them. Not a good sign. I open the most recent one and saw Stack Dump on Page ID X. And Y. And many more. Log said data loss can occur and to run DBCC CHECKDB immediately. We had a meeting and after looking at msdb.dbo.suspect_pages table, I knew the system was in trouble. Management was HESITANT TO OPEN A TICKET WITH MICROSOFT BECAUSE THAT WOULD COST MONEY! I knew my consulting hourly rate and just shook my head. Instead of calling for help right away they wanted me to mess around with it for hours and/or days. I still pressed to open a ticket because I was rightfully nervous about the backups.
Guess what? I determined the corruption happened sometime Saturday. But did we have a backup from last Friday? Nope- because there was a job to delete the previous two days’ backups. So I was unable to restore a backup. The backups from Saturday and Sunday contained the corruption so they were not going to sadly be of use.
This was serious. I had just inherited a system that shortly went belly up and they were looking at me to not panic, fix it, and move on. I had dealt with database corruption before but never this bad. I called other consultants on my team, reached out to others, and began wielding my Google-Fu skills. In the meantime the storage team ran diagnostics on the SAN and couldn’t find anything wrong and no changes had been made.
Here was the scope of the problem. Not all corrupted pages were listed in the SQL log or were listed in msdb.dbo.suspect_pages. DBCC CHECKDB would itself issue a stack dump when it encountered the first of many corrupted pages. Same for running WITH REPAIR_REBUILD. Consulted with management again about data loss and if running CHECKDB WITH REPAIR_ALLOW_DATA_LOSS would be acceptable. Got the go-ahead but no joy.
I could re-produce the stack dump by running a SELECT command and giving it a specific value for a clustered index on the largest table. Corruption appeared to be isolated to just this one table. Also, intentionally making a system do a stack dump is not a good idea either as I found out that if I did that repeatedly sometimes the instance would re-start itself too… I had to figure out how to 1) save the data 2) see if its possible to save the corrupted data and 3) get rid of the database corruption.
Well, not exactly but I did have the awesome SQL Server 2008 Internals book written by her and many other smart people including Paul. Signed copy even! So here are the challenges. How do you write scripts to pull out the good data into another new table without causing a stack dump? And don’t do anything that isn’t Microsoft supported.
First thing I did was test the clustered index ranges. I learned SQL Server does NOT keep sequential Page IDs with a table. Meaning I somehow needed to find all of the Page IDs belonging to a table. Because by using DBCC PAGE I could “see” the clustered ID I could then run a SELECT and insert that row data into a new table. So I made a script to run through all the pages and figure out which ones were bad. From Kalen’s book to all of the posts by Paul Randal on sqlskills.com I was able to migrate as much data as I could. We ended up loosing about 20,000 rows out of 10’s of millions.
In Rides Cowboy Matt Dillon
My pleas to upper management were finally heard and they agreed to let me open a ticket with Microsoft using a manager’s credit card. Microsoft verified I had done all the right things in the right order which was nice to hear. I got an earful about NOT being able to restore from a backup and that we were NOT on the latest Service Pack for SQL2008. It was determined that the overloaded system had probably encountered a memory scribbler problem and had corrupted the metadata of several pages in memory vs on-disk which prevented reading any data from them.
In the end I had exported all the data to a new table, re-named the old, re-named the new table to the old name and re-created all of the non-clustered indexes. Double checked all of the foreign key relationships too. And management relented and I changed the backup schedule as well. Still didn’t have disk space to test a restore, even after all of that 🙂
After I was done at this particular client and successfully migrated that other system from SQL2005 to SQL2012, I took a Pluralsight course by Paul Randal called SQL Server: Advanced Corruption Recovery Techniques. In it, he shows how to take a peak at metadata and also even how to extract data OUT of non-clustered indexes! I wish I had watched this course before I encountered this database corruption. It could have saved me from a lot of pain, stress, and wasting time.
I hope there is NOT a next time for database corruption like this but at some point in your data career you will encounter things way beyond your level of expertise and you will need to figure things out for either your employer or for your client. Thanks again to Kalen Delaney and Paul Randal, some Rockstar authors!
Maj. T.J. “King” Kong: “…Survival kit contents check. In them you’ll find: one forty-five caliber automatic; two boxes of ammunition; four days’ concentrated emergency rations; one drug issue containing antibiotics, morphine, vitamin pills, pep pills, sleeping pills, tranquilizer pills; one miniature combination Russian phrase book and Bible; one hundred dollars in rubles; one hundred dollars in gold; nine packs of chewing gum; one issue of prophylactics; three lipsticks; three pair of nylon stockings.
Shoot, a fella’ could have a pretty good weekend in Vegas with all that stuff.” <- What does YOUR database administrator bug-out bag look like? Hehe.