Revisiting Lady MacBeth and Her Torturous Lies

July 20, 2010

A while back, I wrote up a genius piece of code that would automatically shrink my log files whenever they grew. Kendra Little (blog | twitter) completely called me out for my horrible, sneaky, developer ways. Ostensibly, I had found a solution for my rampant log growth problem. Unfortunately, I had cured the symptom and not the underlying issue. After growing tired of her savage abuse and criticism via gtalk, I looked for the source of the problem. No, not me. The other source of the problem. I set up monitoring on the server in question, waited for the appropriate log death window, and then read my report. Before you think I’m using fancy tools that nobody can afford, I set up profiler and perfmon and then merged the results together. The reports from the single server showed me… nothing, really. There was a lot of I/O and a backup job overlapped with a re-index by about 2 minutes. The logs also didn’t fill up. To be on the safe side, I adjusted the jobs and then sat around making frowny faces for a few minutes. Then I remembered that all of the servers are connected to the same SAN, so set up monitoring on the remaining production servers. An I/O issue on one server could start causing problems on all of the other servers. This time around, the logs filled up, I received a ton of emails, and I also found out something important: all of my backups and re-indexing operations were running at the same time. My SAN was saturated on I/O throughput which was causing a the backup and re-indexing jobs to run slowly. To solve the problem I looked at the average job run times and arranged the jobs so that they had much more downtime between them (to account for other issues that could slow down the jobs). This took a bit more effort than I thought just because of SLAs within the company. I also re-wrote the jobs so that the backups and re-indexes could never run at the same time and would, instead, occur in series. Once I had this change in place I waited and watched. Sure enough, the incredible ever growing log file problem stopped happening (unless I do something dumb like move 30,000,000 rows of data). Moral of the story: make sure that you’re addressing the cause of the problem and not the symptoms.