A Bit of Troubleshooting

November 04, 2010

A client recently asked me for help with their SQL Server environment. It seems that replication was running slowly and was getting further and further behind – replication had been turned off during heavy data modification and was turned on after several days.

Protip: This is why it’s important to have a full checklist for everything that you do on a server.

Check Everyone’s Health

When you have a complicated system you want to take a look at everything, not just the symptoms of the problem. This happens in medicine, economics, and manufacturing. Why shouldn’t we do it in the datacenter? The very first thing I did was take a look at the health of the publication server. That server was running well within normal parameters – there were no readily apparent disk I/O, memory, or CPU problems. Since the distributor lives on the publication, that was covered as well. On a lark, I checked all of the other subscribers. They were also functioning normally. I did this to make sure that were weren’t seeing glaring performance problems on one subscriber that were really a symptom of a problem with the replication set up. Everything was healthy… except one subscription.

The Problem Child

Having ruled out an unknown problem on the other servers, I took a look at the rest of the issues on the problem server. I found a few underlying issues and was quickly able to figure out that the poorly performing replication was only a symptom of the problem.

Oh I/O

When I started digging deeper and looked at the wait stats and I/O activity, I was in for a huge shock- there were queries that had been running for close to a day! Digging deeper, there were two queries that were causing major performance problems. The first was a daily bulk data load. It read from the replicated tables, so if there was going to be heavy contention on those tables, this might be part of the problem. Luckily, the bulk load had been re-written long ago to use small batches so that the transaction log didn’t grow out of control. Rampant transaction log growth had been a huge problem when the server had tiny log drives – the longer running jobs were re-written using a WHILE loop to read blocks of data and produce smaller, explicit transactions. This design also makes it possible to stop and restart the job whenever you want. I immediately killed the bulk load job and looked into the second query. This was the nightly index maintenance script. It had been happily chugging away for over 24 hours and was chewing through more disk that I thought was possible (probably because I was never awake at two in the morning to watch the job run). Figuring that bad indexes were a better option than thrashing disks, I killed the index defragment query and moved on to the next problem.

My Memory’s Not What It Used To Be

Turns out that the server was running low on memory. This server has two purposes – it’s both an ad hoc reporting server and runs regular reports. As a result, SQL Server Reporting Services was installed and the SQL Server had been configured with a max memory setting of 4GB out of the 8GB available. I dug deeper into the memory and I discovered that over half of SQL Server’s memory structure was being used to manage locking. The rest was going to plan cache and a few other internal structures, but at no point was memory being used as a cache for data. The server’s page life expectancy was effectively 0 – every read was going to disk. My immediate recommendation was to double the RAM in the server and increase SQL Server’s max memory setting from 4GB to 12GB. As a longer term recommendation, I cautioned my client that they should invest in a new server since this reporting server was 4 years old and well past its expected lifespan.

Back to the I/O Again

As I was wrapping up, the other production servers started having I/O problems. This was right around the same time that business normally picks up for this client. On a lark I said, “Wouldn’t it be great if this was a hardware problem?” Five minutes later we had great news: it was a hardware problem. One of the power supplies in the SAN had died. Although the SAN had four power supplies, losing a single one caused the SAN to power down the battery backed cache and perform all reads and writes straight from disk. This more than explained the strange I/O we had been seeing on the reporting server. A new power supply was immediately ordered from EMC and the problem was eventually solved.

Wrapping Up

Have a set of canned scripts ready to help you figure out what kind of performance problems you might have on your systems. I started with Glenn Berry’s diagnostic scripts and customized them over time to give me the information that I want to see. If I weren’t so lazy, I would probably make this into something that I could throw into Management Studio’s canned reports with pretty colors to tell me when there was a problem. I’ve also gotten used to scanning over the output and looking for potential problems. Learn which problems are really just symptoms of a bigger issue. It doesn’t do you any good to troubleshoot slow queries only to find out that the SAN is experiencing horrible performance issues.