Identify IO Wait Performance Issues

by Alan Da Costa

ruby ruby-on-rails

In the previous blog post about how to measure Ruby on Rails optimizations, we looked at performance timings from Rails server logs which covered a large time span (several hours to 1 day) to analyze general performance. How would you determine the source of intermittent performance issues using these logs? What would you look for?

When we’re looking at an old degraded web application with lots of errors and poor performance, sometimes the best thing to do is identify and fix one bottleneck at a time. As part of my work on the previous post, I moved a MySQL database server with a very full slow query log to a separate host. The fix was good. The new MySQL server suddenly had an empty slow query log and it stayed that way.

Soon enough, more user complaints about poor performance came trickling in. Ok, it was more than a trickle, but how could that be? The MySQL slow query log still looked great but the Rails log showed high database times. This didn’t make sense. After verifying the slow query log was enabled and the threshold was set low enough, I dug in to figure out the cause.

What is IO Wait?

For a given CPU, the I/O wait time is the time during which that CPU was idle (i.e. didn’t execute any tasks) and there was at least one outstanding disk I/O operation requested by a task scheduled on that CPU (at the time it generated that I/O request). - source

The result of IO Wait is, a CPU that has work to complete which is blocked on IO is also blocked on handling any other work.

When DB time isn’t DB time

Processing AssessmentsController#save (for xx.xx.xx.xx at 2016-01-01 19:00:00) [POST]
  Session ID: 222f372e282f93a35d363f496cb1b1ae
  Parameters: {"choice"=>"3", "action"=>"save", "qn"=>"70", "controller"=>"assessments", "direction"=>"x"}
Rendering assessments/save
Completed in 7.77144 (0 reqs/sec) | Rendering: 0.00667 (0%) | DB: 7.75504 (99%) | 200 OK [https://<domain>/assessments/save]

Look at the last line in the above request. Rails says over 7 seconds was spent in the database, while the slow query log was set to catch everything over 2 seconds and was empty.

As it turns out, Rails logs all of the time spent waiting for a query in the database, not just the query execution time in the database server. This means time blocked due to IO will count in the DB field. I didn’t completely understand this at first.

Avoid making blind changes

I think we’ve all been guilty of making changes due to speculation. Without having enough data to understand what’s wrong, desperation drives us to trial and error. Why not fix this by collecting more data?

New Relic to the Rescue

There are a few command line utilities that can help determine the source of IO Wait, but they have to catch the problem in the act. New Relic mostly makes this easy for us. After installing the server agent and allowing for time to collect data, we can see the following graphs.

High IO Wait

The green sections in the CPU chart represent IO Wait. A healthy chart should be mostly free of green.

Green section process list

Zooming in on time for a green section shows the following in the process list.

High Java CPU

This application stack utilizes a handful of Java processes to generate reports. Notice the java processes at 40.5% CPU use?

White section process list

If we look at a chunk of time outside of the green bars, this is what our process list looks like.

High Java CPU

Comparing more examples, it quickly becomes obvious something in the Java reporting pipeline is the cause of our problems.

If we’re using New Relic without the server agent, we’ll only see the APM (Application Performance Monitoring) page. For the same time period, we’ll see a transactions graph that looks like this.

Bumpy Transactions

This chart doesn’t look too pretty, however, the feel of the application was mostly slow. Because the process list and CPU utilization is on the server page, there wouldn’t be a way to determine the source of our problem in the APM page alone.

Fix

To restore performance, we either have to make the server faster or we have to make it do less work. Making the server faster can be an expensive long term cost and might not allow reporting to scale well. An optimal solution might be to separate the Java reporting pipeline as a service and deploy it on a new host. Ultimately, how this problem is fixed will be determined by business goals.