Performance troubleshooting – CPU READY time
My application is running slow, hey IT guy, can you check that and make sure you fixed it! How many times have you heard such complain? I bet that it is quite common.
In this series I would like to talk about common performance issues you might see within your environment and today we will start with one of the most common – CPU related performance issues.
First, let me explain one of the key metrics that is related to such performance issues – READY time.
What is READY time
READY time is a key metric that is represent as a percentage of time when the virtual machine was ready but could not get scheduled to run on physical CPU. CPU READY time depends on many factors like number of physical CPUs and Cores vs number of Virtual Machines and vCPUs they have assigned. It is also usually connected to oversubscription of your physical resources.
Oversubscription means that you assign more virtual resources than you have physical resources available. Let’s imagine a situation when you have only one physical CPU with 6 Cores and you have created dozens of virtual machines with 2 vCPUs each. If the virtual machines are just sitting and doing nothing there would be probably no performance impact nut once they start to actually utilize CPU resources the performance will go down.
Based on official documentation the ratio between vCPU to pCPU is following:
- 1:1 to 3:1 is no problem
- 3:1 to 5:1 may begin to cause performance degradation
- 6:1 or greater is often going to cause a problem
It is a rule of thumb, without knowing your infrastructure and workloads it is hard to determine appropriate oversubscription ratio and that is why you should monitor READY time closely.
You can monitor READY time either from ESXi / vCenter web client but usually people use esxtop command. The reason is that the metrics displayed in web client are aggregated and you might not notice some peaks, but esxtop gives you real-time values.
Usually what I do is that once I notice some spike on web client I will start investigation on particular ESXi server with esxtop to get better understanding what is going on.
Let’s have a look how it looks like when everything is OK without any congestion
As you can see %RDY metric that represents READY Time in esxtop is around 0.x. This is normal behavior and this metric should not exceed 5% as a recommendation.
If you see that the %RDY time is going higher then you need to start investigation
As you can see here virtual machines are waiting for physical resources from 30-50% of the time which is a huge sign of congestion.
As we have already said one of the reasons is CPU oversubscription as it is in this case, but there might be other factors involve
” CPU resource limitations on particular VM (this would be case that one of the VMs reports high READY time and your oversubscription is low)
” Resource pool limitations (you can see group of VMs residing in single resource pool with high READY time, other VMs running OK and your oversubscription is low)
So, what can you do?
Sometimes lowering number of vCPUs for particular machine will work. This is connected to how symmetric multiprocessing in ESXi works. Let’s say that you have a VM with 4 vCPU cores assigned. This virtual machine wants to run single-threaded application and utilize it to 100%. What will happen is that that machine will actually utilize 4 vCPUs to 25% but it will need to wait until all 4 vCPUs are scheduled to the physical CPU. If you lower number of the vCPUs it will solve your READY time.
If there is a general congestion among all your VMs it will implicate that your physical infrastructure is not able to satisfy your virtual resources and only way how to fix such situation is to add more physical ESXi host.
And lastly, it might be connected to resource limitations on particular VM or Resource Pool which the VMs are part of.
If you start to see READY time higher then 5% you should start investigating such behavior immediately because at this stage, you would probably see performance impact especially on high-performance applications like database servers.
Dig through your performance graphs to identify the bottle neck and investigate with esxtop command or vROps if you have that option.