Thursday, November 12, 2009

Load Test

Past few days i encountered one serious production issue, which the web service timeout before the backend process sent back the message to the SOAP response node in WMB flow. This was due to the high load in production that caused the delay in the backend process.

The whole process is: WMB flow extracts SOAP request message, turns it into a MQ Header message, puts the message to Java adaptor's request queue, then Java adaptor processes the content and puts a response message to the response queue, then WMB flow will read the message in the response queue and will turn it into a SOAP response.

In the progress of investigation, we decided to move one database query out from the java adaptor program to the WMB flow. The reason being is sometimes the db connection will just drop and the java adaptor program has to spend some time reinitiating the db connection. In fact according to the time recorded in the adaptor's log the process was still quite fast, it's even within miliseconds. Well we just gave it a try.

After making all the necessary changes we deployed the works to the testing environment. I then performed some load tests to the web service. The result was negative. The tps was around 3.5 but the average time taken was > 10 seconds and we had to achieve < 10 seconds! So still plenty of work to do :(

We performed a few cycles of investigation, trial and error & testing until we found the solution - increasing the WMB flow instance. This is a setting in the WMB Toolkit. So we increased the number of instances to 3. This time we managed to achieve an average time of 3 seconds! We were so happy and excited!

What about the cause? See below:

There were many messages piling at the response queue which is supposed to be read by the WMB flow. These messages were sent in by the Java Adaptor after it finished processing the data. Apparently these messages were not picked up by the WMB flow in time hence causing the timeout error. By increasing the WMB flow instance, concurrently there are 3 identical WMB flows reading the content of this queue, hence problem resolved!

I was really relieved!