In a recent project I was heavily involved in and lead the design of, the end state was the use of the Netscaler Unified Gateway to provide access capabilities to enterprise resources. Unbeknownst to myself, my team, and my Citrix consulting partners an issue was uncovered that brought the Netscaler platform to its knees.
There are three primary use cases for the Unified Gateway platform that this design covered. The first, VPN using the web interface of the Gateway, not through the gateway plugin. The second, clientless VPN using bookmarks. Lastly, Citrix Storefront, delivered through the clientless VPN web interface. The Netscaler itself (version 11.1.57.x for this architecture) is shipped with a stock configuration that is not suitable for peak loads on the platform for example at the start of a work shift. During these login storms, the Netscaler itself would run out of processes, causing the users to experience white screens, missing clientless VPN bookmarks, and session hanging. Over a period of minutes the appliance would eventually crash.
Netscaler firmware is a version of FreeBSD with an integrated version of Apache web server. When using a custom theme on your gateway, all user traffic bypasses cache and hits the Apache server directly. The Apache configuration is pretty stock, with HTTPD.CONF (location /etc/httpd.conf) consisting of the following values:
minspareservers: 1 maxspareservers: 10 StartServers: 5 timeout: 120 maxrequestserchild: 10000 keepalivetimeout: 15 seconds maxclients: 30
What this means is that at startup, the Netscaler will start 5 processes of Apache, have a max of 30 processes, and keep anywhere between 1 to 10 processes spare and in a waiting state. It will hold processes waiting for 15 seconds before moving on in the queue, with an overall timeout after 120 seconds between receives and sends. Each process has a max request queue of 10000.
While observing the behavior of the Netscaler while the clients are having issues, I was seeing the process counts jump to the max (30), causing system instability as the appliance was trying to keep up with user demand. At 2000 connections, with appliance theoretical limit at 10000 connections, the whole system would go down.
CLI Command for displaying apache processes: ps -aux | grep httpd -wc
One may ask oneself, should I just buy a bigger appliance? What about upgrading to Netscaler 12? The answer is no and no. While a bigger appliance would handle more Gateway connections, the Apache web server and the amount of RAM dedicated to Apache is the same regardless of which model you have. Netscaler 11 and 12 both use the same Apache configuration as well, so firmware version is does not matter.
The Initial Citrix “Fix”
While working with Citrix support, the recommended course of action was to increase the maxclients to 50, as this would give the Netscaler a higher process ceiling. The end result of this was causing the appliance itself to run out of memory trying to manage 50 processes at the same time.
The second recommendation was to scale horizontally to accommodate the logins. Why would I scale horizontally when I have a box with a theoretically limit of 10000 connections, to handle 2000 to 3000 burst logins.
The Real Fix
The actual fix to this issue was a collaborative effort between myself and my teams, and Citrix consulting services. Each bringing a different piece to the puzzle:
My Contribution: Optimize Apache.
The stock configuration is a subpar configuration that does not mirror reality. While it works fine in isolated use cases, when all of the bells and whistles are activated it can not keep up with demand. Here are the values of the new Apache configuration:
minspareservers: 5 maxspareservers: 10 StartServers: 5 timeout: 100 maxrequestserchild: 1000 keepalivetimeout: 5 seconds maxclients: 50
While still starting the appliance with 5 processes, the spares are instead kept between 5 and 10 processes. This will allow the appliance to focus on the requests, instead of focusing on starting up processes. By decreasing the max requests to 1000, if there are issues with the process it will not wait for a request queue of 10000 to recycle, and instead recycle much sooner. The processes themselves are also only waiting 5 seconds between clients before moving on to the next in queue, instead of the stock 15. I kept the 50 maxclients in good faith for Citrix stock configuration, but due to the optimizations the queue cycles too fast to ever hit that limit.
Citrix Contribution: Enable Integrated Caching
Citrix consulting recommended to include integrated caching in the mix as well. Citrix engineering stated that the integrated caching piece will not help with the login storms, as custom themes can not and will not cache by design. Where it will help is once in, bookmark and resource requests can be cached, taking load off of the Apache web server.
set cache contentGroup loginstaticobjects -memlimit 768 set cache contentGroup OcvpnLoginstaticobjects -memlimit 768 save config add cache selector en_config_xml_cache_selector http.req.url.path http.req.method http.req.hostname add cache contentGroup en_config_xml_cache_group -relExpiry 120 -maxResSize 16000 -memLimit 28 -hitSelector en_config_xml_cache_selector add cache policy en_config_xml_cache_pol -rule "HTTP.REQ.URL.PATH_AND_QUERY.STARTSWITH_ANY(\"vpn_cache_dirs\") && (HTTP.REQ.URL.CONTAINS(\"/resources/config.xml\") || HTTP.REQ.URL.CONTAINS(\"/resources/en.xml\"))" -action CACHE -storeInGroup en_config_xml_cache_group bind vpn vserver VSERVER -policy en_config_xml_cache_pol -priority 5 -gotoPriorityExpression END -type REQUEST save config
Success! Not only did the issue go away, but I still haven’t been able to find the limits to this. Thread counts hover around 30 at peak logins, when before they were at 50 before crashing the appliance. I will say that this collaborative effort between myself and my team, and Citrix company in general (one fantastic consultant in particular) has been very successful. This issue has been made known, and I am hopeful that a knowledge base article will be the next step, with eventual incorporation into the stock configuration. Until then if you come across this issue, don’t scale out, just fix the problem.