Facebook

Showing posts with label cluster. Show all posts
Showing posts with label cluster. Show all posts

Wednesday, July 8, 2015

OHS : Periodic OHS/Web Server/WebGate Crash due to cron job incorrectly deleting *.lck files/httpd.pid files

Issue
We faced a very unique issue for one of our OAM Single Sign On implementations wherein , all the OHS Nodes in a cluster setup used to crash every 7th day generating core dumps running into dozens of GBs which potentially used to crash the OHS in addition to the downtime on Production systems.

Stack Trace on OHS/Webgates :

Loaded symbols for /u01/app/orasec/middleware/Oracle_OAMWebGate1/webgate/ohs/lib/libxmlengine.so
Core was generated by `/u01/app/orasec/middleware/Oracle_WT1/ohs/bin/httpd.worker -DSSL'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fd717f1161f in ObLockFileRelease(void*, bool) ()
  from /u01/app/orasec/middleware/Oracle_OAMWebGate1/webgate/ohs/lib/webgate.so
(gdb) (gdb)

Detailed Analysis of Root Cause 
On detailed debugging and some guidance from Oracle Support, we discovered that this was being caused by a Cron Job which was written to ensure oam_server.out files as well as oblog.log files get deleted every 7 days. This was due to the fact that Oracle doesn't provided log retention policies for these files OOTB.
The path that was used by the Cron Job was <MiddlewareHome>/<Oracle_WebTier>/instances/<instance_name>/diagnostics/logs/OHS/ohs1 which incidentally also hosted the important .lck files (polltracking.lck, oblog.log.lck, ObAccessClient.xml.lck)  and http.pid files [Why Oracle, Why ??!!]

Remember : Removing PID and *.lck files caused instability an is not supported by OAM or OHS.

Solution 
It is not supported to remove httpd.pid and *.lck or log files that are created by a running instance while it is running - 


1.  Setup up logging to another location where lock file and httpd.pid and other process files do not exist, if it's a cron job or something else is used to remove those files. In our case we explicitly called out the files which needed to be deleted instead of running the cronjob on a folder.
2.  Use documented log rotation methods as much as possible ( The files in question though don't have OOTB options)

References  - 
OHS Segfault 11 Core Dumps ObLockFileRelease Webgate.so 5-7 Days (Doc ID 1985491.1)
http://oracleoam.blogspot.com/2014/07/lock-files-in-oam-11g-r2ps2_5.html

Saturday, August 23, 2014

OAM 11gR2/Weblogic : The important of parameters in mod_wl_ohs.conf(Web Server plugins)

Configurations of various parameters in web server plugins plays in a major part in ensuring that Single-Sign-On works fine using OAM.

Oracle Documentation -
http://docs.oracle.com/cd/E23943_01/web.1111/e14395/plugin_params.htm

This post is intended to share my experiences with certain parameters and the repercussions if you don't include them :)

WLProxyPassThrough
WLProxySSl works great if webserver is doing the SSL work. But if SSL being terminated by a load balancer then mod_wl will remove any incoming WL-Proxy-SSL and the request will reach OHS over HTTP this means that the WebLogic server won't ever get that header and so request.isSecure() will always return false. If you add that directive and set it to ON then the WebLogic plug-in will not remove any incoming WL-Proxy-SSL header. This lets WebLogic Server know that the original request was initiated over SSL.  WL-Proxy-SSL header should not be sent if the inbound traffic to the load balancer was not SSL (HTTPS).

Error Scenario

Once I added this parameter for under the <if weblogic_module> tag and set it to true, this issue no longer reccurred .




WLCookieName
If you change the name of the WebLogic Server session cookie in the WebLogic Server Web application, you need to change the WLCookieName parameter in the plug-in to the same value. The name of the WebLogic session cookie is set in the WebLogic-specific deployment descriptor, in the <session-descriptor> element.

Error Scenario :
The Webcenter Portal application I was implementing SSO using OAM for, had changed the weblogic session cookie name to a non-JSESSION ID value for some reason.
This was not giving me any issues until I was configuring "Weblogic Cluster" value(instead of "Weblogic Host") in the OHS layer pointing to the Webcenter managed servers.Once I did so,the Webcenter Portal page would not load and instead would give me a flickering page with consistently changing values of adf_ctrl.state and the page would not load up.
This issue was resolved once I added WLCookieName <cookieName> under the context root tag for the Webcenter Portal app in mod_wl_ohs.conf

This post is also relevant in this regard.