Wednesday, July 8, 2015

OHS : Periodic OHS/Web Server/WebGate Crash due to cron job incorrectly deleting *.lck files/httpd.pid files

We faced a very unique issue for one of our OAM Single Sign On implementations wherein , all the OHS Nodes in a cluster setup used to crash every 7th day generating core dumps running into dozens of GBs which potentially used to crash the OHS in addition to the downtime on Production systems.

Stack Trace on OHS/Webgates :

Loaded symbols for /u01/app/orasec/middleware/Oracle_OAMWebGate1/webgate/ohs/lib/libxmlengine.so
Core was generated by `/u01/app/orasec/middleware/Oracle_WT1/ohs/bin/httpd.worker -DSSL'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fd717f1161f in ObLockFileRelease(void*, bool) ()
  from /u01/app/orasec/middleware/Oracle_OAMWebGate1/webgate/ohs/lib/webgate.so
(gdb) (gdb)

Detailed Analysis of Root Cause 
On detailed debugging and some guidance from Oracle Support, we discovered that this was being caused by a Cron Job which was written to ensure oam_server.out files as well as oblog.log files get deleted every 7 days. This was due to the fact that Oracle doesn't provided log retention policies for these files OOTB.
The path that was used by the Cron Job was <MiddlewareHome>/<Oracle_WebTier>/instances/<instance_name>/diagnostics/logs/OHS/ohs1 which incidentally also hosted the important .lck files (polltracking.lck, oblog.log.lck, ObAccessClient.xml.lck)  and http.pid files [Why Oracle, Why ??!!]

Remember : Removing PID and *.lck files caused instability an is not supported by OAM or OHS.

It is not supported to remove httpd.pid and *.lck or log files that are created by a running instance while it is running - 

1.  Setup up logging to another location where lock file and httpd.pid and other process files do not exist, if it's a cron job or something else is used to remove those files. In our case we explicitly called out the files which needed to be deleted instead of running the cronjob on a folder.
2.  Use documented log rotation methods as much as possible ( The files in question though don't have OOTB options)

References  - 
OHS Segfault 11 Core Dumps ObLockFileRelease Webgate.so 5-7 Days (Doc ID 1985491.1)