
deadman user guide v0.8

2017-12-05 SHL Version 0.1
2022-04-03 SHL Version 0.3
2022-06-21 SHL Version 0.5
2022-06-23 SHL Version 0.6
2022-06-23 SHL Version 0.7
2025-10-16 SHL Version 0.8

== Introduction ==

  Deadman attempts to detect when one or more of a known set of problems
  the might occur on a running system and to take appropriate recovery
  actions when one of these problems is detected.

  Deadman is an evolving application.  New features are added when new
  failure modes and/or new recovery modes are discovered.

  Deadman was originally written to keep apache httpd servers that I
  maintain up and running with minimal human interaction so some of deadman's
  features are specific httpd servers.  Other features are more generic and
  may be useful for use with other applications.

  Deadman logs its actions to the deadman.log log file.  The log file will be
  written to the %LOGFILES% directory if defined.  Otherwise it will be
  written the %TEMP% directory.

  Deadman also logs its actions the STDOUT, unless it is running detached.

  Deadman writes its PID to deadman.pid in the %TEMP% directory.  This allows
  other processes to check and/or control deadman.

== Usage ==

  Deadman is a VIO command line application which is typically run detached.
  Output is written to the standard output, if deadman is not running
  detached, and to the log file (%LOGFILES%\deadman.log). The log file
  entries are timestamped so that they can be correlated with information
  from other timestamped logs.

  Each log file entry includes a message id of the form (#number).  The id
  number can be used to locate the code that generated the message, if
  needed.

  To display the help screen, enter

     deadman.exe -?

  at the command line.  The help screen currently displays as:

    The deadman daemon checks system health based on the configuration file settings.
    See deadman.txt for a detailed description of operation and options.

    deadman [-c] [-h] [-s] [-t] [-v] [-V] [-?] [cfgfile]

      -c       Check daemon status
      -h -?    Display this message
      -s       Stop daemon
      -t       Run in TEST mode
      -v       Display verbose status
      -V       Display version

      cfgfile  Configuration file to process

    Copyright (c) 2008-2022 Steven Levine and Associates, Inc.
    All rights reserved.

== Theory of operation ==

  Deadman attempts to monitor system health by watching the state of a user
  selected files, as defined in the configuration file.  Deadman can:

    - monitor one or more transaction log files for activity
    - monitor one or more error log files for certain errors
    - reboot the system on request

  When monitoring a transaction log file for activity, deadman expects the
  file size to increase over time.  If the file size fails to increase for
  longer than the configured interval, deadman will attempt to reboot the
  system after the reboot delay expires.  The check interval and the reboot
  delay interval are both configurable.  Deadman contains logic to handle log
  rotation which causes the log file size to be reduced.

  When monitoring an error log file for errors, deadman will check the log
  file for known errors at configurable intervals.  The set of known errors
  is currently:

   - httpd cannot create child process

  As deadman evolves additional checks may be implemented.

  When one of the known errors is detected, deadman will perform error
  specific recovery actions.  If the recovery actions fail, deadman will
  attempt to reboot the system after the reboot delay expires.  The check
  interval and the reboot delay interval are both configurable.

  When monitoring for reboot requests, deadman checks if the reboot request
  file has been cremated.  When the file is created, deadman will attempt to
  reboot the system.  If the reboot request file is not empty, deadman will
  write the first line of the file to the deadman log file to record the
  reason for the reboot.

== Sample configuration file ==

  The Configuration File section describes the configuration file in more
  detail.

  ; hostname: steven, domain: www.scoug.com
  ; checks error log for child process start failures
  ; checks transaction log for lack of activity
  ; checks transaction log for lack of activity

  ; 2022-04-03 SHL Baseline - steven

  translogfile = d:\logs\apache\scoug-combined_log
  processname = httpd
  TransLogCheckIntervalSec = 60   ; 1 minute

  errlogfile = d:\logs\apache\scoug-error_log
  ErrorLogCheckIntervalSec = 30

  rebootfile = d:\apps\apache24\reboot-me-now

  SleepSec = 10
  RebootDelaySec = 3600           ; 1 hour, 0 suppresses reboots
  ForceStatusSec = 21600          ; 6 hours

== Sample command lines ==

  To start deadman in VIO mode

    start "deadman" deadman d:\apps\bin\deadman.cfg

  To start deadman detached

    detach "deadman" deadman d:\apps\bin\deadman.cfg

  To check if deadman daemon is running:

    deadman -c

  To stop the running instance of deadman:

    deadman -s

== The Configuration File ==

  Deadman is controlled by the settings provided in the configuration file.
  The configuration file contains one statement per line. Each statement
  consists of a keyword and a value.  The configuration file may contain
  comments and blank lines.

  All keywords are optional.  If a keyword enables a feature, the feature
  will not be enabled if the keyword is omitted.  If the keyword sets a time
  interval, a default interval will be set if the keyword is omitted.

  The translogfile keyword names a transaction log file and enables the
  transaction log file monitor feature.  Deadman monitors this file for
  growth.  If the file stops growing for longer than the configured
  interval, deadman will schedule a reboot.  There is no default for the
  transaction log file.  If this keyword is omitted, transaction log
  monitoring will not be enabled.  To monitor multiple transaction log
  files, specify each log file in a separate translogfile statement.

  The translogcheckintervalsec keyword defines how often the
  transaction log monitor feature will check the transaction log file.  If
  this keyword is omitted, the default check interval is 600 seconds (i.e. 5
  minutes).

  The processname keyword names the process that is responsible for writing
  to the configured transaction log file.  If a process name is defined,
  deadman monitors the processes with this name.  If there are no instances
  with this process name running, deadman assumes that the user has stopped
  the processes for maintenance and suspends transaction log file monitoring
  until one or more instances of the process are restarted.  This prevents
  deadman from rebooting during planned shutdowns of these processes.
  There is no default for the process name.  If this keyword is omitted,
  process monitoring will not be enabled.

  The errlogfile keyword names an apache httpd error log file and enables the
  httpd error log monitoring feature.  Deadman will monitor the error log
  file for httpd child create failures.  There is no default for the error
  log file.  If this keyword is omitted, error log monitoring will not be
  enabled.  To monitor multiple error log files, specify each log file in a
  separate errlogfile statement.

  The errorlogcheckintervalsec keyword defines how often deadman will check
  the configured error log file.  If this keyword is omitted, the default
  check interval is 600 seconds (i.e. 5 minutes).

  The rebootfile keyword names the reboot request file and enables the
  reboot request feature.  If this file exists, deadman will reboot the
  system.  If this file exists when deadman is started, it will be deleted to
  prevent a stale reboot request file from triggering a reboot.  There is no
  default for the reboot request file.  If the keyword is omitted, reboot
  request monitoring will not be enabled.

  The sleepsec keyword defines how long deadman sleeps between check cycles.
  If this keyword is omitted, the default interval is 30 seconds.

  The rebootdelaysec keyword defines how long deadman waits after scheduling
  a reboot to perform the reboot.  This allows for intermittent errors to be
  reported without forcing an unneeded reboot.  If this keyword is
  omitted, the default delay interval is 30 seconds. If set to zero, 
  scheduled reboots will not occur.

  The forcestatussec keyword defines how long deadman will wait before
  writing a proof of life message to the deadman log file.  If this keyword
  is omitted, the default reporting interval is 21,600 seconds (i.e. 6 hours).

== Tuning deadman ==

  Every system is different.  The goal of tuning the deadman timing
  parameters is to check often enough so that problems can be detected and
  effectively handled, while at the same time miminizing false positives and
  not checking so often as to waste system resources that could be better
  used elsewhere.

  When tuning deadman, it is recommended that deadman be run in test mode
  (i.e. -t).  Test mode suppresses reboots and reduces the forcestatussec
  check interval which makes the the tuning process more efficient.

  When tuning deadman, the deadman log file can be helpful.  Look for
  spurious reports that can be avoided by optimizing the timing parameters.

  Sleepsec defines the minimum reasonable value for all the other checking
  intervals.

  Translogcheckintervalsec should be set large enough to avoid most false
  positives, but small enough so that any reboot attempt occurs before the
  system has become so unstable that the reboot attempt will fail.

  Errorlogcheckintervalsec should be set large enough to avoid wasting system
  resources, but small enough so that the recovery attempt has a high
  probability of success.

  Rebootdelaysec large enough to allow intermittent reboot requests to clear,
  but small enough so that the reboot attempt occurs before the system has
  become so unstable that the reboot request will fail.

== Running multiple deadman instances ==

  If needed, you can run multiple instances of deadman.

  To do this, make a copy of deadman.exe giving it a unique name (i.e.
  deadman2.exe) and run the copy with a unique configuration file.

  The deadman log file name, the deadman pid file name and the default
  configuration file name are determined by the deadman executable's name so
  there will be no conflict with other running deadman instances.

== Requirements ==

  The dos.sys driver must be installed.  This driver provides application
  level access to the DosReboot DevHlp API.

== Known issues ==

  None

== Ideas for the future ==

 - Enhance the error log monitor feature to detect more types of errors and
   provide recovery support.

 - Support units of measure for numeric values

 - Support deadmanlogfile keyword.

 - Support deadmanpidfile  keyword.

== Copyright and License ==

  COVERED CODE IS PROVIDED UNDER THIS LICENSE ON AN "AS IS" BASIS,
  WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING,
  WITHOUT LIMITATION, WARRANTIES THAT THE COVERED CODE IS FREE OF
  DEFECTS, MERCHANTABLE, FIT FOR A PARTICULAR PURPOSE OR NON-INFRINGING.
  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE COVERED CODE
  IS WITH YOU. SHOULD ANY COVERED CODE PROVE DEFECTIVE IN ANY RESPECT,
  YOU (NOT THE INITIAL DEVELOPER OR ANY OTHER CONTRIBUTOR) ASSUME THE
  COST OF ANY NECESSARY SERVICING, REPAIR OR CORRECTION. THIS DISCLAIMER
  OF WARRANTY CONSTITUTES AN ESSENTIAL PART OF THIS LICENSE. NO USE OF
  ANY COVERED CODE IS AUTHORIZED HEREUNDER EXCEPT UNDER THIS DISCLAIMER.

  Copyright (c) 2008-2022 Steven Levine and Associates, Inc.
  All rights reserved.

  Deadman is provided AS-IS, WITHOUT ANY WARRANTY OF ANY KIND, EITHER
  EXPRESS, IMPLIED OR STATUTORY, not even any implied warranty of
  MERCHANTABILITY.

  YOUR USE THIS PRODUCT IS CONDITIONED UPON YOUR ACCEPTANCE OF THIS
  LICENSE AGREEMENT. INSTALLING AND/OR USING THE PRODUCT INDICATES YOUR
  ACCEPTANCE OF THESE TERMS AND CONDITIONS. IF YOU DO NOT AGREE TO THESE
  TERMS AND CONDITIONS PROMPTLY DELETE THIS PRODUCT.

  You are granted a non-exclusive, non-assignable, non-transferable
  right to use deadman.exe.

== eof ==
