lady:ops:monitor

Monitoring

Best practices, recommendations and operative guidelines.

  • Try to monitor values which are most directly interpretable as failure/success (ie avoid computing values and use status values)
    • For HTTP endpoints and the like fetch only once and extract/compute values (one polling, multiple values)
    • Use metric values at minimum (costly).
    • Avoid agent if possible, use the service itself (i.e. don't develop a script, ask the service directly form the poller)
    • Avoid high resolution (interval⇐1m), they are space costly
    • Don't store values for longer than a week if they will not make sense after a week
  • One template for service, include them.
  • Generalize and parametrize so hosts can put their own specifics via MACROS
  • Descriptive, units, description, … as later you will forget it and can provide useful info later.
  • Information should not trigger alerts, however don't autoclose so they are attended/seen some time.
  • Reserve Critical for real critical events that should require immediate attention
  • To avoid storms, trigger should span on a timeline/recurrence before alerting
  • Simple action to log/record any event
  • Reserve e-mail actions for really relevant issues

Logging

We try to stick and use as most as possible syslog, as easy, straight forward, categorized and highly extended standard. Also we don't need some edge analysis, our main need is to be able to review the events on a centralized way and some basic stats/filtering.

A part from the standard logging facilities, we use some local ones for specific topics:

  • news for general maintenance scripts and Logging
  • uucp for any file transfer/copy logging (rsync, nfs, torrent)
  • local0 for network devices (This is hardcoded on some of our devices)
  • local4 for domain services (DNS, LDAP, DHCP, …)
  • local6 for web services
  • local7 is also directed to syslog
Note there is a ftp facility, which would be more suitable than uucp so we should change this.

So currently free from local1 to local3 and local5.

Additionally as many of our scripts also records some events/messages on syslog thru logger, we try to follow some practices:

  • Try to use some script identifier or $(basename $0)
  • The shell library provides handful functions to log in an easy always
  • logger can read message from STDIN, use -e flag to avoid logging empty messages

A short summary of useful flags:

  • -e to skip empty messages
  • -t to tag them
  • -s send also to STDERR

We configure our syslog daemons to log locally and forward to a central log server using the above facility convention. In turn this syslog central server make some filtering to fix some weird logging devices, stores it in filesystem and in a Database for further review.

The Database schema is a simplified version of the common used one for rsyslog, loganayzer or syslog-ng recommendations, the main characteristic is that is partitioned on day basis so we can drop partitions easlys on log rotation when they expire and gives a better performance when searching thru specific time windows.

I don't consider myself a skilled developer, but due the lack of a tool that likes me and full fill my needs in a easy/light way I developed binnacle. Although not as good as other tools on managing the data, I designed it in an easy to use and practical interface, however we plan to check for other tools to get rid of the burden of maintaining it.

  • lady/ops/monitor.txt
  • Last modified: 2021/06/10 21:53
  • by 127.0.0.1