Package: alinex-monitor

This application will make server management easy and fast. It will check the whole IT landscape from the host to the application.

While most monitoring tools has it's focus on the server here the focus lies on the application side, too.

remote daemon-less analysis
lots of sensors
alerting and reporting
data store for time analysis
interactive analyzing and exploring

The monitor will analyze your whole environment in deep by connecting to the different systems in parallel and check them deeply. If a problem occurs an additional analysis step may be made to get more information. The result values will be stored in the storage database and a detailed report will be created. Based on additional action rules the report may be send by email or a web request is made. Out of the stored values time reports may be created.

It is one of the modules of the Alinex Universe following the code standards defined in the General Docs.

Install

Install the package globally using npm on a central server. From there all your machines may be checked:

sudo npm install -g alinex-monitor --production

After global installation you may directly call monitor from anywhere.

monitor --help

Because this application works agentless, you don't have to do something special on your clients but often some simple changes can make the reports more powerful. If so you will get a hint in the report.

Always have a look at the latest changes.

Usage

After the monitor and it's controllers are fully configured it may be run by only calling:

> monitor

Initializing...
Analyzing systems...
Finished.

This will start the monitor on the command line and check all controllers. All of them which make some problems will be printed on the console.

Global options:

-C, --nocolors  turn of color output
-v, --verbose   run in verbose mode
-h, --help      Show help

Check controller once

You may run the monitor to check the defined controllers once and get their status:

> monitor my-develop    # run only this controller
> monitor my-*          # run controllers with the given name prefix
> monitor               # run all controllers

You may start a try run in which nothing is stored in the storage and no action is taken.

-t --try        do a try run

The verbose mode works here in multiple steps:

-v    show the controller status on console
-vv   show also the sensor status on console
-vvv  also display the result values

If no verbose mode is set only warning and error state of controller will be reported on console.

The output may look like:

> monitor -v

Initializing...
Analyzing systems...
2015-12-15 12:55:32 Controller dvb-develop => ok
Finished.

> monitor -vvv

Initializing...
Analyzing systems...
2015-12-15 12:55:51 Check load:dvb-develop => ok
2015-12-15 12:55:51 Check http:->http://192.168.200.106/svn/ => ok
2015-12-15 12:55:51 Check memory:dvb-develop => ok
2015-12-15 12:55:51 Check diskfree:dvb-develop:/ => ok
2015-12-15 12:55:51 Check diskfree:dvb-develop:/var => ok
2015-12-15 12:55:52 Check ping:localhost->192.168.200.106 => ok
2015-12-15 12:55:52 Check socket:tcp localhost->192.168.200.106:80 => ok
2015-12-15 12:55:52 Check time:dvb-develop => ok
2015-12-15 12:55:52 Check user:dvb-develop => ok
2015-12-15 12:56:01 Check cpu:dvb-develop => ok
2015-12-15 12:56:01 Check diskio:dvb-develop:sda1 => ok
2015-12-15 12:56:02 Check net:dvb-develop:eth0 => ok
2015-12-15 12:56:02 Controller dvb-develop => ok
Finished.

Run as a service

To run the controller continuously use the daemon option and start it in the background.

> monitor -d -C > /var/log/monitor.log 2>&1 &

This will check all the controllers in the defined timerange, collect measurement values and send alerts. You may also specify some controllers to run instead of all.

Like seen above you may send the normal output to a log file but better configure a log destination through the config files (see below).

For production use you may start it using pm2.

Additional commands

You may run some other commands through the interactive console or directly by giving everything on the command line call.

> monitor -c list controller

See the next section for the interactive console and their integrated help system for more details on the possible commands..

Setup

To use the controller you have to setup the whole process using some configuration files. And maybe a storage database will be used.

Exit Codes

The exit codes are arranged alongside the UNIX default:

Code	Description
0	OK - no error in controller
1	General error which should not occur.
2	Fail - controller run failed
3	Warn - warning in controller run
129	SIGHUP (Signal 1)
130	SIGINT like through Ctrl + C (Signal 2)
131	SIGQUIT (Signal 3)
134	SIGABRT or SIGIOT (Signal 6)
143	SIGTERM (Signal 15)
255	Exit status out of range

Interactive Console

You may start the interactive console by using the -i option. Additionally you can use --json to set some presettings with default for the later interactive call. After that you will be greeted and may give the commands:

> monitor -i # or --interactive

                           __   ____     __
           ######  #####  |  | |    \   |  |   ########### #####       #####
          ######## #####  |  | |     \  |  |  ############  #####     #####
         ######### #####  |  | |  |\  \ |  |  #####          #####   #####
        ########## #####  |  | |  | \  \|  |  #####           ##### #####
       ##### ##### #####  |  | |  |__\     |  ############     #########
      #####  ##### #####  |  | |     \\    |  ############     #########
     #####   ##### #####  |__| |______\\___|  #####           ##### #####
    #####    ##### #####                      #####          #####   #####
   ##### ######### ########################## ############  #####     #####
  ##### ##########  ########################   ########### #####       #####
  ___________________________________________________________________________

                  M O N I T O R I N G   A P P L I C A T I O N
  ___________________________________________________________________________

Initializing...

Welcome to the interactive monitor console in which you can get more
information about special tools, run individual tests and explore systems.

To get help call the command help and close with exit!

monitor>

The following commands are possible here:

help - show a help page with all this commands
set - change general or specific settings
exit - this will close the interactive run or send Ctrl-C

Commands possible for controller, sensor, actor and explorer:

list <type> - list all possible elements of given type
show <type> <element> - show meta information for this element
run <type> <element> - run this element (maybe ask for decisions)

Examples:

list controller
show controller my_machine
set verbose 3
run controller my_machine
show sensor cpu
run sensor cpu
run explorer database

Everything the controller/sensor/actor/explorer need is asked within or before starting the process.

The interactive console uses a file based history so use the arrow keys to step in history. Also auto completion is often available try hitting the TAB key.

Using Parameters

If you want to run the same command as on the interactive console but call it directly you can send it as command using the options:

-c --command
-j --json

As an example you may run the cleanup:

> monitor -c cleanup

Or get the list of controllers:

> monitor -c 'list controller'

If you run a command which needs optional parameters while running you have to give all of them on call as a json data object. Take the names from the interactive run displayed in front of the question.

> monitor -c run sensor cpu -j '{"remote":"localhost"}'

This will run the cpu sensor on localhost server.

Definitions

Monitor - is the main program
Controller - is the configurable automatic monitoring element holding some checks
Check - will run the sensor and hold it's contents
Sensor - will analyze some metrics and collect it's values
Status - the current state of the element (ok, warn, fail)
Action - will run the actor and hold the protocol
Actor - will do some active changes on a system
Analysis - will start an explorative check
Analyzer - collect explorative information from a system
Storage - will hold the results and data in a persistence layer
Console - command line interface
Interactive Console - will interactively work
Daemon - service mode in which the monitor runs the controller checks continuously

Components

Configuration

This will describe the base setup. Also needed is the controller configuration which is described in the next section. Most parts of the configuration is the base setup which is used from within the controller and sensors.

Contacts

The contacts are referenced from the controllers and are defined here in a central file under /monitor/contacts. Entries with array are groups and objects are address entries. Within the controller both may be used.

# Contacts for Monitoring
# =================================================
# This file holds a list of contacts to be used from within the rules and
# specific controllers.

# Groups
# -------------------------------------------------
operations: [aschi]

# Staff
# -------------------------------------------------
aschi:
  name: Alexander Schilling
  position: Developer
  company: Alinex Project
  email: info@alinex.de
  phone: 07129/922545

Multiple phone numbers as array are possible.

The contact monitor is already defined and used as from address in emails. You may overwrite it by defining it yourself.

Email Templates

This templates are used for sending emails out. A default template is already defined and only needs the 'to' address. But you may define more templates under /monitor/email:

# Email Report Configuration
# =================================================

# Default (extended)
# -------------------------------------------------

default:
  # already defined, so only set the 'To' address here.
  to: operations

# Own Templates
# -------------------------------------------------

fail:
  subject: >
    Failed {{alias}}
  body: >
    {{name}}\n
    ==========================================================================\n
    {{description}}\n
    \n
    This test failed at {{date}}!\n
    \n
    {{hint}}\n

warn:
  subject: >
    Warning for {{alias}}
  body: >
    {{name}}\n
    ==========================================================================\n
    {{description}}\n
    \n
    This test failed at {{date}}!\n
    \n
    {{hint}}\n

ok:
  subject: >
    OK for {{alias}}
  body: >
    {{name}}\n
    ==========================================================================\n
    {{description}}\n
    \n
    This test failed at {{date}}!\n
    \n
    {{hint}}\n

Rules

The rules specify what to do in specific situations under /monitor/rule:

# Rule Definition
# =================================================

# ### Set templates for default rules
fail:
  email:
    base: fail
warn:
  email:
    base: warn
ok:
  email:
    base: ok

# ### specific check
specific:
  # Only work on specific status.
  status: fail
  # Number of minimum attempts before informing.
  attempt: 3
  # Time (in seconds) to wait before informing.
  latency: 60
  # Only inform if dependent jobs not failed. This prevents of hundred of
  # messages if a central system failed.
  dependskip: true
  # Type of actor to run with it's configuration
  email:
    base: fail # template to use defined under monitor/email
    to: aschi # but send to myself
  # Timeout (in seconds) without status change before informing again.
  redo: 3h

Storage

If you want to store the measurement values, you need the following setup under /monitor/storage:

# Storage settings
# =================================================
# There to store the results of the monitoring.

database: monitor
prefix: mon_

# When to cleanup entries from storage
# -------------------------------------------------
# The values are the number of max. entries of given interval.
storage:
  cleanup:
    minute: 360 # 6 hours
    hour: 96    # 4 days
    day: 90     # 3 months
    week: 104   # two years
    month: 60   # 5 years

The referenced database have to be a postgresql database here and the data structure will be build on startup automatically. The concrete connection settings are defined in the /database configuration, see below.

The cleanup defines how much time units to keep before removing them. Keep in mind that your database will grow if you set high values here.

Exec and Database

Also you need the setup under /exec and /database like described in Exec and Database. This is used in the different sensors by references to the setup stored there.

Controller

A controller is an individual part to be checked. It contains some sensors to check the system and may also depend on other controllers. Each controller is made by a specific configuration files containing meta information.

See the following example for a full controller configuration:

# Monitoring controller configuration
# =================================================
# This is an example of a complete controller configuration.

name: Development Center
description: Server containing miscellaneous tools to help in the development process.

# Monitor runtime configuration
# -------------------------------------------------
# Within the validity the same values will be used without rechecking them and
# after the interval an automatic new run will be started in daemon mode.

# Time (in seconds) in which the value is seen as valid and should not be rechecked.
validity: 1m
# Time (in seconds) to rerun the check in daemon mode.
interval: 5m

# Sensors to run
# -------------------------------------------------
# The list of dependencies are sensors which have to work to make this controller
# fully work.
check:
  - sensor: diskfree

    # ### Name and dependency
    # The name is used for identifying and also to be referred in other checks as
    # dependency meaning that a check only can run if all it's dependant checks are
    # done and don't fail.
    #name: mytest-3
    #depend: mytest-1, mytest-2

    # ### Specific setup
    config:
      remote: my-develop
      share: /

    # ### Weight setting
    # Specific to value of the following 'combine' setting.
    # With the `weight` settings on the different entries single group entries may
    # be rated specific not like the others. Use a number in `average` to make the
    # weight higher (1 is normal). Also the weight 'up' and 'down' changes the error
    # level for one step before using in calculation on all combine methods.
    #weight: down

    # ### Hint
    # Specific hint as handlebars text which may include the current results. Use
    # the following variables:
    #
    #     name: Name of the sensor
    #     meta: Meta Information of the sensor
    #     config: Sensor configuration
    #     results: Results
    #hint: |+

# ### Max Parallel checks
# This goes from 1 = serial to n parallel checks running. It is wise not to use
# too high values here to not make a high load on the server by the monitor itself.
parallel: 5

# ### Combine values
# For multiple dependencies this value defines how the individual sensors are
# combined to calculate the overall status:
#
# - max - the one with the highest failure value is used
# - min - the lowest failure value is used
# - average - the average status (arithmetic round) is used
combine: max

# Rules to process
# -------------------------------------------------
# The following rules will be processed after the controller is run. They will
# decide which actions to run and how to do it.
#
# The following list references the active rules for this controller:
rule:
  - fail
  - warn
  - ok

# Information Text
# -------------------------------------------------
# This is a general and unspecific information text for that controller.
info: |+
  This system is used for software development, building and deployment. An
  outage will have direct effects to the developers so that they can't submit,
  test and deploy their code.

# ### Specific Hint
# In contrast to the `info` the `hint` will be more specific to the concrete
# results. Within this handlebar text you may use some specific variables:
#
#     name: controller name
#     config: this config
#     sensor: sensor results
hint: |+
  All necessary parts are on the same machine, so that you only have to bring
  this machine to work. Backups of the data are made on my-backup.

  Keep in mind that the machine is in the test net and you have to use a valid
  VPN connection for accessing.

# Additional Help
# -------------------------------------------------
contact:
  operations: alex

ref:
  # system access
  subversion: http://192.168.1.6/svn
  nexus: http://192.168.1.6:8081/nexus
  Jenkins: http://192.168.1.6:8080/
  sonarqube: http://192.168.1.6:9000/
  # user/developer help
  doc: https://my-docs/confluence/pages/viewpage.action?pageId=48398554
  #issues:
  #api:
  #code:
  #other:

The controller will call the sensors and collect the data. It may also generate reports or trigger specific actions.

Structure

A controller may hold some sensor but not to much. You should only group corresponding sensors within it. Dependent parts may be put in another controller, one for each level of dependency.

Each controller should have an unique and memorable name. A good structure of controllers may be:

one for each server: name it like your machine names i.e. vs1626, ma77234
one for each application part i.e. web, web1, web2, web3, ftp
one for each end user application i.e. login, browse, buy
one overall check i.e. all

Status

The monitor uses the following status:

running if the sensor is already analyzing, you have to wait

disabled if this controller is currently not checked - this will be used like ok for further processing

ok if everything is perfect, there nothing have to be done - exit code 0

warn if the sensor reached the warning level, know you have to keep an eye on it - exit code 1

fail if the sensor failed and there is a problem - exit code 2

Sensor

An sensor is a code module which allows to check specific parts of the system. It will analyze the system and get some measurement values back.

Each use of a sensor in an controller with specific setup data is further called a check.

A check consists of the following setup:

sensor - the name of the sensor to use
name - an optional alias name for referencing (optional)
depend - other checks which should run before (optional)
config - the configuration to run it
weight - change setting belongs to the controllers combine setting (optional)
hint - a technical hint to find or resolve the problem (optional)

Config

Each sensor has its own configuration settings like seen above in the controller configuration. The common keys are:

warn - the javascript code to check if status should be set to warn
fail - the javascript code to check if status should be set to fail

Meta Data

The following meta data are available:

title
description
category - one of 'sys', 'net', 'srv'
hint - additional help for problems

Result

After running a sensor you get a result object containing:

date - array with start and end date of run
status - one of: 'ok', 'warn', 'fail' ('running')
message - optional, explaining the status
values - object containing specific values