This application will make server management easy and fast. It will check the whole IT landscape from the host to the application.
While most monitoring tools has it's focus on the server here the focus lies on the application side, too.
The monitor will analyze your whole environment in deep by connecting to the different systems in parallel and check them deeply. If a problem occurs an additional analysis step may be made to get more information. The result values will be stored in the storage database and a detailed report will be created. Based on additional action rules the report may be send by email or a web request is made. Out of the stored values time reports may be created.
It is one of the modules of the Alinex Universe following the code standards defined in the General Docs.
Install the package globally using npm on a central server. From there all your machines may be checked:
sudo npm install -g alinex-monitor --production
After global installation you may directly call monitor
from anywhere.
monitor --help
Because this application works agentless, you don't have to do something special on your clients but often some simple changes can make the reports more powerful. If so you will get a hint in the report.
Always have a look at the latest changes.
After the monitor and it's controllers are fully configured it may be run by only calling:
> monitor
Initializing...
Analyzing systems...
Finished.
This will start the monitor on the command line and check all controllers. All of them which make some problems will be printed on the console.
Global options:
-C, --nocolors turn of color output
-v, --verbose run in verbose mode
-h, --help Show help
You may run the monitor to check the defined controllers once and get their status:
> monitor my-develop # run only this controller
> monitor my-* # run controllers with the given name prefix
> monitor # run all controllers
You may start a try run in which nothing is stored in the storage and no action is taken.
-t --try do a try run
The verbose mode works here in multiple steps:
-v show the controller status on console
-vv show also the sensor status on console
-vvv also display the result values
If no verbose mode is set only warning and error state of controller will be reported on console.
The output may look like:
> monitor -v
Initializing...
Analyzing systems...
2015-12-15 12:55:32 Controller dvb-develop => ok
Finished.
> monitor -vvv
Initializing...
Analyzing systems...
2015-12-15 12:55:51 Check load:dvb-develop => ok
2015-12-15 12:55:51 Check http:->http://192.168.200.106/svn/ => ok
2015-12-15 12:55:51 Check memory:dvb-develop => ok
2015-12-15 12:55:51 Check diskfree:dvb-develop:/ => ok
2015-12-15 12:55:51 Check diskfree:dvb-develop:/var => ok
2015-12-15 12:55:52 Check ping:localhost->192.168.200.106 => ok
2015-12-15 12:55:52 Check socket:tcp localhost->192.168.200.106:80 => ok
2015-12-15 12:55:52 Check time:dvb-develop => ok
2015-12-15 12:55:52 Check user:dvb-develop => ok
2015-12-15 12:56:01 Check cpu:dvb-develop => ok
2015-12-15 12:56:01 Check diskio:dvb-develop:sda1 => ok
2015-12-15 12:56:02 Check net:dvb-develop:eth0 => ok
2015-12-15 12:56:02 Controller dvb-develop => ok
Finished.
To run the controller continuously use the daemon
option and start it in the
background.
> monitor -d -C > /var/log/monitor.log 2>&1 &
This will check all the controllers in the defined timerange, collect measurement values and send alerts. You may also specify some controllers to run instead of all.
Like seen above you may send the normal output to a log file but better configure a log destination through the config files (see below).
For production use you may start it using pm2.
You may run some other commands through the interactive console or directly by giving everything on the command line call.
> monitor -c list controller
See the next section for the interactive console and their integrated help system for more details on the possible commands..
To use the controller you have to setup the whole process using some configuration files. And maybe a storage database will be used.
The exit codes are arranged alongside the UNIX default:
Code | Description |
---|---|
0 | OK - no error in controller |
1 | General error which should not occur. |
2 | Fail - controller run failed |
3 | Warn - warning in controller run |
129 | SIGHUP (Signal 1) |
130 | SIGINT like through Ctrl + C (Signal 2) |
131 | SIGQUIT (Signal 3) |
134 | SIGABRT or SIGIOT (Signal 6) |
143 | SIGTERM (Signal 15) |
255 | Exit status out of range |
You may start the interactive console by using the -i
option. Additionally you
can use --json
to set some presettings with default for the later interactive
call. After that you will be greeted and may give the commands:
> monitor -i # or --interactive
__ ____ __
###### ##### | | | \ | | ########### ##### #####
######## ##### | | | \ | | ############ ##### #####
######### ##### | | | |\ \ | | ##### ##### #####
########## ##### | | | | \ \| | ##### ##### #####
##### ##### ##### | | | |__\ | ############ #########
##### ##### ##### | | | \\ | ############ #########
##### ##### ##### |__| |______\\___| ##### ##### #####
##### ##### ##### ##### ##### #####
##### ######### ########################## ############ ##### #####
##### ########## ######################## ########### ##### #####
___________________________________________________________________________
M O N I T O R I N G A P P L I C A T I O N
___________________________________________________________________________
Initializing...
Welcome to the interactive monitor console in which you can get more
information about special tools, run individual tests and explore systems.
To get help call the command help and close with exit!
monitor>
The following commands are possible here:
help - show a help page with all this commands
set - change general or specific settings
exit - this will close the interactive run or send Ctrl-C
Commands possible for controller, sensor, actor and explorer:
list <type> - list all possible elements of given type
show <type> <element> - show meta information for this element
run <type> <element> - run this element (maybe ask for decisions)
Examples:
list controller
show controller my_machine
set verbose 3
run controller my_machine
show sensor cpu
run sensor cpu
run explorer database
Everything the controller/sensor/actor/explorer need is asked within or before starting the process.
The interactive console uses a file based history so use the arrow keys to step in history. Also auto completion is often available try hitting the
TAB
key.
If you want to run the same command as on the interactive console but call it directly you can send it as command using the options:
-c --command
-j --json
As an example you may run the cleanup:
> monitor -c cleanup
Or get the list of controllers:
> monitor -c 'list controller'
If you run a command which needs optional parameters while running you have to give all of them on call as a json data object. Take the names from the interactive run displayed in front of the question.
> monitor -c run sensor cpu -j '{"remote":"localhost"}'
This will run the cpu sensor on localhost server.
This will describe the base setup. Also needed is the controller configuration which is described in the next section. Most parts of the configuration is the base setup which is used from within the controller and sensors.
The contacts are referenced from the controllers and are defined here in a central
file under /monitor/contacts
. Entries with array are groups and objects are
address entries. Within the controller both may be used.
# Contacts for Monitoring
# =================================================
# This file holds a list of contacts to be used from within the rules and
# specific controllers.
# Groups
# -------------------------------------------------
operations: [aschi]
# Staff
# -------------------------------------------------
aschi:
name: Alexander Schilling
position: Developer
company: Alinex Project
email: info@alinex.de
phone: 07129/922545
Multiple phone numbers as array are possible.
The contact monitor
is already defined and used as from address in emails. You
may overwrite it by defining it yourself.
This templates are used for sending emails out. A default
template is already
defined and only needs the 'to' address. But you may define more templates under
/monitor/email
:
# Email Report Configuration
# =================================================
# Default (extended)
# -------------------------------------------------
default:
# already defined, so only set the 'To' address here.
to: operations
# Own Templates
# -------------------------------------------------
fail:
subject: >
Failed {{alias}}
body: >
{{name}}\n
==========================================================================\n
{{description}}\n
\n
This test failed at {{date}}!\n
\n
{{hint}}\n
warn:
subject: >
Warning for {{alias}}
body: >
{{name}}\n
==========================================================================\n
{{description}}\n
\n
This test failed at {{date}}!\n
\n
{{hint}}\n
ok:
subject: >
OK for {{alias}}
body: >
{{name}}\n
==========================================================================\n
{{description}}\n
\n
This test failed at {{date}}!\n
\n
{{hint}}\n
The rules specify what to do in specific situations under /monitor/rule
:
# Rule Definition
# =================================================
# ### Set templates for default rules
fail:
email:
base: fail
warn:
email:
base: warn
ok:
email:
base: ok
# ### specific check
specific:
# Only work on specific status.
status: fail
# Number of minimum attempts before informing.
attempt: 3
# Time (in seconds) to wait before informing.
latency: 60
# Only inform if dependent jobs not failed. This prevents of hundred of
# messages if a central system failed.
dependskip: true
# Type of actor to run with it's configuration
email:
base: fail # template to use defined under monitor/email
to: aschi # but send to myself
# Timeout (in seconds) without status change before informing again.
redo: 3h
If you want to store the measurement values, you need the following setup under
/monitor/storage
:
# Storage settings
# =================================================
# There to store the results of the monitoring.
database: monitor
prefix: mon_
# When to cleanup entries from storage
# -------------------------------------------------
# The values are the number of max. entries of given interval.
storage:
cleanup:
minute: 360 # 6 hours
hour: 96 # 4 days
day: 90 # 3 months
week: 104 # two years
month: 60 # 5 years
The referenced database have to be a postgresql database here and the data
structure will be build on startup automatically. The concrete connection
settings are defined in the /database
configuration, see below.
The cleanup defines how much time units to keep before removing them. Keep in mind that your database will grow if you set high values here.
Also you need the setup under /exec
and /database
like described in
Exec and Database.
This is used in the different sensors by references to the setup stored there.
A controller is an individual part to be checked. It contains some sensors to check the system and may also depend on other controllers. Each controller is made by a specific configuration files containing meta information.
See the following example for a full controller configuration:
# Monitoring controller configuration
# =================================================
# This is an example of a complete controller configuration.
name: Development Center
description: Server containing miscellaneous tools to help in the development process.
# Monitor runtime configuration
# -------------------------------------------------
# Within the validity the same values will be used without rechecking them and
# after the interval an automatic new run will be started in daemon mode.
# Time (in seconds) in which the value is seen as valid and should not be rechecked.
validity: 1m
# Time (in seconds) to rerun the check in daemon mode.
interval: 5m
# Sensors to run
# -------------------------------------------------
# The list of dependencies are sensors which have to work to make this controller
# fully work.
check:
- sensor: diskfree
# ### Name and dependency
# The name is used for identifying and also to be referred in other checks as
# dependency meaning that a check only can run if all it's dependant checks are
# done and don't fail.
#name: mytest-3
#depend: mytest-1, mytest-2
# ### Specific setup
config:
remote: my-develop
share: /
# ### Weight setting
# Specific to value of the following 'combine' setting.
# With the `weight` settings on the different entries single group entries may
# be rated specific not like the others. Use a number in `average` to make the
# weight higher (1 is normal). Also the weight 'up' and 'down' changes the error
# level for one step before using in calculation on all combine methods.
#weight: down
# ### Hint
# Specific hint as handlebars text which may include the current results. Use
# the following variables:
#
# name: Name of the sensor
# meta: Meta Information of the sensor
# config: Sensor configuration
# results: Results
#hint: |+
# ### Max Parallel checks
# This goes from 1 = serial to n parallel checks running. It is wise not to use
# too high values here to not make a high load on the server by the monitor itself.
parallel: 5
# ### Combine values
# For multiple dependencies this value defines how the individual sensors are
# combined to calculate the overall status:
#
# - max - the one with the highest failure value is used
# - min - the lowest failure value is used
# - average - the average status (arithmetic round) is used
combine: max
# Rules to process
# -------------------------------------------------
# The following rules will be processed after the controller is run. They will
# decide which actions to run and how to do it.
#
# The following list references the active rules for this controller:
rule:
- fail
- warn
- ok
# Information Text
# -------------------------------------------------
# This is a general and unspecific information text for that controller.
info: |+
This system is used for software development, building and deployment. An
outage will have direct effects to the developers so that they can't submit,
test and deploy their code.
# ### Specific Hint
# In contrast to the `info` the `hint` will be more specific to the concrete
# results. Within this handlebar text you may use some specific variables:
#
# name: controller name
# config: this config
# sensor: sensor results
hint: |+
All necessary parts are on the same machine, so that you only have to bring
this machine to work. Backups of the data are made on my-backup.
Keep in mind that the machine is in the test net and you have to use a valid
VPN connection for accessing.
# Additional Help
# -------------------------------------------------
contact:
operations: alex
ref:
# system access
subversion: http://192.168.1.6/svn
nexus: http://192.168.1.6:8081/nexus
Jenkins: http://192.168.1.6:8080/
sonarqube: http://192.168.1.6:9000/
# user/developer help
doc: https://my-docs/confluence/pages/viewpage.action?pageId=48398554
#issues:
#api:
#code:
#other:
The controller will call the sensors and collect the data. It may also generate reports or trigger specific actions.
A controller may hold some sensor but not to much. You should only group corresponding sensors within it. Dependent parts may be put in another controller, one for each level of dependency.
Each controller should have an unique and memorable name. A good structure of controllers may be:
vs1626
, ma77234
web
, web1
, web2
, web3
, ftp
login
, browse
, buy
all
The monitor uses the following status:
running if the sensor is already analyzing, you have to wait
disabled if this controller is currently not checked - this will be used like ok for further processing
ok if everything is perfect, there nothing have to be done - exit code 0
warn if the sensor reached the warning level, know you have to keep an eye on it - exit code 1
fail if the sensor failed and there is a problem - exit code 2
An sensor is a code module which allows to check specific parts of the system. It will analyze the system and get some measurement values back.
Each use of a sensor in an controller with specific setup data is further called a check.
A check consists of the following setup:
Each sensor has its own configuration settings like seen above in the controller configuration. The common keys are:
The following meta data are available:
After running a sensor you get a result object containing:
Checking the CPU utilization of all cores together. With the configuration values:
The resulting report part may look like:
CPU localhost
--------------------------------------------------------------------------------
Check the current activity in average percent of all cores.
::: info
- __STATUS: ok__ at Wed Dec 23 2015 16:04:31 GMT+0100 (CET)
:::
Last check results from Wed Dec 23 2015 16:04:31 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:------ |:---------------- | --------:|
| cpus | CPU Cores | 8 |
| speed | CPU Speed | 3.61 GHz |
| user | User Time | 3.7 % |
| system | System Time | 0.68 % |
| idle | Idle Time | 95 % |
| active | Activity | 5.1 % |
| wait | I/O Wait Time | 0.71 % |
| low | Lowest CPU Core | 1.9 % |
| high | Highest CPU Core | 8.2 % |
> A high CPU usage means that the server may not start another task immediately.
> If the load is also very high the system is overloaded, check if any application
> goes evil.
### Configuration
The cpu sensor is configured with:
| CONFIGURATION SETTING | VALUE |
|:--------------------- |:-------------- |
| time | 10 |
| warn | active >= 100% |
Check the system load in the last time ranges. With the configuration values:
The resulting report part may look like:
Load localhost
--------------------------------------------------------------------------------
Check the local processor activity over the last minute to 15 minutes.
::: info
- __STATUS: ok__ at Wed Dec 23 2015 16:14:11 GMT+0100 (CET)
:::
Last check results from Wed Dec 23 2015 16:14:11 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:------ |:---------- | -----:|
| short | 1min Load | 4.8 % |
| medium | 5min Load | 5.1 % |
| long | 15min Load | 8.4 % |
> A very high system load makes the system irresponsible or really slow. Mostly
> this is CPU-bound load, load caused by out of memory issues or I/O-bound load
> problems.
### Configuration
The load sensor is configured with:
| CONFIGURATION SETTING | VALUE |
|:--------------------- |:------------ |
| warn | short > 500% |
Check the memory usage on the system
The resulting report part may look like:
Memory localhost
--------------------------------------------------------------------------------
Check the free and used memory.
::: info
- __STATUS: ok__ at Wed Dec 23 2015 16:14:11 GMT+0100 (CET)
:::
Last check results from Wed Dec 23 2015 16:14:11 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:--------------- |:----------------- | --------:|
| total | Total | 8.05 GB |
| used | Used | 7.71 GB |
| free | Free | 318 MiB |
| shared | Shared | 0.899 GB |
| buffers | Buffers | 300 MiB |
| cached | Cached | 2.34 GB |
| swapTotal | Swap Total | 8.26 GB |
| swapUsed | Swap Used | 1.84 GB |
| swapFree | Swap Free | 6.42 GB |
| actualFree | Actual Free | 2.99 GB |
| percentFree | Percent Free | 37 % |
| swapPercentFree | Swap Percent Free | 78 % |
> Check which process consumes how much memory, maybe some processes have a
> memory leak.
### Configuration
The memory sensor is configured with:
| CONFIGURATION SETTING | VALUE |
|:--------------------- |:--------- |
| warn | free < 1% |
This sensor will check the disk usage on a specific block device. The configuration allows:
The resulting report part may look like:
Diskfree localhost:/
--------------------------------------------------------------------------------
Test the free diskspace of one share.
::: info
- __STATUS: ok__ at Wed Dec 23 2015 16:04:55 GMT+0100 (CET)
:::
Last check results from Wed Dec 23 2015 16:04:55 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:----------- |:---------- | ------------------------------------------------------:|
| share | Share | /dev/disk/by-uuid/c9492a0b-b280-43de-9882-c845cbecdadf |
| type | Type | ext4 |
| mount | Mountpoint | / |
| total | Available | 0.926 TB |
| used | Used | 383 GiB |
| usedPercent | % Used | 44 % |
| free | Free | 479 GiB |
| freePercent | % Free | 56 % |
> If a share is full it will make I/O problems in the system or applications in
> case of the root partition it may also neither be possible to log errors. Maybe
> some old files like temp or logs can be removed or compressed.
### Configuration
The diskfree sensor is configured with:
| CONFIGURATION SETTING | VALUE |
|:--------------------- |:----------------- |
| share | / |
| timeout | 5000 |
| warn | freePercent < 10% |
| fail | free is 0 |
This sensor will check the disk io traffic:
The resulting report part may look like:
Disk IO localhost:sda
--------------------------------------------------------------------------------
Check the disk io traffic.
::: info
- __STATUS: ok__ at Wed Dec 23 2015 16:13:46 GMT+0100 (CET)
:::
Last check results from Wed Dec 23 2015 16:13:46 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:---------- |:----------------- | --------:|
| write | Write operation/s | 11.6 |
| writeSize | Write/s | 482 KiB |
| readTotal | Total Read | 57.4 GiB |
| writeTotal | Total Write | 207 GiB |
| writeTime | Write Time/s | 43 ms |
> If there are any problems here check the device for hardware or network
> problems.
### Configuration
The diskio sensor is configured with:
| CONFIGURATION SETTING | VALUE |
|:--------------------- |:----- |
| device | sda |
| time | 10 |
This sensor will check the network traffic on a specified interface:
The resulting report part may look like:
Network Traffic localhost:eth0
--------------------------------------------------------------------------------
Check the network traffic.
::: info
- __STATUS: ok__ at Wed Dec 23 2015 16:14:11 GMT+0100 (CET)
:::
Last check results from Wed Dec 23 2015 16:14:11 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:--------------- |:------------------- | -------------------------:|
| receivedBytes | Receive Transfer | 22 KiB |
| receivedPackets | Received Packets | 68 |
| transmitBytes | Transmit Transfer | 16.5 KiB |
| transmitPackets | Transmitted Packets | 63 |
| bytes | Total Transfer | 38.5 KiB |
| packets | Total Packets | 131 |
| state | Interface State | UP |
| mac | Mac Address | 40:a8:f0:46:e5:76 |
| ipv4 | IP Address | 192.168.5.113 |
| ipv6 | IPv6 Address | fe80::42a8:f0ff:fe46:e576 |
> If you see a high volume it may be overloaded or a attack is running.
### Configuration
The net sensor is configured with:
| CONFIGURATION SETTING | VALUE |
|:--------------------- |:------------ |
| interface | eth0 |
| time | 10 |
| warn | errors > 50% |
| fail | errors > 99% |
This sensor will check the network traffic on a specified interface:
The resulting report part may look like:
Time Check localhost
--------------------------------------------------------------------------------
Check the system time against the Internet.
::: info
- __STATUS: ok__ at Wed Dec 23 2015 16:14:50 GMT+0100 (CET)
:::
Last check results from Wed Dec 23 2015 16:14:50 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:---- |:---------- | ------:|
| diff | Difference | 0.62 s |
> If the time is not correct it may influence some processes which goes over
> multiple hosts. Therefore install and configure `ntpd` on the machine.
### Configuration
The time sensor is configured with:
| CONFIGURATION SETTING | VALUE |
|:--------------------- |:------------ |
| host | pool.ntp.org |
| port | 123 |
| timeout | 10000 |
| warn | diff > 10000 |
This sensor will analyse processes started from a specific user:
The resulting report part may look like:
Active User localhost
--------------------------------------------------------------------------------
Check what an active user do.
::: info
- __STATUS: ok__ at Wed Dec 23 2015 16:14:50 GMT+0100 (CET)
:::
Last check results from Wed Dec 23 2015 16:14:50 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:------ |:--------------- | -------:|
| num | Processes | 96 |
| cpu | % CPU | 3.9 % |
| memory | % Memory | 58 % |
| rss | Physical Memory | 118 MiB |
| vss | Virtual Memory | 4.88 MB |
> This check will give an overview of the activities of an (logged in) user. If
> you look at the processes you may find out that some other warnings like high
> load are user made and you may contact this person directly.
### Configuration
The user sensor is configured with:
| CONFIGURATION SETTING | VALUE |
|:--------------------- |:----- |
| user | alex |
Although simple, but important to check if a host is responding to ICMP ping packets. Thus, it is possible to measure the availability of a server, as well as the response time and packet loss:
Ping localhost->193.99.144.80
--------------------------------------------------------------------------------
Test the reachability of a host in an IP network and measure the round-trip time
for the messages send.
::: info
- __STATUS: ok__ at Thu Dec 24 2015 08:59:40 GMT+0100 (CET)
:::
Last check results from Thu Dec 24 2015 08:59:40 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:------------ |:------------------ | -------:|
| responseTime | Avg. Response Time | 15.3 ms |
| responseMin | Min. Respons Time | 15.3 ms |
| responseMax | Max. Response Time | 15.3 ms |
| quality | Quality | 100 % |
__Result:__
PING 193.99.144.80 (193.99.144.80) 56(84) bytes of data.
64 bytes from 193.99.144.80: icmp_seq=1 ttl=244 time=15.3 ms
--- 193.99.144.80 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 15.354/15.354/15.354/0.000 ms
Check the network card configuration if local ping won't work or the network connection for external pings. Problems can also be that the firewall will block the ping port.
The ping sensor is configured with:
CONFIGURATION SETTING | VALUE |
---|---|
host | 193.99.144.80 |
count | 1 |
interval | 1000 |
size | 56 |
timeout | 1000 |
warn | quality < 100% |
fail | quality is 0 |
### Socket
This sensor will ping another host:
- remote - the remote server, there to run the sensor
- host - the server hostname or ip address to be called for ping
- port - the port number used to connect to
- transport - the protocol used for internet transport layer (default:tcp)
- warn - the javascript code to check for warn status (default: 'quality < 100%')
- fail - the javascript code to check for fail status (default: 'quality is 0')
``` text
Socket tcp localhost->193.99.144.80:80
--------------------------------------------------------------------------------
Use TCP sockets to check for the availability of a service behind a given port.
::: info
- __STATUS: ok__ at Wed Dec 23 2015 16:14:45 GMT+0100 (CET)
:::
Last check results from Wed Dec 23 2015 16:14:45 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:------------ |:------------- | -----:|
| responseTime | Response Time | 1 s |
> On problems the service may not run or a network problem exists.
### Configuration
The socket sensor is configured with:
| CONFIGURATION SETTING | VALUE |
|:--------------------- |:------------- |
| host | 193.99.144.80 |
| port | 80 |
| transport | tcp |
| timeout | 2000 |
This sensor will ping another host:
If a remote server reference is given this will be used for tunneling over ssh as proxy.
Within the match parameter you may use named-captures in the form (:
regexp). If you do so you will get a named object instead of an array as result.
HTTP Request ->http://heise.de
--------------------------------------------------------------------------------
Connect to an HTTP or HTTPS server and check the response.
::: info
- __STATUS: ok__ at Wed Dec 23 2015 21:26:14 GMT+0100 (CET)
:::
Last check results from Wed Dec 23 2015 21:26:14 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:------------- |:-------------- | ------------------------:|
| responseTime | Response Time | 320 ms |
| statusCode | Status Code | 200 |
| statusMessage | Status Message | OK |
| server | Server | nginx |
| contentType | Content Type | text/html; charset=utf-8 |
| length | Content Length | 177 KiB |
__Request:__
``` text
GET http://heise.de
User-Agent: Alinex Monitor through request.js
Response:
server: nginx
content-type: text/html; charset=utf-8
x-cobbler: octo09.heise.de
x-clacks-overhead: GNU Terry Pratchett
last-modified: Wed, 23 Dec 2015 20:26:07 GMT
expires: Wed, 23 Dec 2015 20:26:39 GMT
cache-control: public, max-age=32
transfer-encoding: chunked
date: Wed, 23 Dec 2015 20:26:15 GMT
age: 8
connection: keep-alive
vary: User-Agent,Accept-Encoding,X-Forwarded-Proto,X-Export-Format,X-Export-Agent
Body:
<!DOCTYPE html>
<html lang="de">
<head>
<title>heise online - IT-News, Nachrichten und Hintergründe
</title>
<meta name="description" content="News und Foren zu Computer, IT, Wissenschaft, Medien und Politik. Preisvergleich von Hardware und Software sowie Downloads bei Heise Medien.">
<meta name="keywords" content="heise online, c't, iX, Technology Review"...
If the server didn't respond it also may be a network problem.
The http sensor is configured with:
CONFIGURATION SETTING | VALUE |
---|---|
url | http://heise.de |
timeout | 10000 |
fail | statusCode < 200 or statusCode >= 400 |
### Ftp
To be written...
- put
- read
- move
- delete
- responseTime
### SFtp
To be written...
- put
- read
- move
- delete
- responseTime
### SNMP
To be written...
Makes monitoring any network device like printer... possible.
### IPMI
### Certificate
To be written...
- validTime
- keySize
Daemon Sensors
-------------------------------------------------
### PID
To be written...
cat /proc/PID/cmdline
cat /proc/PID/status
### PostgreSQL
To be written...
https://wiki.postgresql.org/wiki/Monitoring
analysis:
- pid sensor
### Apache
To be written...
mod_status:
- The number of worker serving requests
- The number of idle worker
- A total number of accesses and byte count served (*)
- The time the server was started/restarted and the time it has been running for
Averages giving the number of requests per second, the number of bytes served per second and the average number of bytes per request (*)
analysis:
- The current hosts and requests being processed (*)
- pid sensor
### Tomcat
To be written...
Maybe use JMX... https://www.npmjs.com/package/jmx
analysis:
- pid sensor
### Wowza
To be written...
Parsing stats page
analysis:
- pid sensor
### VMWare
To be written...
http://searchitchannel.techtarget.com/feature/Monitoring-vSphere-performance-with-command-line-tools
Data Sensors
-------------------------------------------------
### Log
To be written...
- filter
- timerange
- num lines
- filtered lines
- comment
- list of lines
### Database
This sensor will get some measurement values from the database:
- database - alias name of the database to use
- query - the query to run
- timeout - maximum time to run the query (default: 10s)
``` text
Database test-postgresql:SELECT 100 as num, 'just a...
--------------------------------------------------------------------------------
Run a query on the database to chech a value like count of entries in the
database.
::: info
- __STATUS: ok__ at Wed Dec 23 2015 16:04:54 GMT+0100 (CET)
:::
Last check results from Wed Dec 23 2015 16:04:54 GMT+0100 (CET) are:
| NAME | LABEL | VALUE |
|:------------ |:------------- | -------------:|
| data | Data.num | 100 |
| data | Data.comment | just a number |
| responseTime | Response Time | 1.23 s |
### Configuration
The database sensor is configured with:
| CONFIGURATION SETTING | VALUE |
|:--------------------- |:--------------------------------------------- |
| database | test-postgresql |
| query | SELECT 100 as num, 'just a number' as comment |
| timeout | 10000 |
To be written...
To be written...
http://casperjs.org/
To be written...
The controllers will hold some information in memory but store all values also in a database for long time analysis. This includes the following information:
The database structure therefore looks like:
To get easy to view report diagrams out of this you may use the dbVisualizer tool. First make a script to select the values you want to display:
SELECT * FROM controller
LEFT JOIN check USING (controller_id)
RIGHT JOIN sensor_cpu USING (check_id)
WHERE controller.name='my-develop' AND interval='hour'
ORDER BY period
In the result you may switch to graphical view and select the rows to display.
To keep the data volume low old values will be removed.
The controller may do some actions:
All these are triggered using the configuration rules described above.
The explorer modules are used for exploratory analyzation with the given data or by interactively requesting for answers. They are a scriptable tool to gather information and analyze them.
In contrast to the sensor it will retrieve also a lot of more static or seldom changing information and may take a lot more time to execute. It will also run with lower priority on the server. Multiple parallel analysis is also possible on some of the explorers leading to a comparison or overview report.
To run them use the interactive console or give all details to the cli as a json object.
If you want to have the report as email give -m <address>
as parameter on the
monitor startup.
run by step only if before failed and add to result.hint
cat /proc/version
running daemons (exclude system ones)
/proc/sys/kernel/domainname /proc/sys/kernel/hostname ? firewall
config date from-to (default today) time from-to
crontab -l sort by time split alternatives into multiple entries split from to into multiple entries
To extend the monitor with your own sensors, analyzers and actors you have the possibility to create your own package with them.
To include your own plugins you have to install them using npm
and then add
them in the /monitor/plugins
list of your configuration file.
npm install my-plugin
...
plugins:
- my-plugin
...
After that you can use them in your controller or analyzation run as if they are internal ones by name.
Each plugin may have the following methods in the main module:
This methods should return a map of elements with name: require path
within the
module like cpu: 'sensor/cpu'
.
The elements returned need the same API as in the main package (see below).
A sensor should be a module exporting the following objects:
Coming soon...
Like the sensor it needs a schema definition for all values needed in the process. In interactive mode the explorer it will ask for them then needed. Therefore the structure is nearly the same as for a sensor:
I will try to release a first stable version in December 2015. This will have:
The next version coming early in 2016 will have:
After that a lot of bug fixes, smaller improvements and more sensors, actors and explorers will follow.
Copyright 2015-2016 Alexander Schilling
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.