Introduction
The monitoring and alerting feature allows you to receive notifications when monitoring measurements fall below or exceed a specified threshold. For instance, you can receive an email notification if a device’s available disk space runs out.
Monitoring data, also called measurements, are evaluated per client at the moment data arrives at the server. Rule processing is not a periodic job. If your clients send monitoring data once a minute, rules are evaluated once a minute too.
Rule evaluating is limited to the data sent by the client and the last 10 measurement of the past taken from an in-memory cache. That means a rule like “Fire an alert when both, client A and B have high CPU load” is currently not possible.
Please refer to the server sizing documentation regarding the memory required for the measurement cache.
To activate the alerting system, you need to configure three settings:
- An SMTP server is required.
- Create a template: Templates define who will receive notifications, the method of notification (e.g., email), and the content of the message (subject and body).
- Create a rule or rule set: Rule sets determine which measurement is being evaluated, the threshold, and the action to be taken. If the action involves notification, the rule references a template.
To receive your first notification, follow these steps:
Ensure you have a valid SMTP configuration in your rportd.conf
file.
The config file contains all explanations needed. Also look at the recommendation for an optimal SMTP setup at the end
of the document,
Create the template below using the UI or API
HTTP POST /api/v1/monitoring/notification-templates
{
"id": "t1",
"transport": "smtp",
"subject": "{{.Outcome}} {{.Rule.Severity}} for {{.Rule.ID}} on {{.Client.Name}}",
"body": "The client {{.Client.Name}} ({{.Client.ID}}) <br>\nhas triggered rule ID: <b>{{.Rule.ID}} </b><br>\nwith severity: {{.Rule.Severity}} <br>\n<small>The client runs kernel {{.Client.OSKernel}}</small>",
"html": true,
"recipients": [
"user@example.com"
]
}
ID: t1
Transport: SMTP
Subject: {{.Outcome}} {{.Rule.Severity}} for {{.Rule.ID}} on {{.Client.Name}}
Body:
The client {{.Client.Name}} ({{.Client.ID}}) <br>
has triggered rule ID: <b>{{.Rule.ID}} </b><br>
with severity: {{.Rule.Severity}} <br>
<small>The client runs kernel {{.Client.OSKernel}}</small>
HTML: enabled
Recipients: user@example.com
Create the rule set below using the UI or by submitting it via HTTP PUT to the /api/v1/monitoring/rules
endpoint:
{
"rules": [
{
"id": "high-cpu-usage",
"severity": "warning",
"expr": "Avg(CPUUsagePercent(3)) > 80.0",
"actions": [
{
"notify": [
"t1"
]
}
]
}
]
}
warning
ID:
high-cpu-usage
Expression:
Avg(CPUUsagePercent(3)) > 80.0
Actions: Trigger notification template
t1
This rule will trigger the notification template t1
if the average CPU usage of the last three measurements is greater than 80.
Test the rule by executing nohup stress -t 600 -c 1 -q >/dev/null 2>&1 &
on a Linux system.
The following rules cover most common use cases:
CPU Monitoring:
Avg(CPUUsagePercent(X)) > Y
If the average of the last X CPU usage (percent) measurements is above Y, the expression becomes true, and actions are triggered.Memory Monitoring:
Avg(MemoryUsagePercent(5)) > N
If the average of the last X memory usage (percent) measurements is above Y, the expression becomes true, and actions are triggered.Disk Monitoring:
Avg(MountPoints("*", "FreeBytes", X)) < (Y * GB)
If the average of free bytes on any mount point or drive letter is below Y GB, the expression becomes true, and actions are triggered.Process Monitoring:
not Match("X", ProcessCmdLines(Y))
ornot Match("X", ProcessNames(Y))
If the list of processes does not contain X during the Y last measurements, the expression fires. X supports wildcards.ProcessCmdLines()
matches against the full command line (e.g.,/usr/sbin/sshd -D
), whileProcessNames()
matches against the process name only (e.g.,sshd
).
The average and process match functions support the evaluation of the last 10 measurements at maximum.
As a built-in default notifications are also sent once a problem has recovered where recovery is defined as the
rule expression evaluates to false
.
Make sure to have the placeholder {{.Outcome}}
in your template. It will be replaced either by ALERTING
or RESOLVED
.
Because messages, subject and body, are rendered by the go template engine you can further customize them by using an if-else-block.
{{ if eq .Outcome "ALERTING" }} 📣 Heads up {{ else }} 🧘♂️ Relax {{ end }}
Please note that only
if eq .<VAR> "<COMPARISON VALUE>"
works. Usingif .<VAR> eq "<COMPARISON VALUE>"
will not work resulting in messages not being sent.Also, the comparison value must be in double quotation marks. Single quotation marks are not supported.
Multiple rule expression can be concatenated with the logical operators and
and or
(lowercase).
Braces can be used inside complex expressions with multiple concatenations.
Example:
client.os_kernel == "linux" and Avg(CPUUsagePercent(3)) > 80
This rule will trigger an action only if both conditions
are met.
Example:
any(client.Tags, # == "Berlin") and ...
adds a condition to an expression to only trigger if Berlin
is included in
the list of tags.
When looking for an item in a list the hash sign #
is the internal placeholder for the current item while looping
over the list .
While
client.connection_state
is available for evaluation it’s not suitable to create rules to get a notification about disconnected clients.
On server stop all clients will be properly disconnected. Because it takes some time for clients to connect, on server start the majority of clients will haveclient.connection_state = disconnected
resulting in firing many false notifications.Monitoring the connection state of client is currently not supported. A proper rule expression for that will be integrated soon.
If you want to activate rules based on time or weekday you can use the following conditions
Date(measurement.Timestamp).Weekday in ["Monday", "Tuesday" ]
or
Date(measurement.Timestamp).Hour > 9
Example:
any(client.Tags, # == "Berlin") and
any(measurement.Processes, .CmdLine contains "tetris") and
Date(measurement.Timestamp).Weekday in ["Monday", "Tuesday", "Wednesday", "Thursday","Friday"] and
Date(measurement.Timestamp).Hour > 9 and
Date(measurement.Timestamp).Hour < 18
This expression will evaluate true if a process tetris
is running, but only Monday to Friday between 9am and 6pm.
The expression must be entered all on a single line. The example has line breaks just for better readability.
All time data refers to UTC.
Field | Type | Description | Example values |
---|---|---|---|
client.id | string | client ID | b952c2d814bd44088236bd83502dad13 |
client.Name | string | client name as give by rport | Abby-Castro |
client.Address | string | public IPv4 Address | 138.201.17.238 |
client.Hostname | string | name as given by OS hostname | Abby-Castro |
client.OS | string | Kernel string | Linux Abby-Castro 5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 GNU/Linux Microsoft Windows Server 2022 Standard 10.0.20348.1850 Build 20348.1850 Server |
client.OSFullName | string | OS full name | Debian 11.6, Microsoft Windows Server 2022 Standard 10.0.20348.1850 Build 20348.1850 |
client.OSVersion | string | OS version | 11.6, 10.0.20348.1850 Build 20348.1850 |
client.OSArch | string | OS architecture | amd64 |
client.OSFamily | string | OS family | debian, Server |
client.OSKernel | string | OS running kernel | linux, windows |
client.OSVirtualizationSystem | string | Type of virtualization | KVM |
client.OSVirtualizationRole | string | Virtualization role | guest |
client.NumCPUs | int | Number of CPUs | 6 |
client.tags | list | list of tags | Berlin, London, Datacenter 1 |
client.ipv4 | list | list of IPV4 addresses bound to all network cards | 192.168.7.1 |
client.ipv6 | list | list of IPV6 addresses bound to all network cards | fe80::216:3eff:fec3:3681 |
client.ConnectionState | string | State of client connection | connected, disconnected |
client.LastHeartbeatAt | string | Date and time of last heartbeat | 2023-09-07T12:42:50.802594749Z |
client.UpdatesAvailable | int | Number of updates available | 5 |
client.SecurityUpdatesAvailable | int | Number of secutity updates available | 5 |
When using client metadata in rule expressions, the field must be given camel case with a leading lowercase
client
as indicated in the table above. Example:client.OSKernel == "linux" and Avg(CPUUsagePercent(3)) > 80
Matching is case-sensitive.When using client metadata in notification in templates, the field must be given camelcase with a leading uppercase
.Client
and prefixed by a dot inside curly braces.
Example:{{.Client.OSKernel}}
Besides triggering a notification via a template you can also use the ignore
action.
Example:
Rule expression: Date(measurement.Timestamp).Weekday in ["Saturday", "Sunday"]
Action: ignore: [*]
Make sure you put this rule as the first rule of the set. The order matters.
Now you will not get any notifications during the weekend.
The ignore action specifies a list of rule IDs to be ignored. Because wildcards are supported you can disable all rules
based on a condition.
Measurements can be accessed by average functions (see above) or directly using the measurement
object.
Example: measurement.CPUUsagePercent > 95
or measurement.MemoryUsagePercent > 50.0
Measurement Key | Type |
---|---|
CPUUsagePercent | int |
MemoryUsagePercent | int |
MemoryUsagePercent | int |
NetLan | bytes, int |
NetWan | bytes, int |
ClientID | string |
The rport server comes with a built-in SMTP client that dispatches email messages. The SMTP configuration is done
through the rportd.conf
configuration file.
Slow SMTP servers can fill up internal queues leading to higher memory consumption.
It’s highly recommended to install a local SMTP server such as postfix on the rport server box. RPort should hand over messages to the local SMTP server which will then queue it and take care of the asynchronous sending, re-sending and network error handling.
Sending notification via any media like messanger apps, MS Teams or Slack is supported. The documentation will be published shortly.