Service Management¶
Assure1 Service Management provides a top down view into service performance via dynamic, real-time and multi-tenant dashboards. Supporting an infinitely tiered service hierarchy, event-based filters and metrics-based values can be configured for monitoring and reporting of compliance and performance.
If a breach has been detected and the Assure1 Fault Management module is installed, service impacting meta-events can be generated. This, when paired with the highly capable reporting engine, enables real-time service management as well as historical trend reporting. With applications that automate service discovery with integrations to third-party CRM, CMDB and provisioning systems, services can be automatically configured in Assure1 to allow constant monitoring.
What is a Service?¶
A SLM service is a collection of event filters and/or metrics with custom thresholds to monitor an environment and alert when an issue is detected. Calculations for service level objective/agreement compliance and reporting this statistic over time is done by the Metric SLM Collector. The real-time service status is done via the Event SLM Connector. These two applications are typically used in concert.
The Event SLM Connector leverages service hierarchies to determine real-time service impact, so that operators know when a service is currently down and what events are causing the service to be considered as "down". This component works in concert with the Orient Database and Topology engine so that service hierarchies can be created in an automated fashion.
The Metric SLM Collector leverages the Orient Database and looks at service level criteria and, through a weighting system, generates an overall service availability as a new calculated metric. This information can be displayed on a dashboard, SLM portlets, used by a thresholding engine, or displayed on a graph.
SLM Service Management UI¶
The Service Management UI (Configuration -> Service Management) is used for adding, editing and removing the Service Level Monitoring (SLM) Services and the Event Filters and/or Metric Thresholds that are part of the service. Services can be nested under one another to create a hierarchy of services and their respective filters/metrics.
Event-based Services and Filters are used to monitor for certain types of Events and send a custom Meta Event if a threshold is breached. These Services and Filters are also viewable from a hierarchical standpoint and the SLM Engine uses the Orient Database to create representative vertices and the visual links between them. The services defined are used by the Event SLM Engine to determine compliance.
Metric-based Services and Metrics can be composed of multiple different metrics. Each Service and Metric allows a customizable threshold and weight, and the custom Service Level Actions (SLAs) that occur when it passes that threshold. The services defined are used by the Metric SLM Collector to determine compliance.
Creating and Viewing a Service¶
Objectives¶
-
Create a new Self Monitoring Service.
-
Create several SLM Event Filters for the Service.
-
Create several SLM Metrics for the Service.
-
Enable and start the Metric SLM Collector and Event SLM Connector services.
-
View the newly created Service.
Example¶
In this example, a Self Monitoring service is created to monitor the Assure1 server for service compliance and to ensure the server is performing adequately.
Creating the SLM Service, Filters and Metrics¶
-
Navigate to the Service Management UI
-
Click Add -> Service to add a new service.
-
The Service (new) form will appear to the right of the grid. Fill in the form, using the following values:
-
Name: Self Monitoring
-
Parent Service: Root
-
Weight: 1
-
Service User Owner: [Public to All Users In Group]
-
Service Group Owner: [Public to All Groups]
-
Status: Enabled
-
Realtime Properties
-
Threshold Comparison: <=
-
Threshold Value (% of Valid Children): 90
-
Meta Event: Default Service Event
-
-
Compliance Properties
-
Threshold: [Manual]
-
Warning Threshold Comparison: <=
-
Warning Threshold Value (%): 90
-
Critical Threshold Comparison: <=
-
Critical Threshold Value (%): 75
-
Gauge Axis Type: Linear
-
Threshold Poll Time: 300
-
-
-
Click Submit to save the new Service.
- If the Threshold (defined in the Event Properties section of the form) is crossed, a Meta event is created. The severity of the Meta event sent is what controls the color of the services in the Service Tree (shown further in this section).
-
Still within the SLM Service Management UI, select the new Self Monitoring Service, then click Add -> SLM Filter.
-
Fill in the form using the following values:
-
Name: Self Monitor - CPU Usage
-
Parent Service: Self Monitoring
-
Threshold
-
Field: EventID - bigint(20) unsigned
-
Metric Function: COUNT
-
Comparison: !=
-
Threshold Value: 0
-
-
Weight: 1
-
Where Clause: Severity=4 AND EventType='CPU High' AND Node='[FQDN of your Assure1 server]'
-
SLM Filter User Owner: [Public to All Users In Group]
-
SLM Filter Group Owner: [Public to All Groups]
-
Status: Enabled
-
-
Click Submit to save the new SLM Filter.
-
The filter is now attached to the Self Monitoring Service, and the Service Threshold.
-
If the value of the Threshold defined above is true, this means a Threshold breach.
-
-
Repeat steps 5-7 above, creating 5 new SLM Filters with the following values:
-
Inbound Bandwidth Usage
-
Name: Self Monitor - Inbound Bandwidth Usage
-
Parent Service: Self Monitoring
-
Threshold
-
Field: EventID - bigint(20) unsigned
-
Metric Function: COUNT
-
Comparison: !=
-
Threshold Value: 0
-
-
Weight: 1
-
Where Clause: Severity=4 AND EventType='Inbound Bandwidth High' AND Node='[FQDN of your Assure1 server]'
-
SLM Filter User Owner: [Public to All Users In Group]
-
SLM Filter Group Owner: [Public to All Groups]
-
Status: Enabled
-
-
Outbound Bandwidth Usage
-
Name: Self Monitor - Outbound Bandwidth Usage
-
Parent Service: Self Monitoring
-
Threshold
-
Field: EventID - bigint(20) unsigned
-
Metric Function: COUNT
-
Comparison: !=
-
Threshold Value: 0
-
-
Weight: 1
-
Where Clause: Severity=4 AND EventType='Outbound Bandwidth High' AND Node='[FQDN of your Assure1 server]'
-
SLM Filter User Owner: [Public to All Users In Group]
-
SLM Filter Group Owner: [Public to All Groups]
-
Status: Enabled
-
-
Memory Usage
-
Name: Self Monitor - Memory Usage
-
Parent Service: Self Monitoring
-
Threshold
-
Field: EventID - bigint(20) unsigned
-
Metric Function: COUNT
-
Comparison: !=
-
Threshold Value: 0
-
-
Weight: 1
-
Where Clause: Severity=4 AND EventType='Memory High' AND Node='[FQDN of your Assure1 server]'
-
SLM Filter User Owner: [Public to All Users In Group]
-
SLM Filter Group Owner: [Public to All Groups]
-
Status: Enabled
-
-
Disk Usage
-
Name: Self Monitor - Disk Usage
-
Parent Service: Self Monitoring
-
Threshold
-
Field: EventID - bigint(20) unsigned
-
Metric Function: COUNT
-
Comparison: !=
-
Threshold Value: 0
-
-
Weight: 1
-
Where Clause: Severity=4 AND EventType='Disk High' AND Node='[FQDN of your Assure1 server]'
-
SLM Filter User Owner: [Public to All Users In Group]
-
SLM Filter Group Owner: [Public to All Groups]
-
Status: Enabled
-
-
Latency
-
Name: Self Monitor - Latency
-
Parent Service: Self Monitoring
-
Threshold
-
Field: EventID - bigint(20) unsigned
-
Metric Function: COUNT
-
Comparison: !=
-
Threshold Value: 0
-
-
Weight: 1
-
Where Clause: Severity=4 AND EventType='Latency High' AND Node='[FQDN of your Assure1 server]'
-
SLM Filter User Owner: [Public to All Users In Group]
-
SLM Filter Group Owner: [Public to All Groups]
-
Status: Enabled
-
-
-
With the SLM Filters added, it is now time to create some SLM Metrics for the Service. Select the new Self Monitoring Service, then click Add -> SLM Metric.
-
Fill in the form using the following values:
-
Name: Self Monitor - CPU Usage Metric
-
Parent Service: Self Monitoring
-
Metric
-
Device: [Your Primary Presentation Server]
-
Metric Type: CPU Utilization
-
Metric Instance: Device
-
-
Threshold
-
Value Type: Utilization (%)
-
Threshold: [Manual]
-
Comparison: >=
-
Threshold Value: 80
-
-
Weight: 1
-
SLM Metric User Owner: [Public to All Users In Group]
-
SLM Metric Group Owner: [Public to All Groups]
-
Status: Enabled
-
-
Click Submit to save the new SLM Metric.
Note
SLM Metrics are calculated based on the parent service polltime. The value that determines the status of the service is the average value of the metric over the underlying metric's or service's polltime period, whichever is greater.
-
Repeat steps 9 - 11 above, creating 5 new SLM Metrics with the following values:
-
Inbound Bandwidth
-
Name: Self Monitor - Inbound Bandwidth Usage Metric
-
Parent Service: Self Monitoring
-
Metric
-
Device: [Your Primary Presentation Server]
-
Metric Type: Interface Inbound Bandwidth
-
Metric Instance: eth0
-
-
Threshold
-
Value Type: Utilization (%)
-
Threshold: [Manual]
-
Comparison: >=
-
Threshold Value: 80
-
-
Weight: 1
-
SLM Metric User Owner: [Public to All Users In Group]
-
SLM Metric Group Owner: * [Public to All Groups]*
-
Status: Enabled
-
-
Outbound Bandwidth
-
Name: Self Monitor - Outbound Bandwidth Usage Metric
-
Parent Service: Self Monitoring
-
Metric
-
Device: [Your Primary Presentation Server]
-
Metric Type: Interface Outbound Bandwidth
-
Metric Instance: eth0
-
-
Threshold
-
Value Type: Utilization (%)
-
Threshold: [Manual]
-
Comparison: >=
-
Threshold Value: 80
-
-
Weight: 1
-
SLM Metric User Owner: [Public to All Users In Group]
-
SLM Metric Group Owner: * [Public to All Groups]*
-
Status: Enabled
-
-
Memory Usage
-
Name: Self Monitor - Memory Usage Metric
-
Parent Service: Self Monitoring
-
Metric
-
Device: [Your Primary Presentation Server]
-
Metric Type: Memory Used
-
Metric Instance: Device
-
-
Threshold
-
Value Type: Utilization (%)
-
Threshold: [Manual]
-
Comparison: >=
-
Threshold Value: 80
-
-
Weight: 1
-
SLM Metric User Owner: [Public to All Users In Group]
-
SLM Metric Group Owner: [Public to All Groups]
-
Status: Enabled
-
-
Disk Usage
-
Name: Self Monitor - Disk Usage Metric
-
Parent Service: Self Monitoring
-
Metric
-
Device: [Your Primary Presentation Server]
-
Metric Type: Disk Used
-
Metric Instance: /
-
-
Threshold
-
Value Type: Utilization (%)
-
Threshold: [Manual]
-
Comparison: >=
-
Threshold Value: 80
-
-
Weight: 1
-
SLM Metric User Owner: [Public to All Users In Group]
-
SLM Metric Group Owner: [Public to All Groups]
-
Status: Enabled
-
-
Latency
-
Name: Self Monitor - Latency Metric
-
Parent Service: Self Monitoring
-
Metric
-
Device: [Your Primary Presentation Server]
-
Metric Type: Latency
-
Metric Instance: Device
-
-
Threshold
-
Value Type: Value
-
Threshold: [Manual]
-
Comparison: >=
-
Threshold Value: 0.2
-
-
Weight: 1
-
SLM Metric User Owner: [Public to All Users In Group]
-
SLM Metric Group Owner: [Public to All Groups]
-
Status: Enabled
-
-
Starting the SLM Collector Services¶
With the new Self Monitoring SLM Service created, it is now time to enable and start both SLM services from the Broker Control -> Services UI.
-
Navigate to the Services UI.
-
Enable the Metric SLM Collector and Event SLM Connector Services.
Viewing your Service (Services Portal)¶
-
Navigate to the Services navigation menu and click on the Self Monitoring Service.
-
This will open the Services Portal for the Service.
-
You should see something like the image above.
-
The Time Period buttons[1] allow you to view service data from the last day, last week, last month and last year respectively.
-
The Service Tree Portlet[2] displays your parent service, and any child services, SLM Filters and SLM Metrics associated with the service, along with real-time event data and Metric Compliance data.
-
Clicking the wrench icon[3] on any portlet will open the Configure Portlet form to the right of the grid (see image below), allowing you to configure that individual portlet from within the portal.
-
The Service Events portlet[4] displays any Meta Events triggered by this service as a result of a service threshold breach.
-
NOTE: The Metric SLM Collector calculation is based on the Service polltime. Therefore it may take 5 minutes or more before all of the data becomes visible from the Services Portal.
-
With each 30 minute poll cycle, the data in the compliance graph will fill out more (see image below):
-
Parent and child services can be created from the Services UI, with their own SLM Filters and Metrics, creating a tiered hierarchy of services with one overall parent service at the top. The Event SLM Connector leverage's service hierarchies to determine real-time service impact.
Creating a Service¶
Following the above example, create your own SLM Service within Assure1, along with a number of SLM Filters and Metrics for that Service.
Once the service has been created, you could try to simulate an outage by blocking pings via the IPtables firewall. The primary presentation server should be modified for this step.
-
Via the command line, edit the /etc/sysconfig/iptables configuration file using your favorite editor.
-
To block pings, add the following line:
-A INPUT -p icmp --icmp-type echo-request -j DROP
-
Restart iptables with this command:
/etc/init.d/iptables restart
-
Verify the server can no longer be pinged.
-
Wait several poll cycles for the ping poller to report the server is no longer pinging. There will be two processes that then occur at the same time:
-
The device no longer pinging will be seen by the Default Thresholding Engine, which will create a Packet Loss threshold violation event in the event list. This event will be caught be the Event SLM Connector, and will decrease the Service Status for the event side of the SLM Service. As the status will then be below the SLM Service threshold, the selected Meta Event will be used to create a violation event.
-
While the above is happening on the event side, the metric side is also being processed. After a few poll cycles, when the Metric SLM Collector polls the database for the metric data needed, the collector will determine that the packet loss metrics are in violation of the SLM metric configuration. This will decrease the calculated Service Status for the metric side of the SLM Service, and will start effecting the overall SLM Service compliance value.