Monitoring
DDC supports different monitoring technologies and generates a handful set of metrics. You can use the default monitoring capabilities provided in Datumize Zentral or integrate with other monitoring and alerting systems.
Overview
Each running product and component generates out of the box metrics: DDC creates metrics for the engine itself and each relevant subsystem, including performance metrics and metrics for each deployed pipeline.
The Datumize Agent installed at each machine is also responsible for providing metrics about the underlying operating system and host machine. The metrics are provided as a time series, marked with an UTC timestamp and a set of key-value pairs representing each metric and its value. The default monitoring system is provided by Datumize Zentral, and includes not only the time-series storage for all metrics, but customizable dashboards and alerts.
Other 3rd party monitoring system can be integrated as well: the metrics can be stored in a wide range of storage systems, to be analyzed further with many open-source and enterprise tools, and metrics are also available on-premises using JMX monitoring system.
Type of Metrics
DDC provides monitoring metrics to measure different parameters: i.e. number records processed, execution time of a component, or number records contained in a pipeline stream. Metrics support tags to categorize the measurements. Metrics are classified according to different types, as represented in the table below.
Metric Type | Description | When to use | Example |
---|---|---|---|
Gauge | A gauge is the simplest metric type. It just return the last updated value. Internally the value may not be synchronized between the thread that report the value and the consumer thread but the gauge guarantees there will be no data corruption and eventual synchronization, but does not guarantee instant synchronization. The value reported is the last updated value. | A gauge is the simplest metric type. It just return the last updated value. | Free space on a disk You don't need to store and send the exact value at each second. If you send the last updated value when the reporter wakes up and send the metric will be fine. |
Meter | A meter measures the rate of events over time (e.g., “requests per second”). In addition to the mean rate, meters also track 1-, 5-, and 15-minute moving averages. The value reported depends on the frequency of the reporter:
By default the frequency of the reporters is 60s. | When you want to track the number of requests per time period, the number of records, etc. Internally meters measure the rate of the events in a few different ways:
| Requests per second You are interested in the peak of requests now, and the trends in the past to analyze potential issues. |
Counter | A counter is just a gauge for an atomic long number. You can increment or decrement its value. For example, we may want a more efficient way of measuring the pending jobs in a queue. The value reported is the last state of the count. | Keep track of the current capacity of a queue, or the current pcaps that are pending to process. | Records in stream Every time this counter is measured, it will return the number of records pending to be processed in a pipeline stream. |
Histogram | A histogram measures the statistical distribution of values in a stream of data. In addition to minimum, maximum, mean, etc., it also measures median, 75th, 90th, 95th, 98th, 99th, and 99.9th percentiles. Internally, it uses a Exponentially Decaying Reservoir which produces quantiles which are representative of (roughly) the last five minutes of data. It does so by using a forward-decaying priority reservoir with an exponential weighting towards newer data. Unlike the uniform reservoir, an exponentially decaying reservoir represents recent data, allowing you to know very quickly if the distribution of the data has changed. The value reported is the mean of the distribution. | Use a histogram to measure size of requests/response over time, response time over time, etc. | Response size This histogram will measure the size of responses in bytes, for web services being processed in a pipeline. |
Timer | A timer measures both the rate that a particular piece of code is called and the distribution of its duration. This timer will measure the amount of time it takes to process each request in nanoseconds and provide a rate of requests in requests per second. The value reported is the mean of the distribution. | When you want to monitor execution time of a piece of code. | Pipeline component execution Measures with a lot of detail the execution time at each component of a pipeline. |
Reporters
A Reporter is responsible for capturing metrics and publishing. DDC supports different types of reporters:
- JMXReporter: reports metrics through JMX interface.
- RemoteReporter: reports metrics to external storage systems such as InfluxDB.
The JMX Reporter can be enabled, and fine-tuning of JMX configuration is actually done through JVM Policy using this guidelines.
The Remote Reporter is pretty flexible. Datumize Zentral provides a out of the box time-series database to store reported metrics.
Parameter | Description | Type | Default value |
---|---|---|---|
enabled | Enable remote reporting | Boolean | false |
system-metrics-frequency | Frequency to collect (not send) system metrics. | Duration | 1 minute |
channels | List of metric channels. | List<RemoteChannel> |
Must not be empty. |
Multiple reporters can be configured. Each Reporter supports following configuration.
Parameter | Description | Type | Default value |
---|---|---|---|
enabled | Enable this channel. | Boolean | true |
reporter-class | Class used to report metrics. Must extend RemoteReporter. | RemoteReporter | InfluxDBReporter |
endpoint | Endpoint for remote monitoring system. | String |
|
database | Database for remote monitoring system. | String |
|
user | User for remote monitoring system. | String |
|
password | Password for remote monitoring system. | String |
|
frequency | Monitoring frequency. | Duration |
|
timeout | Timeout of remote monitoring system. | Duration |
|
retries | Max number of retries before aborting the request | Integer (>=0) |
|
tag-key | Key of the tag this channel will send. | String |
|
tag-value | Value of the tag this channel will send. | String |
|
§ Preconfigured by default to send authenticated metrics to the Datumize monitoring platform.
Tagged Metrics
A metric can be tagged with multiple tags at creation time, and also dynamic tags at execution time. This tags are used to route the metric to different channels, and, with some reporters, to send extra information (For example, InfluxDBReporter sets a InfluxDB tag for each tag of the metric). Tags at creation time are ALWAYS sent.
Dynamic tags create new metrics for each combination of dynamic tags passed. Used, for example, when we want to measure the number of requests PER SERVER. Internally DDC set some implicit tags to all metrics like the DDC instance (instance name), pipeline (flow id), cmp (component or stream, id) and others. All system metrics are tagged with type = systemMetric
All metrics defined within DDC will have the following implied tags:
- prod: product name = ddc
- cust: customer name
- inst: instance name
In addition, all metrics declared within a component (steps and streams) will have implicit the following tags:
- flow: id of the flow which the component belongs
- cmp: id of the component
Finally, system metrics are tagged with:
- type = systemMetric
Pipeline Metrics
The table below describes the supported metrics for pipeline components.
Component Type | Component Name | Metric Name | Metric Type | Description |
---|---|---|---|---|
All | All |
|
| Execution time of a single iteration of the component. |
|
| Execution time per record. | ||
|
| Rate of records processed. | ||
|
| Rate of records processed with error. | ||
|
| Rate of records that have been retried to process. | ||
|
| Rate of records that have been discarded. | ||
Source | All |
|
| Rate of useful bytes read |
raw_read_rate | meter | Rate of raw bytes read. (read_rate reports the number of useful bytes read) | ||
read_time | timer | Read time per batch. | ||
TCP Processor |
|
| Rate of communications that could not be parsed in the transport layer | |
HTTP Processor |
|
| Rate of communications that does not match the filter | |
|
| Rate of communications that could not be parsed in the application layer | ||
|
| Rate of lost responses (requests without responses) | ||
Sink | All |
|
| Rate of useful bytes written. |
raw_write_rate | meter | Rate of raw bytes written. | ||
write_time | timer | Write time per batch. | ||
S3 Sink | async_records_error | meter | In asynchronous mode, rate of sent records that have been rejected by Kafka. For synchronous mode, this value is represented in the records_error common metric. |
The table below describes the supported metrics for pipeline streams.
Stream Type | Metric Name | Metric Type | Description |
---|---|---|---|
Any |
|
| Usage of the stream. Number of records currently stored in the stream. |
Dashboards and Alerts
Datumize Dashboards and Alerts are based on Grafana. You might refer to its documentation for advanced dashboard configuration.
Datumize provides several out of the box Dashboards for visualizing the health of the machines, data streams, and processing of the host machines and the functioning of the DDC. Dashboards are important to understand performance and can help identify issues, help inspire optimization, and serve as a visual representation of the project's health.
Predefined Dashboards
Zentral works with Grafana and supports several out of the box metrics. A brief explanation is as follows. These dashboards can be further edited, cloned, or new dashboards can be added through the Dashboard resource editor.
Machine Health
CPU and Ram load utilization as a percentage are displayed as a utilization of the total or for the Java memory as both free vs utilized metrics.
Disk Utilization
Disk usage is important to track the capacity of available processing storage. This is displayed as NFS and Total used.
Records
Records can be displayed as the total Volume of near real time processed data records, or as the total of aggregated OK records over time.
Pcap Records
If utilizing Pcap records, these will also be displayed as a total of processed Pcaps in a time series and as a daily total.
Time
Time metrics show the utilization of records waiting to be written and of block time. These are important to understanding if a bottleneck is being created in the record in vs processing.
Total Errors
Additionally, tracking the number of processing errors over time is also included both as error records, and DDC processing errors.