DDC supports different monitoring technologies and generates a handful set of metrics. You can use the default monitoring capabilities provided in Datumize Zentral or integrate with other monitoring and alerting systems.

Overview

Each running product and component generates out of the box metrics: DDC creates metrics for the engine itself and each relevant subsystem, including performance metrics and metrics for each deployed pipeline.

The Datumize Agent installed at each machine is also responsible for providing metrics about the underlying operating system and host machine. The metrics are provided as a time series, marked with an UTC timestamp and a set of key-value pairs representing each metric and its value. The default monitoring system is provided by Datumize Zentral, and includes not only the time-series storage for all metrics, but customizable dashboards and alerts.

Other 3rd party monitoring system can be integrated as well: the metrics can be stored in a wide range of storage systems, to be analyzed further with many open-source and enterprise tools, and metrics are also available on-premises using JMX monitoring system.

Type of Metrics

DDC provides monitoring metrics to measure different parameters: i.e. number records processed, execution time of a component, or number records contained in a pipeline stream. Metrics support tags to categorize the measurements. Metrics are classified according to different types, as represented in the table below.

Metric Type

Description

When to use

Example

Gauge

A gauge is the simplest metric type. It just return the last updated value.

Internally the value may not be synchronized between the thread that report the value and the consumer thread but the gauge guarantees there will be no data corruption and eventual synchronization, but does not guarantee instant synchronization.

The value reported is the last updated value.

A gauge is the simplest metric type. It just return the last updated value.

Free space on a disk

You don't need to store and send the exact value at each second. If you send the last updated value when the reporter wakes up and send the metric will be fine.

Meter

A meter measures the rate of events over time (e.g., “requests per second”). In addition to the mean rate, meters also track 1-, 5-, and 15-minute moving averages.

The value reported depends on the frequency of the reporter:

  • If frequency < 60s: 1 min rate
  • If frequency < 5m: 5 min rate
  • Else: 15 min rate

By default the frequency of the reporters is 60s.

When you want to track the number of requests per time period, the number of records, etc. Internally meters measure the rate of the events in a few different ways:
  • The mean rate is the average rate of events. It represents the total rate for your application’s entire lifetime: the total number of events, divided by the number of seconds the process has been running.
  • Exponentially-weighted moving average rates: the 1-, 5-, and 15-minute moving averages that offers a sense of recency.

Requests per second

You are interested in the peak of requests now, and the trends in the past to analyze potential issues.

Counter

A counter is just a gauge for an atomic long number. You can increment or decrement its value. For example, we may want a more efficient way of measuring the pending jobs in a queue.

The value reported is the last state of the count.

Keep track of the current capacity of a queue, or the current pcaps that are pending to process.

Records in stream

Every time this counter is measured, it will return the number of records pending to be processed in a pipeline stream.

Histogram

A histogram measures the statistical distribution of values in a stream of data. In addition to minimum, maximum, mean, etc., it also measures median, 75th, 90th, 95th, 98th, 99th, and 99.9th percentiles.

Internally, it uses a Exponentially Decaying Reservoir which produces quantiles which are representative of (roughly) the last five minutes of data. It does so by using a forward-decaying priority reservoir with an exponential weighting towards newer data. Unlike the uniform reservoir, an exponentially decaying reservoir represents recent data, allowing you to know very quickly if the distribution of the data has changed.

The value reported is the mean of the distribution.

Use a histogram to measure size of requests/response over time, response time over time, etc.

Response size

This histogram will measure the size of responses in bytes, for web services being processed in a pipeline.


Timer

A timer measures both the rate that a particular piece of code is called and the distribution of its duration.

This timer will measure the amount of time it takes to process each request in nanoseconds and provide a rate of requests in requests per second.

The value reported is the mean of the distribution.

When you want to monitor execution time of a piece of code.

Pipeline component execution

Measures with a lot of detail the execution time at each component of a pipeline.


Reporters

A Reporter is responsible for capturing metrics and publishing. DDC supports different types of reporters:

  • JMXReporter: reports metrics through JMX interface.
  • RemoteReporter: reports metrics to external storage systems such as InfluxDB.

The JMX Reporter can be enabled, and fine-tuning of JMX configuration is actually done through JVM Policy using this guidelines.

The Remote Reporter is pretty flexible. Datumize Zentral provides a out of the box time-series database to store reported metrics. 

Parameter

Description

Type

Default value

enabledEnable remote reportingBooleanfalse
system-metrics-frequencyFrequency to collect (not send) system metrics.Duration1 minute
channelsList of metric channels.List<RemoteChannel>

InfluxDBReporter

Must not be empty.

Multiple reporters can be configured. Each Reporter supports following configuration.

Parameter

Description

Type

Default value

enabledEnable this channel.Booleantrue
reporter-classClass used to report metrics. Must extend RemoteReporter.RemoteReporterInfluxDBReporter
endpointEndpoint for remote monitoring system.String

§

database

Database for remote monitoring system.

String

§

user

User for remote monitoring system.

String

§

password

Password for remote monitoring system.

String

§

frequency

Monitoring frequency.

Duration

1 minute

timeout

Timeout of remote monitoring system.

Duration

1 second

retries

Max number of retries before aborting the request

Integer (>=0)

5

tag-key

Key of the tag this channel will send.

String

null

tag-value

Value of the tag this channel will send.

String

null

§ Preconfigured by default to send authenticated metrics to the Datumize monitoring platform.


Tagged Metrics

A metric can be tagged with multiple tags at creation time, and also dynamic tags at execution time. This tags are used to route the metric to different channels, and, with some reporters, to send extra information (For example, InfluxDBReporter sets a InfluxDB tag for each tag of the metric). Tags at creation time are ALWAYS sent.

Dynamic tags create new metrics for each combination of dynamic tags passed. Used, for example, when we want to measure the number of requests PER SERVER. Internally DDC set some implicit tags to all metrics like the DDC instance (instance name), pipeline (flow id), cmp (component or stream, id) and others. All system metrics are tagged with type = systemMetric

All metrics defined within DDC will have the following implied tags:

  • prod: product name = ddc
  • cust: customer name
  • inst: instance name

In addition, all metrics declared within a component (steps and streams) will have implicit the following tags:

  • flow: id of the flow which the component belongs
  • cmp: id of the component

Finally, system metrics are tagged with:

  • type = systemMetric


Pipeline Metrics

The table below describes the supported metrics for pipeline components.

Component TypeComponent Name

Metric Name

Metric Type

Description

AllAll

execution_time

timer

Execution time of a single iteration of the component.

record_execution_time

histogram

Execution time per record.

records_processed

meter

Rate of records processed.

records_error

meter

Rate of records processed with error.

records_retries

meter

Rate of records that have been retried to process.

records_discards

meter

Rate of records that have been discarded.

SourceAll

read_rate

meter

Rate of useful bytes read

raw_read_ratemeterRate of raw bytes read. (read_rate reports the number of useful bytes read)
read_timetimerRead time per batch.
TCP Processor

tcp_no_assembled

meter

Rate of communications that could not be parsed in the transport layer

HTTP Processor

http_filter_discarded

meter

Rate of communications that does not match the filter

http_not_build

meter

Rate of communications that could not be parsed in the application layer

http_lost_response

meter

Rate of lost responses (requests without responses)

SinkAll

write_rate

meter

Rate of useful bytes written.

raw_write_ratemeterRate of raw bytes written.
write_timetimerWrite time per batch.
S3 Sinkasync_records_errormeter

In asynchronous mode, rate of sent records that have been rejected by Kafka.

For synchronous mode, this value is represented in the records_error common metric.


The table below describes the supported metrics for pipeline streams.

Stream Type

Metric Name

Metric Type

Description

Any

stream_used_records

counter

Usage of the stream. Number of records currently stored in the stream.

Dashboards and Alerts

Datumize Dashboards and Alerts are based on Grafana. You might refer to its documentation for advanced dashboard configuration.

Datumize provides several out of the box Dashboards for visualizing the health of the machines, data streams, and processing of the host machines and the functioning of the DDC.  Dashboards are important to understand performance and can help identify issues, help inspire optimization, and serve as a visual representation of the project's health.

Predefined Dashboards

Zentral works with Grafana and supports several out of the box metrics. A brief explanation is as follows. These dashboards can be further edited, cloned, or new dashboards can be added through the Dashboard resource editor.

Machine Health

CPU and Ram load utilization as a percentage are displayed as a utilization of the total or for the Java memory as both free vs utilized metrics.

Disk Utilization

Disk usage is important to track the capacity of available processing storage. This is displayed as NFS and Total used.

Records

Records can be displayed as the total Volume of near real time processed data records, or as the total of aggregated OK records over time. 

Pcap Records

If utilizing Pcap records, these will also be displayed as a total of processed Pcaps in a time series and as a daily total.

Time 

Time metrics show the utilization of records waiting to be written and of block time. These are important to understanding if a bottleneck is being created in the record in vs processing.

Total Errors

Additionally, tracking the number of processing errors over time is also included both as error records, and DDC processing errors.