Overview for Datumize Data Collector (DDC). If you are new to the product, you are more than encouraged to read it through.

Overview

Data is at the center of every digital transformation initiative nowadays. This fact has increased the need to have reliable technologies that ensure companies are capturing the data they need, timely and in the right format and quality. Datumize Data Collector (DDC) is a lightweight, high-performance, streaming enterprise data integration software focused on dark data collection from sophisticated and complex sources that most vendors don’t cope with. DDC allows you to integrate data from various sources, such as in-transit network transactions, Internet of Things (IoT), mobility and industrial protocols, and enrich customer, business and operational intelligence with the resulting new data insights.


What is a Pipeline

Datumize Data Collector is a software engine for running data integration applications made out of these pieces: 

  • Pipeline: a data integration flow, represented as a directed acyclic graph. You will create pipelines as the core data applications to model your data integration. Pipelines are made out of components, aimed at dealing with data integration activities, and streams that connect the different components to form a complete flow, wrapped as records. In a pipeline, data flows left to right, and multiple combinations of components and streams are supported to allow extreme and modular flexibility.
  • Source Component: the component that captures data in real time from any type of data source. You must have at least one source component to create a pipeline. Each source component deals with a certain protocol (i.e. HTTP, SNMP, ModBus) and capture method (i.e. network sniffing, polling). Temporary data (data flowing on a network) and closed/proprietary data (data contained inside machines) are preferred as these data yield higher value and Datumize Data Collector offers unique value.
  • Processor Component: the component that crunches captured data in real time, correlate individual events, apply advanced algorithms, extract valuable information, compute additional metrics, and standardize or transform into a new data format. Processors become your programming language to transform data at will.
  • Sink Component: the component store the resulting information/events in any storage, either local to the current machine or cloud, integrate with out of the box analytics, or raise business/technical alerts.
  • Stream: the buffer that connects two different components. There are different kind of streams implementing specific funcionality; the basic stream is an buffer the decouples the pipeline execution and supports in-memory temporary storage of records.
  • Record: is the envelope that wraps the data being transformed. The record is modeled as a Java object of any type with a label (topic) to be used for routing. By default, an input record is not copied as-is to an output record.

A Record is the structure used to store data within Datumize Data Collector.

  • A record is defined by the following Java abstract class excerpt. 
abstract class Record<T> {
protected T data;
protected String topic;
}
  • Data stored in a record can be any Java object. 
  • Records are created at source components.
  • Records are not reused. Processor component consume input records from a stream, process data and generate a new output record, placed in another stream. 
  • Sink components consume records and do not generate any other record.
  • A topic is a label applied to a record that is used for advanced partitioning and concurrent processing. All components support defining a topic, that will be applied to each record managed by that component.


 There are multiple pipeline topologies supported, to allow for a flexible data integration. Split and Join pipelines are of special interest. All source and processor components provide a topic property to annotate automatically records that come out of these components.

  • Joining is achieved by using processors that admit more than one stream and mix records coming from all input streams.
  • Splitting involves generating records of different types. Each different type of record is to be defined by using a different topic. Records will be routed through the correct stream according to the policy of the stream; those streams labeled with a topic will only admit records labeled with that specific topic.


Pipeline Components

Source components represent the entrance gate for data into DDC. The following diagram represents how data capture models work, always trying to be as much non-intrusive as possible and adaptative to the actual data sources.

  • Passive data sources use techniques such as network sniffing and deep packet inspection to obtain data passively; that represents minimal overhead, for example, by configuring network replication (port mirroring) in a router to obtain a copy of certain communications.
  • Polling data sources use a more classical request-response approach in which the source component actively inquiries an external resource (i.e. a directory, queue or API) and retrieves a response. A wise combination of these approaches combined with the right runtime policy (polling frequency, batch size, and others) will ensure that the overall data capture process has minimal overhead. Check available DDC Source Components.

Processor components are aimed at transforming data, temporarily stored in a record, to a different data - very likely not only changing the content but also the type. There are plenty of out of the box processors that serve a wide variety of purposes. At the end of the day, as represented in the figure below, the mission of a processor is to handle some input data (in the example, XML containing a dialog made out of a request, a response and some extra meta-information) and transform into an output data that has a meaningful usage within the pipeline (in the example, some business metrics calculated out of parsing the complete XML dialog). The behaviour of all components is determined by component properties that can be configured to substantially change the functionality provided; the most versatile example of configurable processor component would be the Cook Processor, that allows a customizable script. Most of the data type conversions are implemented using component converters, their mission being to transform data from type A into type B (i.e. from a byte array into a Java object), and sometimes changing content as well (i.e. compression). Check available DDC Processor Components and DDC Converters.

Sink components provide the opposite functionality of source components. A sink component receives input data and writes some other data into an external resource such as a local file, queue, or database. The external resource might determine if the data transfer happens synchronously or asynchronously. Sinking to a decoupling resource, such as a queue or local file, might help to break the complexity of the system and create simpler pipelines. Check available DDC Sink Components.



Streams and Records

A pipeline is a highly scalable asset that acts as a data application. Pipelines always run inside DDC memory, aimed at minimal latency for streaming data capture, processing and sink. Components are connected using streams, that should be seen as a highly-efficient buffered queue that temporarily holds records using a First-In First-Out (FIFO) policy.

  • Currently, there is a Single Memory Stream aimed at in-order processing of records (R1, R2) using minimum concurrency: thread T1 will be running component A while same thread T1 (or a different Thread T2, but not many) will run component B. This approach is valid for most situations and is the default option.
  • For high-concurrency, ultra-fast processing, where concurrency matters to avoid queue overflows, a Partitioned Memory Stream supports multi-threading by introducing the concept of hashed (different) buffers within the stream. The stream is configured with a hashing function that will select the buffer to assign the record to. Component B should be configured with multiple threads running in parallel (ideally, DDC running in multi-core CPU), each thread picking-up a buffer. This partitioning strategy comes in very handy to assemble network dialogs, where multiple independent clients are talking to one server, and network packets are received in sequential order (even with repeated packets) but must be assembled in-order and rapidly. 
  • You can label a stream to only accept records of a certain topic. 
  • Streams are continuously evolving with DDC. Check available DDC Streams and Records.



Runtime Policies

As a naive data architect, pipelines tend to be designed assuming almost a linear data flow, no concurrency and infinite execution resources. Crude reality is that real systems are limited by network bandwidth, available CPU or memory, or pricing for each read or write operation. These real-world limitations require that the data integration pipelines are configured with runtime policies that define the several crucial aspects.

These are component runtime aspects to be considered:

  • Concurrency defines the count and assignment of threads to a certain component or set of components in a pipeline. In natural language, you'd say "I want two threads running this component at the same time". High-performance, streamed data usually requires the pipeline to execute as fast as possible, and one of the most effective concurrency systems is available out of the box within the Java Virtual Machine: multiple threads running the same code. In order to exploit the full potential of concurrent threads running the same component (i.e. Cook processor executing some custom data transformation, or a TCP network packet assembler) you also need parallelism in the underlying hardware. Simply put, the most efficient approach is to run DDC in a multi-core CPU, and "glue"  threads to core to maximize throughput at low latency. Defining 100 threads while you have only 4 cores available won't help. 
  • Scheduling defines when to execute a component. You could configure the scheduling to execute a file source every 15 seconds independently of how much time it takes to read each run; you could also want to wait (sleep) 15 seconds between runs, that is slightly different behaviour but crucial as it states the amount of time where you will force that component the breathe.
  • Execution sets additional restrictions to execute a component, considering that the scheduling policy has marked the component to be run. If selected to run, you can restrict the component to run only if there are records queued up in the input streams, or execute anyway, or even define a custom condition. 
  • Batching is about how many records you want to tackle in a single run. Whenever a component is run, it might be helpful to limit the amount of work to do by defining batches of records; for example, for a database sink, a batch of 20 means that at most 20 records will be retrieved from the input stream and eventually written to the database.
  • Errors always happen, so an error policy is needed to define what to do. Basically, there are two options: discard the record (and obviously log the problem) or retry the record (limiting the number of retries to avoid deadlocks). 

These are stream runtime aspects to be considered:

  • Capacity defines the maximum size of the queue that holds records. Any stream must limited by capacity, otherwise out of memory errors will be reported. Eventually, some stream implementations might provide a paging or persistence mechanism for records in the queue, but in-memory capacity will have to be defined as well.  
  • Read and Write define the way records are read and written from/to streams. Depending on your level of impatience, records can be read immediately (and if there isn't any the component gets nothing), you can wait until a record is available (assuming you might need to wait for a long time), or you could prefer to wait with a timeout (in which case, when a component is executed, records might be available or not). You can infer how that policy affects to writing records to an output stream.

Please refer to DDC Policies for more advanced information.


Deployment

Datumize Data Collector software is based on Java language and runs within the boundaries of a Java Virtual Machine (JVM). That means, among other benefits, that DDC is multi-platform and compatible with any operating system supported by by the Java Runtime Environment for running serious stuff. 

Edge computing is made possible because DDC is lightweight and runs on minimal hardware, allowing acquired data to be enhanced, filtered or joined with other sources prior to delivering it for further processing or storage. You can even install DDC on a RaspberryPi. Obviously, the underlying hardware and operating system must be sufficient for running your pipelines: don't expect deep packet inspection to run smoothly on a RaspberryPI.

When you deploy DDC, you must consider several concepts, as represented in the figure below:

  • Machines are physical or virtual servers running a supported operating system.
  • Agents are background daemons installed to control Datumize software, including DDC.
  • Instances of DDC are OS processes running DDC. As DDC runs on Java, each DDC instance will be running in a separate JVM.
  • Pipelines are assigned to DDC instances. You can group instances to facilitate the assignment.

All deployment is managed through Datumize Zentral (DZ) - check Infrastructure Deployment and Pipeline Deployment.


Privacy and Security

Data integration at DDC always happens within your premises and computational resources; this principle complies with enterprise security blueprints, respects data privacy and General Data Protection Regulation (GDPR). When designing a new data integration solution with Datumize, you must consider the customer's security blueprints to define the right architecture. The most important concept to keep in mind is that Datumize, by default, never stores customer's data or on behalf of the customer.

Some relevant points to keep in mind:

  • DDC does not store any Personal Identifiable Information (PII) by default.
  • DDC stores internal technical metrics about product health in the Datumize cloud; these metrics are not PII.
  • The pipelines you create must comply with security and privacy restrictions, as with any other enterprise middleware software.
  • The source and sink components that make data flow between systems and resources must also comply to customer's security and privacy restrictions.

Check Obfuscators components that provide advanced privacy behaviour for data processing.


Management

Datumize Zentral (DZ) is the place to manage Datumize Data Collector and any other Datumize product. All configuration, deployment and monitoring activities must be done through Datumize Zentral. 


Best Practices

Datumize Data Collector is based on very granular components that do one very specific task. This is a design principle that is always enforced to keep things simple and flexible. This potentially requires introducing additional components in the pipeline to obtain the desired result: this is expected and components are highly optimized and all data processing happens in memory.

Although Records don't force the kind of data to be processed, maps are the preferred format to store data flowing in the pipeline. Maps are structured based on keys, and still support complex data structures. Data being processed should be converted into a map as soon as possible in the pipeline. Some data sources such as API or SNMP already produce maps as an output; others data sources, however, like network traffic sniffing coming from PCAP files, produce slightly more complex data, including request and response payloads and network metadata. Whatever the data format is, that is expected, depending on your project the pipeline will handle maps at the right time, if needed. The pipeline should convert the "unstructured" data to a map "structured" data a soon as possible.

The DDC Extensibility (SDK) allows for extending the out of the box DDC capabilities. Always try to use first the available components and converters in the catalog; the SDK is intended mainly to support business logic.