Streams are buffering queues that connect together pipeline components and allow for the transmission of data packaged in Records. A stream buffers records from a publishing component while the receiving component is busy or not scheduled to execute. You always need a stream to connect two components. Streams can be labeled with a topic to accept only matching records. By default, streams store records in memory in a limited size queue. A record is the envelope that wraps the data being forwarded over a stream.

Common Stream Properties

All streams share these common properties.

PROPERTYIDDESCRIPTION

REQUIRED

TYPE

DEFAULT

EXAMPLES

Common
IdentifierIDStream unique identifier within the pipeline, read only, Only useful for advanced mode.YesStringAuto generated

SingleMemoryStream_212

This identifier is automatically generated by the system and you can't change it. Might be helpful for advanced pipeline configuration.

DescriptiondescriptionA short description for the stream, aimed at providing additional information for the pipeline designer.NoString

Processing in parallel 2 availability services.

Short and sweet description.

Topictopic

All streams support a topic to filter the accepted records. Following conditions apply when a stream is receiving a record:

  • The record is not labeled with a topic: record accepted.
  • The record is labeled with a topic
    • Topics match: record accepted
    • Topics don't match: record not accepted 
NoString

foo

All records labeled with "foo" will be accepted, and those not labeled too.

Single Memory Stream

Single Memory Stream is the simplest stream available. It is a buffered connection between two components, limited in capacity and temporarily stores the records in memory. This is stream is not fault tolerant: in case of critical error, not processed records will be lost. 

Partitioned Memory Stream

Partitioned Memory Stream is an in-memory stream aimed at concurrent high-performance processing. It is a buffered connection between two components, limited in capacity and temporarily stores the records in memory. The buffered records are grouped according to a hashing function; the number of groups is determined by the number of consumer threads. This stream is not fault tolerant: in case of critical error, not processed records will be lost. 

The Partitioned Memory Stream should be used when destination component is stateful and more than one thread is executing it, so that related records must be processed by the same thread on the destination component. To increase concurrency and performance, the status in stateful components is not shared between threads, so each thread has his own individual state. If not using a partitioned stream, then two related records that should be processed as a group, in reality they could be picked up by different threads and will never produce a group.

if two records are "related" or not it is decided by the hash function.

This is the list of properties available.

PROPERTYIDDESCRIPTIONREQUIREDTYPE
Default
Hashing functionhash-function

Hash function that determines the identification of the group to store a new record.

The total number of groups is determined by considering the number of threads that will be consuming records from this stream. The idea to maximize concurrency is to have one group per thread.

Every time a new record is to be stored in the stream, the group identification is calculated as the hash value modus number of threads. This ensures that consumer threads always read from the same group.


YesSee Hashing Functions section.

Hashing functions

The IP Hashing function is specially designed for IP packets and high-performance environments where concurrency is needed. Every IP dialog is a group of packets: to reconstruct the whole dialog, the same thread must receive all dialog packets. Otherwise, packets are split into different threads and dialog is impossible to rebuild.

IP packets are hashed according to these rules.

IP4: 4bytes ip source + 4bytes ip dest + 2bytes port src + 2bytes port dest = 12
IP6: 16bytes ip source + 16bytes ip dest + 2bytes port src + 2bytes port dest = 36

WLAN hashing is also possible, in this case the MAC address is hashed.

Record

A record is the structure used to store data within Datumize Data Collector. The data stored by a record can be any Java object.

Records are used to encapsulate data that will be processed throughout the pipeline. 

DDC Record

abstract class Record<T> {
    protected T data;
    protected String topic;
 
    public T getData();
    public String getTopic();
    public boolean hasTopic();
    void clean();
}
JAVA

All data that goes into the pipeline needs to be stored and represented using in a record, that becomes the envelope for any data transformation.

The lifecycle of a record starts in one component and ends in the next component in the pipeline. Record are never reused to prevent race conditions and random behavior due to split/join processing in the pipeline. Effectively, records span only for the temporal storage in a stream and are immutable.