Glossary
These definitions disambiguate all and every term used across Datumize documentation.
Term | Description |
---|---|
Dark Data | As excerpted from Wikipedia, Dark data is data which is acquired through various computer network operations but not used in any manner to derive insights or for decision making. Datumize products are mostly aimed at the integration of Dark Data. |
Datumize Product | Any of the products currently offered in the Datumize portfolio, as explained in the introduction to Datumize . Namely, Developer Guide, Datumize Data Dumper (DDD), and Administration Guide. |
Datumize Data Collector (DDC) | Developer Guide is the Datumize product that supports running data pipelines aimed at capturing and processing streamed data on the edge, but not limited to. It was the first product released by Datumize. |
Datumize Data Dumper (DDD) | Datumize Data Dumper (DDD) is the Datumize product that supports capturing and dumping network sniffed packets. It captures live packets from a network interface and dumps into PCAP file format, minimizing the packet loss through various optimizations. |
Datumize Zentral (DZ) | Administration Guide is the graphical interface that acts as a hub for using Datumize products. The idea behind its name is to have a centralized place to manage the complete product lifecycle, such as discovering data, creating and testing pipelines, deploying instances and pipelines in real infrastructure, and monitoring the system behaviour. Datumize Zentral is a graphical web application made available in the cloud at https://zentral.datumize.tech |
Directed Acyclic Graph (DAG) | A Directed Acyclic Graph is a model fairly often used in maths and computing to describe a set of statuses and the flow between them. A DAG is made out of vertex and arcs. In data integration, a DAG is commonly used to describe the data processing flow, encompassing data capture (start vertex), data processing (intermediate vertex) and data sinking (end vertex). In Datumize, a Pipeline is represented as a DAG, each vertex is a Component, and each arc is a Stream. |
Pipeline | A Pipeline is a data processing flow that captures, processes and sinks data. It is represented as a Directed Acyclic Graph (DAG) and represents the main asset for Datumize products. You design and test a Pipeline using Datumize Zentral, and you deploy a pipeline to a product such as DDC or DDA. |
Component | A Component represents an operation in a Pipeline, or a vertex in a DAG. You will select components from a palette to compose Pipelines. Components can be classified according to the function in the Pipeline, such as capturing data (Source Component), processing data (Processor Component) or sinking data (Sink Component). Each component has properties that define its runtime behaviour. |
Source Component | A Source Component is a specialized Component aimed at capturing data and injecting those data into the next component in the Pipeline. The simplest source component is one that reads a file and creates a Record for each line. |
Processor Component | A Processor Component is a specialized Component aimed at processing (or transforming) data, by reading input data, applying a certain function or logic, and writing data to its output. The simplest processor component does nothing but copying input to output without any modification. |
Sink Component | A Sink Component is a specialized Component aimed at receiving input data and writing those data into an external entity outside the scope of the running Datumize product. The simplest sink component writes each received Record into an output file in the local machine. |
Stream | A Stream represents a link between components in a Pipeline, linking the output of a component to the input of the connected component. In a DAG, a Stream is an arc. The simplest stream is an in-memory queue that communicates a Record between the output of a component and the input of connected component. |
Machine | A Machine is a computational unit, either physical or virtual, providing a supported operating system, computing power, memory, disk and network connectivity. Machines are very often associated to servers located in a data center, but is not limited to, as the Internet of Things (IoT) and cloud environments have opened many alternatives to conventional computing. A Machine must be configured with a proper user and credentials to allow the installation of Datumize products. |
Instance | An Instance is a process running in the underlying operating system that is executing a Datumize Product. You can have many instances of the same product configured in a complex production infrastructure. An Instance is always attached to an unique Machine, but a Machine can run many different Instances. An instance is ultimately aimed at executing a Pipeline or configuration that produces the desired result. |
Deployment | A Deployment for Datumize represents the process of configuring the infrastructure (made out of Machines and Instances) and assigning Pipelines to the infrastructure. The expected results is that the Pipelines are executed within configured Datumize products and produce the expected results. All in all, a successful deployment fulfills the data integration needs of your project. |
Alert | An Alert represents an event that is relevant to a certain stakeholder. Alerts are usually associated to error situations such as a failure, but are not limited to. An alert is based on rules that evaluate monitoring metrics in a continuous basis; whenever a rule is triggered, the rule raises a notification that, eventually, will reach out to a human being (SMS, email) or machine (queue). |
Metric | A Metric is a tagged value that represents a runtime measurement, usually relevant for monitoring purposes. |