Learn some concepts on Dark Data, data integration, data architecture and Datumize products. Both a quick overview to understand the broader picture and, more specifically, it allows you to understand and position each Datumize product and its function in an enterprise architecture.

About Datumize

Datumize is a software vendor established in 2014 in Barcelona working on data integration and edge computing. We are a software vendor in the data integration space that unlocks data for digital transformation.

We have low cost and non-intrusive technology for data capture, featuring polling (request-response), network sniffing and deep packet inspection. Our product is lightweight, real-time and multi-platform, provides edge computing, endpoint management and 3rd party software bundling. Enables data integration and edge use cases for complex and legacy data.

Datumize has created middleware for capturing data from heterogeneous and sophisticated data sources, doing edge computation, and ingesting into any target destination, while preserving full data privacy.

You can read more at www.datumize.com



Enterprise Data and Datumize

The concept of Enterprise Data has enough breadth and depth as to write several books. It is useful to try to define the high level picture of what enterprise data actually means to most companies and what comprises the life cycle for enterprise data. There is not a single definition for enterprise data or even an enterprise data architecture, so take this classification here as something mostly accepted in the market. Depending on a vendor's alignment, the classification might look different, but you should get the point.

These are well accepted areas of expertise (or capabilities) that any enterprise data architecture should define at some point, as summarized in following table.

Capability

Description

Keywords

Data Strategy

According to the Dataversity definition, “Data Strategy describes a set of choices and decisions that together, chart a high-level course of action to achieve high-level goals. This includes business plans to use information to a competitive advantage and support enterprise goals.”

Reference Architecture, Maturity Model, Organization, Process, People

Data Governance

Informatica defines, “Data governance encompasses the strategies and technologies used to make sure business data stays in compliance with regulations and corporate policies.”

Data lineage, ownership, distribution, compliance

Master Data Management (MDM)

The Wikipedia business definition explains that “master data management (MDM) is a method used to define and manage the critical data of an organization to provide, with data integration, a single point of reference. The data that is mastered may include reference data - the set of permissible values, and the analytical data that supports decision making.”

Source of truth, normalization, duplicate detection, master, linking

Data Warehousing (DWH)

Talend elaborates on this capability by stating that “a data warehouse is a large collection of business data used to help an organization make decisions. The concept of the data warehouse has existed since the 1980s, when it was developed to help transition data from merely powering operations to fueling decision support systems that reveal business intelligence.”

Data lake, data mining, OLTP, OLAP

Business Intelligence (BI)

Gartner states that “analytics and business intelligence (ABI) is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance.”

Aggregation, reporting, dashboard, KPI

Data Analytics

The definition given at Investopedia clarifies that “data analytics is the science of analyzing raw data in order to make conclusions about that information. Many of the techniques and processes of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consumption.”

Machine Learning, correlation, regression, descriptive, predictive

Big Data Analytics

Although much related to the broader concept of data analytics, big data has gained enough interest and traction that IBM defines it as “Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes.”

Petabyte, Zettabyte, Hadoop, NoSQL, R

Data Quality

Techopedia defines this capability as “Data quality is an intricate way of measuring data properties from different perspectives. It is a comprehensive examination of the application efficiency, reliability and fitness of data, especially data residing in a data warehouse.”

Cleansing, standardization, masking

Data Architecture

Wikipedia discusses that “in information technology, data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.[1] Data is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture.

TOGAF, conceptual, logical, physical

Data Integration

One of most complete definitions is given by Gartner and states that “The discipline of data integration comprises the practices, architectural techniques and tools for achieving the consistent access and delivery of data across the spectrum of data subject areas and data structure types in the enterprise to meet the data consumption requirements of all applications and business processes.”

Datumize products are to be placed in this category.

Extract-Transform-Load (ETL), streaming, real-time, source, sink

Metadata Management

The international association of data management professionals (DAMA) defines metadata as “information about the physical data, technical and business processes, data rules and constraints, and logical and physical structures of the data, as used by an organisation.”

Catalog, dictionary, taxonomy, semantics, registry


These are the usual activities for enterprise data:

  • Data Identification: this is the first and probably one of the most neglected activities, aimed at inventorying the data in your organization. Your data inventory should cover what the data describes and what type it is, how the data is generated and estimated volumes, and technical formats (protocol, file type, etc.) You usually find data inventories setup when there is some sort of data governance in place; an inventory, however, can be as simple as a spreadsheet describing the different data sources available in the organization. However, it often happens that data sources are not documented and you need to “read the source code” - that is, having a look into actual real data.

  • Data Collection: immediately after identifying your relevant data sources, often heterogeneous and siloed, some collection (or integration) is needed to put together these relevant pieces of data. There are multiple techniques to collect data, including virtual collection that creates a pointer to the data in their original source and status, or physical collection (extract-transform-load et al) in which data is read at source and ingested/sunk into a different place. Often, the data collection includes additional activities such as cleansing, to avoid unnecessary data flow, mapping and format change, to adapt the data to target location needs, and support for multiple protocols, to understand the myriad of data formats that exist in the real world.

  • Data Aggregation: collected data tends to be verbose and fine grained, more than fine and mostly mandatory for a detailed and automated/algorithmic analysis; however, human interpretation needs summarization to understand complex concepts, and that means aggregation. The “best sellers” for aggregated data are counting, average, maximum and minimum; most business metrics can be defined in terms of these operations (remember the median). The other usual concern about data aggregations is related to volume and performance: aggregating 1 million data points every 10 seconds is a hard problem.

  • Data Analysis: analysis tends to be considered the exciting phase, when advanced mathematics, statistics and business modeling (all of them combined in what is known as data science) unleashes all the expressiveness of data into unexpected (or not) insights and conclusions. Most of the time, analysis starts on a blank sheet of paper and requires a combination of techniques (classical and modern) to create meaningful results. It is well accepted in the market that analytical models can be classified according to a timeline, defining descriptive models (describing the past and current time), predictive models (guessing up the future based on past and present data), and prescriptive models (that involve some sort or knowledge/learning, usually related to Artificial Intelligence, that guess up the recommendations for the future).

  • Data Visualization: sight is the dominating sense for most human beings. Therefore, complex information tends to be better transmitted using a visual framework, and data is not an exception to this rule. Data visualization refers to the techniques and tooling used to transmit the metrics, insights and conclusions calculated during the analysis. Infographics, charts, reports or dashboards are examples of techniques used to represent data with varying degree of detail and interactiveness. The concept of a drill-down is one major advantage for visualization tools: it allows humans to cope with summarized views of data and eventually dig into the details for further understanding.

The following diagram represents how Datumize is positioned in terms of the data journey. Dealing with Dark Data actually means to cope with a space between the traditional data integration for Information Technology (IT) and industrial data integration for Operational Technology (OT). As any other data integration technology or product, Datumize main mission is to ingest data into the enterprise data and analytics platform, but considering special and sophisticated data sources for Dark Data, coming from network transactions, Internet of Things (IoT), mobility devices (Wi-Fi, Bluetooth), industrial machines (OPC, Modbus) or highly distributed and disconnected environments.


Dark Data and other types of Data

Data is the raw material for almost every innovation project, but also the reason why many of these projects are failing. According to McKinsey, almost 40% of large corporations are running smart data projects, and other 30% are planning to start one. IBM has reported that 1/3 of business leaders don’t trust the information they use to make decisions. And most surprisingly, 63% of the companies using Big Data have not been successful in data-driven insights, according to Dataversity.

One possible classification for data is made according to 2 factors:

  1. Availability, whether the data are stored and made available for further usage. One could argue whether collected and transmitted data in streaming mode is stored or not; for the sake of simplicity, it is assumed that any collected data is stored, even streaming data where some sort of temporal storage (i.e. Kafka) is used.

  2. Usage, indicating whether the data is actually used or not for any additional purpose. A typical use would include visualization in a chart.

The table below explains the scenarios when these two factors are considered. This method of data classification is not novel, mostly inspired by the analysis technique of the Johari Window, and widely used in intelligence, engineering and astrophysics. You can read more about the known knowns and related concepts, here.


DATA USED

DATA NOT USED

DATA COLLECTED

Transactional data - stored and used, typically needed for business and operational purposes. The most common and first data considered in analytics.

1-5% of overall company’s data.

Examples: customer database (CRM), invoices database (ERP).

Operational data - stored but not used, or seldom used. Usually found in departmental databases or siloed applications, these data have intrinsic value but usually takes considerable effort to integrate.

5-10% of overall company’s data.

Examples: maintenance orders (siloed ERP).

DATA NOT COLLECTED

Application data - data only managed and considered within the boundaries of the application/device that creates and uses the data. However, the data are not stored and therefore no further usage (i.e. analytical) is feasible.

>50% of overall company’s data.

Examples: Availability inquiry (booking system), sensor of temperature connected to an oven (IoT).

Unknown data - collateral data produced within the boundaries of the application/device, not used even for business or operational purposes and not stored neither.

>50% of overall company’s data.

Examples: Wi-Fi Access Point noise in decibels (warehouse Wi-Fi network).


Gartner defines Dark Data as “information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.” At least 80% of is dark data; the great hidden resource that flows untapped through major organizations. At Datumize, we define Dark Data as “data not collected and therefore not used.According to previous data classification, data not collected (highlighted in blue) are considered to be dark data.


Dark Data Sources

Where are dark data to be found? According to the data classification proposed previously, considering data against availability and usage factors, as a rule of the thumb, “dark data is found whenever data is not collected.” Datumize products are aimed at data integration and work whenever the data are not being collected; for other cases, such as neglected operational data, there are other data governance and discovery products that should be considered.

The following dark data sources that are usually found in common scenarios. Please note that the list is not exhaustive.

Network Transactions

Tons of in-transit data remain hidden and unleveraged due to the difficulty of collecting and processing temporary transactions flowing over the network (API, XML integrations, etc.) without the hassle of modifying the backend systems.

Data Source

Type of Data

Ingestion Technique

Preparation Steps

Usage and Benefits

Data Center

On-premises or cloud.

Backend systems (CRM, ERP, web, mobile, mainframe, middleware)

Integrations (API REST, WS, EDI, etc)


Ephemeral or not stored transactions

Searches

Lookups

Any transaction not fully stored in a database/log.


Network Sniffing and Deep Packet Inspection (DPI)

Real-time

Software-based

Physical/virtual

Data-drift aware


Protocol assembly (TCP, UDP, others)

Content parsing (XML, JSON, text, others)

Scriptable business logic (calculations)


Understand what you are not doing

Missing sales

Next best product

Actual omnichannel customer behaviour

Detect customer churn conditions

Wi-Fi Technology

Wi-Fi, a ubiquitous technology that is present in almost every facility, generates a huge variety of data that remains poorly explored but can be used for delivering motion intelligence to companies.

Data Source

Type of Data

Ingestion Technique

Preparation Steps

Usage and Benefits

Wi-Fi enabled premises

Mission-critical Wi-Fi (i.e. warehouse)

Customer-experience Wi-Fi (i.e. hospitality, retail)

Mobility events

Position (X, Y)

Movement A-->B

Distance

Wi-Fi active polling

SNMP

RTLS-like

REST API

Access Point or controller

Trilateration and fingerprint positioning (static environments)

Heuristics (speed, materials, no-route)

Artificial Intelligence (path finding)

Optimize operations

Minimize distance

Measure & compare

Customer experience

Cross-selling

Minimize queues

Industrial Networks

Highly valuable operational data is trapped inside machines, devices and sensors. Vendor lock-in, proprietary protocols and lack of interoperability have inhibited machine data to be shared and used to govern and unlock efficiencies.

Data Source

Type of Data

Ingestion Technique

Preparation Steps

Usage and Benefits

Factories and remote equipment

Industrial machines

Control and automation systems

Machine control operations and internal status

Runtime metrics for internal variables

Control operations

Network Sniffing and Deep Packet Inspection (DPI)
If traffic exists, no overhead to machine

Active polling

If traffic doesn’t exist. For disconnected machines.

Protocol assembly (OPC-DA, OPC-UA, Modbus, others).

Message reformat (XML, JSON).

Enterprise data platform message upload (MQTT, Kafka).

Efficient and convenient machine data acquisition

No overhead

Custom protocols 

Leverage modern data platforms and data scientists

Distributed Locations

Unconsolidated IT landscapes with legacy systems and/or heterogeneous technologies are the main reason why the companies still have data silos, that remains a big challenge for corporate intelligence.

Data Source

Type of Data

Ingestion Technique

Preparation Steps

Usage and Benefits

Multiple remote and heterogeneous premises

Retail shop

Branch office

Franchise

Different countries and regulations

Siloed data in multiple formats

Point of Sale

Sensors

Wi-Fi

Ad-hoc hardware

Active polling

Databases (POS)

REST API (Sensors)

SNMP (Wi-Fi)

Protocol details (i.e. Microsoft Access)

Message reformat (XML, JSON, CSV)

Upload to enterprise data platform (File, SFTP, queue)

Understand your siloed information

Real time sales 

Lost sales
Customer visits

Operational inefficiencies


Datumize Products

Datumize product portfolio is made out of different products, scoped in the data integration space, aimed at making dark data available to the enterprise.

Datumize Data Collector (DDC) is the core Datumize product. DDC is a lightweight, high-performance, on the edge, streaming enterprise data integration software focused on data collection for hidden, complex and disparate data sources, most of the times unexplored. DDC captures the right data stream from dark data sources and processes on the edge in real-time to transform, extract valuable information, and store actionable results into the next system. DDC is modular, every functionality is a pluggable component organized in a data pipeline that visually represents the data integration.

Datumize Data Dumper (DDD) is an ultra-efficient network packet collector aimed at capturing network traffic with minimal or no packet loss. DDD leverages native executables, such as tcpdump, and operating system kernel libraries dealing with the network stack, such as libpcap, for an efficient packet capture. DDD usually works in combination with DDC: DDD does the network sniffing, while DDC does the Deep Packet Inspection and further data processing.

Datumize Zentral (DZ) is a cloud management platform that covers the complete lifecycle of using Datumize products. Use DZ to create new data projects and resources, design and test data pipelines, configure and deploy infrastructure, and monitor your environments. You must use DZ to setup and configure any Datumize product.