Learn some concepts on Dark Data, data integration, data architecture and Datumize products. Both a quick overview to understand the broader picture and, more specifically, it allows you to understand and position each Datumize product and its function in an enterprise architecture.
Datumize is a software vendor established in 2014 in Barcelona working on data integration and edge computing. We are a software vendor in the data integration space that unlocks data for digital transformation.
We have low cost and non-intrusive technology for data capture, featuring polling (request-response), network sniffing and deep packet inspection. Our product is lightweight, real-time and multi-platform, provides edge computing, endpoint management and 3rd party software bundling. Enables data integration and edge use cases for complex and legacy data.
Datumize has created middleware for capturing data from heterogeneous and sophisticated data sources, doing edge computation, and ingesting into any target destination, while preserving full data privacy.
You can read more at www.datumize.com
Enterprise Data and Datumize
The concept of Enterprise Data has enough breadth and depth as to write several books. It is useful to try to define the high level picture of what enterprise data actually means to most companies and what comprises the life cycle for enterprise data. There is not a single definition for enterprise data or even an enterprise data architecture, so take this classification here as something mostly accepted in the market. Depending on a vendor's alignment, the classification might look different, but you should get the point.
These are well accepted areas of expertise (or capabilities) that any enterprise data architecture should define at some point, as summarized in following table.
According to the Dataversity definition, “Data Strategy describes a set of choices and decisions that together, chart a high-level course of action to achieve high-level goals. This includes business plans to use information to a competitive advantage and support enterprise goals.”
Reference Architecture, Maturity Model, Organization, Process, People
Informatica defines, “Data governance encompasses the strategies and technologies used to make sure business data stays in compliance with regulations and corporate policies.”
Data lineage, ownership, distribution, compliance
Master Data Management (MDM)
The Wikipedia business definition explains that “master data management (MDM) is a method used to define and manage the critical data of an organization to provide, with data integration, a single point of reference. The data that is mastered may include reference data - the set of permissible values, and the analytical data that supports decision making.”
Source of truth, normalization, duplicate detection, master, linking
Data Warehousing (DWH)
Talend elaborates on this capability by stating that “a data warehouse is a large collection of business data used to help an organization make decisions. The concept of the data warehouse has existed since the 1980s, when it was developed to help transition data from merely powering operations to fueling decision support systems that reveal business intelligence.”
Data lake, data mining, OLTP, OLAP
Business Intelligence (BI)
Gartner states that “analytics and business intelligence (ABI) is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance.”
Aggregation, reporting, dashboard, KPI
The definition given at Investopedia clarifies that “data analytics is the science of analyzing raw data in order to make conclusions about that information. Many of the techniques and processes of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consumption.”
Machine Learning, correlation, regression, descriptive, predictive
Big Data Analytics
Although much related to the broader concept of data analytics, big data has gained enough interest and traction that IBM defines it as “Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes.”
Petabyte, Zettabyte, Hadoop, NoSQL, R
Techopedia defines this capability as “Data quality is an intricate way of measuring data properties from different perspectives. It is a comprehensive examination of the application efficiency, reliability and fitness of data, especially data residing in a data warehouse.”
Cleansing, standardization, masking
Wikipedia discusses that “in information technology, data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations. Data is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture.
TOGAF, conceptual, logical, physical
One of most complete definitions is given by Gartner and states that “The discipline of data integration comprises the practices, architectural techniques and tools for achieving the consistent access and delivery of data across the spectrum of data subject areas and data structure types in the enterprise to meet the data consumption requirements of all applications and business processes.”
Datumize products are to be placed in this category.
Extract-Transform-Load (ETL), streaming, real-time, source, sink
The international association of data management professionals (DAMA) defines metadata as “information about the physical data, technical and business processes, data rules and constraints, and logical and physical structures of the data, as used by an organisation.”
Catalog, dictionary, taxonomy, semantics, registry
These are the usual activities for enterprise data:
Data Identification: this is the first and probably one of the most neglected activities, aimed at inventorying the data in your organization. Your data inventory should cover what the data describes and what type it is, how the data is generated and estimated volumes, and technical formats (protocol, file type, etc.) You usually find data inventories setup when there is some sort of data governance in place; an inventory, however, can be as simple as a spreadsheet describing the different data sources available in the organization. However, it often happens that data sources are not documented and you need to “read the source code” - that is, having a look into actual real data.
Data Collection: immediately after identifying your relevant data sources, often heterogeneous and siloed, some collection (or integration) is needed to put together these relevant pieces of data. There are multiple techniques to collect data, including virtual collection that creates a pointer to the data in their original source and status, or physical collection (extract-transform-load et al) in which data is read at source and ingested/sunk into a different place. Often, the data collection includes additional activities such as cleansing, to avoid unnecessary data flow, mapping and format change, to adapt the data to target location needs, and support for multiple protocols, to understand the myriad of data formats that exist in the real world.
Data Aggregation: collected data tends to be verbose and fine grained, more than fine and mostly mandatory for a detailed and automated/algorithmic analysis; however, human interpretation needs summarization to understand complex concepts, and that means aggregation. The “best sellers” for aggregated data are counting, average, maximum and minimum; most business metrics can be defined in terms of these operations (remember the median). The other usual concern about data aggregations is related to volume and performance: aggregating 1 million data points every 10 seconds is a hard problem.
Data Analysis: analysis tends to be considered the exciting phase, when advanced mathematics, statistics and business modeling (all of them combined in what is known as data science) unleashes all the expressiveness of data into unexpected (or not) insights and conclusions. Most of the time, analysis starts on a blank sheet of paper and requires a combination of techniques (classical and modern) to create meaningful results. It is well accepted in the market that analytical models can be classified according to a timeline, defining descriptive models (describing the past and current time), predictive models (guessing up the future based on past and present data), and prescriptive models (that involve some sort or knowledge/learning, usually related to Artificial Intelligence, that guess up the recommendations for the future).
Data Visualization: sight is the dominating sense for most human beings. Therefore, complex information tends to be better transmitted using a visual framework, and data is not an exception to this rule. Data visualization refers to the techniques and tooling used to transmit the metrics, insights and conclusions calculated during the analysis. Infographics, charts, reports or dashboards are examples of techniques used to represent data with varying degree of detail and interactiveness. The concept of a drill-down is one major advantage for visualization tools: it allows humans to cope with summarized views of data and eventually dig into the details for further understanding.
The following diagram represents how Datumize is positioned in terms of the data journey. Dealing with Dark Data actually means to cope with a space between the traditional data integration for Information Technology (IT) and industrial data integration for Operational Technology (OT). As any other data integration technology or product, Datumize main mission is to ingest data into the enterprise data and analytics platform, but considering special and sophisticated data sources for Dark Data, coming from network transactions, Internet of Things (IoT), mobility devices (Wi-Fi, Bluetooth), industrial machines (OPC, Modbus) or highly distributed and disconnected environments.
Dark Data and other types of Data
Data is the raw material for almost every innovation project, but also the reason why many of these projects are failing. According to McKinsey, almost 40% of large corporations are running smart data projects, and other 30% are planning to start one. IBM has reported that 1/3 of business leaders don’t trust the information they use to make decisions. And most surprisingly, 63% of the companies using Big Data have not been successful in data-driven insights, according to Dataversity.
One possible classification for data is made according to 2 factors:
Availability, whether the data are stored and made available for further usage. One could argue whether collected and transmitted data in streaming mode is stored or not; for the sake of simplicity, it is assumed that any collected data is stored, even streaming data where some sort of temporal storage (i.e. Kafka) is used.
Usage, indicating whether the data is actually used or not for any additional purpose. A typical use would include visualization in a chart.
The table below explains the scenarios when these two factors are considered. This method of data classification is not novel, mostly inspired by the analysis technique of the Johari Window, and widely used in intelligence, engineering and astrophysics. You can read more about the known knowns and related concepts, here.
DATA NOT USED
Transactional data - stored and used, typically needed for business and operational purposes. The most common and first data considered in analytics.
1-5% of overall company’s data.
Examples: customer database (CRM), invoices database (ERP).
Operational data - stored but not used, or seldom used. Usually found in departmental databases or siloed applications, these data have intrinsic value but usually takes considerable effort to integrate.
5-10% of overall company’s data.
Examples: maintenance orders (siloed ERP).
DATA NOT COLLECTED
Application data - data only managed and considered within the boundaries of the application/device that creates and uses the data. However, the data are not stored and therefore no further usage (i.e. analytical) is feasible.
>50% of overall company’s data.
Examples: Availability inquiry (booking system), sensor of temperature connected to an oven (IoT).
Unknown data - collateral data produced within the boundaries of the application/device, not used even for business or operational purposes and not stored neither.
>50% of overall company’s data.
Examples: Wi-Fi Access Point noise in decibels (warehouse Wi-Fi network).
Gartner defines Dark Data as “information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.” At least 80% of is dark data; the great hidden resource that flows untapped through major organizations. At Datumize, we define Dark Data as “data not collected and therefore not used.” According to previous data classification, data not collected (highlighted in blue) are considered to be dark data.
Dark Data Sources
Where are dark data to be found? According to the data classification proposed previously, considering data against availability and usage factors, as a rule of the thumb, “dark data is found whenever data is not collected.” Datumize products are aimed at data integration and work whenever the data are not being collected; for other cases, such as neglected operational data, there are other data governance and discovery products that should be considered.
The following dark data sources that are usually found in common scenarios. Please note that the list is not exhaustive.
Tons of in-transit data remain hidden and unleveraged due to the difficulty of collecting and processing temporary transactions flowing over the network (API, XML integrations, etc.) without the hassle of modifying the backend systems.
Type of Data
Usage and Benefits
On-premises or cloud.
Backend systems (CRM, ERP, web, mobile, mainframe, middleware)
Integrations (API REST, WS, EDI, etc)
Ephemeral or not stored transactions
Any transaction not fully stored in a database/log.
Network Sniffing and Deep Packet Inspection (DPI)
Protocol assembly (TCP, UDP, others)
Content parsing (XML, JSON, text, others)
Scriptable business logic (calculations)
Understand what you are not doing
Next best product
Actual omnichannel customer behaviour
Detect customer churn conditions
Wi-Fi, a ubiquitous technology that is present in almost every facility, generates a huge variety of data that remains poorly explored but can be used for delivering motion intelligence to companies.
Type of Data
Usage and Benefits
Wi-Fi enabled premises
Mission-critical Wi-Fi (i.e. warehouse)
Customer-experience Wi-Fi (i.e. hospitality, retail)
Position (X, Y)
Wi-Fi active polling
Access Point or controller
Trilateration and fingerprint positioning (static environments)
Heuristics (speed, materials, no-route)
Artificial Intelligence (path finding)
Measure & compare
Highly valuable operational data is trapped inside machines, devices and sensors. Vendor lock-in, proprietary protocols and lack of interoperability have inhibited machine data to be shared and used to govern and unlock efficiencies.
Type of Data
Usage and Benefits
Factories and remote equipment
Control and automation systems
Machine control operations and internal status
Runtime metrics for internal variables
Network Sniffing and Deep Packet Inspection (DPI)
If traffic doesn’t exist. For disconnected machines.
Protocol assembly (OPC-DA, OPC-UA, Modbus, others).
Message reformat (XML, JSON).
Enterprise data platform message upload (MQTT, Kafka).
Efficient and convenient machine data acquisition
Leverage modern data platforms and data scientists
Unconsolidated IT landscapes with legacy systems and/or heterogeneous technologies are the main reason why the companies still have data silos, that remains a big challenge for corporate intelligence.
Type of Data
Usage and Benefits
Multiple remote and heterogeneous premises
Different countries and regulations
Siloed data in multiple formats
Point of Sale
REST API (Sensors)
Protocol details (i.e. Microsoft Access)
Message reformat (XML, JSON, CSV)
Upload to enterprise data platform (File, SFTP, queue)
Understand your siloed information
Real time sales
Datumize product portfolio is made out of different products, scoped in the data integration space, aimed at making dark data available to the enterprise.
Datumize Data Collector (DDC) is the core Datumize product. DDC is a lightweight, high-performance, on the edge, streaming enterprise data integration software focused on data collection for hidden, complex and disparate data sources, most of the times unexplored. DDC captures the right data stream from dark data sources and processes on the edge in real-time to transform, extract valuable information, and store actionable results into the next system. DDC is modular, every functionality is a pluggable component organized in a data pipeline that visually represents the data integration.
Datumize Data Dumper (DDD) is an ultra-efficient network packet collector aimed at capturing network traffic with minimal or no packet loss. DDD leverages native executables, such as tcpdump, and operating system kernel libraries dealing with the network stack, such as libpcap, for an efficient packet capture. DDD usually works in combination with DDC: DDD does the network sniffing, while DDC does the Deep Packet Inspection and further data processing.
Datumize Zentral (DZ) is a cloud management platform that covers the complete lifecycle of using Datumize products. Use DZ to create new data projects and resources, design and test data pipelines, configure and deploy infrastructure, and monitor your environments. You must use DZ to setup and configure any Datumize product.