Datumize Overview
Learn some concepts on data, edge and Datumize products. A quick overview to understand the broader picture and how Datumize products fit into an enterprise architecture.
About Datumize
Datumize is a technology vendor established in 2014 in Barcelona working on data integration and edge computing. We are a vendor in the data integration and edge computing space that unlocks data for digital transformation. Datumize captures data, using our unique middleware, optionally embedded in our low-cost industrial hardware, from sophisticated data sources, computes on edge, and ingests into any destination, preserving full data privacy.
We have a novel approach to data capture (passive network sniffing, legacy/proprietary protocols) and edge computing (multi-platform agent, custom business logic, edge hardware). Datumize enables advanced use cases for machine data (legacy equipment to cloud, industrial IoT) and edge computing (remote control, AI on the edge). We have the credibility of world-class clients, leaders in their industry.
You can read more at www.datumize.com
Enterprise Data, Edge Computing and Datumize
The concept of Enterprise Data has enough breadth and depth as to write several books. It is useful to define the high level picture of what enterprise data actually means to most companies and what comprises the life cycle for enterprise data. There is not a single definition for enterprise data or even an enterprise data architecture, so take this classification here as something mostly accepted in the market. Depending on a vendor's alignment, the classification might look different, but you should get the point.
These are well accepted areas of expertise (or capabilities) that any enterprise data architecture should define at some point, as summarized in following table.
Capability | Description | Keywords |
---|---|---|
Data Strategy | According to the Dataversity definition, “Data Strategy describes a set of choices and decisions that together, chart a high-level course of action to achieve high-level goals. This includes business plans to use information to a competitive advantage and support enterprise goals.” | Reference Architecture, Maturity Model, Organization, Process, People |
Data Governance | Informatica defines, “Data governance encompasses the strategies and technologies used to make sure business data stays in compliance with regulations and corporate policies.” | Data lineage, ownership, distribution, compliance |
Master Data Management (MDM) | The Wikipedia business definition explains that “master data management (MDM) is a method used to define and manage the critical data of an organization to provide, with data integration, a single point of reference. The data that is mastered may include reference data - the set of permissible values, and the analytical data that supports decision making.” | Source of truth, normalization, duplicate detection, master, linking |
Data Warehousing (DWH) | Talend elaborates on this capability by stating that “a data warehouse is a large collection of business data used to help an organization make decisions. The concept of the data warehouse has existed since the 1980s, when it was developed to help transition data from merely powering operations to fueling decision support systems that reveal business intelligence.” | Data lake, data mining, OLTP, OLAP |
Business Intelligence (BI) | Gartner states that “analytics and business intelligence (ABI) is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance.” | Aggregation, reporting, dashboard, KPI |
Data Analytics | The definition given at Investopedia clarifies that “data analytics is the science of analyzing raw data in order to make conclusions about that information. Many of the techniques and processes of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consumption.” | Machine Learning, correlation, regression, descriptive, predictive |
Big Data Analytics | Although much related to the broader concept of data analytics, big data has gained enough interest and traction that IBM defines it as “Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that include structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes.” | Petabyte, Zettabyte, Hadoop, NoSQL, R |
Data Quality | Techopedia defines this capability as “Data quality is an intricate way of measuring data properties from different perspectives. It is a comprehensive examination of the application efficiency, reliability and fitness of data, especially data residing in a data warehouse.” | Cleansing, standardization, masking |
Data Architecture | Wikipedia discusses that “in information technology, data architecture is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.[1] Data is usually one of several architecture domains that form the pillars of an enterprise architecture or solution architecture. | TOGAF, conceptual, logical, physical |
Data Integration | One of most complete definitions is given by Gartner and states that “The discipline of data integration comprises the practices, architectural techniques and tools for achieving the consistent access and delivery of data across the spectrum of data subject areas and data structure types in the enterprise to meet the data consumption requirements of all applications and business processes.” Datumize products are to be placed in this category. | Extract-Transform-Load (ETL), streaming, real-time, source, sink |
Metadata Management | The international association of data management professionals (DAMA) defines metadata as “information about the physical data, technical and business processes, data rules and constraints, and logical and physical structures of the data, as used by an organisation.” | Catalog, dictionary, taxonomy, semantics, registry |
These are the usual activities for enterprise data:
Data Identification: this is the first and probably one of the most neglected activities, aimed at inventorying the data in your organization. Your data inventory should cover what the data describes and what type it is, how the data is generated and estimated volumes, and technical formats (protocol, file type, etc.) You usually find data inventories setup when there is some sort of data governance in place; an inventory, however, can be as simple as a spreadsheet describing the different data sources available in the organization. However, it often happens that data sources are not documented and you need to “read the source code” - that is, having a look into actual real data.
Data Collection: immediately after identifying your relevant data sources, often heterogeneous and siloed, some collection (or integration) is needed to put together these relevant pieces of data. There are multiple techniques to collect data, including virtual collection that creates a pointer to the data in their original source and status, or physical collection (extract-transform-load et al) in which data is read at source and ingested/sunk into a different place. Often, the data collection includes additional activities such as cleansing, to avoid unnecessary data flow, mapping and format change, to adapt the data to target location needs, and support for multiple protocols, to understand the myriad of data formats that exist in the real world.
Data Aggregation: collected data tends to be verbose and fine grained, more than fine and mostly mandatory for a detailed and automated/algorithmic analysis; however, human interpretation needs summarization to understand complex concepts, and that means aggregation. The “best sellers” for aggregated data are counting, average, maximum and minimum; most business metrics can be defined in terms of these operations (remember the median). The other usual concern about data aggregations is related to volume and performance: aggregating 1 million data points every 10 seconds is a hard problem.
Data Analysis: analysis tends to be considered the exciting phase, when advanced mathematics, statistics and business modeling (all of them combined in what is known as data science) unleashes all the expressiveness of data into unexpected (or not) insights and conclusions. Most of the time, analysis starts on a blank sheet of paper and requires a combination of techniques (classical and modern) to create meaningful results. It is well accepted in the market that analytical models can be classified according to a timeline, defining descriptive models (describing the past and current time), predictive models (guessing up the future based on past and present data), and prescriptive models (that involve some sort or knowledge/learning, usually related to Artificial Intelligence, that guess up the recommendations for the future).
Data Visualization: sight is the dominating sense for most human beings. Therefore, complex information tends to be better transmitted using a visual framework, and data is not an exception to this rule. Data visualization refers to the techniques and tooling used to transmit the metrics, insights and conclusions calculated during the analysis. Infographics, charts, reports or dashboards are examples of techniques used to represent data with varying degree of detail and interactiveness. The concept of a drill-down is one major advantage for visualization tools: it allows humans to cope with summarized views of data and eventually dig into the details for further understanding.
Usually, Edge Computing is not considered as part of the Enterprise Data journey. However, modern architectures and trends in computation are gearing (again) into the edge. Again meaning that history of computation remembers us that data and computation has been moving back and forth between central and distributed locations; so edge computing is not new at all. What might be new, however, is the availability of low-cost computing devices that make this type of distributed processing very affordable for many organizations. The edge contains lots of untapped opportunities for data and computation, that why Datumize provides features to combine data integration and edge computing, including the ability to manage the endpoint, run 3rd party software (eg. containers or AI models) or even custom business logic in scripted languages.
The following diagram represents how Datumize is positioned in terms of the data journey. Dealing with data actually means to cope with a space between the traditional data integration for Information Technology (IT) and industrial data integration for Operational Technology (OT).
As a product in the data integration space, Datumize main mission is to ingest data into the enterprise data & analytics platform. However, special and sophisticated data sources needed for modern transformational projects, usually mandate to cope with network transactions, Internet of Things (IoT), mobility devices (Wi-Fi, Bluetooth), industrial machines (OPC, Modbus) or highly distributed and disconnected environments.
Data and other types of Data
Data is the raw material for almost every innovation project, but also the reason why many of these projects are failing. According to McKinsey, almost 40% of large corporations are running smart data projects, and other 30% are planning to start one. IBM has reported that 1/3 of business leaders don’t trust the information they use to make decisions. And most surprisingly, 63% of the companies using Big Data have not been successful in data-driven insights, according to Dataversity.
One possible classification for data is made according to 2 factors:
Availability, whether the data are stored and made available for further usage. One could argue whether collected and transmitted data in streaming mode is stored or not; for the sake of simplicity, it is assumed that any collected data is stored, even streaming data where some sort of temporal storage (i.e. Kafka) is used.
Usage, indicating whether the data is actually used or not for any additional purpose. A typical use would include visualization in a chart.
The table below explains the scenarios when these two factors are considered. This method of data classification is not novel, mostly inspired by the analysis technique of the Johari Window, and widely used in intelligence, engineering and astrophysics. You can read more about the known knowns and related concepts, here.
DATA USED | DATA NOT USED | |
---|---|---|
DATA COLLECTED | Transactional data - stored and used, typically needed for business and operational purposes. The most common and first data considered in analytics. 1-5% of overall company’s data. Examples: customer database (CRM), invoices database (ERP). | Operational data - stored but not used, or seldom used. Usually found in departmental databases or siloed applications, these data have intrinsic value but usually takes considerable effort to integrate. 5-10% of overall company’s data. Examples: maintenance orders (siloed ERP). |
DATA NOT COLLECTED | Application data - data only managed and considered within the boundaries of the application/device that creates and uses the data. However, the data are not stored and therefore no further usage (i.e. analytical) is feasible. >50% of overall company’s data. Examples: Availability inquiry (booking system), sensor of temperature connected to an oven (IoT). | Unknown data - collateral data produced within the boundaries of the application/device, not used even for business or operational purposes and not stored neither. >50% of overall company’s data. Examples: Wi-Fi Access Point noise in decibels (warehouse Wi-Fi network). |
Gartner defines Dark Data as “information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.” At least 80% of is dark data; the great hidden resource that flows untapped through major organizations. At Datumize, we define Dark Data as “data not collected and therefore not used.” According to previous data classification, data not collected are considered to be dark data.
Data Sources
Where are most valuable data to be found? According to the data classification proposed previously, considering data against availability and usage factors, Datumize concentrates on gathering data in "not collected" areas, because these areas usually contain sophisticated data that are not straightforward to capture. Datumize products are aimed at data integration and work whenever the data are not being collected; for other cases, such as neglected operational data, there are other data governance and discovery products that should be considered.
The following data sources are found in common scenarios. Please note that the list is not exhaustive.
Network Transactions
Tons of in-transit data remain hidden and unleveraged due to the difficulty of collecting and processing temporary transactions flowing over the network (API, XML integrations, etc.) without the hassle of modifying the backend systems.
Data Source | Type of Data | Ingestion Technique | Preparation Steps | Usage and Benefits |
---|---|---|---|---|
Data Center On-premises or cloud. Backend systems (CRM, ERP, web, mobile, mainframe, middleware) Integrations (API REST, WS, EDI, etc) | Ephemeral or not stored transactions Searches Lookups Any transaction not fully stored in a database/log. | Network Sniffing and Deep Packet Inspection (DPI) Real-time Software-based Physical/virtual Data-drift aware | Protocol assembly (TCP, UDP, others) Content parsing (XML, JSON, text, others) Scriptable business logic (calculations) | Understand what you are not doing Missing sales Next best product Actual omnichannel customer behaviour Detect customer churn conditions |
Wi-Fi Technology
Wi-Fi, a ubiquitous technology that is present in almost every facility, generates a huge variety of data that remains poorly explored but can be used for delivering motion intelligence to companies.
Data Source | Type of Data | Ingestion Technique | Preparation Steps | Usage and Benefits |
---|---|---|---|---|
Wi-Fi enabled premises Mission-critical Wi-Fi (i.e. warehouse) Customer-experience Wi-Fi (i.e. hospitality, retail) | Mobility events Position (X, Y) Movement A-->B Distance | Wi-Fi active polling SNMP RTLS-like REST API Access Point or controller | Trilateration and fingerprint positioning (static environments) Heuristics (speed, materials, no-route) Artificial Intelligence (path finding) | Optimize operations Minimize distance Measure & compare Customer experience Cross-selling Minimize queues |
Industrial Networks
Highly valuable operational data is trapped inside machines, devices and sensors. Vendor lock-in, proprietary protocols and lack of interoperability have inhibited machine data to be shared and used to govern and unlock efficiencies.
Data Source | Type of Data | Ingestion Technique | Preparation Steps | Usage and Benefits |
---|---|---|---|---|
Factories and remote equipment Industrial machines Control and automation systems | Machine control operations and internal status Runtime metrics for internal variables Control operations | Network Sniffing and Deep Packet Inspection (DPI) Active polling If traffic doesn’t exist. For disconnected machines. | Protocol assembly (OPC-DA, OPC-UA, Modbus, others). Message reformat (XML, JSON). Enterprise data platform message upload (MQTT, Kafka). | Efficient and convenient machine data acquisition No overhead Custom protocols Leverage modern data platforms and data scientists |
Distributed Locations
Unconsolidated IT landscapes with legacy systems and/or heterogeneous technologies are the main reason why the companies still have data silos, that remains a big challenge for corporate intelligence.
Data Source | Type of Data | Ingestion Technique | Preparation Steps | Usage and Benefits |
---|---|---|---|---|
Multiple remote and heterogeneous premises Retail shop Branch office Franchise Different countries and regulations | Siloed data in multiple formats Point of Sale Sensors Wi-Fi Ad-hoc hardware | Active polling Databases (POS) REST API (Sensors) SNMP (Wi-Fi) | Protocol details (i.e. Microsoft Access) Message reformat (XML, JSON, CSV) Upload to enterprise data platform (File, SFTP, queue) | Understand your siloed information Real time sales Lost sales Operational inefficiencies |