Overview for Datumize Data Dumper (DDD). If you are new to the product, you are more than encouraged to read it through.

Background

Dark Data was first located in the wire (literally) of an Ethernet network. It was year 2004, and a consultant (we will omit the names in this story) was called in because of a performance issue in a healthcare backend system. One of the main functionalities of this system was to schedule primary visits to doctors in a health center; even with three employees in the front-desk, the system was so slow that patients had to queue up for hours and a security service had to be provisioned to avoid physical violence on those poor clerks. Everything was apparently fine in the data center - database, communications, application server, application, web server - and the customer was really anxious because these problems were being repeatedly highlighted in the press.

At some point, however, the consultant dared to ask "how about the connection between the health center and the data center?". As nobody knew the answer, the consultant went into the health center with, and accompanied by the local IT manager, came into the communications room and did a (years later) inspiring action: recorded traffic between the front-desk computers and the server application in the remote data center. Using a version of Wireshark to analyze the HTTP traffic back and forth, the problem was made clear: the web page for scheduling a new doctor's appointment weighted 4 MB, while the Internet ADSL line provided 4 Mbps (roughly 0,5 MB per second). Loading each page took 8 seconds, and loading the 4 desktops nearly at the same time, 32 seconds. The line was upgraded to a more decent bandwidth, and the performance issue was gone - patients were happy again and security was rollbacked.

Overview

The Dark Data in the story came from network transactions. Actually, only by a careful examination of the network traffic and reconstruction of the conversations, the hidden metrics flourished and were made evident. In this specific case, the detailed network timestamps for each packet demonstrated that responses were queued up due to bandwidth congestion. This simple yet powerful idea remained in the shadows until 2014, when packet inspection inspired Datumize to create the first product: Datumize Data Collector (DDC).

Datumize Data Dumper (DDD) is a Datumize product aimed at capturing network packets very efficiently at a deep operating system level with minimal-to-no packet loss. It usually works in combination with DDC; DDD manages the segmentation, filtering and temporary persistence of network packets in PCAP files, while DDC efficiently picks-up the segmented files to deliver the further processing. The diagram below represents DDD in action, receiving network packets from the operating systems, capturing, filtering and structuring the output into binary files that will be later processed by Datumize Data Collector (DDC).


Technology

Datumize Data Dumper is a software component that uses tcplib and tcpdump for capturing network packets while in memory of the operating system, apply some filters to select just the traffic needed, and store in packets in PCAP files minimizing the overall packet loss. Some important concepts to keep in mind: 

  • libpcap: library used to intercept packets at operating system level, open source. It works on the user space and works very well from a software perspective. If you need to capture extreme bandwidth, either you go for dedicated hardware (appliance) or use different libraries working at the kernel space; Datumize uses this approach because you can run libpcap in multiple standard operating systems.
  • tcpdump: a very handy capture, filter and store program, open source (releases). Wrapped within DDD with extra goodies.
  • BNF filter: the network filtering syntax, extremely flexible and powerful. Using the Berkeley Packet Filter (BPF) syntax.
  • Operating system: although libpcap is portable and there are Unix and Windows versions, Linux tends to be more robust and minimizes packet loss.
  • PCAP: this binary format supports the storage of network packets for further analysis. DDD supports this format for persisting the selected packets.
  • Storage: the output files are organized in a directory, and pcap output files have different partitioning options.

Management

Datumize Zentral (DZ) is the place to manage Datumize Data Dumper and any other Datumize product. All configuration, deployment and monitoring activities must be done through Datumize Zentral.