A step by step tutorial for capturing live web service traffic using Datumize Data Collector (DDC) and Datumize Data Dumper (DDD), then process each web service call to extract metrics, and produce a file with persisted data to be visualized. This tutorial provides the basic concepts to understand how to using Datumize products to utilize network traffic by doing network sniffing and deep packet inspection.
Overview

This tutorial currently is only supported in Java version 11.

The Sniffing Web Services Sandbox contains just one source that will be the files generated from the traffic capture. This traffic capture will contain Web Service traffic on it generated by a client-server API. Take in to account that this is a traffic generator and not real webservices. From this traffic file the DDC will be then be able to parse the traffic into something understandable for human beings.




Objectives

In this tutorial you will learn: 

  • Use Datumize Data Dumper (DDD) to capture network traffic
  • Design a pipeline that captures network traffic in PCAP format.
  • How to parse data from Web Services.
  • Sink the data to a file into a CSV format. 




Requirements

Make sure to comply with the following requirements.

List of requirements

Understand Datumize Sandbox

If not, please check Datumize Sandbox section.

Download and Start Datumize Sandbox

Download the Datumize Traffic Generator traffic-generator.zip

  1. Unzip the traffic-generator.zip
  2. Change directory to inside the folder of the generator and execute the start:

    bash start-traffic-generator.sh
    BASH
  3. Now the traffic generator will be up and running.

Check Web Service traffic generation

Check the traffic that we have in this machine. In this scenario we simulate real Web Service traffic from one port in the localhost to another port in localhost. To check the current traffic we execute tcpdump:

sudo tcpdump -i any -A -vvv | grep POST
CODE

The output of this will be something like this:

.	...	..POST /api/call HTTP/1.1
.	.v.	.vPOST /api/call HTTP/1.1
.	...	..POST /api/call HTTP/1.1
.	.v.	.vPOST /api/call HTTP/1.1
.	...	..POST /api/call HTTP/1.1
CODE

This shows us that we have traffic in the machine and you will be able to capture it by using the DDD and process it by using the DDC.


Configure and Deploy DDD

In order to configure a DDD you must first create a machine.

First of all, create a new Deployment plan and then go to Infrastructure tab and click on Add machine. Once your machine is created and bootstrapped you can proceed with configuring the DDD.

Bootstrap the machine as shown in the Bootstrap Datumize software installation.

Click the Add instance button:

Select DDD and give your DDD instance a name.


Other fields can be left with the default values.

The Deploy machine button should now be activated.

Once deployed and after Apply Changes your DDD instance should look like this:




Check output of DDD

The datumize main folder is always /opt/datumize/, so all can be found under that one. Check the generated pcaps:

ls -alt /opt/datumize/pcap

-rw-r--r-- 1 datumize datumize  778070 May 22 15:18 2020-05-22_15-18-08.pcap
-rw-r--r-- 1 datumize datumize  811149 May 22 15:18 2020-05-22_15-17-48.pcap
-rw-r--r-- 1 datumize datumize  766058 May 22 15:17 2020-05-22_15-17-28.pcap
-rw-r--r-- 1 datumize datumize  812000 May 22 15:17 2020-05-22_15-17-08.pcap
-rw-r--r-- 1 datumize datumize  774565 May 22 15:17 2020-05-22_15-16-48.pcap
-rw-r--r-- 1 datumize datumize 1042549 May 22 15:16 2020-05-22_15-16-28.pcap
CODE


If your DDC pipeline is not deployed and running your DDD will fill the output folder with pcaps, you can stop the DDD instance and restart it later.


If you want to fully understand how the traffic generator works, please expand this section.

  1. This step is done by chef through the Zentral deployment step and is optional it is recommended to skip to Configure and deploy DDD . As an optional step you can start the product DDD manually by executing this:

    # /opt/datumize/dtzdump/dtzdump.sh start sandbox
    
    DTZ_DUMP_HOME defined: /opt/datumize/dtzdump
    VARIABLES
    Instance: 
    NIC: any
    User: root:root
    Filter: 
    Rotate: 20
    Size: 5120
    Extra: 
    Buffer size: 8192
    RAM file: /opt/datumize/pcap
    Sleep: 5
    Log: /opt/datumize/dtzdump/log/dtzdump.log
    Output: /opt/datumize/pcap
    Backup: /opt/datumize/dtzdump/backup
    Percent Limit: 50
    Capture started [116]
    Processing started [123]
    
    
    CODE
  2. After that, check that the pcaps are being captured: 

    # ls -alt /opt/datumize/pcap/in
    
    -rw-r--r-- 1 root root 498309 Feb  6 11:26 2020-02-06_11-26-34.pcap
    drwxr-xr-x 2 root root   4096 Feb  6 11:26 .
    -rw-r--r-- 1 root root 782681 Feb  6 11:26 2020-02-06_11-26-14.pcap
    -rw-r--r-- 1 root root 774934 Feb  6 11:26 2020-02-06_11-25-54.pcap
    drwxr-xr-x 1 root root   4096 Feb  6 11:25 ..
    CODE
  3. If you do a less of one of the capture files, you will be able to see the traffic inside. Execute this command: 

    less 2020-02-06_11-26-34.pcap
    CODE
  4. You will be able to see something similar to the following example. For example, here you can identify the headers of the request and response of the webservice traffic. The content of this request is not identifiable because it is gziped but demonstrates the power of the DDC: 

    POST /api/call HTTP/1.1
    id: d789550d4c37-2-1580988734948-ovwc
    Content-Length: 1176
    Content-Encoding: gzip
    Host: localhost:8090
    Connection: Keep-Alive
    User-Agent: Apache-HttpClient/4.5.2 (Java/1.8.0_242)
    Accept-Encoding: gzip,deflate
    
    ^_<8B>^H^@^@^@^@^@^@^@<AD>Z]n<9B>G^L<BC><8A><A1><F7>&˿<E5>^R<90><9D><B7><9E> =<80>^Z<A9><85>^AG*,<DB>ho<9F>5
    '^AJ<BA><FC><88>}<93>^M<D9><C4>r<C9><E1><CC>p<F7><9F><FE><FE><FA>p<F3>rz<BC><DE>_η;<F8><D0>v7<A7><F3><97><CB><F1><FE><FC><E7><ED><EE><B7>Ͽ<FE>2v7ק<C3><F9>xx<B8><9C>O<B7><BB>^?N<D7>ݧ<BB><FD><E5><AF><D3><E3><E1>i<FE><CD><F5><A7><CF>7^?<<9F><BF><DC><EE><8E><F7>/<BB><B7><DF>^^^<E1>n<FF>rxx>݁<89><8C>^Om<FF><F1><DF>^_<F7>^_^?<FA><C6><DB>G|<FB>2     ^M<F4><BE><8C><DF>?π<FF><8D>}8^^<BD><D8>,<DD><FD>o^h^X<82><AD>^P<FA><EB><F3><C3>ӽ^W|^L<E5>l<F0><A1><C3>^V^^ESC^Yܣ<B8>)g4)ľ><FF><EE>^7^K
    <E7>c<A3>X%<E7><C1><B9>ɨ<F7>ll^C<A9><84>^N<8E><8D>2<BA><9B>E/4^Sh<A5><CA><C3>RÙ<F4>|<A1>#",<y^G<D3>|l^P<B7>)j<B1><B9>^CS<BA>ج<E9><C2>&Ccq<D3><E8>Ş)^W<B7>2<AB>
    <B1><D1>6<C4>^F^W^?<FF>'t^D<E7>̚<BE><AD><82>-a<A1>3J<FE><C2>aaƩ[K^G^^̕^F^K2<CE>&<96>F5^\^B<EE><F5>T3^N<DC><F2>cL^L<DC>$կ<9B>Z<BA>΁+e^^!^Kvʏ1U^_~k<B5>^F^Dy4^W<ED>K<EF><9B>x<96>z<BA><BF>[<EB>^K<B1>^E<9B>居FǕIs^B,ux^D<E7><D4>4^M.(8*#4B^Wᑏ^MmT<EE>;<E6><A9>><DB>wk^M^A+<E3>;<AA><B5><81><98>GU<98>,gi<97>u^^<F9>J^W<B1><85>7<8E<8D>9}r@Ņ<B1><A7><U+0600>t<97><A9><94>^P<FD><9D>QF<9A><E6><E8><E8>+<C7>*<A2>c<BE>ɸ^E<A4><B6>8ɔ0]j*<B8>2t׼(<C2>h<F0><94><C9><C3><D8><E0>;<A0><D0>BMD^L<90>^F6<9A>ڠB<92><83>Z<DB>0A<A5><97><D0><<CE>8^OKs^G<9A><92>u<AD><DF>^B<F9>^A<FE>J<B1>*<EA> <C6>^V˻^^<Z<A9><C1><DF>^A6<B1><BC><FC>7+Iѐ?4JǶ^
    ^]<C2>^K<FA>^V<8E>^O/<C6>+<9D>^G<D0><F4>^L<ED><83>^V<92>&^R<A2>th<92>V"<C9>ѩ^GPz<8E><99><F2>J<BE>^Fݷ)}<AB><A7><FB><D7>S<9F>%<9A>g<AA>H<9D>^V<96><9A>j<9E>*2P<C9>H~^Gқ<A6>{^LL<DB><D2><E8>h<E2>k^]_^X<A9>O襁^K<FB>m<EB><9E>\<AD>$<84>C<B7>I}<CF><CE>_\<F4><C5>^V^@^M<DF>(<F5>^M<A7>Z<9B>G<EB>^C1<9F>^L<B9>m޻-%08<C4><EF>ܕ>z<9C>wm
    <F9>j^G<E6><92>:<8A>l]"<FF>߹<B1><E7><F0>[赡v<CC>3<F5>Q^Z<E3><91>0ESC<DA><F3>K:<9F>蔯<DB>l<C3>D#<90><A5><9C>^QA<FD><EA><F5>}|)AL<94>u^L|b^_<D6><C9>7j<8A>^<9B><F4>
    ^X^EU^F{^@^@^A^A^H
    ^@^A:ESC^@^A:^ZHTTP/1.1 200 
    vary: accept-encoding
    Content-Encoding: gzip
    Content-Type: text/plain;charset=UTF-8
    Transfer-Encoding: chunked
    Date: Thu, 06 Feb 2020 11:32:14 GMT
    
    a
    ^_<8B>^H^@^@^@^@^@^@^@
    200
    <AC>[<DB>n\G^N<FC>^UC<EF>:n<DE>I@v<9E><BC>_<90>|<80>6<D2>.^L8R`<D9>F<F2><F7>˃@<B9>h<D8>^6Ԁ^^<C6><C6>H<C3><E9>&<8B>UE<9E><9B>^_~<FB><E5>ӛo<F7><9F><9F>>>><BC><BB><82>c\<BD><B9>^?<F8><F9><F1><EE><E
    CODE
  5. Once you check it and you know that the capture is working.

Build the Web Services Pipeline

If you feel lost about using Datumize Zentral, please refer to the available resources in our Getting Started section, including manuals and videos.

The table below summarizes the components used in the pipeline.

Component Type

Name

Description

SourceFilePcapSourceSource to actively request an API from intervals of time (polling).
ProcessorHTTPDialogGroupProcessorGroups the HTTP dialogs.
ProcessorHTTPAssemblerProcessorAssembly all the packets after grouping.
ProcessorWSParserProcessorParses the Web Service dialog into objects to be easy to manipulate.
ProcessorCookProcessorExecute an operation on the input data to obtain an output data (called dish)
ProcessorSerializerProcessorTranslates data structures into a format that can be stored. There are multiple formats available.
ProcessorComposerProcessorPre-preparation to finally sink the data into some file format.
SinkFileSinkStores records into a JDBC compatible database.

Drag the required components from the Pallet to the Workbench and join with a Single Memory Stream.

The table below summarizes the properties to configure the components.

FilePcapSource component:


Field NameValueRequired
Modify


Directory Base

/opt/datumize/pcap

*
default


Filternot set*

File Pattern*.pcap*

File Suffix On Success

not set*

File Suffix On Errornot set*

File SortNAME*

File Age0s*

HTTPDialogGroupProcessor component:


Field NameValueRequired
Modify


Server Pattern

8090

*
default


Timeout2s*

Precision

0ms*

Partitions1*

Client Patternnot set*

HTTPAssemblerProcessor component:


Field NameValueRequired
Modify


Filter

http.rq.resource contains '/api/call'

*

WSParserProcessor component:


Field NameSelectedContent typeRequired
Modify



Parsers (add element)Xml map incomplete deserializerxml*

Parsers (add element)Xml map incomplete desarializerplain*

CookProcessor component:


Please note the code in the cook processor is very important for correctly enabling this tutorial.


Field NameValueRequired
Modify


Operation Parameters

not set

*

Operation
if (input.containsKey('response')) {
    rq = input.get('request')
    ops = rq.get('operations').asList()
    output.count = ops.size()

    if (ops.size() > 0 && ops.get(1) != null && ops.get(1).get('operation') != null){
            output.firstop = ops.get(1).get('operation').get('func').value()
    }

    rs = input.get('response')
    rsops = rs.get('operations').asList()
    if (rsops.get(1) != null && rsops.get(1).get('operation') != null && rsops.get(1).get('operation').get('result') && rsops.get(1).get('operation').get('result').get('value') != null ){
        output.firstresult = rsops.get(1).get('operation').get('result').get('value').value()
    }
}
CODE
*
default


Languagegroovy*

SerializerProcessor component:


Field NameValueRequired
Modify


SerializerMap json serializer


*

ComposerProcessor component:


Field NameSelectedValueRequired
Modify



RulesRule recordsMax records: 20*
default



Key ExtractorFixed extractorKey: k*

Window-not set

Time Rate-not set

Header-not set

Footer-not set

Combiner-not set

FileSink component:


Field NameValueRequired
Modify


Directory Base

/opt/datumize/output

*
default


File Pattern%{uuid}*

Directory Pattern%{Year}%{Month}%{Day}/%{Hour}%{Minute}*

Closed File Suffixnot set*

SerializerNo serializer*




Test the Pipeline

To test the pipeline you have created please see the guide for Testing a Pipeline for a more in depth overview of the steps necessary to test your pipeline.

For this tutorial, in the pipeline editor page you can select Run from: Beginning. Your Input and Output should show your Input, and Output results if successful. 




Deploy the Pipeline to DDC Instance

In Zentral, you will only need to have one machine, one instance and one Pipeline. For more on using the Infrastructure management tool please see the Zentral guide to Infrastructure Deployment

This pipeline will be deployed with the default  DDC Runtime Policy.



Check the Expected Output

  1. As long  as the output is configured as a csv, you will have to check the output folder where it contains the resulting csv's:

    ls -alt /opt/datumize/resources/
    CODE
    # ls -alt /opt/datumize/resources/csv/20200206/1127/
    total 28
    drwxr-xr-x 4 root root  4096 Feb  6 11:29 ..
    drwxr-xr-x 2 root root  4096 Feb  6 11:28 .
    -rw-r--r-- 1 root root   234 Feb  6 11:28 73d96893-7915-49c9-836d-2cf1198d0c30
    -rw-r--r-- 1 root root 15426 Feb  6 11:27 26d430d9-be5a-4b3d-8f95-602fe9f9bfa0
    CODE
  2. Check inside the csv by using less:

    # less /opt/datumize/resources/csv/20200206/1127/73d96893-7915-49c9-836d-2cf1198d0c3
    CODE

    The file contains the following that is the result of the webservices dialogs captured by the DDD and processed by the DDC:

    100,sub,21985.0
    100,div,0.9670221843003413
    100,sub,10669.0
    100,div,0.7495874063729846
    100,add,54708.0
    100,div,4.801460823373174
    100,sub,34699.0
    100,div,0.7584804068202213
    100,div,0.39869337979094077
    100,add,58887.0
    100,multi,234696.0
    CODE



Stop All

Automatically

  1. By using Zentral you can decide to stop the process at any time just by 

Manually

  1. Now, to finish, you can stop all the services that have ran in the machine. First stop the DDD and then the DDC:

    /opt/datumize/dtzdump/dtzdump.sh stop sandbox
    
    /opt/datumize/ddc/bin/ddc stop ws
    CODE
  2. To stop the traffic generator, just execute this in the its folder:

    bash stop-traffic-generator.sh
    CODE