Designing a Pipeline
How to create and configure a pipeline, made out of components and streams, and get to know the component palette and the property inspector.
Objectives
In this section you will learn:
- How to create a DDC Pipeline Resource.
- How to navigate the DDC Pipeline editor and change the docking position.
- How to add Components (Sources, Processors and Sinks) and Streams.
- How to save, export, and modify DDC Pipelines.
- Understanding Dependencies
Overview
A pipeline is a sophisticated data application. Be careful when working with sensitive or production data. It is highly recommended that all users start working with the Datumize Tutorials before working with the Pipeline editor for the first time.
The DDC Pipeline Editor is one of the most powerful aspects of the Datumize Zentral tool kit. It is highly recommended to get familiar with working in this space, and the deployment policies in conjunction with one of our Sandbox tutorials (such as the Hello World tutorial).
The DDC Pipeline editor is the primary space for editing the configuration of the DDC pipeline. It allows for simple, or complex configurations of singular or multiple sources, processors, and data sinks. While this guide may be simplistic, the limits of the DDC pipeline are anything but. It is here that you will define the data sources, get to understand the data, create breakpoints and test your pipeline, and ultimately publish your pipeline for inclusion in a deployment strategy.
Working with the Pipeline Editor
Once a project has been created, you are able to create a new DDC Pipeline resource and that will open automatically the DDC Pipeline Editor; if you click on a resource of type DDC Pipeline, the editor will be opened too. The editor provides auto-saving by default. If you have not already named the Pipeline, double clicking the name will allow for a renaming the pipeline. If you are happy with its name, simply leave it.
Adding Components to the Pipeline
The pipeline creation usually starts by adding a Source component that will read data. In the Add Component button on the upper right, on a blank editor will only allow for the Source component to be selected, as this is the logical beginning of any pipeline.
Selecting add a source will bring up the most current list of sources and protocols. For a complete list, and the necessary and optional components, please read more on DDC Source Components.
Once you have identified the data source you intend to use for the part of the project you are working on, simply select the item from the list, and drag it anywhere on the palette editor floor. Don't worry about keeping it all aligned or too tidy, as you can always arrange things later, by dragging the items to where you want to visualize them, or have them aligned by selecting More actions > Layout> Fit view to pipeline , and tidying up from there.
Drag your source component to the canvas and continue working on your pipeline.
A minimal pipeline will usually require at least one Processor; please read Processor Components for further information. Select one Processor (optional) from the list of components and drop it on the editor canvas. Keep in mind that all components need to be connected by Streams, so ideally having it close to the source(s) is ideal. You will receive an error message if you have selected components that can not operate with each other. This will display a warning message of: incompatible node.
Drop the selected Processor on to the Palette. You can continue to add any kind of sources, processors and sinks based on the design of your pipeline, and capture needs of your project. Incompatible components however will not allow you to add a stream, therefore its best to start simple, and add complexity as you need it as you learn what components work together for the use case you have in mind.
Adding a sink Component from the list of DDC Sink Components available from the list, and drop onto the palette editor floor just as you would a source and sink.
In the below example, all three steps of the process are on the palette floor. However at this stage it is not possible to continue without setting the Streams.
Holding CTRL (or Command with Mac) and clicking on a component allows you to select multiple elements, from their you can either rearrange the layout, or if you want to delete multiple components the words Remove Selected will appear in red at the top of the screen. Clicking this will remove the highlighted components.
Adding Streams to the Pipeline
Each component in the pipeline below has a White bubble to the left or right of the editable component name ( keep in mind that the component property editor, described later, allows for names that are more memorable in each component), clicking on this and dragging towards the other component will activate the Stream connector property. If the components are compatible, drag this tube to the other white bubble and connect them. This is a Stream indicator, that will appear with a green dot and a plus sign if the stream works with the desired component. If it is not you will receive an error message of incompatible node.
Clicking on the Stream indicator will show bring up an editor window allowing for the configuration of each stream. It is important to understand how the data will move through each component. Read more on Streams and Records here.
You are able to select a single stream or a partitioned stream. You can set a topic, and in the case of Partitioned streams the particular Hash Function available from the list. Once everything is configured how it should be configured, it is possible to press Save Component, to commit the settings. If you are happy with the whole Pipeline you can Publish the pipeline to make it available to a deployment plan, at a later stage.
Configuring Pipeline Components and Streams
The figure below represents simple a pipeline consisting of a Database Source, a Cook processor (for custom processing), and a Logger Sink, for writing to log files. The pipeline is joined by two single memory streams.
To edit any component or stream in the pipeline, just click on it and the inspector will be show. The inspector is a very powerful editor that allows you to configure every single aspect of the pipeline that can be edited. These are commonly unique to the type of component you are working with. Use the inspector the modify any component property, manage breakpoints and stubs, save changes, or delete. Please check the documentation on DDC Components and Streams and Records for specific details on configurable properties such as partitioned stream hash functions, or available individual settings for each component. The editor panel allows for one of three potential groupings of editable properties, the basic component values, the default fields necessary, and the advanced tuning settings that are optional but useful.
Clicking on the expanding v button will drop down the editable files in each grouping of component, default, and advanced fields if available. Fields marked with a red dot, are necessary to complete the component, and failure to input a value will cause the pipeline to not save until the necessary values are present.
Some components also allow for code syntax to added to the available pipeline component such as in the cook processor. This is found in the default operations section, by hovering over the + sign and the clicking the window popup box in the right side of the operation line. This will now enable the code editor display.
Additionally, should you want to an advanced editor, this is also provided. Select More actions > Manage> Edit Pipeline . In this editor view, all aspects of the assigned pipeline are editable inline. This requires you to use the Save button, once committing changes to continue.
Once a basic Pipeline is visually laid out, it is important to note that all of the required fields from the components must be edited and saved to the component that the options belong too. This is done by clicking on the component in the palette and tuning the performance of each piece as necessary based on the advanced documentation, and the use case for which the required, and optional properties are required. Once the pipeline is set, it is time to Test your pipeline. Please remember to use the Save button to save your work if it you haven't had notification , or to Publish, to make the pipeline available to a deployment workflow. For additional information see Versioning and History, for more options relates to saving, and retrieving previous configurations.
Working with Dependencies
Some configuration settings will require dependencies from previous steps. While it may be that some dependencies between steps are known, and the system may take these into account, most dependencies are assumed that the user is the best judge of what is needed to be provided. This means that if a field is marked as Dependencies, this is where the dependencies must be provided by the user as the system will not automatically add them.