Denodo Data Virtualization

02-08-2021 Data Virtualization Comments Word Count: 3.6k(words) Read Count: 22(minutes)

Data Virtualization

Data Virtualization is a logical layer which

Delivers business data in real time to consuming applications or business users
Integrates data from disparate sources, locations, and formats, without replicating the data
Enables faster access to all data, less replication and cost, and more agility to change

Connects: to disparate data sources
Combines: related data into views
Consume: in business applications

Pillars of Data Virtualization

Universal Data Access: hides the complexity of underlying data sources
Unified Virtual Data Layer: Common virtual canonical business model
Universal Data Publishing: multiple publishing interfaces and access patterns
Unified Data Governance: Central point for data governance
Agile High Performance: Multiple delivery process to meet different SLA(Service Level Agreement)'s

Data Integration Strategies

Data silos make it challenging for business users to access and analyze all of the available data within an organization. To bring the data together, companies typically use

Extract, Transform, and Load processes (ETL)

copy the data from different silos and move it to a central location (e.g. Data Warehouse)

Enterprise Service Buses (ESB)

establish a communication system for applications, enabling them to share information

Data Virtualization (DV)

reate real time, integrated views of data in data silos, and makes them available to applications, analysts and business users

ETL

In an ETL process the data is extracted from a source, transformed and loaded into another data system
Pros and Cons
- (Pro) ETL processes are efficient and effective at moving data in bulk
- (Con) Moving data to another system means that a new repository must be maintained
- (Con) ETL processes are not collaborative, the end-users must wait until the data is ready

ESB

Pros
- Applications are decoupled
- They can be used for orchestrate business logic using message flows
Cons
- ESBs are sutable only for operational use cases that involve small result sets
- Queries are static, can only be scheduled and restricted to one source at a time.

Data Virtualization

Data virtualization supports a wide variety of sources and targets, which makes it an ideal data integration strategy to complement ETL processes and ESBs

for ETL
- Seamlessly connecting on-premises and cloud components
- real-time integration of disparate data sources
for ESB
- The ESB can connect to the data virtualization layer to access external sources that cannot be easily added to the ESB

Donodo Platform Architecture

Denodo platform comes with the various in-built components as follows

Virtual Dataport: it is the core component, the data virtualization server
Data catalog: a self-service data discovery tool
Scheduler: it is used for scheduling and executing batch jobs
Solution Manager: it provides cantralized management of all the servers

Virtual Dataport

Virtual Dataport enables business applications to process a series of distributed and heterogeneous data sources, as if the data were contained in a large virtual database

it acts as a mediator that provides a structured and unified view of the data contained in all the data sources included in the system

Data Catalog

help business users, data scientists to discover the data assets that are available through the Denodo platform

it promotes self-service and discovery capabilities for business users, enabling them to explore both data and metadata in a single web fronend tool
with this tool, the end user will be able to access a graphical representation of the business entities and associations, as well as the data linage and tree view information
it includes reporting capabilities and export options (CSV, excel, Tableau etc…)

Data catalog is designed to provide organizations with three core benefits

Enterprise-wide directory of datasets available for consumption from a business friendly web interface
- fast business decisions by enabling a much more rapid and comprehensive understanding and access to all enterprise data and metadata
Governed self service BI/Analytics and data consumption that is still owned and maintained by IT
Significant IT time, resource and cost reduction

Denodo Scheduler

exporting data to a database or files such as Tableau Data Export, delimited files, CSVs, flat files, MS Excel
Indexing data to enable google-like keyword searches on content
Cache loading
Periodical management tasks such as data statistic gathering, looking for changes in sources, etc…

Denodo Scheduler enables a hybrid, DV to ETL pattern

Jobs can be scheduled for a specific time and can have dependencies of other jobs
Detailed execution reports that can be sent by email
Support for data extraction from sources with limited query capabilities
Persistent jobs and Transparent retries in the event of failure
Possiblity of parallel execution of the different queries involved in the same job

Denode Solution Manager

easy promotion of new metadata to different environments in multi-location architectures with different clusters from a centralized web UI
it provides automated lifecycle management capabilities, DevOps tasks are greatly simplified and gaining the agility
it’s centralized web UI allows a simpler and more efficient management of deployments, streamlining continuous delivery strategies.

Denodo platform tools and applications

Denode platform control center

Allow to start and stop all the Denodo platform servers and its tools

it allows configuring some additional functions
it can be launched either by
- selecting the shortcuts created during the installatioin
- Or by executing the denodo_platform.sh script from the bin folder of the Denodo platform installation

Web design studio

A web based tool, aimed for developers to develop the Virtual DataPort elements

Web design studio can be started
- from the Denodo platform control center
- by executing the designstudio_startup.sh script from the bin folder
it can be accessed using the URL: http://localhost:9090/denodo-design-studio
currently, it does not support configuring the administration options

Data Catalog

A web based self service tool, aimed for business users to query and serach the organization data

they can generate new knowledge and can take better decisions
data catalog can be started, either from
- the Denodo platform control center
- by executing the datacatalog_startup.sh script from the bin folder of the Denodo platform installation
the default URL is: http://localhost:9090/denodo-data-catalog
this tool is not included in Denodo Standard 8.0 version

Scheduler Aministration Tool

A web tool, used to connect to the scheduler server for creating and managing the scheduler jobs

scheduler administartion tool can be started, either using
- the Denodo platform control center
- executing the scheduler_webadmin_startup.sh script from the bin folder of the Denodo platform installation
the default URL to launch the tool is: http://localhost:9090/webadmin/denodo-scheduler-admin

Solution Manager Administration Tool

A web tool, used to connect to the Solution Manager and License Manager Servers

it provides an unified access to all the web applications of the Denodo platform
- Data Catalog
- Web Design Studio
- Diagnostics and Monitoring tool
- Scheduler web administration tool
The Solution Manager Administration Tool can be started, either by
- Using the Denodo Solution Manager Control center
- Using the solutionmanagerwebtool_startup.sh script from the bin folder of the Solution Manager installation
the default URL to launch the Solution Manager Administration tool
- http://localhost:9090/solution-manager-web-tool

Diagnostics and Monitoring tool

A web tool, aimed for both developers and administrators who wants to

monitor the current state of a server, a cluster or an environment
diagnose the past state of a server to identify the cause of a problem
diagnostics and monitoring tool can be started, either using
- The Denodo platform control center
- by executing the diagnosticmonitoringtool_startup.sh script
default URL: http://localhost:9090/diagnostic-monitoring-tool

Southbound Connectors

Specialized connectors to access specific data repositories or applications to retrieve their schema and data

these connectors are configurable by using a virtual editor or SQL like commands that are part of the Denodo platform
it helps in creating data sources to retrieve data from a repository of data in the Denodo platform
- E.g. a relational database, a REST or SOAP web service, a MS Excel file, etc…

Data Model

The Derived views are created over base views to combine the data from different sources or to apply any kind of transformations

Denodo platform offers graphical drag and drop tool for modeling the derived views
it used extended relational algebra for data combination and transformation.

Northbound Connectors

One of the essential capabilities of Data Virtualization is about the different access mechanisms provided to the consumers, these access mechanisms are called Northbound Connectors

Denodo platform offers flexible delivery options to suit all type of users and application at almost no cost
if facilitates the creation of a single point of access and interaction with the underlying data source and abstracted views in a standard way

Denodo Platform Security

Data in Transit

When users access data from Denodo Platform the data is moving

Data moves through the network
- From the data sources to Denodo
- From Denodo to the final users and consuming applications
This is referred as Security in Transit
To ensure that the data is well protected and secure, it is possible to encrypt the connection

Data in transit refers to all communications between the Denodo Platform and Data sources, between data consumers and Denodo Platform

it can be secured through TLS at the connection level
- Denodo supports TLS 1.2 or 1.3
- TLS configuration is automatized via a Denodo TLS configurator script

Any encrpytion algorithm supported by the default Java Cryptography Providers of the Denodo Platform JRE (Java 11) can be used

Strongest ciphers can be enabled installing Java Cryptography Extension (JCE), unlimited strength jurisdiction policy files
when full encryption at the transport level is not required
- Denodo’s build-in functions for encryption/decryption can be selectively applied to sensitive fields to prevent unauthorized access.

Data in Rest

Denodo Platform stores some opeartional data in a relational database and hard drive, these data are called as Data at Rest

Metadata of the various Denodo Platform components and elements
- Data sources, views, stored procedures, web serviees
- sensitive information on the configuration of the server
Data cached in a relational database
data swapped to disk during query execution

The Data at Rest can be secured by Denodo as follows

Metadata: sensitive information are stored encrypted/hashed in the metadata repository
- Additiionally, you can enable Transparent Data Encryption to encrypt the database
Cached Data: Denodo will transparently leverage any encryption mechanism available in the selected cache system
- For example, Orable Transparent Data Encryption
Swapped Data: encrypt the swap folder by using OS file system encryption

Authentication Protocols

Used to publish data to consuming application

standard username/password security mechanism
Kerberos support
HTTP-based authentication pass-through for the OData interface
Web service security with HTTPS, HTTP basic/Digest, SAML 2.0, OAuth 2.0, HTTP SPNEGO and WS-Security protocols
OAuth 2.0 in JDBC, ODBC connections is also available

Denodo also offers different alternatives to integrate with identity, authentication and authorization services

Denodo built-in security
Integration with external entitlement services (LDAP/AD)
Single sign on using OAuth, and SAML
Integration with external custom entitlement service with specific security policies

Performance

Static Optimization

The Denodo Optimizer analyzes the query and simplify it by rewriting it to improve the performance

it focuses on maximize query delegation to the data sources
it applies by default for all the queries
it minimizes the network traffic
it reduces the data volume that needs to be processed in the virtual layer
it applies one or more simplifications to the same query

Connecting and combining Data with Denodo

Denodo platform allows to add remote databases as sources

the recommended way to connect to databases is always using the JDBC data source
- JDBC is an acronym referred to Java Database Connectivity
- JDBC defines an API for the Java Programming language to access to any database using a suitable driver

Denodo platform supports out-the-box a wide variety of databases

relational databases
- oracle, MS server, PostgreSQL, MySQL, Amazon Aurora, Azure SQL, Apache Derby
Data Warehouses
- Snowflake, Amazon Redshift, Azure SQL Data Warehouse, Google BigQuery, SAP HANA, Teradata, Yellowbrick
Interactive Query Services
- Hive, Impala, Amazon Athena, Presto, Spark SQL

By default, Denodo platform does not include some supported JDBC drivers, although it includes the adapter

When a driver is not provided by Denodo, the .jar files of the driver must be uploaded to Virtual Dataport server using the Administration tool
Import option in the File > Extensioin Management > Libraries

When Denodo does not include an specific adapter for a data source

if possible, select a similar adapter
- for example, to connect to MySQL 6.0, the adapter MySQL 5 can be selected
if there are not similar adapters, the generic adapter can be used
in both cases the driver must be provided and uploaded to the server

Import a SQL Query

Instead of using a table as the source for a base view, it is possible to generate a view from a SQL query

the query will be sent to the database without modifications
- but Denodo platform may delegate to the source a bigger query over the given sentence, for example, when delegating conditions
this new view will inherit its internal schema from the meta information associated to the query results

This is useful in several scenarios

using specific syntax to the accessed database that cannot be replicated in virtual dataport
- for example, using a function that is not available in virutal dataport
situations when an exact query is needed
- for example, complex queries already created instead of duplicting them in Denodo
- using queries that have been optimized for the specific database being accessed.

Import Stored Procedures

Base views can be created graphically over stored procedure of the following databases

IBM DB2, MS SQL Server and Oracle

If the procedure has one or more cursors, one of the following options can be selected

Stream output at the specified cursor
- the data that is returned by the cursor is flattened
do not stream cursor parameters
- the data returned by the cursor will be in an array

Querying views from Execution Panel

Virtual Dataport is able to import data coming from different systems and with different structure and format

to retrieve data form a source, some elements are created in Denodo
- A data source representing a repository of data
- base views repreesnting entities in the source
- these base views can be executed in virtual dataport

Basic Derived Views

Derived Views are new views created over other Denodo Views

A derived view is another element in the virtual dataport catalog
- it may then appear in the FROM clause of any VQL query and may be used as a base for constructing new views or queries
THe cache may be used on a derived view, and user privileges can be defined

Functions

Functions are used to generate new attributes in the schema of a view

a function is defined as an identifier and a list of arguments that can be constants, fields or other functions
- e.g. len(‘this function gets the length of this text’)

All virtual dataport functions can be browsed by using CTRL + SPACE in the

Specify WHERE expression tab of the execution panel
in the field expression dialog available while defining new fields for views

Custom functions allow users to extend the set of functions available in Virtual dataport

they are used in the same way as every other function

Implemented as JAVA classes included in a JAR file that is added to the Virtual Dataport

functions have to be stateless
each function has to have a unique name and a different JAVA class
different functions can be grouped in a single JAR file

Union Views

A union view allows the tuples from various input views to be output as a single view

in standard relational algebra, union operation implies
- all the relations must have the same schema
extended union all is also used
- whenever any of the input relations has an attribute that is not present in the others, it is added to the resulting view.

When creating a view, more than just a UNION operation is performed

some of the operations are performed through intermediate projection operations automatically added by virtual dataport
- e.g. changing the name of the view, deleting fields
other intermediate opeartions can be added manually
- where conditions tab
- group by tab
all the operations performed are shown in the tree view screen

JSON/XML data source

Virtual Dataport allows to import data from XML or JSON

these formats are often used for transmitting structured data over a network connection
these data sources allows Denodo platform to connect to RESTful web services
- connecting to a resource URI which returns a response with a payload formatted in XML or JSON (google APIs, Twitter etc…)

For JSON data source

a tuple root can be selected
- /JSONFile is the default option
- modify it when it is needed to access only a part of the document rather than the entirety of it.

For XML date sources

there are two options to create base views
- do not stream output, the base view will return a single compound value aggregating all the document information
- stream output at the specified level: it does not require the entire document to be realized in memory before processing it
  - the contents will be split into different tuples

Web service data source

A web service is a software system designed to support interoperable machine-to-machine interaction over a network

virtual dataport can use a SOAP web service as a data source and performs queries over it by invoking its operations
- in a web service source, the fields of the query are the input parameters of the web service operation

Join Types

A join view allows the relational algebra operation with the same name to be executed on a series of input views

it combines records from two or more views by using values common to each
INNER JOIN (default)
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
CROSS JOIN
- each row from the first relation is combined with each row from the second one, this join returns the Cartesian product of the sets of rows from the joined tables

Join Execution Methods

Depending on the join type and the selected order, the user can define different join execution methods

Merge
Hash
Nested
Nested Parallel
If the join execution method is not set, design studio will try to calculate the best query plan and all the queries against the join view will follow this plan

Merge

This join method requires the input data to be ordered by the join attributes, this can be achieved as follows

when the data source allows opeartions, virtual dataport will automatically request the data ordered
- if it does not allow sorting operations, the ORDERBY clause can be used in the underlying views to sort the data in the required order
the collation of both sources must be the same
if the data source returns directly the data in the appropriate order
- the base view options allow administrators to specify it using the ‘Fields by which the data is sorted in the source’ field
data sources are queried in parallel, ordered by the join conditiion and the non-matching data are discarded
when valid, this strategy is usually the most efficient one and uses less memory than the others

Hash

Creates an in-memory hash table:

for each different join attribute values on the right side, it will insert a key-value pair into the hash table
the left side tuples that match these values will be in the final join view output

This is the most efficient join method when

the input views that cannot be ordered or are large or the query latency times for the data sources are high
- minimizes the sub-queries to the sources

Nested

Obtains data from one input view, then for each record obtained a subquery is executed in the other view using the join conditions

if the second view comes from a relational data source, virtual dataport will optimize the process by running a single subquery that retrieves all the matching data from the second source
- if the amount of values obtained from the left side view of the join exceed a certain value (200 default), the server will group the queries to obtain the data from the rigih-side
A nested join is a good option when the right side returns a lot of tuples and the left side only returns a few

Nested Parallel

Same as Nested but this method executes the sub queries in parallel

an additional parameter allows to specify the maximum number of subqueries launched in parallel
if the second data source is of JDBC/ODBC type, the use of NESTED PARALLEL is usually unnecessary and less efficient, NESTED option must be used

Compound Types and Flatten Operation

Denodo virtual dataport supports modeling data types with a complex structure using the types register and array

an element of the type array can be considered a subview and an array type always has an associated register type that acts like the schema of the subview it is modeling

Sometimes it is desirable to flatten a compound field that contains an array of registers

typical in XML and web service data sources

Flattening a register

every register element will be a new field on the derived view output
every array elemenet will be a new row

it is possible to perform the inverse operation of faltten to create array or register elements from several fields

this can be done using the NEST or REGISTER functions

Link： http://hellcy.github.io/2021/08/02/Denodo-Data-Virtualization/

Copyright： This article is using CC BY 4.0 license.

Yuan ChengSoftware Engineer

Wait and see