Denodo Data Virtualization

Data Virtualization

Data Virtualization is a logical layer which

  • Delivers business data in real time to consuming applications or business users
  • Integrates data from disparate sources, locations, and formats, without replicating the data
  • Enables faster access to all data, less replication and cost, and more agility to change
  1. Connects: to disparate data sources
  2. Combines: related data into views
  3. Consume: in business applications

Pillars of Data Virtualization

  1. Universal Data Access: hides the complexity of underlying data sources
  2. Unified Virtual Data Layer: Common virtual canonical business model
  3. Universal Data Publishing: multiple publishing interfaces and access patterns
  4. Unified Data Governance: Central point for data governance
  5. Agile High Performance: Multiple delivery process to meet different SLA(Service Level Agreement)'s

Data Integration Strategies

Data silos make it challenging for business users to access and analyze all of the available data within an organization. To bring the data together, companies typically use

  1. Extract, Transform, and Load processes (ETL)
  • copy the data from different silos and move it to a central location (e.g. Data Warehouse)
  1. Enterprise Service Buses (ESB)
  • establish a communication system for applications, enabling them to share information
  1. Data Virtualization (DV)
  • reate real time, integrated views of data in data silos, and makes them available to applications, analysts and business users

ETL

  • In an ETL process the data is extracted from a source, transformed and loaded into another data system
  • Pros and Cons
    • (Pro) ETL processes are efficient and effective at moving data in bulk
    • (Con) Moving data to another system means that a new repository must be maintained
    • (Con) ETL processes are not collaborative, the end-users must wait until the data is ready

ESB

  • Pros
    • Applications are decoupled
    • They can be used for orchestrate business logic using message flows
  • Cons
    • ESBs are sutable only for operational use cases that involve small result sets
    • Queries are static, can only be scheduled and restricted to one source at a time.

Data Virtualization

Data virtualization supports a wide variety of sources and targets, which makes it an ideal data integration strategy to complement ETL processes and ESBs

  • for ETL
    • Seamlessly connecting on-premises and cloud components
    • real-time integration of disparate data sources
  • for ESB
    • The ESB can connect to the data virtualization layer to access external sources that cannot be easily added to the ESB

Donodo Platform Architecture

Denodo platform comes with the various in-built components as follows

  • Virtual Dataport: it is the core component, the data virtualization server
  • Data catalog: a self-service data discovery tool
  • Scheduler: it is used for scheduling and executing batch jobs
  • Solution Manager: it provides cantralized management of all the servers

Virtual Dataport

Virtual Dataport enables business applications to process a series of distributed and heterogeneous data sources, as if the data were contained in a large virtual database

  • it acts as a mediator that provides a structured and unified view of the data contained in all the data sources included in the system

Data Catalog

help business users, data scientists to discover the data assets that are available through the Denodo platform

  • it promotes self-service and discovery capabilities for business users, enabling them to explore both data and metadata in a single web fronend tool
  • with this tool, the end user will be able to access a graphical representation of the business entities and associations, as well as the data linage and tree view information
  • it includes reporting capabilities and export options (CSV, excel, Tableau etc…)

Data catalog is designed to provide organizations with three core benefits

  • Enterprise-wide directory of datasets available for consumption from a business friendly web interface
    • fast business decisions by enabling a much more rapid and comprehensive understanding and access to all enterprise data and metadata
  • Governed self service BI/Analytics and data consumption that is still owned and maintained by IT
  • Significant IT time, resource and cost reduction

Denodo Scheduler

  • exporting data to a database or files such as Tableau Data Export, delimited files, CSVs, flat files, MS Excel
  • Indexing data to enable google-like keyword searches on content
  • Cache loading
  • Periodical management tasks such as data statistic gathering, looking for changes in sources, etc…

Denodo Scheduler enables a hybrid, DV to ETL pattern

  • Jobs can be scheduled for a specific time and can have dependencies of other jobs
  • Detailed execution reports that can be sent by email
  • Support for data extraction from sources with limited query capabilities
  • Persistent jobs and Transparent retries in the event of failure
  • Possiblity of parallel execution of the different queries involved in the same job

Denode Solution Manager

  • easy promotion of new metadata to different environments in multi-location architectures with different clusters from a centralized web UI
  • it provides automated lifecycle management capabilities, DevOps tasks are greatly simplified and gaining the agility
  • it’s centralized web UI allows a simpler and more efficient management of deployments, streamlining continuous delivery strategies.

Denodo platform tools and applications

Denode platform control center

Allow to start and stop all the Denodo platform servers and its tools

  • it allows configuring some additional functions
  • it can be launched either by
    • selecting the shortcuts created during the installatioin
    • Or by executing the denodo_platform.sh script from the bin folder of the Denodo platform installation

Web design studio

A web based tool, aimed for developers to develop the Virtual DataPort elements

  • Web design studio can be started
    • from the Denodo platform control center
    • by executing the designstudio_startup.sh script from the bin folder
  • it can be accessed using the URL: http://localhost:9090/denodo-design-studio
  • currently, it does not support configuring the administration options

Data Catalog

A web based self service tool, aimed for business users to query and serach the organization data

  • they can generate new knowledge and can take better decisions
  • data catalog can be started, either from
    • the Denodo platform control center
    • by executing the datacatalog_startup.sh script from the bin folder of the Denodo platform installation
  • the default URL is: http://localhost:9090/denodo-data-catalog
  • this tool is not included in Denodo Standard 8.0 version

Scheduler Aministration Tool

A web tool, used to connect to the scheduler server for creating and managing the scheduler jobs

  • scheduler administartion tool can be started, either using
    • the Denodo platform control center
    • executing the scheduler_webadmin_startup.sh script from the bin folder of the Denodo platform installation
  • the default URL to launch the tool is: http://localhost:9090/webadmin/denodo-scheduler-admin

Solution Manager Administration Tool

A web tool, used to connect to the Solution Manager and License Manager Servers

  • it provides an unified access to all the web applications of the Denodo platform
    • Data Catalog
    • Web Design Studio
    • Diagnostics and Monitoring tool
    • Scheduler web administration tool
  • The Solution Manager Administration Tool can be started, either by
    • Using the Denodo Solution Manager Control center
    • Using the solutionmanagerwebtool_startup.sh script from the bin folder of the Solution Manager installation
  • the default URL to launch the Solution Manager Administration tool

Diagnostics and Monitoring tool

A web tool, aimed for both developers and administrators who wants to

  • monitor the current state of a server, a cluster or an environment
  • diagnose the past state of a server to identify the cause of a problem
  • diagnostics and monitoring tool can be started, either using
    • The Denodo platform control center
    • by executing the diagnosticmonitoringtool_startup.sh script
  • default URL: http://localhost:9090/diagnostic-monitoring-tool

Southbound Connectors

Specialized connectors to access specific data repositories or applications to retrieve their schema and data

  • these connectors are configurable by using a virtual editor or SQL like commands that are part of the Denodo platform
  • it helps in creating data sources to retrieve data from a repository of data in the Denodo platform
    • E.g. a relational database, a REST or SOAP web service, a MS Excel file, etc…

Data Model

The Derived views are created over base views to combine the data from different sources or to apply any kind of transformations

  • Denodo platform offers graphical drag and drop tool for modeling the derived views
  • it used extended relational algebra for data combination and transformation.

Northbound Connectors

One of the essential capabilities of Data Virtualization is about the different access mechanisms provided to the consumers, these access mechanisms are called Northbound Connectors

  • Denodo platform offers flexible delivery options to suit all type of users and application at almost no cost
  • if facilitates the creation of a single point of access and interaction with the underlying data source and abstracted views in a standard way

Denodo Platform Security

Data in Transit

When users access data from Denodo Platform the data is moving

  • Data moves through the network
    • From the data sources to Denodo
    • From Denodo to the final users and consuming applications
  • This is referred as Security in Transit
  • To ensure that the data is well protected and secure, it is possible to encrypt the connection

Data in transit refers to all communications between the Denodo Platform and Data sources, between data consumers and Denodo Platform

  • it can be secured through TLS at the connection level
    • Denodo supports TLS 1.2 or 1.3
    • TLS configuration is automatized via a Denodo TLS configurator script

Any encrpytion algorithm supported by the default Java Cryptography Providers of the Denodo Platform JRE (Java 11) can be used

  • Strongest ciphers can be enabled installing Java Cryptography Extension (JCE), unlimited strength jurisdiction policy files
  • when full encryption at the transport level is not required
    • Denodo’s build-in functions for encryption/decryption can be selectively applied to sensitive fields to prevent unauthorized access.

Data in Rest

Denodo Platform stores some opeartional data in a relational database and hard drive, these data are called as Data at Rest

  • Metadata of the various Denodo Platform components and elements
    • Data sources, views, stored procedures, web serviees
    • sensitive information on the configuration of the server
  • Data cached in a relational database
  • data swapped to disk during query execution

The Data at Rest can be secured by Denodo as follows

  • Metadata: sensitive information are stored encrypted/hashed in the metadata repository
    • Additiionally, you can enable Transparent Data Encryption to encrypt the database
  • Cached Data: Denodo will transparently leverage any encryption mechanism available in the selected cache system
    • For example, Orable Transparent Data Encryption
  • Swapped Data: encrypt the swap folder by using OS file system encryption

Authentication Protocols

Used to publish data to consuming application

  • standard username/password security mechanism
  • Kerberos support
  • HTTP-based authentication pass-through for the OData interface
  • Web service security with HTTPS, HTTP basic/Digest, SAML 2.0, OAuth 2.0, HTTP SPNEGO and WS-Security protocols
  • OAuth 2.0 in JDBC, ODBC connections is also available

Denodo also offers different alternatives to integrate with identity, authentication and authorization services

  • Denodo built-in security
  • Integration with external entitlement services (LDAP/AD)
  • Single sign on using OAuth, and SAML
  • Integration with external custom entitlement service with specific security policies

Performance

Static Optimization

The Denodo Optimizer analyzes the query and simplify it by rewriting it to improve the performance

  • it focuses on maximize query delegation to the data sources
  • it applies by default for all the queries
  • it minimizes the network traffic
  • it reduces the data volume that needs to be processed in the virtual layer
  • it applies one or more simplifications to the same query

Connecting and combining Data with Denodo

Denodo platform allows to add remote databases as sources

  • the recommended way to connect to databases is always using the JDBC data source
    • JDBC is an acronym referred to Java Database Connectivity
    • JDBC defines an API for the Java Programming language to access to any database using a suitable driver

Denodo platform supports out-the-box a wide variety of databases

  • relational databases
    • oracle, MS server, PostgreSQL, MySQL, Amazon Aurora, Azure SQL, Apache Derby
  • Data Warehouses
    • Snowflake, Amazon Redshift, Azure SQL Data Warehouse, Google BigQuery, SAP HANA, Teradata, Yellowbrick
  • Interactive Query Services
    • Hive, Impala, Amazon Athena, Presto, Spark SQL

By default, Denodo platform does not include some supported JDBC drivers, although it includes the adapter

  • When a driver is not provided by Denodo, the .jar files of the driver must be uploaded to Virtual Dataport server using the Administration tool
  • Import option in the File > Extensioin Management > Libraries

When Denodo does not include an specific adapter for a data source

  • if possible, select a similar adapter
    • for example, to connect to MySQL 6.0, the adapter MySQL 5 can be selected
  • if there are not similar adapters, the generic adapter can be used
  • in both cases the driver must be provided and uploaded to the server

Import a SQL Query

Instead of using a table as the source for a base view, it is possible to generate a view from a SQL query

  • the query will be sent to the database without modifications
    • but Denodo platform may delegate to the source a bigger query over the given sentence, for example, when delegating conditions
  • this new view will inherit its internal schema from the meta information associated to the query results

This is useful in several scenarios

  • using specific syntax to the accessed database that cannot be replicated in virtual dataport
    • for example, using a function that is not available in virutal dataport
  • situations when an exact query is needed
    • for example, complex queries already created instead of duplicting them in Denodo
    • using queries that have been optimized for the specific database being accessed.

Import Stored Procedures

Base views can be created graphically over stored procedure of the following databases

  • IBM DB2, MS SQL Server and Oracle

If the procedure has one or more cursors, one of the following options can be selected

  • Stream output at the specified cursor
    • the data that is returned by the cursor is flattened
  • do not stream cursor parameters
    • the data returned by the cursor will be in an array

Querying views from Execution Panel

Virtual Dataport is able to import data coming from different systems and with different structure and format

  • to retrieve data form a source, some elements are created in Denodo
    • A data source representing a repository of data
    • base views repreesnting entities in the source
    • these base views can be executed in virtual dataport

Basic Derived Views

Derived Views are new views created over other Denodo Views

  • A derived view is another element in the virtual dataport catalog
    • it may then appear in the FROM clause of any VQL query and may be used as a base for constructing new views or queries
  • THe cache may be used on a derived view, and user privileges can be defined

Functions

Functions are used to generate new attributes in the schema of a view

  • a function is defined as an identifier and a list of arguments that can be constants, fields or other functions
    • e.g. len(‘this function gets the length of this text’)

All virtual dataport functions can be browsed by using CTRL + SPACE in the

  • Specify WHERE expression tab of the execution panel
  • in the field expression dialog available while defining new fields for views

Custom functions allow users to extend the set of functions available in Virtual dataport

  • they are used in the same way as every other function

Implemented as JAVA classes included in a JAR file that is added to the Virtual Dataport

  • functions have to be stateless
  • each function has to have a unique name and a different JAVA class
  • different functions can be grouped in a single JAR file

Union Views

A union view allows the tuples from various input views to be output as a single view

  • in standard relational algebra, union operation implies
    • all the relations must have the same schema
  • extended union all is also used
    • whenever any of the input relations has an attribute that is not present in the others, it is added to the resulting view.

When creating a view, more than just a UNION operation is performed

  • some of the operations are performed through intermediate projection operations automatically added by virtual dataport
    • e.g. changing the name of the view, deleting fields
  • other intermediate opeartions can be added manually
    • where conditions tab
    • group by tab
  • all the operations performed are shown in the tree view screen

JSON/XML data source

Virtual Dataport allows to import data from XML or JSON

  • these formats are often used for transmitting structured data over a network connection
  • these data sources allows Denodo platform to connect to RESTful web services
    • connecting to a resource URI which returns a response with a payload formatted in XML or JSON (google APIs, Twitter etc…)

For JSON data source

  • a tuple root can be selected
    • /JSONFile is the default option
    • modify it when it is needed to access only a part of the document rather than the entirety of it.

For XML date sources

  • there are two options to create base views
    • do not stream output, the base view will return a single compound value aggregating all the document information
    • stream output at the specified level: it does not require the entire document to be realized in memory before processing it
      • the contents will be split into different tuples

Web service data source

A web service is a software system designed to support interoperable machine-to-machine interaction over a network

  • virtual dataport can use a SOAP web service as a data source and performs queries over it by invoking its operations
    • in a web service source, the fields of the query are the input parameters of the web service operation

Join Types

A join view allows the relational algebra operation with the same name to be executed on a series of input views

  • it combines records from two or more views by using values common to each
  • INNER JOIN (default)
  • LEFT OUTER JOIN
  • RIGHT OUTER JOIN
  • FULL OUTER JOIN
  • CROSS JOIN
    • each row from the first relation is combined with each row from the second one, this join returns the Cartesian product of the sets of rows from the joined tables

Join Execution Methods

Depending on the join type and the selected order, the user can define different join execution methods

  • Merge
  • Hash
  • Nested
  • Nested Parallel
    If the join execution method is not set, design studio will try to calculate the best query plan and all the queries against the join view will follow this plan

Merge

This join method requires the input data to be ordered by the join attributes, this can be achieved as follows

  • when the data source allows opeartions, virtual dataport will automatically request the data ordered
    • if it does not allow sorting operations, the ORDERBY clause can be used in the underlying views to sort the data in the required order
  • the collation of both sources must be the same
  • if the data source returns directly the data in the appropriate order
    • the base view options allow administrators to specify it using the ‘Fields by which the data is sorted in the source’ field
  • data sources are queried in parallel, ordered by the join conditiion and the non-matching data are discarded
  • when valid, this strategy is usually the most efficient one and uses less memory than the others

Hash

Creates an in-memory hash table:

  • for each different join attribute values on the right side, it will insert a key-value pair into the hash table
  • the left side tuples that match these values will be in the final join view output

This is the most efficient join method when

  • the input views that cannot be ordered or are large or the query latency times for the data sources are high
    • minimizes the sub-queries to the sources

Nested

Obtains data from one input view, then for each record obtained a subquery is executed in the other view using the join conditions

  • if the second view comes from a relational data source, virtual dataport will optimize the process by running a single subquery that retrieves all the matching data from the second source
    • if the amount of values obtained from the left side view of the join exceed a certain value (200 default), the server will group the queries to obtain the data from the rigih-side
  • A nested join is a good option when the right side returns a lot of tuples and the left side only returns a few

Nested Parallel

Same as Nested but this method executes the sub queries in parallel

  • an additional parameter allows to specify the maximum number of subqueries launched in parallel
  • if the second data source is of JDBC/ODBC type, the use of NESTED PARALLEL is usually unnecessary and less efficient, NESTED option must be used

Compound Types and Flatten Operation

Denodo virtual dataport supports modeling data types with a complex structure using the types register and array

  • an element of the type array can be considered a subview and an array type always has an associated register type that acts like the schema of the subview it is modeling

Sometimes it is desirable to flatten a compound field that contains an array of registers

  • typical in XML and web service data sources

Flattening a register

  • every register element will be a new field on the derived view output
  • every array elemenet will be a new row

it is possible to perform the inverse operation of faltten to create array or register elements from several fields

  • this can be done using the NEST or REGISTER functions