Data integration involves combining data residing in different sources and providing users with a unified view of data. This process becomes significant in a variety of situations, which include both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories) domains. Data integration has increased as the volume and the need to share existing data exploded. It has become the focus of extensive theoretical work and numerous problems still remain unsolved.
Traditionally, data integration has meant compromise. No matter how rapidly data architects and developers could complete a project before its deadline, speed would always come at the expense of quality. On the other hand, if they focused on delivering a quality project, it would generally drag on for months thus exceeding its deadline. Finally, if the teams concentrated on both quality and rapid delivery, the costs would invariably exceed the budget. Regardless of which path you chose, the end result would be less than desirable. This led some experts to revisit the scope of data integration. This write up shall focus on the same issue.
Why is data integration so difficult?
Even after years of creating data warehouses and data infrastructures, IT teams have continued to struggle with high costs, delays and sub optimal results of traditional data integration. With the introduction of new data sources and data types, data management professionals are not making any concrete progress. Rising business demands are another factor. Following are some of the main causes for data inefficiency:
- Data integration is a time consuming process owing to lack of reusability of various data patterns.
- Integration costs are tremendous.
- Low quality of data results in untrustworthy data.
- Scalability is the biggest issue in traditional data systems as the data volumes keep increasing each day.
- Lack of real-time technologies results in data not updated regularly.
These problems and challenges are all related to the reality that data has become more fragmented while data integration has grown more complex, costly and inflexible. Depending on the data integration requirements, the data must first be extracted from the various sources, then it is generally filtered, aggregated, summarized or transformed in some way and finally delivered to the destination user or system.
Dramatic changes in data volume, variety and velocity make the traditional approach to data integration inadequate and requires one to evolve to next-generation techniques in order to unlock the potential of data.
For any data integration project one has to have a good understanding of the following:
- Data models within each data source
- Mappings between the source and destination data models
- Data contracts mandated by the destination system
- Data transformations required
Next Generation Data Integration
Next-generation data integration technologies are designed mostly to support an extended team that includes data, integration and enterprise architects, as well as data analysts and data stewards, while better aligning with business users.
Most companies use a mixture of in-house developed systems, 3rd party “off the shelf” and cloud hosted systems. In order to get the most value from the data within these systems it must be consolidated in some way. For example, the data within a cloud hosted system could be combined with the data in an in-house system to deliver a new product or service. Mergers and acquisitions are also big drivers of data integration. Data from multiple systems must be consolidated so that the various companies involved in the merger can work together very effectively.
Some rules for next generation data integration are as follows:
- Data Integration (“DI”) is a family of techniques. Some data management professionals still think of DI as merely ETL tools for data warehousing or data replication utilities for database administration. Those use cases are still prominent, as we’ll see when we discuss TDWI survey data. Yet, DI practices and tools have broadened into a dozen or more techniques and use cases.
- DI techniques may be hand coded, based on a vendor’s tool, or both. TDWI survey data shows that migrating from hand coding to using a vendor DI tool is one of the strongest trends as organizations move into the next generation. A common best practice is to use a DI tool for most solutions, but augment it with hand coding for functions missing from the tool.
- DI practices reach across both analytics and operations. DI is not just for Data Warehousing (“DW”). Nor is it just for operational Database Administration (“DBA”). It now has many use cases spanning across many analytic and operational contexts and expanding beyond DW and DBA work is one of the most prominent generational changes for DI.
- DI is an autonomous discipline. Nowadays, there’s so much DI work to be done that DI teams with 13 or more specialists are the norm; some teams have more than 100. The diversity of DI work has broadened, too. Due to this growth, a prominent generational decision is whether to staff and fund DI as is, or to set up an independent team or competency center for DI.
- DI is absorbing other data management disciplines. The obvious example is DI and Data Quality (“DQ”), which many users staff with one team and implement on one unified vendor platform. A generational decision is whether the same team and platform should also support master data management, replication, data sync, event processing and data federation.
- DI has become broadly collaborative. The larger number of DI specialists requires local collaboration among DI team members, as well as global collaboration with other data management disciplines, including those mentioned in the previous rule, plus teams for message/service buses, database administration and operational applications.
- DI needs diverse development methodologies. A number of pressures are driving generational changes in DI development strategies, including increased team size, operational versus analytic DI projects, greater interoperability with other data management technologies and the need to produce solutions in a more lean and agile manner.
- DI requires a wide range of interfaces. That’s because DI can access a wide range of source and target IT systems in a variety of information delivery speeds and frequencies. This includes traditional interfaces (native database connectors, ODBC, JDBC, FTP, APIs, bulk loaders) and newer ones (Web services, SOA and data services). The new ones are critical to next generation requirements for real time and services. Furthermore, as many organizations extend their DI infrastructure, DI interfaces need to access data on-premises, in public and private clouds and at partner and customer sites.
- DI must scale. Architectures designed by users and servers built by vendors need to scale up and scale out to both burgeoning data volumes and increasingly complex processing, while still providing high performance at scale. With volume and complexity exploding, scalability is a critical success factor for future generations. Make it a top priority in your plans.
- DI requires architecture. It’s true that some DI tools impose an architecture (usually hub and spoke), but DI developers still need to take control and design the details. DI architecture is important because it strongly enables or inhibits other next generation requirements for scalability, real time, high availability, server interoperability and data services.
Some of the methods employed for next generation of data integration are as follows:
- Extract, Transform & Load (“ETL”) Design: ETL involves extraction of data from a source and transforming it in the desired data model and finally loading the data into the destination data store. In short:
- Extraction process involves extracting the source data into a staging data store which is usually a separate schema within the destination database or a separate staging database.
- Transform process generally involves re-modelling the data, normalizing is structure, unit conversion, formatting and the resolution of primary keys.
- Load process is quite simple in the case of a once off data migration but can be quite complex in the case of continuous data integration where the data must be merged (inserts, updates & deletes), possibly over a defined date range, there may also be a requirement to maintain a history of changes.
- Enterprise Application Integration Design: While the ETL process is about consolidating data from many sources into one, Enterprise Application Integration, or EAI, is about distributing data between two or more systems. Data exchanged using EAI is often transactional and related to an event in a business process or the distribution of master data. EAI includes message broker and Enterprise Service Bus (“ESB”)
- Data Virtualization & Data Federation: Data virtualization is the process of offering data consumers a data access interface which hides the technical aspects of stored data, such as location, storage structure, access language and storage technology. Data federation is a form of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using on-demand data integration.
Hexanika: Implementation of Data Integration in the form of ETL Process
Hexanika is a FinTech Big Data software company, which has developed a revolutionary software named as Hexanika Solutions for financial institutions to address data sourcing and reporting challenges for regulatory compliance. What it includes is basically uploading the data, sanitizing it and then validating the data. Hexanika Solutions can join ‘N’ different tables data sources and creates joins. These data checks are applied on standardized data as per customer needs. Scalability isn’t an issue as the platform used is Hadoop.
The ELT Process
Hexanika leverages the power of ELT using distributed parallel processing, Big Data/Hadoop technology with a secure data cloud (IBM Cloud). Understanding the high implementation costs of new systems and the complexities involved in redesigning existing solutions, Hexanika offers a unique build that adapts to existing architectures. This makes our solution cost-effective, efficient, simple and smart!
Read more about our solution and architecture at: https://hexanika.com/big-data-solution-architecture/
Also, do visit our website: www.hexanika.com
Contributor: Akash Marathe
Feature image link: https://i.ytimg.com/vi/7TpG9w46i_A/maxresdefault.jpg
 Image Credits: IBM