Earlier in my career, at the height of the business intelligence (BI) boom, I worked with an internal reporting team to expose data for extract, transform, and load (ETL) processes that used Ralph Kimball-inspired data structures. Understanding how to maximize data for reporting and analysis was a new and exciting time in my life. Based on my previous experience with transaction-driven architectures, I thought the schema was backward.
At the end of the day, there were a lot of moving components and even some requirements on the existence of a flat-file to ensure that everything functioned properly. The reports came out rapidly, but one thing concerned me: I was continually looking at data from the previous day.
Real-time data has become the holy grail for enabling up-to-date and proactive analysis and data-driven business choices due to the increased amount, velocity, and types of data created in current tech stacks. When we employ the traditional batch method of data input, the data is constantly outdated and unable to answer pressing questions.
To support these real-time use cases, the data must be available in the storage and processing layers that an organization now employs.
Transactional (OLTP) systems like Oracle, Postgres, and SQLServer have been at the heart of most enterprises for many years, and they’re still important for downstream analytics and processing. Typically, a group of data engineers spends months constructing pipelines that pull and transform data from OLTP systems into analytical data lakes and warehouses (OLAP).
Organizations of all sizes have spent billions of dollars over the last decade trying to improve their ability to handle data and extract economic value from it. Only 26.5 percent of companies believe they have achieved their goal of becoming a data-driven business. Gaining the capacity to develop dependable data pipelines that extract data from OLTP systems while assuring data integrity and ease of scale as the data grows is one of their biggest obstacles to achieving the “Data Utopia” state.
This made me question if there was a better way to achieve the same outcomes without having to go through such a lengthy overnight process. To satisfy such objectives, my first instinct was to look into the change data capture (CDC) pattern.
Why Change Data Capture (CDC) Matters
Data extraction from OLTP systems can be difficult for a variety of reasons, including:
- Enterprise databases, unlike SaaS technologies, lack flexible APIs that can be used to react to data changes.
- To assure data consistency, building and maintaining pipelines to extract data from OLTP systems is complex and costly, especially at a large scale.
CDC, on the other hand, is a software design pattern that detects and records database updates in real-time. This enables third-party consumers to react to database changes, greatly simplifying the OLTP system’s data retrieval process. CDC events can be found in a variety of ways, including:
- Time-based: data is sourced from the required timestamp column.
- Checksum: uses checksums on tables in the source/target and uses differences as a data source.
- Log-based: uses database logs as a data source.
We’ll concentrate on the log technique in this article. Because of its lightweight method of making data updates, log-based CDC is widely regarded as the ideal option for real-time use cases. Time-based and checksum techniques demand additional overhead and CPU and frequently use database queries to locate the most recent changes to a dataset. CDC may save time and money by skipping these expensive and time-consuming inquiries and going right to the source: the logs.
Database writes are managed via transaction logs, often known as changelogs when using the log technique. CDC solutions monitor the same logs and generate events if certain conditions are met. The CDC service can access the same data by using the logs instead of waiting for the transaction to commit and then making a database request for it.
Third-party data consumers gain faster access to database changes by using a CDC solution. Furthermore, because data consumers are only interested in changes in those changelogs, they do not require access to the real database. CDC not only allows for near-instant data replication but also has a modest impact on production database performance. The IT teams also like it because CDC merely examines the logs and poses no security risk. Because of the unique qualities of CDC, many tech leaders have come to believe that real-time use cases are critical to their business models (e.g., Shopify, Uber, CapitalOne).
Detecting Fraud Using CDC
The financial industry is concerned about fraudulent transactions. Credit card fraud surged 44.7 percent in 2020 over the previous year, according to this research.
Let’s pretend WorldWide Bank, a hypothetical financial institution, wishes to increase its fraud detection speed. Their present approach relies on an ETL process to query and generate reports from their Oracle database. They want to see if they can use Arcion CDC to process Oracle database changelogs in order to produce events for their Databricks Delta Lake prototype. WorldWide Bank, like many other businesses, uses a number of transactional systems, including an Oracle database that houses a substantial portion of its data. They’ve decided to use Databricks for a data lake since they need to provide more machine learning (ML) capabilities and mix multiple datasets in an easy-to-use environment for analysis.
Oracle’s relational data style, combined with the difficulty of bringing in additional ancillary data sources and the lack of support for running machine learning directly on Oracle, makes Databricks an obvious choice. However, this approach raises another issue: how to source Oracle transactions into Databricks quickly. To increase the speed with which they can detect fraud, they need to be able to swiftly source CDC events from Oracle and deposit them in their Databricks data lake for processing and detection.
Why did WorldWideBank choose Arcion CDC over another option, even one that was open source? The solutions are straightforward. Debezium, for example, is free and open-source, but it requires the installation and functioning of Zookeeper, Kakfa, and MySQL before you can use or install it. Simply installing and maintaining the tech stack would necessitate a significant staff of engineers.
While other solutions, such as Fivetran’s HVR for CDC, may also provide free trials, Arcion offers two free options: a 30-day download trial or a 14-day cloud trial. Because this is convenient for testing, Arcion was an obvious choice when looking for a CDC solution.
Prototyping Arcion CDC
Assume inbound transactions are written to an Oracle database table called TRANSACTIONS to represent the WorldWide Bank scenario.
We need to set up a CDC service to function as middleware between Databricks and Delta Lake in order to use their combined capability for fraud protection. Within Delta Lake, the TRANSACTIONS table design and data must exist as a Bronze (Ingestion) tier. After the existing data has been absorbed, the Arcion solution’s CDC stream will be used for all future transactions.
The first step in using CDC is to create a free Arcion Cloud account and receive the Oracle database and Databricks system connection/login details.
The imported data can be turned into Silver (Redefined Tables) and Gold (Feature/Agg Datastore) layers, allowing us to use Delta Lake’s fraud detection and mitigation capabilities. All data from the Oracle TRANSACTIONS table is expected to automatically find its way into Databricks in the future.
Our CDC intake into the data lake has one more phase. We would run pre-built dbt tests or Great Expectations checkpoints before running the ML fraud model for prediction output to ensure we had received new records since our last fraud detection run. This would reassure WorldWide Bank that Oracle’s whole end-to-end CDC is flowing into Databricks as predicted.
I find it amazing that I can get this degree of design by simply completing a number of steps in the Arcion Cloud user interface without writing a single line of code. Establishing equivalent replications for WorldWide Bank’s other databases should be just as simple, allowing all fraud detection to be done in one place.
Without reinventing the wheel, Arcion provides CDC functionality. Their technology enables feature teams to create source and destination data sources with minimal code, as well as everything needed to route inbound database transactions to a secondary source before being written to the database.
Furthermore, Arcion can be used in a variety of ways. For example, financial or healthcare businesses may have security or compliance needs. Self-hosted Arcion can be installed on-premises or on a virtual private cloud. Arcion Cloud, which is fully managed, is a viable alternative for enterprises lacking the DevOps capabilities to handle their own installations.
Additionally, because snapshot and CDC streaming replications are simple to set up, other data sources can benefit from the same processing logic as the Databricks Delta Lake implementation. This allows one source of reasoning to focus on fraud detection, resulting in a “don’t repeat yourself” (DRY) strategy.
The Arcion team is definitely following my personal mission statement and allowing their customers to focus on accomplishing corporate goals.
If I were working with the same BI team indicated in the introduction, I would strongly suggest that they investigate Arcion Cloud to satisfy their needs. This would allow for real-time data rather than relying on data from the previous day to make decisions.
For more info: https://www.mammoth-ai.com/testing-services/
Also Read: https://www.guru99.com/software-testing.html