Capturing Changed Data In An Oracle Data Warehouse

来源:百度文库 编辑:神马文学网 时间:2024/05/05 07:09:04
May 17, 2004
Capturing Changed Data In An Oracle Data Warehouse
[Note: This is the first of two articles by Kehinde O Eseyin, an OCP DBA working with the Nigerian Ports Authority, Delta ports, Warri. Nigeria. Kehinde can be contacted atpen4kenny@yahoo.ca]
INTRODUCTION
Data change is a phenomenon that is highly inevitable in a data warehouse. A data warehouse however, is a central repository containing enterprise-wide information used for strategic decision-making. The fact that the information in a data warehouse is imperative for decision supports and operational business processes cannot be over emphasized. Since decisions to be taken are dependent on the information presently available in the warehouse, it is pertinent that data currency, correctness and accuracy be ensured. Data currency however is dependent on ETL process (how often data extraction occurs and the latency (delay) of load). Therefore, to guarantee data currency, oracle has some in-built functionalities targeted towards this direction.
Some features like SQL procedures, materialized views, external tables and merge command are useful in capturing change. Thus, this article focuses on the some features of oracle that supports the capture of change data, particularly oracle change data capture framework, an oracle 9i release 1 feature and oracle streams, an oracle 9i release 2 feature. The article further went ahead to summarily explore major enhancements in oracle 10g that make capturing change easier. The article concludes by pointing out issues to be taken note of, in future releases of oracle.
WHAT IS AN ETL PROCESS?
ETL process - acronymic for extraction, transformation and loading operations are a fundamental phenomenon in a data warehouse. Whenever DML (data manipulation language) operations such as INSERT, UPDATE OR DELETE are issued on the source database, data extraction occurs.
After data extraction and transformation have taken place, data are loaded into the data warehouse. Incremental loading is beneficial in the sense that only that have changed after the last data extraction and transformation are loaded.
ORACLE CHANGE DATA CAPTURE FRAMEWORK
The change data framework is designed for capturing only insert, delete and update operations on the oracle database, that is to say they are ‘DML sensitive‘. Below is architecture of change data capture framework. Below is architecture illustrating the flow of information in an oracle data capture framework.

Figure 1.Change data capture framework architecture.
Implementing oracle change data capture is very simple. Following the following steps, guides you through the whole implementation process.
Source table identification: Firstly, the source tables must be identified. Choose a publisher: The publisher is responsible for creating and managing the change tables. Note that the publisher must be granted SELECT_CATALOG_ROLE, which enables the publisher to select data from any SYS-owned dictionary tables or views and EXECUTE_CATALOG_ROLE, which enables the publisher to receive execute privileges on any SYS-owned packages. He also needs select privilege on the source tables Change tables creation: When data extraction occurs, change data are stored in the change tables. Also stored in the change tables are system metadata, imperative for the smooth functioning of the change tables. In order to create the change tables, the procedure DBMS_LOGMNR_CDC_PUBLISH.CREATE_CHANGE_TABLE is executed. It is important to note that each source table must have its own change table. Choose the subscriber: The publisher must grant select privilege on the change tables and source tables to the subscriber. You might have more than one subscriber as the case may be. Subscription handle creation: Creating the subscription handle is very pertinent because it is used to specifically identify a particular subscription. Irrespective of the number of tables subscribed to, one and only one subscription handle must be created. To create a subscription handle, first define a variable, and then execute the DBMS_LOGMNR_CDC_SUBSCRIBE.GET_SUBSCRIPTION HANDLE procedure. Subscribe to the change tables: The data in the change tables are usually enormous, thus only data of interest should be subscribed to. To subscribe, the DBMS_LOGMNR_CDC_SUBSCRIBE.SUBSCRIBE procedure is executed. Subscription activation: Subscription is activated only once and after activation, subscription cannot be modified. Activate your subscription using the DBMS_LOGMNR_CDC_SUBSCRIBE.ACTIVATE_SUBSCRIPTION procedure. Subscription window creation: Since subscription to the change tables does not stop data extraction from the source table, a window is set up using the DBMS_LOGMNR_CDC_SUBSCRIBE.EXTEND_WINDOW procedure. However, it is to be noted that changes effected on the source system after this procedure is executed will not be available until the window is flushed and re-extended. Subscription views creation: In order to view and query the change data, a subscriber view is prepared for individual source tables that the subscriber subscribes to using DBMS_LOGMNR_CDC_SUBSCRIBE.PREPARE_SUBSCRIBER_VIEW procedure. However, you need to define the variable in which the subscriber view name would be returned. Also, you would be prompted for the subscription handle, source schema name and source table name. Query the change tables: Resident in the subscriber view are not only the change data needed but also metadata, fundamental to the efficient use of the change data such as OPERATION$, CSCN$, USERNAME$ etc. Since you already know the view name, you can describe the view and then query it using the conventional select statement. Drop the subscriber view: The dropping of the subscriber view is carried out only when you are sure you are done with the data in the view and they are no longer needed (i.e. they‘ve been viewed and extracted). It is imperative to note that each subscriber view must be dropped individually using the DBMS_LOGMNR_CDC_SUBSCRIBE.DROP_SUBSCRIBE_VIEW procedure. Purge the subscription view: To facilitate the extraction of change data again, the subscription window must be purged using the DBMS_LOGMNR_CDC_SUBSCRIBE.PURGE_WINDOW procedure.
ORACLE STREAMS
An oracle stream, a new feature in oracle9i release 2, is an information-sharing technology, which allows the propagation and management of data, event, and transactions within oracle databases or between oracle and non-oracle databases. Oracle streams is flexible in the sense that it allows user-intervention- users can specify what information goes into the stream, the route of flow, what happens to events in the stream and how the stream terminates. It is used to capture events such as DML (insert, update and delete) and DDL (alter, drop, rename) operations.
However, very fundamental to the working of oracle stream are three elements/components namely
Capture Staging Apply
Figure 2. Architecture of oracle stream.
The capture process
The capture process is responsible for the identification of data to capture such as database changes (DDL and DML) and application generated messages. We can have either implicit capture in which the server captures DML and DDL events at a source database using oracle‘s default rules or explicit capture, in which a customized configuration is used to capture data using procedures.
Furthermore, the change process formats the retrieved data into events called logical change records (LCR) and they are then placed in a queue- staging environment for onward processing. Logical change records are of two types- DDL LCR and row LCR. DDL LCR refers to changes made in the database objects by issuing ALTER, RENAME, CREATE or DROP commands. Row LCR on the other hand, refers to the modification of a single row of a table by a single DML statement. This implies that updating 10 rows in a table using a single DML statement generates 10 row LCRs.
The staging process
The staging area is a queue and acts as a temporary repository for logical change records (LCR) until they are subscribed to. The subscriber (a user application or another staging area or default apply process) has control over the contents of the staging area. Therefore, the subscriber can decide which records are propagated or consumed from the queue. For events propagation from a queue to take place, a user must be the owner of the queue and appropriate privileges are needed, not only on the source queue but also on the target queue. Moreover, a particular target queue can accept events from more than one source queues
The apply process
The apply process is responsible for applying the changes to the target database. This is possible in two ways namely- Default consumption (implicit) and customized consumption (explicit). In default consumption, the apply engine is used to apply the changes to the database directly. Per adventure if a conflict occurs, the apply engine resolves it by invoking resolution (data transmission) routines.
In customized consumption, the logical change records are passed as arguments to a user-defined function for processing. If the customized procedure process DML LCRs, DDL LCRs and enqueued messages, they are called DML handlers, DDL handlers and message handlers respectively.
ORACLE 10G AND DATA CHANGE CAPTURE
As oracle data warehousing becomes increasingly more complex by the day, powerful features that simplify the ETL process have been added to the latest release of oracle-oracle 10g. Essentially, much has been done in the area of facilitating the extraction and movement of large volumes of data between oracle databases. The emergence of heterogeneous transportable tablespaces and oracle data pump are testimonies to this fact and would be described later in this paper.
From the inherent data structure perspective, change data capture can either occur at the data level or at the application level. At the data level, a table in the target database is regarded as a remote snapshot of a table in the origin database. At whichever level capturing and propagation is taking place, it is imperative to note that there is always an increase in the workload on the source database. However, with oracle 10g, additional overhead is a forgotten issue. Asynchronous CDC is now adopted, in which change data is extracted from the redo logs without any negative performance implication on the source database. Furthermore, asynchronous CDC can be described as a lightweight technology targeted towards change extraction and propagation in a data warehousing system and in which changes to the source tables are viewed as relational data for onward consumption by subscribers.
There is therefore no gain saying that asynchronous CDC has greatly enhanced parallel log file processing and data transformation.
Heterogeneous Transportable Tablespaces.
Transportable tablespaces was introduced in oracle 8.0. The movement of data using transportable tablespaces is much more faster than when compared to the export/import methodology of the same data. This is because; tablespaces transportation involves the copying of datafiles and integrating the database schema information into the data dictionary.
Transportable tablespaces has proved to be useful in diverse ways, which includes
Data loading from OLTP applications to data warehouse systems. Feeding data marts from central data warehouses. Updating data warehouses and data marts from staging systems.
In as much as the benefits are enormous, transportable tablespaces have suffered some limitations over the years. These setbacks includes
The source and target databases must be on the same operating system for you to be able to transport tablespaces. It is impossible to transport a tablespace from an NT oracle database to an HP-UX oracle database. It is only the set of tablespaces that have no references from within the set of tablespaces pointing outside the tablespace can be transported. The source and target database cannot use different character set. The character set of both databases must be the same. Transportable tablespaces do not support function- based index. Function based index is used primarily to improve query performance in cases where the WHERE clause of a SQL statement contains operations on the columns. Transportable tabalespaces do not support materialized views. Scoped REFs are also not supported by transportable tablespaces.
However, with the advent of oracle 10g, the monopoly, inherent in former releases of oracle as it relates to same operating system for the source database and target database has been eradicated. Transportable tablespaces are now platform-friendly as in; you can transport tablespaces from databases of different platforms. When transporting tablespaces of different platforms, the RMAN utility and the CONVERT command are used to convert the byte ordering to the same thing. Alternatively, you can convert the destination platform after the tablespace datafiles have been transported.
Oracle Data Pump
One of the new features of oracle 10g is the oracle data pump. It is indeed an exciting server-side infrastructure suitable for fast, bulk-data and metadata movement from one oracle database to another database. Oracle data pump is highly flexible in that, it only not allows you to use a customized data movement utility, but also allows you monitor status, cancel, suspend and resume a load. Also, loading can be restarted after failure without the loss of data integrity.
When unloading data physically, oracle data pumps makes use of the external table as the unload mechanism. Thus, resulting in the external table data pump unload driver unloading the driver into platform-independent, oracle proprietary files. The new external table unload mechanism can be used as a standalone without necessarily using the new export/import utilities (expdp/impdp) which are client side utilities that make application programming interface (API) calls/requests into the oracle data pump.
CONCLUSION
This article has really given insight into how change data can be captured and ETL process management using oracle technologies. Oracle has remained committed to ensuring that its manageability features become more flexible. This fact has been revealed in this paper by the chronological analysis of how various releases of oracle, starting from oracle 9i release 1 to oracle 9i release 2 and then to the latest release, oracle 10g handles changed data. The dynamic nature of ETL process management in oracle is highly commendable- from the conventional transform, then load and load, then transform methodologies to transform-while-loading methodology and from synchronous CDC to asynchronous CDC.
By and large, it is important to suggest that oracle in it‘s quest towards providing better ETL management functionalities, should look into the homogeneity associated with its latest release as it relates to asynchronous change data capture and transportable tablespaces to mention a few, in which the source and destination database must be 10g. Furthermore, it is worth mentioning that other drawbacks on transportable tablespaces that are yet to be addressed should be looked into in the next release.
Good luck!
_xyz