At this point, we have a populated set of temporary tables that match the destination tables in the data warehouse. These tables have been populated with the source data, but still without any of the complex transformations that need to take place. The data may not be consistent with itself or with other data in the data warehouse. Before we can reconcile detailed transactions, we have to ensure that the basis of comparison is the same: that is, the reference information is identical. If this is not the case, we may find that a can of beans today has a different stock-keeping unit (SKU) from the one used yesterday.
In order to avoid this situation, we have to reconcile data in the order in which it makes sense. In practice, this means we start by reconciling the key reference items, and then work our way outwards to related reference items, before addressing the base transaction. This is harder than it sounds, because many of the entities are interrelated to a point where it becomes difficult to decide how to prise them apart. It will be necessary to keep changing the summaries that are produced to match the query profiles at each point intime. If we had to modify the warehouse manager every time we wished to add a new summary or change an existing one, the system would be perpetually in flux. Metadata can be used to address this issue, by data-driving the generation of summaries. Within the database itself, we store descriptions of the summary tables we require in terms of facts and dimensions.
August 20th, 2008 | Posted in General | No Comments
At this point, the warehouse manager has to create indexes against the information in the fact or dimension tables. One would expect the index creation time to be significant, even if we need only to create an index against a fact table partition. Because of this, most relational technologies have facilities to create indexes in parallel, distributing the load across the hardware and significantly reducing the elapsed time.
The overhead of inserting a row into a table and indexes can be far higher with a large number of rows than the overhead of re-creating the indexes once the rows have been inserted. Therefore it is often more effective to drop all indexes against tables prior to inserting large numbers of rows. Fact data that tends to have a large amount of data inserted on a daily basis is a prime candidate for dropping indexes before the data load. This method will be more effective unless the partition that is being loaded to already contains a substantial amount of data. Dimension data tends to change or be added to in far smaller volumes than fact data, so unless the dimension is being changed wholesale, retain existing indexes on the dimension tables. In order to fool user access tools into thinking that the fact table partitions are one large table, at this point in the process the warehouse manager creates views that combine a number of partitions into a single fact table. We would suggest that you create a few views, corresponding to meaningful periods of time within the business.
August 11th, 2008 | Posted in General | No Comments
The warehouse manager has a number of areas of responsibility. One of these areas is maintaining the responsiveness of the system even when the query profiles change over time. This can be achieved by gathering statistics on the queries being physically executed against the data warehouse, and deducing the appropriate summaries to speed up common queries. Query statistics can be collected by the query manager as it intercepts any query hitting the database. In fact, some relational databases provide facilities to do this as a matter of course. If this information is available, the warehouse manager can be extended to various functions. The query manager is the system component that performs all the operations necessary to support the query management process. This system is typically constructed using a combination of user access tools, specialist data warehousing monitoring tools, native database facilities, bespoke coding, C programs, and shell scripts.
As with the load manager, the size and complexity of the query manager will vary between specific solutions. Unlike the load manager, the complexity of the query manager is driven by the extent to which the facilities are provided by user access tools or native database facilities. Practically, all the query manager is built in later development phases. Typically, the query manager is designed in the first build phase, once the database and user access tool technologies have been determined.
July 18th, 2008 | Posted in General | No Comments
Once the data is in temporary tables, we can begin performing the simple conversions to start getting it into shape. The next step is to create a set of tables identical to the destination tables in the data warehouse. For example, if there exists a highly partitioned sales transactions table in the data warehouse, we should create one of those partitions at this point. This allows us to start populating the database in a form that fits the data warehouse requirement. Because we are about to execute a substantial number of consistency checks at this point, data should not be loaded into the warehouse until it has been cleaned up. If a consistency check fails, then that particular item of data needs to be set aside in a temporary table till later. Although all relational databases provide some form of rollback, in practice it is easier to load data into a temporary area, clean it up, and then publish it to the data warehouse.
On the basis that we have used temporary tables identical to the destination tables, it becomes a simple exercise to publish data to the data warehouse. This can be achieved either by a simple copy operation or, even better, by reassigning the temporary table to become a partition of the destination table. Once source data is in the temporary data store and has been cleaned up, the warehouse manager can then transform it into a form suitable for decision support queries. The data is transformed into a form in which the bulk of the factual data lies in the center, surrounded by the reference (dimension) data. Three variations on this theme are commonly used: star schemas, snowflake schemas, and starflake schemas.
July 2nd, 2008 | Posted in General | No Comments
Data should be loaded into the warehouse in the fastest possible time, in order to minimize the total load window. This becomes critical as the number of data sources increases, and the time window shrinks. The speed at which the data is processed into the warehouse is affected by the kind of transformations that are taking place. In practice, it is mare effective to load the data into a relational database prior to applying transformations and checks. This is because most database vendors have very performant facilities to fast-load data from ASCII flat files.
Gateway technology tends to be inappropriate, because gateways tend not to be performant when large data volumes are involved. This technology may be appropriate where the data volumes to be loaded are small, but in practice we have never seen a situation where they have fitted the bill. Utilities to fast load data into databases tend to have limited facilities to transform the data at the same time. These facilities usually bypass the SQL layer of the database software, and thus SQL transformations are not possible. In archiving, older data is removed from the system in a format that allows it to be quickly restored if required. For example, in a retail sales analysis data warehouse there may be a requirement to keep data for 3 years, with the latest 6 months being kept online. In this sort of scenario there is often a requirement to be able to do month-on-month comparisons for this year and last year. This will require some months of data to be temporarily restored from archive.
It is common to archive data as a flat file extract, where the file is in a format that allows the data to be fast-loaded directly into the relevant fact and dimension tables. One issue that needs to be addressed is the fact that as the data warehouse evolves, the reference data, the structure of the fact data and of any related information may change. To ensure that a restored archive is valid, you may need to extract all related data and structures as well.
June 24th, 2008 | Posted in General | No Comments
The load manager is the system component that performs all the operations necessary to support the extract and load process. This system may be constructed using a combination of off-the-shelf tools, bespoke coding, C programs and shell scripts. The size and complexity of the load manager will vary between specific solutions from data warehouse to data warehouse but, as an indication, the larger the degree of overlap between source systems, the larger the load manager will be. However, it is worth noting that third-party tools will probably contribute a maximum of 20 to 25% of the total system functionality.
The bulk of the effort to develop a load manager should be planned within the first production phase. Although the entire system does not have to be built, a significant proportion of the functionality should be provided in the first phase. Each of these functions has to operate automatically, and recover from any errors it encounters, to a very large extent with no human intervention. This is because this process tends to run overnight at the close of the business day. In a situation where the load manager encounters a major error, it may be necessary for it to flag that human intervention is required. A system that automatically pages system operators DBAs is fairly typical.
June 20th, 2008 | Posted in General | No Comments
As users become used to the facilities provided by a data warehouse, they will change the kinds of queries they ask. This is inevitable and should be encouraged, because it indicates that users are exploiting the information content of the data warehouse. This implies that query profiles change on a regular basis over the life of a data warehouse, and the original user query requirements may be nothing more than a starting point. Bearing in mind that summary tables are structured around a defined query profile, if the profile changes, the summary tables need to change as well.
In order to accurately monitor and understand what the new query profiles are, it can be very effective to capture the physical queries that are being executed. At various points in time, such as the end of the week, these queries can be analyzed to determine the new query profiles, and the resulting impact on summary tables. This analysis can be manual or automatic. The mechanism is less significant than the fact that the process has to occur. Query capture is typically part of the query management process.
June 1st, 2008 | Posted in General | No Comments
As in operational systems, the data within the data warehouse is backed up regularly in order to ensure that the data warehouse can always be recovered from data loss, software failure or hardware failure. Backup and recovery strategies need to be put in place. In archiving, older data is removed from the system in a format that allows it to be quickly restored if required. For example, in a retail sales analysis data warehouse there may be a requirement to keep data for 3 years, with the latest 6 months being kept online. In this sort of scenario there is often a requirement to be able to do month-on-month comparisons for this year and last year. This will require some months of data to be temporarily restored from archive.
It is common to archive data as a flat file extract, where the file is in a format that allows the data to be fast-loaded directly into the relevant fact and dimension tables. One issue that needs to be addressed is the fact that as the data warehouse evolves, the reference data, the structure of the fact data and of any related information may change. To ensure that a restored archive is valid, you may need to extract all related data and structures as well. Data warehouses that contain summary data potentially provide a number of distinct data sources to respond to a specific query. These are the detailed information itself, and any number of aggregations that satisfy the query’s information need.
May 28th, 2008 | Posted in General | No Comments
This is the area of the data warehouse that stores all the detailed information in the starflake schema. In many cases, all the detailed information is not held online the whole time, but aggregated to the next level of detail, and the detailed information is then offloaded into tape archive. On a rolling basis, detailed information is loaded into the warehouse to supplement the aggregated data.
For example, it is very common to store transactions at the level of product by store by day for retail sales analysis data warehouses. This information is perfectly appropriate for analyzing actual sales against projected sales. What this degree of detail does not provide is basket analysis: that is, because all transactions for a particular product in a particular store are aggregated to the whole of the day, it is not possible to examine a specific customer basket transaction, or analyze components within the basket.
In order to determine what degree of detail is required, ask yourself the question - What activities within the business process require detailed, individual transactions? This should indicate what user requirements exist to store base transactions. If the business requirement for detailed information is weak or very specific, it may be possible to satisfy it by storing a rolling three-month detailed history. This could satisfy the need to have some detailed information, without too much impact on the storage requirements. This style of operation is assisted if all historical changes to dimension data are stored online - that is, dimension tables.
May 16th, 2008 | Posted in General | No Comments
This is when we ensure that any data being loaded does not contradict the information already within the data warehouse. In some cases, the changes could be valid - for example, updating information about the product hierarchy - but the changes need to be controlled carefully, so as not to render meaningless any of the existing information already in the data warehouse. For example, if we compared the existing customer list with a previous version already in the data warehouse, we would expect inconsistencies to be rationalized by specific customer events. If the events don’t exist, we need to determine whether event or customer transactions are missing.
Once the data has been cleaned, the next task within the c1ean-and-transform process is to convert the source data in the temporary data store into a structure that is designed to balance query performance and operational cost. The techniques applied to the data to convert it into a form that is suitable for long-term storage need to be applied. Every data warehouse solution will use these techniques to varying degrees; the precise mix will vary depending on the specific business requirements. Information in the data warehouse must be structured to support the performance requirement from the business, and also to minimize the ongoing operational cost. These requirements essentially pull the design in opposite directions, prompting you either to structure to improve query performance, or to structure to minimize operational management.
May 12th, 2008 | Posted in General | No Comments