Actually, this idea should rather have startet another "BI Musings" blog. But for the sake of simplicitely ...
Any advanced data warehouse/BI landscape will necessarily develop a bunch of (mostly: dimensional) data residing in the twilight zone between operational input and analytical output.
Examples are - from our experiences - localization resources, seasonal and other analytical taggings, organisational mappings, plausibility rules, technical annotations such as versions, modes, runs, etc., etc., etc. ... See, the hunger comes with the meal.
Most often, it is too expensive and/or inconvenient to put these data and the associated maintenance processes into the (legacy) operational systems where they should have been modelled in the first place.
On the other hand, the relational techie in charge of the DWH platform is seldom willing to care about the business meaning of those things, too.
We have tried various mechanisms to construct and maintain these beasts which range from "Direct-Editing Relational Tables", "Pre-Privoting in Excel and Mounting the Sheet as an External Table" into "Replaying SQL Scripts". Neither of those solutions have been very satisfying so far.
Our customer indeed gave us a hint to look at a tool that we have abandoned long before but that is still installed at most BI users computing devices along with Excel: MS Access!
Using Access to build up and maintain a normalized part of the dimensional model (e.g., we use individual tables for each hierarchy/attribute hosting technical and business keys, labels, descriptions, ranges, localizations, ....) eventually using forms make it a pleasure for business users.
By (graphically!) defining views in Access, it is quite easy to construct, pre-pivot, debug and even pre-chart the flattened dimensional outcome of the partial model. Outer joins (caution, mostly RIGHT JOINs in a typical star-like structure) help you to initially identify integrity problems when doing batch-input. After that, using referential integrity constraints, Access will help you and your users to derive at consistent intermediate versions of the data.
Finally, you can use the more-or-less two-button ODBC export facility to dump the complete result view into the datawarehouse stage/information schema of your choice. Depending on the sophistication of your driver (believe me, SQL-Server is well-supported ;-), type conversion could sometimes aberrate, but SSAS will remap whatever junk there comes into reasonable target types most of the time.
Mittwoch, 9. Dezember 2009
Transformation of parent-child-hierarchies into a multi-column representation
it's not rocket science but here is a query to transform a balanced parent-child-hierarchy into a multi-column hierarchy for a dimension table:
The first subquery ig0 gets all the nodes from the root level, then these rows are joined to ig1 via ig1.parentid = ig0.groupid and every following level is connected in the same way. Of course you have to know the depth of the hierarchy to choose the fitting number of subqueries.
The approach should also work for unbalanced (=ragged) hierarchies, but then you have to use outer joins (and maybe to replace NULL-values by a known parent-entry).
Somewhat related is the information that Oracle 11 Release 2 has introduced recursive subquery factoring, so you can avoid the ''connect by'' syntax if you like. Rob van Wijk explains some pros and cons of both options in his Blog.
-- subquery factoring for the relevant part of the source table with groups as (select groupid , groupname , parentid from ext_groups) -- definition of the levels of the hierarchy select ig0.groupid groupid_level0 , ig0.groupname groupname_level0 , ig1.groupid groupid_level1 , ig1.groupname groupname_level1 , ig2.groupid groupid_level2 , ig2.groupname groupname_level2 , ig3.groupid groupid_level3 , ig3.groupname groupname_level3 , ig4.groupid groupid_level4 , ig4.groupname groupname_level4 from -- root-level with the restriction to nodes without a parent -- (in this case: partenid = 0) (select * from groups where parentid = 0 ) ig0 -- level 1 , (select * from groups ) ig1 -- level 2 , (select * from groups ) ig2 -- level 3 , (select * from groups ) ig3 -- level 4 (= branch-level) , (select * from groups ) ig4 where ig1.parentid = ig0.groupid and ig2.parentid = ig1.groupid and ig3.parentid = ig2.groupid and ig4.parentid = ig3.groupid order by ig0.groupid , ig1.groupid , ig2.groupid , ig3.groupid , ig4.groupid;
The first subquery ig0 gets all the nodes from the root level, then these rows are joined to ig1 via ig1.parentid = ig0.groupid and every following level is connected in the same way. Of course you have to know the depth of the hierarchy to choose the fitting number of subqueries.
The approach should also work for unbalanced (=ragged) hierarchies, but then you have to use outer joins (and maybe to replace NULL-values by a known parent-entry).
Somewhat related is the information that Oracle 11 Release 2 has introduced recursive subquery factoring, so you can avoid the ''connect by'' syntax if you like. Rob van Wijk explains some pros and cons of both options in his Blog.
Mittwoch, 2. Dezember 2009
Never Ever!
Never ever forget to add the nonrecoverable keyword when doing a load into a tablespace/database without active backup/recovery!
Especially when talking about information schema tablespaces, this could result in a lot of developers waiting for a several-hours backup job to finish hopefully
before the nightly ETL´s creating views, dropping old tables, running statistics etc ...
No more happy. Lasted for just half an hour. Think I need a beer or two.
Especially when talking about information schema tablespaces, this could result in a lot of developers waiting for a several-hours backup job to finish hopefully
before the nightly ETL´s creating views, dropping old tables, running statistics etc ...
No more happy. Lasted for just half an hour. Think I need a beer or two.
Staging Pain Minimized: DB2 Create Tabe Like
A quite usual task when building a staging area is to more or less "stupidly" replicating federated source data into the target data warehouse.
This is to minimize the influence of ongoing transactions within the operational systems on your following ETL processes. This is also to allow for "batch performance" treatments on the source data that you normally would not apply to your source systems. And finally, having the data locally gives you more expressivity when combining the data even with other data from the same source.
In an ongoing customer´s project, we were faced with exactly that situation and came to build a lot scripts of the form:
each taking about 20 minutes in order to describe the original table and reformat the output into the create table syntax. Pffff.
For gods sake, some people at IBM do not look only good in their blue suits (still too few, I think I have to blog next my hate away about the ugly and time-consuming create view restriction when selecting non-unique column names) and invented the create table like syntax
1 minute. Runs fine. I´m happy.
This is to minimize the influence of ongoing transactions within the operational systems on your following ETL processes. This is also to allow for "batch performance" treatments on the source data that you normally would not apply to your source systems. And finally, having the data locally gives you more expressivity when combining the data even with other data from the same source.
In an ongoing customer´s project, we were faced with exactly that situation and came to build a lot scripts of the form:
drop table stage.replicated_table;
create table stage.replicated_table(replicated_column1 replicated_data_type1 replicated_constraint1, replicated_column2 ...);
declare load_cursor load for select column1, column2,... from source_link_schema.source_table for read only with ur;
load from "load_cursor" method p(1,2,...) insert into stage.replicated_table(replicated_column1, replicated_column2, ...) nonrecoverable;
runstats on table stage.replicated_table ...;
each taking about 20 minutes in order to describe the original table and reformat the output into the create table syntax. Pffff.
For gods sake, some people at IBM do not look only good in their blue suits (still too few, I think I have to blog next my hate away about the ugly and time-consuming create view restriction when selecting non-unique column names) and invented the create table like syntax
drop table stage.replicated_table;
create table stage.replicated_table like sources_link_schema.source_table;
declare load_cursor load for select * from source_link_schema.source_table for read only with ur;
load from "load_cursor" method p(1,2,...) insert into stage.replicated_table nonrecoverable;
runstats on table stage.replicated_table ...;
1 minute. Runs fine. I´m happy.
Dienstag, 1. Dezember 2009
DB2 Database/Tablespace Size Debugging
It is one of the most fundamental laws in nature that, no matter what you computed with whatever nifty usage statistics in your upset mind at the beginning of a project, database space will become sparse.
In order to find out what the actual settings of the relevant tablespaces are, you can issue a
command once you have the sufficient privileges. Note that, even if your tablespace was created using "autosize", there may be a maxsize limit which you better turn off with
once again under the precondition that you have the sufficient privileges.
In order to find out what the actual settings of the relevant tablespaces are, you can issue a
list tablespaces show detail
command once you have the sufficient privileges. Note that, even if your tablespace was created using "autosize", there may be a maxsize limit which you better turn off with
alter tablespace blablabla maxsize none
once again under the precondition that you have the sufficient privileges.
Abonnieren
Posts (Atom)