Mittwoch, 9. Dezember 2009

Maintaining Analyto-Operative Data using MS Access

Actually, this idea should rather have startet another "BI Musings" blog. But for the sake of simplicitely ...

Any advanced data warehouse/BI landscape will necessarily develop a bunch of (mostly: dimensional) data residing in the twilight zone between operational input and analytical output.
Examples are - from our experiences - localization resources, seasonal and other analytical taggings, organisational mappings, plausibility rules, technical annotations such as versions, modes, runs, etc., etc., etc. ... See, the hunger comes with the meal.

Most often, it is too expensive and/or inconvenient to put these data and the associated maintenance processes into the (legacy) operational systems where they should have been modelled in the first place.

On the other hand, the relational techie in charge of the DWH platform is seldom willing to care about the business meaning of those things, too.

We have tried various mechanisms to construct and maintain these beasts which range from "Direct-Editing Relational Tables", "Pre-Privoting in Excel and Mounting the Sheet as an External Table" into "Replaying SQL Scripts". Neither of those solutions have been very satisfying so far.

Our customer indeed gave us a hint to look at a tool that we have abandoned long before but that is still installed at most BI users computing devices along with Excel: MS Access!

Using Access to build up and maintain a normalized part of the dimensional model (e.g., we use individual tables for each hierarchy/attribute hosting technical and business keys, labels, descriptions, ranges, localizations, ....) eventually using forms make it a pleasure for business users.

By (graphically!) defining views in Access, it is quite easy to construct, pre-pivot, debug and even pre-chart the flattened dimensional outcome of the partial model. Outer joins (caution, mostly RIGHT JOINs in a typical star-like structure) help you to initially identify integrity problems when doing batch-input. After that, using referential integrity constraints, Access will help you and your users to derive at consistent intermediate versions of the data.

Finally, you can use the more-or-less two-button ODBC export facility to dump the complete result view into the datawarehouse stage/information schema of your choice. Depending on the sophistication of your driver (believe me, SQL-Server is well-supported ;-), type conversion could sometimes aberrate, but SSAS will remap whatever junk there comes into reasonable target types most of the time.

Transformation of parent-child-hierarchies into a multi-column representation

it's not rocket science but here is a query to transform a balanced parent-child-hierarchy into a multi-column hierarchy for a dimension table:
-- subquery factoring for the relevant part of the source table
with
groups as
(select groupid 
      , groupname
      , parentid
   from ext_groups)
-- definition of the levels of the hierarchy
select ig0.groupid groupid_level0
     , ig0.groupname groupname_level0
     , ig1.groupid groupid_level1
     , ig1.groupname groupname_level1
     , ig2.groupid groupid_level2
     , ig2.groupname groupname_level2
     , ig3.groupid groupid_level3
     , ig3.groupname groupname_level3
     , ig4.groupid groupid_level4
     , ig4.groupname groupname_level4
  from 
       -- root-level with the restriction to nodes without a parent
       -- (in this case: partenid = 0)
       (select *
          from groups
         where parentid = 0
        ) ig0
       -- level 1
     , (select *
          from groups
        ) ig1
       -- level 2
     , (select *
          from groups
        ) ig2
       -- level 3
     , (select *
          from groups
        ) ig3
       -- level 4 (= branch-level)
     , (select *
          from groups
        ) ig4
 where ig1.parentid = ig0.groupid
   and ig2.parentid = ig1.groupid
   and ig3.parentid = ig2.groupid
   and ig4.parentid = ig3.groupid
 order by ig0.groupid
        , ig1.groupid
        , ig2.groupid
        , ig3.groupid
        , ig4.groupid;

The first subquery ig0 gets all the nodes from the root level, then these rows are joined to ig1 via ig1.parentid = ig0.groupid and every following level is connected in the same way. Of course you have to know the depth of the hierarchy to choose the fitting number of subqueries.

The approach should also work for unbalanced (=ragged) hierarchies, but then you have to use outer joins (and maybe to replace NULL-values by a known parent-entry).

Somewhat related is the information that Oracle 11 Release 2 has introduced recursive subquery factoring, so you can avoid the ''connect by'' syntax if you like. Rob van Wijk explains some pros and cons of both options in his Blog.

Mittwoch, 2. Dezember 2009

Never Ever!

Never ever forget to add the nonrecoverable keyword when doing a load into a tablespace/database without active backup/recovery!

Especially when talking about information schema tablespaces, this could result in a lot of developers waiting for a several-hours backup job to finish hopefully
before the nightly ETL´s creating views, dropping old tables, running statistics etc ...

No more happy. Lasted for just half an hour. Think I need a beer or two.

Staging Pain Minimized: DB2 Create Tabe Like

A quite usual task when building a staging area is to more or less "stupidly" replicating federated source data into the target data warehouse.

This is to minimize the influence of ongoing transactions within the operational systems on your following ETL processes. This is also to allow for "batch performance" treatments on the source data that you normally would not apply to your source systems. And finally, having the data locally gives you more expressivity when combining the data even with other data from the same source.

In an ongoing customer´s project, we were faced with exactly that situation and came to build a lot scripts of the form:


drop table stage.replicated_table;

create table stage.replicated_table(replicated_column1 replicated_data_type1 replicated_constraint1, replicated_column2 ...);

declare load_cursor load for select column1, column2,... from source_link_schema.source_table for read only with ur;

load from "load_cursor" method p(1,2,...) insert into stage.replicated_table(replicated_column1, replicated_column2, ...) nonrecoverable;

runstats on table stage.replicated_table ...;


each taking about 20 minutes in order to describe the original table and reformat the output into the create table syntax. Pffff.

For gods sake, some people at IBM do not look only good in their blue suits (still too few, I think I have to blog next my hate away about the ugly and time-consuming create view restriction when selecting non-unique column names) and invented the create table like syntax


drop table stage.replicated_table;

create table stage.replicated_table like sources_link_schema.source_table;

declare load_cursor load for select * from source_link_schema.source_table for read only with ur;

load from "load_cursor" method p(1,2,...) insert into stage.replicated_table nonrecoverable;

runstats on table stage.replicated_table ...;


1 minute. Runs fine. I´m happy.

Dienstag, 1. Dezember 2009

DB2 Database/Tablespace Size Debugging

It is one of the most fundamental laws in nature that, no matter what you computed with whatever nifty usage statistics in your upset mind at the beginning of a project, database space will become sparse.

In order to find out what the actual settings of the relevant tablespaces are, you can issue a

list tablespaces show detail

command once you have the sufficient privileges. Note that, even if your tablespace was created using "autosize", there may be a maxsize limit which you better turn off with

alter tablespace blablabla maxsize none

once again under the precondition that you have the sufficient privileges.

Mittwoch, 25. November 2009

DB2 Tablespace State "Backup Pending. Load in Progress."

The load utility of DB2 is a strange thing as it does certain things much faster than using the command line processor but on the other hand goes different transactional ways to complete its work.

One outcome of a load process gone wild (or rather: being interrupted before completing its work) could be that it leaves the tablespace in the state "Backup Pending. Load in Progress" (0x0820) with which DB2 notifies that it still needs a backup, recovery or rollforward usually done by load.

You still can access the tablespace, but any structural/transactional change will be denied due to the status (note that the status of all the associated tables could be "normal") of the tablespace.

For a concise overview of the backup/recovery levels and strategies of DB2 you could have a look at this page.

DB2 buffer pool "4096" has no more pages

Yesterday, we found a customer´s database in a strange state. Whenever doing some of the more bandwith-demanding ETL operations, we were presented with the annoying error message:

Operation could not be performed, because buffer pool has no more pages left.

After some internet search, we found out that these ids belong to buffer pools (so called hidden, for most serious cases under-dimensioned buffer pools - kind of replacement tires) that are not officially assigned to the relevant tablespaces, but that get used once the memory manager cannot start the actually assigned pools, e.g., due to memory shortage.

The database finally had to be restarted (online activation of the buffer pools did not work) in order to get the state back to normal.

Date Type Conversion When Federating Oracle into DB/2

Federation (or DB-Link depending on the target DWH platform) is a most convenient way to connect your staging area to your data sources.

There is the golden rule not to stretch the tempting, modern capabilities of that feature to much, hence to first replicate, then transform the data.

There are nevertheless a few issues left that may still hurt you mostly around the builtin data type conversions, especially when formulating the mostly temporally constrained replication conditions.

So, DB2 9.5 represents oracle´s DATE as a TIMESTAMP - hence be sure to use CURRENT_TIMESTAMP instead of CURRENT_DATE!