ETL Tools

ETL is short for extract, transform, load, three database functions that are combined into one tool to pull data out of one database and place it into another database.

Extract is the process of reading data from a database.

Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data.

Load is the process of writing the data into the target database.

ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format or type to another.

 

Most common ETL tools are:

  • SQL Services Integration Services (SSIS)
  • Talend
  • Oracle Wahouse Builder
  • CloverETL
Advertisements

Fact and dimension tables

There are two types of tables that distinguish a data warehouse from other databases: fact and dimensions.

Fact Tables

A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact’s data on detail or aggregated level.

Dimension Tables

A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn’t got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size then fact table.

Data warehouse structure

A data warehouse can have a star schema architecture or a snow flake architecture. You have to choose the one tht best fits your data, after reading their characteristics.

Star schema

The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly used nowadays.

star schema

Snowflake schema

The snowflake schema architecture is a more complex variation of the star schema used in a data warehouse, because the tables which describe the dimensions are normalized, so they have a typical relational database design. Snowflake schemas are generally used when a dimensional table becomes very big and when a star schema can’t represent the complexity of a data structure. For example if a PRODUCT dimension table contains millions of rows, the use of snowflake schemas should significantly improve performance by moving out some data to other table (with BRANDS for instance). The problem is that the more normalized the dimension table is, the more complicated SQL joins must be issued to query them. This is because in order for a query to be answered, many tables need to be joined and aggregates generated.

snowflake schema

Choose database model

The first step in working with a data warehouse is to create the database design, based on the information you need to store for the company.

The most complex and recommended model is the Physical Data Model, which contains all the tables and columns from database, the primary and foreign keys and each column’s data type and length.