What is Data Warehousing ?
Compiled corporate information which has been obtained from operational system and external data sources is referred to as the data warehouse.
Meaning of Data Warehouse
The data warehouse acts like a storage place of information which can be accessed by various applications and shared by them. Though it is a warehouse, then also it contains information summary as shown in figure.
Figure : A Data Warehousing System (Data warehousing concepts)
Definition of Data Warehouse
According to W.H. Inmon :
"A data warehousing is subject oriented, integrated non-volatile, time varying collection of data in support of its decision making process".
According to Ralph Kimball :
"Data warehouse is a copy of transaction data specifically structured for query and analysis".
Therefore, data warehouse can be thought of as a collection of pre-packaged or summarized data which is developed as per particular business rules. It is also designed in such a way that it helps in taking management decisions.
Types of Information Store in Data Warehouse
The following types of information are stored in a data warehouse :
1) Business Information :
It stores business information which has been sourced from throughout the organisation and encompassing all aspects of a company's products, processes and customers.
2) Standard Reports and Queries :
Most of the users of the date warehouse require a set of standard reports and queries. Thus, it becomes very useful when these standard reports are periodically, automatically produced. In such case, a user just views the report whenever required instead of taking the time to run the report every time. This saves time as well as increases manpower productivity.
Features of Data Warehouse
The Characteristics of data warehouse are as follows :
1) Subject Oriented :
Here, data is categorized according to subject instead of application.
For example, in an insurance company, data warehousing organizes data on the basis of premium paid by customers and corresponding claim in place of type of insurance policy bought.
2) Integrity :
Encoding of data becomes inconsistent when data is present in more than one application.
For example, male and female coding may vary by 0 or 1 among applications.
3) Time Variant :
Data warehouse also stores old data which are not updated and might be 8 to 10 years old. These data are used for comparisons, trends and forecasting.
4) Non-Volatile :
Once data is uploaded onto data warehouse, it is not updated or changed but only loaded and accessed.
Data Warehouse Architecture
The data warehouse architecture depends on the relational database management system (RDBMS) and comprises three tiers which are the bottom tier, middle tier and top tier as shown in figure :
1) Bottom Tier :
The bottom tier is a warehouse database server which is a relational database system. It comprises the following three components :
i) Data Sources :
A large organisation will gather data from various sources, which comprise operational databases and external sources like web, purchased data, etc. These external data source have different schemes as they have been developed independently by different groups. When companies need such data then they first require a unified schema to consolidate these for making business decisions so that queries get executed efficiently.
ii) ETL (Extract, Transform and Load) Process :
Post designing of schema, warehouse looks for data acquisition in order to achieve its objectives. This involves the following steps :
- Extraction of data is done from multiple, diverse sources.
- The data acquired might have some errors and inconsistencies. For example, names of street, area or city are miss spelt. These incorrect data requires cleaning to reduce errors and fill in missing information when required. This process is referred to as 'data cleansing'. The mistakes in data are corrected by way of looking up the correct street names and corresponding zip codes of each city. This process is referred to as fuzzy lookup. When the erroneous data is cleansed with updated data, then that process is referred to as back-flushing. Semantic mismatches are accommodated by transformation of this data.
- The data which has been cleaned and transformed are finally loaded onto the warehouse where data is partitioned and indexed for quick retrieval. The loading process may be slow depending on volume of data. For example, loading a terabyte of data may take weeks whereas gigabyte takes hours. Because of this, parallelism is required where some data is stored in a summarized format which may involve additional activities like sorting.
ETL process is carried out to transfer the data into the data warehouse. To keep the relationship of the data sources and data warehouse updated, the data warehouse should be updated periodically and the old data should also be examined.
iii) Metadata Repository :
This is the most significant part of the data warehouse which monitors presently stored data. This comprises the description of data such as its schema definition and includes both technical and business metadata. Technical metadata contains details of storage structures, data description. warehouse operations, etc. whereas business metadata contains details of business rules which guide an organisation.
2) Middle Tier :
The middle tier comprises an OLAP server which es either relational OLAP (ROLAP) or multidimensional OLAP (MOLAP). The ROLAP works in conjunction with the relational database and is an extension of the relational DBMS. This is used for mapping multidimensional data into standard relational operations Conversely, the MOLAP model directly performs the operation on the multidimensional data. The data cubes are stored in memory using the multidimensional arrays.
3) Top Tier :
The top tier is a client layer which comprises query and reporting, analysis and/or data mining tools like trend analysis, prediction, etc.
Components of Data Warehouse
Figure depicts the basic components of data warehouse :
1) Data Warehouse Database :
The central data warehouse database forms the basis of the data warehouse environment. The drivers of different technological approaches towards data warehouse database include features like very large database size, ad hoc query processing and flexible user view creation comprising aggregates, multi-table joins and drill down. The approaches adopt the following :
- Parallel relational database design.
- New indexed structure is used for bypassing relational table.
- Multidimensional database.
The above approach is used in conjunction with the online analytical processing tools.
2) Sourcing, Acquisition, Cleanup and Transformation Tools :
Data sourcing, cleanup,. transformation and migration tools help in the conversion, summarisation, structural changes and condensation of data. This transformation of data is essential so that the information can be used by the decision support tools. This results in the program and control statement comprising the COBOL program, job control language (JCL), UNIX scripts and SQL data definition languages. The function comprises the following :
- Removal of unnecessary data,
- Conversion to common data name and identification,
- Calculation of summarized and drive data,
- Establishment default for lost data, and
- Accommodation of source data definition changes.
Significance is used as stated below :
- Data heterogeneity
- Database heterogeneity
- Meta Data
3) Meta Data :
It can be said to be the data about data describing the data warehouse. Meta data are used for building, maintaining, managing and using the data warehouse.
4) Access Tools :
Front end tools are used by users to interact with data warehouse. The tools require information specialists but most end-users develop their own expertise for using such tools. It can be divided into five categories :
i) Query and Reporting Tools :
This category is further divided into two groups :
a) Reporting Tools :
Production reporting tools and desktop report writers are part of reporting tools. The former lets the computer generate regular operational reports whereas the latter are designed for the end user.
b) Managed Query Tools :
SQL and database structure uses the managed query tool for the insertion of a meta layer between user and database.
ii) Applications :
Applications development environment comprises power builder, visual basic, business objects, etc.
iii) Online Analytical Processing (OLAP) :
This tool is developed on the basis of multidimensional databases. This enables the sophisticated user to analyze the data by the use of multidimensional database.
iv) Data Mining :
Effective use of information is an important factor in the success of business. This is done by strategically using hidden, previously un-deleted & frequently valuable facts about the consumer.
5) Data Marts :
A data mart is like a mini data storehouse which stores information required for a particular subject area. This means that it caters to the requirement of a particular group of users. It has a shorter implementation curve, less data and fewer users. It has far less data usage of 50 GB and does not require high-end hardware and is therefore cost friendly.
6) Data Warehouse Administration & Management :
- Security and priority management,
- Checking for updates from multiple sources,
- Checking data quality,
- Management and updating of metadata,
- Reporting and editing of data,
- Removal or deletion of unnecessary data,
- Replicating, sub-setting and distributing data,
- Data backup and recovery, and
- Storage and management of data warehouse.
7) Information Delivery System :
The work of this system is to distribute warehouse stored data to other data warehouses or end user products like spreadsheets or local databases. The completion of the work depends on the time of the day as well as an external event completion. The idea behind this system is that the users do not have to be concerned with the location and maintenance of data warehouse.
Advantages of Data Warehouse
There are various importance of data warehouse :
1) Provide Cost-Effective Decision-Making :
Data warehouse helps in cost cutting by assisting in reduction of the staff and computer resources which are needed to cater the queries and report generation against the operational and production databases.
2) Better Enterprise Intelligence :
Due to the multi-storied data structures, the quality and flexibility is enhanced in enterprise analysis. It also supports a wide range of data starting from detailed transactional level to high-level summarized information.
3) Enhanced Customer Service :
A single Data Warehouse architecture helps an enterprise in achieving better customer relationships by correlation of all customer data.
4) Business Re-engineering :
Unlimited analysis of enterprise information is allowed here which assists in providing an insight into enterprise processes. This helps in coming up with ideas that help re-engineer such processes.
5) Information System Re-engineering :
An enterprise-wide data requirement based Data Warehouse results in cost effectiveness. This is achieved by data standardization and operational system interoperability.
Disadvantages of Data Warehouse
There are various limitations of the data warehouse :
1) High Installation Cost :
The initial cost for setting up data warehouse is expensive which eventually goes down when it covers only maintenance and modification cost. High cost is also involved in getting data translated and copied onto existing databases for end user.
2) Time-Taking :
Building a data warehouse is time consuming. Adequate time should be given for its development so that difficulties faced in making the system up and running are addressed properly.
3) Change Resistance :
It requires re-education of programmers which proves to be a hindrance as everybody may not be willing to take the charge easily. It takes time to become familiar with the new approach.
4) Specific Skills Required :
Time and cost must also be allowed for helping warehouse developers and end users to acquire new skill set.
5) Complex :
A data warehouse is a complex system which is developed as per organisation. requirement. For this, the choice of hardware, software and structure must be done very carefully as all these must work in co-ordination for optimal performance.
6) Management Acceptance :
Management will also be required even in data warehousing but not to the same extent as it is required currently. Data warehousing approach considerably reduces the time devoted to management of an organisation.
7) Security Issues :
When data is held in one place, then it is more prone to external security threats. Advanced security must be adopted to prevent unwanted users from competition or from accessing critical company data.
Data Warehouse Applications
The applications of data warehouse are described below :
1) Standard Reports and Queries :
The users of the data warehouse require a standard set of queries and reports at any given point of time. Thus, it becomes useful if such reports are prepared automatically at regular intervals. This ensures that users can view the reports which have been run by the Data Warehouse system automatically and save the time taken to run the time consuming reports.
2) Queries against Summarized Data :
A majority of data analysis is carried out on the basis of summarized data in a Data Warehouse. They contain filtered, summarize views with predefined standard business analysis.
3) Data Mining :
Among whole Data Warehouse activity, data mining comprises a small percentage. However, the detailed data forms the basis of the most significant data analysis.
4) Interface with Other Data Warehouses :
Data Warehouse interacts with many applications which use it as a source for operational system data. It may also supply data to other same size or smaller Data Warehouses referred to as data marts.