Construção de um dw

Categories: Trabalhos

0

Building data warehouses using open source technologies Michel Jansen Building data warehouses using open source technologies Draft version 197 Michel Jansen (mjansen@betterbe. com) Building data warehouses using open source technologies Miche I Jansen 1 Table of Contents 1 Table of Contents.. PACE 1 orig to view 3 2. 1 Intended audie Ild a data warehou 3 3 guilding a data warehouse……….. 3. 1 Designing a dimensional data …. 5 3. 1. 1 Asking questions…… 5 3. 1. 2 Modeling structures….. … 5 3. 1. 3 Picking a fact grain. 3. 1. 4 Adding dimensions…… Swipe to vlew next page 3. 1.

Adding ….. 6 3. 2 Constructing the data wareho use • • • • • • • • • • • • • • • • • 5 3. 2. 1 Designlng transformations using Spoon……… … 7 Some preparations………… 7 Updating “type 1” … 7 Updating “type 2” dimensions and the fact Aggregating the .. 10 3. 2. 2 Putting it all together.. 4 USIng the data warehouse…. …….. 11 4. 1 Preparing for on line an alytical processing…. 11 4. 1. 1 From relational to dimensional… .. 11 4. 1. 2 Doing it in Mondrian. 4. 2 Asking multidimensional queries…….. — 4. 2 Asking multidimensional . … 12 4.

Visualization and presenta tion • • • • • • • • • • • • • 13 5 References… Appendix A: Technology overview…. Mondrian. 15 1 5 Appendix B: Generating a date dimension using JavaScript… 6 JavaScript code. 1 6 Appendix C: Example Mondrian XML schema…. … 17 18 2 Building data warehouses uslng open source technologies Miche 2 Intro s Michel Jansen 2 Introduction ThlS article is about building data warehouses. A dat a warehouse is a computer database that collects, integrates and stores an organization’s data with the a im of producing accurate and timely management nformation and supporting data analysisl.

It explains the importa nce of a good data warehouse and cover the process of building such a specialized database using open source technologies. 2. 1 Intended audience This article is meant for software developers, databa se administrators, integrated software vendors or other people who are facing the challenge of making t he analysis of large amounts of data generated by a business or an information system possible. It has been written with the assumption of a basic understanding of the covered concepts like database s, information systems and business transactions in ind. . 2 Why build a data warehouse There are many reasons to justify building a data warehouse, but almost all of them boil down to the same basic wish: provide mea ns for analysis of data to support management decisions. You are probably already providing management with data like site usage statistics, referrer trends and the number of registered users of their information system. This is basic information, which can be retrieved directly from the vstem itself. However, t PAGFd 1 g which is often a problem if, for instance, the system ‘s database tables get locked for retrieval.

A data arehouse can be completely detached from the i nformation system, even running on a different system. Secondly, an OLTP2system’s data model is rarely optimized for analysis. We all learn to develop systems to use a normalized database modeled afte r our entity relationships. This is a good thing for the information system, since the underlying model is a close reflection ofthe system itself, but it makes querying for large sums of aggregated data a costly operation. Furthermore, redundancy is rarely part of this database design, because redundancy is hard to maintain , often causing data inconsistencies or worse.

For data analysi s, redundancy can be great, because it speeds up the process. Moreover, data in an OLTP system might change over time. A cust omer might move to another country, leavingyou, the dataanalys is provider with an impossible choice: either you update the 1 http://en. wikipedia. org/wiki/Data_warehouse 2 On Line Transaction Processing customer’s record, discarding his previous state a nd invalidating previous analysis or you have to somehow create a new record for the online system a nd change all references to it.

Neither of them are desirable, but In an offline data warehouse you can eep both the old and the new state of the customer and specify at What period in time it has changed. Finally, there are probably a lot of data sources your information s ystem isn’t using that could be useful in data analysis. A data war ehouse can provide centr II this data, so all th in data analysis. A data warehouse can provide ce ntral storage for all this data, so all the collected information can be queried in one step. 4 3 guilding a data warehouse In this chapter, Will guide you through the different phases in bu ilding a data warehouse.

Will illustrate this by using a simplified ebbased information system as an example for creating a data warehouse. This system contains familiar entities s uch as “requests”, “users” and “pages”. The data model of this system is shown in figure . 3. 1 3. 1. 1 Designing a dimensional data warehouse Asking questions The most important step in building a data wareho use is designing it. You have to ask yourself: ‘What does the management want to know? “. First, youil I have to figure out which questions need to be answered by analysis of the data warehouse to. e3. For example, is there a correlation between users of a webbased system and the pages It provides? Do ce rtaln groups of users VISit other pages than others? Where do the visitors from different pages come from? My belief is at least of all management questions about the data gener ated by an information system can be answered by a decent data warehouse. Since you are readin this, assume your compan y’s management has alrea this kind of ask the data warehouse later on, you have to determine the different data arrangements that come with these questions.

Fo r example, a questlon about sales Will have to operate on a different structure than one about employment. These structures are called “cubes” in the world of OLAP4, because they are essentially an extension to the two dimensional array of data that is stored in a conve ntional (for example SQL) database. Depending on the data, a cube may have more than two or even three dimensions. We’ll model these cubes in a relational database as a star schema, containing a single fact table linking together multiple dimension tables. This is not a typo 4 On Line Analytical Processing Illustration 1: Request cube containing only a time dimension 3. 1. 3 Picking a fact grain The art in making up this structure is finding a good basis for a fact table. The more aggregated the chosen “fact” element for each line in the fact table is, the harder it becomes to add useful dimensions. The ideal basis for a fact table is the lowest pos sible atomic grain of data. This is often a single transaction, payment or , as in my example, a page request- 3. 1. Adding dimensions Now that we have a centr PAGF 7 OFIg t for our data chosen the wrong grain to begin with. The most trivial of all dimensions is that of time. Every request takes place on a certain point in time, so we create dimension table “time” and have every request entry in the fact table link Illustration 2: A cube modelled as an OLAPstar in a to the entry in this table corresponding to the time it took relational database place. If two requests took place on the same moment, they Will link to the same entry in the time dimension’s table.

Another dimension we want to add is that of pages. F or each distinct page associated to a request in the fact table, an entry Will exist in the page dimension table. The s ame goes for referrers. Finally, we’ll want to link requests to th e users that made them. Every request in the fact table was made either by a known user, or by no user at all, w ich Will be a special entry in the user dimension’s table. As we create and link dimensions, an important r ule is to never use the existing keys from the online system, as we have no control over how they might c hange or even disappear.

Instead, every dimension Will get it’s own surrogate key called the “technical k which is unique to the data warehouse. ThlS also goes for the fact table. 3. 2 Constructing the data warehouse Now that we know which cubes we want to expres s as a starjoin schema, it’s time to get to work. In this step, we’ll populate a data warehouse with data f rom the OLTP system. This phase of the process is known as ETI_, which stands for Extract, Transfor m, Load. This is exactly What needs to be done.

Extract the data needed for the fact and dimension t ables from all different da nsform it to ifferent data sources, transform it to fit our needs and load it into the data warehouse so it can be queried. 6 We’ll perform these actions with KEITLE, an open s ource ETC tool. Kettle is roughly made up out of two packages. Spoon and Pan, which are used to create and exec ute transformations and Chef and Kitchen, which Iet you define jobs from transformations and schedule and execute them. 3. 2. 1 Designing transformations using Spoon We start by using Spoon to make the transformation s that Will populate our data warehouse.

In order to be able to fill the central fact table, the keys to all of the dimensions must be known. Therefore we make a distinction between two types of dimensions (not to be confused by Ralph Kimball’s slowly changing dimension types): 1. Dimensions consisting of data already known to the online syste m. 2. Dimensions that are to be generated from the fact data a nd surrounding sources. While we generate or update the dime nsions of type two while filling our fact table and thus know its eys at that time, we cannot do this for the dimensions of the fi rst kind.

In our example, the time and page dimensions are of the second kind, meaning they are generated when updating our fact table. The user dimension has to be known before then, because is based on some independent tables in the online system. Because of this, we Will fill our data warehouse in two separate transformations, ensuring first the existenc e of our type 1 user dimension and later the other dimensions and the fact table itself. Some preparations Before we can begin using Spoon, we have to defin e the data sources. In our case, we have only one ata source: the database of our on line system.

W e need to add this database as a “Connection” to Spoon as described in its documentation about “Data base connections”. We also define the connection to our target database this way. Updating “typ e 1” d imensions In our example, the user dimension is the only one that cannot be updated while filling the fact table. The data already exists in the source system as a si ngle table, so all we have to do is read it and run it through Spoon’s great “Dimension lookup/update” step. This step is capable of creating, updating and lookin g up “slowly changing dimensions” as described y Ralph Kimball.

For sources that contain changing data, such as the customer records mentioned earlier, it can do this in two ways: l. Overwriting the existing record by the updated one. 7 II. Creating another dimension record, maintaining multiple copie s of changed records over time. This second type is implemente d by adding a “version”, “date_from” and “date_to” field to the dimension’s table and it is almost always the most useful one. In our example, we want the “user” dimension to be of type two, so the transformation in S oon Will look like Illustration 3. First, the da the online

Caso toyota

0

A porta da falência recorreu a seu principal executivo o engenheiro Taiichi Ohno que reinventou o processo produtivo da montadora,

Read More

A perspectiva sociológica

0

A perspectiva Sociológica A Sociedade no Homem A sociedade determina não só o que fazemos como também o que fazemos.

Read More