Implementing Star Schema Design

On September 12, 2010, in Web Development, SQL, by Anuj Gakhar

I am working on a project that has a requirement to report some statistics historically. e.g. how many views a product had this week, this month, this year etc. how many comments were made on a product in the last quarter….and so on… although doing something like this does not sound overly complex when you hear it first, but it can be quite challenging if the DB design for this is not done right. In the past, almost every project I have worked on, involved some kind of statistics but they were like “all time” statistics….not based on a time frame.

So, I did a quick Google search for this and figured out that a Star Schema is the best schema design for doing such a thing. A Star Schema consists of a few Dimension tables and one or more Fact tables. A Fact table is one that holds the statistics (e.g. number of views, total sales amount, number of ratings etc). A Dimension table is one that holds the dimensions e.g. a Product table that holds a list of Products. While my search, I actually could not find a complete implementation of this schema design anywhere. I could only find bits and pieces of information so I thought I will write up a blog post on this. Might be useful for me as well to summarize what I have learned and might help someone else as well.

Lets look at an example (for this example, I will be using MySQL database but with slight modifications – the scripts would work on any database) :-

Consider an application you have a lot of users registered on the site. These users login to the site and write articles for the site. And these articles then get reviewed and published on the site. Ideally, you’d like to do reporting for these users in terms of how many articles they have published, how many views each article gets, how many comments each article gets and then you’d also like to do this based on a time frame. So views/comments today, this week, this month, this year , all time etc. So, in this case, we will have a table that holds all the Users, a table that holds all the articles, a table that holds the statistics and a table that would hold the dates. In terms of terminology, the User, article and the dates table would all be called Dimension tables and the statistics table would be our Facts table. Here is how it looks like :-

The Dates table is a table that holds all the possible dates in their different formats. The idea is that we will pre-populate this table with the dates that we want. eg.we can populate it with 10 years worth of dates (which is 4018 rows). Each row will store the date in different formats. These formats would be the ones that we want to report against. I found this useful script to populate this table. Here is a little sample of how this table will look after its populated. :-

Once we have the dates table sorted, we can now get down to the real business and start logging data into our stats table. The basic logic behind this in layman terms is that from our list of 10 years worth of dates, we get the dateId that belongs to todays’ date and the articleid and update the stats table based on that. So, if I was to record a ‘view’ against an article – my SQL query would look like this :-

 INSERT INTO tblstats( articleid, dateid, views )
SELECT 1, date_id, 1
FROM tbldates
WHERE date = DATE_FORMAT( curdate( ) , '%Y-%m-%d' )
ON DUPLICATE KEY
UPDATE views = views +1

One thing to note here is that the stats table has a composite Primary key made up of all the Primary keys from Dimension tables. In the above case, it has got articleid and dateid as a composite primary key. As we want the combination of all the dimensions to be unique. The above script insert a record into the stats table that has todays’ date, articleId 1 and it would increment the views count everytime this article is viewed. We could do the same query for comments and ratings as well but I wont be showing those queries here as they are pretty much the same as above except the column name.

Once you start logging stats in the above mentioned way, you will quickly end up with some stats that look like this :-

Now, the final task is to query this data and do some reporting based on this.

To find out the total views on articleId 1 for the current year, the query will look like this :-

 SELECT sum( s.views ) AS totalviews
FROM tblstats s
INNER JOIN tbldates d ON d.date_id = s.dateid
INNER JOIN tblarticles a ON a.id = s.articleid
WHERE d.year = DATE_FORMAT( curdate(), "%Y")
and s.articleid = 1

To find out the total views for articleid 1 in current week, the query will look like this :-

SELECT sum( s.views ) AS totalviews
FROM tblstats s
INNER JOIN tbldates d ON d.date_id = s.dateid
INNER JOIN tblarticles a ON a.id = s.articleid
WHERE d.week_of_year = DATE_FORMAT( curdate(), "%V")
and s.articleid = 1

To find out the total views for articleid 1 in current month, the query will look like this :-

SELECT sum( s.views ) AS totalviews
FROM tblstats s
INNER JOIN tbldates d ON d.date_id = s.dateid
INNER JOIN tblarticles a ON a.id = s.articleid
WHERE d.month_of_year = DATE_FORMAT( curdate(), "%m")
and s.articleid = 1

That should give you an idea on how easy it becomes to query data historically with this DB Design. The options are endless, you could use any number of dimensions and any number of fact tables to do the reporting that suits you best. The queries will never get any more complex than they are right now and you could extract all the reporting data you need.

Tagged with:  

Leave a Reply

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 445 other subscribers

© 2011 Anuj Gakhar
%d bloggers like this: