Thoughts on Data

Data Warehousing Evolution

Has there been a revolution in data analytics? Has the relational database been overthrown? I believe there have been a lot of overstatements about what modern analytics looks like in the real world. Things have changed a lot in the last 5 or so years, as they always do, but it is more of an evolution and not as revolutionary as many software vendors claim.

Store everything
One major evolution is that many companies are now storing all the data they can get their hands on. Often data is stored indefinitely in the raw forms and also in processed forms so that a data scientist can pick what they need to provide significant answers to the company’s decision makers. This evolution is a combination of three key things.
1. Storage can be cheap and scalable by using the cloud or distributed systems.
2. Systems to leverage affordable storage can be implemented, scaled, and managed by standard data teams.
3. Data scientists/analysts can run analysis on large, unprocessed data sets in a reasonable time frame.
Many companies can take advantage of storing data in various forms and for very long periods of time by using new storage technologies or clustered systems such as Hadoop.  Typically the relational database will have a place, but it will not be the one tool that the data warehouse team applies to every new request.

Analytic Insights Don’t Come In A Box
The analysts in the organization will need data that the data warehousing team won’t predict to be useful. I’ve built out data warehouses and confidently ignored fields that wouldn’t be useful for analytics, and eventually someone came around asking for a few of those discarded fields. Beyond the difficulty of an internal team predicting what is needed, its also becoming more obvious that “out of the box” data warehouses that are typically packaged with business systems will not make analytics as easy as it should be. You need your data available in a variety of forms to meet your needs, and having an “out of the box” data model will make assumptions and apply rules that don’t work for all the analysis you need to do to keep up with the competition.  Since we know that we can’t predict every field that analysts will require, it is more common to change our designs to require little effort to add new fields and also allow for work arounds by the analyst teams until the new field is added.

Analytics Can Be Agile
Data teams have proven that agile development techniques can be applied to data warehouse and report development. That means systems which require less data modeling, maintenance, and indexing are appealing to teams that need to deliver results quickly with a small team. Some of the new systems which are popular these days meet these low maintenance requirements better than a relational database. With small amounts of data the relational database systems are usually easiest to work with for developers and analysts, but large data sets require a lot more attention and increased hardware costs to support fast results.  This also allows us to spend less time trying to determine the perfect solution but instead follow lean development practices to make data available in stages even though it isn’t perfect.  This also means we gravitate towards technologies that reduce maintenance time in order to deliver results that are fast enough for end users.  An example of this change is using a scalable system that can hold all the data rather than building specialized OLAP cubes for a small amount of the data.

To tie these three areas of evolution back to the real world, let me share a bit about how our organization has evolved.  The biggest step is we opted to migrate the data warehouse and analytic data sets off of SQL Server and onto a Hadoop Ecosystem.  We also are shifting away from SSIS as an ETL tool and using more Python and custom built APIs to load and transform data.  I will write more about the reasons for these changes in the future, but the summary is that we want custom control to handle all types of data and respond to changes automatically.  We also are shifting more towards streaming data in throughout the day and decreasing the amount of batch processing that happens in our nightly jobs.  We will still have a relational database in the mix to help with certain scenarios and play a part in the batch processes we run, but most analysts will go to the Hadoop platform to get data.  Those that write SQL will still be able to use it to access their data by leveraging Impala or Hive.  In the end, the performance is fast enough for most cases even on small data sets and large data sets perform much better than before without much effort in indexing and partitioning.  We still are leveraging the knowledge and experience we gained following the advice of Ralph Kimball and using a relational database, but things have definitely evolved as they have in many other organizations.  I believe our evolution is similar to what many other data warehouse teams are going through (or have already been through).  Although the technology and practices of data warehousing have evolved greatly, the changes have not reversed everything we learned along the way and are hardly revolutionary.

 

Related references:

Many thoughts in this blog are inspired by this Big Data, Bad Analogies talk by Mark Madsen.  The rest of the inspiration is compiled from years of attending conferences and hearing how others are evolving their data analytics practices.

Advertisements