Month: January 2016

3 Roles in Analytics

From time to time I’ve been asked about the different roles within a data team.  So for anyone wondering here are the three roles I hear most often and my take on what is expected of each including technology I hear most commonly associated with the role.  All these roles play a big part in a good analytics team and do much more than I have taken the time to indicate in this post, but hopefully this is a good overview.
Data Engineer (or Data/ETL Developer)
Role: Build out data systems, get data from various sources (often web APIs, flat files, or databases), transform data, integrate data, and  make it available for analysts to use.  This role is more about building the foundation which the other roles pull from all the time so there is always work to do.  Building out the technology platform and base data structures is a very important step in the analytic process and may involve the most technical programming challenges.  Usually this team picks which type of data system to work with such as Hadoop, SQL Server, PostgreSQL, or Oracle.
Technology: SQL, Python, Hadoop -> Hive, Spark (using Python or Scala)
Data Scientist
Role: Expected to do a lot with analyzing data and the role varies based on company.  We think of this as someone who has a high level of statistics and math training, is able to build analytic models and predictive models, and is able to test an idea against the data and come back knowing if the hypothesis holds true and with what likelihood of error.  This role is usually focused on analytic projects with significant impact on the company.  Some projects can take quite a while and a lot of data processing, quailty evaluation, cleansing, and normalization takes place along the way.  One of the hardest things to learn in academic setting is how to know if your model or other type of results are accurate before company invests money acting on this, but within a company that is an important characteristic of a data scientist.
Technology: SQL, R, Python (with Pandas or other libraries), Spark MLLib or Mahout or other machine learning library
Data Analyst (sometimes called Data Scientist now, especially in the Bay Area)
Role: Expected to analyze data with less focus on statistics and often focused on building reports and dashboards for others to use on a recurring basis.  Often partner with business users to help them come up with meaningful metrics or reports and should be able to quality check data and find anomalies that would be misleading to management if not explained or cleaned up.  This role may play a part in deciding which reporting and data visualization tools the company uses and often tries to get answers to short term questions.
Technology: SQL, Tableau, D3.js, Excel

Data Warehousing Evolution

Has there been a revolution in data analytics? Has the relational database been overthrown? I believe there have been a lot of overstatements about what modern analytics looks like in the real world. Things have changed a lot in the last 5 or so years, as they always do, but it is more of an evolution and not as revolutionary as many software vendors claim.

Store everything
One major evolution is that many companies are now storing all the data they can get their hands on. Often data is stored indefinitely in the raw forms and also in processed forms so that a data scientist can pick what they need to provide significant answers to the company’s decision makers. This evolution is a combination of three key things.
1. Storage can be cheap and scalable by using the cloud or distributed systems.
2. Systems to leverage affordable storage can be implemented, scaled, and managed by standard data teams.
3. Data scientists/analysts can run analysis on large, unprocessed data sets in a reasonable time frame.
Many companies can take advantage of storing data in various forms and for very long periods of time by using new storage technologies or clustered systems such as Hadoop.  Typically the relational database will have a place, but it will not be the one tool that the data warehouse team applies to every new request.

Analytic Insights Don’t Come In A Box
The analysts in the organization will need data that the data warehousing team won’t predict to be useful. I’ve built out data warehouses and confidently ignored fields that wouldn’t be useful for analytics, and eventually someone came around asking for a few of those discarded fields. Beyond the difficulty of an internal team predicting what is needed, its also becoming more obvious that “out of the box” data warehouses that are typically packaged with business systems will not make analytics as easy as it should be. You need your data available in a variety of forms to meet your needs, and having an “out of the box” data model will make assumptions and apply rules that don’t work for all the analysis you need to do to keep up with the competition.  Since we know that we can’t predict every field that analysts will require, it is more common to change our designs to require little effort to add new fields and also allow for work arounds by the analyst teams until the new field is added.

Analytics Can Be Agile
Data teams have proven that agile development techniques can be applied to data warehouse and report development. That means systems which require less data modeling, maintenance, and indexing are appealing to teams that need to deliver results quickly with a small team. Some of the new systems which are popular these days meet these low maintenance requirements better than a relational database. With small amounts of data the relational database systems are usually easiest to work with for developers and analysts, but large data sets require a lot more attention and increased hardware costs to support fast results.  This also allows us to spend less time trying to determine the perfect solution but instead follow lean development practices to make data available in stages even though it isn’t perfect.  This also means we gravitate towards technologies that reduce maintenance time in order to deliver results that are fast enough for end users.  An example of this change is using a scalable system that can hold all the data rather than building specialized OLAP cubes for a small amount of the data.

To tie these three areas of evolution back to the real world, let me share a bit about how our organization has evolved.  The biggest step is we opted to migrate the data warehouse and analytic data sets off of SQL Server and onto a Hadoop Ecosystem.  We also are shifting away from SSIS as an ETL tool and using more Python and custom built APIs to load and transform data.  I will write more about the reasons for these changes in the future, but the summary is that we want custom control to handle all types of data and respond to changes automatically.  We also are shifting more towards streaming data in throughout the day and decreasing the amount of batch processing that happens in our nightly jobs.  We will still have a relational database in the mix to help with certain scenarios and play a part in the batch processes we run, but most analysts will go to the Hadoop platform to get data.  Those that write SQL will still be able to use it to access their data by leveraging Impala or Hive.  In the end, the performance is fast enough for most cases even on small data sets and large data sets perform much better than before without much effort in indexing and partitioning.  We still are leveraging the knowledge and experience we gained following the advice of Ralph Kimball and using a relational database, but things have definitely evolved as they have in many other organizations.  I believe our evolution is similar to what many other data warehouse teams are going through (or have already been through).  Although the technology and practices of data warehousing have evolved greatly, the changes have not reversed everything we learned along the way and are hardly revolutionary.


Related references:

Many thoughts in this blog are inspired by this Big Data, Bad Analogies talk by Mark Madsen.  The rest of the inspiration is compiled from years of attending conferences and hearing how others are evolving their data analytics practices.

About this blog

I started this blog to talk about the evolving world of data analytics and data management.  The goal is to share opinions, tutorials, and best practices (or at times worst practices) in both data technology and managing development teams.  The theme will be new technologies and new approaches that are changing how data analytics and data engineering are done within modern organizations.  I have been working in the data warehousing field for a while now and the approach across the industry has changed drastically since I started – or at least it should have changed drastically.  Data has grown, modern systems have been built to deal with today’s problems, and there is a new generation of employees that want access to the data and will learn how to use it.

My goal has always been to make important data available for driving insight and decision making.  I learned from many experts in the industry the importance of making data clean and easy to use so that the managers and analysts for the business do not incorrectly interpret the data.  This is important to consider because you do not want people to make poor decisions because of confusion in the metrics.  This concern leads to carefully collecting important data and only exposing it to others in the organization once it is defined, cleansed, and documented — because then they cannot draw incorrect conclusions.  The goal is spot on, and the approach I theoretically agree with.  The problem: it takes too long to make the data perfect and you must predict how it will be interpreted by the people who see the data or there will still be confusion (unless they miraculously read the definitions and documentation you provide to them).

The solution: free the data from the vault you kept it locked away in and make all of it available to those that can get value out of it!  I have many thoughts on how to do this responsibly, but the underlying message is make data easily accessible.  If you keep the data locked away you will miss out on the value many other organizations are getting from their data.