Thoughts on Data

Big Data Kickstart

Big Data is everywhere, here is how to manage it…

Managing big data is critical for many organizations. Analytics can improve products and inform critical business decisions. Using data can provide distinct advantages, and it’s likely that an organization’s competitors are already leveraging their data. But if you have not started down the path yet, it can be a challenge to get your big data initiative off the ground.

Often the world of big data sounds like a magical place where smart data geeks just take data and come back with the answers to all of life’s questions. As you may suspect, it doesn’t actually work this way. There is hard technical work and even harder shifts in how organizational leaders think about the role of data. But the hard work is worth it if it makes the company better and customers happier, which we know is possible based on all the success stories that have been shared publicly. If you are wondering how to approach managing and leveraging data to improve your organization, the steps outlined below provide a way to approach data and analytics with a focus on building a foundation. This foundation will help an organization deal with big data (think many terabytes or petabytes). These steps are meant to drive a long term data focus so you may approach some of the steps in parallel and you will definitely iterate on the steps in the long run.

1. Collect

The first step in getting value from data is to collect it, big or small. Rather than spending days, weeks, or months discussing which data to collect I urge you to start with the mindset that all data related to your organization should be collected. I prefer collecting the data into one data ecosystem to minimize the challenge of finding the data when it’s time to do analysis, but even if you choose a decentralized approach to collection and storage it is important to set standards across departments. You should provide guidance across your organization on how data should be captured and history stored for all processes. You should aim to capture both local data as well as third party services you might use. You would expect to capture all the business-side data from CRMs (such as Salesforce), your company website, accounting systems, HR systems, online marketing tools, and so on. There is also a wealth of data to collect on the product or service you provide (web application). This could include data captured by the system, error and application logs, web analytics (such as Adobe or Google analytics), and customer feedback through surveys. Analysts in your organization will inevitably choose to tie together data from each source system at some point so storing it in one big data ecosystem (such as Hadoop) can decrease the effort needed to join the data together.

2. Make Available

The next focus is making sure those who want to get value from the data can access it. The data should be useful with tools they are able to use. For some teams this will be a spreadsheet and for others it will be more sophisticated reporting software. You will want to focus more on self-service when there is a lot of data to choose from. For big data environments you will want to eliminate roadblocks. A common roadblock is relying on a “one-off” data request process that relies on an overloaded engineering team to create or modify views specific to each analytics request. Ideally all of the data collected will automatically be available for analysts to pull into the tool of their choice without having to request it. Realistically, however, you will likely want to pick some high value data sets to start. This will allow you to think about how to make all the data available in the future without getting caught up with fully automating steps that might need to change once you have a little more experience.

Once you make your first data set available you should market it to the internal teams and offer support for them to get started. This part of the process is a great place to gather feedback on the value, utility and ease of use for others. With this feedback and the experience gained by collecting and making it available in your big data environment, you should be able to decide how to improve the process to be scalable for many more data sets.

3. Train

Once your first data set is available, you should begin putting significant energy into training. You will need to educate analysts on retrieving data and your organization’s decision makers on how they can get the information they need from the data. You will need to cater to varying levels of technical proficiency ranging from those with formal technology training to those in business units who became analysts along the way, so several levels of training will be most effective. One of the goals of training is that department leaders should know what capabilities exist, in particular who they can talk to regarding their department’s data goals as well as any requests for analytics. You will need to provide the support for those individuals handling departmental analytics to be successful, especially in the early stages of your big data journey. A lot of the training time will be focused on making sure the data analysts/scientists can access the data and transform it in a way that fits their project, but you must not neglect making sure the department leaders know how involved they should be in driving the projects and understanding what was learned with each one.

4. Trust

To unlock the potential of big data, there is a vital need to trust others in your organization with the hard work of making sense of it. You can’t rely entirely on a central data team for all of your reporting and analytics. The reason you need to trust others in the organization is simple: there is too much data with too much potential to improve your organization to keep it locked up where only a small group of privileged data experts can access it. An exception to this is that you will still keep financial and personally identifiable information (PII) locked up tight. One example of trust is when an analyst has an idea to take product usage data and feed it into a model to determine what type of emails should be sent to improve conversions (more future purchases, longer retention, etc). Senior data scientists may have concerns about the analyst attempting this project on their own while no data experts available to partner on the project. When you have trust as a component of your data strategy, you should have an open communication channel for the work done by the analyst to happen independent from a central data team and be reviewed if needed. You have a mindset that the analyst can be trusted to be honest with their department leaders about their limitations and the department leaders will be careful to weigh the risks before rolling out customer impacting changes based on any analysis that hasn’t been vetted by the central data team. In practice, department leaders are unlikely to change strategy based on analysis that contradicts their experience and instinct, so the use of data is a step forward and carries a low risk of steering the organization in the wrong direction without proper due diligence.

5. Improve

There is a strong need in big data initiatives to focus on improvement as a consistent part of the work. Despite what you may have heard, big data systems are not cheap and the amount of data that can be collected is growing every year. Effective data initiatives are less about setting up a perfect process to organize people and more about building systems to automate the steps to collect data and make it available. For most organizations, taking an iterative approach to these steps is the best chance for success since results will be evident in a reasonable timeframe. You should focus on a single use case initially with a plan to refactor the process as you expand to more use cases. It is critical that you make time for the effort to refactor after the first project is deployed to production. This will allow your engineers to spend less time trying to develop perfect automated solutions for the ‘Collect’ and ‘Make Available’ steps and provide a chance for feedback and adjustments in the ‘Train’ and ‘Trust’ steps. Data engineers will find solutions to the collect and make available steps, but these alone will not provide the impact you desire without strong organizational effort to train and trust. All of the prior steps are opportunities to learn and most organizations will need to make adjustments to be successful regardless of how much time they spend on the planning and design.


Data Warehousing Evolution

Has there been a revolution in data analytics? Has the relational database been overthrown? I believe there have been a lot of overstatements about what modern analytics looks like in the real world. Things have changed a lot in the last 5 or so years, as they always do, but it is more of an evolution and not as revolutionary as many software vendors claim.

Store everything
One major evolution is that many companies are now storing all the data they can get their hands on. Often data is stored indefinitely in the raw forms and also in processed forms so that a data scientist can pick what they need to provide significant answers to the company’s decision makers. This evolution is a combination of three key things.
1. Storage can be cheap and scalable by using the cloud or distributed systems.
2. Systems to leverage affordable storage can be implemented, scaled, and managed by standard data teams.
3. Data scientists/analysts can run analysis on large, unprocessed data sets in a reasonable time frame.
Many companies can take advantage of storing data in various forms and for very long periods of time by using new storage technologies or clustered systems such as Hadoop.  Typically the relational database will have a place, but it will not be the one tool that the data warehouse team applies to every new request.

Analytic Insights Don’t Come In A Box
The analysts in the organization will need data that the data warehousing team won’t predict to be useful. I’ve built out data warehouses and confidently ignored fields that wouldn’t be useful for analytics, and eventually someone came around asking for a few of those discarded fields. Beyond the difficulty of an internal team predicting what is needed, its also becoming more obvious that “out of the box” data warehouses that are typically packaged with business systems will not make analytics as easy as it should be. You need your data available in a variety of forms to meet your needs, and having an “out of the box” data model will make assumptions and apply rules that don’t work for all the analysis you need to do to keep up with the competition.  Since we know that we can’t predict every field that analysts will require, it is more common to change our designs to require little effort to add new fields and also allow for work arounds by the analyst teams until the new field is added.

Analytics Can Be Agile
Data teams have proven that agile development techniques can be applied to data warehouse and report development. That means systems which require less data modeling, maintenance, and indexing are appealing to teams that need to deliver results quickly with a small team. Some of the new systems which are popular these days meet these low maintenance requirements better than a relational database. With small amounts of data the relational database systems are usually easiest to work with for developers and analysts, but large data sets require a lot more attention and increased hardware costs to support fast results.  This also allows us to spend less time trying to determine the perfect solution but instead follow lean development practices to make data available in stages even though it isn’t perfect.  This also means we gravitate towards technologies that reduce maintenance time in order to deliver results that are fast enough for end users.  An example of this change is using a scalable system that can hold all the data rather than building specialized OLAP cubes for a small amount of the data.

To tie these three areas of evolution back to the real world, let me share a bit about how our organization has evolved.  The biggest step is we opted to migrate the data warehouse and analytic data sets off of SQL Server and onto a Hadoop Ecosystem.  We also are shifting away from SSIS as an ETL tool and using more Python and custom built APIs to load and transform data.  I will write more about the reasons for these changes in the future, but the summary is that we want custom control to handle all types of data and respond to changes automatically.  We also are shifting more towards streaming data in throughout the day and decreasing the amount of batch processing that happens in our nightly jobs.  We will still have a relational database in the mix to help with certain scenarios and play a part in the batch processes we run, but most analysts will go to the Hadoop platform to get data.  Those that write SQL will still be able to use it to access their data by leveraging Impala or Hive.  In the end, the performance is fast enough for most cases even on small data sets and large data sets perform much better than before without much effort in indexing and partitioning.  We still are leveraging the knowledge and experience we gained following the advice of Ralph Kimball and using a relational database, but things have definitely evolved as they have in many other organizations.  I believe our evolution is similar to what many other data warehouse teams are going through (or have already been through).  Although the technology and practices of data warehousing have evolved greatly, the changes have not reversed everything we learned along the way and are hardly revolutionary.


Related references:

Many thoughts in this blog are inspired by this Big Data, Bad Analogies talk by Mark Madsen.  The rest of the inspiration is compiled from years of attending conferences and hearing how others are evolving their data analytics practices.