Data Development Environment – Mac

A first step to developing with modern technologies such as Big Data systems and NoSQL systems is getting your development environment setup. I like to have many of the tools available locally on my laptop so I can feel free to experiment without breaking a shared server or running up a large bill on the cloud platform used to host the machine.  Checkout my previous post on setting up on Windows to read a little about what I like about Python and Sublime Text.  For this post, let’s just walk through the tools I found myself installing on my Mac to do Python data development (this excludes Scala setup).

  • Python installed by default – using Python 2.7.10
  • install Homebrew –
  • install Developer Tools – on command line type git and follow prompts to install developer tools (
  • install pip – sudo easy_install pip
  • install SublimeText2
  • install PyCharm
  • install several things using Homebrew (type at command line):
    • brew update
    • brew install wget
    • brew install gcc
    • brew install apache-spark
  • pip install virtualenv
    • (then use virtualenv and virtualenvwrapper for most things python)
  • create a virtual environment for data-eng and install
    • pip install pandas
    • pip install request
    • …..(a lot more, might add to this list later with some of the heavily used ones)

Hopefully that helps you with getting started, feel free to leave comments if you hit errors along the way and if I’ve dealt with similar errors I will give you some tips.


Data Development Environment – Windows

A first step to developing with modern technologies such as Big Data systems and NoSQL systems is getting your development environment setup. I like to have many of the tools available locally on my laptop so I can feel free to experiment without breaking a shared server or running up a large bill on the cloud platform used to host the machine.  Originally I started on a Windows machine and being somewhat new to these technologies tried a few paths that didn’t go very well.  Here is some guidance if you are trying to get going with Python and Hadoop or other open source data platforms using a Windows laptop.

Python 2.7

A very popular programming language for processing data is Python and much of the ETL we write uses Python for flexibility (compared to SSIS which relies heavily on knowing the data model). It is simpler than Java and much easier to read, so if maintainability is important (which it should be) then it is a great option. You can use Python 2 or Python 3, but some third party modules are not compatible with Python 3. Python can be installed directly on Windows, Mac, or Linux. I prefer using it with Linux or Mac because of the other command line features and the popularity in the developer community.

Sublime Text 2

This is the text editor I use to write Python modules and edit JSON files, as well as any other type of text file. This is what others on our data team use and when I saw it I was impressed with how easy it makes it to read the code. It is not an IDE, it is an awesome text editor. There are other options for text editors (gedit, emacs, vim), but Sublime Text 2 works well. If you want an IDE instead then PyCharm is one that was recommended in the Python Fundamentals course on Pluralsight.

Oracle VM VirtualBox

After trying to get things working properly with Cygwin Terminal and setting up a Linux VM with VMware Player, I saw VirtualBox was a high ranked virtual machine option and had the easiest time setting it up. One big requirement I had was to be able to copy and paste from my local to the virtual machine and I had trouble getting that capability set up on my VMWare machine.

Linux CentOS 7

Linux works well for developing and running Python code, plus you can install many open source projects on it such as Apache Hadoop. I chose CentOS because of its similarity to RedHat which is supported by most databases and open source projects.  I found many examples for installing Python modules and client libraries on Linux, as well as plenty of information on installing Hadoop as a single node instance. I did not face as many barriers as Cygwin presented, so once I made the jump to Linux I was finally able to focus on the programming instead of the system setup.
So try out this setup and check out my Resources page for ideas of what to learn once the environment is ready and hit me up with questions if you get stuck.

3 Roles in Analytics

From time to time I’ve been asked about the different roles within a data team.  So for anyone wondering here are the three roles I hear most often and my take on what is expected of each including technology I hear most commonly associated with the role.  All these roles play a big part in a good analytics team and do much more than I have taken the time to indicate in this post, but hopefully this is a good overview.
Data Engineer (or Data/ETL Developer)
Role: Build out data systems, get data from various sources (often web APIs, flat files, or databases), transform data, integrate data, and  make it available for analysts to use.  This role is more about building the foundation which the other roles pull from all the time so there is always work to do.  Building out the technology platform and base data structures is a very important step in the analytic process and may involve the most technical programming challenges.  Usually this team picks which type of data system to work with such as Hadoop, SQL Server, PostgreSQL, or Oracle.
Technology: SQL, Python, Hadoop -> Hive, Spark (using Python or Scala)
Data Scientist
Role: Expected to do a lot with analyzing data and the role varies based on company.  We think of this as someone who has a high level of statistics and math training, is able to build analytic models and predictive models, and is able to test an idea against the data and come back knowing if the hypothesis holds true and with what likelihood of error.  This role is usually focused on analytic projects with significant impact on the company.  Some projects can take quite a while and a lot of data processing, quailty evaluation, cleansing, and normalization takes place along the way.  One of the hardest things to learn in academic setting is how to know if your model or other type of results are accurate before company invests money acting on this, but within a company that is an important characteristic of a data scientist.
Technology: SQL, R, Python (with Pandas or other libraries), Spark MLLib or Mahout or other machine learning library
Data Analyst (sometimes called Data Scientist now, especially in the Bay Area)
Role: Expected to analyze data with less focus on statistics and often focused on building reports and dashboards for others to use on a recurring basis.  Often partner with business users to help them come up with meaningful metrics or reports and should be able to quality check data and find anomalies that would be misleading to management if not explained or cleaned up.  This role may play a part in deciding which reporting and data visualization tools the company uses and often tries to get answers to short term questions.
Technology: SQL, Tableau, D3.js, Excel

Data Warehousing Evolution

Has there been a revolution in data analytics? Has the relational database been overthrown? I believe there have been a lot of overstatements about what modern analytics looks like in the real world. Things have changed a lot in the last 5 or so years, as they always do, but it is more of an evolution and not as revolutionary as many software vendors claim.

Store everything
One major evolution is that many companies are now storing all the data they can get their hands on. Often data is stored indefinitely in the raw forms and also in processed forms so that a data scientist can pick what they need to provide significant answers to the company’s decision makers. This evolution is a combination of three key things.
1. Storage can be cheap and scalable by using the cloud or distributed systems.
2. Systems to leverage affordable storage can be implemented, scaled, and managed by standard data teams.
3. Data scientists/analysts can run analysis on large, unprocessed data sets in a reasonable time frame.
Many companies can take advantage of storing data in various forms and for very long periods of time by using new storage technologies or clustered systems such as Hadoop.  Typically the relational database will have a place, but it will not be the one tool that the data warehouse team applies to every new request.

Analytic Insights Don’t Come In A Box
The analysts in the organization will need data that the data warehousing team won’t predict to be useful. I’ve built out data warehouses and confidently ignored fields that wouldn’t be useful for analytics, and eventually someone came around asking for a few of those discarded fields. Beyond the difficulty of an internal team predicting what is needed, its also becoming more obvious that “out of the box” data warehouses that are typically packaged with business systems will not make analytics as easy as it should be. You need your data available in a variety of forms to meet your needs, and having an “out of the box” data model will make assumptions and apply rules that don’t work for all the analysis you need to do to keep up with the competition.  Since we know that we can’t predict every field that analysts will require, it is more common to change our designs to require little effort to add new fields and also allow for work arounds by the analyst teams until the new field is added.

Analytics Can Be Agile
Data teams have proven that agile development techniques can be applied to data warehouse and report development. That means systems which require less data modeling, maintenance, and indexing are appealing to teams that need to deliver results quickly with a small team. Some of the new systems which are popular these days meet these low maintenance requirements better than a relational database. With small amounts of data the relational database systems are usually easiest to work with for developers and analysts, but large data sets require a lot more attention and increased hardware costs to support fast results.  This also allows us to spend less time trying to determine the perfect solution but instead follow lean development practices to make data available in stages even though it isn’t perfect.  This also means we gravitate towards technologies that reduce maintenance time in order to deliver results that are fast enough for end users.  An example of this change is using a scalable system that can hold all the data rather than building specialized OLAP cubes for a small amount of the data.

To tie these three areas of evolution back to the real world, let me share a bit about how our organization has evolved.  The biggest step is we opted to migrate the data warehouse and analytic data sets off of SQL Server and onto a Hadoop Ecosystem.  We also are shifting away from SSIS as an ETL tool and using more Python and custom built APIs to load and transform data.  I will write more about the reasons for these changes in the future, but the summary is that we want custom control to handle all types of data and respond to changes automatically.  We also are shifting more towards streaming data in throughout the day and decreasing the amount of batch processing that happens in our nightly jobs.  We will still have a relational database in the mix to help with certain scenarios and play a part in the batch processes we run, but most analysts will go to the Hadoop platform to get data.  Those that write SQL will still be able to use it to access their data by leveraging Impala or Hive.  In the end, the performance is fast enough for most cases even on small data sets and large data sets perform much better than before without much effort in indexing and partitioning.  We still are leveraging the knowledge and experience we gained following the advice of Ralph Kimball and using a relational database, but things have definitely evolved as they have in many other organizations.  I believe our evolution is similar to what many other data warehouse teams are going through (or have already been through).  Although the technology and practices of data warehousing have evolved greatly, the changes have not reversed everything we learned along the way and are hardly revolutionary.


Related references:

Many thoughts in this blog are inspired by this Big Data, Bad Analogies talk by Mark Madsen.  The rest of the inspiration is compiled from years of attending conferences and hearing how others are evolving their data analytics practices.

About this blog

I started this blog to talk about the evolving world of data analytics and data management.  The goal is to share opinions, tutorials, and best practices (or at times worst practices) in both data technology and managing development teams.  The theme will be new technologies and new approaches that are changing how data analytics and data engineering are done within modern organizations.  I have been working in the data warehousing field for a while now and the approach across the industry has changed drastically since I started – or at least it should have changed drastically.  Data has grown, modern systems have been built to deal with today’s problems, and there is a new generation of employees that want access to the data and will learn how to use it.

My goal has always been to make important data available for driving insight and decision making.  I learned from many experts in the industry the importance of making data clean and easy to use so that the managers and analysts for the business do not incorrectly interpret the data.  This is important to consider because you do not want people to make poor decisions because of confusion in the metrics.  This concern leads to carefully collecting important data and only exposing it to others in the organization once it is defined, cleansed, and documented — because then they cannot draw incorrect conclusions.  The goal is spot on, and the approach I theoretically agree with.  The problem: it takes too long to make the data perfect and you must predict how it will be interpreted by the people who see the data or there will still be confusion (unless they miraculously read the definitions and documentation you provide to them).

The solution: free the data from the vault you kept it locked away in and make all of it available to those that can get value out of it!  I have many thoughts on how to do this responsibly, but the underlying message is make data easily accessible.  If you keep the data locked away you will miss out on the value many other organizations are getting from their data.