Thoughts on Analytics

ETL tool vs custom code

I used to help sell an ETL tool that had a graphical drag and drop interface. I really did like the tool because with a little training you could quickly build a basic ETL job. I still like these types of tools if pulling data from a database that has a static or slow changing data model. However, at my current company we do not use an ETL tool because I suggested we are better off without one. While it is possible we will use an ETL tool one day for certain tasks, we currently prefer Python and SQL to move and process our data. The primary reasons we went down this path is for increased flexibility, portability, and maintainability.

One of my top regrets leading a Data Warehousing team that used an ETL tool is that we felt limited by what the tool was capable of doing. Elements of ETL that were not as important when the team started were not easily supported by the tool. The best example of this is reading from a RESTful API. Another was working with JSON data as a source. With these examples we could easily find a tool that can do this for us, but what else will we encounter in the future? At my current company we are consuming RabbitMQ messages and using Kafka for data streaming, and we would not have known to plan for a tool that works well for these use cases. Since we are using Python (and Spark and Scala) we have no limits on what is possible for us to build. There are a lot of libraries that are already built which we can leverage, and we can modify our libraries as new ideas come up rather than being stuck with what a tool provides out of the box. We choose to focus on building a data flow engine in many cases over having one script per table or source. This amount of control over the code that moves data allows us to build up an engine that supports many configurations while keeping the base code backward compatible for data sets already flowing through the system. In many cases we trade off having a longer ramp up period to get our first build working in order to have more flexibility and control down the road, but it cuts down the amount of frustrating rework when systems change.

Another benefit of coding your own ETL is that you can change databases, servers, and data formats without applying changes all over your code base. We have already taken our library for reading SQL Server data and written a similar version that works for Postgres. With how our ETL jobs are set up we just switched out the library import on the relevant scripts and didn’t have to dig in to the logic that was running. I think this leads to better maintainability as well since if you find something is taking up a lot of your time you just build it into the overall system. I remember having alerts at 2 am because the metadata of a table had changed and our ETL tool couldn’t load the data without us refreshing the metadata in the job. With most of our python code we can handle new columns added to the source data and either add that column to the destination table or just ignore it until we decide to modify the destination. This really has decreased time spent getting mad at the system administrators who disrupted our morning by adding a new custom field, though there is still plenty of work to do to ease the pain of changed datatypes and renamed columns.

I am sure there are plenty of different tools out there that do every thing you could want to do (at least according to their sales team), but I love the flexibility, control, and maintainability of writing our own applications to move data. It was worked out well for us as we have transitioned to building out a data platform rather than focusing on just tools to load an analytic data warehouse (but that is a topic for another time).

About this blog

I started this blog to talk about the evolving world of data analytics and data management.  The goal is to share opinions, tutorials, and best practices (or at times worst practices) in both data technology and managing development teams.  The theme will be new technologies and new approaches that are changing how data analytics and data engineering are done within modern organizations.  I have been working in the data warehousing field for a while now and the approach across the industry has changed drastically since I started – or at least it should have changed drastically.  Data has grown, modern systems have been built to deal with today’s problems, and there is a new generation of employees that want access to the data and will learn how to use it.

My goal has always been to make important data available for driving insight and decision making.  I learned from many experts in the industry the importance of making data clean and easy to use so that the managers and analysts for the business do not incorrectly interpret the data.  This is important to consider because you do not want people to make poor decisions because of confusion in the metrics.  This concern leads to carefully collecting important data and only exposing it to others in the organization once it is defined, cleansed, and documented — because then they cannot draw incorrect conclusions.  The goal is spot on, and the approach I theoretically agree with.  The problem: it takes too long to make the data perfect and you must predict how it will be interpreted by the people who see the data or there will still be confusion (unless they miraculously read the definitions and documentation you provide to them).

The solution: free the data from the vault you kept it locked away in and make all of it available to those that can get value out of it!  I have many thoughts on how to do this responsibly, but the underlying message is make data easily accessible.  If you keep the data locked away you will miss out on the value many other organizations are getting from their data.