Month: April 2017

ETL tool vs custom code

I used to help sell an ETL tool that had a graphical drag and drop interface. I really did like the tool because with a little training you could quickly build a basic ETL job. I still like these types of tools if pulling data from a database that has a static or slow changing data model. However, at my current company we do not use an ETL tool because I suggested we are better off without one. While it is possible we will use an ETL tool one day for certain tasks, we currently prefer Python and SQL to move and process our data. The primary reasons we went down this path is for increased flexibility, portability, and maintainability.

One of my top regrets leading a Data Warehousing team that used an ETL tool is that we felt limited by what the tool was capable of doing. Elements of ETL that were not as important when the team started were not easily supported by the tool. The best example of this is reading from a RESTful API. Another was working with JSON data as a source. With these examples we could easily find a tool that can do this for us, but what else will we encounter in the future? At my current company we are consuming RabbitMQ messages and using Kafka for data streaming, and we would not have known to plan for a tool that works well for these use cases. Since we are using Python (and Spark and Scala) we have no limits on what is possible for us to build. There are a lot of libraries that are already built which we can leverage, and we can modify our libraries as new ideas come up rather than being stuck with what a tool provides out of the box. We choose to focus on building a data flow engine in many cases over having one script per table or source. This amount of control over the code that moves data allows us to build up an engine that supports many configurations while keeping the base code backward compatible for data sets already flowing through the system. In many cases we trade off having a longer ramp up period to get our first build working in order to have more flexibility and control down the road, but it cuts down the amount of frustrating rework when systems change.

Another benefit of coding your own ETL is that you can change databases, servers, and data formats without applying changes all over your code base. We have already taken our library for reading SQL Server data and written a similar version that works for Postgres. With how our ETL jobs are set up we just switched out the library import on the relevant scripts and didn’t have to dig in to the logic that was running. I think this leads to better maintainability as well since if you find something is taking up a lot of your time you just build it into the overall system. I remember having alerts at 2 am because the metadata of a table had changed and our ETL tool couldn’t load the data without us refreshing the metadata in the job. With most of our python code we can handle new columns added to the source data and either add that column to the destination table or just ignore it until we decide to modify the destination. This really has decreased time spent getting mad at the system administrators who disrupted our morning by adding a new custom field, though there is still plenty of work to do to ease the pain of changed datatypes and renamed columns.

I am sure there are plenty of different tools out there that do every thing you could want to do (at least according to their sales team), but I love the flexibility, control, and maintainability of writing our own applications to move data. It was worked out well for us as we have transitioned to building out a data platform rather than focusing on just tools to load an analytic data warehouse (but that is a topic for another time).