How To: Python Logging

I shared previously in my post ETL Tool vs Custom Code that I use Python for developing data flows.  In my journey of writing production data flows and applications with Python I dove in head first and didn’t learn some of the useful practices until a year or two in to my journey.  But lucky for you, I am going to share some of these foundational Python concepts in various “How To” posts, starting today.  Logging is a major component I procrastinated on learning but think any new comer should learn in the first week.  To help anyone else getting started (or who is awesome at Python but still is using print every other line) let’s look at both a basic logging example (minimum expected) plus a real world example (recommended).

To start, the basic example involves adding these lines at the beginning of your code:

import logging
logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)

Once this is set up, you can add log calls like this:

log.info("Logging is turned on")

This is the minimum amount of logging that I recommend, even if you are doing adhoc scripts like we often do in data engineering and data science. In the code recommended above, it will still print to your screen but if you decide to deploy your adhoc script as a job or module you now have the option to turn off those messages. The first thing you will notice is that most of your statements like print(“Opened connection”) which are really meant for logging and debugging start to become log.info(“Opened connection”). You should play with different levels of logging as well. Normally I set level=logging.DEBUG while I’m developing and then have statements like log.debug(“Load table using query: %s”, query) in order to keep an eye on if my code is doing what I meant for it to do.

Now, if you want to take it up a notch and apply your own format consistently across your Python applications, you should take the next step of creating a log configuration file. For starters, just save a file name logging.conf into your project directory and populate it with this text:

[formatters]
keys=logfileformatter

[logger_root]
handlers=logfile

[formatter_logfileformatter]
format=%(levelno)s:%(asctime)s:%(name)s:sample-etl:%(message)s

[handler_logfile]
class=handlers.RotatingFileHandler
args=('/var/log/sample_etl/sample_etl.log', 'a', 2000000, 10)
formatter=logfileformatter

Now we have a file that specifies a custom format in the [formatter_logfileformatter] section.  We set up a handler in the [handler_logfile] section that will write to a file and rotate files.  After enough entries happen, it will truncate the oldest log file.
The next step is to tell your Python code to use this format, which you can do by replacing the first piece of log setup code I shared with this:

import logging.config
logging.config.fileConfig('logging.conf')
log = logging.getLogger(__name__)
log.setLevel(logging.DEBUG) # Use debug to see the most log entries

Now you should try adding a StreamHandler that uses sys.stderror so that you can see the logging in your console also. This is something I add to my development environments to save me effort of tailing the actual files as I develop. To get you started, check out the documentation. If you get stuck, leave a comment.

In closing, logging is important and you should use it right now and all the time unless you have a good reason to ignore this advice. Saving time is not a good reason…this stuff is easy. Granted, this message is coming from the guy that wrote a lot of production Python code before taking the time to get his head around how to properly log. But you can be better than me :).

ETL tool vs custom code

I used to help sell an ETL tool that had a graphical drag and drop interface. I really did like the tool because with a little training you could quickly build a basic ETL job. I still like these types of tools if pulling data from a database that has a static or slow changing data model. However, at my current company we do not use an ETL tool because I suggested we are better off without one. While it is possible we will use an ETL tool one day for certain tasks, we currently prefer Python and SQL to move and process our data. The primary reasons we went down this path is for increased flexibility, portability, and maintainability.

One of my top regrets leading a Data Warehousing team that used an ETL tool is that we felt limited by what the tool was capable of doing. Elements of ETL that were not as important when the team started were not easily supported by the tool. The best example of this is reading from a RESTful API. Another was working with JSON data as a source. With these examples we could easily find a tool that can do this for us, but what else will we encounter in the future? At my current company we are consuming RabbitMQ messages and using Kafka for data streaming, and we would not have known to plan for a tool that works well for these use cases. Since we are using Python (and Spark and Scala) we have no limits on what is possible for us to build. There are a lot of libraries that are already built which we can leverage, and we can modify our libraries as new ideas come up rather than being stuck with what a tool provides out of the box. We choose to focus on building a data flow engine in many cases over having one script per table or source. This amount of control over the code that moves data allows us to build up an engine that supports many configurations while keeping the base code backward compatible for data sets already flowing through the system. In many cases we trade off having a longer ramp up period to get our first build working in order to have more flexibility and control down the road, but it cuts down the amount of frustrating rework when systems change.

Another benefit of coding your own ETL is that you can change databases, servers, and data formats without applying changes all over your code base. We have already taken our library for reading SQL Server data and written a similar version that works for Postgres. With how our ETL jobs are set up we just switched out the library import on the relevant scripts and didn’t have to dig in to the logic that was running. I think this leads to better maintainability as well since if you find something is taking up a lot of your time you just build it into the overall system. I remember having alerts at 2 am because the metadata of a table had changed and our ETL tool couldn’t load the data without us refreshing the metadata in the job. With most of our python code we can handle new columns added to the source data and either add that column to the destination table or just ignore it until we decide to modify the destination. This really has decreased time spent getting mad at the system administrators who disrupted our morning by adding a new custom field, though there is still plenty of work to do to ease the pain of changed datatypes and renamed columns.

I am sure there are plenty of different tools out there that do every thing you could want to do (at least according to their sales team), but I love the flexibility, control, and maintainability of writing our own applications to move data. It was worked out well for us as we have transitioned to building out a data platform rather than focusing on just tools to load an analytic data warehouse (but that is a topic for another time).

How To: Kafka 0.9 on Mac

Kafka is a distributed messaging system used for streaming data.  It works as a distributed commit log and if you want to really understand why you should use Kafka then it’s worth the time to read this article by Jay Kreps.  Now if you just want to get hands on with Kafka on your laptop, follow these steps from the quick start guide (which should work on linux also for a sandbox environment).  I didn’t hit errors along the way so this is pretty similar to what is in the documentation but I thought its worth sharing as a reference to the actual commands I used and a place for me to reference when I post more articles about working with Kafka.

  1. Go to http://kafka.apache.org/downloads.html and download the version you want.  I chose kafka_2.11-0.9.0.0.tgz.
  2. Follow instructions here for initial setup: http://kafka.apache.org/documentation.html#quickstart
    1. unzip: tar -xzf kafka_2.11-0.9.0.0.tgz
    2. go to folder: cd kafka_2.11.0.9.0.0
    3. start zookeeper: bin/zookeeper-server-start.sh config/zookeeper.properties
    4. open new terminal window and go to folder
    5. start kafka: bin/kafka-server-start.sh config/server.properties
    6. test creating topic: bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
    7. test listing a topic: bin/kafka-topics.sh --list --zookeeper localhost:2181
  3. Follow additional steps to get to multiple brokers since you would never use a single broker setup for a real environment (though for a production cluster there would be some different steps and a server per broker, of course)
    1. copy config: cp config/server.properties config/server-1.properties
    2. edit config/server-1.properties:
      broker.id=1
      listeners=PLAINTEXT://:9093
      log.dir=/tmp/kafka-logs-1
    3. copy config again: cp config/server.properties config/server-2.properties
    4. edit config/server-2.properties:
      broker.id=2
      listeners=PLAINTEXT://:9094
      log.dir=/tmp/kafka-logs-2
    5. keep zookeeper running but stop kafka (CMD+C on the terminal it is running under)
    6. run all 3 brokers as background processes:
      bin/kafka-server-start.sh config/server.properties &
      bin/kafka-server-start.sh config/server-1.properties &
      bin/kafka-server-start.sh config/server-2.properties &
    7. test creating topic with replication factor of 3: bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic
    8. might as well publish a message to the topic: bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
      {"value": "Test message 1"}
    9. then test out the consumer: bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning

The quick start guide and additional documentation have a lot more info that is worth exploring, but if things went well you now have a local instance to test with.  Congrats!

How To: VirtualBox Shared Folders

When using virtual machines, you will likely want to setup a mapping of a local folder on your computer to a virtual machine folder. This is a good way to move files from your machine onto the VM and vice versa.  Here are steps to set that up with VirtualBox using a Centos image (in this case it is the Cloudera Sandbox VM).

  1. From the VM, go to the VirtualBox menu and choose Devices -> Insert Guest Additions CD Image…VirtualBox Shared Folders 1
  2. If the CD image does not start automatically then select the drive from the file browser and run “autorun.sh”.  This will install the add-ons needed.
  3. Then go to Devices -> Shared Folders -> Shared Folders Settings and setup your folder.  For this example we’ll use a local folder called “installs”.VirtualBox Shared Folders 2VirtualBox Shared Folders 3
  4. Restart virtual machine
  5. You can now find your folder under /media/sf_<foldername> and you’ll probably need elevated permissions.  So for my example the command “sudo ls -l /media/sf_installs” can be used to view files and “sudo cp /media/sf_installs/<filename> ~/” can be used to copy files to a folder local to the VM.VirtualBox Shared Folders 4

Bonus info: Once guest additions are installed you can also setup clipboard sharing to let you copy and paste from your machine to the VM, this is done by going to Devices -> Shared Clipboard and choosing your option (such as Bidirectional).

 

Data Development Environment – Mac

A first step to developing with modern technologies such as Big Data systems and NoSQL systems is getting your development environment setup. I like to have many of the tools available locally on my laptop so I can feel free to experiment without breaking a shared server or running up a large bill on the cloud platform used to host the machine.  Checkout my previous post on setting up on Windows to read a little about what I like about Python and Sublime Text.  For this post, let’s just walk through the tools I found myself installing on my Mac to do Python data development (this excludes Scala setup).

  • Python installed by default – using Python 2.7.10
  • install Homebrew – http://brew.sh/
  • install Developer Tools – on command line type git and follow prompts to install developer tools (http://www.cnet.com/how-to/install-command-line-developer-tools-in-os-x/)
  • install pip – sudo easy_install pip
  • install SublimeText2
  • install PyCharm
  • install several things using Homebrew (type at command line):
    • brew update
    • brew install wget
    • brew install gcc
    • brew install apache-spark
  • pip install virtualenv
    • (then use virtualenv and virtualenvwrapper for most things python)
  • create a virtual environment for data-eng and install
    • pip install pandas
    • pip install request
    • …..(a lot more, might add to this list later with some of the heavily used ones)

Hopefully that helps you with getting started, feel free to leave comments if you hit errors along the way and if I’ve dealt with similar errors I will give you some tips.

Data Development Environment – Windows

A first step to developing with modern technologies such as Big Data systems and NoSQL systems is getting your development environment setup. I like to have many of the tools available locally on my laptop so I can feel free to experiment without breaking a shared server or running up a large bill on the cloud platform used to host the machine.  Originally I started on a Windows machine and being somewhat new to these technologies tried a few paths that didn’t go very well.  Here is some guidance if you are trying to get going with Python and Hadoop or other open source data platforms using a Windows laptop.

Python 2.7

A very popular programming language for processing data is Python and much of the ETL we write uses Python for flexibility (compared to SSIS which relies heavily on knowing the data model). It is simpler than Java and much easier to read, so if maintainability is important (which it should be) then it is a great option. You can use Python 2 or Python 3, but some third party modules are not compatible with Python 3. Python can be installed directly on Windows, Mac, or Linux. I prefer using it with Linux or Mac because of the other command line features and the popularity in the developer community.

Sublime Text 2

This is the text editor I use to write Python modules and edit JSON files, as well as any other type of text file. This is what others on our data team use and when I saw it I was impressed with how easy it makes it to read the code. It is not an IDE, it is an awesome text editor. There are other options for text editors (gedit, emacs, vim), but Sublime Text 2 works well. If you want an IDE instead then PyCharm is one that was recommended in the Python Fundamentals course on Pluralsight.

Oracle VM VirtualBox

After trying to get things working properly with Cygwin Terminal and setting up a Linux VM with VMware Player, I saw VirtualBox was a high ranked virtual machine option and had the easiest time setting it up. One big requirement I had was to be able to copy and paste from my local to the virtual machine and I had trouble getting that capability set up on my VMWare machine.

Linux CentOS 7

Linux works well for developing and running Python code, plus you can install many open source projects on it such as Apache Hadoop. I chose CentOS because of its similarity to RedHat which is supported by most databases and open source projects.  I found many examples for installing Python modules and client libraries on Linux, as well as plenty of information on installing Hadoop as a single node instance. I did not face as many barriers as Cygwin presented, so once I made the jump to Linux I was finally able to focus on the programming instead of the system setup.
So try out this setup and check out my Resources page for ideas of what to learn once the environment is ready and hit me up with questions if you get stuck.