I shared previously in my post ETL Tool vs Custom Code that I use Python for developing data flows. In my journey of writing production data flows and applications with Python I dove in head first and didn’t learn some of the useful practices until a year or two in to my journey. But lucky for you, I am going to share some of these foundational Python concepts in various “How To” posts, starting today. Logging is a major component I procrastinated on learning but think any new comer should learn in the first week. To help anyone else getting started (or who is awesome at Python but still is using print every other line) let’s look at both a basic logging example (minimum expected) plus a real world example (recommended).
To start, the basic example involves adding these lines at the beginning of your code:
import logging logging.basicConfig(level=logging.INFO) log = logging.getLogger(__name__)
Once this is set up, you can add log calls like this:
log.info("Logging is turned on")
This is the minimum amount of logging that I recommend, even if you are doing adhoc scripts like we often do in data engineering and data science. In the code recommended above, it will still print to your screen but if you decide to deploy your adhoc script as a job or module you now have the option to turn off those messages. The first thing you will notice is that most of your statements like print(“Opened connection”) which are really meant for logging and debugging start to become log.info(“Opened connection”). You should play with different levels of logging as well. Normally I set level=logging.DEBUG while I’m developing and then have statements like log.debug(“Load table using query: %s”, query) in order to keep an eye on if my code is doing what I meant for it to do.
Now, if you want to take it up a notch and apply your own format consistently across your Python applications, you should take the next step of creating a log configuration file. For starters, just save a file name logging.conf into your project directory and populate it with this text:
[formatters] keys=logfileformatter [logger_root] handlers=logfile [formatter_logfileformatter] format=%(levelno)s:%(asctime)s:%(name)s:sample-etl:%(message)s [handler_logfile] class=handlers.RotatingFileHandler args=('/var/log/sample_etl/sample_etl.log', 'a', 2000000, 10) formatter=logfileformatter
Now we have a file that specifies a custom format in the [formatter_logfileformatter] section. We set up a handler in the [handler_logfile] section that will write to a file and rotate files. After enough entries happen, it will truncate the oldest log file.
The next step is to tell your Python code to use this format, which you can do by replacing the first piece of log setup code I shared with this:
import logging.config logging.config.fileConfig('logging.conf') log = logging.getLogger(__name__) log.setLevel(logging.DEBUG) # Use debug to see the most log entries
Now you should try adding a StreamHandler that uses sys.stderror so that you can see the logging in your console also. This is something I add to my development environments to save me effort of tailing the actual files as I develop. To get you started, check out the documentation. If you get stuck, leave a comment.
In closing, logging is important and you should use it right now and all the time unless you have a good reason to ignore this advice. Saving time is not a good reason…this stuff is easy. Granted, this message is coming from the guy that wrote a lot of production Python code before taking the time to get his head around how to properly log. But you can be better than me :).