From time to time I’ve been asked about the different roles within a data team. So for anyone wondering here are the three roles I hear most often and my take on what is expected of each including technology I hear most commonly associated with the role. All these roles play a big part in a good analytics team and do much more than I have taken the time to indicate in this post, but hopefully this is a good overview.
Data Engineer (or Data/ETL Developer)
Role: Build out data systems, get data from various sources (often web APIs, flat files, or databases), transform data, integrate data, and make it available for analysts to use. This role is more about building the foundation which the other roles pull from all the time so there is always work to do. Building out the technology platform and base data structures is a very important step in the analytic process and may involve the most technical programming challenges. Usually this team picks which type of data system to work with such as Hadoop, SQL Server, PostgreSQL, or Oracle.
Technology: SQL, Python, Hadoop -> Hive, Spark (using Python or Scala)
Role: Expected to do a lot with analyzing data and the role varies based on company. We think of this as someone who has a high level of statistics and math training, is able to build analytic models and predictive models, and is able to test an idea against the data and come back knowing if the hypothesis holds true and with what likelihood of error. This role is usually focused on analytic projects with significant impact on the company. Some projects can take quite a while and a lot of data processing, quailty evaluation, cleansing, and normalization takes place along the way. One of the hardest things to learn in academic setting is how to know if your model or other type of results are accurate before company invests money acting on this, but within a company that is an important characteristic of a data scientist.
Technology: SQL, R, Python (with Pandas or other libraries), Spark MLLib or Mahout or other machine learning library
Data Analyst (sometimes called Data Scientist now, especially in the Bay Area)
Role: Expected to analyze data with less focus on statistics and often focused on building reports and dashboards for others to use on a recurring basis. Often partner with business users to help them come up with meaningful metrics or reports and should be able to quality check data and find anomalies that would be misleading to management if not explained or cleaned up. This role may play a part in deciding which reporting and data visualization tools the company uses and often tries to get answers to short term questions.
Technology: SQL, Tableau, D3.js, Excel