STORIES ON DATA SCIENCE — A PLACE TO LEARN ABOUT THIS DISCIPLINE, WITH A PERSONAL TWIST.

Full Stack Data Scientist… or not

From hero to zero during a one-hour data science interview

Juan Carlos Vázquez

--

After doing animal breeding and human genomics for a couple of years, and having worked as an Administrative for twelve years in the Information Systems Engineering career (which gave me direct access to many books and other knowledge resources), I finally felt ready to submit my CV to data science positions. I knew I was going to be out of my comfort zone but, hey, that is the only way to learn. During a couple of months or so, I applied to every job offering I could find, both local and remote. I knew my profile was being seen (thanks LinkedIn), but no one called. I asked about this to a couple of recruiters, and they were kind enough to tell me that the distance between my academic studies and the job position made it difficult to recommend me to such positions. I was very frustrated.

Nevertheless, I started to plan on how to shorten the gap, and decided to go for a PhD in the area. I though this was an industry issue, and that the academic world would not have such nonsense. Yeah, I was naive. I worked with my research director and made a Thesis Proposal for a PhD in Information Systems Engineering. Long story short, they said my proposal was fantastic and a perfect fit… but I was a veterinary. I was even more frustrated.

Photo by Tim Gouw on Unsplash. Hey! This could have been me at that time!

I stopped looking to get in the field, at least for the time being. I focused on my PhD (now in Biochemistry), started teaching at the University and tried to let my frustration go away. I kept on studying, taking postgraduate courses on parallel computing, neural networks and so on.

One year later, someone contacted me at LinkedIn, asking if I was open to new job opportunities. Tatiana had seen my profile, kept on reading even after she saw I was a vet, and thought I could apply for the position open at the company. I still don’t know if I was able to hide my anxiety and happiness. We had a couple of calls and then she arranged a meeting with the Leader of Innovation at the company.

I guess many of you can relate to this kind of situation. The thing is, I was not as nervous as I should have been. I had decided I was just going to show who I was. Either way, whatever happened, would’ve for the best. So I went to the company to meet Marcelo. He is quite a remarkable person, with a contagious energy. The kind of guy you’d want to work with, or at least I did.

We talked about many things. He told me about the company, about how they managed Data Science projects, what was the aspects they valued in a team, and which were the roles of team members. And I want to stop here, because I didn’t fully understand, at that time, what every role did. So, in this article, I will explain what were my thoughts and how wrong they were.

The Data Engineer:
During the interview: This was one of the roles I discovered during the interview. Having worked on solo projects most of my time, I did all the data wrangling and cleansing. So when I found that there actually existed a role for that, I was kind of puzzled. On one hand, I would have much more time to just play with models, features and parameters, but on the other hand, I felt like I lost control over raw data, since the preprocessing was not in my hands.
Real Life: These are the guys that know, they really know. It’s not just data wrangling (hey, everyone can do that). They really understand what is the best way to serve the data, in relation to both the infrastructure and the data itself. That know when to use indexes, primary keys, secondary keys, sorting and are a walking SQL dictionary. Plus they can get data from any data source available, relational databases, non-relational databases, files, images, sound, IOT, APIs, and any other acronym you can think of. They use technologies with weird names like NiFi, glue, lambda and so on. They can answer pretty much everything about the data, and also know what kind of data is needed for machine learning, deep learning or plain business rules. They are the builders of Big Data, and are one of, if not the most, important roles needed to digitalize a business company. They rule, period.

The Data Scientist:
During the interview: Hey, I know this, I thought. This is, by far, the most known role in the area. It is the guy who makes the computer “learn”. He is in charge of, well, from my point of view at that time, everything. But I discovered during the interview that I would not preprocess data. So my job was even simpler than I thought. Just know my math, my models and optimize everything to get the best possible results.

“Big Data is not about the Data” — Gary King, Harvard University. Explaining how collecting data is easy, but the real value is in the analytics.

Real Life: Data engineers are also called data plumbers. They role is vital, as I said before, but they often go unnoticed. Data scientists, on the other hand, are the ones that define if a project will be succesful or not. There is no grey areas here. A project can go incredibly well, or it can go extremely wrong. And the one blamed is the data scientist. This is the role that has to provide value to the business. Whether it is by helping decission taking through predictive models, or through the use of modern technologies to optimize or automatize processes, this is the stage where it can really be determined if the project will reach something useful or not. With regards of coding, this role may not be the most proficient, or the cleanest. Jupyter notebooks are often a mess, pretty much going back and forth doing tests. We only clean the code when it is going for production, and that is, in many ways, really wrong. There is a real need for “Software Engineering Good Practices for Data Scientists”, and it is slowly starting to happen. Data scientists, however, have to understand what happens behind the “magic” libraries. What is the math behind it, when and why should I use this or that model, how should I handle nans, drop ’em or impute ’em? Scientists also have to understand the business. The have to get inside the business domain. This is a must. Otherwise, they won’t be able to understand how to empower business and what can be done to deliver these powerful management tools, and all effort made will go to waste.

The Dev Ops / Data Ops:
During the interview: Another new role. I was familiar with the term dev ops, even if I didn’t exactly know what they did. But the little knowledge I had was not wrong. They were in charge of setting up the infrastructure environment and solving any issues regarding hardware and communication. Again, I had participated in many academic research projects managing the hardware and software configuration, so I felt pretty confident in this area. Hah!
Real Life: Along with the data engineer, this is a role that often goes unnoticed. It may also be the most difficult role to describe, and I will surely fail at it. DataOps has derived, obviously, from DevOps, but has evolved, instead of just applying the same methodology and philosophy to Data. The Data Ops is like the glue of Data Science, and also serves, many times, as the orchestrator of the ins and outs between roles. He is responsible to create an environment for data scientists to feel confortable in, with all the languages and libraries that we cherish. He also knows how to communicate all the work done by the data engineers to the scientists. Finally he designs the best way to implement models in production, and establishes the environment to set them up. Thus, he has to know every role, in order to facilitate the tasks carried out.

“Sometimes, the most ordinary things could be made extraordinary, simply by doing them with the right people” — Nicholas Sparks

The Developer:
During the interview: Ok, this is pretty clear. It is the guy who gets the model and carries it to production. Maybe with a web app, an api or some other software development. Personally, I had developed backend in python and frontend in shiny for a couple of apps that were (and still are) being used. Development has a big place in my heart, even though I still have issues with Javascript and, therefore, are not so proficient with front ends.
Real Life: This was the most clear role from the very beggining. The developer is the one that develops (duh!) the application that will be used by the business analyst or management, using models trained by the scientist and data cleaned by the engineer. Even though it is an important role, it is usually carried out by third parties. It job is not inherently related to data science in the sense that he does not have to know the data or the models. However, if the role is taken by some one inside the data team, the final solution will be much more robust and more appealing for the business. The reason behind this is that having domain knowledge about the data business will allow the developer to understand what is the friendliest and easiest way to present the most useful and valuable data to the clients.

The Business Analyst:
During the interview: Well, I really had not even thought about this role. Coming from an academic background, I did visualizations only for presentations and papers, and even then, they were pretty basic. No interactivity, no auto updating… Just plain graphics. However it made sense when working with business. Again, I thought this would make my job a lot easier, one less thing to think about.
Real Life: This guys are information artists. They get a data table and transform it in something beautiful. It is no easy task, they have to know how to summarize all the important information provided, having a deep knowledge of the client’s business. And they have to choose the best ways to show that information, not only choose the appropiate chart or table, but also how to “make it sell”. Business Intelligence people are, in my opinion, the sales maker of a data company. They can also help us shape the processes output, since they will consume it .They are the ones that will create the tools to conquer those difficult meetings with clients, making kick-ass graphics in real time to show off.

The bosses

The Project Manager:
During the interview: Boss number 1, or that was what I though. Although my leader explained it in a friendlier way, I understood it was the guy who would put pressure on me, so that deadlines were correctly reached. I supposed he had clear estimations on how much time I should employ on each tasks, and reminded me when I was behind schedule. This, however, was one of the roles I was more mistaken about.

Real Life: This is a difficult one, because I know this will be read, lol. To be honest, the project manager has a really big responsability. He is in charge on keeping the projects on schedule. That is a really complex job in any software discipline, but it is extremely hard for data science. The scientific method is, inherently, timeless. It is a creative process and, as such, reveals itself as hard to structure and evaluate. How do you assess productivity on a team that can’t give you an estimated date until the process has gone through the initial phases, and how do you rate a model that may not be useful, even if the development process was carried out brilliantly? Moreover, the Project Manager has another responsability, although it is many times unseen. Workplace wellness, team motivation and “happiness” depends, in my opinion, almost two thirds on the Project Manager. A good Project Manager can create a great workplace and keep the spirits up even when a project fails. He considers himself as an equal collaborator to other team members, but stands out as a shield for the team against unreasonable or abusive clients. On the contrary, a bad Project Manager can reduce teamwork to the worst experience, making everyone extra hours, not managing projects nor clients in an efficient way and trying to take personal advantage of the work made by others. I have lived both experiences and, in my opinion, the role of Project Management, at least in data science, must be selected with extreme caution, taking special consideration in the soft abilities and the type of personallity a candidate has.

The Product Owner:
During the interview: Boss number 2. Again, I was sooo wrong with this one. I imagined this as the client, the guy who alwasys wants the job done in the shortest time possible and with the minor budget employed. Maybe he was not putting pressure on the time, but this was the one that would judge the results. I never thought of the real value of this role during the interview. Instead, I believed it would be more of an obstacle.
Real Life: This has to be the most useful external role for a data science team. The Product Owner is familiar with the data science process, but he comes from the business side. He can detect the client needs even before the need exists. And he is a great communicator, making it easy to translate the bussiness language to the scientists and viceversa. I have been in projects with no product owners, projects where the product owner was not fully committed and projects with a really good and committed product owner, and I can tell that there is a huge difference in the data science process. Every deliverable is much better when the product owner exists and is easily reachable, thus, improving enormously the final product.

The Golden Unicorn — Full Stack Data Scientist

So after leaving my first interview, I left thinking that I was a full stack data scientist. For those not familiar with that term, it comes from web development, and relates to a developer that can do both back and frontend. I thought I could take part in any of the roles in the data science process, without much difficulties. I was just missing a couple of technologies, but I understood every step of the process and had carried it out in my research. Hah! Silly me. There are no unicorns in real life, and golden ones are even more rare. Every role has it own particularities and each member masters a whole different set of knowledge and abilities. Even if this is true, I’m almost certain that we will start to see full stack data scientist offerings more and more. It has happened before, and it will surely happen again.

Still, I don’t think this search is completely wrong. I will keep trying to reach that mystical knowledge.I’ll probably have a deeper knowledge in one or two roles, but people with holistic knowledge of the data science processes are extremely necessary. A person that can take on any role, even if it is not the best at that role, has a lot of value in a dynamically changing field such as this one. Moreover, this capability speaks of a person’s adaptability and resilience which are, in my opinion, one of the most important soft skills in our field. But I will write about soft skills in another article.

I’d like to thank Tatiana, Marcelo, Leandro and Mauricio for giving me the opportunity to show what I’m made of, and every member of my team and of the company, for teaching me so much and keeping a great workplace.

--

--

Juan Carlos Vázquez

Lead Data Scientist at CoreBI S.A. PhD Student, Vet, teacher, researcher, developer, fell in love with Data Science. My academic research is in bioinformatics.