What makes a great data scientist?

A data scientist is an umbrella term that describes people whose main responsibility is leveraging data to help other people (or machines) making more informed decisions. The spectrum of data scientist roles is so broad that I will keep this discussion for my next post. What I really want to focus is on what are the distinctive characteristics of a great data scientist.

Over the years that I have worked with data and analytics I have found that this has almost nothing to do with technical skills. Yes, you read it right. Technical knowledge is a must-have if you want to get hired but that’s just the basic absolutely minimal requirement. The features that make one a great data scientist are mostly non-technical. So what are the 3 key things that distinguish a great data scientist?

1. Great data scientist is obsessed with solving problems, not new tools.

This one is so fundamental, it is hard to believe it’s so simple. Every occupation has this curse – people tend to focus on tools, processes or – more generally – emphasize the form over the content. A very good example is the on-going discussion whether R or Python is better for data science and which one will win the beauty contest. Or another one – frequentist vs. Bayesian statistics and why one will become obsolete. Or my favorite – SQL is dead, all data will be stored on NoSQL databases.

These are just instruments that are used to solve problems. A famous American philosopher Abraham Kaplan has coined a concept called the law of the instrument – where he described it “I call it the law of the instrument, and it may be formulated as follows: Give a small boy a hammer, and he will find that everything he encounters needs pounding.” It was popularized by the psychologist Abraham Maslow who described it with the famous phrase “if all you have is a hammer, everything looks like a nail”.

The core function of any data-driven role is solving problems by extracting knowledge from data. A great data scientist first strives to understand the problem at hand, then defines the requirements for the solution to the problem, and only then decides which tools and techniques are best fit for the task. In most business cases, the stakeholders you will interact with do not care about the tools – they only care about answering tough questions and solving problems. Knowing how to select, use and learn tools & techniques is a minimum requirement to becoming a data scientist. A great data scientist knows that understanding the underpinnings of the business case is paramount to data science project success.

2. Great data scientist wants to find the solution and knows it’s not perfect.

A very dangerous state for any data scientist is being stuck in the infinite loop of analytic iterations – drilling in, finding insights, zooming out, looking at a macro level, re-defining hypothesis, zooming in again, looking at the most granular details, then re-thinking and round and round. This is called the analysis-paralysis which is basically over-thinking the process by trying to find the “perfect” solution.

A great data scientist understands that there’s almost never a perfect solution, and a simple imperfect solution delivered on time is much better than a hypothetically perfect one late. In fact the Agile software development methodology seeks to prevent analysis-paralysis by employing adaptive evolutionary planning, early delivery and continuous improvement. The mindset of a great data scientist works in the same way – they think about solving their stakeholder problems and know that they need to be redefined when new insights are uncovered.

The main piece of advice here – don’t overthink and over-analyze the problem. Instead – phase out your analysis or modelling process in stages and get feedback from the problem owners. This way you will ensure that the learning process is continuous and it improves the decisions with each iteration.

3. Great data scientist is the ultimate communicator.

As you may see there’s a lot of communication involved in understanding the problem and delivering constant feedback to the stakeholders. But this is just the surface of the importance of communication – a much more important element of this is asking the right questions. Sounds easy, right? It’s not, actually. The data scientists are much more likely to fall into a trap of the curse of knowledge cognitive bias than any other occupation. This bias “occurs when an individual, communicating with other individuals, unknowingly assumes that the others have the background to understand.

When the data scientist is scoping out a problem together with the stakeholders or presenting the first findings, it is vital to be as explicit and detailed as possible and not assume that stakeholders know as much as you do. This is very hard as the number of assumptions and underlying methodologies that a data scientist makes can be counted in dozens, even hundreds.

The biggest risk is when the stakeholder briefly describes the problem to the data scientist who doesn’t ask enough questions and assumes what the problem is. Then the data scientist builds a solution that seems to solve the described problem. The lack of asking questions and too many assumptions result in a situation where the final solution actually solves a different problem than the original one and gives an opposite recommendation or a result.

Great data scientists never assume they know something without in-depth analysis, they think in hypotheses which need to be either rejected or proved, and they ask a lot of questions, even if they are 99.9% sure they know the answer.

Wait – what about programming, statistics, math, hacking?

The fact of the matter is – you must have the technical skills and a strong basic foundation to be hired as a data scientist – you can read more about the basic requirements in my previous blog post How to think like a data scientist to become one. This is what’s expected and required from you as the bare minimum.

But to become a truly great data scientist you have to be an ultimate problem solver who is obsessed with understanding the ins and outs of the business case they’re handed.

How to think like a data scientist to become one

We have all read the punchlines – data scientist is the sexiest job, there’s not enough of them and the salaries are very high. The role has been sold so well that the number of data science courses and college programs are growing like crazy. After my previous blog post I have received questions from people asking how to become a data scientist – which courses are the best, what steps to take, what is the fastest way to land a data science job?

I tried to really think it through and I reflected on my personal experience – how did I get here? How did I become a data scientist? Am I a data scientist? My experience has been very mixed – I have started out as a securities analyst in an investment house using mainly Excel then slowly shifted towards business intelligence in the banking industry and then in consulting, eventually doing the actual so-called “data science” – building predictive models, working with Big Data, crunching tons of numbers and writing code to do data analysis and machine learning – though in the earlier days it was called “data mining”.

When the data science hype has started I tried to understand how is it different from what I have been doing so far, maybe I should learn new skills and become the data scientist instead of someone working “in analytics”?

Like everybody obsessed with it I have started taking multiple courses, rea
ding data books, doing data science specializations (and not finishing all of them..),
coded a lot – I wanted to become THE one in the middle cross-
section of the (in)famous data science Venn diagram. What I did learn is that these unicorns (yes, the people in the middle “Data Science” bucket are called unicorns) rarely exist and even if they do – they are typically generalists who have knowledge in all of these areas but are “master of none”.

Although I now consider myself a data scientist – I lead a fantastically talented data science team in Amazon, build machine learning models, work with “Big data” – I still think there’s too much chaos around the craft and much less clarity, especially for people new to the industry or ones trying to get in. Don’t get me wrong – there are a lot of very complex branches of data science – like AI, robotics, computer vision, voice recognition etc. – which require very deep technical and mathematical expertise, and potentially a PhD… or two. But if you are interested in getting into a data science role that was called a business / data analyst just a few years ago – here are the four rules that have helped me survive in the data science world.

Rule 1 – Get your priorities and motivations straight. Be very realistic about what skills you have right now and where you want to arrive – there are so many different roles in data science, it’s important to understand and assess you current knowledge base. Let’s say you’re working in HR and want to change careers – learn about HR analytics! If you’re a lawyer – understand the data applications in the legal industry. The fact is that the hunger for insights is so big that all industries and business functions have started using it. If you already have a job then try to understand what can be optimized or solved by using data and learn how to do it yourself. It’s going to be gradual and long shift but you will still have a job and learn by doing it in the real world. If you are a recent graduate or a student – you have a perfect chance to figure out what are you passionate about – maybe movies, maybe music, or maybe cars? You wouldn’t imagine the amount of data scientists these industries employ – and they are all crazy about the fields they’re working in.

Rule 2 – Learn the basics very well. Although the specifics of the each data science field are very different, the basics are the same. There are three areas where you should develop very strong foundations – basic data analysis, introductory statistics and coding.

Data analysis. You should understand and practice (a lot!) the basic data analysis techniques – what is a data table, how to join tables, what are the main techniques to analyze data organized in such a way, how to build summary views on your dataset and draw initial conclusions from it, what is the exploratory data analysis, which visualizations can help you understand and learn from data. This is very basic but believe me – master this you’ll have the fundamental skill that is absolutely mandatory for the job.

Statistics. Also, get a very good grasp of introductory statistics – what is mean, median, when to use one over the other, what is standard deviation and when is doesn’t make any sense to use it, why averages “lie” but are still the most used aggregated value everywhere. And when I say “introductory” I really mean “introductory”. Unless you are a mathematician and plan to become an econometrician who applies advanced statistical and econometric models to explain complex phenomenons – then yes, learn advanced statistics. If you don’t have PhD in mathematics, just take your time and be patient and get a really good grasp of the basic statistics and probability.

Coding. And off course – learn how to code. This is the most over-used cliché advice but it’s actually a sound one. You should start from learning how to query a database with SQL first – believe it or not, most of the time data science teams spend are on data pulling and preparation, and a lot of that is done with SQL. So get your basics in place – build your own small database, write some “select * from my_table” lines and get a good grasp of the SQL fundamentals. You should also learn one (start with just one) data analysis language – be it R or Python, both are great – that does make a difference and many positions require it, although not all. First learn the basics of the language you chose with focus on how to do data analysis with it. You don’t have to become a programmer to succeed in the field, it’s all about knowing how to use the language to analyze and visualize data.

Rule 3 – Data science is about solving problems – find and solve one. The thing I have learned over the years is that one of the fundamental requirements for a data scientist is to be always asking questions and looking for problems. Now I don’t advice to do it 24/7 as you will definitely go insane, but be prepared to be the problem solver and be looking for them non-stop. Start small, find areas in your own life that can benefit from some analysis – you will be amazed how much data is available out there. Maybe you will analyze your spending patterns, identify sentiment patterns of your emails, or just build nice charts to track your city’s finances. The data scientist is responsible for questioning everything – is this marketing campaign effective, are there any concerning trends in the business, do some products under-perform and should be taken off the market, does the discount the company gives makes sense or is it too big – these questions become hypotheses that are then validated or rejected by the data scientist. The hypotheses are the raw material of a data scientist as the more of them you will solve and explain – the better you’ll be in your job.

Rule 4 – Start doing instead of planning what you will do “when”. This is applicable to any learning behavior but it’s especially true in data science. Be sure you start “doing” from the very first day you start learning. It’s very easy to put off the actual learning by just reading “about” data science, how it “should” be done, copy-pasting data analysis code from the book and running it on very simple datasets which you will never ever get in the real world.

With everything you learn – be sure you start applying it to the field you’re passionate about. That’s where the magic happens – writing your first line of code and seeing it fail, being stuck and not knowing what to do next, looking for the answer, finding a lot of different solutions none of which work, struggling to build your own one and finally passing a milestone – the “aha!” moment. This is how you will actually learn. Learning by doing is the only way to learn data science – you don’t learn how to ride bike by reading about it, right? Same rule applies here – whatever you learn, be sure you apply it immediately and solve actual problems with real data.

“If you spend too much time thinking about a thing, you’ll never get it done.” – a quote from one of the most famous martial artists Bruce Lee captures the essence of this post. You have to apply what you learn and make sure you make your own mistakes – this is the only way you will learn and improve. And if Bruce Lee doesn’t convince you, maybe Shia LeBeouf will:

Thanks for reading! Subscribe to my blog www.cyborgus.com and get the latest updates. Also can also follow me on social networks:

Follow my blog updates on Facebook – https://www.facebook.com/cyborguscom/

Look me up on LinkedIn – https://www.linkedin.com/in/karolisurbonas/

Hello, World!

My name is Karolis and I love everything about data – machine learning, getting the “aha” insightful moments, AI, automating the boring stuff and every engineering puzzle I have to solve in my job.

I have been working with data and its applications for more than 10 years now. Currently I lead a brilliant data science team in Amazon as the Head of Business Intelligence and Data Science of Amazon Devices.

The goal for this blog is share the practical and applied methods to build data-driven applications, analyses and data-products in the real-world. A somewhat “survival tutorial for data science practitioners”.

If you have any suggestions or ideas or just want to chat – please feel free to reach out to me via LinkedIn or Facebook – the links are on the top right.

I hope you will enjoy reading my blog!