How to Lie with Data

We expect that data scientists and analysts should be objective and base their conclusions on data. Now while the name of the job implies that “data” is the fundamental material that is used to do their jobs, it is not impossible to lie with it. Quite the opposite – the data scientist is affected by unconscious biases, peer pressure, urgency, and if that’s not enough – there are inherent risks in the process of data analysis and interpretation that lead to lying. It happens all the time while the intentions might be truly honest – though we all know the saying “The road to Hell is paved with good intentions”.

As every industry in every country is affected by data revolution we need to make sure we are aware of the dangerous mechanisms that can affect the output of any data project.

Averages, averages everywhere

The average is the most over-used aggregation metric that creates lies everywhere. Whenever an average metric is provided – unless the underlying data is distributed normally (and it almost never is) – it does not represent any useful information about reality whatsoever. When the data distribution is skewed then the average is affected and makes no sense. The average is not a robust metric which means it is very sensitive to outliers and any deviation from normal distribution.

And while this knowledge has been known to statisticians for decades, it’s still being used in business, institutions and governments as a core statistic that drives billions, even trillions of dollars’ worth of decisions. Now what’s the solution? Don’t use it! Stop doing that at this instance and start thinking about data distributions consciously before reporting a statistic measure that only works in rare cases. As a first step – move to using median, top 99%, bottom 1 percentile metrics to summarize your data.

“The Average” has been standing on the data science, hell – any science – pedestal for far too long – it has so many blind followers that don’t question it, we can almost consider it a religion. Why? Because the normal distribution assumptions that were made in natural sciences long time ago had spilled over to other fields, especially business analytics and other corporate data applications. This has poisoned generations of analysts who to this day still lie with average data.

Fitting data to hypothesis – confirmation bias

Now this is classic. It starts even before you are handed with the problem to solve with data – although this step also affects this bias. The way data scientist views the case or problem that has to be solved can fundamentally change the process that is supposed to be objective. This bias intensifies when there are strong emotions – either expressed or implied – about the matter in question. Typically it’s very hard to identify it and this is what separates truly exceptional data scientists from the average ones (pun intended).

A typical situation is when there’s a rushed analysis that needs to be done, there’s pressure to deliver the outcome fast as there is an important decision pending on it. A lot of biases kick in but the confirmation bias is the one that offers data scientists the easiest “way out”. The data scientist then rushes to answer the question or solve the problem as soon as possible. This means that the first spurious correlation discovered can become the answer. In these situations the evidence is searched for to confirm the hypothesis – hence they are “fitting data to hypothesis”.

This happens when the preconceived notions about the “right” solution to the problem steer the data scientist to the wrong direction where they start looking for proof. So objective data exploration doesn’t take place – there’s data tweaking and squeezing to get to the conclusion that’s already defined. A very important thing to do here is to define robust requirements from the very beginning and collect evidence and data for conflicting hypotheses – the ones that proof, the ones that reject the hypothesis, and then the ones that do neither. The last one is also very important – because of the “itch” to find a pattern or explanation (see more about it in the next item), the data scientist might miss the fact that there might not be enough data to conclude or answer the question. That’s also fine, and maybe the question needs to be redefined.

Finding “patterns” – a.k.a. clustering illusion

The human brain is so good at identifying patterns they start seeing them where they don’t exist. This is a lethal trap for the data scientist. Many data scientists are hired to “find” patterns hence the more patterns are found that better they are presumed to be at their job. This false success metric leads to a lot of work being focused in search of patterns, segments and “something peculiar”. Many times and more than is normally expected – there’s a lot of noise and everything’s normal (pun intended, but normality not assumed).

This leads to tricky situations where business gets patterns that don’t exist, makes decisions on them, and eventually influences the actual population and enforces these patterns to actually emerge. Amazing. Very simple example – finding customer segments and trying to get them to “convert” from one segment to another. When one “segment” is targeted and pushed towards another “segment”, the magic happens and there’s an actual impact. But this is very dangerous and can lead to many wrong and costly decisions.

Don’t be a data liar

This is definitely not a final list and you should read about other cognitive biases that can affect your judgement and quality of insights. But these are very common traps that I have seen data scientists fall into and then unintentionally make up lies instead of searching for truth. Objectivity is not an easily achievable goal, and it requires a lot of discipline. With all of this data out there the role of data scientist will only become more and more important.

The most successful data scientists will put enormous focus on being super aware about the potential biases they can have and the lies these biases can lead to.

Top mistakes data scientists make

The rise of the data scientists continues and the social media is filled with success stories – but what about those who fail? There are no cover articles praising the fails of the many data scientists that don’t live up to the hype and don’t meet the needs of their stakeholders.

The job of the data scientist is solving problems. And some data scientists can’t solve them. They either don’t know how to, or are obsessed about the technology part of the craft and forget what the job is all about. Some get frustrated that “those business people” are asking them to do “simple trivial data tasks” while they’re working on something “really important and complex”. There are many ways a data scientist can fail – here’s a summary of top three mistakes that is a straight path towards failure.

Mistake #1 – Less communication is better

What I have seen in the great data scientists is that they are communicators first and data geeks second. A very common mistake that data scientists make is avoiding business people at all costs. This means that they try to maintain a minimal amount of interactions with them in order to go back and do “cool geek stuff”. Now I really like the geeky part of work, I do. That’s why I got into the field in the first place. But we are hired to solve problems and without communication those problems won’t be solved. Data scientists must follow up on the progress of their data analysis and collect feedback from their peers all the time, especially when they don’t find anything peculiar – maybe that’s good news? Not only collecting feedback is important but also adjusting the analysis and assumptions based on the feedback. This is the “science” in the “data science” – scientific method is founded on the principle of redefining hypothesis based on new data. And the only way to collect and interpret new data is by communicating with your stakeholders who have defined the hypothesis in the first place!

Mistake #2 – Delaying simple data requests from business teams

This is a golden one – simple data requests drive data scientists crazy (“it’s just 30 lines of SQL code, yuck!”). And this is where they fail. While it might be very simple for a data scientist – the data might just have become available and it might solve years’ worth of a problem. But the data scientist tends to think like an engineer (“trust me, I’m an engineer”) and tries to build scalable architectures to support long-term solutions. But – the business doesn’t care about the architectures, scale, engineering – they only care about the insights, actionable insights. If you’re not providing them – you fail in their eyes. And, well – they do the sales, so their decisions matter. If you don’t help improving those decisions – you’re just a sunk cost and finance theory has some pretty rough advice how to deal with it. Don’t ignore the simple requests. First make sure they support a decision and that decision will improve the business if it has the data – and when you do, swallow your pride and run those trivial 30 lines of SQL code – you’ll turn to a high ROI unit instead of a sunk cost.

Mistake #3 – Preference for complex solution over easy one

Very costly mistake. It’s actually a whole mantra that’s been built around the data scientist occupation. Depiction of data scientists as ultimate geniuses who can code, do math and statistics, and understand business better than most has done a big disfavor. The expectation becomes a perverse one – the data scientists think that they need to solve the problems by applying the top-of-the-line statistical and computer science methods. Ultimately you get to a situation where the junior data scientists think that everything can be solved with deep learning and don’t know how to explore the data because the industry sold the complexity obsession to them. Basic data exploration and visualization are the main tools for a data scientist and you will spend most of your time exploring data. Not building machine learning models – unless you’re hired to exclusively do so. Not building back-end architectures that scale. Not writing a 10-page in-depth hypothesis testing research for a simple business question. Unless you’re hired for that or were specifically asked to do that. Your main role is discovering actionable insights and sharing them as recommendations with your stakeholders.

Don’t over-complicate the already overly complex field with too many superstitions.The most typical situation showcasing this mistake is when the data scientists want to apply machine learning everywhere, for every use case, every project. This not only slows down the delivery of the desired output but in many cases a machine learning model is not required at all! As I’ve been explaining earlier – the core work of the data scientist if to solve problems, not to apply and use every shiny new tool that’s out there.

64541955

So how do I succeed as a data scientist?

As with every field there are many ways so succeed and fail – and many mistakes need to be made to understand which are which – but the fundamental lessons can be learned without trial-and-error. What’s utmost important is being passionate about the problems and building solutions for your stakeholders instead of obsessing over tools and geeky stuff. Unless your role is an engineering one where you are not required to interact with other human beings, you will have to deal with human-to-human communication and run very simple – trivial, in your mind! – code that delivers a non-attractive 3×3 data table. But sometimes the simple is better, and it’s all that is needed – “everything should be made as simple as possible, but not simpler” as one pretty famous scientist Albert Einstein once said.

How to stay out of analytic rabbit holes: avoiding investigation loops and their traps

What if we add these variables?..” is a deadly type of a question that can ruin your analytic project. Now, while curiosity is the best friend of a data scientist, there’s a curse that comes with it – some call it analysis paralysis, others – just over-analysis, but I call these situations “analytic rabbit holes”. As you start any data science project – be it an in-depth statistical research, machine learning model, or a simple business analysis – there are certain steps that are always involved. Some sources make them more granular, some make them more general but this view makes the most sense from a real-world business perspective.

The process goes as follows: a data scientist defines a hypothesis, then explores the data, gains insights into the data that help explain the hypothesis better. After this step the loop begins – a new information allows to refine the hypothesis and start “digging deeper” while repeating the data exploration, insight generation and… re-refining the hypothesis again. This is where the loop starts and it’s important to be conscious about it from the very beginning. Falling into an analytic rabbit hole starts here if one thing isn’t defined – a supported decision.

If the decision is not defined or it’s not the main goal of the analytic investigation – the project will go down the drains to the rabbit hole. Why? Because the over-analysis begins when the data scientist starts focusing on the hypothesis instead of the decision. While the two might look very similar, in reality this makes a fundamental difference between a successful data science project and an “analytic rabbit hole”. I am going to describe the two approaches and how one leads to success while the other is doomed to fail.

Hypothesis-focused. As the data exploration goes, the hypothesis is constantly refined and new insights are discovered. The curse of this process is that since the goal is to find the perfect answer or a solution to the hypothesis a data scientist will fall for many traps such as spurious correlations where relationship between un-related though correlated variables are discovered. Eventually the breadth of ways of analyzing and cutting through the data start having their side effect – the hypothesis is broken out into sub-segments each of which have a series of data points, assumptions and conflicting conclusions of their own. A typical end for this project is a happy data scientist presenting these immense findings to a non-technical team who get lost in the details faster than the data scientist starts explaining a second bullet-point. A question that knocks this effort down goes something like this – “can we do something about it?” That’s it. Weeks spent and one question derails the whole effort.

Decision-focused. The focus of this exploration is to find ways to influence and improve a decision. And to test whether it moves the needle as soon as possible. Then and only then a hypothesis can be refined. This doesn’t close the analytic loop, but it ensures that the focus of the data scientist is to discover insights that can improve the impact of the underlying decision. In this case the focus is on how the project’s output impacts the environment, and both the data scientist and the business can learn from the response the environment has to the data-refined actions. Hypothesis testing without any actual intervention that uses the generated generated is a perfect example of an analytic rabbit hole.

So what?

While this may sound very trivial, the amount of time data scientists waste on hypothesis-focused projects is incredibly high. If this hypothesis-focused philosophy is left unchallenged it might even ruin their careers, while others can end the trust put into the data science department. And believe me – it’s very tempting to wake up your inner geek and fall into the analytic rabbit hole trap every time you are handed with a very cool and interesting hypothesis.

Data scientist’s inner gut feeling tells that the main task of the job is to answer complex questions and gain in-depth insights. While in reality it’s all about solving problems – and the only way to solve a problem is to act on it. Our goal as data scientists is to support tough & complex decisions with actionable data-based recommendations. We are the ultimate internal consultants that drive actions through insights. And action with some insights is always better than no action with all the insights there can be discovered. So never forget to ask yourself a question – “what is the decision that this analysis supports?” It might save the project and maybe even your career as a data scientist.

What makes a great data scientist?

A data scientist is an umbrella term that describes people whose main responsibility is leveraging data to help other people (or machines) making more informed decisions. The spectrum of data scientist roles is so broad that I will keep this discussion for my next post. What I really want to focus is on what are the distinctive characteristics of a great data scientist.

Over the years that I have worked with data and analytics I have found that this has almost nothing to do with technical skills. Yes, you read it right. Technical knowledge is a must-have if you want to get hired but that’s just the basic absolutely minimal requirement. The features that make one a great data scientist are mostly non-technical. So what are the 3 key things that distinguish a great data scientist?

1. Great data scientist is obsessed with solving problems, not new tools.

This one is so fundamental, it is hard to believe it’s so simple. Every occupation has this curse – people tend to focus on tools, processes or – more generally – emphasize the form over the content. A very good example is the on-going discussion whether R or Python is better for data science and which one will win the beauty contest. Or another one – frequentist vs. Bayesian statistics and why one will become obsolete. Or my favorite – SQL is dead, all data will be stored on NoSQL databases.

These are just instruments that are used to solve problems. A famous American philosopher Abraham Kaplan has coined a concept called the law of the instrument – where he described it “I call it the law of the instrument, and it may be formulated as follows: Give a small boy a hammer, and he will find that everything he encounters needs pounding.” It was popularized by the psychologist Abraham Maslow who described it with the famous phrase “if all you have is a hammer, everything looks like a nail”.

The core function of any data-driven role is solving problems by extracting knowledge from data. A great data scientist first strives to understand the problem at hand, then defines the requirements for the solution to the problem, and only then decides which tools and techniques are best fit for the task. In most business cases, the stakeholders you will interact with do not care about the tools – they only care about answering tough questions and solving problems. Knowing how to select, use and learn tools & techniques is a minimum requirement to becoming a data scientist. A great data scientist knows that understanding the underpinnings of the business case is paramount to data science project success.

2. Great data scientist wants to find the solution and knows it’s not perfect.

A very dangerous state for any data scientist is being stuck in the infinite loop of analytic iterations – drilling in, finding insights, zooming out, looking at a macro level, re-defining hypothesis, zooming in again, looking at the most granular details, then re-thinking and round and round. This is called the analysis-paralysis which is basically over-thinking the process by trying to find the “perfect” solution.

A great data scientist understands that there’s almost never a perfect solution, and a simple imperfect solution delivered on time is much better than a hypothetically perfect one late. In fact the Agile software development methodology seeks to prevent analysis-paralysis by employing adaptive evolutionary planning, early delivery and continuous improvement. The mindset of a great data scientist works in the same way – they think about solving their stakeholder problems and know that they need to be redefined when new insights are uncovered.

The main piece of advice here – don’t overthink and over-analyze the problem. Instead – phase out your analysis or modelling process in stages and get feedback from the problem owners. This way you will ensure that the learning process is continuous and it improves the decisions with each iteration.

3. Great data scientist is the ultimate communicator.

As you may see there’s a lot of communication involved in understanding the problem and delivering constant feedback to the stakeholders. But this is just the surface of the importance of communication – a much more important element of this is asking the right questions. Sounds easy, right? It’s not, actually. The data scientists are much more likely to fall into a trap of the curse of knowledge cognitive bias than any other occupation. This bias “occurs when an individual, communicating with other individuals, unknowingly assumes that the others have the background to understand.

When the data scientist is scoping out a problem together with the stakeholders or presenting the first findings, it is vital to be as explicit and detailed as possible and not assume that stakeholders know as much as you do. This is very hard as the number of assumptions and underlying methodologies that a data scientist makes can be counted in dozens, even hundreds.

The biggest risk is when the stakeholder briefly describes the problem to the data scientist who doesn’t ask enough questions and assumes what the problem is. Then the data scientist builds a solution that seems to solve the described problem. The lack of asking questions and too many assumptions result in a situation where the final solution actually solves a different problem than the original one and gives an opposite recommendation or a result.

Great data scientists never assume they know something without in-depth analysis, they think in hypotheses which need to be either rejected or proved, and they ask a lot of questions, even if they are 99.9% sure they know the answer.

Wait – what about programming, statistics, math, hacking?

The fact of the matter is – you must have the technical skills and a strong basic foundation to be hired as a data scientist – you can read more about the basic requirements in my previous blog post How to think like a data scientist to become one. This is what’s expected and required from you as the bare minimum.

But to become a truly great data scientist you have to be an ultimate problem solver who is obsessed with understanding the ins and outs of the business case they’re handed.

How to think like a data scientist to become one

We have all read the punchlines – data scientist is the sexiest job, there’s not enough of them and the salaries are very high. The role has been sold so well that the number of data science courses and college programs are growing like crazy. After my previous blog post I have received questions from people asking how to become a data scientist – which courses are the best, what steps to take, what is the fastest way to land a data science job?

I tried to really think it through and I reflected on my personal experience – how did I get here? How did I become a data scientist? Am I a data scientist? My experience has been very mixed – I have started out as a securities analyst in an investment house using mainly Excel then slowly shifted towards business intelligence in the banking industry and then in consulting, eventually doing the actual so-called “data science” – building predictive models, working with Big Data, crunching tons of numbers and writing code to do data analysis and machine learning – though in the earlier days it was called “data mining”.

When the data science hype has started I tried to understand how is it different from what I have been doing so far, maybe I should learn new skills and become the data scientist instead of someone working “in analytics”?

Like everybody obsessed with it I have started taking multiple courses, rea
ding data books, doing data science specializations (and not finishing all of them..),
coded a lot – I wanted to become THE one in the middle cross-
section of the (in)famous data science Venn diagram. What I did learn is that these unicorns (yes, the people in the middle “Data Science” bucket are called unicorns) rarely exist and even if they do – they are typically generalists who have knowledge in all of these areas but are “master of none”.

Although I now consider myself a data scientist – I lead a fantastically talented data science team in Amazon, build machine learning models, work with “Big data” – I still think there’s too much chaos around the craft and much less clarity, especially for people new to the industry or ones trying to get in. Don’t get me wrong – there are a lot of very complex branches of data science – like AI, robotics, computer vision, voice recognition etc. – which require very deep technical and mathematical expertise, and potentially a PhD… or two. But if you are interested in getting into a data science role that was called a business / data analyst just a few years ago – here are the four rules that have helped me survive in the data science world.

Rule 1 – Get your priorities and motivations straight. Be very realistic about what skills you have right now and where you want to arrive – there are so many different roles in data science, it’s important to understand and assess you current knowledge base. Let’s say you’re working in HR and want to change careers – learn about HR analytics! If you’re a lawyer – understand the data applications in the legal industry. The fact is that the hunger for insights is so big that all industries and business functions have started using it. If you already have a job then try to understand what can be optimized or solved by using data and learn how to do it yourself. It’s going to be gradual and long shift but you will still have a job and learn by doing it in the real world. If you are a recent graduate or a student – you have a perfect chance to figure out what are you passionate about – maybe movies, maybe music, or maybe cars? You wouldn’t imagine the amount of data scientists these industries employ – and they are all crazy about the fields they’re working in.

Rule 2 – Learn the basics very well. Although the specifics of the each data science field are very different, the basics are the same. There are three areas where you should develop very strong foundations – basic data analysis, introductory statistics and coding.

Data analysis. You should understand and practice (a lot!) the basic data analysis techniques – what is a data table, how to join tables, what are the main techniques to analyze data organized in such a way, how to build summary views on your dataset and draw initial conclusions from it, what is the exploratory data analysis, which visualizations can help you understand and learn from data. This is very basic but believe me – master this you’ll have the fundamental skill that is absolutely mandatory for the job.

Statistics. Also, get a very good grasp of introductory statistics – what is mean, median, when to use one over the other, what is standard deviation and when is doesn’t make any sense to use it, why averages “lie” but are still the most used aggregated value everywhere. And when I say “introductory” I really mean “introductory”. Unless you are a mathematician and plan to become an econometrician who applies advanced statistical and econometric models to explain complex phenomenons – then yes, learn advanced statistics. If you don’t have PhD in mathematics, just take your time and be patient and get a really good grasp of the basic statistics and probability.

Coding. And off course – learn how to code. This is the most over-used cliché advice but it’s actually a sound one. You should start from learning how to query a database with SQL first – believe it or not, most of the time data science teams spend are on data pulling and preparation, and a lot of that is done with SQL. So get your basics in place – build your own small database, write some “select * from my_table” lines and get a good grasp of the SQL fundamentals. You should also learn one (start with just one) data analysis language – be it R or Python, both are great – that does make a difference and many positions require it, although not all. First learn the basics of the language you chose with focus on how to do data analysis with it. You don’t have to become a programmer to succeed in the field, it’s all about knowing how to use the language to analyze and visualize data.

Rule 3 – Data science is about solving problems – find and solve one. The thing I have learned over the years is that one of the fundamental requirements for a data scientist is to be always asking questions and looking for problems. Now I don’t advice to do it 24/7 as you will definitely go insane, but be prepared to be the problem solver and be looking for them non-stop. Start small, find areas in your own life that can benefit from some analysis – you will be amazed how much data is available out there. Maybe you will analyze your spending patterns, identify sentiment patterns of your emails, or just build nice charts to track your city’s finances. The data scientist is responsible for questioning everything – is this marketing campaign effective, are there any concerning trends in the business, do some products under-perform and should be taken off the market, does the discount the company gives makes sense or is it too big – these questions become hypotheses that are then validated or rejected by the data scientist. The hypotheses are the raw material of a data scientist as the more of them you will solve and explain – the better you’ll be in your job.

Rule 4 – Start doing instead of planning what you will do “when”. This is applicable to any learning behavior but it’s especially true in data science. Be sure you start “doing” from the very first day you start learning. It’s very easy to put off the actual learning by just reading “about” data science, how it “should” be done, copy-pasting data analysis code from the book and running it on very simple datasets which you will never ever get in the real world.

With everything you learn – be sure you start applying it to the field you’re passionate about. That’s where the magic happens – writing your first line of code and seeing it fail, being stuck and not knowing what to do next, looking for the answer, finding a lot of different solutions none of which work, struggling to build your own one and finally passing a milestone – the “aha!” moment. This is how you will actually learn. Learning by doing is the only way to learn data science – you don’t learn how to ride bike by reading about it, right? Same rule applies here – whatever you learn, be sure you apply it immediately and solve actual problems with real data.

“If you spend too much time thinking about a thing, you’ll never get it done.” – a quote from one of the most famous martial artists Bruce Lee captures the essence of this post. You have to apply what you learn and make sure you make your own mistakes – this is the only way you will learn and improve. And if Bruce Lee doesn’t convince you, maybe Shia LeBeouf will:

Thanks for reading! Subscribe to my blog www.cyborgus.com and get the latest updates. Also can also follow me on social networks:

Follow my blog updates on Facebook – https://www.facebook.com/cyborguscom/

Look me up on LinkedIn – https://www.linkedin.com/in/karolisurbonas/

Hello, World!

My name is Karolis and I love everything about data – machine learning, getting the “aha” insightful moments, AI, automating the boring stuff and every engineering puzzle I have to solve in my job.

I have been working with data and its applications for more than 10 years now. Currently I lead a brilliant data science team in Amazon as the Head of Business Intelligence and Data Science of Amazon Devices.

The goal for this blog is share the practical and applied methods to build data-driven applications, analyses and data-products in the real-world. A somewhat “survival tutorial for data science practitioners”.

If you have any suggestions or ideas or just want to chat – please feel free to reach out to me via LinkedIn or Facebook – the links are on the top right.

I hope you will enjoy reading my blog!