Practical data science tips & tricks – writing wrapper functions for dplyr

This is a first post in the practical data science series with tips and tricks from my experience. I have been doing data science for over 10 years now and there are many practical hacks most of which are pretty simple but can really improve one’s performance when writing code or doing analysis. This one is for using dplyr when exploring data distributions, aggregating and grouping it to understand what does it tell us.

Dplyr is a great data exploration tool. It has become an irreplaceable library for any R user in the business.

But when it comes to data exploration there’s actually a lot of redundant copy-pasting a data scientist has to do with dplyr.

Let’s take a look at an example. We’re going to use the Bank Marketing Data set from the UCI ML repository – https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip

Let’s read in the bank-additional-full.csv file.

bank <- read.csv('bank-additional-full.csv', header = TRUE, sep=';')
head(bank)

1 head bank

Let’s try doing some simple group_by and summarise applications to get a sense of quantitative data differences for different qualitative variables. Grouping, summarizing, making conclusions, regrouping, adding more data, bucketing quantitative variables into qualitative groups and doing this loop all over again – this is what exploratory data analysis phase is all about.

bank %>% group_by(job) %>% 
 summarise(duration = mean(duration),
 age = mean(age),
 pdays = mean(pdays),
 duration_median = median(duration),
 age_median = median(age),
 pdays_median = median(pdays),
 duration_min = min(duration),
 age_min = min(age),
 pdays_min = min(pdays),
 duration_max = max(duration),
 age_max = max(age),
 pdays_max = max(pdays)
 )

Above code gives us summary statistics of different quantitative variables group by job variable. What if we wanted to do the grouping by a different variable? Copy-paste and change the variable in the group_by()

bank %>% group_by(education) %>% 
 summarise(duration = mean(duration),
 age = mean(age),
 pdays = mean(pdays),
 duration_median = median(duration),
 age_median = median(age),
 pdays_median = median(pdays),
 duration_min = min(duration),
 age_min = min(age),
 pdays_min = min(pdays),
 duration_max = max(duration),
 age_max = max(age),
 pdays_max = max(pdays)
 )

But imagine doing this tens, even hundreds of times over and over again – the code will become very long and repetitive. Since dplyr doesn’t provide this functionality, we will write our own function to help us with that.

bank_fx % 
 group_by_(x) %>% 
 summarise(duration = mean(duration),
 age = mean(age),
 pdays = mean(pdays),
 duration_median = median(duration),
 age_median = median(age),
 pdays_median = median(pdays),
 duration_min = min(duration),
 age_min = min(age),
 pdays_min = min(pdays),
 duration_max = max(duration),
 age_max = max(age),
 pdays_max = max(pdays)
 )
 } else if(type == 'mean') {
 bank %>% 
 group_by_(x) %>% 
 summarise(duration = mean(duration),
 age = mean(age),
 pdays = mean(pdays)
 )
 } else if(type == 'median') {
 bank %>% 
 group_by_(x) %>% 
 summarise(duration_median = median(duration),
 age_median = median(age),
 pdays_median = median(pdays)
 )
 } else if(type == 'min') {
 bank %>% 
 group_by_(x) %>% 
 summarise(duration_min = min(duration),
 age_min = min(age),
 pdays_min = min(pdays)
 )
 } else if(type == 'max') {
 bank %>% 
 group_by_(x) %>% 
 summarise(duration_max = max(duration),
 age_max = max(age),
 pdays_max = max(pdays)
 )
 } else {
 print('Error')
 }
}

What does it do? It takes in 2 variables – grouping variable x and the type of aggregation method type.

Let’s see how much space we save with it:

bank_fx('job','min')
bank_fx('education','all')
bank_fx('marital','mean')

These 3 lines give you more than the long repetitive lines in the beginning of this post. This function is reusable and very easy to implement for your own use cases. It has saved me a lot of time in many different exploratory projects in the past.

You can add nested if-else statements to add different variables and layers of aggregation – investing time into writing function before you start doing your analysis or sometime in the beginning after you have found what will be repeated over and over is an absolute revelation and time saver.

Next one for the series – iterative data analysis loop and how to ensure you don’t get lost in it. Subscribe and get instant updates of new posts and my upcoming courses.

Data description – https://archive.ics.uci.edu/ml/datasets/Bank+Marketing 

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s