Introduction

This project is the capstone assignment for the Google Data Analytics Professional Certificate program. The program prepares participants for a career in data analytics with training focused on key analytical skills (data cleaning, analysis, and visualization) and tools (Excel, SQL, R Programming, Tableau).

We’ll be investigating data sets of FitBit usage to make high-level marketing recommendations for Bellabeat—a high-tech company that manufactures health-focused smart products.

Table of Contents

Summary

Bellabeat was founded in 2013 and has grown to become a tech-driven wellness company for women. Since 2016, they have opened offices around the globe and launched multiple products. Their apps and devices collect data on activity, sleep, stress, and reproductive health, allowing Bellabeat to empower women with knowledge about their own health and habits.

Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals. There products include:

Guiding questions

The guiding questions for this case study are as follows:

  • What are some trends in smart device usage?
  • How could these trends apply to Bellabeat customers?
  • How could these trends help influence Bellabeat marketing strategy?

Tools and Techniques

The following analysis makes use of the following tools and techniques:

  • R programming language and libraries; ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr, lubridate and forcats
  • Data transformations: joins, visualizations, summary statistics
  • Data inspection: removal of duplicate/unnessary data, change format/datatype, verify unique values

Recommendations

Based on the analysis, the following are three recommendations (detailed below):

  • Double the Points Earned on Sundays
  • Reminder Notifcations Following Active Days
  • Encourage Rest Days

Double the Points Earned on Sundays

Based on the observations from the data, the largest trend that stands out is the increase in activity on Tuesday, Wednesday, and Thursday. Participants were most active on these days (both in terms of step count and calories burned). These also coincided with the days participants got the most sleep and spent the most time in bed. Further, we also know that Sunday is the least active day.

In order to help Bellabeat customers reach their wellness goals, it is recommended that the product team implement the ablity to add reminders when customers have lower level of activities. In particular, if we know that Sunday (and indeed Friday and Satury to some degree) is the least active day the product team could add the ability for customers to increase their wellness score by more points when they are active on Sunday. For example, perhaps activities on Sundays count for 1.5 times or 2 times the regular amount of points awarded when calculating the Wellness Score. This is intended to incentivize customers to be more active on this day.

Reminder Notifcations Following Active Days

We know that Tuesday, Wednesday, and Thursday are the most active days. The product team could implement notifications on the following days. These notifications could include reminders about how active the individual has been the past day (or previous days) and offer encouragement to keep staying active.

Encourage Rest Days

The Wellness Score in the Bellabeat app is a number calculated on a number of different factors, including activity. We know from the data that Sunday is the least active day. Perhaps, instead of encouraging individuals to be active on this day, rest should be encourage. Bellabeat takes a hollistic approach to wellness and while it is important to be active when achieiving fitness goals rest is also an important factor. In this case, perhaps the activities on Sundays (or on another rest day specified by the customer) would count for silently less so that the overall Wellness Score isn’t adversely affected.

Continue reading for the full details of this analysis that led to these recommendations.

Guiding Questions

We want to analyze the smart device usage data in order to gain insights into how people are already using their smart devices and make high-level recommends for how these trends can inform Bellabeat marketing strategy. We’ll focus on the following guiding questions:

Prepare Data

Data Description

The following data includes FitBit Fitness tracker data from 30 individuals over a month long period from April 12, 2016 to May 12, 2016. The data includes metrics on daily activities, calories, intensities, steps, heart rate, sleep, weight, and METs.

License

The data is available as a public data set by Mobius on Kaggle. It was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. This data is available for us under the CC0: Public Domain License.

Limitations

The data was collected from 30 FitBit users; this seems limiting and any trends found might not align with larger groups. Further, there is no demographic data within the data so it’s unclear what gender, ethnicity, age, etc. these users fall into. Bellabeat is a company focused on providing smart devices specifically for women and this data might not align with the demographics of Bellabeat users.

Process Data

Install/Open Libraries

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("dplyr")
library("ggplot2")
library("lubridate")
library("corrplot")
## corrplot 0.92 loaded

Set up Data Frames

daily_activity <- read.csv("datasets/dailyActivity_merged.csv")
daily_calories <- read.csv("datasets/dailyCalories_merged.csv")
hourly_calories <- read.csv("datasets/hourlyCalories_merged.csv")
daily_steps <- read.csv("datasets/dailySteps_merged.csv")
hourly_steps <- read.csv("datasets/hourlySteps_merged.csv")
sleep_day <- read.csv("datasets/sleepDay_merged.csv")
weight <- read.csv("datasets/weightLogInfo_merged.csv")
hr <- read.csv("datasets/heartrate_seconds_merged.csv")

Inspect Data Frames

For the FitBit Fitness Tracker data set we’ll focus on the following metrics for our analysis: daily activity, calories, steps, sleep, weight, and heart rate.

Daily Activity

head(daily_activity)
colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
sapply(daily_activity, class)
##                       Id             ActivityDate               TotalSteps 
##                "numeric"              "character"                "integer" 
##            TotalDistance          TrackerDistance LoggedActivitiesDistance 
##                "numeric"                "numeric"                "numeric" 
##       VeryActiveDistance ModeratelyActiveDistance      LightActiveDistance 
##                "numeric"                "numeric"                "numeric" 
##  SedentaryActiveDistance        VeryActiveMinutes      FairlyActiveMinutes 
##                "numeric"                "integer"                "integer" 
##     LightlyActiveMinutes         SedentaryMinutes                 Calories 
##                "integer"                "integer"                "integer"

Summary Statistics

daily_activity %>%
  select(TotalSteps, TotalDistance, Calories, SedentaryMinutes) %>%
  summary()
##    TotalSteps    TotalDistance       Calories    SedentaryMinutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.:1828   1st Qu.: 729.8  
##  Median : 7406   Median : 5.245   Median :2134   Median :1057.5  
##  Mean   : 7638   Mean   : 5.490   Mean   :2304   Mean   : 991.2  
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:2793   3rd Qu.:1229.5  
##  Max.   :36019   Max.   :28.030   Max.   :4900   Max.   :1440.0

Calories

Daily (Calories)
head(daily_calories)
colnames(daily_calories)
## [1] "Id"          "ActivityDay" "Calories"
sapply(daily_calories, class)
##          Id ActivityDay    Calories 
##   "numeric" "character"   "integer"

Summary Statistics

daily_calories %>%
  select(Calories) %>%
  summary()
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900
Hourly (Calories)
head(hourly_calories)
colnames(hourly_calories)
## [1] "Id"           "ActivityHour" "Calories"
sapply(hourly_calories, class)
##           Id ActivityHour     Calories 
##    "numeric"  "character"    "integer"

Summary Statistics

hourly_calories %>%
  select(Calories) %>%
  summary()
##     Calories     
##  Min.   : 42.00  
##  1st Qu.: 63.00  
##  Median : 83.00  
##  Mean   : 97.39  
##  3rd Qu.:108.00  
##  Max.   :948.00

Steps

Daily (Steps)
head(daily_steps)
colnames(daily_steps)
## [1] "Id"          "ActivityDay" "StepTotal"
sapply(daily_steps, class)
##          Id ActivityDay   StepTotal 
##   "numeric" "character"   "integer"

Summary Statistics

daily_steps %>%
  select(StepTotal) %>%
  summary()
##    StepTotal    
##  Min.   :    0  
##  1st Qu.: 3790  
##  Median : 7406  
##  Mean   : 7638  
##  3rd Qu.:10727  
##  Max.   :36019
Hourly (Steps)
head(hourly_steps)
colnames(hourly_steps)
## [1] "Id"           "ActivityHour" "StepTotal"
sapply(hourly_steps, class)
##           Id ActivityHour    StepTotal 
##    "numeric"  "character"    "integer"

Summary Statistics

hourly_steps %>%
  select(StepTotal) %>%
  summary()
##    StepTotal      
##  Min.   :    0.0  
##  1st Qu.:    0.0  
##  Median :   40.0  
##  Mean   :  320.2  
##  3rd Qu.:  357.0  
##  Max.   :10554.0

Daily Sleep

head(sleep_day)
colnames(sleep_day)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
sapply(sleep_day, class)
##                 Id           SleepDay  TotalSleepRecords TotalMinutesAsleep 
##          "numeric"        "character"          "integer"          "integer" 
##     TotalTimeInBed 
##          "integer"

Summary Statistics

sleep_day %>%
  select(TotalMinutesAsleep, TotalTimeInBed) %>%
  summary()
##  TotalMinutesAsleep TotalTimeInBed 
##  Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:361.0      1st Qu.:403.0  
##  Median :433.0      Median :463.0  
##  Mean   :419.5      Mean   :458.6  
##  3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :796.0      Max.   :961.0

Weight

head(weight)
colnames(weight)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"
sapply(weight, class)
##             Id           Date       WeightKg   WeightPounds            Fat 
##      "numeric"    "character"      "numeric"      "numeric"      "integer" 
##            BMI IsManualReport          LogId 
##      "numeric"    "character"      "numeric"

Summary Statistics

weight %>%
  select(WeightKg, WeightPounds, Fat, BMI) %>%
  summary()
##     WeightKg       WeightPounds        Fat             BMI       
##  Min.   : 52.60   Min.   :116.0   Min.   :22.00   Min.   :21.45  
##  1st Qu.: 61.40   1st Qu.:135.4   1st Qu.:22.75   1st Qu.:23.96  
##  Median : 62.50   Median :137.8   Median :23.50   Median :24.39  
##  Mean   : 72.04   Mean   :158.8   Mean   :23.50   Mean   :25.19  
##  3rd Qu.: 85.05   3rd Qu.:187.5   3rd Qu.:24.25   3rd Qu.:25.56  
##  Max.   :133.50   Max.   :294.3   Max.   :25.00   Max.   :47.54  
##                                   NA's   :65

Heart rate

head(hr)
colnames(hr)
## [1] "Id"    "Time"  "Value"
sapply(hr, class)
##          Id        Time       Value 
##   "numeric" "character"   "integer"

Summary Statistics

hr %>%
  select(Value) %>%
  summary()
##      Value       
##  Min.   : 36.00  
##  1st Qu.: 63.00  
##  Median : 73.00  
##  Mean   : 77.33  
##  3rd Qu.: 88.00  
##  Max.   :203.00

Duplicate Data

Let’s inspect our data to see if there are any duplicate information.

colnames(daily_activity)
##  [1] "Id"                       "ActivityDate"            
##  [3] "TotalSteps"               "TotalDistance"           
##  [5] "TrackerDistance"          "LoggedActivitiesDistance"
##  [7] "VeryActiveDistance"       "ModeratelyActiveDistance"
##  [9] "LightActiveDistance"      "SedentaryActiveDistance" 
## [11] "VeryActiveMinutes"        "FairlyActiveMinutes"     
## [13] "LightlyActiveMinutes"     "SedentaryMinutes"        
## [15] "Calories"
colnames(daily_calories)
## [1] "Id"          "ActivityDay" "Calories"
colnames(daily_steps)
## [1] "Id"          "ActivityDay" "StepTotal"
colnames(hourly_calories)
## [1] "Id"           "ActivityHour" "Calories"
colnames(hourly_steps)
## [1] "Id"           "ActivityHour" "StepTotal"
colnames(hr)
## [1] "Id"    "Time"  "Value"
colnames(sleep_day)
## [1] "Id"                 "SleepDay"           "TotalSleepRecords" 
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(weight)
## [1] "Id"             "Date"           "WeightKg"       "WeightPounds"  
## [5] "Fat"            "BMI"            "IsManualReport" "LogId"
identical(daily_activity["Calories"], daily_calories["Calories"])
## [1] TRUE
identical(daily_activity["TotalSteps"], daily_steps["StepTotal"])
## [1] FALSE

A TRUE value is returned for the ‘Calories’ columns of ‘daily_activity’ and ‘daily_calories’, thus both columns in the data frames have identical values.

For ‘daily_activity’ and ‘daily_steps’ the columns for the number of steps returns a FALSE value indicating the values are not the same in these two coloumns. However, when we view the data frames (see below), at first glance, it looks like the values are identical. They have have the same number of distinct rows for ‘Id’.

head(daily_activity)
nrow(daily_activity)
## [1] 940
head(daily_steps)
nrow(daily_steps)
## [1] 940

This seems strange, so let’s compare the summary statistics and plot the data to see if they are indeed the same.

Summary Statistics: Daily Activity and Daily Steps

daily_activity %>%
  select(TotalSteps) %>%
  summary()
##    TotalSteps   
##  Min.   :    0  
##  1st Qu.: 3790  
##  Median : 7406  
##  Mean   : 7638  
##  3rd Qu.:10727  
##  Max.   :36019
daily_steps %>%
  select(StepTotal) %>%
  summary()
##    StepTotal    
##  Min.   :    0  
##  1st Qu.: 3790  
##  Median : 7406  
##  Mean   : 7638  
##  3rd Qu.:10727  
##  Max.   :36019

Observations

Summary statistics are identical.

Comparision Plots

‘TotalSteps’ from ‘daily_activity’
ggplot(data = daily_activity) +
  geom_point(mapping = aes(x = Id, y = TotalSteps)) +
  xlab("Participant ID") +
  ylab("Total Number of Steps") +
  ggtitle("Total Steps for 'daily_activity'")

‘StepTotal’ from ‘daily_steps’
ggplot(data = daily_steps) +
  geom_point(mapping = aes(x = Id, y = StepTotal)) +
  xlab("Participant ID") +
  ylab("Total Number of Steps") +
  ggtitle("StepTotal from 'daily_steps'")