This project is the capstone assignment for the Google Data Analytics Professional Certificate program. The program prepares participants for a career in data analytics with training focused on key analytical skills (data cleaning, analysis, and visualization) and tools (Excel, SQL, R Programming, Tableau).
We’ll be investigating data sets of FitBit usage to make high-level marketing recommendations for Bellabeat—a high-tech company that manufactures health-focused smart products.
Bellabeat was founded in 2013 and has grown to become a tech-driven wellness company for women. Since 2016, they have opened offices around the globe and launched multiple products. Their apps and devices collect data on activity, sleep, stress, and reproductive health, allowing Bellabeat to empower women with knowledge about their own health and habits.
Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals. There products include:
The guiding questions for this case study are as follows:
The following analysis makes use of the following tools and techniques:
ggplot2
,
tibble
, tidyr
, readr
,
purrr
, dplyr
, stringr
,
lubridate
and forcats
Based on the analysis, the following are three recommendations (detailed below):
Based on the observations from the data, the largest trend that stands out is the increase in activity on Tuesday, Wednesday, and Thursday. Participants were most active on these days (both in terms of step count and calories burned). These also coincided with the days participants got the most sleep and spent the most time in bed. Further, we also know that Sunday is the least active day.
In order to help Bellabeat customers reach their wellness goals, it is recommended that the product team implement the ablity to add reminders when customers have lower level of activities. In particular, if we know that Sunday (and indeed Friday and Satury to some degree) is the least active day the product team could add the ability for customers to increase their wellness score by more points when they are active on Sunday. For example, perhaps activities on Sundays count for 1.5 times or 2 times the regular amount of points awarded when calculating the Wellness Score. This is intended to incentivize customers to be more active on this day.
We know that Tuesday, Wednesday, and Thursday are the most active days. The product team could implement notifications on the following days. These notifications could include reminders about how active the individual has been the past day (or previous days) and offer encouragement to keep staying active.
The Wellness Score in the Bellabeat app is a number calculated on a number of different factors, including activity. We know from the data that Sunday is the least active day. Perhaps, instead of encouraging individuals to be active on this day, rest should be encourage. Bellabeat takes a hollistic approach to wellness and while it is important to be active when achieiving fitness goals rest is also an important factor. In this case, perhaps the activities on Sundays (or on another rest day specified by the customer) would count for silently less so that the overall Wellness Score isn’t adversely affected.
Continue reading for the full details of this analysis that led to these recommendations.
We want to analyze the smart device usage data in order to gain insights into how people are already using their smart devices and make high-level recommends for how these trends can inform Bellabeat marketing strategy. We’ll focus on the following guiding questions:
The following data includes FitBit Fitness tracker data from 30 individuals over a month long period from April 12, 2016 to May 12, 2016. The data includes metrics on daily activities, calories, intensities, steps, heart rate, sleep, weight, and METs.
The data is available as a public data set by Mobius on Kaggle. It was generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. This data is available for us under the CC0: Public Domain License.
The data was collected from 30 FitBit users; this seems limiting and any trends found might not align with larger groups. Further, there is no demographic data within the data so it’s unclear what gender, ethnicity, age, etc. these users fall into. Bellabeat is a company focused on providing smart devices specifically for women and this data might not align with the demographics of Bellabeat users.
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("dplyr")
library("ggplot2")
library("lubridate")
library("corrplot")
## corrplot 0.92 loaded
daily_activity <- read.csv("datasets/dailyActivity_merged.csv")
daily_calories <- read.csv("datasets/dailyCalories_merged.csv")
hourly_calories <- read.csv("datasets/hourlyCalories_merged.csv")
daily_steps <- read.csv("datasets/dailySteps_merged.csv")
hourly_steps <- read.csv("datasets/hourlySteps_merged.csv")
sleep_day <- read.csv("datasets/sleepDay_merged.csv")
weight <- read.csv("datasets/weightLogInfo_merged.csv")
hr <- read.csv("datasets/heartrate_seconds_merged.csv")
For the FitBit Fitness Tracker data set we’ll focus on the following metrics for our analysis: daily activity, calories, steps, sleep, weight, and heart rate.
head(daily_activity)
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
sapply(daily_activity, class)
## Id ActivityDate TotalSteps
## "numeric" "character" "integer"
## TotalDistance TrackerDistance LoggedActivitiesDistance
## "numeric" "numeric" "numeric"
## VeryActiveDistance ModeratelyActiveDistance LightActiveDistance
## "numeric" "numeric" "numeric"
## SedentaryActiveDistance VeryActiveMinutes FairlyActiveMinutes
## "numeric" "integer" "integer"
## LightlyActiveMinutes SedentaryMinutes Calories
## "integer" "integer" "integer"
Summary Statistics
daily_activity %>%
select(TotalSteps, TotalDistance, Calories, SedentaryMinutes) %>%
summary()
## TotalSteps TotalDistance Calories SedentaryMinutes
## Min. : 0 Min. : 0.000 Min. : 0 Min. : 0.0
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.:1828 1st Qu.: 729.8
## Median : 7406 Median : 5.245 Median :2134 Median :1057.5
## Mean : 7638 Mean : 5.490 Mean :2304 Mean : 991.2
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:2793 3rd Qu.:1229.5
## Max. :36019 Max. :28.030 Max. :4900 Max. :1440.0
head(daily_calories)
colnames(daily_calories)
## [1] "Id" "ActivityDay" "Calories"
sapply(daily_calories, class)
## Id ActivityDay Calories
## "numeric" "character" "integer"
Summary Statistics
daily_calories %>%
select(Calories) %>%
summary()
## Calories
## Min. : 0
## 1st Qu.:1828
## Median :2134
## Mean :2304
## 3rd Qu.:2793
## Max. :4900
head(hourly_calories)
colnames(hourly_calories)
## [1] "Id" "ActivityHour" "Calories"
sapply(hourly_calories, class)
## Id ActivityHour Calories
## "numeric" "character" "integer"
Summary Statistics
hourly_calories %>%
select(Calories) %>%
summary()
## Calories
## Min. : 42.00
## 1st Qu.: 63.00
## Median : 83.00
## Mean : 97.39
## 3rd Qu.:108.00
## Max. :948.00
head(daily_steps)
colnames(daily_steps)
## [1] "Id" "ActivityDay" "StepTotal"
sapply(daily_steps, class)
## Id ActivityDay StepTotal
## "numeric" "character" "integer"
Summary Statistics
daily_steps %>%
select(StepTotal) %>%
summary()
## StepTotal
## Min. : 0
## 1st Qu.: 3790
## Median : 7406
## Mean : 7638
## 3rd Qu.:10727
## Max. :36019
head(hourly_steps)
colnames(hourly_steps)
## [1] "Id" "ActivityHour" "StepTotal"
sapply(hourly_steps, class)
## Id ActivityHour StepTotal
## "numeric" "character" "integer"
Summary Statistics
hourly_steps %>%
select(StepTotal) %>%
summary()
## StepTotal
## Min. : 0.0
## 1st Qu.: 0.0
## Median : 40.0
## Mean : 320.2
## 3rd Qu.: 357.0
## Max. :10554.0
head(sleep_day)
colnames(sleep_day)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
sapply(sleep_day, class)
## Id SleepDay TotalSleepRecords TotalMinutesAsleep
## "numeric" "character" "integer" "integer"
## TotalTimeInBed
## "integer"
Summary Statistics
sleep_day %>%
select(TotalMinutesAsleep, TotalTimeInBed) %>%
summary()
## TotalMinutesAsleep TotalTimeInBed
## Min. : 58.0 Min. : 61.0
## 1st Qu.:361.0 1st Qu.:403.0
## Median :433.0 Median :463.0
## Mean :419.5 Mean :458.6
## 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :796.0 Max. :961.0
head(weight)
colnames(weight)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
sapply(weight, class)
## Id Date WeightKg WeightPounds Fat
## "numeric" "character" "numeric" "numeric" "integer"
## BMI IsManualReport LogId
## "numeric" "character" "numeric"
Summary Statistics
weight %>%
select(WeightKg, WeightPounds, Fat, BMI) %>%
summary()
## WeightKg WeightPounds Fat BMI
## Min. : 52.60 Min. :116.0 Min. :22.00 Min. :21.45
## 1st Qu.: 61.40 1st Qu.:135.4 1st Qu.:22.75 1st Qu.:23.96
## Median : 62.50 Median :137.8 Median :23.50 Median :24.39
## Mean : 72.04 Mean :158.8 Mean :23.50 Mean :25.19
## 3rd Qu.: 85.05 3rd Qu.:187.5 3rd Qu.:24.25 3rd Qu.:25.56
## Max. :133.50 Max. :294.3 Max. :25.00 Max. :47.54
## NA's :65
head(hr)
colnames(hr)
## [1] "Id" "Time" "Value"
sapply(hr, class)
## Id Time Value
## "numeric" "character" "integer"
Summary Statistics
hr %>%
select(Value) %>%
summary()
## Value
## Min. : 36.00
## 1st Qu.: 63.00
## Median : 73.00
## Mean : 77.33
## 3rd Qu.: 88.00
## Max. :203.00
Let’s inspect our data to see if there are any duplicate information.
colnames(daily_activity)
## [1] "Id" "ActivityDate"
## [3] "TotalSteps" "TotalDistance"
## [5] "TrackerDistance" "LoggedActivitiesDistance"
## [7] "VeryActiveDistance" "ModeratelyActiveDistance"
## [9] "LightActiveDistance" "SedentaryActiveDistance"
## [11] "VeryActiveMinutes" "FairlyActiveMinutes"
## [13] "LightlyActiveMinutes" "SedentaryMinutes"
## [15] "Calories"
colnames(daily_calories)
## [1] "Id" "ActivityDay" "Calories"
colnames(daily_steps)
## [1] "Id" "ActivityDay" "StepTotal"
colnames(hourly_calories)
## [1] "Id" "ActivityHour" "Calories"
colnames(hourly_steps)
## [1] "Id" "ActivityHour" "StepTotal"
colnames(hr)
## [1] "Id" "Time" "Value"
colnames(sleep_day)
## [1] "Id" "SleepDay" "TotalSleepRecords"
## [4] "TotalMinutesAsleep" "TotalTimeInBed"
colnames(weight)
## [1] "Id" "Date" "WeightKg" "WeightPounds"
## [5] "Fat" "BMI" "IsManualReport" "LogId"
identical(daily_activity["Calories"], daily_calories["Calories"])
## [1] TRUE
identical(daily_activity["TotalSteps"], daily_steps["StepTotal"])
## [1] FALSE
A TRUE
value is returned for the ‘Calories’ columns of
‘daily_activity’ and ‘daily_calories’, thus both columns in the data
frames have identical values.
For ‘daily_activity’ and ‘daily_steps’ the columns for the number of
steps returns a FALSE
value indicating the values are
not the same in these two coloumns. However,
when we view the data frames (see below), at first glance, it looks like
the values are identical. They have have the same number of distinct
rows for ‘Id’.
head(daily_activity)
nrow(daily_activity)
## [1] 940
head(daily_steps)
nrow(daily_steps)
## [1] 940
This seems strange, so let’s compare the summary statistics and plot the data to see if they are indeed the same.
daily_activity %>%
select(TotalSteps) %>%
summary()
## TotalSteps
## Min. : 0
## 1st Qu.: 3790
## Median : 7406
## Mean : 7638
## 3rd Qu.:10727
## Max. :36019
daily_steps %>%
select(StepTotal) %>%
summary()
## StepTotal
## Min. : 0
## 1st Qu.: 3790
## Median : 7406
## Mean : 7638
## 3rd Qu.:10727
## Max. :36019
Observations
Summary statistics are identical.
ggplot(data = daily_activity) +
geom_point(mapping = aes(x = Id, y = TotalSteps)) +
xlab("Participant ID") +
ylab("Total Number of Steps") +
ggtitle("Total Steps for 'daily_activity'")
ggplot(data = daily_steps) +
geom_point(mapping = aes(x = Id, y = StepTotal)) +
xlab("Participant ID") +
ylab("Total Number of Steps") +
ggtitle("StepTotal from 'daily_steps'")