This project is the capstone assignment for the Google Data Analytics Professional Certificate program. The program prepares participants for a career in data analytics with training focused on key analytical skills (data cleaning, analysis, and visualization) and tools (Excel, SQL, R Programming, Tableau).
This project will analyze publicly available data sets, provided by the course, for a bike share program based in Chicago.
Cyclistic is a successful bike-share program launched in 2016. It has since grown to a fleet of 5,824 bicycles that are geo-tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Cyclistic offers a variety of pricing plans including: single-ride passes, full-day passes, and annual memberships. Customers who purchase a single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
For this case study, our guiding question is:
How do annual members and casual riders use Cyclistic bikes differently?
Theis analysis made use of the following tools and techniques:
ggplot2
,
tibble
, tidyr
, readr
,
purrr
, dplyr
, stringr
,
lubridate
and forcats
The analysis yielded some key differences between member and casual riders of the Cyclistic bike share service. Most notably were the peak times and durations when casual riders use the service versus when member riders use the service. This led to the following recommendations outlined below.
We know that casual riders take rides of longer duration during the weekends. To encourage casual riders to become members, Cyclistic could partner with special events that occur on weekends. For example, they could partner with concert venues or sporting events.
Casual riders have their peak ride times during the months of April, May, and June. To encourage them to continue using the bike share service, special promotions for membership could be offer outside of peak times in effort to encourage them to continue using the service. This could also be applied during the work-week (M-F) when casual riders are less-likely to use the service.
Lastly, Cyclistic could partner with restaurants and recreational facilities near the Top 10 Start/End locations. As an example a discounted bike share membership could be offered when they purchase an item at a local business near these locations.
Continue reading for the full details of this analysis that led to these recommendations.
The goal of the assigned business task is to investigate the bike share data for differences in usage between annual members and casual riders. The insights gained will be shared with the Cyclistic marketing team. The Director of Marketing is interested in expanding annual memberships by designing targeted ad campaigns for converting casual riders to annual members.
The data for the project is provided by Motivate International, Inc. under this license.
NOTE: The data sets have a different name because Cyclistic is a fictional company created for the purposes of the capstone project as determined by the Google Data Analytics Professional Certificate program.
The data has been organized in monthly and quarterly periods. Since this project was started in April/May of 2022, the data from March 2021 to March 2022 has been selected.
The data contains the following columns:
The data has been de-personalized to safeguard the privacy of users. In particular, this means it is not possible to connect past purchases to credit card numbers and determine if casual riders live in the service area or purchased multiple single passes.
Data processing and analyzing will primary occur in RStudio using the R programming language with supplement visualizations done via Tableau.
Throughout this analysis we’ll make use of the following libraries. If they are not already installed, please do so before opening them.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The data is separated by month; one CSV file per month. In order to order analyze all 12-months worth of data it’s necessary to combine the files as follows. NOTE: The directory path will vary depending on where the original files were downloaded/saved.
# setting up files
file_names <- dir(path = "Desktop/Code/Courses/GoogleDataAnalytics/8_CaseStudy/Case_Study_01_Cyclistic/Datasets",
pattern = NULL, all.files = FALSE, full.names = FALSE,
recursive = FALSE, ignore.case = FALSE, include.dirs = FALSE,
no.. = FALSE)
# creating df for last 12 months of data
cyclistic_df <- do.call(rbind, lapply(file_names, read.csv))
# export df to CSV file
write.csv(cyclistic_df, "cyclistic_202103-202203.csv", row.names = TRUE) # completed 2022-04-21
Once we’ve combined the necessary data into a single CSV file, we can read it and store in a data frame. This will allow for ease of use during the cleaning process and the analysis process.
bikeshare_df <- read_csv("Datasets/cyclistic_202103-202203.csv")
## New names:
## Rows: 5952028 Columns: 14
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (7): ride_id, rideable_type, start_station_name, start_station_id, end_... dbl
## (5): ...1, start_lat, start_lng, end_lat, end_lng dttm (2): started_at,
## ended_at
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
Quick view using head()
and summary()
head(bikeshare_df)
summary(bikeshare_df)
## ...1 ride_id rideable_type
## Min. : 1 Length:5952028 Length:5952028
## 1st Qu.:1488008 Class :character Class :character
## Median :2976014 Mode :character Mode :character
## Mean :2976014
## 3rd Qu.:4464021
## Max. :5952028
##
## started_at ended_at
## Min. :2021-03-01 00:01:09.00 Min. :2021-03-01 00:06:28.00
## 1st Qu.:2021-06-15 21:08:34.00 1st Qu.:2021-06-15 21:35:02.25
## Median :2021-08-13 20:17:38.50 Median :2021-08-13 20:38:45.50
## Mean :2021-08-20 17:11:57.70 Mean :2021-08-20 17:33:33.29
## 3rd Qu.:2021-10-12 08:05:40.25 3rd Qu.:2021-10-12 08:19:54.00
## Max. :2022-03-31 23:59:47.00 Max. :2022-04-01 22:10:12.00
##
## start_station_name start_station_id end_station_name end_station_id
## Length:5952028 Length:5952028 Length:5952028 Length:5952028
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.64 Min. :-87.84 Min. :41.39 Min. :-88.97
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80 Max. :42.17 Max. :-87.49
## NA's :4883 NA's :4883
## member_casual
## Length:5952028
## Class :character
## Mode :character
##
##
##
##
Inspect column names:
## [1] "...1" "ride_id" "rideable_type"
## [4] "started_at" "ended_at" "start_station_name"
## [7] "start_station_id" "end_station_name" "end_station_id"
## [10] "start_lat" "start_lng" "end_lat"
## [13] "end_lng" "member_casual"
After inspection, there are is an extra column “… 1” which is not needed since it only contains a list of numbers which correspond to the row number - this was created when the the 12-months worth of data was created. This will need to be removed in the cleaning process.
Check the total number of rows is 5,952,028 after combining the 12-month data:
## [1] 5952028
Prior to 2020, Cyclistic used different names for casual and member riders. This data is from 2021, but we still want to confirm that the categorical terms for casual riders (“casual”) and annual members (“members”) are still being used.
##
## casual member
## 2630575 3321453
In order to analyze ride usage based on the month, day, and year, we need to add columns for each.
bikeshare_df$date <- as.Date(bikeshare_df$started_at)
bikeshare_df$month <- format(as.Date(bikeshare_df$date), "%m")
bikeshare_df$day <- format(as.Date(bikeshare_df$date), "%d")
bikeshare_df$year <- format(as.Date(bikeshare_df$date), "%Y")
bikeshare_df$day_of_week <- format(as.Date(bikeshare_df$date), "%A")
bikeshare_df$ride_length <- difftime(bikeshare_df$ended_at,bikeshare_df$started_at)
is.factor(bikeshare_df$ride_length)
## [1] FALSE
bikeshare_df$ride_length <- as.numeric(as.character(bikeshare_df$ride_length))
is.numeric(bikeshare_df$ride_length)
## [1] TRUE
The data frame includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative. We will create a new version of the data frame (v2) since data is being removed.
bikeshare_df_v2 <- bikeshare_df[!(bikeshare_df$ride_length<0),] # removes neg values
bikeshare_df_v2 <- mutate(bikeshare_df_v2, ...1 = NULL) # removes extra col
head(bikeshare_df_v2)
colnames(bikeshare_df_v2)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "date" "month"
## [16] "day" "year" "day_of_week"
## [19] "ride_length"
View(bikeshare_df_v2)
nrow(bikeshare_df_v2) # check num of rows
## [1] 5951881
any(bikeshare_df_v2$start_station_name == "HQ QR") # doesn't exist in data set
## [1] NA
any(bikeshare_df_v2$ride_length < 0) # checking for negative values
## [1] FALSE
After cleaning, a CSV file will be exported to preserve the clean data and another file will be created for use in Tableau.
write_csv(bikeshare_df_v2, "2022-04-26_cyclistic_clean_data.csv")
Since the data frame containing the clean data is too large to upload to Tableau Public (file limit of 1 GB) a subset of the data frame will be created and exported.
# selection of desired columns to keep for export
myvars <- c("ride_id", "rideable_type", "member_casual", "date", "month",
"day", "year", "day_of_week", "ride_length", "start_station_name",
"end_station_name")
# store selected columns in a data frame
bikeshare_subset <- bikeshare_df_v2[myvars]
# write subset data frame to CSV file
write_csv(bikeshare_subset, "2022-04-30_cyclistic_subset.csv")
Summary Statistics of ride length (in seconds) for both casual and member riders:
summary(bikeshare_df_v2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 395 704 1296 1283 3356649
Comparison of the mean, median, max, and min.
aggregate(bikeshare_df_v2$ride_length ~ bikeshare_df_v2$member_casual, FUN = mean)
aggregate(bikeshare_df_v2$ride_length ~ bikeshare_df_v2$member_casual, FUN = median)
aggregate(bikeshare_df_v2$ride_length ~ bikeshare_df_v2$member_casual, FUN = max)
aggregate(bikeshare_df_v2$ride_length ~ bikeshare_df_v2$member_casual, FUN = min)
Observations
We can see that for both the mean and median, casual riders have trips of longer duration than member riders.
# first order by day of the week
bikeshare_df_v2$day_of_week <- ordered(bikeshare_df_v2$day_of_week,
levels=c("Sunday", "Monday", "Tuesday",
"Wednesday", "Thursday", "Friday",
"Saturday"))
# calculate average
aggregate(bikeshare_df_v2$ride_length ~ bikeshare_df_v2$member_casual +
bikeshare_df_v2$day_of_week, FUN = mean)
bikeshare_df_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, weekday) %>% #groups by user type and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, weekday) # sorts
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Observations
On average, - casual riders ride longer than member riders - member riders take more rides during the work week (M-F) than casual riders - Casual riders take more rides during the weekend
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Observations
Casual riders take more rides on the weekend than member riders.
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Observations
On average, casual riders take rides of longer duration than member riders.
Observations
Throughout the year, casual riders have longer average ride lengths than member riders. However, ride length of casual riders peaks during the months of April, May, and June.
Observations
With few exceptions, the top 10 start and end stations for casual riders are nearly identical. In particular, the top 3 stations (listed below) are the same start and ending points for casual riders. This could be useful for making recommendations for a targeted ad campaign to convert casual riders to member riders.
Top 3 Stations for casual riders: - Streeter Dr & Grand Ave - Millennium Park - Michigan Ave & Oak St