Cyclistic is a bike-share company in Chicago, founded in 2016 that features more than 5,800 bicycles and 600 docking stations. It sets itself apart by making bike-share more inclusive to people with disabilities There are two types of customers: annual members and casual customers. They have a flexibility pricing plans: single-ride passes, full-day passes, and annual memberships. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders.
The Cyclistic Case Study is one of the capstone projects of Google Data Analytics, a professional certificate program offered on Coursera. I roleplayed as a data analyst working on the marketing analyst team at Cyclistic. Lily Moreno, The director of marketing and my manager assigned me to explore the behavior of casual riders and annual members to understand how they use Cyclistic bikes differently.
Data source link: Cyclistic’s Data
Data information: Public Data, dataset made available through Motivate International Inc. under a license from divvybikes.com.
The Cyclistic’s dataset has the historical trip data from 2022, separeted in csv. files to analyze and identify trends.
install.packages("tidyverse")
install.packages("readr")
install.packages("tidyr")
install.packages("janitor")
install.packages("skimr")
install.packages("dplyr")
install.packages("lubridate")
library(tidyverse)
library(readr)
library(tidyr)
library(janitor)
library(skimr)
library(dplyr)
library(lubridate)
read_csv()
functioncyclistic_01 <- read_csv("202201-divvy-tripdata.csv")
cyclistic_02 <- read_csv("202202-divvy-tripdata.csv")
cyclistic_03 <- read_csv("202203-divvy-tripdata.csv")
cyclistic_04 <- read_csv("202204-divvy-tripdata.csv")
cyclistic_05 <- read_csv("202205-divvy-tripdata.csv")
cyclistic_06 <- read_csv("202206-divvy-tripdata.csv")
cyclistic_07 <- read_csv("202207-divvy-tripdata.csv")
cyclistic_08 <- read_csv("202208-divvy-tripdata.csv")
cyclistic_09 <- read_csv("202209-divvy-publictripdata.csv")
cyclistic_10 <- read_csv("202210-divvy-tripdata.csv")
cyclistic_11 <- read_csv("202211-divvy-tripdata.csv")
cyclistic_12 <- read_csv("202212-divvy-tripdata.csv")
In order to process the data I also used the R programing language.
all_trips_raw1 <- rbind(cyclistic_01, cyclistic_02, cyclistic_03, cyclistic_04, cyclistic_05, cyclistic_06, cyclistic_07, cyclistic_08, cyclistic_09, cyclistic_10, cyclistic_11, cyclistic_12)
all_trips_raw1 <-all_trips_raw1 %>%
rename(bike_type=rideable_type,
start_time=started_at,
end_time=ended_at)
all_trips_raw <- all_trips_raw1 %>%
select(-c(start_lat, start_lng, end_lat, end_lng))
colnames(all_trips_raw)
all_trips_raw <- all_trips_raw [!duplicated(all_trips_raw), ]
The variable ride_id has no duplicates, it is also a primary key. Each Id has 16 characters and each row represents a unique bike ride. It’s important to understand if ride_id is the user id or if a new ride generates a new ride id.
all_trips_raw$date <- as.Date(all_trips_raw$start_time)
all_trips_raw$weekday <- format(as.Date(all_trips_raw$date), "%A")
all_trips_raw$month <- format(as.Date(all_trips_raw$date), "%B")
all_trips_raw$hour <- hour(all_trips_raw$start_time)
all_trips_raw<-all_trips_raw %>%
mutate (ride_length=difftime(end_time, start_time, unit="mins"))
After verifying the previews dataset, I spotted trips with zero minutes and negative time. I decided to remove them for the analysis and created a new dataset, as I assumed these were registered incorrectly.
all_trips_raw[all_trips_raw<=0] <- NA
all_trips <- na.omit(all_trips_raw)
head(all_trips)
To analyse my dataframe I’m combining R programming language with SQL and create data visualization using Tableau.
install.packages("sqldf")
library(sqldf)
So fist we want to know how many annual members vs casual users we have. What can we tell from the findings?
ride_user <- sqldf("select count (distinct ride_id), count (ride_id) from all_trips ")
The number of unique ride_id (4369052) is the same as the count of all ride_id (4369052), which means that ride_id can never be a user identifier. If it was each member only used this service once a year.
Cyclistic destinguish customers on their price plan. Those who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
members_vs_casual <- sqldf("select member_casual, count (distinct ride_id) from all_trips
group by member_casual")
The data shows that we have more rides from annual members (59,03%) than casual users(40.9%) in 2022. Even though we have more usage from members we still have a lot activity from casual users. More analysis is needed.
rides_month <- sqldf("select member_casual, month, count (distinct ride_id) from all_trips
group by member_casual, month")
We can see by the results that both type of customers are more active during the warmer months. So this is not a distinguishable factor. Nonetheless it is clear that casual users ride more than member during spring and summer. This way we can conclude that seasons affect the amount of membership rides.
To explore more the if there was a difference between this two customer groups I decided to analyse their ride behavior throughout the day and week.
rides_day <- sqldf("select member_casual, hour, weekday, count(distinct ride_id) from all_trips group by member_casual, hour, weekday")
From the visualization it seems that annual members ride during during rush hours. This means that they may be mainly using Cyclistic to commuting to work. Casual riders use this service more sporadically, but mostly from 10-19h.
Once again the assumption that annual members use Cyclistic services to ride to work is even more clear. They ride especially from monday to friday which are normal work days. In the other hand casual users are more active during the weekend. Also during saturday, casual rides almost double meaning that customers use this service especially during freetime and for leisure purposes.
mean(all_trips$ride_length)
median(all_trips$ride_length)
max(all_trips$ride_length)
min(all_trips$ride_length)
Right away I spotted something interesting in the max ride length. It seems that we have some outliers with trips of more than 24 hours (1440 min). After some analysis I decided to leave this data in the dataset. The starting and ending ride time were in different days, which means that a customer had the bike for more than one day.
Ride_time <- sqldf("select month, weekday, member_casual, avg(ride_length) from all_trips
group by month, weekday, member_casual")
Casual users ride longer than members. This can mean that if there is a price per minute, casual users spend more on their rides. Also casual users have access to full-day passes. It would be interesting to look further into this: daily passes might be discouraging new annual memberships and or preventing potential revenue from single rides passes.
Members have a more consistent rides throughout week, with a slight increase during the weekend, during these times they might have more free time. Again this reinforces the idea that members may be communting to work.
We know that Cyclistic offer more than 5,800 bicycles and 600 docking station. It would be interesting to know if customers have a bike preference.
rides_bike <- sqldf("select member_casual, bike_type, count (distinct ride_id) from all_trips group by member_casual, bike_type ")
Both groups prefer to ride in classic bikes, which doen’t show much diffence between the two groups. However only casual riders use docked bikes. This might indicate that this type of customers don’t care about their starting and ending point. This evidence points, again, to the idea of casual users being tourists. In the other hand Members don’t go for docked bikes because they want the freedom to park their bikes when they go to work.
Since Cyclistic has a network of 692 stations across Chicago, I decided to further analysis on customers ride differences in starting and ending point.
station_users <- sqldf("select start_station_name, member_casual, count(ride_id) from all_trips group by start_station_name, member_casual")
When analysing the start station, we can see through the charts that casual riders tend to start their trips more frequently downtown than annual members. In contrast, we can observe that members start their trips further from downtown. When looking at the ending station there isn’t much different between customers. Both tend to end their rides closer to downtown.
The analysis results seem to show that annual members and casual users are two different customer groups.
More than half (59,76%) of rides are made by Cyclistic annual members.
The average ride time for members is considerably lower than casual riders, which seems to indicate that members use Cyclistic bikes for different purposes comparing to casual riders. Due to this, there may be opportunities to review the company pricing plan and passes for casual users.
The members are probably locals who use the bikes daily to commute to and from their workplaces, while casual customers are most likely tourists and use the bikes for leisure and sightseeing around Chicago.
Cyclistic can try to build a campaign to turn casual customers to members, especially during spring and summer but it will most likely be difficult, since they have such diferent purposes.
The amount of casual rides (more then 1.7 million) is still considerable, and we might have some oportunities to convert them to members to get more value.