Google-Data-Analytics---Cyclistic-Case-Study

The Cyclistic Case Study - Google Data Analytics Capstone


Author: Catarina Periquito

Date: 2023-01-16


Cyclistic bike-share


Cyclistic is a bike-share company in Chicago, founded in 2016 that features more than 5,800 bicycles and 600 docking stations. It sets itself apart by making bike-share more inclusive to people with disabilities There are two types of customers: annual members and casual customers. They have a flexibility pricing plans: single-ride passes, full-day passes, and annual memberships. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders.

Scenario


The Cyclistic Case Study is one of the capstone projects of Google Data Analytics, a professional certificate program offered on Coursera. I roleplayed as a data analyst working on the marketing analyst team at Cyclistic. Lily Moreno, The director of marketing and my manager assigned me to explore the behavior of casual riders and annual members to understand how they use Cyclistic bikes differently.

1. Ask


1.1 Identify the business task:

1.2 Questions

1.3 Consider key stakeholders:

Primary stakeholders

Secundary stakeholders

2. Prepare


2.1 Data source used:

2.2 Data credibility:

2.3 Dataset analysed:

The Cyclistic’s dataset has the historical trip data from 2022, separeted in csv. files to analyze and identify trends.

Loaded packages in R programing language

cyclistic_01 <- read_csv("202201-divvy-tripdata.csv")
cyclistic_02 <- read_csv("202202-divvy-tripdata.csv")
cyclistic_03 <- read_csv("202203-divvy-tripdata.csv")
cyclistic_04 <- read_csv("202204-divvy-tripdata.csv")
cyclistic_05 <- read_csv("202205-divvy-tripdata.csv")
cyclistic_06 <- read_csv("202206-divvy-tripdata.csv")
cyclistic_07 <- read_csv("202207-divvy-tripdata.csv")
cyclistic_08 <- read_csv("202208-divvy-tripdata.csv")
cyclistic_09 <- read_csv("202209-divvy-publictripdata.csv")
cyclistic_10 <- read_csv("202210-divvy-tripdata.csv")
cyclistic_11 <- read_csv("202211-divvy-tripdata.csv")
cyclistic_12 <- read_csv("202212-divvy-tripdata.csv")

Process


In order to process the data I also used the R programing language.

Combining and Modifying data

all_trips_raw1 <- rbind(cyclistic_01, cyclistic_02, cyclistic_03, cyclistic_04, cyclistic_05, cyclistic_06, cyclistic_07, cyclistic_08, cyclistic_09, cyclistic_10, cyclistic_11, cyclistic_12)

Changing columns name to be more legible

all_trips_raw1 <-all_trips_raw1 %>% 
                  rename(bike_type=rideable_type,
                   start_time=started_at,
                   end_time=ended_at)

Remove the latitute and longitude fields

all_trips_raw <- all_trips_raw1 %>%
  select(-c(start_lat, start_lng, end_lat, end_lng))

Confirm final columns of dataset

colnames(all_trips_raw)

Data Cleaning

all_trips_raw <- all_trips_raw [!duplicated(all_trips_raw), ] 

The variable ride_id has no duplicates, it is also a primary key. Each Id has 16 characters and each row represents a unique bike ride. It’s important to understand if ride_id is the user id or if a new ride generates a new ride id.

Adding columns for date, day of the week, month and time

all_trips_raw$date <- as.Date(all_trips_raw$start_time) 
all_trips_raw$weekday <- format(as.Date(all_trips_raw$date), "%A") 
all_trips_raw$month <- format(as.Date(all_trips_raw$date), "%B")
all_trips_raw$hour <- hour(all_trips_raw$start_time)

Add calculated field for ride_length in minutes

all_trips_raw<-all_trips_raw %>%
mutate (ride_length=difftime(end_time, start_time, unit="mins"))

After verifying the previews dataset, I spotted trips with zero minutes and negative time. I decided to remove them for the analysis and created a new dataset, as I assumed these were registered incorrectly.

all_trips_raw[all_trips_raw<=0] <- NA
all_trips <- na.omit(all_trips_raw)

Confirm final dataset

head(all_trips)

Analyse/share


To analyse my dataframe I’m combining R programming language with SQL and create data visualization using Tableau.

install.packages("sqldf")
library(sqldf)

So fist we want to know how many annual members vs casual users we have. What can we tell from the findings?

Is Rider_id the user id or is it the id of a new ride?

ride_user <- sqldf("select count (distinct ride_id), count (ride_id) from all_trips ")

The number of unique ride_id (4369052) is the same as the count of all ride_id (4369052), which means that ride_id can never be a user identifier. If it was each member only used this service once a year.

How many casual vs membership rides take place?

Cyclistic destinguish customers on their price plan. Those who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

members_vs_casual <- sqldf("select member_casual, count (distinct ride_id) from all_trips 
group by member_casual")

The data shows that we have more rides from annual members (59,03%) than casual users(40.9%) in 2022. Even though we have more usage from members we still have a lot activity from casual users. More analysis is needed.

Is the number of rides seasonable?

rides_month <- sqldf("select member_casual, month, count (distinct ride_id) from all_trips 
group by member_casual, month")

We can see by the results that both type of customers are more active during the warmer months. So this is not a distinguishable factor. Nonetheless it is clear that casual users ride more than member during spring and summer. This way we can conclude that seasons affect the amount of membership rides.

When do different users ride throughout the day? and week?

To explore more the if there was a difference between this two customer groups I decided to analyse their ride behavior throughout the day and week.

rides_day <- sqldf("select member_casual, hour, weekday, count(distinct ride_id) from all_trips group by member_casual, hour, weekday")

From the visualization it seems that annual members ride during during rush hours. This means that they may be mainly using Cyclistic to commuting to work. Casual riders use this service more sporadically, but mostly from 10-19h.

Once again the assumption that annual members use Cyclistic services to ride to work is even more clear. They ride especially from monday to friday which are normal work days. In the other hand casual users are more active during the weekend. Also during saturday, casual rides almost double meaning that customers use this service especially during freetime and for leisure purposes.

What is the duration of each user type rides? Does it vary per week day?

mean(all_trips$ride_length)
median(all_trips$ride_length)
max(all_trips$ride_length)
min(all_trips$ride_length)

Right away I spotted something interesting in the max ride length. It seems that we have some outliers with trips of more than 24 hours (1440 min). After some analysis I decided to leave this data in the dataset. The starting and ending ride time were in different days, which means that a customer had the bike for more than one day.

Ride_time <- sqldf("select month, weekday, member_casual, avg(ride_length) from all_trips 
group by month, weekday, member_casual")

Casual users ride longer than members. This can mean that if there is a price per minute, casual users spend more on their rides. Also casual users have access to full-day passes. It would be interesting to look further into this: daily passes might be discouraging new annual memberships and or preventing potential revenue from single rides passes.

Members have a more consistent rides throughout week, with a slight increase during the weekend, during these times they might have more free time. Again this reinforces the idea that members may be communting to work.

Do members and casual users have different bike type preferences?

We know that Cyclistic offer more than 5,800 bicycles and 600 docking station. It would be interesting to know if customers have a bike preference.

rides_bike <- sqldf("select member_casual, bike_type, count (distinct ride_id) from all_trips group by member_casual, bike_type ")

Both groups prefer to ride in classic bikes, which doen’t show much diffence between the two groups. However only casual riders use docked bikes. This might indicate that this type of customers don’t care about their starting and ending point. This evidence points, again, to the idea of casual users being tourists. In the other hand Members don’t go for docked bikes because they want the freedom to park their bikes when they go to work.

Where does each type of user start and end their trip?

Since Cyclistic has a network of 692 stations across Chicago, I decided to further analysis on customers ride differences in starting and ending point.

station_users <-  sqldf("select start_station_name, member_casual, count(ride_id) from all_trips group by  start_station_name, member_casual")

When analysing the start station, we can see through the charts that casual riders tend to start their trips more frequently downtown than annual members. In contrast, we can observe that members start their trips further from downtown. When looking at the ending station there isn’t much different between customers. Both tend to end their rides closer to downtown.

Conclusion


Suggestions and recommendations


Presentation


Click here