Introduction

Citibike lets New Yorkers hop on a bike when they want, to go wherever they want. No more worrying about where to lock up a bike - Citibike’s stations have dedicated slots to securely return your bikes. And no more having to schlep your bike home - bike share means that you can take one-way trips and not have to worry about your bike when you’re done. It’s convenient, easy, and cheap.

This convenience of on-demand bikes creates a puzzle for Citibike’s logistics: how to ensure that bikes are available where and when they’re needed. If bikes aren’t available at the right place and the right time, ridership and revenue will fall. It’s therefore very important to rebalance bikes in the system: i.e. move bikes from areas of low demand to areas of high demand, before the high demand begins.

To help Citibike rebalance bikes, I’ve created a model to forecast demand in Brooklyn. My rebalancing plan will work as follows:

  • Pre-peak incentives: my program would provide incentives for riders to park their bikes in stations that are forecast to have high demand in the one- to two-hour windows before the higher demand occurs. These incentives would include: a $2* discount per day pass or monthly membership; and a reserved slot at the receiving station that ensure riders will definitely have a place to park - a perk otherwise not available in the system. (*Note, this is an example discount rate. This discount may be adjusted based on further research into cost and benefit thresholds.)

  • Overnight trucking: Morning demand is spread across residential areas, making it both difficult and expensive to incentivize riders to return their bikes across such a wide area. Since traffic is also lower during the night, this would be the ideal time to deploy trucks to gather bikes from stations with forecasted lower morning demand, and redock them in stations predicted to have high morning demand. This process would require us to predict conditions at least four hours in advance, and ideally 24 hours in advance, so that we can arrange labor schedules accordingly.

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(gganimate)
library(gifski)
library(caret)
library(knitr) 
library(pscl)

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

palette_plasma = c("#0d0887", "#7e03a8", "#cc4778", "#f89540", "#f0f921")

Data

I gathered Citibike’s ridership data from September and October 2022, and sourced demographic data from the US Census’s 2020 American Community Survey. Finally, I relied upon NYC’s open data service for borough boundaries.

brooklyn <- st_read("https://data.cityofnewyork.us/resource/7t3b-ywvw.geojson") %>%
  st_transform("EPSG:4326") %>%
  filter(boro_name == "Brooklyn")

bk_neighborhoods <-
  st_read("https://data.cityofnewyork.us/resource/q2z5-ai38.geojson") %>%
  st_transform("EPSG:4326") %>%
  filter(boro_name == "Brooklyn") %>%
  dplyr::select(-ntacode, -shape_area, -county_fips, -shape_leng, -boro_name, -boro_code) %>%
  rename(neighborhood = ntaname)
citibike_sep22 <- read_csv("data/202209-citibike-tripdata.csv")#%>%
  #filter(ended_at > "2022-09-16 23:59:59")

citibike_oct22 <- read_csv("data/202210-citibike-tripdata.csv")%>%
  filter(ended_at < "2022-10-17 00:00:00")

citibike <- rbind(citibike_sep22, citibike_oct22)

# Filter out approximate lat/long of Brooklyn
approx_bk_trips <- citibike %>%
  filter(start_lat < 40.740 & start_lat > 40.563 & start_lng < -73.852 & start_lng > -74.046)

# Identify singular stations in approximately Brooklyn
approx_stations_list <- approx_bk_trips %>%
  group_by(start_station_id, start_station_name) %>%
  summarize(from_lat = mean(start_lat),
            from_lng = mean(start_lng)) %>%
  as.data.frame()

# Convert to geometry
approx_stations_geom <- approx_stations_list %>%
  st_as_sf(coords = c("from_lng", "from_lat"), crs = 4326)

# Create intersect logic
stations_in_bk <- st_intersects(approx_stations_geom, brooklyn, sparse = FALSE)

# Add column to identify stations that are in BK, with geometry
bk_stations_geom <- cbind(approx_stations_geom, stations_in_bk) %>%
  filter(stations_in_bk == TRUE)

# Add column to identify stations that are in BK, without geometry
bk_stations_list <- cbind(approx_stations_list, stations_in_bk) %>%
  filter(stations_in_bk == TRUE)

# Inner join our stations list with the approximated trips, in order to get trips that only begin in Brooklyn.
bk_trips <- inner_join(approx_bk_trips, bk_stations_list, by = "start_station_id") %>%
  dplyr::select(-start_station_name.y) %>%
  rename(start_station_name = start_station_name.x,
         station_lat = from_lat,
         station_lng = from_lng)

I used date parsing to bin historical ridership data by 15 and 60 minute intervals.

trips <- bk_trips %>%
  mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
         interval15 = floor_date(ymd_hms(ended_at), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))

#glimpse(trips)
bk_census <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2020, 
          state = "NY", 
          geometry = TRUE, 
          county=c("Kings"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
bk_tracts <- 
  bk_census %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf() %>%
  st_transform("EPSG:4326")

I joined census tracts to bike stations so that aggregated demographic information would be available for each bike station.

station_census <- st_join(trips %>% 
          filter(is.na(start_lng) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lat) == FALSE &
                   is.na(end_lng) == FALSE) %>%
          st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
        bk_tracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lng = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  st_join(., bk_neighborhoods, join = st_intersects, left = TRUE) %>%
  rename(start_nbrhood = neighborhood) %>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4326) %>%
  st_join(., bk_tracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lng = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

Engineer features

Weather

I used JFK airport’s weather data to identify the temperature, wind speed, precipitation in the New York City area on an hourly basis. I then plotted the temperature and precipitation trends over our study period.

This chart shows us that the first week of October saw unusually cold, windy, and rainy weather - this is likely to have had an effect on ridership.

weather.Panel <- 
  riem_measures(station = "JFK", date_start = "2022-09-01", date_end = "2022-10-17") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

glimpse(weather.Panel)
grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation, color = Precipitation)) + geom_line() + scale_color_gradient(low = "#d0d1e6", high = "#0570b0")+
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  
  ggplot(weather.Panel, aes(interval60,Wind_Speed, color = Wind_Speed)) + geom_line() + 
    scale_color_gradient(low = "#9ebcda", high = "#4d004b")+
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  
  ggplot(weather.Panel, aes(interval60,Temperature, color = Temperature)) + geom_line() + 
    scale_color_gradient(low = "#74add1", high = "#f46d43")+
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  
  top="Weather Data - New York JFK - Sep-Oct, 2022")

Time Effects

To include time effects in my model, I began by creating a panel data set of our unique stations for each day and time of day within our study period, and combined this with our census data and weather data.

#length(unique(station_census$interval60)) * length(unique(station_census$start_station_id))


study.panel <- 
  expand.grid(interval60=unique(station_census$interval60), 
              start_station_id = unique(station_census$start_station_id)) %>%
  left_join(., station_census %>%
              select(start_station_id, start_station_name, Origin.Tract, start_lng, start_lat, start_nbrhood)%>%
              distinct() %>%
              group_by(start_station_id) %>%
              slice(1))

#nrow(study.panel)      
ride.panel <- 
  station_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, start_lng, start_lat, start_nbrhood) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)
ride.panel <- 
  left_join(ride.panel, bk_census %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

Create time lags

I created time lag variables to identify demand at a station 1, 2, 3, 4, 12, and 24 hours ago; if these variables exhibit temporal correlation (with the exception of the 12-hour lag, they do), then these variables can be used to predict future demand. Thus, we can use the shorter time lags (e.g. 1- and 2-hour lags) to forecast trip demand in a shorter amount of time, which would be very useful for the pre-peak incentives for rebalancing.

The longer time lags (such as 4-hours and 24 hours) will be more useful for our overnight trucking. The 24-hours lag can help predict much further in advance, letting us set labor schedules; while the 4-hour lag can help us fine-tune where to place the rebalanced bikes.

I also created a variable for the United Nations General Assembly (UNGA) meeting, held in New York every September (in 2022, it was held September 21-23). This event often sees an uptick in traffic, which appears to have led to higher bikeshare demand.

I also created a variable to control for the effect of holidays and associated holiday weekends - specifically, for the Labor Day (September 5) and Columbus Day (October 10) holidays.

ride.panel <- 
  ride.panel %>% 
  arrange(start_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) %in% c(248, 283),1,0), #Labor Day, Columbus Day
         unga = ifelse(yday(interval60) %in% c(264, 265, 266), 1, 0)) %>% #NYC UNGA 
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag),
         ungaLag = case_when(dplyr::lag(unga, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(unga, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(unga, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(unga, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(unga, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(unga, 3) == 1 ~ "MinusThreeDays"),
         ungaLag = ifelse(is.na(ungaLag) == TRUE, 0, ungaLag))
as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours", "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2))

Exploratory Analysis

The overall time pattern of trips in Brooklyn is plotted below. There is a clear daily cycle of high ridership during weekdays, and lull periods on the weekends. There is also a lull in the beginning of October - the same week where we earlier saw an increase in precipitation in the region.

There appears to be a high point of ridership around September 21 - the first day of UNGA meetings.

There is also a lull around September 4th, the Labor Day holiday, and a small but not as deep lull, around October 11, Columbus Day.

ggplot(station_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. Brooklyn, Sep-Oct, 2022",
       x="Date", 
       y="Number of trips")+
  plotTheme

Below is a plot of the distribution of trip volume by station for different times of the day. Demand is spread out across stations in the morning and evening rushes as well as midday, but concentrated in a select set of stations for overnight trips.

station_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station_name, time_of_day) %>%
         tally()%>%
  group_by(start_station_name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), fill = "#cc4778", binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. Brooklyn, Sep-Oct 2022",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

Comparing Weekend and Weekday trip counts: Below is a plot comparing the number of bike trips by day of the week. The first plot is broken out by week, and we can see a pattern of peaks in the morning around 8 am, and in the evening around 5 pm on the weekdays. Weekends have a different pattern, with ridership peaking around 3 pm.

The second plot combines the weekdays and weekend days together to compare more directly between the two groups, and demonstrate these ridership peaks and lulls for the different time periods.

ggplot(station_census %>% mutate(hour = hour(started_at)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  scale_color_viridis(option = "plasma", discrete = TRUE)+
  labs(title="Bike share trips in Brooklyn, by day of the week, Sep-Oct, 2022",
       x="Hour", 
       y="Trip Counts",
       color = "Day of\nthe week")+
     theme_bw()

ggplot(station_census %>% 
         mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
   scale_color_manual(values = c("#cc4778", "#0d0887"))+
  labs(title="Bike share trips in Brooklyn - weekend vs weekday, Sep-Oct, 2022",
       x="Hour", 
       y="Trip Counts",
       color = "Weekend")+
     theme_bw()

Trips by station, by time of day: this map shows where most trips begin, at different times of day during the week and weekend. Most trips appear to cluster in Williamsburg, Dumbo, and Downtown Brooklyn, particularly in the weekday evenings and overnight. Demand is generally spread more evenly on the weekends.

ggplot()+
  #geom_sf(data = boroughs)+
  geom_sf(data = bk_tracts %>%
          st_transform(crs=4326), fill = "gray", color = "white")+
  geom_point(data = station_census %>% 
            mutate(hour = hour(started_at),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station_id, station_lat, station_lng, weekend, time_of_day) %>%
              tally(),
            aes(x=station_lng, y = station_lat, color = n),
            fill = "transparent", alpha = 0.8, size = 0.3)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "C")+
  ylim(min(station_census$station_lat), max(station_census$station_lat))+
  xlim(min(station_census$station_lng), max(station_census$station_lng))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. Brooklyn, Sep-Oct, 2022")+
  mapTheme

What does bike demand look like across the borough over the course of the day? Below is an animated map of bikeshare demand over one Sunday in 15-minute intervals. DUMBO, Williamsburg, and Prospect Park appear to be especially popular locations to start a trip, which means that incentives should target these areas on weekends in particular just before the peak times for bike return.

sep_day <- filter(station_census, yday(interval60) == 261) %>% #Sep 18
  filter(started_at < "2022-09-19 00:00:00") %>%
  mutate(Trip_Counter = 1) %>%
  dplyr::select(-rideable_type, -started_at, -ended_at, -end_station_name,   -end_station_id, -member_casual, -stations_in_bk, -interval60, -week, -dotw, -Destination.Tract, -end_lng, -end_lat)

sep_day.panel <- expand.grid(interval15 = unique(sep_day$interval15),
                            start_station_id = unique(sep_day$start_station_id))

bk_animation_data_stations <- #week42 %>%
  #mutate(week42, Trip_Counter = 1) %>%
  left_join(sep_day.panel, sep_day, by = c("interval15", "start_station_id")) %>%
  group_by(interval15, start_station_id) %>%
  #summarize(Trip_Count = sum(length(unique(ride_id)), na.rm = T)) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm = T)) %>%
  ungroup() %>%
  full_join(bk_stations_geom, by="start_station_id") %>%
  st_sf() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
                             Trip_Count > 10 ~ "11+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
                                       "7-10 trips","11+ trips"))


bk_animation_stations <-
  ggplot()+
  geom_sf(data = bk_tracts %>%
          st_transform(crs=4326), fill = "gray", color = "white")+
  geom_sf(data = bk_animation_data_stations, aes(color = Trips))+
  scale_color_manual(values = palette_plasma)+
  labs(title = "Citibike Trips for one Sunday in September",
       subtitle = "Brooklyn, September 2022\n15 minute intervals: {current_frame}")+
  ylim(min(station_census$station_lat), max(station_census$station_lat))+
  xlim(min(station_census$station_lng), max(station_census$station_lng))+
  transition_manual(interval15)+
  mapTheme

animate(bk_animation_stations, duration = 20, renderer = gifski_renderer())

Models and Predictions

I split the data into a training and a test set and created five linear models with certain key variables. The test set was made up of September, while the training set was the first two weeks of October.

The first model includes two temporal controls (time of day, day of the week), and temperature.

The second model includes the station itself (which therefore includes the fixed effect of nearby attributes, such as distance to parks or other amenities), the day of the week and temperature.

The third model includes the station itself, the time of day, the day of the week, temperature, and precipitation.

The fourth model includes the station itself, the time of day, the day of the week, temperature, precipitation, and the spatial lag variables.

The fifth and final model includes the station itself, the time of day, the day of the week, temperature, precipitation, the spatial lag variables, and the holiday and UNGA events and their temporal lags.

ride.Train <- dplyr::filter(ride.panel, week >= 36 & week < 40)
ride.Test <- dplyr::filter(ride.panel, week > 39 & week < 42)
reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station_name + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~ start_station_name +  hour(interval60) + dotw + Temperature + Precipitation + lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + unga + ungaLag, 
     data=ride.Train)

Predictions

I created a nested data frame of test data by week. A nested table is essentially a table made up of other tables inside of it. This is a powerful way to store and access large amounts of data. Then I used purrr’s “map” function to create predictions of trip demand for every day in the dataset, with a separate prediction for each of the five regression models created.

ride.Test.weekNest <- ride.Test %>%
  nest(-week) 
model_pred <- function(bk_trips, fit){
   pred <- predict(fit, newdata = bk_trips)}
week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions

Error Metrics

Below is a plot of the Mean Absolute Errors (MAE) by model and week. It appears that model 4 (which includes time, space, and time lags as predictors) performs slightly better than the model with holiday lags, and much better than the models without any time lags.

It is slightly surprising that Model 4 did better than the model with holiday lags - however, this is likely due to the types of holidays in the test and training set. The training set includes the Labor Day holiday, which is a major holiday that most employers recognize. In training, the model would have recognized the impact of such a large holiday on the trip cont. In the test set, the only holiday was Columbus Day - a minor holiday which few employers observe. Since the model was trained on holiday behavior for Labor Day, it likely incorrectly predicted bike demand for the Columbus Day holiday.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    #scale_fill_manual(values = palette5) +
  scale_fill_viridis(option = "plasma", discrete = TRUE)+
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme

Below is the bike share time series of the predicted and observed bike demand for the test set. As seen in the previous plot, the first three models don’t so a very good job of predicting bike demand; and the fourth and fifth models, which both include time lags, are almost indistinguishable.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
  scale_color_manual(values = c("#cc4778", "#0d0887"))+
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "Brooklyn; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme

For onward analysis, I selected the fourth model, which seems to have the best goodness of fit generally.

Below is a plot of observed and predicted bike demand for different times of the day and week. It appears that our model performs best at predicting demand overnight and in the evening rush, and weekdays at midday. It does a lesser job at predicting morning rush, and weekend mid-day trips.

This indicates that the model may need additional information pertaining to morning rush behavior and weekday midday behavior. Perhaps including locally relevant events, such as weekend festivals, might improve this model.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station_id, start_lng, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction), color = "gray")+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#cc4778")+
    geom_abline(slope = 1, intercept = 0, color = "#0d0887")+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

Below is a plot of Mean Absolute Errors by station. It appears that our errors are clustered in Williamsburg, DUMBO, and downtown Brooklyn, all of which are very busy areas with many tourists.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng)) %>%
    select(interval60, start_station_id, start_lng, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags") %>%
  group_by(start_station_id, start_lng, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = bk_census, fill = "grey", color = "white")+
  geom_point(aes(x = start_lng, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.4)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "plasma")+
  ylim(min(bk_trips$start_lat), max(bk_trips$start_lat))+
  xlim(min(bk_trips$start_lng), max(bk_trips$start_lng))+
  labs(title="Mean Abs Error, Test Set, Model 4")+
  mapTheme

How do these errors in space look over time? Below, the errors are mapped by station by weekend/weekday and time of day.

Our errors appear to be concentrated in weekends, particularly in the midday and evening, especially by Prospect Park and near the water in DUMBO and Williamsburg - areas that are probably popular with leisure activities. Weekdays see errors more spread out in residential areas, particularly in the morning and evening rush hours.

These errors in predictions have important implications for my rebalancing plan:

  • First and foremost, these errors need to be taken in the context of the station: if we incorrectly predicted demand by five bicycles, that may matter a lot at a station with 10 bikes, but may not matter as much at a station with 20 bikes. These errors may need to be normalized so that we are prioritizing filling stations that whose bike stock is especially depleted compared to surrounding stations.

  • Pre-Peak incentives will need to especially target the weekday evening rush - particularly areas where our forecast errors cluster (Williamsburg, downtown Brooklyn).

  • Overnight rebalancing via trucks is likely to work best to prepare for the weekday morning rush, since the errors in demand forecasting for this time are spread out.

  • Weekends would be the most difficult to rebalance for. Pre-peak incentives may help by targeting key clusters in the mid-day and evening rush, especially near Prospect Park and in Williamsburg.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station_id, start_lng, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, start_lng, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = bk_census, fill = "grey", color = "white")+
  geom_point(aes(x = start_lng, y = start_lat, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.4)+
  scale_colour_viridis(option = "plasma", direction = -1,
  discrete = FALSE)+
  ylim(min(bk_trips$start_lat), max(bk_trips$start_lat))+
  xlim(min(bk_trips$start_lng), max(bk_trips$start_lng))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme

How generalizable is the model across groups? Below, our model’s errors for morning weekday trips are plotted against the areas’ median income, the percent of residents who take public transit, and the percent of white residents. We can see that the model has more errors in wealthier and whiter errors, and only slightly more errors in areas where a larger proportion of residents take public transit. However, this bias can be considered “baked in” to the data - since Citibike stations are predominantly present in whiter and wealthier areas of Brooklyn.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station_id, start_lng, 
           start_lat, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station_id, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station_id, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = chicagoCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE, color = "#cc4778")+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

Cross-Validation

I performed a Leave One Group Out (LOGO) cross-validation on our rides dataset, based on the census tract of the start of the trip. A LOGO cross-validation provided a computational advantage over a K-fold cross-validation, as the K-fold approach would take approximately 100 hours to complete on this size of a dataset. This cross-validation was performed using Model 4’s parameters, with the exception of the station location, since that is colinear with the census tract start location.

crossValidate <- function(dataset, id, dependentVariable, indVariables){ #factor_var) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, #geometry, 
                    all_of(indVariables),
                    all_of(dependentVariable))
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, #geometry, 
                    all_of(indVariables),
                    all_of(dependentVariable))
    #MDH added
      #fold.test[,factor_var] <- factor(fold.test[,factor_var], levels = dataset[,factor_var])

    form_parts <- paste0(dependentVariable, " ~ ", paste0(indVariables, collapse = "+"))
    form <- as.formula(form_parts)
    regression <- lm(form,
                      data = fold.train %>%
                        dplyr::select(#-geometry, 
                          -id))
    
    thisPrediction <-
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(#st_sf(
    allPredictions)#)
}
rides.CV <- crossValidate(
  dataset = (ride.panel %>% mutate(hour = hour(interval60),
                                   start_station_id = factor(start_station_id))), 
  id = "Origin.Tract",
  dependentVariable = "Trip_Count",
  indVariables = c(#"start_station_name", 
    "hour", "dotw", "Temperature", "Precipitation", "lagHour", "lag2Hours", "lag3Hours", "lag12Hours", "lag1day")) %>%
  dplyr::select(cvID = Origin.Tract, Trip_Count, Prediction)
error_by_reg_and_fold <- 
  rides.CV %>%
    group_by(cvID) %>% 
    summarize(Mean_Error = mean(Prediction - Trip_Count, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>% 
  arrange(desc(MAE))
error_by_reg_and_fold %>% 
  arrange(MAE)

Below is a histogram of our Mean Absolute Error from our LOGO cross-validation. The distribution of these errors is not normal, suggesting that there may be spatial or a temporal pattern within the errors that the model has missed, and could improve upon. The range of our MAE is encouragingly small - between 0 and just below 0.7, indicating that our model is quite accurate.

## plot histogram of OOF (out of fold) errors
error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#cc4778") +
  scale_x_continuous(breaks = seq(0, 11, by = 0.1)) + 
  geom_vline(xintercept = mean(error_by_reg_and_fold$MAE), linetype = "dashed", color = "#0d0887", size = 1)+
    labs(title="Distribution of MAE by Tract", subtitle = "LOGO-CV, mean MAE indicated with dashed line",
         x="Mean Absolute Error", y="Count")+
  #annotate(x = mean(error_by_reg_and_fold$MAE)-0.03, y = 48, label = "Mean MAE", vjust = 0.5, geom = "label")+
  theme_bw()

This map of our MAE by census tract illustrates what we saw with our original prediction errors - errors are not spatially generalizable, and are instead concentrated in certain areas: Williamsburg and Downtown Brooklyn. However, since this is a spatial analysis and not a temporal analysis, it does not tell us when these errors may cluster in time, only where they cluster in space.

error_geoms <- error_by_reg_and_fold %>% 
  left_join(., bk_census, by = c("cvID" = "GEOID")) %>%
  st_as_sf()%>%
  st_transform("EPSG:4326")

ggplot()+
  geom_sf(data = bk_census, fill = "gray", color = "black")+
  geom_sf(data = error_geoms, aes(fill = MAE), color = "black")+
  scale_fill_viridis(option = "plasma", name="Mean Absolute Error") +
  #geom_sf(data = brooklyn, fill = "NA", color = "black")+
  labs(title="LOGO-CV MAE by Census Tract, Brooklyn") +
  ylim(min(station_census$station_lat), max(station_census$station_lat))+
  xlim(min(station_census$station_lng), max(station_census$station_lng))+
  mapTheme

How does our cross-validation error generalize across groups? Below is a table comparing the mean MAE for tracts above and below the NYC poverty line ($36,262 in 2019). Stations in tracts above the poverty line have more errors than those in tracts below the poverty line. While it is encouraging that our model is not biased in errors towards poorer tracts, this may be a function of the design of the Citibike system itself. Citibike stations are currently predominantly in wealthier areas of the city, which means there is more of an opportunity to predict incorrectly in these areas, than in poorer areas with fewer stations. As the Citibike network continues to expand across the city, it will be important to ensure that prediction errors do not start to be biased against poorer tracts.

income_compare <- error_by_reg_and_fold %>%
  left_join(., bk_census, by = c("cvID" = "GEOID")) %>%
  dplyr::select(cvID, Mean_Error, MAE, SD_MAE, Med_Inc) %>%
  st_drop_geometry() %>%
  na.omit(.) %>%
  mutate(Tract_Income_Level = ifelse(Med_Inc > 36262, "Above NYC Poverty Line", "Below NYC Poverty Line")) %>% # NYC 2019 poverty threshold
  dplyr::select(cvID, Tract_Income_Level, Mean_Error, MAE) %>%
  group_by(Tract_Income_Level) %>%
  summarize(mean.MAE = mean(MAE, na.rm = T)) %>%
  spread(Tract_Income_Level, mean.MAE) %>%
  kbl(caption = "Mean MAE by Tract Income Level, LOGO-CV",
      footnote = "Based on ") %>%
  kable_classic_2(full_width = F)

income_compare
Mean MAE by Tract Income Level, LOGO-CV
Above NYC Poverty Line Below NYC Poverty Line
0.1174928 0.0717118

Conclusion

Rebalancing is a tricky puzzle for bikeshare companies like Citibike to solve. It requires forecasting changes in demand across time and space. The forecasting model I’ve created here would be very useful to the rebalancing process I’ve proposed. My model predicts quite accurately, with minimal errors across the system. My model could therefore be used to predict bike demand several hours in advance, to effectively organize the manual redistribution via scheduled truck teams, and to predict demand on a shorter time frame, which would then be used to incentivize to return bikes to areas of high demand before the demand occurs.