Although a bit reluctant first about availability and number of stations, this summer I decided to purchase a seasonal pass (costs 30 €) of Helsinki City bikes. As I have been living in a town outside Helsinki, the bikes have proved to be a great means of transportation for short distances where I would have otherwise needed to buy a rather costly transport ticket.

HSL webpage offers the possibility to see the basic information of the travels done with the city bikes.

The page also shows some basic data about the different trips. However due to a lack of possibility to download the data, an easy option in order to get the data into a tabular form is web scraping.

So let’s dive into web scraping with R!

Web scrapping step by step

In order to simplify the scrapping since we need to sign in, first step is to go to the HSL city bike webpage and downloading the page locally as HTML.

Then we need to go to the page where all travel segments are shown by clicking Katso Kaikki (show all)

First step is to click right mouse button inside the webpage and click save as

Save as

Just save the complete webpage into your project folder


Next step is to Inspect the page elements (this can be done in the actual webpage, or also in the downloaded page).

Now we can see some of the code of the webpage, good! But we only need certain parts. So next step is to click the button to select elements, or use the shorcut (Ctrl + Shift + C).

Now every time we hover something in the webpage, it shows parts of its code. As an example, the departure station

Final step is to get the selector for that specific element. We can do that by right mouse click button in the code part related to that element, then we choose Copy -> Copy selector

From there we get the following selector for that specific element:

‘#activity-feed > div > div:nth-child(31) > div.station-info > span.return-station’

However we are interested in getting all the similar pieces of text, in this case departure-station. For that we can delete the part :nth-child(31), thus getting the whole list.

We need to do this for all the relevant points of data we are interested on scrapping. Once we have those we can start with getting the info we are interested in.

We will use read_html() from xml2 package and html_nodes() and html_text() from rvest package. Here is the list of all the packages used:

#Data transformation
#API calls to HSL
library(ghql) #devtools::install_github("ropensci/ghql")
#Spatial data and maps
#Nice table formatting
#personal library

First step is reading the html page.

webpage <- read_html(here::here("static","data","citybikes","bike_routes.html"))

Then we will create a named vector with the element selectors.Note that here the selectors are shorter, since in the next step we will be pasting the selector initial part that is repeated in all of the selectors.

selectors <- c(departure_station = "station-info > span.departure-station",
  return_station = "station-info > span.return-station",
  departure_time = "date-info > span.departure-date",
  return_time = "date-info >  span.return-date",
  duration = "trip-info > span.duration",
  distance = "trip-info > span.covered-distance")

Finally, we will loop over all the selectors to read those specific nodes and convert them into text. Column names are defined by the names of the vector.

df_raw <- map_dfc(selectors,~html_nodes(webpage,paste0( "#activity-feed > div > div > div.",.x)) %>% 
  .id = names(.x))

Data transformation

When data scraping is over, now we need to clean the data into something usable for the analysis including station ID numbers and getting numbers for duration and distance. Trips starting and ending in same station are excluded since they are “false rides”.

df <- df_raw %>% 
  mutate(departure_time = dmy_hm(departure_time),
    arrival_time = dmy_hm(return_time),
    duration_min = parse_number(duration),
    distance_km = parse_number(distance),
    dep_station_id = str_sub(departure_station,1,3),
    arr_station_id = str_sub(return_station,1,3)) %>% 
  filter(dep_station_id != arr_station_id)

In order to get the coordinates of the stops we will make a data_frame with the unique departure-arrival combinations.

stops_unique <- df %>% 
  select(dep_station_id,arr_station_id) %>%

Coordinates of bike stations can be found in Helsinki and Vantaa are separated, so rbind will be used to get data of all stations.

all_stations_raw <- read_csv(here::here("static","data","citybikes","HKI_bikestations.csv")) %>% 

all_stations <- all_stations_raw %>%

Finally, we join the departure station ID with the coordinates, followed by doing the same with the arrival station. This provides us a data.frame with the travels and coordinates for arrival and departure. Which will be used to feed the API and get the route and estimated times.

df_with_coord <- df %>% 
  left_join(all_stations,by = c("dep_station_id" = "ID")) %>% 
  left_join(all_stations,by = c("arr_station_id" = "ID"),suffix = c("_dep_","_arr_")) 

Talking API

Next step is to use an API to get estimated times between directions as well as route details.

HSL offers a graphql API environment to fetch data from. ropensci offers a package to do API calls from R, this are the steps needed to follow:

Initialize client

path <- ""
client <- GraphqlClient$new(
  url = path

Make a Query class object

qry <- Query$new()

Query function

route_query <- function(from_lat,from_lon,to_lat,to_lon){
  #Generate random ID for the query
 query_hash <- stringi::stri_rand_strings(1,5) 
    from: {lat: <<from_lat>>, lon: <<from_lon>>}
    to: {lat: <<to_lat>>, lon: <<to_lon>>}
    numItineraries: 1
    transportModes: [{mode: BICYCLE}]
    itineraries {
      legs {
        legGeometry {
  ',.open = "<<", 
    .close = ">>"))
#Always ghet the last response from the queries
response <- client$exec(qry$queries[[length(qry$queries)]])
#Convert the response to more readable format
result <- fromJSON(response,flatten = TRUE) %>% 
#Filter routes, to take just the longest single trip. (Avoids considering walking few meters)
result[[1]][[1]] %>% 
  filter(mode == "BICYCLE") %>% 
  filter(distance == max(distance))
#Example query

Once query function is tested and working, we will again use the map family to pass coordinated arguments into the query and arrange them into a data frame.

routes_df <- pmap_dfr(list(df_with_coord$y_dep_,df_with_coord$x_dep_,


Saving the output, to avoid repeating the API calls in the future.

Data wrangling and geometries

The API returns a handy googlePolylines text string to define the route. This needs to be converted into points.

routes_df <- routes_df %>% 
  mutate(duration_min = duration/60,
    distance_km = distance/1000) %>% 
  select(duration_min_HSL = duration_min,
    distance_km_HSL = distance_km,
    polyline = legGeometry.points)

points_all <- map_dfr(routes_df$polyline,~decode(.x) %>%,.id = "trip")

For handling easier, SF will be used and a table with each route and the containing LINESTRING, which is very easy to plot with geom_sf()

linestrings <- points_all %>% 
  mutate(point_order = row_number()) %>% 
  st_as_sf(coords = c("lon","lat")) %>% 
  sf::st_set_crs(4326) %>% 
  group_by(trip) %>% 
  summarize(n = n(),do_union = FALSE) %>% 

Final wranglings are combining the data from the webpage and API, together with the newly created linestring table.

all_data_df <- cbind(df_with_coord,routes_df) %>% 
  mutate(time_diff = duration_min_HSL - duration_min,
    dist_diff = distance_km_HSL - distance_km,
    trip_num = as.character(row_number()) )

all_data_sf <- linestrings %>% 
  left_join(all_data_df,by = c("trip" = "trip_num"))

Simple data exploration

With the data ready and some simple dplyr, it is easy to find the longest bike ride. According to the API, it took 27 minutes more than needed to complete the ride. This was due to a distraction on the way, there was a flea market in the area.

all_data_df %>% 
  filter(duration_min == max(duration_min)) %>% 
  select(departure_station,return_station,departure_time,time_diff,dist_diff) %>% 
  kable() %>%
141 Intiankatu137 Arabian kauppakeskus2019-05-25 11:52:00-27.05-1.353054

In general I have been riding faster than the time defined by the HSL API, which might mean I was taking shortcuts from roads without designated biking lane, or then the speed defined in the API is a safe assumption and rather slow.

ggplot(all_data_sf) +
  geom_jitter(mapping = aes(duration_min,duration_min_HSL)) +
  geom_abline(intercept = 0, slope = 1,color = "red",linetype = "dashed") +
   geom_text(aes(x = 6.5,y = 18,label = "Same route",vjust = 0.5),
    family = "Open Sans",color = "#6d6d6d",size = 3.5)  +
    geom_curve( mapping = aes(x = 11,y = 14.5,xend = 8,yend = 18),
    arrow = arrow(length = unit(0.07, "inch"),ends = "first"), 
    size = 0.4, color = "#6d6d6d",curvature = 0.3) +
  theme_jechave() +
  labs(x = "Trip duration (min)",
    y = "Trip estimation HSL (min)",
    title = "Do I ride fast?",
    subtitle = "Above red dotted line, trips riding faster than estimation")

In the plot it can be seen that there is a group of points gathered horizontally, these correspond to the same route done in different days. From Viikki to Arabianranta. Let’s calculate some simple statistics for these trips.

all_data_df %>% 
  filter(dep_station_id %in% c("139","239"),
    arr_station_id %in% c("139","239")) %>% 
  mutate(direction = case_when(dep_station_id == "139" ~ "Arabianranta -> Viikki",
    TRUE ~ "Viikki -> Arabianranta")) %>% 
  group_by(direction) %>% 
  summarize(trips = n(),
    mean_duration = mean(duration_min),
    max_duration = max(duration_min),
    min_duration = min(duration_min)) %>% 
  kable() %>%
Arabianranta -> Viikki1510.60000000138
Viikki -> Arabianranta911.444444441410

The trip from Arabianranta to Viikki was mainly to catch the bus, which could explain the shorter durations (fear of missing the bus).

Visualizing all the routes

To finalize, a map visualization where all the routes done with bike are shown. I have used snapbox package to work on top of mapbox and use my own map styling.

Plotting everything together for final result

 linestrings_3857 <- st_transform(linestrings, 3857)
area <- st_bbox(
  c(xmin = 24.85, ymin = 60.149, xmax = 25.05, ymax = 60.23),
  crs = 4326

map_base <- ggplot() +
  layer_mapbox(area, scale_ratio = 0.5,
    map_style = "mapbox://styles/kutteb/ck9h39xng0png1iqyyvmqchpb",
    mapbox_api_access_token = keyring::key_get("mapbox_token",keyring = "jechaveR"))

map_base +
  coord_sf(crs = st_crs(3857)) +
  geom_sf(data = linestrings_3857,color = alpha("#FED100",0.5),size = 0.75, inherit.aes = FALSE) +
theme_jechave() %+replace%
  theme(panel.grid.major = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()) +
  labs(title = "Itineraries in Helsinki City bikes",
    subtitle = "Summer 2019, data from HSL")
ggsave("citybike_2019_itineraries.png",dpi = 900)

Conclusion: In this post we have done some web scrapping, created a personalized function that does API calls and put everything on a map to visualize the usage of the city bikes during the summer season in Helsinki. This has been my story of how I fell in love with those bikes . Cannot wait to see those yellow bikes again on the city.

Helsinki city bikes: a summer love story