Cycling data

Data are a recurring topic on this blog and the necessity to consider data with regard to availability, accessibility, quality and suitability is still increasing. This is simply because of the massive amount of mobility-related data that is constantly generated by mobile devices and stationary sensors.
Michael Batty internet pointed to the fact that it was not the automatization of data capturing, but the miniaturization of sensors that has lead to the ever-growing data stream. Batty stated that the amount of data was scaling up to a level that is not manageable by conventional means. Interestingly, this observation was made seven years ago in 2013. Technology has been advancing significantly since then and there is no indication for a slow down of these dynamics. However, since then, the number of scientifc papers, which suggest that ‘big data’ would facilitate a whole new era of research, planning and management, has been growing substantially. In the broader field mobility, I’m thinking of seminal papers by Kitchen (2014) internet when in comes to smart urbanism, Miller & Shaw (2015) internet in the context of GIS-T, or Anda et al. (2017) internet with regard to transport modelling.

In fact, the ubiquity of sensors, which are connected to the internet, has led to a plethora of new applications and business opportunities, from automated driving to MaaS platforms and many more. Moreover, the paradigm of theory-based research is challenged by data-driven approaches, which are heavily relying on machine-learning and AI respectively.
Against this backdrop, what is the situation like, when it comes to cycling data? If we were following the overall trend towards massive data streams and ‘big data lakes’, we would expect emerging (business) opportunities and new insights into the complex system of cycling mobility.

An excellent, up-to-date review of data sources and applications for pedestrian and bicycle monitoring comes from Lee & Sener (2020) internet:

Classification of pedestrian and bicycle data sources (Lee & Sener 2020). Figure published as Open Access (CC BY-NC-ND 4.0).

Rightly, they point to the fact that cyclists (as well as pedestrians) have specific characteristics and thus, the sensed data can be fundamentally different from motorized transport data. Trips are more sensitive to the environment (infrastructure, weather, topography, …), more variant and commonly shorter. Caused by these particularities and considering current data capturing technologies, Lee & Sener identified the following challenges with regard to data from cyclists:

  • Mode detection
  • Data validity in terms of representativeness
  • Sampling bias
  • Privacy
  • Lack of detailed contextual data
  • Cost of obtaining and utilizing data

According to this list, there is still much research to do. Although numerous voices have been proclaiming that data would help to better understand and manage the entire transport system, things are not that easy, at least with regard to cyclists and pedestrian.

A similar overview can be found in a report by Steenberghen et al. internet from 2017. There, the authors also investigated the availability of walking and cycling data in all member states of the European Union plus Norway and Switzerland. For this purpose, the authors interviewed representatives of the responsible governmental bodies and found that 18 out of 30 had difficulties with collecting data for cycling and walking. Those countries with structured data acquisition strategies, reported major problems with under-reporting and data completeness. 60% of all investigated countries were not able to calculate the annual average distance cycled per person – at a national level; not to speak about such key performance indicators at a city scale level.

However, cities and regions require detailed data for providing adequate infrastructure and efficiently promoting cycling. Interest in cycling data increasingly comes from the health and environmental sector as well, where the need for quantification of physical activity and emission reduction respectively is a major driver.
Parallel to the technological advances, Batty and others extensively described, a new cultural phenomenon emerged: the quantified-self (Swan (2013) internet). The readiness to track personal mobility together with several physiological parameters further boosted the production of (cycling) data. Romanillos et al. (2016) internet see huge potential in these data sets, especially when they are linked to other data sources, such as stationary counters. Among the many fitness and tracking applications, Strava seems to be the data source, which is used most often for cycling-related research. Indeed, Google Scholar internet returns more than 4,700 references for the search term strava data cycle* today.

Of course, the suitability and quality of data such as Strava needs to be critically reflected. Too often, data from such sources are used in a somehow naive manner. Leao et al. (2017) internet point to the conceptual difficulty of transforming raw data, which was generated by individuals, into robust databases of collective activity. Griffin et al. (2020) internet investigated biases in big data for transportation and propose mitigation strategies. The latter is particularly difficult, when it comes to data from cyclists and pedestrians. According to the authors, the datasets – primarily generated by fitness and tracking apps – are heavily biased towards specific user groups. Thus, they suggest to combine various data sources and be careful with conclusions. With regard to the inference of traditional and emerging data sources, Conrow et al. (2018) internet state, “As a step toward developing a method for conflating conventional and crowdsourced bicycling data, we seek to explore the as yet understudied area of understanding how crowdsourced and conventional data correspond in representing activity.” This is a more than valid point. To date, there is no standardized framework for how to integrate different data sets.
What we know so far is that the correlation between permanent, stationary counters and crowdsourced tracking data varies, depending on time, location, temporal sampling and spatial tolerance (see for example, Boss et al. (2018) internet for the time dependent correlation). Moreover, the prevalence of app usage varies among regions and even neighbourhoods, according to Heesch & Langdon (2016) internet. Consequently comparisons over time and across regions need to be done with great care.

We learn from the current state of research that regardless of the common enthusiasm about vast amounts of data, sound cycling data, which represents the total of cycling mobility, is not available yet. Perhaps this is less a question of data availability, but of data integration. For this, not only technical, but above all conceptual research is desperately needed.
In an ongoing research project, we are continuously harvesting cycling-related data from many data sources. Together with an agent-based bicycle flow model internet, we are aiming to generate an integrated dataset, which adequately reflects cycling mobility at the local scale level. If you were interested in this research, visit our Bicycle Observatory website internet or drop me a line.


  1. Pingback: Bicycle observatory – why it makes sense | gicycle
  2. Pingback: Insights into data usage | gicycle

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s