A word on data quality

Writing on my conference paper for the AGIT symposium internet about using crowd sourced data for modeling and analysis purposes I came across a brilliant blog post by Muki Haklay (see here internet). His contribution basically points to the fact, that all data are more or less biased (in terms of completeness and consistency) but data providers are not equally honest to commit it. That’s nothing really new – at least everyone with a professional GI background should be aware of this – but it was nicely condensed.

Interestingly, most studies dealing with the quality of crowd sourced data – in a GIS context most of the time OpenStreetMap data – use authoritative or commercial data sets as references. For some questions this approach is definitely useful; e.g. one can monitor the coverage of OSM data over time and see the project geographically expanding. But it has its limitations when one wants to deduce information about the quality of the data in terms of their spatial and especially attributive characteristics. Comparing a potentially biased data set with another potentially biased data set is definitely a tricky thing – even more when you draw global conclusions about the quality of the data.

In a project related to the routing platform www.radlkarte.eu internet I’ve used authoritative and crowd sourced road data sets for the same purpose: assessing the road network’s quality in terms of bicycle safety. In such a task the quality of the respective data sets becomes immediately obvious. Of course, OSM data partially suffer from attributive gaps, wrong classifications, simple mapping errors or heterogeneous attributes. But on the other side they are – at least in my project area – spatially more accurate, more complete and above all more up to date.
Generally, the quality of authoritative (and commercial) road data tends to decrease in areas with low level roads or roads with limited access for motorized vehicles. In the context of bicycle traffic this is a major drawback because these are exactly the roads bicyclists prefer! Now, if you use authoritative or commercial data sets as reference in such a context you won’t necessarily be able to say anything about the quality of the crowd sourced data set! Determining the quality of data sets – no matter whether you have crowd sourced, commercial or authoritative data – heavily depends on the purpose the data sets are used for. Imagine you build a routing service for bicyclists on your data set. No bicyclist will use your service if a major link is missing. Perhaps the road is not traversable for cars and that’s why it is regarded as dispensable. But for bicyclists this connection is an important shortcut and might be the reason for using the bike instead of the car for their travel to work …

A nice webpage for the comparison of  authoritative and crowd sourced road data sets is www.basemap.at internet. It’s a web map tile service which is fed by authoritative data from Austria’s federal states and city administrations. The service claims to be up to date and most accurate. Well, this might be true for the high level road network. But “unfortunately” an OpenStreetMap rendering is provided as alternative base map on the same webpage. And comparing these two base maps proofs the provider’s claim to be wrong – at least when it comes to links which are essentially for bicyclists. Here are two nice examples (screenshots from today):

Along the Saalach river runs a cycle way of supra-regional importance. It’s a kind of bicycle highway. The neighboring residential area is de-facto connected to this cycle way via several small, mostly informal tracks. For everyday (bicycle and pedestrian) mobility these links are of enormous importance. Nevertheless they are missing in the official map. Imagine the detour a routing service which exclusively relies on the authoritative data would generate!

No debate on data timeliness. The hydro power plant with an additional bicycle and pedestrian bridge was opened in summer 2013.

No debate on data timeliness. The hydro power plant with an additional bicycle and pedestrian bridge was opened in summer 2013.

Don’t take me wrong. The basemap.at project is to be appreciated. It’s a first and very important step to open the administrations’ data treasure for a wider audience. Nevertheless the project (I guess non-voluntary) helps to stimulate the discussion about data quality and reference data sets for quality assessment. Authoritative data sets are good, most often very good. But with regard to several special purposes – as bicycle traffic – crowd sourced data sets might be even better.

TL;DR

A quality assessment of digital road network data sets can only be plausible when 1) the purpose of the data is defined and 2) potential biases in the reference data sets are considered.

Advertisements

2 comments

  1. Pingback: Data quality: topology | gicycle
  2. Pingback: How to deal with attribute gaps in OSM data? | gicycle

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s