How to deal with attribute gaps in OSM data?

Since the start of the OpenStreet Map project numerous studies have been dealing with the “quality” of this crowdsourced data set. In a previous post internet I’ve shown how relative the “quality” of a data set can be. Interestingly, this post got by far the most views – only as a side note.
Anyway, most studies that are dealing with the quality of OSM data focus on the geometric characteristics. Haklay & Ellul (2010) investigate the completeness of OSM (compared to Ordnance Survey data) in the UK. Helbich et al. (2012) compare the spatial accuracy of OSM and TomTom data. And, just to name a third example, Jackson et al. (2013) analyze both, completeness and accuracy, for OSM data in Colorado.
Only very few studies deal with the attributive quality of OSM data. Ludwig et al. (2011) and Graser et al. (2013), for example, evaluate the attributive completness of selected attributes. But to my current knowledge, there is little more …

I must confess, that I’m not an OSM geek, not a heavy mapper. But I’ve worked in several projects with the data and got to know (and love) them; and of course I’ve contributed to OSM more than once. In a recent project my task was to model accross two data sets with different data models and attribute structures and use the modeling results as inputs in a network analysis. One of this data set was an OSM extract for 5 municipalities in the Austrian-Bavarian boarder region. During my work I learned to deal with at least three issues concerning the attributive quality of OSM data:

  • Attribute gaps
  • Inconsistencies and errors
  • Heterogeneous attribute structure

Some of my lessons learned will be presented at this year’s AGIT internet conference (this is an explicit invitation to all German speaking readers for this nice conference!!!). Since the conference language will be German, I’ll publish an excerpt of my paper here. Today I want to focus on the first implication, attribute gaps, and how to deal with it in spatial analyses, such as routing.

The road network I’ve worked with has a total length of roughly 1,000 km with slightly more than 5,000 ways in OSM (for analysis purpose I processed the data, which is irrelevant for the following considerations). Compared to other data sets (commercial and authoritative) of this region, OSM can be seen as the must up-to-date and the most complete in terms of existing ways. But when it comes to attributes (tags), the OSM data set has several downsides.
Here is an overview of the completness of several tags which were important for my modeling:


Completeness of selected attributes in OpenStreetMap. The road network (~ 1,000 km) is in light gray, ways with values for the respective keys are in dark grey.

The question for any modeling and/or analysis that builds on such data is how to deal with the attribute gaps.
Take for example the key “maxspeed”. This attribute is necessary for the calculation of the mean driving times for every way. If it’s not there, you can for example calculate a route, but not the total travel time. No maxspeed, no travel time? Not necessarily!

The OSM data set offers a whole bunch of different attributive information. And many of these attributes are functionally related. The road category determines the maximum speed to a certain degree etc. To illustrate such a functional relation imagine the following: If you have a way with highway = residential, the probability is very high that the maximum speed is not higher than 50 km/h due to traffic regulations. And so on.
Such functional relations in general allow for an estimation of missing attributive values. And for several analysis estimated values are sufficient. For example, if you calculate the total travel time for a route, it’s an estimation anyway.

So, how to estimate missing values for a whole data set? Here is an extract of the python script we used to estimate the maximum speed:

Python script for the estimation of missing maxspeed values based on functionally related attributes.

Python script for the estimation of missing maxspeed values based on functionally related attributes.

In analogy to this approach, nearly all attribute gaps (width, surface, tracktype etc.) can be closed, as long as enough functionally related attributes are in the database. Of course such an approach generates some errors and in the worst case gaps in the attributes might remain (here expressed by the value 999). But in general quite plausible results can be produced this way.



One comment

  1. Pingback: Data quality: topology | gicycle

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s