Data inconsistencies and errors

Last week I’ve started a lessons-learned-series on how to deal with imperfect data in the context of spatial modeling and analyses. In a first part I’ve presented how functionally related attributes can be used to bridge gaps in attribute data.
Today I want to focus on a second category of data quality implications on an attribute level: attribute consistency as a specific form of attribute accuracy. Attribute consistency means, that there are no contratictions in the attributes of an object. Accuracy, as a more general category, simply means, that the attribute values are correct.

Generally inconsistencies and errors can occur in any data set; these are no OSM specific implications. But of course, the very informal mode to edit attributes in OSM (no relations or dependencies, no default values etc.) leads to inconsistencies and/or wrong attribute values. On the other hand, the community approach is a very, very effective way to deal with such shortcomings – in many cases the only feasible!

In the figure below a typical (and real!) example of inconsistent attribute values is illustrated:

Example for data inconsistency on an attribute level.

Example for data inconsistency on an attribute level.

The mapped way has, among others, the tags highway = track, tracktype = grade5 and surface = gravel. Here the tracktype, according to the wiki entry, indicates a track with a soft, uncompacted surface, whereas the value for the surface key is “gravel”. Consequently, either the tracktype or the surface value must be incorrect. A brief check in Google Earth brings clarity. The surface is, indeed, compacted and made of gravel. Thus, the value for the tracktype is wrong; the correct value would be grade2, according to the wiki.

The question now is, how to deal with such implications in the context of data modeling and analyses. Is there a routine to detect attribute inconsistencies automatically?
For the project, mentioned in my last blog, I’ve made use of combined queries to detect such inconsistencies. This approach works well, if enough attribute values exist. To formulate the right query statements, a well-founded overview of the data and a clear thematic focus are necessary. Conceptually such query looks like the following:

SELECT * FROM dataset WHERE
Key = Value AND ( functionally related Key1 <> Value OR functionally related Key2 <> Value OR functionally related Keyn <> Value)

If more than two functionally related attributes are attached to an object, potential inconsistencies can be corrected based on the data. But if only two conflicting attributes exist, it’s impossible to judge which one is wrong. Imagine a way with the tags highway = residential and maxspeed = 130. Obviously, this attribute combination is flawed. But which one is wrong? If no other attributes (such as the keys width, lanes, oneway etc.) are available for this segment one can either estimate the value based on the adjacent segments (if both are attributed with highway = motorway it’s very likely that the value for highway was wrong) or additional sources of information need to be consulted. For the latter local experts within the community are of high value!

Local expert’s knowledge is even more valuable when the attributes are not inconsistent but simply wrong. In such cases functionally related attributes are missing and thus errors are not detectable with query routines. If wrong attributes are not corrected by community members – which is most often the case! – they doze in the data set and become only obvious when they bias analysis results. This was, for example, the case with the way shown in the figure below:

The highway tag for this way is obviously wrong. Instead of cycleway the correct (according to the OSM wiki) value would be track with a further specification by the tracktype key.

The highway tag for this way is obviously wrong. Instead of cycleway the correct (according to the OSM wiki) value would be track with a further specification by the tracktype key.

We built a model and consequently a routing application on OSM data. The model rated ways with the tag highway = cycleway higher than other road types. As the respective way is tagged as cycleway it was prefered in the routing. But it was only after we received user feedbacks about implausible routing recommendations that we  realized that the data set contains a serious error. With more tags the error might have become obvious erlier.

TL;DR

The more (functionally related) attributes are attached to an object, the higher is the chance to detect inconsistencies in the data.
Query routines can help to detect errors. Local community members are mostly indispensable to correct them!

Advertisements

2 comments

  1. Pingback: Data quality: topology | gicycle
  2. Pingback: One object – different attributes | gicycle

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s