“Analysis results can only be as good as the underlying data are.” – Is this really the whole truth?
I have used this argument numerous times; especially when clients and data providers were dissatisfied with analysis results. And I’m totally convinced that in most cases the argument is valid.
During the last few years I’ve been working a lot with digital transportation network data; from crowd-sourced as well as from authoritative data sources. In all cases I’ve been using the data as input for spatial modelling and analysis, above all for network assessment and routing purposes.
Since last year the federal state of Salzburg runs an intermodal routing web service (VAO) which is built upon authoritative data from Austria’s harmonized transportation network dataset (GIP). These are the same data, I’ve been using since the early days of the GIP. A couple of weeks ago OpenStreetMap has implemented routing functionalities to its main website . Again, I’ve been working with OSM data for quite a while in this context.
The nice thing with these two web services is, that they use the very same data source I do and both provide special routing services for bicyclists. Again, this is what I’ve been working on since my Master thesis. This situation allows for testing the hypothesis of a direct relation between data quality and the quality (plausibility or validity) of analysis results. Before I go into more details about this, I’d like to briefly illustrate how, in general, data sources and analysis results might be related to each other (if you have better ideas how to categorize the relations, please leave me a note!):
1) GIGO: garbage in, garbage out
Most of the time the old mantra of information science holds true. At least, as long as no additional intelligence is added to the data or the interpretation of the data.
2) GIIGO: garbage in, improved garbage out
With the help of intelligent data modelling and the implementation of heuristics, it is possible to make the best out of poor data. Last year, I’ve presented some routines for data improvement by means of spatial modelling at the AGIT conference (although I have to mention, that the quality of the used data in the case study was not bad at all). The paper (in German, sorry) can be accessed here .
3) QIGO: quality in, garbage out
This sounds a bit weird, but try to imagine the following: you want to prepare a perfect dinner for your sweetheart. The recipe is pinned on your fridge – you have a plan. You (assume to) know what you need and how to do it. To be sure that all ingredients are fresh and of perfect quality, you invest quite a lot of money at the local farmer’s market. At home you try to follow the recipe, but somehow you missed an important step and in the end you even scorched the meat.
Who is to blame? Well, as the ingredients were flawless and the recipe correct, it was the cook who messed it up. Exactly the same could happen with data. A perfect data set without any errors, inaccuracies or inconsistencies is not a guarantee for good analysis results. The data can be misused, misinterpreted or the analysis design and application might be simply crap.
4) QIQO: quality in, quality out
This is the ideal relation (no wonder, the acronym sound like an alternative, Chines treatment – maybe I should protect it). Logically it mirrors the concept of GIGO. Being realistically, it is clear that the data quality is never 100%. But given the fact that the data are of good quality, the right interpretation (modelling and analysis design) ensures results with a fairly high quality.
Let’s turn to the above-mentioned applications now.
The authoritative data, which are used in the VAO web service are of an overall high quality. We’ve investigated the geometrical and topological quality as well as the attribute consistency. Nevertheless some of the resulting routing recommendations for bicyclists are everything but plausible. Here are two examples:
In the left example the recommended route runs along one of the most frequented roads in town. There is hardly any appropriate bicycle infrastructure, apart from a few segments with painted on-road bicycle lanes. Local bicyclists never ever would choose this route, although it is the shortest connection between origin and destination.
Compared to this example, the right one is much better. But again, this is not the route locals would drive; the roads in the northern part of the route (around the main station) are crowded and completely bicycle unfriendly. This recommendation is even more surprising, as it is possible to cycle from A to B nearly exclusively along high-capacity cycle ways along the Alterbach and the Salzach river.
I would hesitate to call these two examples a showcase for what I’ve called QIGO; the routing application was mainly intended for MIT and PT. But nevertheless, it can be shown, that with the exactly same data much better (in this case more plausible) recommendations can be generated:
The reason for the different analysis results is an additional modelling step in the data preparation process (for further details this presentation might be interesting). This implemented model makes use of the various descriptive attributes in the dataset. They are used for the identification of the most bicycle friendly segments, which are then accordingly considered in the way finding algorithm.
The nice thing with this model is, that it can be implemented independently from the data source. Hence it is possible to use it for a bicyclist-specific interpretation of virtually any transportation network data set. The effect can be demonstrated for example with OpenStreetMap data. As stated above, the OSM main website has integrated different routing functions. Two of the routing options are designed for bicyclists. Obviously, these routing engines are quite differently parametrized; using the same data set, diverging results are generated. Let’s have a look at two examples:
The left example demonstrates how different results can be, depending on the parameters of the routing engine. Although the recommendation from Mapquest could be judged as more or less bicycle friendly (cycleways), the detour is quite long. The result from GraphHopper is plausible, although it could be further optimized (prevent from driving on the primary road).
Whereas the left example is comprehensible in both versions, the right example shows, independently from the routing engine, a rather dangerous route: 2/3 of the route is on a primary road without any bicycle infrastructure.
The question again arises, if the results were a direct function of the data availability (a critical point in the context of crowd-sourced data) and data quality. The implementation of the aforementioned model gives at least some indication for the potential for further improvements. These are the corresponding results – again, based on the identical data set:
How is data quality related to the quality of analysis results? Is it really that easy to say “Garbage in, garbage out.”?
Well, the examples given here would rather suggest to see it not strictly linear. Obviously there is the potential to improve results with the implementation of additional intelligence. With this, analysis results can be of better quality than it would have been expected from the data quality.
This is everything but a carte blanche for handling data carelessly! In fact, the contrary is true: every investment in data quality pays off multiple times, because all modelling and data processing routines can focus on the last few percent for an optimal solution or result. This means, although GIIGO might spend some hope, we should go for the QIQO and watch out that we don’t “scorch” the best data and only get QIGO!
Any thoughts, additional ideas or comments? Feel free to start a discussion or get in touch via the contact form. I’m looking forward to read from you!