OpenStreetMap is much more than a free map of the world. It’s a huge geo-database, which is still growing and improving in quality. OpenStreetMap is a great project in many respects!
But because it is a community project, where basically everyone can contribute, it has some particularities, which are rather uncommon in authoritative data sets. There, data is generated according to a pre-fixed data standard. Thus, (in an ideal world) the data are consistent in terms of attribute structure and values. In contrast, attribute data in OpenStreetMap can exhibit a certain degree of (semantic) heterogeneity, misclassifications and errors. The OSM wiki helps a lot, but it is not binding.
Another particularity of OpenStreetMap is the data model. Coming from a GIS background I was taught to represent spatial networks as a (planar) graph with edges and nodes. In the case of transportation networks, junctions are commonly represented by nodes and the segments between as edges. OpenStreetMap is not designed this way. Without going into details, the effect of OSM’s data model is that nodes are not necessarily introduced at junctions. This doesn’t matter for mapping, but for network analysis, such as routing!
In 2014 I presented and published an approach that deals with attributive heterogeneity in OSM data. Later I joined forces with Stefan Keller from the University of Applied Sciences in Rapperswil, Switzerland and presented our work at the AAG annual meeting 2015 in Chicago.
Since then Stefan and his team have lifted our initial ideas of harmonized attribute data to an entire different level. They formalized data cleaning routines, introduced subordinate attribute categories and developed an OSM export service, which generates real network graphs from OSM data. The result is just brilliant!
The service can be accessed via osmaxx.hsr.ch . There, a login with an OSM account is required. Users can then choose whether they go with an existing excerpt or define an individual area of interest. In the latter case the area can be clipped on a map and the export format (from Shapefiles to GeoPackage to SQLite DB) and spatial reference system can be chosen. The excerpt is then processed and published on a download server. At this stage I came across the only shortcoming of the service: you don’t get any information that the processing of the excerpt takes up to hours (see here ).
However, the rest of the service is just perfect. After “Hollywood has called” the processed data set can be downloaded from a web server.
The downloaded *.zip file contains three folders: data, static and symbology. The first contains the data in the chosen format. In the static folder all licence files and metadata can be found. The latter is especially valuable, because it contains the entire OSMaxx schema documentation. This excellent piece of work, which is the “brain” of the service is also available on GitHub . Those who are interested in data models and attribute structure should definitely have a look at this!
The symbology folder contains three QGIS map documents and a folder packed full with SVG map symbols. The QGIS map documents are optimized for three different scale levels. They can be used for the visualization of the data. I’ve tried them for a rather small dataset (500 MB ESRI File Geodatabase), but QGIS (2.16.3) always crashed. However, I think there is hardly any application context where the entire content of an OSM dataset needs to be visualized at once.
Of course, OSMaxx is not the first OSM export service. But besides the ease of use and the rich functionality (export format, coordinate system and level of detail), the attribute data cleaning and clustering are real assets. With this it is easy, for example, to map all shops in a town or all roads where motorized vehicles are banned. Using the native OSM data can make such a job quite cumbersome.
I have also tried to use the data as input for network analysis. Although the original OSM road data are transformed into a network dataset (ways are split into segments at junctions), the topology (connectivity) is invalid at several locations in the network. Before the data are used for routing etc., I would recommend a thoroughly data validation. For the detection of topological errors in a network see this post . Maybe a topology validation and correction routine can be implemented in a future version of OSMaxx.
In the current version the OSMaxx service is especially valuable for the design of maps that go beyond standard OSM renderings. But the pre-processed data are also suitable for all kinds of spatial analyses, as long as (network) topology doesn’t play a central role. Again, mapping and spatial analysis on the basis of OSM data was possible long before OSMaxx, but with this service it isn’t necessary to be an OSM expert and thus, I see a big potential (from mapping to teaching ) for this “intelligent” export service.
The number of available data sets published as Open Data (OD) and Open Government Data (OGD) is constantly growing . That’s incredibly cool, because you can do analyses that were impossible a few years ago. Today I’d like to show you how you can use building footprints from OpenStreetMap and census data from an OGD portal to generate a population grid with any spatial resolution.
Here is the reason for why it’s worth to go through a few analysis steps instead of using what’s available anyway. At least in Austria (I know, the situation is quite different in the US) nationwide census data are only freely available on the level of municipalities. Now, everyone is aware of the fact that the population is commonly not equally distributed within rather arbitrarily defined administrative units; especially in the case of large, rural municipalities. Instead, the population is more or less spatially clustered.
For many analyses population data on the level of municipalities are way to coarse. Take for example the calculation of service areas for central facilities in order to estimate the potential coverage (“How many people live within 5 driving minutes?” etc.). Until recently you were forced to buy expensive statistical data from the federal bureau of statistics, Statistik Austria , in order to answer such questions. What you get there are aggregated census data in 250, 500 or 1000 meter raster grids.
Fortunately, enough data are published today as OD and OGD to bypass this limitation. Of course, the resulting population raster from the approach presented below, is only an approximation (similar to dasymetric maps ). But for a first estimation it’s enough and it is for free!
Here is how you can generate disaggregated population grids based on OSM data and demographic OGD:
- Download administrative boundaries, including available census data. For Austria you’ll find everything via the national OGD portal .
- Download building footprints from OpenStreetMap. I prefer QGIS and the QuickOSM plugin for this task, because OSM data are immediately converted to a geospatial dataset (e.g. Shapefile).
- Transfer all datasets into a projected coordinate system; the calculation of areas is more convenient this way.
- Select (building = *) all building footprints that are not used for residential purposes and remove them from your analysis layer.
- Calculate the share (r) of the total building footprint area for each building:
- Select all buildings within the respective administrative unit and multiply the population data with the share of each building.
- Generate a regular grid, which covers the entire area (MMQGIS plugin for QGIS, hexgrid script for ArcGIS).
- Assign the estimated population data of the building footprints to each grid cell.
- Done. What you have is a rough estimation of the population distribution.
Although the results are fairly reliable, at least two issues negatively affect the result. First, building footprints don’t account for multi-storey buildings. Theoretically the number of storeys can be tagged in OpenStreetMap, but this is hardly ever done. Second, data inaccuracies bias the result. In OSM many buildings are not adequately tagged (e.g. commercial buildings should be tagged as such) and even worse, some buildings are not mapped yet. Nevertheless for many questions the approximation is sufficient.
This simple piece of GIS analysis demonstrates the power of GIS on the one hand and the large benefit of Open (Government) Data on the other. Try it yourself – I’m looking forward reading about your experience!
While Open Government Data are currently a big deal in the German-speaking countries, the OpenStreetMap project celebrates its 10th anniversary . How these different data sources can be dealt with in spatial modelling approaches and how they can even be used in combination were the two major topics of a presentation, I’ve given last friday at a UNIGIS workshop in Salzburg.
Spatial modelling allows for interpreting and relating data for specific applications, without necessarily manipulating them. Neglecting this option and building applications directly on databases can result in rather weird and/or useless results. The reason for this is simple: generally data are captured for a certain purpose. Naturally, this purpose decides on the data model, the attribute structure or the data maintenance. And these determining factors might diverge from the requirements of the intended application.
In the case of OGD the published data are made available by different public agencies. For example the responsible department is obliged by law to monitor air quality and, in case, intervene efficiently. Thus different parameters are sensed for this very purpose. When these data are being published as OGD one can, for example, use them for building a “health map”. But in such a case the direct visualization of micrograms and PPMs of the sensed pollutants wouldn’t make much sense. The data need to be interpreted, aggregated, classified, related – in short – modelled in order to fit the intended purpose of the map.
A similar mechanism holds true for data from the OpenStreetMap project. Originally the data were mapped for the purpose of building a free world map. Meanwhile the extent of the database has grown enormously and the data can be used for much more sophisticated applications than a “simple” world map. But again, if the data – and especially the attributes – which were originally collected for a specific purpose are being used in any other context, they have to be processed and modelled.
When applications are built on not only one dataset which was originally created for a different purpose, but on several datasets (e.g. because the data availability ends at the border of an administrative unit), the necessity of modelling is given anyway. As an example I’ve referred to our current work in the context of the web application Radlkarte .
Here it was necessary to combine authoritative data (mainly published as OGD) with crowd-sourced data. Because of the fundamental differences between these data sources – concerning the data model, attribute structure, data quality and the competence for data management – evaluation and correction routines, as well as an extensive modelling workflow had to be implemented. But, as it could have been demonstrated in the presentation, this effort pays off significantly when the validity and plausibility of the results are being examined.
Geographical information systems (GIS) are intuitive and performing environments for the implementation of such multi-stage workflows. They allow for the data storage and management in spatial databases, provide modelling interfaces and facilitate immediate analysis and visualization capacities.
After dealing with attribute gaps and data inconsistencies, I want to focus on my favourite implication when it comes to using OpenStreetMap data for spatial modeling and analysis: attribute heterogeneity. Complicated term, easy-to-understand concept …
Somehow different from the two previous implications, which can basically occur in any data set, this one is quite characteristic of OSM data. As the individual mapper is relatively free in attributing objects, attribute heterogeneity is an inevitable consequence. In these cases it is not about right and wrong, but about different views on one and the same object. The mapper’s perception of reality is directly mirrored in how he or she assigns attributes to objects.
You can find this phenomenon frequently if you dig deep into the data set. Take for example a physically separated, mixed cycle- and footway along a primary road, as it is shown in the picture on the right. How can it be tagged in OSM? Basically there are several options. Here are just a few:
highway = cycleway
foot = designated
highway = cycleway
foot = yes
highway = footway
bicycle = designated
highway = path
foot = designated
bicycle = designated
None of these tags would be wrong. They are completely in accordance with the wiki’s recommendations. But a walking enthusiast might tend to tag the way as highway = footway with the corresponding bicycle tags. And a regular cyclist might want to emphasize the cycleway. And a third mapper prefers the general approach and simply tags the way as highway = path.
What seems to be irrelevant for some applications, can cause serious problems in the process of spatial modeling and analyses. Here, the attribute heterogeneity needs to be considered, if gaps and inconsistent analysis results should be effectively avoided! How can this be done?
In a first step it is necessary to check whether the tags are correct. For this combined queries (see last post ) are a feasible option.
If different tag combinations are admissible and in accordance with the OSM wiki, the definition of derived attributes is a suitable approach. Such derived attributes are “virtual” attributes which can be “fed” by several, different tag combinations. Take for example the aforementioned example of a physically separated, mixed cycle- and footway. This type of road infrastructure can be defined as a derived attribute. To simplify matters, let’s introduce a new key sep.mixed for this attribute. Now we can define:
IF (highway = cycleway AND foot = designated) OR (highway = cycleway AND foot = yes) OR (highway = footway AND bicycle = designated) OR (highway = path AND foot = designated AND bicycle = designated) etc. THEN sep.mixed = yes
For modeling and analysis purposes this derived attribute is now considered when one wants to consider different types of bicycle infrastructure. Very plain approach, but with huge effects in models and analysis routines which are based on OSM data sets.
Of course, this approach is not restricted to bicycle infrastructure. It can be employed in anny case where objects are potentially heterogeneously tagged. In order to find these heterogeneous attributes in a data set we found a simple visualization approach and a plausibility check very useful as a starting point.
Concluding the last three posts, I hope it became clear, why it is of such great importance to not simply build applications on data sets, but to check the data sets’s quality and introduce modeling routines if necessary*. This of course requires an extra effort, but having an eye on analysis results and user satisfaction, the return of investment is striking.
* There is a very fine piece of work by Anita Graser et al. on this. It’s published in “Transactions in GIS” and can be accessed here .
Last week I’ve started a lessons-learned-series on how to deal with imperfect data in the context of spatial modeling and analyses. In a first part I’ve presented how functionally related attributes can be used to bridge gaps in attribute data.
Today I want to focus on a second category of data quality implications on an attribute level: attribute consistency as a specific form of attribute accuracy. Attribute consistency means, that there are no contratictions in the attributes of an object. Accuracy, as a more general category, simply means, that the attribute values are correct.
Generally inconsistencies and errors can occur in any data set; these are no OSM specific implications. But of course, the very informal mode to edit attributes in OSM (no relations or dependencies, no default values etc.) leads to inconsistencies and/or wrong attribute values. On the other hand, the community approach is a very, very effective way to deal with such shortcomings – in many cases the only feasible!
In the figure below a typical (and real!) example of inconsistent attribute values is illustrated:
The mapped way has, among others, the tags highway = track, tracktype = grade5 and surface = gravel. Here the tracktype, according to the wiki entry, indicates a track with a soft, uncompacted surface, whereas the value for the surface key is “gravel”. Consequently, either the tracktype or the surface value must be incorrect. A brief check in Google Earth brings clarity. The surface is, indeed, compacted and made of gravel. Thus, the value for the tracktype is wrong; the correct value would be grade2, according to the wiki.
The question now is, how to deal with such implications in the context of data modeling and analyses. Is there a routine to detect attribute inconsistencies automatically?
For the project, mentioned in my last blog, I’ve made use of combined queries to detect such inconsistencies. This approach works well, if enough attribute values exist. To formulate the right query statements, a well-founded overview of the data and a clear thematic focus are necessary. Conceptually such query looks like the following:
SELECT * FROM dataset WHERE
Key = Value AND ( functionally related Key1 <> Value OR functionally related Key2 <> Value OR functionally related Keyn <> Value)
If more than two functionally related attributes are attached to an object, potential inconsistencies can be corrected based on the data. But if only two conflicting attributes exist, it’s impossible to judge which one is wrong. Imagine a way with the tags highway = residential and maxspeed = 130. Obviously, this attribute combination is flawed. But which one is wrong? If no other attributes (such as the keys width, lanes, oneway etc.) are available for this segment one can either estimate the value based on the adjacent segments (if both are attributed with highway = motorway it’s very likely that the value for highway was wrong) or additional sources of information need to be consulted. For the latter local experts within the community are of high value!
Local expert’s knowledge is even more valuable when the attributes are not inconsistent but simply wrong. In such cases functionally related attributes are missing and thus errors are not detectable with query routines. If wrong attributes are not corrected by community members – which is most often the case! – they doze in the data set and become only obvious when they bias analysis results. This was, for example, the case with the way shown in the figure below:
We built a model and consequently a routing application on OSM data. The model rated ways with the tag highway = cycleway higher than other road types. As the respective way is tagged as cycleway it was prefered in the routing. But it was only after we received user feedbacks about implausible routing recommendations that we realized that the data set contains a serious error. With more tags the error might have become obvious erlier.
The more (functionally related) attributes are attached to an object, the higher is the chance to detect inconsistencies in the data.
Query routines can help to detect errors. Local community members are mostly indispensable to correct them!
Since the start of the OpenStreet Map project numerous studies have been dealing with the “quality” of this crowdsourced data set. In a previous post I’ve shown how relative the “quality” of a data set can be. Interestingly, this post got by far the most views – only as a side note.
Anyway, most studies that are dealing with the quality of OSM data focus on the geometric characteristics. Haklay & Ellul (2010) investigate the completeness of OSM (compared to Ordnance Survey data) in the UK. Helbich et al. (2012) compare the spatial accuracy of OSM and TomTom data. And, just to name a third example, Jackson et al. (2013) analyze both, completeness and accuracy, for OSM data in Colorado.
Only very few studies deal with the attributive quality of OSM data. Ludwig et al. (2011) and Graser et al. (2013), for example, evaluate the attributive completness of selected attributes. But to my current knowledge, there is little more …
I must confess, that I’m not an OSM geek, not a heavy mapper. But I’ve worked in several projects with the data and got to know (and love) them; and of course I’ve contributed to OSM more than once. In a recent project my task was to model accross two data sets with different data models and attribute structures and use the modeling results as inputs in a network analysis. One of this data set was an OSM extract for 5 municipalities in the Austrian-Bavarian boarder region. During my work I learned to deal with at least three issues concerning the attributive quality of OSM data:
- Attribute gaps
- Inconsistencies and errors
- Heterogeneous attribute structure
Some of my lessons learned will be presented at this year’s AGIT conference (this is an explicit invitation to all German speaking readers for this nice conference!!!). Since the conference language will be German, I’ll publish an excerpt of my paper here. Today I want to focus on the first implication, attribute gaps, and how to deal with it in spatial analyses, such as routing.
The road network I’ve worked with has a total length of roughly 1,000 km with slightly more than 5,000 ways in OSM (for analysis purpose I processed the data, which is irrelevant for the following considerations). Compared to other data sets (commercial and authoritative) of this region, OSM can be seen as the must up-to-date and the most complete in terms of existing ways. But when it comes to attributes (tags), the OSM data set has several downsides.
Here is an overview of the completness of several tags which were important for my modeling:
The question for any modeling and/or analysis that builds on such data is how to deal with the attribute gaps.
Take for example the key “maxspeed”. This attribute is necessary for the calculation of the mean driving times for every way. If it’s not there, you can for example calculate a route, but not the total travel time. No maxspeed, no travel time? Not necessarily!
The OSM data set offers a whole bunch of different attributive information. And many of these attributes are functionally related. The road category determines the maximum speed to a certain degree etc. To illustrate such a functional relation imagine the following: If you have a way with highway = residential, the probability is very high that the maximum speed is not higher than 50 km/h due to traffic regulations. And so on.
Such functional relations in general allow for an estimation of missing attributive values. And for several analysis estimated values are sufficient. For example, if you calculate the total travel time for a route, it’s an estimation anyway.
So, how to estimate missing values for a whole data set? Here is an extract of the python script we used to estimate the maximum speed:
In analogy to this approach, nearly all attribute gaps (width, surface, tracktype etc.) can be closed, as long as enough functionally related attributes are in the database. Of course such an approach generates some errors and in the worst case gaps in the attributes might remain (here expressed by the value 999). But in general quite plausible results can be produced this way.