“Again!” you might say … why is he always dwelling on data? Well, because they are essential for the quality of our analysis and the conclusion we can draw from them.
At the moment I’m analyzing bicycle accident data. I’ve over 3,000 geo-located accident reports with at least one involved bicyclist. The details are really amazing and interesting discoveries can be made in this data set. It is also a nice stimulus for hypothesis generation, such as, “Why is the average age of female victims above the average age of male victims?”, see figure.
My research goal is to detect and characterise spatial and temporal patterns (e.g. moving hotspots) of different accident types and/or accident variables. GIS helps to put single incidents in a spatial context and observe potential changes over time. A major problem in doing so is the absence of any sound statistical population. This makes it nearly impossible to calculate the risk exposure. Consequently it’s hard to judge whether spatial or temporal clusters of accidents are significant or not. [If you have any idea how to calculate significances under these circumstances please get in touch with me!]
I know that I’m not the only one with this problem. Scanning the literature was an exciting endeavor. Those studies which relate the number of bicycle accidents to a statistical population mostly use aggregated statistics, such as inhabitants or kilometers per year from national to census block level.
As there seems to be no traffic model for bicycles (at least I didn’t find any) or anything similar, it’s de-facto impossible to calculate the risk exposure on the level of road segments; although this would be the interesting thing to do if you have geo-located accident reports!
Anyway, what I find really intriguing is how often vague statistics are used in order to have any exposure variable (the problem is, that we are attracted by figures and not so much by the quality of the underlying data). And even more, these variables are perpetuated from publication to publication. Here is my example (I hesitate from citing the publications in detail because I find them good and helpful and don’t want to distain the work at all; and apart from this the principle mechanism/problem is by far not limited to this single showcase):
- A study from 2012 about road safety for bicyclists refers to data used in a study from 2008.
- The study from 2008 about bicycle promotion in several European countries refers to data from the European Commission published in 2002.
- The publication of the European Commission from 2002 is the annual report “EU Energy and Transport in Figures – Statistical Pocket Book 2000”
- In the preface of the EU report it is stated that the data mainly cover the period from 1970 to 1997; the data were collected by Eurostats from several national agencies and institutions.
- Finally, digged through to the data the study from 2012 (!) refers to you will land here:
Isn’t this amazing? 20 year old data of questionable quality (see my recent post on modal split data) are used as essential variable in a current study … I don’t want to judge other’s work. It’s hard enough to get one’s own papers published and those who are succuessfull with their work are without any doubt experts in their respective fields.
But when we read fancy papers or – above all – run our own analysis we should never forget to critically ask where the data acutally come from. The conclusions which are drawn in the aforementioned papers are plausible. But the question is if the conclusions are really backed by the data?! Maybe sometimes it’s better to be honest (or humble?) and commit that the data basis might be weak. Maybe we have to reduce our ambitions and do some less spectacular work. Nevertheless, I’m totally convinced that we have still enough to say.
Take some time to question your own analysis and recap where your data come from.