# Evaluating the significance of bicycle accidents

In one of may last posts I’ve presented a method to define spatial reference units in a network. These units should form the basis for an evaluation of the significance of bicycle accident occurrences. This is done by comparing the actual number of accidents per reference unit to a random distribution. The underlying assumption is, that the bicycle volume (or the risk for an accident) is homogeneous – which of course is not the fact in reality. Here are first results of a global analysis …

**1. Generate random points**

Bicycle accidents are not necessarily equally distributed in a network. In order to reflect this, I’ve calculated the number of accidents per road category. This can be done by a spatial join of the accidents to the road network. Surprisingly by far most accidents happen on municipal roads (why this is the case should be object to further research!). Generating the random points, the 3,048 (that’s the total number of reported accidents for 10 years in my research area) points were distributed according to the respective percentage per road category in each iteration.

For my first prototypical analysis I’ve run 100 iterations. The whole processing is done in ArcGIS using ESRI’s model builder and a few lines of python code.

**2. Calculate mean value and standard deviation
**

In a consecutive step the random points were assigned to the à priori defined reference units – again, simply using a spatial join. This allowed for the calculation of the mean number of accidents per reference unit (from the 100 iterations of random points generation) which can be expected in case of a random distribution of accidents. Together with the mean value the standard deviation (STD) was calculated for each reference unit.

**3. Evaluate the significance of bicycle accidents**

In order to determine whether the number of accidents per reference unit is random or not, the actual number of accidents was compared to the mean value of the random distribution. A significance intervall of two standard deviations was used. This means, that every value between +/- two standard deviations from the mean value is insignificant; in other words, the number of accidents can be expected from the random distribution. Reference units with accident occurences > 2 STDs exhibit significantly more accidents than expected and vice versa. Additionally a z-score (number of STDs) was calculated for each unit.

**4. Visualize results**

One of the most exciting questions in this context is of course where the number of bicycle accidents is significantly high. No medium is better suited to answer such a question than a map. The color coding is kept simple and straight forward. Blue lines indicate roads (or reference units) with significantly fewer accidents than expected. Red lines represent roads with a significantly high number of accidents. The grey lines are roads with an insignificant number of accidents (within the interval of +/- 2 STDs from the simulated mean number of accidents per reference unit). In order to get a better impression of the significance I’ve extruded the roads (per reference unit) by their z-score resulting in a 2,5D visualization. The light layer indicates the zero level.

Based on this visualization one can draw several conclusions or formulate hypothesis (“reasoning”). Apparently most accidents occur in the city center. Probably because the number of accidents is – at least to a certain degree – a function of the number of bicyclists.

This brings me to a decisive point: this analyis does NOT allow for any risk assessment. For this a statistical population (e.g. number of bicyclists per reference unit) would be necessary. Based on these results one can only state, if the number of accidents is significantly lower or higher than it could be expected from a random distribution.

**5. Next steps**

The results shown here are only an initial point for a series of additional analyses. What I’m planning to do next:

- Investigate the result of different reference units (I’ve used 100 here). The question is to what degree the definition of reference units influences the overall result.
- Consider the seasonality in the data set. This means that I will split up the data set into 4 subsets reflecting the different seasons. I expect the spatial pattern of the winter months to be different from the rest.
- Consider different accident types. I’m interested if there are differences between the spatial patterns of male vs. female drivers or if the injury severity differs depending on the location.
- Check whether reference units with a significantly high or low number of accidents move over time. Here interesting spatio-temporal patterns might become evident.

A general question I need to work on (my colleague Christoph is after this) is the minimum number of accidents per reference unit which is necessary for a robust analysis result. If you have any ideas on this, do not hesitate to contact me!

## One comment