This page provides additional and accompanying material for our submission to the Special Issue: Geo-Information Fostering Innovative Solution for Smart Cities.
Paper abstract:
The ever increasing availability of linked open geospatial data provides an unprecedented source of geo-information to describe urban environments. This wealth of data should be turned into actionable knowledge: for example, open data could be used as proxy or substitute for closed or expensive information. The successful employment of linked open geospatial data can pave the way for innovative solutions to smart cities problems.
In this paper, we illustrate a set of experiments that, starting from linked open geospatial data, execute a knowledge discovery process to predict urban semantics. More specifically, we leverage geo-information about points of interests as input in a classification model of land use at a fine-grained spatial resolution (250 meters) over wide urban areas in Europe.
We replicate our experiments in different European cities - Milano, München, Barcelona and Brussels - to ensure the repeatability and generality of our approach, and we explain the experimental conditions as well as the employed datasets to guarantee reproducibility. We extensively report on quantitative and qualitative evaluation results, to judge the validity as well as the limitations of our proposed approach.
City land use distribution
The grid cells of the selected cities are mapped to 5 classes by taking into consideration their prevalent land use: cluster 1 identifies the dense residential areas (corresponding to CORINE category 111), cluster 2 the sparse residential areas (category 112), cluster 3 the industrial and commercial areas (categories 12x and 13x), cluster 4 the agricultural areas (categories 2xx) and cluster 5 the parks and the natural areas (categories 14x, 3xx, 4xx and 5xx).
The cells distribution across the 5 classes for Milano, Brussels, München and Barcelona.
Points of Interest (POIs): distances and distributions
Because of the very wide geographic coverage as well as the fine-grained spatial resolution, we select OpenStreetMap/LinkedGeoData as source for our geo-information variables to be used as predictors in the classification experiments.
We first select a set of 50 POI categories that can characterize the urban landscape. As we need to describe each grid cell in terms of its surrounding environment, we extracted the distance from each grid cell to the closest POI of a given category. Therefore each cell is described by 50 quantitative dimensions - one for each POI category - that represent the distance from the cell to the closest POI of a given category.
CSV files with the distances for the four cities: Barcelona, Brussels, Milano, München.
In order to compare the POI distances we plotted the overlapped POIs distribution of the four cities.
All the results obtained by the 4 experiments (City-specific model selection, cross-city model selection with/without background knowledge, binomial classification).
City-specific model selection
The goal of our qualitative error analysis is to identify possible patterns in the spatial distribution of errors. We verified that all cities exhibit the same behaviour, both in terms of spatial displacement of errors and of misclassification type.
On the left, errors are shown as black cells on the entire city area, which is otherwise coloured according to the correctly predicted land use classes (1=red, 2=orange, 3=yellow, 4=green and 5=blue). It is evident that all the errors lie on the “boundaries” between the areas with homogeneous land use.
The right side of the figure zooms into a portion of the map to better visualize the error types: the colour of the small square at the centre of each misclassified cell visually represents the (mis)predicted class and can thus be compared to the background colour representing the correct class. We discover that the classifier always mistakes a cell’s class with the class of one of its adjacent cells.
Barcelona, Brussels, Milano, Müenchen
Cross-City model selection without any background knowledge
We rank all the predictors in term of their information gain, calculated according to the Shannon entropy, which measures the data heterogeneity with respect to the land use classes. We select the top 5 and the top 11 variables, based on some thresholds on information gain values. This procedure aims at selecting only the most informative predictors, avoiding the model overfitting.
The list of all the POIs ranked according to the information gain.
SVM Tuning to determine best configuration parameters
R script: SVM Tuning
City Comparative Analysis
R script: Hypothesis testing for proportion equality