This blog post explains an alternative way to figure out how similar cities are. After you read it, you will realize why I think Madison and Reykjavik are very similar cities.
Teleport Cities allows you to research the most interesting cities in the world using a sophisticated scoring system that compares and contrasts them. Using Santiago as an example, this is what you see when you first visit its Teleport City page:
Each score has well-thought-out data science behind it which helps us rank all of 134 cities that are currently part of Teleport Cities. By the way, these scores are not absolute, but are relative to the rest of the cities we feature.
A question that comes to mind is how can we compare different cities? There are two main ways of doing this in Teleport Cities. The first one lets you compare two cities side by side. For example, if we want to compare Santiago with Oslo, we are able to get this view:
The other way is to rank all cities by logging in and setting your personal preferences. We can calculate a score for each city giving you a ranked view of all of them. This is the world through your personal lens, if you may:
However, there is one slight problem with just adding up all scores. If two cities have a similar score of 75, does that mean that they are similar? I can offer you a simple counter-example to illustrate that just one number does not represent similarity. Say that we have only two categories: commute and safety. If city A has 50 for commute and 25 for safety, but city B has 25 for commute and 50 for safety, obviously they are not necessarily similar. You can expand this to all our categories (we have 15 categories in our public view, but 20 categories for logged-in users).
We are fortunate to have lots of mathematical techniques to be able to estimate how close or far from each other two data sets are. One of the most well-known metrics is the Euclidean Distance. You probably have used it already, because in its most basic form, it is utilized to calculate the length of the hypotenuse of a triangle (remember the Pythagorean theorem?).
The Euclidean Distance can be calculated between points in any number of dimensions. We can consider that each score category in a Teleport City is a dimension. In the public view of our cities, we can count up to fifteen individual dimensions. The formula is very simple: you take the difference between two categories and square it, repeat the same for every category, add up all the results and then get the square root of the total sum.
This only gives us a metric, an important number that tells you the relationship between two different cities: if they are close or far. Most importantly, you can now see within a set of cities, which one is the closest and the farthest from an arbitrary city for the whole city set.
Plotting distances in a two-dimensional chart
Human beings can easily represent two dimensions on any surface (or three with a bit of imagination). Comprehending more than three dimensions is nearly impossible for most people. This means that if we want to visualize the relationship between cities, we should do it in two dimensions preferably.
There are several methods to “reduce” the number of dimensions (called multidimensional scaling or MDS). We could take our 15 city dimensions and represent them in two. One of the techniques which has survived the test of time and a favorite of mine is called Sammon’s Mapping. This was developed by John W. Sammon in 1969.
In summary, what this algorithm does is start with a random or pre-configured set of two-dimensional points and then iterates over all data points (our cities). With every pass, it tries to reduce the error (called stress by Sammon) between the distance in the two-dimensional set and the actual distance in the fifteen-dimensional original set. Once the stress conditions are met, the algorithm stops and we are left with a set of points that can be easily represented in any flat surface. The x- and y-axis are not meaningful in these charts, but the distance between points is.
Here is an example of running this algorithm using the original fifteen categories for all our 134 Teleport Cities and representing them in a two dimensional chart. You can expand the image to see it in large resolution. Two cities which are close in the plot would very likely be similar in their category scores and vice-versa.
Sammon’s mapping applied to all 134 Teleport Cities
Let’s zoom in on the plot and verify by taking a city pair which is close in our chart above, but that we wouldn’t have thought to be similar:
Now, if this is true, categories for both Madison and Reykjavik should be relatively similar. Let’s take the comparison view out of Teleport Cities to verify:
Although all categories are not exactly the same, we can say that categories with a high score are mostly the same in both cities. Naturally, the results are not the same as technically this would put both dots in exactly the same location in Sammon’s mapping. Who knew that Wisconsin and Iceland had something in common?
Improving Sammon’s Mapping with a Voronoi Tessellation
As we saw above, Sammon’s mapping is pretty good at representing city similarity in a two-dimensional plot. It’s difficult to see what is the Teleport score for the city though. I tried to represent it above by the dot size in proportion to the city score, but I am nearly sure that it wasn’t even noticed by most people.
Voronoi diagrams are named after George Voronoy who worked on the mathematics behind these tessellations in the early 20th century. However, these diagrams are older than his research and were used even by Descartes himself in 1644. The basics of these representations is to have polygons (called cells) around points. Cells are calculated via a distance-based heuristic.
Each cell was colored based on the score for the city. Reds are for low scores and greens are for high scores. To increase the differentiation between cells, scores were normalized so strong greens were given to the highest scores and strong reds to the lowest.
The resulting diagram with the same mapping as the previous chart:
Teleport City similarity using Sammon’s Mapping and Voronoi Cells
As you can appreciate in the chart above, similar cities have an obvious difference in score. If we take the same example as Madison and Reykjavik, the differences will be much clearer. Madison seems to be slightly higher than Reykjavik, but they both are definitely higher than Milwaukee:
Customized Visualizations of City Similarity
The charts above use the default scores provided by Teleport Cities. Even though we update our data frequently, default scores are pretty much stable. For example, a city that is expensive today will not get cheaper in just a few days.
Where it gets interesting is when users actually choose their preferences and generate a personal set of scores for all cities. To give you an idea, anyone can create an account and log in to set their preferences. Some examples of our preference dialogs are:
By selecting a few preferences, the resulting scores become a unique view of what a person is looking for. Also, cities which are similar in one combination of preferences would not be necessarily similar in other conditions. In this way, the similarity map is truly personal. I have collected below three example maps from hypothetical users with different priorities who have activated various preferences.
Preferences (example 1):
- Climate: similar to Denver
- Internet access: very important
- Culture: cinemas, comedy clubs, concerts
Preferences (example 2):
- Low crime rate: very important
- Healthcare: best possible care
- Education: schools for my kids
Preferences (example 3):
- Startup scene vibrance: very important
- Venture capital ecosystem: somewhat important
- Travel connectivity: medium hub
- Ease of starting a business: very important
- Low corporate taxes: somewhat important
- Fast internet connectivity: very important
Do you have any comments or corrections? Just leave a comment below or get in touch with us.
All charts in this post were generated using iPython Notebook, NumPy, scikit-learn, Matplotlib, D3 and mpld3.