FOR HIS GIS MASTERS DISSERTATION at the University of Edinburgh, Alex Mackie chose to mix two subjects close to his heart: books and maps. There is undoubtedly commercial value to mapping books but this has been largely ignored by the industry. If books are properly georeferenced then location-aware e-readers and tablets can use their user’s location to recommend locally relevant books or provide the option to search for books relating to intended holiday destinations or places of interest. This extends the principle that physical bookstores already recognize a demand for locally relevant books, with Waterstones and other retailers stocking shop windows with books linked to the shop’s location. In addition to the commercial value there is also a humanistic argument that the best way to really get a feel for a place is via its literature.
How: part I
Rather than having to source, store and manage the text of thousands of books, the initial method involved extracting place names from on-line book reviews, totalling around 80 million words, on the basis that reviewers tend to discuss the places books are about or set in. Reviews tend to represent the book in a highly condensed form, but despite the reduction in size, this is still an example of a ‘Big Data’ problem and this represents much of the challenge of this work. It may seem strange to be mining metadata for further, more specific metadata but the vast amount of review text is a potentially rich source of information and book catalogue data is sparse when it comes to the settings of fiction. The Unlock Text Geoparser was used to do this. A geoparser is a tool which attempts to find words that are related to specific places and, using a gazetteer, assigns geographical coordinates to these place names. The particular challenge is the disambiguation of place names, for example distinguishing the London which is capital of the UK, from the smaller city located in south-western Ontario (Canada).
The Unlock Geoparser was developed by the Language and Technology Group within the School of Informatics in Edinburgh and has been successfully used for geoparsing historical texts. It therefore seemed an appropriate tool to use for this application. Research revealed that the reviews do indeed contain sufficient place-names to effectively geolocate books, however this particular use case of mapping books requires a very high level of accuracy in toponym identification and disambiguation. Despite considerable efforts, it was felt that this level of accuracy could not quite be achieved. This demand for near-perfect accuracy is higher than in typical applications of geoparsing, for example identifying trends and the gist of texts. Errors such as misidentified mentions of things like author names, Dundee cakes and Yorkshire terriers meant the data was not ideal for powering location-based book recommendations and would lead to the application being rejected by potential users.
How: part II Thus an alternative approach was taken, searching existing metadata for less finegrained but more accurate locations. By taking subject metadata from existing book catalogues (the Open Library and Library Thing) and using a custom algorithm to disambiguate these toponyms to real world coordinates, an interactive “global book map” of 60,000 books has been built at http://www.mappit.net/bookmap. The algorithm works by examining the names and determining if they can be reliably disambiguated. Using thirdparty APIs like GeoNames, the algorithm takes the place name that is top-most in the administrative hierarchy (e.g. United Kingdom) and checks other place-names for containment within it and continues down the place hierarchy as far as necessary. If it can be confident that it has found a unique and definite place name match then the book is added to the map. This algorithm is simpler than the Unlock Geoparser, but has the great advantage of running quickly, vital to be able to effectively process the data volumes involved. The map is growing as fast as allowed by the rate-limited APIs it uses – rate limited by the third-party provider to prevent abuse.