There are a series of maps at the back of the booklet to assist in locating the pubs but these are not very detailed and when organising a walk to take in several pubs the process can be a little tedious. Therefore in 2009 I decided to convert the locations into a googlemap. This was purely for my own use but I added it to a publicly accessible site for others to share in this functionality. The original googlemap, created in 2009, was generated with a quick and dirty conversion from the Ale Trail PDF using a bit of XSLT and a lot of manual effort.
So, for this years Ale Trail I decided that the conversion I needed to be a little quicker and easier to generate and a lot less manual effort. The source would be the available PDF document of the Ale Trail. The age old problem with dealing with PDF data is the lack of structure when attempting to extract it to other formats. The tool used for this conversion is an online utility provided by Xerox called Rossinante. The purpose of this tool is to produce an epub but it also enables the ability to grab the XML from each of the stages that is used to create the epub. I grabbed the content XML that had been generated from the PDF extraction. This was not pretty.....
<PARAGRAPH id="psgmt_427" sgm_info="interlign" x="14.4062" y="209.629" width="232.0" height="18.99"> <LINE x="14.4062" y="209.629" width="232.0" height="10.0" id="l21_19" base="217.287" font-size="9" font-name="myriad" bold="?" italic="no"> <TEXT width="231.91" height="9.909" x="14.4062" y="209.629" id="p21_t19"> <TOKEN sid="p21_s9175" id="p21_w148" font-name="myriad" bold="yes" italic="no" font-size="9" font-color="#d10018" rotation="0" angle="0" x="14.4062" y="209.629" base="217.287" width="17.325" height="9.909">186.</TOKEN> <TOKEN sid="p21_s9177" id="p21_w149" font-name="myriad" bold="yes" italic="no" font-size="9" font-color="#d10018" rotation="0" angle="0" x="33.4052" y="209.629" base="217.287" width="48.177" height="9.909">Marlingford</TOKEN> <TOKEN sid="p21_s9178" id="p21_w150" font-name="myriad" bold="yes" italic="no" font-size="9" font-color="#d10018" rotation="0" angle="0" x="83.4002" y="209.629" base="217.287" width="15.183" height="9.909">Bell</TOKEN> <TOKEN sid="p21_s9179" id="p21_w151" font-name="myriad" bold="no" italic="no" font-size="9" font-color="#000000" rotation="0" angle="0" x="100.493" y="209.808" base="217.287" width="38.763" height="9.729">Bawburgh</TOKEN> <TOKEN sid="p21_s9180" id="p21_w152" font-name="myriad" bold="no" italic="no" font-size="9" font-color="#000000" rotation="0" angle="0" x="141.163" y="209.808" base="217.287" width="20.862" height="9.729">Road,</TOKEN> <TOKEN sid="p21_s9181" id="p21_w153" font-name="myriad" bold="yes" italic="no" font-size="9" font-color="#000000" rotation="0" angle="0" x="163.316" y="209.629" base="217.287" width="48.177" height="9.909">Marlingford</TOKEN> <TOKEN sid="p21_s9182" id="p21_w154" font-name="myriad" bold="no" italic="no" font-size="9" font-color="#000000" rotation="0" angle="0" x="213.403" y="209.808" base="217.287" width="15.381" height="9.729">NR9</TOKEN> <TOKEN sid="p21_s9183" id="p21_w155" font-name="myriad" bold="no" italic="no" font-size="9" font-color="#000000" rotation="0" angle="0" x="230.692" y="209.808" base="217.287" width="15.624" height="9.729">5HX</TOKEN> </TEXT> </LINE>.......</PARAGRAPH>
As expected, the structure was severely lacking although there was enough styling information to infer the structure I needed. So firstly, an XSLT transformation was used to simplify the data and generate some basic elements from the styling data which could be used to infer the structure.
<para> <bold> <id>186. </id> <t>Marlingford </t> <t>Bell </t> </bold> <normal> <t>Bawburgh </t> <t>Road, </t> </normal> <bold> <t>Marlingford </t> </bold> <normal> <postcode>NR9 </postcode> <postcode>5HX </postcode> </normal> <normal> <tel>01603 </tel> <tel>880263 </tel> </normal> <bold> <t>1 </t> <t>2 </t> <t>4 </t> </bold> </para>
This wasn't perfect and there were a few instances where an entry was split across paragraphs or multiple items had been contained in a single paragraph. As this was a Quick and Dirty method, I sorted these out manually, but may look into getting the XSLT to determine this to take out the manual intervention. As it happened the manual effort took less than an hour which was not bad for 680 or so pubs!
A second stage XSLT transformation was then used to create structure and add in the geo location data. The geo location data was inserted using a call to the googlemaps api. The issue encountered here was that the api limited access to 20 requests for each minute. Therefore I needed to slow the transformation down. This was done using a simple task for the transformation engine (Saxon) to undertake:
<xsl:message><xsl:value-of select="for $i in 1 to 100000 return $i*2.5"/></xsl:message>
This took a little experimentation before reaching the desired result but in the end worked a dream. This also provided the county information which would otherwise had to be inferred from the identity number of each pub item. The result was a lot cleaner:
<pub> <name id="186.">Marlingford Bell</name> <address>Bawburgh Road,</address> <address type="place">Marlingford</address> <postcode>NR9 5HX</postcode> <coords worked="60000"> <GeocodeResponse> <status>OK</status> <result> <type>postal_code</type> <formatted_address>Marlingford, Norfolk NR9 5HX, UK</formatted_address> <address_component> <long_name>NR9 5HX</long_name> <short_name>NR9 5HX</short_name> <type>postal_code</type> </address_component> <address_component> <long_name>Marlingford</long_name> <short_name>Marlingford</short_name> <type>locality</type> <type>political</type> </address_component> <address_component> <long_name>Norfolk</long_name> <short_name>Norfk</short_name> <type>administrative_area_level_2</type> <type>political</type> </address_component> <address_component> <long_name>United Kingdom</long_name> <short_name>GB</short_name> <type>country</type> <type>political</type> </address_component> <address_component> <long_name>Norwich</long_name> <short_name>Norwich</short_name> <type>postal_town</type> </address_component> <geometry> <location> <lat>52.6377664</lat> <lng>1.1488654</lng> </location> <location_type>APPROXIMATE</location_type> <viewport> <southwest> <lat>52.6364174</lat> <lng>1.1472765</lng> </southwest> <northeast> <lat>52.6391153</lat> <lng>1.1504542</lng> </northeast> </viewport> <bounds> <southwest> <lat>52.6370836</lat> <lng>1.1472765</lng> </southwest> <northeast> <lat>52.6384491</lat> <lng>1.1504542</lng> </northeast> </bounds> </geometry> </result> </GeocodeResponse> </coords> <tel>01603 880263 </tel> <key> <keyitem>1</keyitem> <keyitem>2</keyitem> <keyitem>4</keyitem> </key> </pub>
The final stage was a transform to create the JSON objects that were inserted into the html file for the googlemap.
['Marlingford Bell, Marlingford',52.6377664,1.1488654,'','Bawburgh Road, Marlingford, Norfolk NR9 5HX. Tel: 01603 880263 ',0,1,1,0,1,0]
As stated - this was quick and dirty but achieved a result. Hopefully I can make this more seemless in the future. The result can be found at http://vulcanarms.freehostia.com/woodfordes/2013.htm
No comments:
Post a Comment