Thursday 24 May 2018

Remove illegal XML characters

A recent issue where some parsed text generated from PDFbox required to be transformed using XSLT highlighted issues with illegal characters in the content. This was part of an ant task and therefore a simple replaceregex solved the issue. The matter was complicated by characters that had not correctly translated to UTF-8, notably the greek π character. Therefore a couple of regexes were required to resolve the issue. The first matches all characters that are not part of the XML specification, the next is specific to the the π character

<target name="remove-illegal">
 <replaceregexp match="[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u10000-u10FFFF]" replace="" flags="g" encoding="utf-8">
  <fileset dir="PDFbox-text" includes="*.txt **/*.txt"/>
 </replaceregexp>
 <replaceregexp match="[\\xcf]" replace="" flags="gs" encoding="utf-8">
  <fileset dir="PDFbox-text" includes="*.txt **/*.txt"/>
 </replaceregexp>
</target>