QA/test/report tool for ePub/ebooks

The automatic filter and e-book quality control error reporting facility that you built for us has been an extremely useful tool. We have reduced time of production per title while keeping the quality of our product line very high. We have used it for over 300 e-books to date.
— M.A., E-book Editor, Ignatius Press

QA/test/report tool is a web-based ebook filter and “proofreader.” My client, a book publisher, is converting an extensive backlist of published print books to release as ebooks. For most titles, process is to scan/OCR, save as HTML, proofread and hand-edit to match printed book as well as possible. Production person then uploads that file to QA tool, which

  1. filters the HTML file for things like typographic quotes and dashes as well as adding CSS and changing HTML (very like Markdown, but looking for more stuff/more constructs).
  2. reads the filtered output and reports on probable typos in the file. Words not recognized from whitelist are printed red. Unlikely events are listed (e.g., numerals in alpha strings and vice versa; italics left on; 3+ contiguous identical alphas; unbalanced parens, braces, brackets; questionable ellisions; incorrect dashes; broken numeric sequences; unexpected formatting) -- 416 possible errors that may be found by counting and regex matching). This data is not filtered, but a report printed — "iii" may be a typo or a roman numeral, etc.
  3. passes file output at (1) against third-party tools to check for valid XHTML/CSS, etc. (API/shell).
  4. splits the file to chapters and packages directory to ePub after generating framework, OPF, meta inf, UUID, and toc.ncx, compiling/hyperlinking Kindle-only table of contents, and hyperlinking and moving footnotes to the end of each respective chapter. It then tests ePub against the tools recommended by iBooks, Kindle, etc. for ePub compliance. Two or more ePubs (or Daisy) may be forked here if you want to accommodate different products for different readers.

Sample input text:

#chap CHAPTER 1


Can be expanded / filtered to needs of current project:

<a id="c1" /><h1>I</br>
An Introduction to the Topic at Hand</h1>

<a id="s1-1" /><h2>Historical Overview and
a Brief Description of the Ground Rules</h2>

Input can be plain text, HTML, or Markdown.
Output can be HTML (any), TeX/LaTeX, InDesign, Quark input language.