View on GitHub

Optimizing Certain EPUB Books

I recently downloaded a bunch of EPUB books. Alas, I found they loaded extremely slowly in my favorite Android ebook readers. I.e. it could take minutes before the first word of the text would appear!

An EPUB book is pretty simple to decipher: it is a ZIP file, with metadata and HTML content. Unpacking one of these EPUB files, here are the contents:

before

I'm not going into the details of the EPUB standards. As a result of my investigations, here are the issues with the above book:

According to the book metadata, the files were created by a PHP library, identified as EPub (2.1) by A. Grandt, http://www.phpclasses.org/package/6115. NOTE I'm not criticizing the author of said package, but the website which used this old version of the package!

A Fix

A decent epub processor might be able to fix some of these problems. For my own purposes, I wrote a brute-force C# program which fixes the most painful problems. This program performs basic text processing on the ebook contents. The result is an ebook which loads on my Android tablet in seconds, with minimal disruption in the text flow!

The program can be found in my epub_fixer repository.

Features:

  1. Breaks the single HTML file into many small chunks, arbitrarily less than 25K in size. The reader quickly handles each small chunk.
  2. Replaces each <br> tag with </p><p>. The reader can parse this easily and display text quickly.
  3. The empty Log.Html file is deleted.
  4. The epub metadata is updated to use the changed content files.

Known Limitations:

Suggested usage:

The following DOS script will process a single epub file. It requires the filename as a command line parameter without the extension. The batch file may need to be tweaked to change the path to the "fixer" program.

unzip %1.epub -d temp
cd temp
..\..\fixer
zip -r -9 ..\%1_v2.epub *
cd ..
rmdir /S /Q temp

The zip and unzip commands are the Info-ZIP programs, currently to be found on SourceForge.

The batch file unpacks the epub into a "temp" sub-folder, runs the "fixer" program, packs the "fixed" epub into a file with the same name and "_v2" appended, and finally cleans up the "temp" sub-folder.

I used the batch file to "fix" multiple epub files in a folder as follows (assuming the above script is called "fix_cmd.bat"):

for %i in (*.epub) do fix_cmd "%~nI"