I recently downloaded a bunch of EPUB books. Alas, I found they loaded extremely slowly in my favorite Android ebook readers. I.e. it could take minutes before the first word of the text would appear!
An EPUB book is pretty simple to decipher: it is a ZIP file, with metadata and HTML content. Unpacking one of these EPUB files, here are the contents:
I'm not going into the details of the EPUB standards. As a result of my investigations, here are the issues with the above book:
- This is an EPUB2 book. The EPUB3 standard supersedes EPUB2 as of October 11, 2011!
- The book content consists of a single, large HTML file. Some ebook readers, especially on a low-memory device running Android, may have problems loading the file in one go.
- Looking at the HTML in the content file, it has been packaged as a single <P> (paragraph) tag, with the book paragraphs split using the <br> (break) tag. A simple ebook reader (possibly relying on the Android webkit library) will need to parse the entire HTML file before any text can be displayed!
- The file has not been packaged according to the recommended EPUB2 layout, with distinct META-INF and OEBPS sub-folders.
- A minor issue: the file contains a "Log.html" file which has no real content.
According to the book metadata, the files were created by a PHP
library, identified as
EPub (2.1) by A. Grandt, http://www.phpclasses.org/package/6115.
NOTE I'm not criticizing the author of said package, but the website
which used this old version of the package!
A decent epub processor might be able to fix some of these problems. For my own purposes, I wrote a brute-force C# program which fixes the most painful problems. This program performs basic text processing on the ebook contents. The result is an ebook which loads on my Android tablet in seconds, with minimal disruption in the text flow!
The program can be found in my epub_fixer repository.
- Breaks the single HTML file into many small chunks, arbitrarily less than 25K in size. The reader quickly handles each small chunk.
- Replaces each
</p><p>. The reader can parse this easily and display text quickly.
- The empty Log.Html file is deleted.
- The epub metadata is updated to use the changed content files.
- Only these "problem" books can be handled right now. If there isn't
Chapter01.htmlfile, the program will quit.
- Should a "problem" book have a table of contents, it probably won't work with the "fixed" book.
- Depending on the reader program, there may be "arbitrary" page breaks. These occur at "chunk" boundaries.
- The program does not unpack / pack the epub file. See the suggested usage below for details.
- The resulting "fixed" epub file is a bit larger. This is because each "chunk" needs a reference in the .NCX and .OPF files, and each "chunk" needs a copy of the full HTML header and trailer.
- The "fixed" book is still in the same EPUB2 format with no change to the packaging.
- The program is currently .NET framework based. It could be ported to .NET CORE / other platforms.
The following DOS script will process a single epub file. It requires the filename as a command line parameter without the extension. The batch file may need to be tweaked to change the path to the "fixer" program.
unzip %1.epub -d temp cd temp ..\..\fixer zip -r -9 ..\%1_v2.epub * cd .. rmdir /S /Q temp
The zip and unzip commands are the Info-ZIP programs, currently to be found on SourceForge.
The batch file unpacks the epub into a "temp" sub-folder, runs the "fixer" program, packs the "fixed" epub into a file with the same name and "_v2" appended, and finally cleans up the "temp" sub-folder.
I used the batch file to "fix" multiple epub files in a folder as follows (assuming the above script is called "fix_cmd.bat"):
for %i in (*.epub) do fix_cmd "%~nI"