Fun with ePub

I’ve been reading a bunch of H.P. Lovecraft short stories recently. I’ve mostly been doing this by reading them online. The problem was I wanted an ebook version that I could read offline, but wanted to know that I had the complete set of stories...not some curated selection of the “best” stories. This got me thinking about ebooks. So I decided to take it upon myself to figure out how to make myself an ebook of Lovecraft's stories to read on my Android phone.

I quickly settled on the ePub format since it is moderately popular, has good device support, and the format is open. I don’t know if it is the best format, but it worked for this project.

This is not meant as an exhaustive guide for creating ebooks nor does it stretch the limits and dark recesses of the ePub format. This is just the result of me trying to munge together a rather simple no-frills ebook. I documented the process so that hopefully someone else might learn from my experience as well.

About ePub Format

The ePub format is actually a bunch of different specs together which form a standard for ebooks. The official page for these specs is here. The basic container for an epub is a ZIP file. And that zip file just needs to be constructed in a certain way and have the right parts for an ebook reader to read it correctly.

Parts of an ePub Book

mimetype

application/epub+zip

This is always a 20 byte text file with the text “application/epub+zip”. There are special restrictions on this file within the zip archive:

  1. This file must be the first one in the zip archive (you can achieve this by adding it first to a new zip file and then adding the additional files to the created zip archive later (you can use the -g flag in the linux zip command to add files to an existing zip archive).
  2. This file must not be compressed (using the -0 flag in the linux zip command prevents the added file from being compressed).
  3. This file must not include any file status bits (by default the linux zip command adds 7 access bytes of metadata for each file in the archive, and this can be avoided with the -X flag.

Detailed information about the mimetype file is here.

META-INF/container.xml

<?xml version='1.0' encoding='utf-8'?>
<container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0">
  <rootfiles>
    <rootfile media-type="application/oebps-package+xml" full-path="OEBPS/content.opf"/>
  </rootfiles>
</container>

The schema for this is described here. This is the second necessary component of the ePub, you need a directory in your zip archive named META-INF and it must contain a file named container.xml. This XML file’s purpose is only to point at the opf file described below.

OEBPS/content.opf

<?xml version='1.0' encoding='UTF-8'?>
<package xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:opf="http://www.idpf.org/2007/opf"
    xmlns="http://www.idpf.org/2007/opf" unique-identifier="id" version="2.0">
  <metadata>
    <dc:identifier opf:scheme="URI" id="id">lovecraft123456789</dc:identifier>
    <dc:creator opf:file-as="Lovecraft, H. P.">H. P. Lovecraft</dc:creator>
    <dc:title>H.P. Lovecraft: Complete Works</dc:title>
    <dc:language xsi:type="dcterms:RFC4646">en</dc:language>
  </metadata>
  <manifest>
    <item href="cy.html" media-type="application/xhtml+xml" id="item2"/>
    <item href="ee.html" media-type="application/xhtml+xml" id="item3"/>
    <item href="toc.ncx" media-type="application/x-dtbncx+xml" id="ncx"/>
  </manifest>
  <spine toc="ncx">
    <itemref idref="item2" linear="yes"/>
    <itemref idref="item3" linear="yes"/>
  </spine>
</package>

The OPF format is described here. There must be a “package” element with child elements: “metadata”, “manifest”, and “spine”. The “package” element must have the “unique-identifier” attribute and “version” attribute. The “metadata” element must have at least the “dc:title”, “dc:identifier”, and “dc:language” children. The “manifest” element defines all of the content resources (with the appropriate mimetype) that are part of the ebook. The “spine” element has multiple “itemref” child elements defining the reading order for the items in the manifest.

OEBPS/*

toc.ncx

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE ncx PUBLIC '-//NISO//DTD ncx 2005-1//EN'
    'http://www.daisy.org/z3986/2005/ncx-2005-1.dtd'>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" xml:lang="en"
    version="2005-1">
  <head>
    <meta name="dtb:uid" content="lovecraft123456789"/>
    <meta name="dtb:depth" content="1"/>
    <meta name="dtb:totalPageCount" content="0"/>
    <meta name="dtb:maxPageNumber" content="0"/>
  </head>
  <docTitle>
    <text>H.P. Lovecraft: Complete Works</text>
  </docTitle>
  <navMap>
    <navPoint playOrder="1" id="cy">
      <navLabel>
        <text>The Curse of Yig</text>
      </navLabel>
      <content src="cy.html"/>
    </navPoint>
    <navPoint playOrder="2" id="ee">
      <navLabel>
        <text>The Electric Executioner</text>
      </navLabel>
      <content src="ee.html"/>
    </navPoint>
  </navMap>
</ncx>

*.html

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN'
    'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Story Title Goes Here...</title>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
  </head>
  <body>
    Content Goes Here...
  </body>
</html>

The rest of the OEBPS directory contains the items listed in the content.opf manifest section. You can read about all the fancy stuff that is supported in the spec here. In this example we include an ncx file for navigation (this was used in ePub v2.0, but probably isn’t necessary anymore). The rest of the content is just basic xhtml documents and I basically just stuck to paragraph (P) tags in my HTML; though more complex formatting is supported.

Complete Example

Download the Index of Stories

The first step was to get the data that I wanted to bundle into the ebook. Luckily there was a single page that contained all the links to the stories I wanted to download. I downloaded this index using wget and then used a series of regex-replace commands in emacs to convert this file into a flat text file where each line in the text file was a URL for one of the stories.

Download All the Stories

I wrote a quick bash script that would download all the stories from the text file:

while read p; do
  echo "Downloading $p"
  wget $p
  sleep 1s
done <lovecraft2.txt

At this point I had all the html files for the stories.

Prepare the Stories

The files downloaded were not valid xhtml. I fixed these files up and simplified the HTML so that it would be more easily restylable by the ebook reading software. Right now I am doing this manually, which is not great.

Prepare the content.opf and toc.ncx

Entries for each story need to be added to the content.opf and toc.ncx. This modifications can all be generated automatically from the list of files to download from the “Download the Index of Stories” step above.

Packaging Everything Together

Finally, here is the set of the linux commands that I ran to package my ePub (naturally “lc” stands for “Lovecraft”):

zip -0 -X lc mimetype
zip -g -r lc META-INF/
zip -g -r lc OEBPS/
cp lc.zip lc.epub

Tips

If you are having trouble with your ePub you can use this official online validator to verify the formatting of your ePub: http://validator.idpf.org/. I was able to just Google any errors I didn’t understand in order to debug.

To get your ePub on your device you can either host it somewhere online and download it or I was able to just upload it directly to https://play.google.com/books and drag and drop my ePub there. If your ePub format is invalid it will spend a long time “processing” the file and then fail.

You can start with an existing epub file and modify it if you want. A good resource for free epub books is Project Gutenberg. Where you might for example download something like Metamorphosis by Franz Kafka to see how their ePub ebooks are formatted.

By the way, here is an ePub of this document ;-)