I’ve been reading a bunch of H.P. Lovecraft short stories recently. I’ve mostly been doing this by reading them online. The problem was I wanted an ebook version that I could read offline, but wanted to know that I had the complete set of stories...not some curated selection of the “best” stories. This got me thinking about ebooks. So I decided to take it upon myself to figure out how to make myself an ebook of Lovecraft's stories to read on my Android phone.
I quickly settled on the ePub format since it is moderately popular, has good device support, and the format is open. I don’t know if it is the best format, but it worked for this project.
This is not meant as an exhaustive guide for creating ebooks nor does it stretch the limits and dark recesses of the ePub format. This is just the result of me trying to munge together a rather simple no-frills ebook. I documented the process so that hopefully someone else might learn from my experience as well.
The ePub format is actually a bunch of different specs together which form a standard for ebooks. The official page for these specs is here. The basic container for an epub is a ZIP file. And that zip file just needs to be constructed in a certain way and have the right parts for an ebook reader to read it correctly.
application/epub+zip
This is always a 20 byte text file with the text “application/epub+zip”. There are special restrictions on this file within the zip archive:
Detailed information about the mimetype file is here.
<?xml version='1.0' encoding='utf-8'?><container xmlns="urn:oasis:names:tc:opendocument:xmlns:container" version="1.0"><rootfiles><rootfile media-type="application/oebps-package+xml" full-path="OEBPS/content.opf"/></rootfiles></container>
The schema for this is described here. This is the second necessary component of the ePub, you need a directory in your zip archive named META-INF and it must contain a file named container.xml. This XML file’s purpose is only to point at the opf file described below.
<?xml version='1.0' encoding='UTF-8'?><package xmlns:dc="http://purl.org/dc/elements/1.1/"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xmlns:dcterms="http://purl.org/dc/terms/"xmlns:opf="http://www.idpf.org/2007/opf"xmlns="http://www.idpf.org/2007/opf" unique-identifier="id" version="2.0"><metadata><dc:identifier opf:scheme="URI" id="id">lovecraft123456789</dc:identifier><dc:creator opf:file-as="Lovecraft, H. P.">H. P. Lovecraft</dc:creator><dc:title>H.P. Lovecraft: Complete Works</dc:title><dc:language xsi:type="dcterms:RFC4646">en</dc:language></metadata><manifest><item href="cy.html" media-type="application/xhtml+xml" id="item2"/><item href="ee.html" media-type="application/xhtml+xml" id="item3"/><item href="toc.ncx" media-type="application/x-dtbncx+xml" id="ncx"/></manifest><spine toc="ncx"><itemref idref="item2" linear="yes"/><itemref idref="item3" linear="yes"/></spine></package>
The OPF format is described here. There must be a “package” element with child elements: “metadata”, “manifest”, and “spine”. The “package” element must have the “unique-identifier” attribute and “version” attribute. The “metadata” element must have at least the “dc:title”, “dc:identifier”, and “dc:language” children. The “manifest” element defines all of the content resources (with the appropriate mimetype) that are part of the ebook. The “spine” element has multiple “itemref” child elements defining the reading order for the items in the manifest.
toc.ncx
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE ncx PUBLIC '-//NISO//DTD ncx 2005-1//EN''http://www.daisy.org/z3986/2005/ncx-2005-1.dtd'><ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" xml:lang="en"version="2005-1"><head><meta name="dtb:uid" content="lovecraft123456789"/><meta name="dtb:depth" content="1"/><meta name="dtb:totalPageCount" content="0"/><meta name="dtb:maxPageNumber" content="0"/></head><docTitle><text>H.P. Lovecraft: Complete Works</text></docTitle><navMap><navPoint playOrder="1" id="cy"><navLabel><text>The Curse of Yig</text></navLabel><content src="cy.html"/></navPoint><navPoint playOrder="2" id="ee"><navLabel><text>The Electric Executioner</text></navLabel><content src="ee.html"/></navPoint></navMap></ncx>
*.html
<?xml version='1.0' encoding='utf-8'?><!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN''http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'><html xmlns="http://www.w3.org/1999/xhtml"><head><title>Story Title Goes Here...</title><meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/></head><body>Content Goes Here...</body></html>
The rest of the OEBPS directory contains the items listed in the content.opf manifest section. You can read about all the fancy stuff that is supported in the spec here. In this example we include an ncx file for navigation (this was used in ePub v2.0, but probably isn’t necessary anymore). The rest of the content is just basic xhtml documents and I basically just stuck to paragraph (P) tags in my HTML; though more complex formatting is supported.
The first step was to get the data that I wanted to bundle into the ebook. Luckily there was a single page that contained all the links to the stories I wanted to download. I downloaded this index using wget and then used a series of regex-replace commands in emacs to convert this file into a flat text file where each line in the text file was a URL for one of the stories.
I wrote a quick bash script that would download all the stories from the text file:
while read p; doecho "Downloading $p"wget $psleep 1sdone <lovecraft2.txt
At this point I had all the html files for the stories.
The files downloaded were not valid xhtml. I fixed these files up and simplified the HTML so that it would be more easily restylable by the ebook reading software. Right now I am doing this manually, which is not great.
Entries for each story need to be added to the content.opf and toc.ncx. This modifications can all be generated automatically from the list of files to download from the “Download the Index of Stories” step above.
Finally, here is the set of the linux commands that I ran to package my ePub (naturally “lc” stands for “Lovecraft”):
zip -0 -X lc mimetypezip -g -r lc META-INF/zip -g -r lc OEBPS/cp lc.zip lc.epub
If you are having trouble with your ePub you can use this official online validator to verify the formatting of your ePub: http://validator.idpf.org/. I was able to just Google any errors I didn’t understand in order to debug.
To get your ePub on your device you can either host it somewhere online and download it or I was able to just upload it directly to https://play.google.com/books and drag and drop my ePub there. If your ePub format is invalid it will spend a long time “processing” the file and then fail.
You can start with an existing epub file and modify it if you want. A good resource for free epub books is Project Gutenberg. Where you might for example download something like Metamorphosis by Franz Kafka to see how their ePub ebooks are formatted.
By the way, here is an ePub of this document ;-)