Word to ePub – 2: Preparing a Word document

For ePub or Kindle production, the ideal Word document has simple styles, consistency in their usage, and no formatting that disappears after conversion, such as tabs, Word’s text boxes, and running headers and footers. Why? Because the source files for an ePub are formatted with XHTML (extensible hypertext markup language), a strict XML version of HTML 4 that defines the structure of documents displayed in ePub readers as well as Web browsers. You can usually change the appearance of your book using CSS.

Pushing against limitations

Support for the ePub standard, XHTML, and CSS by eReader devices and apps is limited. For example, you can embed fonts in an ePub document, but few devices support embedded fonts. CSS styling is also restricted, partly because ePub readers include their own default styling and partly to give the user more control over the appearance of documents and make them work on small screens.

Unlike a printed book, the experience of reading an eBook varies wildly, depending on the device or app in which it is viewed and the user’s preferences. Precise control over the layout is not possible, as it is in a printed book.

For the best looking, most usable eBook, give up striving for precise control in favor of usability. Keep formatting simple!

I find it counter-productive to invest huge amounts of time cleaning up Word styles because I replace them with a simpler CSS style sheet that will work with ePub and Kindle books. I do take time to check and fix things known to create problems before saving the document as HTML Filtered. After that, I work only with the HTML, delete Word’s styles with regular expressions, and use the Word document as a record of the author’s intent.

If you rarely look at the HTML source code or don’t feel comfortable editing CSS, you will be better off cleaning up and simplifying Word styles before conversion. There are other tutorials for doing that.  Overall, I recommend Elizabeth Castro’s book, EPUB Straight to the Point, which is available everywhere. The book is iPad centric, but it’s easy to generalize from her instructions.

Nuking Hairy Messes

If you convert books for other people, you can’t escape having to untangle hairy messes of Word styling, such as 15 versions of H3 styles. (Yes, they do exist; I got one.) The hairy messes require patience and time to write simplified CSS styles that reproduce a similar appearance without losing the tone and understandability of the presentation.

nuke buttonIf a document has all inline styles or is hopelessly messed up, the best strategy is to nuke all formatting by making the entire document Normal style and add back what’s needed, one bit at a time. To do that in Word, press CTRL-A to select all content, then select Remove Formatting. All non-default styles are now gone. Headers and inline italics or bold should remain, but you’ll need to review the results carefully.

Even meticulous authors make the common mistake of using tabs or spaces to indent paragraphs in Word. Create an indented paragraph style. There are no tabs in HTML!

Learn to use Word styles instead of inline formatting for easier maintenance and consistency. For example, if you have H2 chapter headings defined at a certain size, centered with a 30% top margin, and decide to modify their look, edit the style to change all of them at once instead of finding and editing each header.

BASIC STYLES IN THE WORD DOCUMENT

A subset of the default Word styles is enough for most narrative books, including novels, short fiction, and non-fiction that is not mathematics or full of technical code and illustrations:

  • Normal style – Parent of other Word styles. Some people like to use the Body Text style as their base style instead of Normal style. You may still do that, but you’ll need to unhide the Body Text style in Word 2007+, and check the option to NOT base it on Normal style. I don’t do this because I replace references to Normal and other Word styles with an external CSS style sheet.
  • Header styles – Use h1, h2, h3, h4 as needed. Always use header styles instead of big bold text for part, chapter, or section headers. Separate header styles from Normal Style to avoid problems with indents. To do that, base h1 on No Style. Then, for consistency in Word, make the headers hierarchical. That is, base h2 on h1, h3 on h2, and h4 on h3 so that headers inherit the same font and other characteristics. Adjust the font-size for each header style.
  • Paragraph styles – Several may be needed, such as indented, no-indent block style, and hanging indent. Most narrative works use an indented first line in paragraphs without blank lines between them, with exceptions for the first paragraph in a chapter, a paragraph following an image or list, and so on. Non-fiction or technical books often use block paragraphs without first-line indent and some blank space separating paragraphs. The front and back matter in a book may use block style paragraphs, in contrast to the narrative body of the book. Just be consistent and don’t mix up block and indented styles in the body of the book. Use a real indent style, not tabs or spaces! Tabs disappear when the document is converted from Word to HTML; there are no tabs in HTML.
  • Emphasized or italic text – Define and use a paragraph class when an entire paragraph is italic, e.g., <p class=”i”>. If you don’t, Sigil will generate a class and add it to every page when you split your book file into chapters. It’s okay to use <em></em> or <i></i> to italicize a few words.
  • Bold text – Define and use a paragraph style when an entire paragraph is bold. e.g., <p class=”b”>. If you don’t, Sigil will generate a class and add it to every page when you split your book file into chapters. It’s okay to use <strong></strong> or <b></b> to bold a few words.
  • Blockquotes – Note that XHTML requires block elements within a blockquote and will throw an error if missing. In contrast, Kindle books will IGNORE the blockquote tag if it includes other block elements. For example Kindlegen will convert a blockquote containing a paragraph tag as a paragraph, ignoring the blockquote tag altogether.
  • Simple lists – Avoid deeply nested lists, which become unreadable on small screens. Stick to one or two levels of indentation.
  • Centered text – Useful for quotations, dedication, copyright page, and headers. When used for headers, include in the header style.

Later, when you start formatting the converted Word-to-XHTML document as an ePub, you will need to add several more styles to your CSS style sheet to handle special sections of the book. Additional styles may include CSS for areas of the book where the author has inserted from 1 to 3 blank lines, larger centered italic text, a scene-change separator, images, captions, and callouts. Never use more than the equivalent of four blank lines in the body of your book to indicate a change of scene. Doing so will result in totally blank pages on small-screen devices.

Be aware that Kindlegen ignores multiple CSS classes when converting HTML or ePub to Mobi (Kindle) format. For this reason, you may be forced to include redundant CSS declarations in your style sheet to combine several styling variations for which you would normally use multiple styles.

Even if you know CSS well, for occasional questions or to learn more about CSS, the free tutorials at www.w3schools.com are excellent.

Important Word Settings

In Word settings, select Advanced and find Web Settings. Choose to Always save HTML as UTF-8 encoding. UTF-8 is the encoding standard for ePubs, XHTML, and most web documents in Western languages. With UTF-8, common special characters and punctuation will carry over into your HTML document without the additional effort of finding and replacing them with XHTML entities. Also, as you work on polishing up your XHTML, you can simply type special characters using the ALT+Number key and they’ll display as they should. For example:

Character Alt-key
en dash 0150
em dash 0151
ellipsis 0133
curly open quote 0148
curly close quote 0149
curly apostrophe 0146

Examine Word’s AutoCorrect/AutoFormat options. Deselect most options, especially if you’re working on someone else’s manuscript. It’s wise to prevent Word from second-guessing their intent and clobbering dialogue. I keep the following AutoFormat replacements:

  • “Straight quotes” with “smart quotes”
  • Hyphens (–) with dash (–)
  • *Bold* and _italic_ with real formatting (Unlikely to encounter this legacy practice!)

Find stray or bad formatting with Search and Replace

Use Word’s search and replace to find and change three adjacent periods to a real ellipsis, two hyphens to an em dash, quotes to curly quotes and apostrophes to curly apostrophes. These formatting niceties display just fine in both Kindle and ePub books.

Note that if you have smart quotes turned on in AutoFormat, you can type the straight quote or apostrophe as both the search and the replacement and Word will convert the straight ones to curly versions.

Also use search and replace to make sure there aren’t any double hyphens or double periods at the end of a sentence.

Although I don’t clean up all Word styles, I take some time to examine the document in order to understand what the author intended it to look like and to check for consistent use of headers and punctuation. There should at least be no stray straight quotes or apostrophes, double hyphens, and fake ellipses. The proper use of ellipses is not straightforward. It’s possible to have an ellipsis followed by a period, or an ellipsis without a space following it in certain contexts. Check with the author when you think there’s a problem.

Remove Word attributes that cause big trouble

Remove hidden text fields, comments, and reviewer markup. It’s especially important to remove hidden bookmarks because they cause problems if used to create an HTML Table of Contents. In the last book I worked on, there were up to six hidden bookmarks in the same spot! Apparently Word doesn’t discard old bookmarks when you create a new one in the same position.

Save a new (third) copy of the Word doc as your final cleaned-up file, and then Save As Filtered HTML. The original and cleaned up Word docs will be helpful as visual reference points when you review the ePub drafts. The filtered HTML is also handy if you need to check back to see what you started with in the conversion.

In Word, close the saved HTML document. The rest of the cleanup and formatting will be done in a text editor or Sigil, an ePub editor and converter. Sigil is the greatest thing since sliced bread. If you use it, remember to contribute to the authors to keep it going. Above all, don’t reopen the save HTML document in Word, which will happily mess it up.

Proceed to Word to ePub – 3: Clean up Filtered HTML