Most authors write their documents in the word processor Microsoft WORD or something equivalent. Word processors add additional formatting characters to the document that are not visible on the screen. We will eventually need to remove all of these hidden characters since they are not allowed in HTML. However, we need to make some initial preparations so that desired formatting information will be preserved in the HTML file. Here are the preliminary steps that I recommend:
- You want to replace straight quotes with smart quotes and double hyphens (--) with em dashes. In WORD 2010 you can find these options as check boxes in the menu File > Options > Proofing > AutoCorrect Options > AutoFormat and in the menu File > Options > Proofing > AutoCorrect Options > AutoFormat As You Type. They should be checked in both places. In WORD 2003 they can be found in the menu Tools > AutoCorrect Options > AutoFormat (AutoFormat As You Type)
- Using Find and Replace, replace ' by ' and replace " by ". This looks like it wouldn't do anything, but it activates the smart quote replacement. Similarly, replace -- by — (em dash) and - (space-hyphen-space) by – (en dash).
- We want to mark the sections of text that are italic, bold, or underlined so that we will know where they are after the WORD formatting is removed. We will mark italic text by putting III at its beginning and end. Similarly, we will surround bold text with BBB and underlined text with UUU. We will later replace these markings with appropriate HTML tags.To mark the italic text we place ctrl-i in the Find box so it says Font:Italic and place III^&III in the Replace box (^& is Microsoft Word's symbol for the found text). Similarly replace ctrl-b by BBB^&BBB and ctrl-u by UUU^&UUU.
- Copy the text to the clipboard and paste it into a text editor such as notepad++. This is the most important step. It removes all the hidden formatting information. Each paragraph is now a single line of text. Save this file with an html extension.
Adding HTML tags to the text file we have created involves a lot of searching and replacing of character strings. This process is much simpler if we make use of regular expressions. We will describe these regular expressions in the next section.