Adding HTML Tags
We will now use regular expressions to clean up our text file and add HTML tags. The following are the steps we will follow:
- To remove all tab characters we place \t in the find box and nothing in the replace box. Make sure that Regular expression is selected under Search Mode. Now hit the Replace All key.
- To remove all blank lines we change to Extended mode. We then place \n\r in the find box and nothing in the replace box. Again we hit Replace All. Now change back to regular expressions.
- We can remove white space from the beginning of each line by placing ^\s+ in the find box and nothing in the replace box. Again hit Replace All key.
- We can remove white space from the end of each line by placing \s+$ in the find box and nothing in the replace box. Hit Replace All key.
- Search for double spaces and replace by single spaces.
-
We now want to see if there are any special characters such as ï or è in the document. You can find any special characters by searching for
[^<>A-Za-z0-9\.,'"?\\\^\|\-\[\]:!;()/$#@&%*_+{}=~\s“”‘’—–]
, i.e., any character that is not one of the standard keyboard characters. We need to either remove these or replace them by their HTML equivalent. You can find HTML equivalents at a number of sites on the internet such as Common HTML Tags. Use the HTML entity names and not the entity numbers. -
There are some symbols that either have no keyboard representation or have a special meaning in XHTML. We will replace these by their HTML entity names. In particular
- Replace & by &
- Replace > by >
- Replace < by <
- Replace " by "
- Replace “ by “
- Replace ” by ”
- Replace ‘ by ‘
- Replace ’ by ’
- Replace … by …
- Since paragraphs are now lines, we want to wrap paragraph tags around each line. To do this we place ^(.+)$ in the find box and <p>\1</p> in the replace box. Now hit Replace All.
- You can cut and paste the text into the following HTML template:
- <?xml version="1.0" encoding="UTF-8" ?>
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
- <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
- <head>
- <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
- <title>Your Title Here</title>
- <link type="text/css" rel="stylesheet" href="style.css" />
- </head>
- <body>
- <!--Insert Content here-->
- </body>
- </html>;
You can now add any links, headings, etc. you desire and create a CSS file to style the elements the way you want them. It is considered good style to have a page break before each chapter. You can do this by adding page-break-before: always; to the style for chapter headings. Any heading that you want to link to from an HTML table of contents should be surrounded by div tags with an id, e.g., <div id="c1">Chapter 1</div>. In the next section we will show how to build an EPUB document.
+++++