Regular Expressions

Now that we have the text from the WORD file in our text editor, we want to start adding HTML tags. To do this we will make use of The editor's find and replace capability along with regular expressions. Regular expressions allow us to find and replace multiple strings of characters in one operation. They are like a complex version of wildcards. You probably are aware that you can search for all jpeg files in a directory by using the wild card expression *.jpg in the Find Box. Regular expressions are an extension of this capability to character strings within a file. We will describe some of the regular expressions that can be used in notepad++, but other text editor's have similar capabilities.

. (period) matches any character.
[…] matches any of the characters within the brackets. For example, [a5g] matches any of the characters a, 5, or g. You can also use ranges of characters. For example, [0-9] matches any of the numbers zero through nine, and [a-z] matches any lowercase letter.
* matches the preceding item zero or more times. For example, A[0-9]* would match A, A5, A23, etc.
+ matches the preceding item one or more times.
[^…]  matches any character except those within the bracket.
(?!) negative lookahead. For example, q(?!u) would match a q not followed by a u.
(?=) positive lookahead. For example, q(?=u) would match a q that was followed by a u.
(?<!) negative lookbehind. For example, q(?<!u) would match a q not preceded by a u.
(?<=) positive lookbehind. For example, q(?<=u) would match a q that was preceded by a u.
^ anchors the next regular expression to the start of a line (when not included in brackets).
$ anchors the previous regular expression to the end of a line (doesn't include new line character).
\< anchors the next regular expression to the start of a word.
\> anchors the previous regular expression to the end of a word.
(…) assigns a tag to the characters selected by the expressions inside the parentheses. The first set of tagged characters is assigned the tag \1, the second \2, and so on up to \9. These tags \1–\9 can be used as a substitute for the characters they represent in a replace operation
\n matches a new line character (line feed)
\r matches a carriage return
\t matches a tab character
\s matches a space or a tab character
\f matches a page break

Sometimes * and + extend the match farther than we would like. For example, ^[.]*e applied to “The old man is dead” would select “The old man is de”, i.e., it extends as far as possible. Adding ? after * or + limits the selection to the smallest matching string. In our example it would select “The”.

If you need to use a character that has a special meaning in regular expressions (for example a period) in its normal sense, then precede it by a backslash. Let's look at some simple examples.

Example 1  To remove blank lines we search (in extended mode) for

\n\r

and then replace with a blank entry. If you also want to remove lines containg only space and tab characters, search for

\r\n\s*\r\n

and replace with \r\n.

Example 2  Paragraphs in the WORD file are now single lines of text. We want to put HTML paragraph tags around the text in each line. To do this we can search for

^(.+)$

and then replace with

<p>\1</p>

In the next section we will show how to use regular expressions to add html tags to our document.

+++++