2009-05-18

building regular expressions

This whole entry could be summarized as 'use M-x re-builder' to build your regular expressions. But let's see if I can stretch that wisdom over a couple of lines…

For searching and replacing, regular expressions ('regexps') are a very useful tool. For example, see the entry about getting your ip-number. I am not going to explain regexps here – there are plenty of good references about them. Of course, emacs supports regexps - but it's not always so easy, compaired to e.g. Perl. I am only providing some trivial examples here, please see Steve Yegge's post on the regexp tricks possible with then-new Emacs 22 (I can't remember ever needing that kind of regexp-pr0n in real life though…)

Back to regexps - on of the issues with regexps in Elisp is that they need extra quoting, that is, lots of \-escape characters; regexps can be hard to comprehend, and this does not help… Why the extra quoting? Let's look at a simple example. Suppose we want to search for the word cat. And not category or concatenate. The regular expression would then be \bcat\b.

In Perl you could write this as /\bcat\b\/ (in Perl you specify regexps by putting them between /-characters).

Not so in Emacs-Lisp. On the Lisp-level, there are no regexps; there are only strings and only the regexp functions understand their true nature. But before the strings ever get those functions, the Lisp interpreter does what it does best: interpreting. And when it sees \b, it interprets it as the backspace-character.

To make it not do that, you'll need to pay the 'slash-tax' and write something like:

(re-search-forward "\\bcat\\b")
Things can go ugly quickly from there - think of when you need search for something with a backslash, like our regex \bcat\b itself; you'd need to do:
(re-search-forward "\\\\bcat\\\\b")

slash tax break

To make things even more interesting, in different contexts, different rules apply. The above is all about regexps in strings in Emacs-Lisp. However, things are different when you provide a string interactively.

Suppose you search through your buffer (with M-x isearch-forward-regexp or C-M-s). Now, your input is not interpreted by the Lisp interpreter (after all, it's just user input). So, you're exempt from the slash tax, and you can use \bcat\b to match, well, \bcat\b.

re-builder

So, regexps can be hard, and Emacs-Lisp makes it somewhat harder. A natural way to come up with the regular expression you need, is to use trial-and-error, and this is exactly what isearch-forward-regexp and friends do. But what about the slash-taxed regexps that you need in your Lisp code?

The answer is M-x re-builder. I am sure many people are already using it, but even if there were only one person that finds out about this through this blog-post, it'd be worth it! And this is the whole trick here: whenever you need a regexp in your code, put the kind of string it should match in a buffer, and enter M-x re-builder.

re-builder will put some quotes in the minibuffer. You type your regexp there, and it will show the matches in the buffer as you type. It even supports different regex-syntaxes. By default, re-builder will help you with the strings-in-Emacs-lisp kind of regexps; this is called the read-syntax. But you can switch to the user-input regexps with C-c TAB string RET (yes, these are called string here). There are some other possible syntaxes as well.

One final trick for re-builder is the subexpression mode, that you activate with C-c C-e (and leave with q). You can than see what subexpressions match (ie. if we can match cat, cut, cot etc., with \\bc\\(.\\)t\\b, and the subexpression would then contain the middle letter. re-builder automatically converts between the syntaxes it supports, so you could use 'string-mode' as well, bc\(.\)t\b.

11 comments:

Seneca the Younger said...
This comment has been removed by the author.
Seneca the Younger said...

even if there were only one person that finds out about this through this blog-post, it'd be worth it!score.

sajith said...

One person that found re-builder from this blog post says thank you.

Rotem said...

Thanks a lot! Are there any flavors of this that build PCRE regexps?

awils1 said...

Thanks a bunch! I'd adore a more introductory tutorial to regex syntax in Emacs, but this is fantastic nonetheless.

rudi said...

For nicer regexp syntax, there's also rx:

(require 'rx)
(rx word-boundary (or "cat" "dog") word-boundary)

rx rocks.

Alberto said...

In the fourth paragraph it is said:

In Perl you could write this as /\bcat\b\/ (...)I think the correct expression would be /\bcat\b/, without the last backslash.

BTW, very nice article. Also the first time that I read about re-builder.

djcb said...

@alberto: you are very right -- you see how all those \-chars get confusing quickly :-)

won't update the post now, as it would show up as a new article in planet.emacsen as well...

Anonymous said...

oh this tip made my day. i've been futzing around with backslash after backslash without a good trial-and-error system until i ran into this. THANKS!!

Drew said...

See also this. WYSIWYG search that highlights regexp groups can be a great way to test out a regexp.

http://emacswiki.org/emacs/RegularExpressionHelp#toc3

Justin said...

re-builder! Awesomeness! Thank you!