what happens when computer: structural regular expressions

One of the biggest tragedies of computer science is how we so often forget our field has a history. It's not at all hard to read papers from a decade or two ago and find amazing ideas that never got popular, or even ideas which are currently being popularized in blog posts and tweets without any idea that there's prior art. As Ron Minnich once said:

You want to make your way in the CS field? Simple. Calculate rough time of amnesia (hell, 10 years is plenty, probably 10 months is plenty), go to the dusty archives, dig out something fun, and go for it. It's worked for many people, and it can work for you.

A good source of mostly-forgotten CS ideas is the Plan 9 from Bell Labs operating system, which is a lost utopia among computer systems. It was an operating system that took the same ideals that Unix paid lip-service to and realized them more fully than anyone else had. A lot has been written about the amazing networking portions of Plan 9, and the way that it considers the Unix adage everything is a file and takes it a hundred times further than Unix ever did. That's not what I'm going to go into here, but you should absolutely investigate it.

Instead, I'm going to talk about a minor but very interesting feature of the Plan 9 user-space utilities that has been undeservedly forgotten by many programmers: something called Structural Regular Expressions. Plan 9's user-land utilities are available for Unix-like operating systems: they are known as Plan 9 from User Space, or plan9port for short, and you can easily download them and experiment with the features I describe here. You can read about them in Rob Pike's paper describing them or by reading the documentation to the ed-inspired editor sam, but I'm going to take my own approach in describing them that's at least partly inspired by a now-deleted blog post that originally informed me.

I suspect at least part of the reason structural regular expressions aren't well-remembered is that the name is both misleading and thoroughly boring. It makes it sound like they're an alternative to regular expressions, which isn't the case: they're an alternative to the command language found in tools like ed or sed. Their implementation of regular expressions is (with a small but important exception) the same one you can find everywhere, but structural regular expressions build a more powerful set of actions on top of regular expressions.

`ed` and `sed`

The most well-known of ed's regex commands is almost certainly regex substitution, which takes the form

(ed) s/<regex>/<replacement>/<flags>

For example, the command s/this/that/g will replace every instance of the string this with the string that. (Omitting the g results in a command that only replaces the first instance.) Another ed command is g/<regex>/<command>, which loops over every line in the file, and if a line matches the supplied <regex>, then it runs <command> over that line. The inverse command—which runs the <command> if the line doesn't match, is v/<regex>/<command>. For example, a command like This syntax is a bit obtuse: you can imagine it as a piping the result from g cake into s this that, but it might be more accurate to think of the latter command as a function or continuation passed to the earlier one: it's a command that represents what to do next, which is the essence of continuations.

(ed) g/cake/s/this/that/g

can be read as find every line that contains cake, then only on those lines, replace every instance of this with that. A classic simple command is finding and printing lines that match a pattern: using re to stand in for an arbitrary regular expression, the command would be written g/re/p: this command is useful enough that it inspired a tool specifically for the purpose of printing lines matched by a regex, which was creatively named after that equivalent ed command: grep.

Structural regular expressions

Structural regular expressions build on a similar but non-identical command language, but the first deficiency identified in traditional Unix regexp-ey tools was that they were necessarily line-oriented. This isn't a feature of the theory of regular languages, but rather a practical API choice for Unix programs, which often deal with newline-delimited text files. While practical for some applications, this does create a weird edge case for regular expressions where some hopefully-straightforward uses of regular expressions don't suffice: for example, I might want to write a short script to search my prose for accidentally repeated instances of common words like the: a regex like /the +the/ would suffice for most cases, but would completely fail to match the string "the\nthe".

Structural regular expressions begin by tossing out line-orientedness: a regular expression like .* would match the entire file, newlines and all. Structural regular expressions use the escape sequence \n to represent a newline, so if I wanted to match a single line, I could use the regular expression .*\n to describe it. Consequently, I can handle the "the\nthe" case by writing /the[ \n]+the/, and replace all instances of repeated the—even across newlines—with the command I'm marking these snippets with sam, which is the ed- and ex-inspired stream editor that appeared in Plan 9. There's a bit more complexity to actually using sam which I'm eliding for the sake of explanation.

(sam) s/the[ \n]+the/the/g

That said, line-oriented commands are often very useful, and it'd be a shame if we lost out on the ability to do things on a per-line basis! Luckily, structural regular expressions have a trick up their sleeves: the x command, which can be thought of as a sort of for-each over every place where a regular expression matches the input. It takes the form x/<regex>/<command>, and will find every instance of <regex> in the input and then run <command> over only the portion of the input that matched the regex. We can trivially combine this with commands like p for printing:

(sam) x/cake/p

That command was pretty boring: all it will do is print out every instance of the word cake, and only that: none of the characters before, none of the surrounding lines, just a bunch of cakes, as many as appear in the input. If, instead, we wanted to print out every line that contained the word cake, we could write something like this:

(sam) x/.*cake.*\n/p

But there's a more elegant way: in the sam command language, the g command does something different than in ed: it looks like g/<regex>/<command> and acts like a filter, running a supplied command over the entire input (not just the part it matched) if its regex matches some part of the input. So, if we want to print every line that contains cake, we could first focus on each line of input with x, use g to filter down to just the lines that contain the string cake, and then print those lines:

(sam) x/.*\n/g/cake/p

There's a corresponding negative match command v, which runs the command if it doesn't find a match, so the following command prints every line which doesn't contain cake:

(sam) x/.*\n/v/cake/p

And there are commands for prepending, appending, modifying, or deleting text matched by previous commands. We can still use our old friend the s command to replace a regular expression, so the structural equivalent of our replace-this-with-that-on-lines-containing-cake example above could be written by focusing on lines, filtering by cake, and then running a traditional s command:

(sam) x/.*\n/g/cake/s/this/that/g

But we could also write this a slightly different way: by first focusing on lines, then filtering by cake, then focusing on instances of the string this, and using the c command to change the focused text to that:

(sam) x/.*\n/g/cake/x/this/c/that

A High-Level View

A major advantage of structural regular expressions is that they're more compositional than the traditional ed-like command language. In ed, we might be tempted to find lines that contain both this and that by writing a command like

(ed) g/this/g/that/p # bad!

but we can't do this: g commands aren't allowed to invoke other g commands, only simpler commands like printing or deletion. But in structural regular expressions, the primitive components are designed to be used in a recursive way, composing complicated commands out of regular expressions and sets of commands, which has the wonderful side-effect that the regular expressions you actually write are much simpler. To borrow a few examples from Rob Pike's paper: with structural regular expressions, if we wanted to print every line that contained rob but not robot, we could write a command to focus on lines, keep only those that contain rob, filter out those that contain robot, and print them:

(sam) x/.*\n/g/rob/v/robot/p

The same thing in an ed-like command language would require making the regular expression much more complicated, to filter by the string rob when not followed by o, or when followed by o but not by t, and so forth:

(ed) g/rob($|[^o]|o[^t])/p

Another advantage of structural regular expressions is that they alleviate a common problem with submatches. I neglected to mention it above, but the s command also allows you to use parens within the supplied regex to select some subpart of the matched input, and use that within the output, e.g.

(sam) s/([A-Za-z]+) ([A-Za-z]+)/\2 \1/g

is a command which will match two words and swap their order: \1 referring to the item matched within the first set of parens, and \2 within the second. One problem with submatches is that they are no longer valid when you put them underneath a *:

(ed) s/words: ([A-Za-z]+ )*/got: \1/g # also bad!

What does this invocation mean? Well, nothing: \1 can't unambiguously refer to any submatch, because it may match zero or more times.

Because structural regular expressions contain for-each-like constructs, we can start to articulate commands that perform the same repetitions, but with no ambiguity about what replaces what:

(sam) x/words: [A-Za-z ]+\n/x/[A-Za-z]+ /i/got:

Structural regular expressions are a powerful and expressive tool: separating the ability to focus, filter, and edit into different but complimentary commands provides a surprising amount of power while also simplifying the regular expressions needed to perform these operations. They are sadly not as popular as they could be—as far as I know, the only editors that support them are part of the Plan 9 tools—but implementing them in other tools would be a wonderful and easy way of making those tools more powerful—so, editor-writers and tool-writers, keep them in mind!

programming

what happens when computer

ed and sed

Structural regular expressions

A High-Level View

`ed` and `sed`