sed — Stream Editor

Sometimes it is better to use regular expressions to manipulate content rather than patching sources. This can be used for small changes, especially those which are likely to create patch conflicts across versions. The canonical way of doing this is via sed:

# This plugin is mapped to the 'h' key by default, which conflicts with some # other mappings. Change it to use 'H' instead. sed -i 's/$noremap <buffer> $h/\1H/' info.vim \ || die 'sed failed'

Another common example is appending a -gentoo-blah version string (some upstreams like us to do this so that they can tell exactly which package they're dealing with). Again, we can use sed. Note that the ${PR} variable will be set to r0 if we don't have a -r component in our version.

# Add in the Gentoo -r number to fluxbox -version output. We need to look # for the line in version.h.in which contains "__fluxbox_version" and append # our content to it. if [[ "${PR}" == "r0" ]] ; then suffix="gentoo" else suffix="gentoo-${PR}" fi sed -i \ -e "s~$__fluxbox_version .@VERSION@$~\1-${suffix}~" \ version.h.in || die "version sed failed"

It is also possible to extract content from existing files to create new files this way. Many app-vim ebuilds use this technique to extract documentation from the plugin files and convert it to Vim help format.

# This plugin uses an 'automatic HelpExtractor' variant. This causes # problems for us during the unmerge. Fortunately, sed can fix this # for us. First, we extract the documentation: sed -e '1,/^" HelpExtractorDoc:$/d' \ "${S}"/plugin/ZoomWin.vim > ${S}/doc/ZoomWin.txt \ || die "help extraction failed" # Then we remove the help extraction code from the plugin file: sed -i -e '/^" HelpExtractor:$/,$d' "${S}"/plugin/ZoomWin.vim \ || die "help extract remove failed"

A summary of the more common ways of using sed and a description of commonly used address and token patterns follows. Note that some of these constructs are specific to GNU sed 4 on non-GNU userland archs, the sed command must be aliased to GNU sed. Also note that GNU sed 4 is guaranteed to be installed as part of @system. This was not always the case, which is why some packages, particularly those which use sed -i, have a DEPEND upon >=sys-apps/sed-4.

Basic <c>sed</c> Invocation

The basic form of a call is:

sed [ option flags ] \ -e 'first command' \ -e 'second command' \ -e 'and so on' \ input-file > output-file \ || die "Oops, sed didn't work!"

For cases where the input and output files are the same, the inplace option should be used. This is done by passing -i as one of the option flags.

Usually sed prints out every line of the created content. To obtain only explicitly printed lines, the -n flag should be used.

The term pattern refers to the description of text being matched.

Simple Text Substitution using <c>sed</c>

The most common form of sed is to replace all instances of some text with different content. This is done as follows:

# replace all instances of "some text" with "different content" in # somefile.txt sed -i -e 's/some text/different content/g' somefile.txt || \ die "Sed broke!" The /g flag is required to replace all occurrences. Without this flag, only the first match on each line is replaced. The above will replace irksome texting with irkdifferent contenting, which may not be desired.

If the pattern or the replacement string contains the forward slash character, it is usually easiest to use a different delimiter. Most punctuation characters are allowed, although backslash and any form of brackets should be avoided. You should choose your delimiter with care to ensure it cannot appear in any strings involved in the subject/replacement. For example, using sed with CFLAGS is hazardous because it is user-supplied data (so may contain any character), but one should in particular avoid e.g. the colon here.

# replace all instances of "/usr/local" with "/usr" sed -i -e 's~/usr/local~/usr~g' somefile.txt || \ die "sed broke"

Patterns can be made to match only at the start or end of a line by using the ^ and $ metacharacters. A ^ means "match at the start of a line only", and $ means "match at the end of a line only". By using both in a single statement, it is possible to match exact lines.

# Replace any "hello"s which occur at the start of a line with "howdy". sed -i -e 's!^hello!howdy!' data.in || die "sed failed" There is no need for a !g suffix here. # Replace any "bye"s which occur at the end of a line with "cheerio!". sed -i -e 's,bye$,cheerio!,' data.in || die "sed failed" # Replace any lines which are exactly "change this line" with "have a # cookie". sed -i -e 's-^change this line$-have a cookie-' data.in || die "Oops"

To ignore case in the pattern, add the /i flag.

# Replace any "emacs" instances (ignoring case) with "Vim" sed -i -e 's/emacs/Vim/gi' editors.txt || die "Ouch" Case insensitive matching doesn't work correctly when backreferences are used.

Regular Expression Substitution using <c>sed</c>

It is also possible to do more complex matches with sed. Some examples could be:

Match any three digits
Match either "foo" or "bar"
Match any of the letters "a", "e", "i", "o" or "u"

These types of pattern can be chained together, leading to things like "match any vowel followed by two digits followed by either foo or bar".

To match any of a set of characters, a character class can be used. These come in three forms.

A backslash followed by a letter. \d, for example, matches a single digit (any of 0, 1, 2, ... 9). \s matches a single whitespace character. A table of the more useful classes is provided later in this document.
A group of characters inside square brackets. [aeiou], for example, matches any one of 'a', 'e', 'i', 'o' or 'u'. Ranges are allowed, such as [0-9A-Fa-fxX], which could be used to match any hexadecimal digit or the characters 'x' and 'X'. Inverted character classes, such as [^aeiou], match any single character except those listed.
A POSIX character class is a special named group of characters that are locale-aware. For example, [[:alpha:]] matches any 'alphabet' character in the current locale. A table of the more useful classes is provided later in this document.

The regex a[^b] does not mean "match a, so long as it does not have a 'b' after it". It means "match a followed by exactly one character which is not a 'b'". This is important when one considers a line ending in the character 'a'. At the time of writing, the sed documentation (man sed and sed.info) does not mention that POSIX character classes are supported. Consult IEEE Std 1003.1-2017, section 9.3 for full details of how these should work, and the sed source code for full details of how these actually work.

To match any one of multiple options, alternation can be used. The basic form is first\|second\|third.

To group items to avoid ambiguity, the $parentheses$ construct may be used. To match "iniquity" or "infinity", one could use in$iqui\|fini$ty.

To optionally match an item, add a \? after it. For example, colou\?r matches both "colour" and "color". This can also be applied to character classes and groups in parentheses, for example $in$\?finite$ly$\?. Further atoms are available for matching "one or more", "zero or more", "at least n", "between n and m" and so on these are summarised later in this document.

There are also some special constructs which can be used in the replacement part of a substitution command. To insert the contents of the pattern's first matched bracket group, use \1, for the second use \2 and so on up to \9. An unescaped ampersand & character can be used to insert the entire contents of the match. These and other replace atoms are summarised later in this document.

Addresses in <c>sed</c>

Many sed commands can be applied only to a certain line or range of lines. This could be useful if one wishes to operate only on the first ten lines of a document, for example.

The simplest form of address is a single positive integer. This will cause the following command to be applied only to the line in question. Line numbering starts from 1, but the address 0 can be useful when one wishes to insert text before the first line. If the address 100 is used on a 50 line document, the associated command will never be executed.

To match the last line in a document, the $ address may be used.

To match any lines that match a given regular expression, the form /pattern/ is allowed. This can be useful for finding a particular line and then making certain changes to it sometimes it is simpler to handle this in two stages rather than using one big scary s/// command. When used in ranges, it can be useful for finding all text between two given markers or between a given marker and the end of the document.

To match a range of addresses, addr1,addr2 can be used. Most address constructs are allowed for both the start and the end addresses.

Addresses may be inverted with an exclamation mark. To match all lines except the last, $! may be used.

Finally, if no address is given for a command, the command is applied to every line in the input.

GNU sed does not support the % address forms found in some other implementations. It also doesn't support /addr/+offset, that's an ex thing...

Other more complex options involving chaining addresses are available. These are not discussed in this document.

Content Deletion using <c>sed</c>

Lines may be deleted from a file using address d command. To delete the third line of a file, one could use 3d, and to filter out all lines containing "fred", /fred/d.

sed -e /fred/d is not the same as s/.fred.// the former will delete the lines including the newline, whereas the latter will delete the lines' contents but not the newline.

Content Extraction using <c>sed</c>

When the -n option is passed to sed, no output is printed by default. The p command can be used to display content. For example, to print lines containing "infra monkey", the command sed -n -e '/infra monkey/p' could be used. Ranges may also be printed sed -n -e '/^START$/,/^END$/p' is sometimes useful.

Inserting Content using <c>sed</c>

To insert text with sed use a address a or i command. The a command inserts on the line following the match while the i command inserts on the line before the match.

As usual, an address can be either a line number or a regular expression: a line number command will only be executed once and a regular expression insert/append will be executed for each match.

# Add 'Bob' after the 'To:' line: sed -i -e '/^To: $/a Bob' data.in || die "Oops" # Add 'From: Alice' before the 'To:' line: sed -i -e '/^To: $/i From: Alice' # Note that the spacing between the 'i' or 'a' and 'Bob' or 'From: Alice' is simply ignored' # Add 'From: Alice' indented by two spaces: (You only need to escape the first space) sed -i -e '/^To: $/i\ From: Alice'

Note that you should use a match instead of a line number wherever possible. This reduces problems if a line is added at the beginning of the file, for example, causing your sed script to break.

Regular Expression Atoms in <c>sed</c> Basic Atoms text Literal text  Grouping \| Alternation, a or b * \? \+ \{\} Repeats, see below . Any single character ^ Start of line $ End of line [abc0-9] Any one of [^abc0-9] Any one character except [[:alpha:]] POSIX character class, see below \1 .. \9 Backreference \x (any special character) Match character literally \x (normal characters) Shortcut, see below

Atom	Purpose

Character Class Shortcuts \a "BEL" character \f "Form Feed" character \t "Tab" character \w "Word" (a letter, digit or underscore) character \W "Non-word" character

Atom	Description

POSIX Character Classes

Read the source, it's the only place these're documented properly...

[[:alpha:]] Alphabetic characters [[:upper:]] Uppercase alphabetics [[:lower:]] Lowercase alphabetics [[:digit:]] Digits [[:alnum:]] Alphabetic and numeric characters [[:xdigit:]] Digits allowed in a hexadecimal number [[:space:]] Whitespace characters [[:print:]] Printable characters [[:punct:]] Punctuation characters [[:graph:]] Non-blank characters [[:cntrl:]] Control characters

Class	Description

Count Specifiers * Zero or more (greedy) \+ One or more (greedy) \? Zero or one (greedy) \{N\} Exactly N \{N,M\} At least N and no more than M (greedy) \{N,\} At least N (greedy)

Atom	Description

Replacement Atoms in <c>sed</c> \1 .. \9 Captured  contents & The entire matched text \L All subsequent characters are converted to lowercase \l The following character is converted to lowercase \U All subsequent characters are converted to uppercase \u The following character is converted to uppercase \E Cancel the most recent \L or \U

Atom	Description

Details of <c>sed</c> Match Mechanics

GNU sed uses a traditional (non-POSIX) nondeterministic finite automaton with extensions to support capturing to do its matching. This means that in all cases, the match with the leftmost starting position will be favoured. Of all the leftmost possible matches, favour will be given to leftmost alternation options. Finally, all other things being equal favour will be given to the longest of the leftmost counting options.

Most of this is in violation of strict POSIX compliance, so it's best not to rely upon it. It is safe to assume that sed will always pick the leftmost match, and that it will match greedily with priority given to items earlier in the pattern.

Notes on Performance with <c>sed</c> write this

Recommended Further Reading for Regular Expressions

The author recommends Mastering Regular Expressions by Jeffrey E. F. Friedl for those who wish to learn more about regexes. This text is remarkably devoid of phrases like "let t be a finite contiguous sequence such that t[n] ∈ ∑ ∀ n", and was not written by someone whose pay cheque depended upon them being able to express simple concepts with pages upon pages of mathematical and Greek symbols.