Recipe 6.7 Reading Records with a Separator

6.7.1 Problem

You want to read records separated by a pattern, but Perl doesn't allow its input record separator variable to be a regular expression.

Many problems, most obviously those involving parsing complex file formats, become simpler when you can extract records separated by different strings.

6.7.2 Solution

Read the whole file and use split:

undef $/;
@chunks = split(/pattern/, <FILEHANDLE>);

6.7.3 Discussion

Perl's official record separator, the $/ variable, must be a fixed string, not a pattern. To sidestep this limitation, undefine the input record separator entirely so that the next readline operation reads the rest of the file. This is sometimes called slurp mode, because it slurps in the whole file as one big string. Then split that huge string using the record separating pattern as the first argument.

Here's an example where the input stream is a text file that includes lines consisting of ".Se", ".Ch", and ".Ss", which are special codes in the troff macro set that this book was developed under. These strings are the separators, and we want to find text that falls between them.

# .Ch, .Se and .Ss divide chunks of STDIN
    local $/ = undef;
    @chunks = split(/^\.(Ch|Se|Ss)$/m, <>);
print "I read ", scalar(@chunks), " chunks.\n";

We create a localized version of $/ so its previous value is restored once the block finishes. By using split with parentheses in the pattern, captured separators are also returned. This way data elements in the return list alternate with elements containing "Se", "Ch", or "Ss".

If you don't want separators returned, but still need parentheses, use non-capturing parentheses in the pattern: /^\.(?:Ch|Se|Ss)$/m.

To split before a pattern but include the pattern in the return, use a lookahead assertion: /^(?=\.(?:Ch|Se|Ss))/m. That way each chunk except the first starts with the pattern.

Be aware that this uses a lot of memory when the file is large. However, with today's machines and typical text files, this is less often an issue now than it once was. Just don't try it on a 200 MB logfile unless you have plenty of virtual memory for swapping out to disk! Even if you do have enough swap space, you'll likely end up thrashing.

6.7.4 See Also

The $/ variable in perlvar(1) and in the "Per-Filehandle Variables" section of Chapter 28 of Programming Perl; the split function in perlfunc(1) and Chapter 29 of Programming Perl; we talk more about the special variable $/ in Chapter 8.

