Main Page

Previous Next

Regular Expressions

We have seen some elementary capability for searching strings when we discussed the String class back in Chapter 4. From Java 1.4, we have had much more sophisticated facilities for analyzing strings by searching for patterns known as regular expressions. Regular expressions are not unique to Java. Perl is perhaps better known for its support of regular expressions, many word processors, especially on Unix, and there are specific utilities for regular expressions too.

So what is a regular expression? A regular expression is simply a string that describes a pattern that is to be used to search for matches within some other string. It's not simply a passive sequence of characters to be matched, though. A regular expression is essentially a mini-program for a specialized kind of computer called a state-machine. This isn't a real machine but a piece of software specifically designed to interpret a regular expression and analyze a given string based on that.

The regular expression capability in Java is implemented through two classes in the java.util.regex package: the Pattern class that defines objects that encapsulate regular expressions, and the Matcher class that defines an object that encapsulates a state-machine that can search a particular string using a given Pattern object. The java.util.regex package also defines the PatternSyntaxException class that defines exception objects thrown when a syntax error is found when compiling a regular expression to create a Pattern object.

Using regular expressions in Java is basically very simple:

  1. You create a Pattern object by passing a string containing a regular expression to the static compile() method in the Pattern class.

  2. You then obtain a Matcher object, which can search a given string for the pattern, by calling the matcher() method for the Pattern object with the string that is to be searched as the argument.

  3. You call the find() method (or some other methods as we shall see) for the Matcher object to search the string.

  4. If the pattern is found, you query the matcher object to discover the whereabouts of the pattern in the string and other information relating to the match.

While this is a straightforward process that is easy to code, the hard work is in defining the pattern to achieve the result that you want. This is an extensive topic since in their full glory regular expressions are immensely powerful and can get very complicated. There are books devoted entirely to this so our aim will be to get enough of a bare bones understanding of how regular expressions work so you will be in a good position to look into the subject in more depth if you need to. Although regular expressions can look quite fearsome, don't be put off. They are always built step-by-step, so although the end result may look complicated and obscure, they are not at all difficult to put together. Regular expressions are a lot of fun and a sure way to impress your friends and maybe confound your enemies.

Defining Regular Expressions

You may not have heard of regular expressions before reading this book and therefore may think you have never used them. If so, you are almost certainly wrong. Whenever you search a directory for files of a particular type, "*.java" for instance, you are using a form of regular expression. However, to say that regular expressions can do much more than this is something of an understatement. To get an understanding of what we can do with regular expressions, we will start at the bottom with the simplest kind of operation and work our way up to some of the more complex problems they can solve.

Creating a Pattern

In its most elementary form, a regular expression just does a simple search for a substring. For example, if we want to search a string for the word had, the regular expression is exactly that. So the string defining this particular regular expression is "had". Let's use this as a vehicle for understanding the programming mechanism for using regular expressions. We can create a Pattern object for our expression "had" with the statement:

Pattern had = Pattern.compile("had");

The static compile() method in the Pattern class returns a reference to a Pattern object that contains the compiled regular expression. The method will throw an exception of type PatternSyntaxException if the regular expression passed as the argument is invalid. However, you don't have to catch this exception as it is a subclass of RuntimeException and therefore is unchecked. The compilation process stores the regular expression in a Pattern object in a form that is ready to be processed by a Matcher state-machine.

There's a further version of the compile() method that enables you to control more closely how the pattern will be applied when looking for a match. The second argument is a value of type int that specifies one or more of the following flags that are defined in the Pattern class:


Matches ignoring case, but assumes only US-ASCII characters are being matched.


Enables the beginning or end of lines to be matched anywhere. Without this flag only the beginning and end of the entire sequence will be matched.


When this is specified in addition to CASE_INSENSITIVE, case insensitive matching will be consistent with the Unicode standard.


Makes the expression . (which we will see shortly) match any character, including line terminators.


Matches taking account of canonical equivalence of combined characters. For instance, some characters that have diacritics may be represented as a single character or as a single character with a diacritic followed by a diacritic character. This flag will treat these as a match.


Allows whitespace and comments in a pattern. Comments in a pattern start with # so from the first # to the end of the line will be ignored.


Enables Unix lines mode where only '\n' is recognized as a line terminator.

All these flags are single bit values so you can combine them by ANDing them together or by simple addition. For instance, you can specify the CASE_INSENSITIVE and the UNICODE_CASE flags with the expression:


Or you can write this as:


If we wanted to match "had" ignoring case, we could create the pattern with the statement:

Pattern had = Pattern.compile("had", Pattern.CASE_INSENSITIVE);

In addition to the exception thrown by the first version of the compile() method, this version will throw an exception of type IllegalArgumentException if the second argument has bit values set that do not correspond to one of the flag constants defined in the Pattern class.

Creating a Matcher

Once we have a Pattern object, we can create a Matcher object that can search a particular string, like this:

String sentence = "Smith, where Jones had had 'had', had had 'had had'."
Matcher matchHad = had.matcher(sentence);

The first statement defines the string, sentence, that we want to search. To create the Matcher object, we call the matcher() method for the Pattern object with the string to be analyzed as the argument. This will return a Matcher object that can analyze the string that was passed to it. The parameter for the matcher() method is actually of type CharSequence. This is an interface that is implemented by both the String and StringBuffer classes so you can pass either type of reference to the method. The java.nio.CharBuffer class also implements CharSequence so you can pass the contents of a CharBuffer to the method too. This means that if you use a CharBuffer to hold character data you have read from a file, you can pass the data directly to the matcher() method to be searched.

An advantage of Java's implementation of regular expressions is that you can reuse a Pattern object to create Matcher objects to search for the pattern in a variety of strings. To use the same pattern to search another string, you just call the matcher() method for the Pattern object with the new string as the argument. You then have a new Matcher object that you can use to search the new string.

You can also change the string that a Matcher object is to search by calling its reset() method with a new string as the argument. For example:

matchHad.reset ("Had I known, I would not have eaten the haddock.");

This will replace the previous string, sentence, in the Matcher object so it is now capable of searching the new string. Like the matcher() method in the Pattern class, the parameter type for the reset() method is CharSequence so you can pass a reference of type String, StringBuffer, or java.nio.CharBuffer to it.

Searching a String

Now we have a Matcher object, we can use it to search the string. Calling the find() method for the Matcher object will search the string for the next occurrence of the pattern. If it is found, the method stores information about where it was found in the Matcher object and returns true. If it is not found it returns false. When the pattern has been found, calling the start() method for the Matcher object returns the index position in the string where the first character in the pattern was found. Calling the end() method returns the index position following the last character in the pattern. Both index values are returned as type int. You could therefore search for the first occurrence of the pattern like this:

  System.out.println("Pattern found. Start: "+m.start()+" End: "+m.end());
  System.out.println("Pattern not found.");

Note that you must not call start() or end() for the Matcher object before you have succeeded in finding the pattern. Until a pattern has been matched, the Matcher object is in an undefined state and calling either of these methods will result in an exception of type IllegalStateException being thrown.

You will usually want to find all occurrences of a pattern in a string. When you call the find() method, searching starts at an index position in the string called the append position and stops either when the pattern is found and the value true is returned, or when the end of the string is reached, in which case the return value is false. The append position is initially zero, corresponding to the beginning of the string, but it gets updated if the pattern is found. Each time the pattern is found, the new append position will be the index position of the character immediately following the last character in the text that matched the pattern. The next call to find() will start searching at this new append position. Thus you can easily find all occurrences of the pattern by searching in a loop like this:

  System.out.println(" Start: "+m.start()+" End: "+m.end());

At the end of this loop the append position will be at the index position of the character following the last occurrence of the pattern in the string. If you want to reset the append position back to zero, you just call an overloaded version of reset() for the Matcher object that has no arguments:

m.reset();     //Reset this matcher

This resets the Matcher object to its original state before any search operations were carried out.

To make sure we understand the searching process, let's put it all together in an example.

Try It Out – Searching for a Substring

Here's a complete example to search a string for a pattern:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.Arrays;

class TryRegex {
  public static void main(String args[]) {
    //  A regex and a string in which to search are specified
    String regEx = "had";
    String str = "Smith , where Jones had had 'had', had had 'had had'.";

    // The matches in the output will be marked (fixed-width font required)
    char[]  marker = new char[str.length()];
    Arrays.fill(marker,' '); 
// So we can later replace spaces with marker characters
// Obtain the required matcher Pattern pattern = Pattern.compile(regEx); Matcher m = pattern.matcher(str); // Find every match and mark it while( m.find() ){ System.out.println("Pattern found at Start: "+m.start()+" End: "+m.end()); Arrays.fill(marker,m.start(),m.end(),'^'); } // Show the object string with matches marked under it System.out.println(str); System.out.println(new String(marker)); } }

This will produce the output:

Pattern found at Start: 19 End: 22
Pattern found at Start: 23 End: 26
Pattern found at Start: 28 End: 31
Pattern found at Start: 34 End: 37
Pattern found at Start: 38 End: 41
Pattern found at Start: 43 End: 46
Pattern found at Start: 47 End: 50
Smith, where Jones had had 'had', had had 'had had'.
                   ^^^ ^^^  ^^^   ^^^ ^^^  ^^^ ^^^  

How It Works

We first define a string, regEx, containing the regular expression, and a string, str, that we will search. We also create an array of type char[] that we use to indicate where the pattern is found in the string. We fill the elements of this array with spaces initially using the static fill() method from the Arrays class that we discussed earlier. Later we will replace some of these spaces with '^' to indicate where the pattern has been found.

Once we have compiled the regular expression regEx into a Pattern object, pattern, we create a Matcher object, m, from pattern that applies to the string str. We then call the find() method for m in the while loop condition. This loop will continue as long as the find() method returns true. On each iteration we output the index values returned by the start() and end() methods that reflect the index position where the first character of the pattern was found and the index position following the last character. We also insert the '^' character in the marker array at the index positions where the pattern was found – again using the fill() method.

When the loop ends we have found all occurrences of the pattern in the string so we output the string str, with the contents of the marker array immediately below it on the next line. As long as we are using a fixed width font for output to the command line, the '^' characters will mark the positions where the pattern appears in the string.

We will reuse this example as we delve into further options for regular expressions by plugging in different definitions for regEx and the string that is searched, str. The output will be more economical if you delete or comment out the statement in the while loop that outputs the start and end index positions.

Matching an Entire String

There are occasions when you want to try to match a pattern against an entire string, in other words when you want to establish that the complete string that you are searching is a match for the pattern. Suppose you read an input value into your program as a string. This might be from the keyboard or possibly through a dialog box managing the data entry. You might want to be sure that the input string is an integer for example. If input should be of a particular form, you can use a regular expression to determine whether it is correct or not.

The matches() method for a Matcher object tries to match the entire input string with the pattern and returns true only if there is a match. We can illustrate how this works with the following code fragment:

String input = null;    
// Read into input from some source...

Pattern yes = Pattern.compile("yes");
Matcher m = pattern.matcher(input);

if(m.matches())                            // Check if input matches "yes"
  System.out.println("Input is yes.");
  System.out.println("Input is not yes.");

Of course, this illustration is trivial, but later we will see how to define more sophisticated patterns that can check for a range of possible input forms.

Defining Sets of Characters

A regular expression can be made up of ordinary characters, which are upper and lower case letters and digits, plus sequences of meta-characters that have a special meaning. The pattern in the previous example was just the word "had", but what if we wanted to search a string for occurrences of "hid" or "hod" as well as "had", or even any three letter word beginning with 'h' and ending with 'd'?

You can deal with any of these possibilities with regular expressions. One option is to specify the middle character as a wildcard by using a period here, which is one example of a meta-character. This meta-character matches any character except end-of-line, so the regular expression "h.d", represents any sequence of three characters that start with 'h' and end with 'd'. Try changing the definitions of regEx and str in the previous example to:

  String regEx = "h.d";
  String str = "Ted and Ned Hodge hid their hod and huddled in the hedge.";

If you recompile and run the example again, the last two lines of output will be:

Ted and Ned Hodge hid their hod and huddled in the hedge.
                  ^^^       ^^^     ^^^            ^^^ 

You can see that we didn't find "Hod" in Hodge because of the capital 'H' but we found all the other sequences beginning with 'h' and ending with 'd'.

Of course, the regular expression "h.d" would also have found "hzd" or "hNd" if they had been present, which is not what we want. We can limit the possibilities by replacing the period with just the collection of characters we are looking for between square brackets, like this:

  String regEx = "h[aio]d";

The [aio] sequence of meta-characters defines what is called a simple class of characters consisting in this case of 'a', 'i', or 'o'. Here the term 'class' is used in the sense of a set of characters, not a class that defines a type. If you try this version of the regular expression in the previous example, the last two lines of output will be:

Ted and Ned Hodge hid their hod and huddled in the hedge.

^^^ ^^^

This now finds all sequences that begin with 'h' and end with 'd' and have a middle letter as 'a' or 'i' or 'o'.

There are a variety of ways in which you can define character classes in a regular expression. Here are some examples of the more useful forms:


This is a simple class that any of the characters between the square brackets will match – in this example, any vowel. We used this form in the code fragment above to search for variations on "had".


This represents any character except those appearing to the right of the ^ character between the square brackets. Thus here we have specified any character that is not a vowel. Note this is any character, not any letter, so the expression "h[^aeiou]d" will look for "h!d" or "h9d" as well as "hxd" or "hWd". Of course, it will reject "had" or "hid" or any other form with a vowel as the middle letter.


This defines an inclusive range – any of the letters 'a' to 'e' in this case. You can also specify multiple ranges, for example:


This corresponds to any of the characters from 'a' to 'c', from 's' to 'z', or from 'A' to 'E'.

If you want to specify that a position must contain a digit you could use [0 9]. To specify that a position can be a letter or a digit you could express it as [a zA Z0 9].

Any of these can be used in combination with ordinary characters to form a regular expression. For example, suppose we wanted to search some text for any sequence beginning with 'b', 'c', or 'd', with 'a' as the middle letter, and ending with 'd' or 't'. The regular expression to do this could be defined as:

String regEx = "[b-d]a[dt]";

This will search for any occurrence of "bad", "cad", "dad", "bat", "cat", or "dat".

Logical Operators in Regular Expressions

You can use the && operator to combine classes that define sets of characters. This is particularly useful when it is combined with the negation operator, ^, that appears in the second line of the table above. For instance, if you want to specify that any lower case consonant is acceptable, you could write it as:


However, it can much more conveniently be expressed as:


This produces the intersection (in other words the characters common to both sets) of the set of characters 'a' through 'z' with the set that is not a lower case vowel. To put it another way, the lower case vowels are subtracted from the set 'a' through 'z' so we are left with just the consonants.

The | operator is a logical OR that you use to specify alternatives. A regular expression to find "hid", "had", or "hod" could be written as "hid|had|hod". You can try this in the previous example by changing the definition of regEx to:

    String regEx = "hid|had|hod";

Note that the | operation means either the whole expression to the left of the operator or the whole expression to the right, not just the characters on either side as alternatives.

You could also use the | operator to define an expression to find sequences beginning with an upper case or lower case 'h', followed by a vowel, and ending in 'd', like this:

String regEx = "[h|H][aeiou]d";

With this as the regular expression in the example, the "Hod" in Hodge will be found as well as the other variations.

Predefined Character Sets

There are also a number of predefined character classes that provide you with a shorthand notation for commonly used sets of characters. Here are some that are particularly useful:


This represents any character, as we have already seen.


This represents any digit and is therefore shorthand for [0-9].


This represents any character that is not a digit. It is therefore equivalent to [^0-9].


This represents any whitespace character.


This represents any non-whitespace character and is therefore equivalent to [^\s].


This represents a word character, which corresponds to an upper or lower case letter or a digit or an underscore. It is therefore equivalent to [a-zA-Z_0-9].


This represents any character that is not a word character so it is equivalent to [^\w].

Note that when you are including any of the sequences that start with a backslash in a regular expression, you need to keep in mind that Java treats a backslash as the beginning of an escape sequence. You must therefore specify the backslash in the regular expression as \\. For instance, to find a sequence of three digits, the regular expression would be "\\d\\d\\d". This is peculiar to Java because of the significance of the backslash in Java strings, so it doesn't apply to other environments that support regular expressions, such as Perl.

Obviously you may well want to include a period, or any of the other meta-characters, as part of the character sequence you are looking for. To do this you can use an escape sequence starting with a backslash in the expression to define such characters. Since Java strings interpret a backslash as the start of a Java escape sequence, the backslash itself has to be represented as \\, the same as when using the predefined characters sets that begin with a backslash. Thus the regular expression to find the sequence "had." would be "had\\.".

Our earlier search with the expression "h.d" found embedded sequences such as "hud" in the word huddled. We could use the \s set that corresponds to any whitespace character to prevent this by defining regEx like this:

String regEx = "\\sh.d\\s";

This searches for a five-character sequence that starts and ends with any whitespace character. The output from the example will now be:

Ted and Ned Hodge hid their hod and huddled in the hedge.
                 ^^^^^     ^^^^^                         

You can see that the marker array shows the five-character sequences that were found. The embedded sequences are now no longer included, as they don't begin and end with a whitespace character.

To take another example, suppose we want to find hedge or Hodge as words in the sentence, bearing in mind that there's a period at the end. We could do this by defining the regular expression as:

    String regEx = "\\s[h|H][e|o]dge[\\s|\\.]";

The first character is defined as any whitespace by \\s. The next character is defined as either 'h' or 'H' by [h|H]. This can be followed by either 'e' or 'o' specified by [e|o]. This is followed by plain text dge with either a whitespace character or a period at the end, specified by [\\s|\\.]. This doesn't cater for all possibilities. Sequences at the beginning of the string will not be found, for instance, nor will sequences followed by a comma. We'll see how to deal with these next.

Matching Boundaries

So far we have tried to find the occurrence of a pattern anywhere in a string. In many situations you will want to be more specific. You may want to look for a pattern that appears at the beginning of a line in a string but not anywhere else, or maybe just at the end of any line. As we saw in the previous example you may want to look for a word that is not embedded – you want to find the word "cat" but not the "cat" in "cattle" or in "Popacatapetl" for instance. The previous example worked for the string we were searching but would not produce the right result if the word we were looking for was followed by a comma or appeared at the end of the text. However, we have other options. There are a number of special sequences you can use in a regular expression when you want to match a particular boundary. For instance, these are especially useful:


Specifies the beginning of a line. For example, to find the word Java at the beginning of any line you could use the expression "^Java".


Specifies the end of a line. For example, to find the word Java at the end of any line you could use the expression "Java$". Of course, if you were expecting a period at the end of a line the expression would be "Java\\.$".


Specifies a word boundary. To find words beginning with 'h' and ending with 'd' we could use the expression "\\bh.d\\b".


A non-word boundary – the complement of \b above.


Specifies the beginning of the string being searched. To find the word The at the very beginning of the string being searched you could use the expression "\\AThe\\b". The \\b at the end of the regular expression is necessary to avoid finding Then or There at the beginning of the input.


Specifies the end of the string being searched. To find the word hedge followed by a period at the end of a string you could use the expression "\\bhedge\\.\\z".


The end of input except for the final terminator. A final terminator will be a newline character ('\n') if Pattern.UNIX_LINES is set. Otherwise it can also be a carriage return ('\r'), a carriage return followed by a newline, a next-line character ('\u0085'), a line separator ('\u2028'), or a paragraph separator ('\u2029').

While we have moved quite a way from the simple search for a fixed substring offered by the String class methods, we still can't search for sequences that may vary in length. If you wanted to find all the numerical values in a string, which might be sequences such as 1234 or 23.45 or 999.998 for instance, we don't yet have the ability to do that. Let's fix that now by taking a look at quantifiers in a regular expression, and what they can do for us.

Using Quantifiers

A quantifier following a subsequence of a pattern determines the possibilities for how that subsequence of a pattern can repeat. Let's take an example. Suppose we want to find any numerical values in a string. If we take the simplest case we can say an integer is an arbitrary sequence of one or more digits. The quantifier for one or more is the meta-character +. We have also seen that we can use \d as shorthand for any digit (remembering of course that it becomes \\d in a Java STRING literal), so we could express any sequence of digits as the regular expression:


Of course, a number may also include a decimal point and may be optionally followed by further digits. To indicate something can occur just once or not at all, as is the case with a decimal point, we can use the quantifier ?. We can write the pattern for a sequence of digits followed by a decimal point as:


To add the possibility of further digits we can append \\d+ to what we have so far to produce the expression:


This is a bit untidy. We can rewrite this as an integral part followed by an optional fractional part by putting parentheses around the bit for the fractional part and adding the ? operator:


However, this isn't quite right. We can have 2. as a valid numerical value | for instance so we want to specify zero or more appearances of digits in the fractional part. The * quantifier expresses that, so maybe we should use:


We are still missing something though. What about the value .25 or the value -3? The optional sign in front of a number is easy so let's deal with that first. To express the possibility that - or + can appear we can use [ |+], and since this either appears or it doesn't, we can extend it to [+|-]?. So to add the possibility of a sign we can write the expression as:


We have to be careful how we allow for numbers beginning with a decimal point. We can't allow a sign followed by a decimal point or just a decimal point by itself to be interpreted as a number so we can't say a number starts with zero or more digits or that the leading digits are optional. We could define a separate expression for numbers without leading digits like this:


Here there is an optional sign followed by a decimal point and at least one digit. With the other expression there is also an optional sign so we can combine these into a single expression to recognize either form, like this:


This regular expression identifies substrings with an optional plus or minus sign followed by either a substring defined by "\\d+(\\.\\d*)?" or a substring defined by "\\.\\d+". You might be tempted to use square brackets instead of parentheses here, but this would be quite wrong as square brackets define a set of characters, so any single character from the set is a match.

That was probably a bit more work than you anticipated but it's often the case that things that look simple at first sight can turn out to be a little tricky. Let's try that out in an example.

Try It Out – Finding Integers

This is similar to the code we have used in previous examples except that here we will just list each substring that is found to correspond to the pattern:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class FindingIntegers {
  public static void main(String args[]) {
    String regEx = "[+|-]?(\\d+(\\.\\d*)?)|(\\.\\d+)";
    String str = "256 is the square of 16 and -2.5 squared is 6.25 " +
                                            "and -.243 is less than 0.1234.";
    Pattern pattern = Pattern.compile(regEx);
    Matcher m = pattern.matcher(str);
    int i = 0;
    String subStr = null;
      System.out.println(;              // Output the substring matched

This will produce the output:


How It Works

Well, we found all the numbers in the string so our regular expression works well, doesn't it? You can't do that with the methods in the String class. The only new code item here is the method, group(), that we call in the while loop for the Matcher object, m. This method returns a reference to a String object containing the subsequence corresponding to the last match of the entire pattern. Calling the group() method for the Matcher object, m, is equivalent to the expression str.substring(m.start(), m.end()).

Search and Replace Operations

You can implement a search and replace operation very easily using regular expressions. Whenever you call the find() method for a Matcher object, you can call the appendReplacement() method to replace the subsequence that was matched. You create a revised version of the original string in a new String Buffer object. There are two arguments to the appendReplacement() method. The first is a reference to the StringBuffer object that is to contain the new string, and the second is the replacement string for the matched text. We can see how this works by considering a specific example.

Suppose we define a string to be searched as:

String joke = "My dog hasn't got any nose.\n"
             +"How does your dog smell then?\n"
             +"My dog smells horrible.\n";

We now want to replace each occurrence of "dog" in the string by "goat". We first need a regular expression to find "dog":

String regEx = "dog";

We can compile this into a pattern and create a Matcher object for the string joke:

Pattern doggone = Pattern.compile(regEx);
Matcher m = doggone.matcher(joke);

We are going to assemble a new version of joke in a StringBuffer object that we can create like this:

StringBuffer newJoke = new StringBuffer();

This is an empty StringBuffer object ready to receive the revised text. We can now search for and replace instances of "dog" in joke by calling the find() method for m, and calling appendReplacement() each time it returns true:

  m.appendReplacement(newJoke, "goat");

Each call of appendReplacement() copies characters from joke to newJoke starting at the character where the previous find() operation started and ending at the character preceding the first character matched: at m.start()-1 in other words. The method will then append the string specified by the second argument to newJoke. This process is illustrated below.

Click To expand

The find() method will return true three times, once for each occurrence of "dog" in joke. When the three steps shown in the diagram have been completed, the find() method returns false on the next loop iteration, terminating the loop. This leaves newJoke in the state shown in the last box above. All we now need to complete newJoke is a way to copy the text from joke that comes after the last subsequence that was found. The appendTail() method for the Matcher object does that:


This will copy the text starting with the m.end() index position from the last successful match through to the end of the string. Thus this statement copies the segment " smells horrible." from joke to newJoke. We can put all that together and run it.

Try It Out – Search and Replace

Here's the code we have just discussed assembled into a complete program:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

class SearchAndReplace {
  public static void main(String args[]) {
    String joke = "My dog hasn't got any nose.\n"
                 +"How does your dog smell then?\n"
                 +"My dog smells horrible.\n";
    String regEx = "dog";

    Pattern doggone = Pattern.compile(regEx);
    Matcher m = doggone.matcher(joke);

    StringBuffer newJoke = new StringBuffer();
      m.appendReplacement(newJoke, "goat");

When you compile and execute this you should get the output:

My goat hasn't got any nose.
How does your goat smell then?
My goat smells horrible.

How It Works

Each time the find() method returns true in the while loop condition, we call the appendReplacement() method for the Matcher object, m. This copies characters from joke to newJoke, starting with the index position where the find() method started searching, and ending at the character preceding the first character in the match, which will be at m.start()-1. The method then appends the replacement string, "goat", to the contents of newJoke. Once the loop finishes, the appendTail() method copies characters from joke to newJoke, starting with the character following the last match at m.end(), through to the end of joke. Thus we end up with a new string similar to the original, but which has each instance of "dog" replaced by "goat".

The search and replace capability can be used to solve very simple problems. For example, if you want to make sure that any sequence of one or more whitespace characters is replaced by a single space, you can define the regular expression as "\\s +" and the replacement string as a single space " ". To eliminate all spaces at the beginning of each line, you can use the expression "^\\s+" and define the replacement string as empty, "".

Using Capturing Groups

Earlier we used the group() method for a Matcher object to retrieve the subsequence matched by the entire pattern defined by the regular expression. The entire pattern represents what is called a capturing group because the Matcher object captures the subsequence corresponding to the pattern match. Regular expressions can also define other capturing groups that correspond to parts of the pattern. Each pair of parentheses in a regular expression defines a separate capturing group in addition to the group that the whole expression defines. In the earlier example, we defined the regular expression by the statement:

    String regEx = "[+|-]?(\\d+(\\.\\d*)?)| (\\.\\d+)";

This defines three capturing groups other than the whole expression: one for the subexpression (\\d+(\\.\\d*)?), one for the subexpression (\\.\\d*), and one for the subexpression (\\.\\d+). The Matcher object stores the subsequence that matches the pattern defined by each capturing group, and what's more, you can retrieve them.

To retrieve the text matching a particular capturing group, you need a way to identify the capturing group that you are interested in. To this end, capturing groups are numbered. The capturing group for the whole regular expression is always number 0. Counting their opening parentheses from the left in the regular expression numbers the other groups. Thus the first opening parenthesis from the left corresponds to capturing group 1, the second opening parenthesis corresponds to capturing group 2, and so on for as many opening parentheses as there are in the whole expression. The diagram below illustrates how the groups are numbered in an arbitrary regular expression.

Click To expand

As you see, it is easy to number the capturing groups as long as you can count left parentheses. Group 1 is the same as Group 0 because the whole regular expression is parenthesized. The other capturing groups in sequence are defined by (B), (C(D)), (D), and (E).

To retrieve the text matching a particular capturing group after the find() method returns true, you call the group() method for the Matcher object with the group number as the argument. The groupCount() method for the Matcher object returns a value of type int that is the number of capturing groups within the pattern – that is, excluding group 0, which corresponds to the whole pattern. You therefore have all you need to access the text corresponding to any or all of the capturing groups in a regular expression.

Try It Out – Capturing Groups

Let's modify our earlier example to output the text matching each group:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class TryCapturingGroups {
  public static void main(String args[]) {
    String regEx = "[+|-]?(\\d+(\\.\\d*)?)|(\\.\\d+)";
    String str = "256 is the square of 16 and -2.5 squared is 6.25 " +
                                            "and -.243 is less than 0.1234.";
    Pattern pattern = Pattern.compile(regEx);
    Matcher m = pattern.matcher(str);
      for(int i = 0; i<=m.groupCount() ; i++)
        System.out.println("Group " + i + ": " +; // Group i substring

This produces the output:

Group 0: 256
Group 1: 256
Group 2: null
Group 3: null
Group 0: 16
Group 1: 16
Group 2: null
Group 3: null
Group 0: -2.5
Group 1: 2.5
Group 2: .5
Group 3: null
Group 0: 6.25
Group 1: 6.25
Group 2: .25
Group 3: null
Group 0: .243
Group 1: null
Group 2: null
Group 3: .243
Group 0: 0.1234
Group 1: 0.1234
Group 2: .1234
Group 3: null

How It Works

The regular expression here defines four capturing groups:

  • Group 0: The whole expression.

  • Group 1: The subexpression "(\\d+(\\.\\d*)?)"

  • Group 2: The subexpression "(\\.\\d*)"

  • Group 3: The subexpression "(\\.\\d+)"

After each successful call of the find() method for the Matcher object, m, we output the text captured by each group in turn by passing the index value for the group to the group() method. Note that because we want to output group 0 as well as the other groups, we start the loop index from 0 and allow it to equal the value returned by groupCount() so as to index over all the groups.

You can see from the output that group 1 corresponds to numbers beginning with a digit and group 3 corresponds to numbers starting with a decimal point, so either one or the other of these is always null. Group 2 corresponds to the sub-pattern within group 1 that matches the fractional part of a number that begins with a digit, so the text for this can only be non-null when the text for group 1 is non-null and the number has a decimal point.

Juggling Captured Text

Since we can get access to the text corresponding to each capturing group in a regular expression, we can move them around. The appendReplacement() method has special provision for recognizing references to capturing groups in the replacement text string. If $n, where n is an integer, appears in the replacement string, it will be interpreted as the text corresponding to group n. You can therefore replace the text matched to a complete pattern by any sequence of your choosing of the sub sequences corresponding to the capturing groups in the pattern. That's hard to describe in words, so let's demonstrate it with an example.

Try It Out – Rearranging Captured Group Text

I'm sure you remember that the Math.pow() method requires two arguments; the second argument is the power to which the first argument must be raised. Thus to calculate 163 you can write:

double result = Math.pow(16.0, 3.0);

Let's suppose we have written a Java program where we have mistakenly switched the two arguments so in trying to compute 163 we have written:

double result = Math.pow(3.0, 16.0);

Of course, this computes 316, which is not quite the same thing. Let's suppose further that this sort of error is strewn throughout the source code and in every case we have the arguments the wrong way round. We would need a month of Sundays to go through manually and switch the argument values so let's see if regular expressions can rescue the situation.

What we need to do is find each occurrence of Math.pow() and switch the arguments around. The intention here is to understand how we can switch things around so we will keep it simple and assume that the argument values to Math.pow() are always a numerical value or a variable name.

The key to the whole problem is to devise a regular expression with capturing groups for the bits we want to switch – the two arguments. Be warned – this is going to get a little messy; not difficult though – just messy.

We can define the first part of the regular expression that will find the sequence "Math.pow(". at any point where we want to allow an arbitrary number of whitespace characters we can use the sequence \\s*. You will recall that \\s in a Java string specifies the predefined character class \s which is whitespace. The * quantifier specifies zero or more of them. If we allow for whitespace between Math.pow and the opening parenthesis for the arguments, and some more whitespace after the opening parenthesis, the regular expression will be:


We have to specify the opening parenthesis by "\\(" since an opening parenthesis is a meta-character so we have to escape it.

This is followed by the first argument, which we said could be a number or a variable name. We created the expression to identify a number earlier. It was:


To keep things simple we will assume that a variable name is just any sequence of letters, digits, or underscores that begins with a letter or an underscore, so we won't get involved with qualified names. We can match a variable name with the expression:


We can therefore match either a variable name or a number with the pattern:


This just ORs the two possibilities together and parenthesizes the whole thing so it will be a capturing group.

A comma that may be surrounded by zero or more whitespace characters on either side follows the first argument. We can match that with the pattern:


The pattern to match the second argument will be exactly the same as the first:


Finally this must be followed by a closing parenthesis that may or may not be preceded by whitespace:


We can put all this together to define the entire regular expression as a String variable:

String regEx = "(Math.pow)"                                     // Math.pow
    + "\\s*\\(\\s*"                                             // Opening (
    + "(([a-zA-Z_]\\w*)|([+|-]?(\\d+(\\.\\d*)?)|(\\.\\d+)))"    // First argument
    + "\\s*,\\s*"                                               // Comma 
    + "(([a-zA-Z_]\\w*)|([+|-]?(\\d+(\\.\\d*)?)|(\\.\\d+)))"    // Second argument
    + "\\s*\\)";                                                // Closing ( 

Here we assemble the string literal for the regular expression by concatenating six separate string literals. Each of these corresponds to an easily identified part of the method call. If you count the left parentheses, excluding the escaped parenthesis of course, you can also see that capturing group 1 corresponds with the method name, group 2 is the first method argument, and group 8 is the second method argument.

We can put this in the example:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class TryCapturingGroups {
  public static void main(String args[]) {
    String regEx = "(Math.pow)"                                 // Math.pow
    + "\\s*\\(\\s*"                                             // Opening (
    + "(([a-zA-Z_]\\w*)|([+|-]?(\\d+(\\.\\d*)?)|(\\.\\d+)))"    // First argument
    + "\\s*,\\s*"                                               // Comma 
    + "(([a-zA-Z_]\\w*)|([+|-]?(\\d+(\\.\\d*)?)|(\\.\\d+)))"    // Second argument
    + "\\s*\\)";                                                // Closing ( 

    String oldCode = 
       "double result = Math.pow( 3.0, 16.0);\n"
     + "double resultSquared = Math.pow(2 ,result );\n"
     + "double hypotenuse = Math.sqrt(Math.pow(2.0, 30.0)+Math.pow(2 , 40.0));\n";
    Pattern pattern = Pattern.compile(regEx);
    Matcher m = pattern.matcher(oldCode);

    StringBuffer newCode = new StringBuffer();
      m.appendReplacement(newCode, "$1\\($8,$2\\)");


    System.out.println("Original Code:\n"+oldCode.toString());
    System.out.println("New Code:\n"+newCode.toString());

You should get the output:

Original Code:
double result = Math.pow( 3.0, 16.0);
double resultSquared = Math.pow(2 ,result );
double hypotenuse = Math.sqrt(Math.pow(2.0, 30.0)+Math.pow(2 , 40.0));

New Code:
double result = Math.pow(16.0,3.0);
double resultSquared = Math.pow(result,2);
double hypotenuse = Math.sqrt(Math.pow(30.0,2.0)+Math.pow(40.0,2));

How It Works

We have defined the regular expression so that separate capturing groups identify the method name and both arguments. As we saw earlier, the method name corresponds to group 1, the first argument to group 2, and the second argument to group 8. We therefore define the replacement string to the appendReplacement() method as "$1\\($8,$2\\)". The effect of this is to replace the text for each method call that is matched by the following, in sequence:


The text matching capturing group 1, which will be the method name.


A left parenthesis.


The text matching capturing group 8, which will be the second argument.


A comma.


The text matching capturing group 2, which will be the first argument.


A right parenthesis.

The call to appendTail() is necessary to ensure that any text left at the end of oldCode following the last match for regEx gets copied to newCode.

In the process we have eliminated any superfluous whitespace that was laying around in the original text.

Previous Next
JavaScript Editor Java Tutorials Free JavaScript Editor