Pages

Regular expressions in Java

Regular expressions is a topic that programmers, even experienced ones, often postpone for later. But sooner or later, most Java developers have to process textual information. Most often, this means searching and editing text. Without regular expressions, effective and compact text-processing code is simply unthinkable. So stop procrastinating, let's tackle regular expressions right now. It's not so difficult. Regular expressions in Java - 1

What is a regular expression (regex)?

In fact, a regular expression is a pattern for finding a string in text. In Java, the original representation of this pattern is always a string, i.e. an object of the String class. However, it's not any string that can be compiled into a regular expression — only strings that conform to the rules for creating regular expressions. The syntax is defined in the language specification. Regular expressions are written using letters and numbers, as well as metacharacters, which are characters that have special meaning in regular expression syntax. For example:
String regex="java"; // The pattern is "java";
String regex=”\\d{3}; // The pattern is three digits;

Creating regular expressions in Java

Creating a regular expression in Java involves two simple steps:
  1. write it as a string that complies with regular expression syntax;
  2. compile the string into a regular expression;
In any Java program, we start working with regular expressions by creating a Pattern object. To do this, we need to call one of the class's two static methods: compile. The first method takes one argument — a string literal containing the regular expression, while the second takes an additional argument that determines the pattern-matching settings:
public static Pattern compile (String literal)
public static Pattern compile (String literal, int flags)
The list of potential values of the flags parameter is defined in Pattern class and is available to us as static class variables. For example:
Pattern pattern = Pattern.compile("java", Pattern.CASE_INSENSITIVE); // Pattern-matching will be case insensitive.
Basically, the Pattern class is a constructor for regular expressions. Under the hood, the compile method calls the Pattern class's private constructor to create a compiled representation. This object-creation mechanism is implemented this way in order to create immutable objects. When a regular expression is created, its syntax is checked. If the string contains errors, then a PatternSyntaxException is generated.

Regular expression syntax

Regular expression syntax relies on the <([{\^-=$!|]})?*+.> characters, which can be combined with letters. Depending on their role, they can be divided into several groups:
1. Metacharacters for matching the boundaries of lines or text
MetacharacterDescription
^beginning of a line
$end of a line
\bword boundary
\Bnon-word boundary
\Abeginning of the input
\Gend of the previous match
\Zend of the input
\zend of the input
2. Metacharacters for matching predefined character classes
MetacharacterDescription
\ddigit
\Dnon-digit
\swhitespace character
\Snon-whitespace character
\walphanumeric character or underscore
\Wany character except letters, numbers, and underscore
.any character
3. Metacharacters for matching control characters
MetacharacterDescription
\ttab character
\nnewline character
\rcarriage return
\flinefeed character
\u 0085next line character
\u 2028line separator
\u 2029paragraph separator
4. Metacharacters for character classes
MetacharacterDescription
[abc]any of the listed characters (a, b, or c)
[^abc]any character other than those listed (not a, b, or c)
[a-zA-Z]merged ranges (Latin characters from a to z, case insensitive)
[a-d[m-p]]union of characters (from a to d and from m to p)
[a-z&&[def]]intersection of characters (d, e, f)
[a-z&&[^bc]]subtraction of characters (a, d-z)
5. Metacharacters for indicating the number of characters (quantifiers). A quantifier is always preceded by a character or character group.
MetacharacterDescription
?one or none
*zero or more times
+one or more times
{n}n times
{n,}n or more times
{n,m}at least n times and no more than m times

Greedy quantifiers

One thing you should know about quantifiers is that they come in three different varieties: greedy, possessive, and reluctant. You make a quantifier possessive by adding a "+" character after the quantifier. You make it reluctant by adding "?". For example:
"А.+а" // greedy
"А.++а" // possessive
"А.+?а" // reluctant
Let's try using this pattern to understand the how the different types of quantifiers work. By default, quantifiers are greedy. This means that they look for the longest match in the string. If we run the following code:
public static void main(String[] args) {
    String text = "Fred Anna Alexander";
    Pattern pattern = Pattern.compile("A.+a");
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        System.out.println(text.substring(matcher.start(), matcher.end()));
    }
}
we get this output: Anna Alexa For the regular expression "A.+a", pattern-matching is performed as follows:
  1. The first character in the specified pattern is the Latin letter AMatcher compares it with each character of the text, starting from index zero. The character F is at index zero in our text, so Matcher iterates through the characters until it matches the pattern. In our example, this character is found at index 5.
    Regular expressions in Java - 2
  2. Once a match with the pattern's first character is found, Matcher looks for a match with its second character. In our case, it is the "." character, which stands for any character.
    Regular expressions in Java - 3
    The character n is in the sixth position. It certainly qualifies as a match for "any character".
  3. Matcher proceeds to check the next character of the pattern. In our pattern, it is included in the quantifier that applies to the preceding character: ".+". Because the number of repetitions of "any character" in our pattern is one or more times, Matcher repeatedly takes the next character from the string and checks it against the pattern as long as it matches "any character". In our example — until the end of the string (from index 7 to index 18).
    Regular expressions in Java - 4
    Basically, Matcher gobbles up the string to the end — this is precisely what is meant by "greedy".
  4. After Matcher reaches the end of the text and finishes the check for the "A.+" part of the pattern, it starts checking for the rest of the pattern: a. There's no more text going forward, so the check proceeds by "backing off", starting from the last character:
    Regular expressions in Java - 5
  5. Matcher "remembers" the number of repetitions in the ".+" part of the pattern. At this point, it reduces the number of repetitions by one and checks the larger pattern against the text until a match is found:
    Regular expressions in Java - 6

Possessive quantifiers

Possessive quantifiers are a lot like greedy ones. The difference is that when text has been captured to the end of the string, there is no pattern-matching while "backing off". In other words, the first three stages are the same as for greedy quantifiers. After capturing the entire string, the matcher adds the rest of the pattern to what it is considering and compares it with the captured string. In our example, using the regular expression "A.++a", the main method finds no match. Regular expressions in Java - 7

Reluctant quantifiers

  1. For these quantifiers, as with the greedy variety, the code looks for a match based on the first character of the pattern:
    Regular expressions in Java - 8
  2. Then it looks for a match with the pattern's next character (any character):
    Regular expressions in Java - 9
  3. Unlike greedy pattern-matching, the shortest match is searched for in reluctant pattern-matching. This means that after finding a match with the pattern's second character (a period, which corresponds to the character at position 6 in the text, Matcher checks whether the text matches the rest of the pattern — the character "a"
    Regular expressions in Java - 10
  4. The text does not match the pattern (i.e. it contains the character "n" at index 7), so Matcher adds more one "any character", because the quantifier indicates one or more. Then it again compares the pattern with the text in positions 5 through 8:
    Regular expressions in Java - 11
  5. In our case, a match is found, but we haven't reached the end of the text yet. Therefore, the pattern-matching restarts from position 9, i.e. the pattern's first character is looked for using a similar algorithm and this repeats until the end of the text.
    Regular expressions in Java - 12
Accordingly, the main method obtains the following result when using the pattern "A.+?a": Anna Alexa As you can see from our example, different types of quantifiers produce different results for the same pattern. So keep this in mind and choose the right variety based on what you're looking for.

Escaping characters in regular expressions

Because a regular expression in Java, or rather, its original representation, is a string literal, we need to account for Java rules regarding string literals. In particular, the backslash character "\" in string literals in Java source code is interpreted as a control character that tells the compiler that the next character is special and must be interpreted in a special way. For example:
String s="The root directory is \nWindows"; // Move "Windows" to a new line
String s="The root directory is \u00A7Windows"; // Insert a paragraph symbol before "Windows"
This means that string literals that describe regular expressions and use "\" characters (i.e. to indicate metacharacters) must repeat the backslashes to ensure that the Java bytecode compiler doesn't misinterpret the string. For example:
String regex=”\\s”; // Pattern for matching a whitespace character
String regex="\\"Windows\\"";  // Pattern for matching "Windows"
Double backslashes must also be used to escape special characters that we want to use as "normal" characters. For example:
String regex="How\\?";  // Pattern for matching "How?"

Methods of the Pattern class

The Pattern class has other methods for working with regular expressions:
  • String pattern() ‒ returns the regular expression's original string representation used to create the ,code>Pattern object:
    Pattern pattern = Pattern.compile("abc");
    System.out.println(Pattern.pattern()); // "abc"
  • static boolean matches(String regex, CharSequence input) – lets you check the regular expression passed as regex against the text passed as input. Returns:
    true – if the text matches the pattern;
    false – if it does not;
    For example:
    System.out.println(Pattern.matches("A.+a","Anna")); // true
    System.out.println(Pattern.matches("A.+a","Fred Anna Alexander")); // false
  • int flags() ‒ returns the value of the pattern's flags parameter set when the pattern was created or 0 if the parameter was not set. For example:
    Pattern pattern = Pattern.compile("abc");
    System.out.println(pattern.flags()); // 0
    Pattern pattern = Pattern.compile("abc",Pattern.CASE_INSENSITIVE);
    System.out.println(pattern.flags()); // 2
  • String[] split(CharSequence text, int limit) – splits the passed text into a String array. The limit parameter indicates the maximum number of matches searched for in the text:
    • if limit > 0 ‒ limit-1 matches;
    • if limit < 0 ‒ all matches in the text
    • if limit = 0 ‒ all matches in the text, empty strings at the end of the array are discarded;
    For example:
    public static void main(String[] args) {
        String text = "Fred Anna Alexa";
        Pattern pattern = Pattern.compile("\\s");
        String[] strings = pattern.split(text,2);
        for (String s : strings) {
            System.out.println(s);
        }
        System.out.println("---------");
        String[] strings1 = pattern.split(text);
        for (String s : strings1) {
            System.out.println(s);
        }
    }
    Console output:
    Fred
    Anna Alexa
    --------
    Fred
    Anna
    Alexa
    Below we'll consider another of the class's methods used to create a Matcher object.

Methods of the Matcher class

Instances of the Matcher class are created to perform pattern-matching. Matcher is the "search engine" for regular expressions. To perform a search, we need to give it two things: a pattern and a starting index. To create a Matcher object, the Pattern class provides the following method: рublic Matcher matcher(CharSequence input) The method takes a character sequence, which will be searched. This is an instance of a class that implements the CharSequence interface. You can pass not only a String, but also a StringBuffer, StringBuilder, Segment, or CharBuffer. The pattern is a Pattern object on which the matcher method is called. Example of creating a matcher:
Pattern p = Pattern.compile("a*b"); // Create a compiled representation of the regular expression
Matcher m = p.matcher("aaaaab"); // Create a "search engine" to search the text "aaaaab" for the pattern "a*b"
Now we can use our "search engine" to search for matches, get the position of a match in the text, and replace text using the class's methods. The boolean find() method looks for the next match in the text. We can use this method and a loop statement to analyze an entire text as part of an event model. In other words, we can perform necessary operations when an event occurs, i.e. when we find a match in the text. For example, we can use this class's int start() and int end() methods to determine a match's position in the text. And we can use the String replaceFirst(String replacement) and String replaceAll(String replacement) methods to replace matches with the value of the replacement parameter. For example:
public static void main(String[] args) {
    String text = "Fred Anna Alexa";
    Pattern pattern = Pattern.compile("А.+?a");

    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        int start=matcher.start();
        int end=matcher.end();
        System.out.println("Match found: " + text.substring(start, end) + " from index "+ start + " through " + (end-1));
    }
    System.out.println(matcher.replaceFirst("Ira"));
    System.out.println(matcher.replaceAll("Mary"));
    System.out.println(text);
}
Output: Match found: Anna from index 5 to 8 Match found: Alexa from index 10 to 14 Fred Ira Alexa Fred Mary Mary Fred Anna Alexa The example makes it clear that the replaceFirst and replaceAll methods create a new String object — a string in which pattern matches in the original text are replaced by the text passed to the method as an argument. Additionally, the replaceFirst method replaces only the first match, but the replaceAll method replaces all the matches in the text. The original text remains unchanged. The Pattern and Matcher classes' most frequent regex operations are built right into the String class. These are methods such as split, matches, replaceFirst, and replaceAll. But under the hood, these methods use the Pattern and Matcher classes. So if you want to replace text or compare strings in a program without writing any extra code, use the methods of the String class. If you need more advanced features, remember the Pattern and Matcher classes.

Conclusion

In a Java program, a regular expression is defined by a string that obeys specific pattern-matching rules. When executing code, the Java machine compiles this string into a Pattern object and uses a Matcher object to find matches in the text. As I said at the beginning, people often put off regular expressions for later, considering them to be a difficult topic. But if you understand the basic syntax, metacharacters, and character escaping, and study examples of regular expressions, then you'll find they are much simpler than they appear at first glance.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.