Best regex questions in September 2010

How to parse HTML with PHP?

38 votes

Suggestion for a reference question. Stack Overflow has dozens of "How to parse HTML" questions coming in every day. However, it is very difficult to close as a duplicate because most questions deal with the specific scenario presented by the asker. This question is an attempt to build a generic "reference question" that covers all aspects of the issue.

This is an experiment. If such a reference question already exists, let me know and I'll happily remove this one.

My ideal vision is that each of the three questions gets answered separately, and the best answers to each bubble up to the top.

I will be awarding a 200 bounty to the best answer in each of the three categories two weeks from now, pending discussion of this question on Meta.

Each of these questions have already been answered brilliantly elsewhere, so copy+pasting your own answer to a different question is fine with me.

How do I parse HTML with PHP?

  1. What libraries are there? Which ones use PHP's native DOM, which ones come with their own parsing engine? (Hint: SimpleHTMLDOM)

    1a. I need to find a specific element, but I find it hard to get used to the XPath syntax. Are there any DOM-based libraries that make parsing HTML easier? Please consider making generic real world examples.

  2. Is there a PHP library that enables me to query the DOM using CSS[2/3] selectors, like jQuery does? (Hint: phpQuery) Please consider making generic real world examples.

  3. Bonus question: Why shouldn't I use regular expressions? Please provide a very short answer in layman's terms.

phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That's also why they're one of the easiest approaches to properly parse HTML in PHP.

Examples for QueryPath

Basically you first create a queryable DOM tree from a HTML string:

 $qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL

The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:

 $qp->find("div.classname")->children()->...;

 foreach ($qp->find("p img") as $img) {
     print qp($img)->attr("src");
 }

Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use xpath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularily ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)

 $qp->xpath("//div/p[1]");  // get first paragraph in a div

QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).

 $qp->find("a[target=_blank]")->toggleClass("usability-blunder");

.

phpQuery or QueryPath?

Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because overall less features).
For further informations on the differences see this comparison: http://www.tagbytag.org/articles/phpquery-vs-querypath

And here's a comprehensive QueryPath introduction: http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&S_CMP=HP

Advantages

  • Simplicity and Reliability
  • Simple to use alternatives ->find("a img, a object, div a")
  • Proper data unescaping (in comparison to regular expression greping)

How does this regex find triangular numbers?

24 votes

Part of a series of educational regex articles, this is a gentle introduction to the concept of nested references.

The first few triangular numbers are:

 1 = 1
 3 = 1 + 2
 6 = 1 + 2 + 3
10 = 1 + 2 + 3 + 4
15 = 1 + 2 + 3 + 4 + 5

There are many ways to check if a number is triangular. There's this interesting technique that uses regular expressions as follows:

  • Given n, we first create a string of length n filled with the same character
  • We then match this string against the pattern ^(\1.|^.)+$
    • n is triangular if and only if this pattern matches the string

Here are some snippets to show that this works in several languages:

PHP (on ideone.com)

$r = '/^(\1.|^.)+$/';

foreach (range(0,50) as $n) {
  if (preg_match($r, str_repeat('o', $n))) {
     print("$n ");
  }
}

Java (on ideone.com)

for (int n = 0; n <= 50; n++) {
    String s = new String(new char[n]);
    if (s.matches("(\\1.|^.)+")) {
        System.out.print(n + " ");
    }
}

C# (on ideone.com)

Regex r = new Regex(@"^(\1.|^.)+$");

for (int n = 0; n <= 50; n++) {
    if (r.IsMatch("".PadLeft(n))) {
       Console.Write("{0} ", n);
    }
}

So this regex seems to work, but can someone explain how?

Similar questions

Explanation

Here's a schematic breakdown of the pattern:

from beginning…
|         …to end
|         |
^(\1.|^.)+$
 \______/|___match
  group 1    one-or-more times

The (…) brackets define capturing group 1, and this group is matched repeatedly with +. This subpattern is anchored with ^ and $ to see if it can match the entire string.

Group 1 tries to match this|that alternates:

  • \1., that is, what group 1 matched (self reference!), plus one of "any" character,
  • or ^., that is, just "any" one character at the beginning

Note that in group 1, we have a reference to what group 1 matched! This is a nested/self reference, and is the main idea introduced in this example. Keep in mind that when a capturing group is repeated, generally it only keeps the last capture, so the self reference in this case essentially says:

"Try to match what I matched last time, plus one more. That's what I'll match this time."

Similar to a recursion, there has to be a "base case" with self references. At the first iteration of the +, group 1 had not captured anything yet (which is NOT the same as saying that it starts off with an empty string). Hence the second alternation is introduced, as a way to "initialize" group 1, which is that it's allowed to capture one character when it's at the beginning of the string.

So as it is repeated with +, group 1 first tries to match 1 character, then 2, then 3, then 4, etc. The sum of these numbers is a triangular number.


Further explorations

Note that for simplification, we used strings that consists of the same repeating character as our input. Now that we know how this pattern works, we can see that this pattern can also match strings like "1121231234", "aababc", etc.

Note also that if we find that n is a triangular number, i.e. n = 1 + 2 + … + k, the length of the string captured by group 1 at the end will be k.

Both of these points are shown in the following C# snippet (also seen on ideone.com):

Regex r = new Regex(@"^(\1.|^.)+$");

Console.WriteLine(r.IsMatch("aababc"));     // True
Console.WriteLine(r.IsMatch("1121231234")); // True
Console.WriteLine(r.IsMatch("iLoveRegEx")); // False

for (int n = 0; n <= 50; n++) {
    Match m = r.Match("".PadLeft(n));
    if (m.Success) {
       Console.WriteLine("{0} = sum(1..{1})", n, m.Groups[1].Length);
    }
}
// 1 = sum(1..1)
// 3 = sum(1..2)
// 6 = sum(1..3)
// 10 = sum(1..4)
// 15 = sum(1..5)
// 21 = sum(1..6)
// 28 = sum(1..7)
// 36 = sum(1..8)
// 45 = sum(1..9)

Flavor notes

Not all flavors support nested references. Always familiarize yourself with the quirks of the flavor that you're working with (and consequently, it almost always helps to provide this information whenever you're asking regex-related questions).

In most flavors, the standard regex matching mechanism tries to see if a pattern can match any part of the input string (possibly, but not necessarily, the entire input). This means that you should remember to always anchor your pattern with ^ and $ whenever necessary.

Java is slightly different in that String.matches, Pattern.matches and Matcher.matches attempt to match a pattern against the entire input string. This is why the anchors can be omitted in the above snippet.

Note that in other contexts, you may need to use \A and \Z anchors instead. For example, in multiline mode, ^ and $ match the beginning and end of each line in the input.

One last thing is that in .NET regex, you CAN actually get all the intermediate captures made by a repeated capturing group. In most flavors, you can't: all intermediate captures are lost and you only get to keep the last.

Related questions


Bonus material: Using regex to find power of twos!!!

With very slight modification, you can use the same techniques presented here to find power of twos.

Here's the basic mathematical property that you want to take advantage of:

  • 1 = 1
  • 2 = (1) + 1
  • 4 = (1+2) + 1
  • 8 = (1+2+4) + 1
  • 16 = (1+2+4+8) + 1
  • 32 = (1+2+4+8+16) + 1

The solution is given below (but do try to solve it yourself first!!!!)

(see on ideone.com in PHP, Java, and C#):

^(\1\1|^.)*.$

How can we match a^n b^n with Java regex?

17 votes

This is the second part of a series of educational regex articles. It shows how lookaheads and nested references can be used to match the non-regular languge anbn. Nested references are first introduced in: How does this regex find triangular numbers?

One of the archetypal non-regular languages is:

L = { anbn: n > 0 }

This is the language of all non-empty strings consisting of some number of a's followed by an equal number of b's. Examples of strings in this language are ab, aabb, aaabbb.

This language can be show to be non-regular by the pumping lemma. It is in fact an archetypal context-free language, which can be generated by the context-free grammar S → aSb | ab.

Nonetheless, modern day regex implementations clearly recognize more than just regular languages. That is, they are not "regular" by formal language theory definition. PCRE and Perl supports recursive regex, and .NET supports balancing groups definition. Even less "fancy" features, e.g. backreference matching, means that regex is not regular.

But just how powerful is this "basic" features? Can we recognize L with Java regex, for example? Can we perhaps combine lookarounds and nested references and have a pattern that works with e.g. String.matches to match strings like ab, aabb, aaabbb, etc?

References

Linked questions

The answer is, needless to say, YES! You can most certainly write a Java regex pattern to match anbn. It uses a positive lookahead for assertion, and one nested reference for "counting".

Rather than immediately giving out the pattern, this answer will guide readers through the process of deriving it. Various hints are given as the solution is slowly constructed. In this aspect, hopefully this answer will contain much more than just another neat regex pattern. Hopefully readers will also learn how to "think in regex", and how to put various constructs harmoniously together, so they can derive more patterns on their own in the future.

The language used to develop the solution will be PHP for its conciseness. The final test once the pattern is finalized will be done in Java.


Step 1: Lookahead for assertion

Let's start with a simpler problem: we want to match a+ at the beginning of a string, but only if it's followed immediately by b+. We can use ^ to anchor our match, and since we only want to match the a+ without the b+, we can use lookahead assertion (?=…).

Here is our pattern with a simple test harness:

function testAll($r, $tests) {
   foreach ($tests as $test) {
      $isMatch = preg_match($r, $test, $groups);
      $groupsJoined = join('|', $groups);
      print("$test $isMatch $groupsJoined\n");
   }
}

$tests = array('aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb');

$r1 = '/^a+(?=b+)/';
#          └────┘
#         lookahead

testAll($r1, $tests);

The output is (as seen on ideone.com):

aaa 0
aaab 1 aaa
aaaxb 0
xaaab 0
b 0
abbb 1 a

This is exactly the output we want: we match a+, only if it's at the beginning of the string, and only if it's immediately followed by b+.

Lesson: You can use patterns in lookarounds to make assertions.


Step 2: Capturing in a lookahead (and f r e e - s p a c i n g mode)

Now let's say that even though we don't want the b+ to be part of the match, we do want to capture it anyway into group 1. Also, as we anticipate having a more complicated pattern, let's use x modifier for free-spacing so we can make our regex more readable.

Building on our previous PHP snippet, we now have the following pattern:

$r2 = '/ ^ a+ (?= (b+) ) /x';
#             │   └──┘ │
#             │     1  │
#             └────────┘
#              lookahead

testAll($r2, $tests);

The output is now (as seen on ideone.com):

aaa 0
aaab 1 aaa|b
aaaxb 0
xaaab 0
b 0
abbb 1 a|bbb

Note that e.g. aaa|b is the result of join-ing what each group captured with '|'. In this case, group 0 (i.e. what the pattern matched) captured aaa, and group 1 captured b.

Lesson: You can capture inside a lookaround. You can use free-spacing to enhance readability.


Step 3: Refactoring the lookahead into the "loop"

Before we can introduce our counting mechanism, we need to do one modification to our pattern. Currently, the lookahead is outside of the + repetition "loop". This is fine so far because we just wanted to assert that there's a b+ following our a+, but what we really want to do eventually is assert that for each a that we match inside the "loop", there's a corresponding b to go with it.

Let's not worry about the counting mechanism for now and just do the refactoring as follows:

  • First refactor a+ to (?: a )+ (note that (?:…) is a non-capturing group)
  • Then move the lookahead inside this non-capturing group
    • Note that we must now "skip" a* before we can "see" the b+, so modify the pattern accordingly

So we now have the following:

$r3 = '/ ^ (?: a (?= a* (b+) ) )+ /x';
#          │     │      └──┘ │ │
#          │     │        1  │ │
#          │     └───────────┘ │
#          │       lookahead   │
#          └───────────────────┘
#           non-capturing group

The output is the same as before (as seen on ideone.com), so there's no change in that regard. The important thing is that now we are making the assertion at every iteration of the + "loop". With our current pattern, this is not necessary, but next we'll make group 1 "count" for us using self-reference.

Lesson: You can capture inside a non-capturing group. Lookarounds can be repeated.


Step 4: This is the step where we start counting

Here's what we're going to do: we'll rewrite group 1 such that:

  • At the end of the first iteration of the +, when the first a is matched, it should capture b
  • At the end of the second iteration, when another a is matched, it should capture bb
  • At the end of the third iteration, it should capture bbb
  • ...
  • At the end of the n-th iteration, group 1 should capture bn
  • If there aren't enough b to capture into group 1 then the assertion simply fails

So group 1, which is now (b+), will have to be rewritten to something like (\1 b). That is, we try to "add" a b to what group 1 captured in the previous iteration.

There's a slight problem here in that this pattern is missing the "base case", i.e. the case where it can match without the self-reference. A base case is required because group 1 starts "uninitialized"; it hasn't captured anything yet (not even an empty string), so a self-reference attempt will always fail.

There are many ways around this, but for now let's just make the self-reference matching optional, i.e. \1?. This may or may not work perfectly, but let's just see what that does, and if there's any problem then we'll cross that bridge when we come to it. Also, we'll add some more test cases while we're at it.

$tests = array(
  'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb'
);

$r4 = '/ ^ (?: a (?= a* (\1? b) ) )+ /x';
#          │     │      └─────┘ | │
#          │     │         1    | │
#          │     └──────────────┘ │
#          │         lookahead    │
#          └──────────────────────┘
#             non-capturing group

The output is now (as seen on ideone.com):

aaa 0
aaab 1 aaa|b        # (*gasp!*)
aaaxb 0
xaaab 0
b 0
abbb 1 a|b          # yes!
aabb 1 aa|bb        # YES!!
aaabbbbb 1 aaa|bbb  # YESS!!!
aaaaabbb 1 aaaaa|bb # NOOOOOoooooo....

A-ha! It looks like we're really close to the solution now! We managed to get group 1 to "count" using self-reference! But wait... something is wrong with the second and the last test cases!! There aren't enough bs, and somehow it counted wrong! We'll examine why this happened in the next step.

Lesson: One way to "initialize" a self-referencing group is to make the self-reference matching optional.


Step 4½: Understanding what went wrong

The problem is that since we made the self-reference matching optional, the "counter" can "reset" back to 0 when there aren't enough b's. Let's closely examine what happens at every iteration of our pattern with aaaaabbb as input.

 a a a a a b b b
↑
# Initial state: Group 1 is "uninitialized".
           _
 a a a a a b b b
  ↑
  # 1st iteration: Group 1 couldn't match \1 since it was "uninitialized",
  #                  so it matched and captured just b
           ___
 a a a a a b b b
    ↑
    # 2nd iteration: Group 1 matched \1b and captured bb
           _____
 a a a a a b b b
      ↑
      # 3rd iteration: Group 1 matched \1b and captured bbb
           _
 a a a a a b b b
        ↑
        # 4th iteration: Group 1 could still match \1, but not \1b,
        #  (!!!)           so it matched and captured just b
           ___
 a a a a a b b b
          ↑
          # 5th iteration: Group 1 matched \1b and captured bb
          #
          # No more a, + "loop" terminates

A-ha! On our 4th iteration, we could still match \1, but we couldn't match \1b! Since we allow the self-reference matching to be optional with \1?, the engine backtracks and took the "no thanks" option, which then allows us to match and capture just b!

Do note, however, that except on the very first iteration, you could always match just the self-reference \1. This is obvious, of course, since it's what we just captured on our previous iteration, and in our setup we can always match it again (e.g. if we captured bbb last time, we're guaranteed that there will still be bbb, but there may or may not be bbbb this time).

Lesson: Beware of backtracking. The regex engine will do as much backtracking as you allow until the given pattern matches. This may impact performance (i.e. catastrophic backtracking) and/or correctness.


Step 5: Self-possession to the rescue!

The "fix" should now be obvious: combine optional repetition with possessive quantifier. That is, instead of simply ?, use ?+ instead (remember that a repetition that is quantified as possessive does not backtrack, even if such "cooperation" may result in a match of the overall pattern).

In very informal terms, this is what ?+, ? and ?? says:

?+

  • (optional) "It doesn't have to be there,"
    • (possessive) "but if it is there, you must take it and not let go!"

?

  • (optional) "It doesn't have to be there,"
    • (greedy) "but if it is you can take it for now,"
      • (backtracking) "but you may be asked to let it go later!"

??

  • (optional) "It doesn't have to be there,"
    • (reluctant) "and even if it is you don't have to take it just yet,"
      • (backtracking) "but you may be asked to take it later!"

In our setup, \1 will not be there the very first time, but it will always be there any time after that, and we always want to match it then. Thus, \1?+ would accomplish exactly what we want.

$r5 = '/ ^ (?: a (?= a* (\1?+ b) ) )+ /x';
#          │     │      └──────┘ │ │
#          │     │          1    │ │
#          │     └───────────────┘ │
#          │         lookahead     │
#          └───────────────────────┘
#             non-capturing group

Now the output is (as seen on ideone.com):

aaa 0
aaab 1 a|b          # Yay! Fixed!
aaaxb 0
xaaab 0
b 0
abbb 1 a|b
aabb 1 aa|bb
aaabbbbb 1 aaa|bbb
aaaaabbb 1 aaa|bbb  # Hurrahh!!!

Voilà!!! Problem solved!!! We are now counting properly, exactly the way we want it to!

Lesson: Learn the difference between greedy, reluctant, and possessive repetition. Optional-possessive can be a powerful combination.


Step 6: Finishing touches

So what we have right now is a pattern that matches a repeatedly, and for every a that was matched, there is a corresponding b captured in group 1. The + terminates when there are no more a, or if the assertion failed because there isn't a corresponding b for an a.

To finish the job, we simply need to append to our pattern \1 $. This is now a back reference to what group 1 matched, followed by the end of the line anchor. The anchor ensures that there aren't any extra b's in the string; in other words, that in fact we have anbn.

Here's the finalized pattern, with additional test cases, including one that's 10,000 characters long:

$tests = array(
  'aaa', 'aaab', 'aaaxb', 'xaaab', 'b', 'abbb', 'aabb', 'aaabbbbb', 'aaaaabbb',
  '', 'ab', 'abb', 'aab', 'aaaabb', 'aaabbb', 'bbbaaa', 'ababab', 'abc',
  str_repeat('a', 5000).str_repeat('b', 5000)
);

$r6 = '/ ^ (?: a (?= a* (\1?+ b) ) )+ \1 $ /x';
#          │     │      └──────┘ │ │
#          │     │          1    │ │
#          │     └───────────────┘ │
#          │         lookahead     │
#          └───────────────────────┘
#             non-capturing group

It finds 4 matches: ab, aabb, aaabbb, and the a5000b5000. It takes only 0.06s to run on ideone.com.


Step 7: The Java test

So the pattern works in PHP, but the ultimate goal is to write a pattern that works in Java.

public static void main(String[] args) {

        String aNbN = "(?x) (?:  a  (?= a* (\\1?+ b))  )+ \\1";
        String[] tests = {
                "",      // false
                "ab",    // true
                "abb",   // false
                "aab",   // false
                "aabb",  // true
                "abab",  // false
                "abc",   // false
                repeat('a', 5000) + repeat('b', 4999), // false
                repeat('a', 5000) + repeat('b', 5000), // true
                repeat('a', 5000) + repeat('b', 5001), // false
        };
        for (String test : tests) {
                System.out.printf("[%s]%n  %s%n%n", test, test.matches(aNbN));
        }

}

static String repeat(char ch, int n) {
        return new String(new char[n]).replace('\0', ch);
}

The pattern works as expected (as seen on ideone.com).


And now we come to the conclusion...

It needs to be said that the a* in the lookahead, and indeed the "main + loop", both permit backtracking. Readers are encouraged to confirm why this is not a problem in terms of correctness, and why at the same time making both possessive would also work (though perhaps mixing mandatory and non-mandatory possessive quantifier in the same pattern may lead to misperceptions).

It should also be said that while it's neat that there's a regex pattern that will match anbn, this is in not always the "best" solution in practice. A much better solution is to simply match ^(a+)(b+)$, and then compare the length of the strings captured by groups 1 and 2 in the hosting programming language.

In PHP, it may look something like this (as seen in ideone.com):

function is_anbn($s) {
   return (preg_match('/^(a+)(b+)$/', $s, $groups)) &&
      (strlen($groups[1]) == strlen($groups[2]));
}

The purpose of this article is NOT to convince readers that regex can do almost anything; it clearly can't, and even for the things it can do, at least partial delegation to the hosting language should be considered if it leads to a simpler solution.

As mentioned at the top, while this article is necessarily tagged [regex] for stackoverflow, it is perhaps about more than that. While certainly there's value in learning about assertions, nested references, possessive quantifier, etc, perhaps the bigger lesson here is the creative process by which one can try to solve problems, the determination and hard work that it often requires when you're subjected to various constraints, the systematic composition from various parts to build a working solution, etc.


Bonus material! PCRE recursive pattern!

Since we did bring up PHP, it needs to be said that PCRE supports recursive pattern and subroutines. Thus, following pattern works for preg_match (as seen on ideone.com):

$rRecursive = '/ ^ (a (?1)? b) $ /x';

Currently Java's regex does not support recursive pattern.


Even more bonus material! Matching anbncn !!

So we've seen how to match anbn which is non-regular, but still context-free, but can we also match anbncn, which isn't even context-free?

The answer is, of course, YES! Readers are encouraged to try to solve this on their own, but the solution is provided below (with implementation in Java on ideone.com).

^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $

Why aren't modern regular expression dialects regular?

16 votes

I've seen a few comments here that mention that modern regular expressions go beyond what can be represented in a regular language. How is this so? What features of modern regular expressions are not regular? Examples would be helpful.

The first thing that comes to mind is backreferences:

(\w*)\s\1

(matches a group of word characters, followed by a space character and then the same group previously matched) eg: hello hello matches, hello world doesn't.

This construct is not regular (ie: can't be generated by a regular grammar).


Another feature supported by Perl Compatible RegExp (PCRE) that is not regular are recursive patterns:

\((a*|(?R))*\)

This can be used to match any combination of balanced parentheses and "a"s (from wikipedia)

Writing a parser for regular expressions

15 votes

Even after years of programming, I'm ashamed to say that I've never really fully grasped regular expressions. In general, when a problem calls for a regex, I can usually (after a bunch of referring to syntax) come up with an appropriate one, but it's a technique that I find myself using increasingly often.

So, to teach myself and understand regular expressions properly, I've decided to do what I always do when trying to learn something; i.e., try to write something ambitious that I'll probably abandon as soon as I feel I've learnt enough.

To this end, I want to write a regular expression parser in Python. In this case, "learn enough" means that I want to implement a parser that can understand Perl's extended regex syntax completely. However, it doesn't have to be the most efficient parser or even necessarily usable in the real-world. It merely has to correctly match or fail to match a pattern in a string.

The question is, where do I start? I know almost nothing about how regexes are parsed and interpreted apart from the fact that it involves a finite state automaton in some way. Any suggestions for how to approach this rather daunting problem would be much appreciated.

EDIT: I should clarify that while I'm going to implement the regex parser in Python, I'm not overly fussed about what programming language the examples or articles are written in. As long as it's not in Brainfuck, I will probably understand enough of it to make it worth my while.

Writing an implementation of a regular expression engine is indeed a quite complex task.

But if you are interested in how to do it, even if you can't understand enough of the details to actually implement it, I would recommend that you at least look at this article:

Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)

It explains how many of the popular programming languages implement regular expressions in a way that can be very slow for some regular expressions, and explains a slightly different method that is faster. The article includes some details of how the proposed implementation works, including some source code in C. It may be a bit heavy reading if you are just starting to learn regular expressions, but I think it is well worth knowing about the difference between the two approaches.

string replace using a List<string>

13 votes

I have a List of words I want to ignore like this one :

public List<String> ignoreList = new List<String>()
        {
            "North",
            "South",
            "East",
            "West"
        };

For a given string, say "14th Avenue North" I want to be able to remove the "North" part, so basically a function that would return "14th Avenue " when called.

I feel like there is something I should be able to do with a mix of LINQ, regex and replace, but I just can't figure it out.

The bigger picture is, I'm trying to write an address matching algorithm. I want to filter out words like "Street", "North", "Boulevard", etc. before I use the Levenshtein algorithm to evaluate the similarity.

How about this:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)));

or for .Net 3:

string.Join(" ", text.Split().Where(w => !ignoreList.Contains(w)).ToArray());

Note that this method splits the string up into individual words so it only removes whole words. That way it will work properly with addresses like Northampton Way #123 that string.Replace can't handle.

How does this PCRE pattern detect palindromes?

10 votes

This question is an educational demonstration of the usage of lookahead, nested reference, and conditionals in a PCRE pattern to match ALL palindromes, including the ones that can't be matched by the recursive pattern given in the PCRE man page.

Examine this PCRE pattern in PHP snippet:

$palindrome = '/(?x)
^
  (?:
      (.) (?=
              .*
              (
                \1
                (?(2) \2 | )
              )
              $
          )
  )*
  .?
  \2?
$


/';

This pattern seems to detect palindromes, as seen in this test cases (see also on ideone.com):

$tests = array(
  # palindromes
  '',
  'a',
  'aa',
  'aaa',
  'aba',
  'aaaa',
  'abba',
  'aaaaa',
  'abcba',
  'ababa',

  # non-palindromes
  'aab',
  'abab',
  'xyz',
);

foreach ($tests as $test) {
  echo sprintf("%s '%s'\n", preg_match($palindrome, $test), $test);  
}

So how does this pattern work?


Notes

This pattern uses a nested reference, which is a similar technique used in How does this Java regex detect palindromes?, but unlike that Java pattern, there's no lookbehind (but it does use a conditional).

Also, note that the PCRE man page presents a recursive pattern to match some palindromes:

# the recursive pattern to detect some palindromes from PCRE man page
^(?:((.)(?1)\2|)|((.)(?3)\4|.))$

The man page warns that this recursive pattern can NOT detect all palindromes (see: Why will this recursive regex only match when a character repeats 2n - 1 times? and also on ideone.com), but the nested reference/positive lookahead pattern presented in this question can.

Let's try to understand the regex by constructing it. Firstly, a palindrome must start and end with the same sequence of character in the opposite direction:

^(.)(.)(.) ... \3\2\1$

we want to rewrite this such that the ... is only followed by a finite length of patterns, so that it could be possible for us to convert it into a *. This is possible with a lookahead:

^(.)(?=.*\1$)
 (.)(?=.*\2\1$)
 (.)(?=.*\3\2\1$) ...

but there are still uncommon parts. What if we can "record" the previously captured groups? If it is possible we could rewrite it as:

^(.)(?=.*(?<record>\1\k<record>)$)   # \1     = \1 + (empty)
 (.)(?=.*(?<record>\2\k<record>)$)   # \2\1   = \2 + \1
 (.)(?=.*(?<record>\3\k<record>)$)   # \3\2\1 = \3 + \2\1
 ...

which could be converted into

^(?: 
    (.)(?=.*(\1\2)$)
 )*

Almost good, except that \2 (the recorded capture) is not empty initially. It will just fail to match anything. We need it to match empty if the recorded capture doesn't exist. This is how the conditional expression creeps in.

(?(2)\2|)   # matches \2 if it exist, empty otherwise.

so our expression becomes

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*

Now it matches the first half of the palindrome. How about the 2nd half? Well, after the 1st half is matched, the recorded capture \2 will contain the 2nd half. So let's just put it in the end.

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*\2$

We want to take care of odd-length palindrome as well. There would be a free character between the 1st and 2nd half.

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*.?\2$

This works good except in one case — when there is only 1 character. This is again due to \2 matches nothing. So

^(?: 
    (.)(?=.*(\1(?(2)\2|))$)
 )*.?\2?$
#      ^ since \2 must be at the end in the look-ahead anyway.

Regex to check if a number is even

9 votes

Can we have a regex to detect if a number is even ?

I was wondering if we can have a regex to do this instead of usual % or bit operations.

Thanks for replies :)

Since the correct answer has already been given, I'll argue that regex would not be my first choice for this.

  • if the number fits the long range, use %
  • if it does not, you can use BigInteger.remainder(..), but perhaps checking whether the last char represents an even digit would be more efficient.

Split string to equal length substrings in Java

9 votes

How to split the string "Thequickbrownfoxjumps" to substrings of equal size in Java. Eg. "Thequickbrownfoxjumps" of 4 equal size should give the output.

["Theq","uick","brow","nfox","jump","s"]

Similar Question:

Split string into equal-length substrings in Scala

Here's the regex one-liner version:

System.out.println(Arrays.toString(
    "Thequickbrownfoxjumps".split("(?<=\\G.{4})")
));

\G is a zero-width assertion that matches the position where the previous match ended. If there was no previous match, it matches the beginning of the input, the same as \A. The enclosing lookbehind matches the position that's four characters along from the end of the last match.

Both lookbehind and \G are advanced regex features, not supported by all flavors. Furthermore, \G is not implemented consistently across the flavors that do support it. This trick will work (for example) in Java, Perl, .NET and JGSoft, but not in PHP (PCRE), Ruby 1.9+ or TextMate (both Oniguruma).

EDIT: I should mention that I don't necessarily recommend this solution if you have other options. The non-regex solutions in the other answers may be longer, but they're also self-documenting; this one's just about the opposite of that. ;)

What is the power of regular expressions?

9 votes

As the name suggests we may think that regular expressions can match regular languages only. But regular expressions we use in practice contain stuff that I am not sure it's possible to implement with their theoretical counterparts. How for example would you simulate a back-reference? So the question arises: what is the theoretical power of the regular expressions we use in practice? Can you think of a way to match {(a^n)(b^n)|n>=0}? What about {(a^n)(b^n)(c^n)|n>=0}?

The answer to your question is, "regular expression" languages that allow back-references are neither regular nor context-free. (In other words, as you pointed out, you cannot simulate back-reference with a regular language, nor with a CFL.) In fact, Wikipedia says many of the "regular expression" languages we use in practice are NP-Complete:

Pattern matching with an unbounded number of back references, as supported by numerous modern tools, is NP-complete (see,[11] Theorem 6.2).

As others have suggested, the regular expression languages commonly supported in computer languages and libraries are a different animal from regular expressions in formal language theory. Larry Wall wrote in regard to Perl "regexes,"

'Regular expressions' [...] are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes"

You asked,

Can you think of a way to match {(a^n)(b^n)|n>=0}? What about {(a^n)(b^n)(c^n)|n>=0}?

I'm not sure here if you're trying to test whether theoretical regular expression languages can match the "language of squares", or whether you're looking for an implementation in a (practical) regex language. Here's the proof why the former is not possible; and here's a long explanation and implementation of the latter for java regexes.

Does an algorithm exist which can determine whether one regular language matches any input another regular language matches?

7 votes

Let's say we have regular expressions:

  • Hello W.*rld
  • Hello World
  • .* World
  • .* W.*

I would like to minimize the number of regexes required to match arbitrary input.

To do that, I need to find if one regular expression matches any input matched by another expression. Is that possible?

Billy3

Any regular expression can be linked to a DFA - you can minimize the DFA and since the minimal form is unique, you can decide whether two expressions are equivalent. Dani Cricco pointed out the Hopcroft O(n log n) algorithm. There is another improved algorithm by Hopcroft and Craft which tests the equivalence of two DFAs in O(n).

For a good survey on the matter and an interesting approach to this, I reccomend the paper Testing the Equivalence of Regular Languages, from arXiv.

Later edit: if you are interested in inclusion rather than equivalence for regular expressions, I have come across a paper that might be of interest: Inclusion Problem for Regular Expressions - I have only skimmed through it but it seems to contain a polynomial time algorithm to the problem.

What's the difference between () and [] in regular expression patterns?

7 votes

What is the difference between encasing part of a regular expression in () (parentheses) and doing it in [] (square brackets)?

How does this:

[a-z0-9]

differ from this:

(a-z0-9)

?

[] denotes a character class. () denotes a capturing group.

[a-z0-9] -- One character that is in the range of a-z OR 0-9

(a-z0-9) -- Explicit capture of a-z0-9. No ranges.

a -- Can be captured by [a-z0-9].

a-z0-9 -- Can be captured by (a-z0-9) and then can be referenced in a replacement and/or later in the expression.