Best regex questions in October 2011

What does [^*] mean?

18 votes

So [^x] means don't match "x", and x* means match "x" 0 or more times, but what does [^*] mean?

It means "match a character that isn't a literal asterisk character."

The thing to keep in mind is that within a character class metacharacters don't need to be escaped, so [^*] is the same as [^\*]. Similarly, you could use [.] to refer to a literal dot rather than the metacharacter referring to any character. Outside of a character class you would need to escape it: \..

Pattern.matches() gives StackOverflowError

15 votes

I'm using java's Pattern.matches to match a block of data to a regex. The block of data can be a single line or multiple lines. The problem is that once my data becomes more than 15 lines (typically more than 17-18 lines), i start getting stackoverflowerror. For data less than 15 lines the regex works fine.

The Regex is of this format:
domainname -> space -> , -> space -> number -> space -> , -> space -> number -> newline

String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";

The data block i use to test against this regex is this

abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456
abc.com, 123, 456

This is the code:

String regex = "^(([a-zA-Z0-9][a-zA-Z0-9\\-]*\\.)+([a-zA-Z]{2,})\\s*,\\s*\\d+\\s*,\\s*\\d+(\\r?\\n)?)+$";
boolean valid = Pattern.matches(regex, data); //fails here

I can't tell you the reason for this error; the regex itself is fine and not subject to catastrophic backtracking or any other obvious error.

Perhaps you can reduce the number of backtracking positions the regex engine saves by using possessive quantifiers (++ instead of +, *+ instead of *, {2,}+ instead of {2,} etc.). Also, you don't need the capturing groups (thanks Thomas), so I've changed them into non-capturing ones:

"(?:(?:[a-zA-Z0-9][a-zA-Z0-9-]*+\\.)++([a-zA-Z]{2,}+)\\s*+,\\s*+\\d++\\s*+,\\s*+\\d++(\r?+\n)?+)++"

This won't change the behaviour of the regex (except for the removal of the unnecessary anchors since you're using Pattern.matches()), but perhaps it helps avoid StackOverflows. I don't have a Java SDK installed, so I can't test it myself, though.

Regular Expressions in C# running slowly

13 votes

I have been doing a little work with regex over the past week and managed to make a lot of progress, however, I'm still fairly n00b. I have a regex written in C#:

string isMethodRegex = 
    @"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?"+
    @"\s*(?<returnType>[a-zA-Z\<\>_1-9]*)\s(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
    @"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]*\s*[a-zA-Z_1-9]*\s*)[,]?\s*)+)\)";
IsMethodRegex = new Regex(isMethodRegex);

For some reason, when calling the regular expression IsMethodRegex.IsMatch() it hangs for 30+ seconds on the following string:

"\t * Returns collection of active STOP transactions (transaction type 30) "

Does anyone how the internals of Regex works and why this would be so slow on matching this string and not others. I have had a play with it and found that if I take out the * and the parenthesis then it runs fine. Perhaps the regular expression is poorly written?

Any help would be so greatly appreciated.

EDIT: I think the performance issue comes in the way <parameters> matching group is done. I have rearranged to match a first parameter, then any number of successive parameters, or optionally none at all. Also I have changed the \s* between parameter type and name to \s+ (I think this was responsible for a LOT of backtracking because it allows no spaces, so that object could match as obj and ect with \s* matching no spaces) and it seems to run a lot faster:

string isMethodRegex = 
    @"\b(public|private|internal|protected)?\s*(static|virtual|abstract)?"+
    @"\s*(?<returnType>[a-zA-Z\<\>_1-9]*)\s*(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
    @"((?<parameters>((\s*[a-zA-Z\[\]\<\>_1-9]*\s+[a-zA-Z_1-9]*\s*)"+
    @"(\s*,\s*[a-zA-Z\[\]\<\>_1-9]*\s+[a-zA-Z_1-9]*\s*)*\s*))?\)";

EDIT: As duly pointed out by @Dan, the following is simply because the Regex can exit early.

This is indeed a really bizarre situation, but if I remove the two optional matching at the beginning (for public/private/internal/protected and static/virtual/abstract) then it starts to run almost instantaneously again:

string isMethodRegex = 
    @"\b(public|private|internal|protected)\s*(static|virtual|abstract)"+
    @"(?<returnType>[a-zA-Z\<\>_1-9]*)\s(?<method>[a-zA-Z\<\>_1-9]+)\s*\"+
    @"((?<parameters>(([a-zA-Z\[\]\<\>_1-9]*\s*[a-zA-Z_1-9]*\s*)[,]?\s*)+)\)";
var IsMethodRegex = new Regex(isMethodRegex);

string s = "\t * Returns collection of active STOP transactions (transaction type 30) ";

Console.WriteLine(IsMethodRegex.IsMatch(s));

Technically you could split into four separate Regex's for each possibility to deal with this particular situation. However, as you attempt to deal with more and more complicated scenarios, you will likely run into this performance issue again and again, so this is probably not the ideal approach.

Java pattern matching going to infinite loop

11 votes

A friend gave me this piece of code and said there is a bug. And yes, this code runs for ever.

The answer I got is:

It runs for >10^15 years before printing anything.

public class Match {
     public static void main(String[] args) {
         Pattern p = Pattern.compile("(aa|aab?)+");
         int count = 0;
         for(String s = ""; s.length() < 200; s += "a")
             if (p.matcher(s).matches())
                 count++;
         System.out.println(count);
     }
}

I didn't really understand why am I seeing this behavior, I am new to java, do you have any suggestions?

The pattern you are using is known as an evil regex according to OWASP (they know what they're talking about most of the time):

https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS

It basically matches aa OR aa or aab (since the b is optional by addition of ?)

A Regex like this is vulnerable to a ReDoS or Regex Denial of Service Attack.

So yes, sort out what you want to match. I suggest in the above example you should simply match aa, no need for groups, repitition or alternation:

Pattern p = Pattern.compile("aa");

Also as someone pointed out, who now deleted his post, you should not use += to append to strings. You should use a StringBuffer instead:

public class Match {
  public static void main(String[] args) {
    Pattern p = Pattern.compile("aa");
    StringBuffer buffy = new StringBuffer(200);
    int count = 0;
    for(int i = 0; i < 200; i++){
      buffy.append("a")
      if (p.matcher(buffy.toString()).matches()){
        count++;
      }
    }
    System.out.println(count);
  }
}

What does the python re.template function do?

8 votes

While using the re module in ipython I noticed an undocumented template function:

In [420]: re.template?
Type:           function
Base Class:     <type 'function'>
String Form:    <function template at 0xb7eb8e64>
Namespace:      Interactive
File:           /usr/tideway/lib/python2.7/re.py
Definition:     re.template(pattern, flags=0)
Docstring:
    Compile a template pattern, returning a pattern object

there is also a flag re.TEMPLATE and its alias re.T.

None of this is mentioned in the docs for either 2.7 or 3.2. What do they do? Are they obsolete hangovers from an earlier version of Python, or an experimental feature that may be officially added in the future?

In CPython 2.7.1, re.template() is defined as:

def template(pattern, flags=0):
    "Compile a template pattern, returning a pattern object"
    return _compile(pattern, flags|T)

_compile calls _compile_typed which calls sre_compile.compile. The only place in the code where the T (aka SRE_FLAG_TEMPLATE) flag is checked is in that function:

    elif op in REPEATING_CODES:
        if flags & SRE_FLAG_TEMPLATE:
            raise error, "internal: unsupported template operator"
            emit(OPCODES[REPEAT])
            skip = _len(code); emit(0)
            emit(av[0])
            emit(av[1])
            _compile(code, av[2], flags)
            emit(OPCODES[SUCCESS])
            code[skip] = _len(code) - skip
    ...

This would have the effect of disabling all repetition operators (*, +, ?, {} etc):

In [10]: re.template('a?')
---------------------------------------------------------------------------
.....
error: internal: unsupported template operator

The way the code is structured (the unconditional raise preceding a bunch of dead code) makes me think that the feature was either never fully implemented, or has been turned off due to some problems. I can only guess what the intended semantics may have been.

The end result is that the function does nothing useful.

non-greedy matching in Scala RegexParsers

8 votes

Suppose I'm writing a rudimentary SQL parser in Scala. I have the following:

class Arith extends RegexParsers {
    def selectstatement: Parser[Any] = selectclause ~ fromclause
    def selectclause: Parser[Any] = "(?i)SELECT".r ~ tokens
    def fromclause: Parser[Any] = "(?i)FROM".r ~ tokens
    def tokens: Parser[Any] = rep(token) //how to make this non-greedy?
    def token: Parser[Any] = "(\\s*)\\w+(\\s*)".r
}

When trying to match selectstatement against SELECT foo FROM bar, how do I prevent the selectclause from gobbling up the entire phrase due to the rep(token) in ~ tokens?

In other words, how do I specify non-greedy matching in Scala?

To clarify, I'm fully aware that I can use standard non-greedy syntax (*?) or (+?) within the String pattern itself, but I wondered if there's a way to specify it at a higher level inside def tokens. For example, if I had defined token like this:

def token: Parser[Any] = stringliteral | numericliteral | columnname

Then how can I specify non-greedy matching for the rep(token) inside def tokens?

Not easily, because a successful match is not retried. Consider, for example:

object X extends RegexParsers {
  def p = ("a" | "aa" | "aaa" | "aaaa") ~ "ab"
}

scala> X.parseAll(X.p, "aaaab")
res1: X.ParseResult[X.~[String,String]] = 
[1.2] failure: `ab' expected but `a' found

aaaab
 ^

The first match was successful, in parser inside parenthesis, so it proceeded to the next one. That one failed, so p failed. If p was part of alternative matches, the alternative would be tried, so the trick is to produce something that can handle that sort of thing.

Let's say we have this:

def nonGreedy[T](rep: => Parser[T], terminal: => Parser[T]) = Parser { in =>
  def recurse(in: Input, elems: List[T]): ParseResult[List[T] ~ T] =
    terminal(in) match {
      case Success(x, rest) => Success(new ~(elems.reverse, x), rest)
      case _ => 
        rep(in) match {
          case Success(x, rest) => recurse(rest, x :: elems)
          case ns: NoSuccess    => ns
        }
    }

  recurse(in, Nil)
}  

You can then use it like this:

def p = nonGreedy("a", "ab")

By the way,I always found that looking at how other things are defined is helpful in trying to come up with stuff like nonGreedy above. In particular, look at how rep1 is defined, and how it was changed to avoid re-evaluating its repetition parameter -- the same thing would probably be useful on nonGreedy.

Here's a full solution, with a little change to avoid consuming the "terminal".

trait NonGreedy extends Parsers {
    def nonGreedy[T, U](rep: => Parser[T], terminal: => Parser[U]) = Parser { in =>
      def recurse(in: Input, elems: List[T]): ParseResult[List[T]] =
        terminal(in) match {
          case _: Success[_] => Success(elems.reverse, in)
          case _ => 
            rep(in) match {
              case Success(x, rest) => recurse(rest, x :: elems)
              case ns: NoSuccess    => ns
            }
        }

      recurse(in, Nil)
    }  
}

class Arith extends RegexParsers with NonGreedy {
    // Just to avoid recompiling the pattern each time
    val select: Parser[String] = "(?i)SELECT".r
    val from: Parser[String] = "(?i)FROM".r
    val token: Parser[String] = "(\\s*)\\w+(\\s*)".r
    val eof: Parser[String] = """\z""".r

    def selectstatement: Parser[Any] = selectclause(from) ~ fromclause(eof)
    def selectclause(terminal: Parser[Any]): Parser[Any] = 
      select ~ tokens(terminal)
    def fromclause(terminal: Parser[Any]): Parser[Any] = 
      from ~ tokens(terminal)
    def tokens(terminal: Parser[Any]): Parser[Any] = 
      nonGreedy(token, terminal)
}

Can I use $1 in regex_replace?

7 votes

From reading the FCD for regex_replace (28.11.4) I can only guess that the function can also use parts of the original string for replacing? I can not test it with my gcc, is this correct?

using namespace std;
regex rx{ R"((\d+)-(\d+))" }; // regex: (\d+)-(\d+)
cout << regex_replace("123-456", rx, "b: $2, a:$1");
// "b: 456, a:123"

As you can see, I assume $1 and $2 refer to the "()" capturing groups (and not \1 and \2 like elsewhere).

Update. So, I guess this is a two-part question

  • Is this use of capturing groups in the replacement text supported at all?
  • Is the default ECMAScript syntax using $n? Or \n?

Table 139 in the C++ 2011 FDIS lists two constants that can be used to affect the rules used for the format string in regex_replace, format_default and format_sed. format_default is described as using "the rules used by the ECMAScript replace function in ECMA-262, part 15.5.4.11 String.prototype.replace." This standard does indicate the use of $ for backreferences. See: ECMA-262

Using the format_sed flag instead uses the rules for the sed utility in POSIX. Sed doesn't appear to support $ backreferences.

Calculate if two infinite regex solution sets don't intersect

7 votes

In calculate if two arbitrary regular expressions have any overlapping solutions (assuming it's possible).

For example these two regular expressions can be shown to have no intersections by brute force because the two solution sets are calculable because it's finite.

^1(11){0,1000}$ ∩     ^(11){0,1000}$        = {}
{1,111, ..., ..111} ∩ {11,1111, ..., ...11} = {}
{}                                          = {}

But replacing the {0,1000} by * remove the possibility for a brute force solution, so a smarter algorithm must be created.

^1(11)*$ ∩ ^(11)*$ = {}
{1,^1(11)*$} ∩ {^(11)*$} = {}
{1,^1(11)*$} ∩ {11,^11(11)*$} = {}
{1,111,^111(11)*$} ∩ {11,^(11)*$} = {}
.....

In another similar question one answer was to calculate the intersection regex. Is that possible to do? If so how would one write an algorithm to do such a thing?

I think this problem might be domain of the halting problem.

EDIT:

I've used the accepted solution to create the DFAs for the example problem. It's fairly easy to see how you can use a BFS or DFS on the graph of states for M_3 to determine if a final state from M_3 is reachable.

DFA solution

It is not in the domain of the halting problem; deciding whether the intersection of regular languages is empty or not can be solved as follows:

  1. Construct a DFA M1 for the first language.
  2. Construct a DFA M2 for the second language. Hint: Kleene's Theorem and Power Set machine construction
  3. Construct a DFA M3 for M1 intersect M2. Hint: Cartesian Product Machine construction
  4. Determine whether L(M3) is empty. Hint: If M3 has n states, and M3 doesn't accept any strings of length no greater than n, then L(M3) is empty... why?

Each of those things can be algorithmically done and/or checked. Also, naturally, once you have a DFA recognizing the intersection of your languages, you can construct a regex to match the language. And if you start out with a regex, you can make a DFA. This is definitely computable.

EDIT:

So to build a Cartesian Product Machine, you need two DFAs. Let M1 = (E, q0, Q1, A1, f1) and M2 = (E, q0', Q2, A2, f2). In both cases, E is the input alphabet, q0 is the start state, Q is the set of all states, A is the set of accepting states, and f is the transition function. Construct M3 where...

  1. E3 = E
  2. Q3 = Q1 x Q2 (ordered pairs)
  3. q0'' = (q0, q0')
  4. A3 = {(x, y) | x in A1 and y in A2}
  5. f3(s, (x, y)) = (f1(s, x), f2(s, y))

Provided I didn't make any mistakes, L(M3) = L(M1) intersect L(M2). Neat, huh?

Why does std::sub_match<T> publicly inherit from std::pair<T, T>?

7 votes

I was reading the documentation of std::sub_match<BidirectionalIterator> and saw that it publicly inherits from std::pair<BidirectionalIterator, BidirectionalIterator>. Since a sub_match is simply a pair of iterators into a sequence of characters, with some additional functions, I can understand that it is implemented with a pair, but why use public inheritance?

The problem with inheriting publicly from std::pair<T,U> is the same as inheriting publicly from most other standard classes: they are not meant to be manipulated polymorphically (notably they do not define a virtual destructor). Other members will also fail to work properly, namely the assignment operator and the swap member function (they will not copy the matched member of sub_match).

Why did Boost developers and then the committee decided to implement sub_match by inheriting publicly from pair instead of using composition (or private inheritance with using declarations if they wanted to keep member access through first and second)?

It's an interesting question. Presumably, they considered it safe because no one would ever dynamically allocate one anyway. About the only way you're going to get sub_match objects is as a return value from some of the functions of basic_regex, or as copies of other sub_match, and all of these will be either temporaries or local variables.

Note that it's not safe to keep sub_match objects around anyway, since they contain iterators whose lifetime... doesn't seem to be specified in the standard. Until the match_results object is reused? Until the string operand to the function which filled in the match_results object is destructed? Or?

I'd still have avoided the public inheritence. But in this case, it's not as dangerous as it looks, because there's really no reason you'd ever want to dynamically allocate a sub_match.

boost::regex vs std::regex - can't find empty() method?

6 votes

Replacing boost::regex with std::regex since we are using gcc 4.6 in the company I ran into an issue with empty () method of that class - it basically didn't make it from boost::regex into std::regex class. I am not sure whether this is a gcc's issue or this method didn't make it into C++11 standard at all, but that piece of code was heavily depending on this feature. So the question is - is there a way in C++11 std::regex to check if expression was ever set or I should stick to boost::regex for the rest of my life?

empty() got removed from std::regex a long time ago. N1507 (2003-09-16) was the original paper to suggest its removal (search for "What is an invalid/empty regular expression?"). This issue was directed at what then was std::tr1:regex. It was recorded in the LWG tr1 issues lists as issue 7.28 and contained the following resolution:

Discussed at Kona. The LWG agrees that the default constructor should be equivalent to construction from an empty string. Leaving this open for now partly because we need wording expressing that, and partly because it’s not clear that there’s any point to having theempty() member function in the first place.

N1711 (2004-11-04) was the first TR1 draft to lack basic_regex::empty(). From there it got imported from TR1 into C++11 with no further discussion.

sed join lines together

6 votes

what would be the sed (or other tool) command to join lines together in a file that do not end w/ the character '0'?

I'll have lines like this

412|n|Leader Building Material||||||||||d|d|20||0

which need to be left alone, and then I'll have lines like this for example (which is 3 lines, but only one ends w/ 0)

107|n|Knot Tying Tools|||||Knot Tying Tools

|||||d|d|0||0

which need to be joined/combined into one line

107|n|Knot Tying Tools|||||Knot Tying Tools|||||d|d|0||0

 sed ':a;/0$/{N;s/\n//;ba}'

In a loop (branch ba to label :a), if the current line ends in 0 (/0$/) append next line (N) and remove inner newline (s/\n//).

awk:

awk '{while(/0$/) { getline a; $0=$0 a; sub(/\n/,_) }; print}'

Perl:

perl -pe '$_.=<>,s/\n// while /0$/'

bash:

while read line; do 
    if [ ${line: -1:1} != "0" ] ; then 
        echo $line
    else echo -n $line
fi
done 

Remove <p><strong><br /> &nbsp;</strong></p> with XPATH

6 votes

I use xpath to remove <p>&nbsp;</p>

    $nodeList = $xpath->query("//p[text()=\"\xC2\xA0\"]"); # &nbsp;
    foreach($nodeList as $node) 
    {
        $node->parentNode->removeChild($node);
    }

but it does not remove this,

<p><strong><br /> &nbsp;</strong></p>

or this kind,

<p><strong>&nbsp;</strong></p>

How can I remove them?

Or maybe a regex that I should use?

Try with

$nodeList = $xpath->query("//p[normalize-space(.)=\"\xC2\xA0\"]"); # &nbsp;
foreach($nodeList as $node) 
{
    $node->parentNode->removeChild($node);
}

Quoting from the docs

The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.

Returning overlapping regular expressions

5 votes

Is there a regular expression that will capture all instances of an expression, regardless of whether or not they overlap?

E.g. in /abc/def/ghi if I want to capture all strings beginning with /. The regex (/.*) only returns the entire string, but I'd want it to match on /def/ghi and /ghi as well.

Sure, match an empty string and place a look-ahead after it that captures /.* in a capturing group:

Matcher m = Pattern.compile("(?=(/.*))").matcher("/abc/def/ghi");
while(m.find()) {
  System.out.println(m.group(1));
}

would print:

/abc/def/ghi
/def/ghi
/ghi

How preg_match_all() processes strings?

5 votes

I'm still learning a lot about PHP and string alteration is something that is of interest to me. I've used preg_match before for things like validating an email address or just searching for inquiries.

I just came from this post What's wrong in my regular expression? and was curious as to why the preg_match_all function produces 2 strings, 1 w/ some of the characters stripped and then the other w/ the desired output.

From what I understand about the function is that it goes over the string character by character using the RegEx to evaluate what to do with it. Could this RegEx have been structured in such a way as to bypass the first array entry and just produce the desired result?

and so you don't have to go to the other thread

$str = 'text^name1^Jony~text^secondname1^Smith~text^email1^example-
        free@wpdevelop.com~';

preg_match_all('/\^([^^]*?)\~/', $str, $newStr);

for($i=0;$i<count($newStr[0]);$i++)
{
    echo $newStr[0][$i].'<br>';
}

echo '<br><br><br>';

for($i=0;$i<count($newStr[1]);$i++)
{
    echo $newStr[1][$i].'<br>';
} 

This will output

^Jony~
^Smith~
^example-free@wpdevelop.com~


Jony
Smith
example-free@wpdevelop.com

I'm curious if the reason for 2 array entries was due to the original sytax of the string or if it is the normal processing response of the function. Sorry if this shouldn't be here, but I'm really curious as to how this works.

thanks, Brodie

It's standard behavior for preg_match and preg_match_all - the first string in the "matched values" array is the FULL string that was caught by the regex pattern. The subsequent array values are the 'capture groups', whose existence depends on the placement/position of () pairs in the regex pattern.

In your regex's case, /\^([^^]*?)\~/, the full matching string would be

^   Jony    ~
|     |     |
^  ([^^]*?) ~   -> $newstr[0] = ^Jony~
                -> $newstr[1] = Jony (due to the `()` capture group).

How to split a string with whitespace chars at the beginning?

5 votes

Quick example:

public class Test {
    public static void main(String[] args) {
        String str = "   a b";
        String[] arr = str.split("\\s+");
        for (String s : arr)
            System.out.println(s);
    }
}

I want the array arr to contain 2 elements: "a" and "b", but in the result there are 3 elements: "" (empty string), "a" and "b". What should I do to get it right?

Kind of a cheat, but replace:

String str = "   a b";

with

String str = "   a b".trim();