Best regex questions in June 2012

Regex with -, ::, ( and )

28 votes

I need to split the string

(age-is-25::OR::last_name-is-qa6)::AND::(age-is-20::OR::first_name-contains-test)

into

string[0] = (age-is-25::OR::last_name-is-qa6)

string[1] = AND

string[2] = (age-is-20::OR::first_name-contains-test)

I tried writing so many regex expressions, but nothing works as expected.

Using the following regex, Matcher.groupCount() which returns 2 but assigning results to an arraylist returns null as the elements.

Pattern pattern = Pattern.compile("(\\)::)?|(::\\()?");

I tried to split it using ):: or ::(.

I know the regex looks too stupid, but being a beginner this is the best I could write.

You can use positive lookahead and lookbehind to match the first and last parentheses.

String str = "(age-is-25::OR::last_name-is-qa6)::AND::(age-is-20::OR::first_name-contains-test)";

for (String s : str.split("(?<=\\))::|::(?=\\()"))
    System.out.println(s);

Outputs:

(age-is-25::OR::last_name-is-qa6)
AND
(age-is-20::OR::first_name-contains-test)

Just a note however: It seems like you are parsing some kind of recursive language. Regular expressions are not good at doing this. If you are doing advanced parsing I would recommend you to look at other parsing methods.

Find longest repetitive sequence in a string

19 votes

I need to find the longest sequence in a string with the caveat that the sequence must be repeated three or more times. So, for example, if my string is:

fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld

then I would like the value "helloworld" to be returned.

I know of a few ways of accomplishing this but the problem I'm facing is that the actual string is absurdly large so I'm really looking for a method that can do it in a timely fashion.

This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).

That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.

Hope this helps!

Why does the same RegExp behave differently?

12 votes

Possible Duplicate:
Interesting test of Javascript RegExp
Regular expression test can't decide between true and false (JavaScript)

Example of issue. When ran inline the results are as I would expect. But when stored as a variable it skips the middle span element.

// Inline RegExp
function getToggleClasses() {
  var toggler = [],
      elements = document.getElementsByTagName("*"),
      i=0,
      len = elements.length;

  for (i; i < len; i++) {
    if (/toggler/g.test(elements[i].className)) {
      toggler.push(elements[i]);
    }
  }

  document.getElementById('results').innerHTML += "<br />Inline: " + toggler.length;
}

// Variable
function getToggleClasses2() {
  var toggler = [],
      elements = document.getElementsByTagName("*"),
      tester = /toggler/g,
      i=0,
      len = elements.length;

  for (i; i < len; i++) {
    if (tester.test(elements[i].className)) {
      toggler.push(elements[i]);
    }
  }

  document.getElementById('results').innerHTML += "<br />Variable: " + toggler.length;
}
​

Mark up:

<span class="toggler">A</span>
<span class="toggler">B</span>
<span class="toggler">C</span>

Given: I understand there is no reason to use a RegExp to do this comparison and I also understand how great libraries such as jQuery are. I also know that the g is not needed in this case.

I can't understand why these two methods should ever return different results.

RegExp instances are stateful, so reusing them can cause unexpected behavior. In this particular case, it's because the instance is global, meaning:

that the regular expression should be tested against all possible matches in a string.

That's not the only difference caused by using g, however. From RegExp.test @ MDN:

As with exec (or in combination with it), test called multiple times on the same global regular expression instance will advance past the previous match.


Remove the g flag, or set lastIndex to 0 (thanks, @zzzzBov).

Using regular expressions to find a word with the five letters abcde, each letter appearing exactly once, in any order, with no breaks in between

12 votes

For example, the word debacle would work because of debac, but seabed would not work because: 1. there is no c in any 5-character sequence that can be formed, and 2. the letter e appears twice. As another example, feedback would work because of edbac. And remember, the solution must be done using only regular expressions.

A strategy I attempted to implement was: match the first letter if it's inside [a-e], and remember it. Then find the next letter in [a-e] but not the first letter. And so on. I wasn't sure what the syntax was (or even if some syntax existed) so my code didn't work:

open(DICT, "dictionary.txt");
@words = <DICT>;

foreach my $word(@words){

if ($word =~ /([a-e])([a-e^\1])([a-e^\1^\2])([a-e^\1^\2^\3])([a-e^\1^\2^\3^\4])/
){
    print $word;
}
}

I was also thinking of using (?=regex) and \G but I wasn't sure how it would work out.

/
   (?= .{0,4}a )
   (?= .{0,4}b )
   (?= .{0,4}c )
   (?= .{0,4}d )
   (?= .{0,4}e )
/xs

It's probably results in faster matching to generate a pattern from all combinations.

use Algorithm::Loops qw( NextPermute );
my @pats;
my @chars = 'a'..'e';
do { push @pats, quotemeta join '', @chars; } while NextPermute(@chars);
my $re = join '|', @pats;

abcde|abced|abdce|abdec|abecd|abedc|acbde|acbed|acdbe|acdeb|acebd|acedb|adbce|adbec|adcbe|adceb|adebc|adecb|aebcd|aebdc|aecbd|aecdb|aedbc|aedcb|bacde|baced|badce|badec|baecd|baedc|bcade|bcaed|bcdae|bcdea|bcead|bceda|bdace|bdaec|bdcae|bdcea|bdeac|bdeca|beacd|beadc|becad|becda|bedac|bedca|cabde|cabed|cadbe|cadeb|caebd|caedb|cbade|cbaed|cbdae|cbdea|cbead|cbeda|cdabe|cdaeb|cdbae|cdbea|cdeab|cdeba|ceabd|ceadb|cebad|cebda|cedab|cedba|dabce|dabec|dacbe|daceb|daebc|daecb|dbace|dbaec|dbcae|dbcea|dbeac|dbeca|dcabe|dcaeb|dcbae|dcbea|dceab|dceba|deabc|deacb|debac|debca|decab|decba|eabcd|eabdc|eacbd|eacdb|eadbc|eadcb|ebacd|ebadc|ebcad|ebcda|ebdac|ebdca|ecabd|ecadb|ecbad|ecbda|ecdab|ecdba|edabc|edacb|edbac|edbca|edcab|edcba

(This will get optimised into a trie in Perl 5.10+. Before 5.10, use Regexp::List.)

what does the regular expression (?<!-) mean

10 votes

I'm trying to understand a piece of code and came across this regular expression used in PHP's preg_replace function.

'/(?<!-)color[^{:]*:[^{#]*$/i'

This bit... (?<!-) doesnt appear in any of my reg-exp manuals. Anyone know what this means please? (Google doesnt return anything - I dont think symbols work in google.)

The ?<! at the start of a parenthetical group is a negative lookbehind. It asserts that the word color (strictly, the c in the engine) was not preceded by a - character.

So, for a more concrete example, it would match color in the strings:

color
+color
someTextColor

But it will fail on something like -color or background-color. Also note that the engine will not technically "match" whatever precedes the c, it simply asserts that it is not a hyphen. This can be an important distinction depending on the context (illustrated on Rubular with a trivial example; note that only the b in the last string is matched, not the preceding letter).

C++ regex not understanding

8 votes

The following outputs ">Hut" where I expect it to output "Hut". I know that .* is greedy but > must be matched and it is outside of the capture group so why is it in my submatch?

#include <string>
#include <regex>
#include <iostream>

using namespace std;

int main() {
        regex my_r(".*>(.*)");
        string temp(R"~(cols="64">Hut)~");
        smatch m;
        if (regex_match(temp, m, my_r)) {
                cout << m[1] << endl;
        }
}

This is a bug in libstdc++'s implementation. Watch these:

#include <string>
#include <regex>
#include <boost/regex.hpp>
#include <iostream>

int main() {
    {
        using namespace std;
        regex my_r("(.*)(6)(.*)");
        smatch m;
        if (regex_match(std::string{"123456789"}, m, my_r)) {
            std::cout << m.length(1) << ", "
                      << m.length(2) << ", "
                      << m.length(3) << std::endl;
        }
    }

    {
        using namespace boost;
        regex my_r("(.*)(6)(.*)");
        smatch m;
        if (regex_match(std::string{"123456789"}, m, my_r)) {
            std::cout << m.length(1) << ", "
                      << m.length(2) << ", "
                      << m.length(3) << std::endl;

        }
    }

    return 0;
}

If you compile with gcc, the first one (libstdc++) returns the totally wrong result 9, -2, 4 and the second one (boost's implementation) returns 5, 1, 3 as expected.

If you compile with clang + libc++, your code works fine.

(Note that libstdc++'s regex implementation is only "partially supported", as described in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52719.)

String's replaceAll() method and escape characters

7 votes

The line

System.out.println("\\");

prints a single back-slash (\). And

System.out.println("\\\\");

prints double back-slashes (\\). Understood!

But why in the following code:

class ReplaceTest
{
    public static void main(String[] args)
    {
        String s = "hello.world";
        s = s.replaceAll("\\.", "\\\\");
        System.out.println(s);
    }
}

is the output:

hello\world

instead of

hello\\world

After all, the replaceAll() method is replacing a dot (\\.) with (\\\\).

Can someone please explain this?

When replacing characters using regular expressions, you're allowed to use backreferences, such as \1 to replace a using a grouping within the match.

This, however, means that the backslash is a special character, so if you actually won't to use a backslash it needs to be escaped.

Which means it needs to actually be escaped twice when using it in a Java string. (First for the string parser, then for the regex parser.)

Is there a regular expression way to replace a set of characters with another set (like shell tr command)?

7 votes

The shell tr command support replace one set of characters with another set. For example, echo hello | tr [a-z] [A-Z] will tranlate hello to HELLO.

In java, however, I must replace each character individually like the following

"10 Dogs Are Racing"
    .replaceAll ("0", "0")
    .replaceAll ("1", "1")
    .replaceAll ("2", "2")
    // ...
    .replaceAll ("9", "9")
    .replaceAll ("A", "A")
    // ...
;

The apache-commons-lang library provides a convenient replaceChars method to do such replacement.

// half-width to full-width
System.out.println
(
    org.apache.commons.lang.StringUtils.replaceChars
    (
        "10 Dogs Are Racing",
        "0123456789ABCDEFEGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",
        "0123456789ABCDEFEGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
    )
);
// Result:
// 10 Dogs Are Racing

But as you can see, sometime the searchChars/replaceChars are too long (also too boring, please find a duplicated character in it if you want), and can be expressed by a simple regular expression [0-9A-Za-z]/[0-9A-Za-z]. Is there a regular expression way to achieve that ?

While there is no direct way to do this, constructing your own utility function to use in combination with replaceChars is relatively simple. The version below accepts simple character classes, without [ or ]; it does not do class negation ([^a-z]).

For your use case, you could do:

StringUtils.replaceChars(str, charRange("0-9A-Za-z"), charRange("0-9A-Za-z"))

Code:

public static String charRange(String str) {
    StringBuilder ret = new StringBuilder();
    char ch;
    for(int index = 0; index < str.length(); index++) {
        ch = str.charAt(index);
        if(ch == '\\') {
            if(index + 1 >= str.length()) {
                throw new PatternSyntaxException(
                    "Malformed escape sequence.", str, index
                );
            }
            // special case for escape character, consume next char:
            index++;
            ch = str.charAt(index);
        }
        if(index + 1 >= str.length() || str.charAt(index + 1) != '-') {
            // this was a single char, or the last char in the string
            ret.append(ch);
        } else {
            if(index + 2 >= str.length()) {
                throw new PatternSyntaxException(
                    "Malformed character range.", str, index + 1
                );
            }
            // this char was the beginning of a range
            for(char r = ch; r <= str.charAt(index + 2); r++) {
                ret.append(r);
            }
            index = index + 2;
        }
    }
    return ret.toString();
}

Produces:

0-9A-Za-z : 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
0-9A-Za-z : 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

Are these regex patterns different?

7 votes

A website I've been working on will not match data using a PHP (preg_match) regex pattern that seems to work everywhere else I've tested it. That pattern is:

<channel.*?>(.*?)</channel>

It is matched against an RSS feed that has a channel tag.

Now the server I am working on will only produce the correct result if change it to:

<channel.*?>(.*)?</channel>

My regex isn't the best in the world so I'm wondering if anyone can tell me if there is any significant difference between the two patterns.

Small note: I realize it would probably be better to use SimpleXML etc, but this regex is from a previous application and for various reasons I am not allowed to change it.

Thanks in advance for any insights.

The statement (.*) says "the selection is zero or more characters" and the trailing ? makes it an optional match. By contrast, (.*?) is using a "lazy star" ( *? ) which first attempts to skip the match completely. Check this for more information.

To understand the difference between a normal (greedy) star and a lazy star, look at the following example in PHP and notice that the greedy star makes the largest match it can with the pattern it is given, while the lazy star "gives up" as soon as it has satisfied the match pattern:

$inputs = array( 'axb' , 'axxxb' , 'axbxb' , 'axbxxxb' );

// GREEDY STAR (NORMAL)
foreach( $inputs as $input )
{
  preg_match( '/a.*b/' , $input , $greedy );
  $greedy_matches[] = $greedy[0];
}

print "<pre>";
print_r( $greedy_matches );
print "</pre>";
/* 
Array
(
    [0] => axb
    [1] => axxxb
    [2] => axbxb
    [3] => axbxxxb
)
*/



// LAZY STAR
foreach( $inputs as $input )
{
  preg_match( '/a.*?b/' , $input , $lazy );
  $lazy_matches[] = $lazy[0];
}

print "<pre>";
print_r( $lazy_matches );
print "</pre>";
/* 
Array
(
    [0] => axb
    [1] => axxxb
    [2] => axb
    [3] => axb
)
*/

IE8 parses this simple regex differently from all other browsers

7 votes

I am trying to use this function to create 2 results from value

function split(val){
  return val.split( /,\s*/ );
};
value = "Jim, ";
var terms = split( value );

terms;

All other browsers including IE9, will produce terms = ["Jim", ""]

However, IE8 and probably IE7 produces this : terms = ["Jim"]

Does anyone have any suggestions or alternatives that could possibly work for IE8 ?

You might be better off going with:

val.split(',')

This seems to work consistently in all browsers.

Any trailing whitespace after the commas still has to be stripped off afterwards. Something along the lines of:

for (var i = 0; i < terms.length; i++) {
    terms[i] = terms[i].replace(/^\s\s*/, '').replace(/\s\s*$/, '');
}

Apparently, in IE8 and earlier, empty-string matches are ignored by split() when a regex parameter is used. A string parameter works fine:

'axx'.split('x')    // All browsers: ["a", "", ""]
'axx'.split(/x/)    // IE6/7/8: ["a"], all other browsers: ["a", "", ""]

is there any compiler that can convert regexp to fsm? or could convert to human words?

7 votes

Something that can convert

r"a+|(?:ab+c)"

to

{
    (1, 'a') : [2, 3],
    (2, 'a') : [2],
    (3, 'b') : [4, 3],
    (4, 'c') : [5]
}

or something similar

and accepting in 2 or 5

i have some code that will do this. it's not well documented and it's not supported, but if you're interested you're welcome to look at it.

the library is called rxpy and the repository is http://code.google.com/p/rxpy

the routine that does parsing is parse_pattern at http://code.google.com/p/rxpy/source/browse/rxpy/src/rxpy/parser/pattern.py#871

if you call repr(...) on the result from that you get a graph in the "dot language" - https://en.wikipedia.org/wiki/DOT_language

for example, see the tests as http://code.google.com/p/rxpy/source/browse/rxpy/src/rxpy/parser/_test/parser.py#47

to show what i mean ,let's look at the test at http://code.google.com/p/rxpy/source/browse/rxpy/src/rxpy/parser/_test/parser.py#234 which is for 'ab*c':

"""digraph {
 0 [label="a"]
 1 [label="...*"]
 2 [label="b"]
 3 [label="c"]
 4 [label="Match"]
 0 -> 1
 1 -> 2
 1 -> 3
 3 -> 4
 2 -> 1
}"""

that starts at 0 which can match an "a" to go to state 1. from there you can match a "b" to go to state 2 or a "c" to go to state 3. state 2 then has a transition back to 1 that can consume another "b", etc etc. it's a bit ugly to read by hand, but when the test fails you get a little graph displayed on the screen.

the library also has various "engines" which will match strings against this graph (and so do regular expression matching). but it is much slower than the python library (because it is pure python).

this is not supported and may not be very clear - sorry - but i think it's close to what you want and you're welcome to use it if it's useful (MPL or LGPL licence).

Replace string inside tags?

6 votes

I want to replace a content inside some tags, eg:

<p>this it to be replaced</p>

I could extract the content between with groups like this, but can i actually replace the group?

str = str.replaceAll("<p>([^<]*)</p>", "replacement");

Change the regex to this:

(?<=<p>).*?(?=</p>)

ie

str = str.replaceAll("(?<=<p>).*?(?=</p>)", "replacement");

This uses a "look behind" and a "look ahead" to assert, but not capture, input before/after the matching (non-greedy) regex

Just in case anyone is wondering, this answer is different to dacwe's: His uses unnecessary brackets. This answer is the more elegant :)

Regular expression to allow alphanumeric, only one space and then alpahnumeric

6 votes

I've searched and searched and tried many different ways, but I can't seem to figure this out. I'm looking for a way to only allow alphanumeric characters, then only one space, then alphanumeric characters. I'm sure it's easy, but I don't know it.

Examples of what I want:

    First Last     Allowed
    First La-st    Not Allowed
    FirstLast      Not Allowed
    First  Last    Not Allowed
    First La'st    Not allowed

I'd then like to remove the invalid characters from the string.

Please let me know if you need more information. Thanks a lot!

^[a-zA-Z0-9]+ [a-zA-Z0-9]+$

… should do it.