Best regex questions in April 2012

PHP Regex Check if two strings share two common characters

12 votes

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.

Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).

Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.

This is a regex that defines "valid" lines:

"/^[AB][CD][EF][GH]$/m" 

In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.

What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.

The below example assumes the following:

  1. $line is always a valid format
  2. BigFileOfLines.txt contains only valid lines

Example:

// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
    $regex = "magic regex I'm looking for here";
    $matchingLines = array();
    preg_match_all($regex, $subject, $matchingLines);
    return $matchingLines;
}

// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);

/*
 * Desired return value (Note: this is an example set, there 
 * could be more or less than this)
 * 
 * BCEG
 * ADFG
 * BCFG
 * BDFG
*/

One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":

"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"

This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.

It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.

Thanks in advance!

Update:

It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."

I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:

  • A. Works
  • B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).

Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)

Update 2:

Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.

The reasons I favor this answer:

  1. The regular expression given provides excellent scalability for longer lines
  2. The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.

However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.

Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?

For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";

How to check for repeating sequence in an integer

11 votes

I have an alpha-numeric string and I want to check for pattern repetition in it just for the integers. And they should be continuous.

Example

  1. 12341234qwe should tell me 1234 is repeated.
  2. 1234qwe1234 should NOT tell me that 1234 is repeated since its not continuous.
  3. 12121212 should be treated as 12 being repeated as that is the first set which would be found being repeated. But if there is an algorithm which would find 1212 as the repeated set before 12 then I guess it has to perform the steps again on 1212.

What I thought was I can store the integer part by iterating and comparing it with ( <= '0' && >= '9') in a different StringBuilder. Then I read about performing FFT on the string and it shows the repeated patterns. But I have no idea on how to perform FFT in Java and look for the results, also, I was hoping to try to do this without going to Signal Processing. I read about KMP pattern matching but that only works with a given input. Is there any other way to do this?

You can take help of regex to solve this I think. Consider code like this:

String arr[] = {"12341234abc", "1234foo1234", "12121212"};
String regex = "(\\d+?)\\1";
Pattern p = Pattern.compile(regex);
for (String elem : arr) {
   Matcher matcher = p.matcher(elem);
   if (matcher.find())
      System.out.println(elem + " got repeated: " + matcher.group(1));
   else
      System.out.println(elem + " has no repeation");
}

OUTPUT:

12341234abc got repeated: 1234
1234foo1234 has no repeation
12121212 got repeated: 12

Explanation:

Regex being used is (\\d+?)\\1 where

\\d        - means a numerical digit
\\d+       - means 1 or more occurrences of a digit
\\d+?      - means reluctant (non-greedy) match of 1 OR more digits
( and )    - to group the above regex into group # 1
\\1        - means back reference to group # 1
(\\d+?)\\1 - repeat the group # 1 immediately after group # 1

Why does /^(.+)+Q$/.test("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX") take so long?

10 votes

When I run

/^(.+)+Q$/.test("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")

in Chrome or IE, it takes ~10 seconds to complete. (Firefox is able to evaluate it almost instantly.)

Why does it take so long? (And why/how is Firefox able to do it so quickly?)

(Of course, I'd never run this particular regex, but I'm hitting a similar issue with the URL regex at http://daringfireball.net/2010/07/improved_regex_for_matching_urls and it seems to boil down to this, i.e. there are certain URLs which will cause the browser to lock up)

For example:

var re = /\b((?:https?:\/\/|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/i;
re.test("http://google.com/?q=(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA")

As indicated by thg435, it sounds like catastrophic back-tracking. There's an excellent article on this, Regular Expression Matching Can Be Simple And Fast.

It describes an efficient approach known as Thompson NFA. As noted, though, this does not support all features of modern regexes. For instance, it can't do backreferences. However, as suggested in the article:

"Even so, it would be reasonable to use Thompson's NFA simulation for most regular expressions, and only bring out backtracking when it is needed. A particularly clever implementation could combine the two, resorting to backtracking only to accommodate the backreferences."

I suspect Firefox may be doing this.

Impossible lookbehind with a backreference

10 votes

From my understanding,

(.)(?<!\1)

should never match. Actually, php's preg_replace even refuses to compile this and so does ruby's gsub. The python re module seems to have a different opinion though:

import re
test = 'xAAAAAyBBBBz'
print (re.sub(r'(.)(?<!\1)', r'(\g<0>)', test))

Result:

(x)AAAA(A)(y)BBB(B)(z)

Can anyone provide a reasonable explanation for this behavior?

Update

This behavior appears to be a limitation in the re module. The alternative regex module seems to handle groups in assertions correctly:

import regex

test = 'xAAAAAyBBBBz'

print (regex.sub(r'(.)(?<!\1)', r'(\g<0>)', test))
## xAAAAAyBBBBz

print (regex.sub(r'(.)(.)(?<!\1)', r'(\g<0>)', test))
## (xA)AAA(Ay)BBB(Bz)

Note that unlike pcre, regex also allows variable-width lookbehinds:

print (regex.sub(r'(.)(?<![A-Z]+)', r'(\g<0>)', test))
## (x)AAAAA(y)BBBB(z)

Eventually, regex is going to be included in the standard library, as mentioned in PEP 411.

This does look like a limitation (nice way of saying "bug", as I learned from a support call with Microsoft) in the Python re module.

I guess it has to do with the fact that Python does not support variable-length lookbehind assertions, but it's not clever enough to figure out that \1 will always be fixed-length. Why it doesn't complain about this when compiling the regex, I can't say.

Funnily enough:

>>> print (re.sub(r'.(?<!\0)', r'(\g<0>)', test))
(x)(A)(A)(A)(A)(A)(y)(B)(B)(B)(B)(z)
>>>
>>> re.compile(r'(.*)(?<!\1)') # This should trigger an error but doesn't!
<_sre.SRE_Pattern object at 0x00000000026A89C0>

So better don't use backreferences in lookbehind assertions in Python. Positive lookbehind isn't much better (it also matches here as if it was a positive lookahead):

>>> print (re.sub(r'(.)(?<=\1)', r'(\g<0>)', test))
x(A)(A)(A)(A)Ay(B)(B)(B)Bz

And I can't even guess what's going on here:

>>> print (re.sub(r'(.+)(?<=\1)', r'(\g<0>)', test))
x(AA)(A)(A)Ay(BB)(B)Bz

Need RegExp help for Linux Bash grep command to filter out lines containing square brackets

7 votes

Using the following example, I need to filter out the line containing 'ABC' only, while skipping the lines matching 'ABC' that contain square brackets:

2012-04-04 04:13:48,760~sample1~ABC[TLE 5332.233 2/13/2032 3320392]:CAST
2012-04-04 04:13:48,761~sample2~ABC
2012-04-04 04:13:48,761~sample3~XYZ[BAC.CAD.ABC.CLONE 232511]:TEST

Here is what I have, but so far I'm unable to successfully filter out the lines with square brackets:

bash-3.00$ cat Metrics.log | grep -e '[^\[\]]' | grep -i 'ABC'

Please help?

Edited based on comments:

Try grep -i 'ABC' Metrics.log | grep -v "[[]" | grep -v "ABC\w"

Input:

2012-04-04 04:13:48,760~sample1~ABC[TLE 5332.233 2/13/2032 3320392]:CAST
2012-04-04 04:13:48,761~sample2~ABC
2012-04-04 04:13:48,761~sample3~XYZ[BAC.CAD.ABC.CLONE 232511]:TEST
2012-04-04 04:13:48,761~sample4~XYZ
2012-04-04 04:13:48,761~sample5~ABCD
2012-04-04 04:13:48,761~sample6~ABC:TEST

Output:

2012-04-04 04:13:48,761~sample2~ABC
2012-04-04 04:13:48,761~sample6~ABC:TEST

Reg Ex for even number of 0s and 1s

7 votes

I am trying to create a regular expression that determines if a string (of any length) matches a regex pattern such that the number of 0s in the string is even, and the number of 1s in the string is even. Can anyone help me determine a regex statement that I could try and use to check the string for this pattern?

So, I have come up with a solution to the problem:

(11+00+(10+01)(11+00)\*(10+01))\*

Which non-empty string does /^$/ match?

7 votes

In a Perl SO answer, a poster used this code to match empty strings:

$userword =~ /^$/; #start of string, followed immediately by end of string

To which brian d foy commented:

You can't really say that because that will match one particular non-empty string.

Question: Which non-empty string is matched by this? Is it a string consisting of "\r" only?

Let's check the docs, why don't we? Quote perlre,

$: Match the end of the line (or before newline at the end)

Given

\z: Match only at end of string

That means /^$/ is equivalent to /^\n?\z/.

$ perl -E'$_ = "";    say /^$/ ||0, /^\n?\z/ ||0, /^\z/ ||0;'
111

$ perl -E'$_ = "\n";  say /^$/ ||0, /^\n?\z/ ||0, /^\z/ ||0;'
110

Note that /m changes what ^ and $ match. Under /m, ^ matches at the start of any "line", and $ matches before any newline and at the end of the string.

$ perl -E'$_ = "abc\ndef\n";  say "matched at $-[0]" while  /^/g'
matched at 0

$ perl -E'$_ = "abc\ndef\n";  say "matched at $-[0]" while  /$/g'
matched at 7
matched at 8

And using /m:

$ perl -E'$_ = "abc\ndef\n";  say "matched at $-[0]" while  /^/mg'
matched at 0
matched at 4   <-- new

$ perl -E'$_ = "abc\ndef\n";  say "matched at $-[0]" while  /$/mg'
matched at 3   <-- new
matched at 7
matched at 8

\A, \Z and \z aren' t affected by /m:

$ perl -E'$_ = "abc\ndef\n";  say "matched at $-[0]" while  /\A/g'
matched at 0

$ perl -E'$_ = "abc\ndef\n";  say "matched at $-[0]" while  /\z/g'
matched at 8

$ perl -E'$_ = "abc\ndef\n";  say "matched at $-[0]" while  /\Z/g'
matched at 7
matched at 8

Perl Regex "Not" (negative lookahead)

6 votes

I'm not terribly certain what the correct wording for this type of regex would be, but basically what I'm trying to do is match any string that starts with "/" but is not followed by "bob/", as an example.

So these would match:

/tom/
/tim/
/steve

But these would not

tom
tim
/bob/

I'm sure the answer is terribly simple, but I had a difficult time searching for "regex not" anywhere. I'm sure there is a fancier word for what I want that would pull good results, but I'm not sure what it would be.

Edit: I've changed the title to indicate the correct name for what I was looking for

You can use a negative lookahead (documented under "Extended Patterns" in perlre):

/^\/(?!bob\/)/

Replace only up to N matches on a line

6 votes

In Perl, how to write a regular expression that replaces only up to N matches per string?

I.e., I'm looking for a middle ground between s/aa/bb/; and s/aa/bb/g;. I want to allow multiple substitutions, but only up to N times.

I can think of three reliable ways. The first is to replace everything after the Nth match with itself.

my $max = 5;
$s =~ s/(aa)/ $max-- > 0 ? 'bb' : $1 /eg;

That's not very efficient if there are far more than N matches. For that, we need to move the loop out of the regex engine. The next two methods are ways of doing that.

my $max = 5;
my $out = '';
$out .= $1 . 'bb' while $max-- && $in =~ /\G(.*?)aa/gcs;
$out .= $1 if $in =~ /\G(.*)/gcs;

And this time, in-place:

my $max = 5;
my $replace = 'bb';
while ($max-- && $s =~ s/\G.*?\Kaa/$replace/s) {
   pos($s) = $-[0] + length($replace);
}

You might be tempted to do something like

my $max = 5;
$s =~ s/aa/bb/ for 1..$max;

but that approach will fail for other patterns and/or replacement expressions.

my $max = 5;
$s =~ s/aa/ba/ for 1..$max;  # XXX Turns 'aaaaaaaa'
                             #     into 'bbbbbaaa'
                             #     instead of 'babababa'

Bug in JavaScript V8 regex engine when matching beginning-of-line?

6 votes

I have a pretty nifty tool, underscore-cli, that's getting the strangest behavior when printing out the help / usage information.

In the usage() function, I do this to indent blocks of text (eg, the options):

str.replace(/^/, "    ");

This regex, in addition to being pretty obvious, comes straight out of TJ Hollowaychuk's commander.js code. The regex is correct.

Yet, I get bizzare spaces inserted into the middle of my usage text. like this:

  Commands:
...
     values              Retrieve all the values of an object's properties.
     extend <object>     Override properties in the input data.
     defaults <object>   Fill in missing properties in the input data.
     any <exp>           Return 'true' if any of the values in the input make the expression true.  Expression args: (value, key, list)
         all <exp>           Return 'true' if all values in the input make the expression true.  Expression args: (value, key, list)
     isObject            Return 'true' if the input data is an object with named properties
     isArray             Return 'true' if the input data is an array
     isString            Return 'true' if the input data is a string
...

99% chance, this HAS to be a bug in V8.

Anyone know why this happens, or what the easiest work-around would be?

Yup, turns out this IS a V8 bug, 1748 to be exact. Here's the workaround I used in the tool:

str.replace(/(^|\n), "$1    ");

This is a bug in V8 (bug 1748):

http://code.google.com/p/v8/source/browse/branches/bleeding_edge/test/mjsunit/regress/regress-1748.js?spec=svn9504&r=9504

Here is a test for the bug:

function assertEquals(a, b, msg) { if(a !== b) { console.log("'%s' != '%s'  %s", a, b, msg); } }

var str = Array(10000).join("X");
str.replace(/^|X/g, function(m, i, s) {
  if (i > 0) assertEquals("X", m, "at position 0x" + i.toString(16));
});

On my box, it prints:

'X' != ''.  at position 0x100
'X' != ''.  at position 0x200
'X' != ''.  at position 0x300
'X' != ''.  at position 0x400
'X' != ''.  at position 0x500
'X' != ''.  at position 0x600
...

On jsfiddle, it prints nothing (the version of V8 in my Chrome browser doesn't have the bug):

http://jsfiddle.net/PqDHk/


Bug History:

From the V8 changelog, the bug was fixed in V8-3.6.5 (2011-10-05).

From the Node.js changelog, Node-0.6.5 should be using V8-3.6.6.11 !?!!?. Node.js updated from V8-3.6.4 to V8-3.7.0 (Node-0.5.10) and then downgraded to V8-3.6.6 for Node-0.6.0. So theoretically, this bug should have been fixed before Node V0.6.0. Why does it still repro on Node-0.6.5??? Odd.

Can someone with the latest (Node-0.6.15) run the test snippet above and report if it generates errors? Or i'll get around to it eventually.

Thanks to ZachB for confirming this bug on Node-0.6.15. I filed an issue (issue #3168) against node, and a fix (5d69bbf) has been applied and should be included in Node-0.6.16. :) :) :)

Until then, the workaround is to replace:

str.replace(/^/, indent);

With:

str.replace(/(^|\n)/, "$1" + indent);

Real Time Morse code converter in Javascript

6 votes

After seeing Google's april fools joke of the morse code gmail, I thought I'd try to create a real-time morse code converter in javascript.

I'm using regex and replace to change the morse code into character. For example:

.replace(/.- /g, "a").replace(/.-. /g, "r")

The issue I'm having is that when I'm typing .-. for "r" it give me an "a" because it sees .- first. How can I make it replace only exact matches?

Updated and working!! Thanks to every one that helped me

http://jsfiddle.net/EnigmaMaster/sPDHL/32/ - My Original code

http://jsfiddle.net/EnigmaMaster/LDKKE/6/ - Rewritten by Shawn Chin

http://jsfiddle.net/EnigmaMaster/y9A4Y/2/ - Rewritten by Matthias Tylkowski

If anyone has other ways of writting this program please post a JsFiddle

Id love to see how else this can be done

The other answers have already covered the reasons why your example was not working so I'll refrain from repeating them.

However, may I suggest that since you're already using spaces to delimit each code, a straight-forward solution would be to do a simple .split() to segment the input text into individual units then simply do a one-to-one mapping of code to chars. This will be a lot more efficient than repeated regex replacements and less prone to errors.

For example:

var morse = {  // use object as a map
    '.-': 'a', 
    '-...': 'b', 
    '-.-.': 'c', 
    // .... the rest ...
};

function translate_morse(code) {  // given code, return matching char
    return (typeof morse[code] === "undefined") ? "" : morse[code];
    // if the var is not found, the code is unknown/invalid. Here we 
    // simply ignore it but you could print out the code verbatim and use
    // different font styles to indicate an erroneous code
}

// example usage
translated = code_text.split(" ").map(translate_morse).join("");

Here's a working example: http://jsfiddle.net/KGVAm/1/

p.s. I've taken the liberty of tweaking the code and the behaviour a little, i.e. disabling the input of other chars but allowing backscape to allow corrections.

Use regular expression to handle nested parenthesis in math equation?

6 votes

If I have:

statement = "(2*(3+1))*2"

I want to be able to handle multiple parentheses within parentheses for a math reader I'm writing. Perhaps I'm going about this the wrong way, but my goal was to recursively go deeper into the parentheses until there were none, and then I would perform the math operations. Thus, I would first want to focus on

"(2*(3+1))" 

then focus on

"(3+1)"

I hoped to do this by assigning the focus value to the start index of the regex and the end index of the regex. I have yet to figure out how to find the end index, but I'm more interested in first matching the regex

r"\(.+\)" 

failed to match. I wanted it to read as "any one or more characters contained within a set of parentheses". Could someone explain why the above expression will not match to the above statement in python?

I love regular expressions. I use them all the time.

Don't use regular expressions for this.

You want an actual parser that will actually parse your math expressions. You might want to read this:

http://effbot.org/zone/simple-top-down-parsing.htm

Once you have actually parsed the expression, it's trivial to walk the parse tree and compute the result.

EDIT: @Lattyware suggested pyparsing, which should also be a good way to go, and might be easier than the EFFBot solution posted above.

http://pyparsing.wikispaces.com

Here's a direct link to the pyparsing sample code for a four-function algebraic expression evaluator:

http://pyparsing.wikispaces.com/file/view/fourFn.py

Split on first comma in string

6 votes

How can I efficiently split the following string on the first comma using base?

x <- "I want to split here, though I don't want to split elsewhere, even here."
strsplit(x, ???)

Desired outcome (2 strings):

[[1]]
[1] "I want to split here"   "though I don't want to split elsewhere, even here."

Thank you in advance.

EDIT: Didn't think to mention this. This needs to be able to generalize to a column, vector of strings like this, as in:

y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")

The outcome can be two columns or one long vector (that I can take every other element of) or a list of stings with each index ([[n]]) having two strings.

Apologies for the lack of clarity.

Here's what I'd probably do. It may seem hacky, but since sub() and strsplit() are both vectorized, it will also work smoothly when handed multiple strings.

XX <- "SoMeThInGrIdIcUlOuS"
strsplit(sub(",\\s*", XX, x), XX)
# [[1]]
# [1] "I want to split here"                               
# [2] "though I don't want to split elsewhere, even here."

Negative lookahead assertion with the * modifier in Perl

6 votes

I have the (what I believe to be) negative lookahead assertion <@> *(?!QQQ) that I expect to match if the tested string is a <@> followed by any number of spaces (zero including) and then not followed by QQQ.

Yet, if the tested string is <@> QQQ the regular expression matches.

I fail to see why this is the case and would appreciate any help on this matter.

Here's a test script

use warnings;
use strict;

my @strings = ('something <@> QQQ',
               'something <@> RRR',
               'something <@>QQQ' ,
               'something <@>RRR' );


print "$_\n" for map {$_ . " --> " . rep($_) } (@strings);



sub rep {

  my $string = shift;

  $string  =~ s,<@> *(?!QQQ),at w/o ,;
  $string  =~ s,<@> *QQQ,at w/  QQQ,;

  return $string;
}

This prints

something <@> QQQ --> something at w/o  QQQ
something <@> RRR --> something at w/o RRR
something <@>QQQ --> something at w/  QQQ
something <@>RRR --> something at w/o RRR

And I'd have expected the first line to be something <@> QQQ --> something at w/ QQQ.

It matches because zero is included in "any number". So no spaces, followed by a space, matches "any number of spaces not followed by a Q".

You should add another lookahead assertion that the first thing after your spaces is not itself a space. Try this (untested):

 <@> *(?!QQQ)(?! )

ETA Side note: changing the quantifier to + would have helped only when there's exactly one space; in the general case, the regex can always grab one less space and therefore succeed. Regexes want to match, and will bend over backwards to do so in any way possible. All other considerations (leftmost, longest, etc) take a back seat - if it can match more than one way, they determine which way is chosen. But matching always wins over not matching.

jquery find exact string with no more or less

6 votes

I have some text from a breadcrumb which I am using to open menu items on a page. For example, say the bctext = 'pasta'.

I want to target the word "pasta", but not say "yadda yadda yadda pasta". Only an instance of the single word "pasta" should match, or if bctext were a phrase, then it would only find the exact phrase.

This is what I have so far:

$('ul#accordion a:contains(' + bctext + ')')

But this finds "yadda yadda pasta", of course.

I get the bctext with the following:

var bctext = $('#CategoryBreadcrumb ul li:last-child').prev().children().text();

Then, I edit the menu with the following:

$('ul#accordion a:contains(' + bctext + ')').parent()
                                            .addClass('special')
                                            .children('ul')
                                            .css('display','block'); 

Is what I'm going for possible?

$('ul#accordion a').filter(function() {
    return $(this).text() == bctext;
}).parent().addClass('special').children('ul').css('display','block');

:contains() is not a native selector anyway so using .filter() with a custom callback won't have any performance drawbacks.

Retaining the pattern characters while splitting via Regex, Ruby

5 votes

I have the following string

str="HelloWorld How areYou I AmFine"

I want this string into the following array

["Hello","World How are","You I Am", "Fine"]

I have been using the following regex, it splits correctly but it also omits the matching pattern, i also want to retain that pattern. What i get is

str.split(/[a-z][A-Z]/)
 => ["Hell", "orld How ar", "ou I A", "ine"] 

It omitts the matching pattern.

Can any one help me out how to retain these characters as well in the resulting array

Three answers so far, each with a limitation: one is rails-only and breaks with underscore in original string, another is ruby 1.9 only, the third always has a potential error with its special character. I really liked the split on zero-width assertion answer from @Alex Kliuchnikau, but the OP needs ruby 1.8 which doesn't support lookbehind. There's an answer that uses only zero-width lookahead and works fine in 1.8 and 1.9 using String#scan instead of #split.

str.scan /.*?[a-z](?=[A-Z]|$)/
=> ["Hello", "World How are", "You I Am", "Fine"]

"preg_match(): Compilation failed: unmatched parentheses" in PHP for valid pattern

5 votes

Wondering if anyone out there can shed some light on why the following regular expression is failing when used in PHP's preg_match function:-

<?php
$str = '\tmp\phpDC1C.tmp';

preg_match('|\\tmp\\([A-Za-z0-9]+)|', $str, $matches);

print_r($matches);
?>

This results in the error message "preg_match(): Compilation failed: unmatched parentheses" despite the fact that the pattern appears to be valid. I've tested it with an online PHP Regular Expression tester and the Linux tool Kiki. Seems like PHP is escaping the opening parenthesis rather than the backslash.

I've got round the issue by using str_replace to swap the backslashes for forward ones. This works for my situation but it would be nice to know why this regular expression is failing.

To encode a literal backslash, you need to escape it twice: Once for the string, and once for the regex engine:

preg_match('|\\\\tmp\\\\([A-Za-z0-9]+)|', $str, $matches);

In PHP (when using single-quoted strings), this is only relevant for actual backslashes; other regex escapes are OK with a single backslash:

preg_match('/\bhello\b/', $subject)

This is covered in the manual (see the box labeled "Note:" at the top of the page).