Best regex questions in May 2012

Why does preg_replace with /(.*)/ repeat part of string?

14 votes

Why does the following code:

<?php echo preg_replace("/(.*)/", "$1.def", "abc");

Output abc.def.def instead of abc.def?

I'm interested in understanding why the repetition occurs.

Using /(.+)/ or /^(.*)$/ works as expected, but I'm not looking for a solution, just asking a question (although these patterns may be related to the answer).

Tinker with a live version here.

Because .* matches the empty substring at the end of the string. It means there are two matches to the string abc:

  1. The whole string abcabc.def
  2. The empty string → .def

which gives abc.def.def.


Edit: Detail of why it happens is explained in Java regex anomaly?.

What exactly does $ match in Perl?

10 votes

Until a few minutes ago, I believed that Perl's $ matches any kind of end of line. Unfortunatly, my assumption turns out to be wrong.

The following script removes the word end only for $string3.

use warnings;
use strict;

my $string1 = " match to the end" . chr(13);
my $string2 = " match to the end" . chr(13) . chr(10);
my $string3 = " match to the end" .           chr(10);

$string1 =~ s/ end$//;
$string2 =~ s/ end$//;
$string3 =~ s/ end$//;

print "$string1\n";
print "$string2\n";
print "$string3\n";

But I am almost 75% sure that I have seen cases where $ matched at least chr(13).chr(10).

So, what exactly (and under what circumstances) does the $ atom match?

$ matches only the position before \n/chr(10) and not before \r/chr(13).

It's very often misinterpreted to match before a newline character (in a lot of cases it's not causing problems), but to be strict it matches before a "linefeed" character but not before a carriage return character!

See Regex Tutorial - Start and End of String or Line Anchors.

R: Capitalizing everything after a certain character

10 votes

I would like to capitalize everything in a character vector that comes after the first _. For example the following vector:

x <- c("NYC_23df", "BOS_3_rb", "mgh_3_3_f") 

Should come out like this:

"NYC_23DF" "BOS_3_RB" "mgh_3_3_F"

I have been trying to play with regular expressions, but am not able to do this. Any suggestions would be appreciated.

You were very close:

gsub("(_.*)","\\U\\1",x,perl=TRUE)

seems to work. You just needed to use _.* (underscore followed by zero or more other characters) rather than _* (zero or more underscores) ...

To take this apart a bit more:

  • _.* gives a regular expression pattern that matches an underscore _ followed by any number (including 0) of additional characters; . denotes "any character" and * denotes "zero or more repeats of the previous element"
  • surrounding this regular expression with parentheses () denotes that it is a pattern we want to store
  • \\1 in the replacement string says "insert the contents of the first matched pattern", i.e. whatever matched _.*
  • \\U, in conjunction with perl=TRUE, says "put what follows in upper case" (uppercasing _ has no effect; if we wanted to capitalize everything after (for example) a lower-case g, we would need to exclude the g from the stored pattern and include it in the replacement pattern: gsub("g(.*)","g\\U\\1",x,perl=TRUE))

For more details, search for "replacement" and "capitalizing" in ?gsub (and ?regexp for general information about regular expressions)

Are regex atomic groups distributive?

10 votes

Are regex atomic groups distributive?

I.e. is (?>A?B?) always equivalent to (?>A?)(?>B?)?

If not please provide a counter example.

Atomic groups in general

  1. The atomic group (?>regex1|regex2|regex3) takes only the first successful match within it. In other words, it doesn't allow backtracking.

  2. Regexes are evaluated left-to-right, so you express the order you intend things to match. The engine starts at the first position, trying to make a successful match, backtracking if necessary. If any path through the expression would lead to a successful match, then it will match at that position.

  3. Atomic groups are not distributive. Consider these patterns evaluated over ABC: (?>(AB?))(?>(BC)) (no match) and (?>(AB?)(BC)) (matches ABC).

Atomic Groups with all optional components

But, your scenario where both parts are optional may be different.

Considering an atomic group with 2 greedy optional parts A and B ((A)? and (B)?). At any position, if A matches, it can move on to evaluate the optional B. Otherwise, if A doesn't match, that's fine, too because it's optional. Therefore, (A)? matches at any position. The same logic applies for the optional B. The question remaining is whether there can be any difference in backtracking.

In the case of all optional parts ((?>A?B?)), since each part always matches, there's no reason to backtrack within the atomic group, so it will always match. Then, since it is in an atomic group, it is prohibited from backtracking.

In the case of separate atomic groups ((?>A?)(?>B?)), each part always matches, and the engine is prohibited from backtracking in either case. This means the results will be the same.

To reiterate, the engine can only use the first possible match in (?>A?)(?>B?), which will always be the same match as the first possible match in (?>A?B?). Thus, if my reasoning is correct,for this special case, the matches will be the same for multiple optional atomic groups as a single atomic group with both optional components.

pattern() vs toString()

9 votes

What is the difference between the pattern() method and the toString() method in Pattern class?? The doc says:

public String pattern()

Returns the regular expression from which this pattern was compiled.

public String toString()

Returns the string representation of this pattern. This is the regular expression from which this pattern was compiled. Even their implementation returns the same result

import java.util.regex.*;

class Test {
  public static void main(String[] args) {
    Pattern p = Pattern.compile("[a-zA-Z]+\\.?");
    String s = p.pattern();
    String d = p.toString();
    System.out.println(s);
    System.out.println(d);
  }
}

I see no difference, so why are there two methods? Or am I missing something?

Thanks in advance!

Because each class has a toString() method which was inherited from Object. The toString() method is supposed to return a string which represents the object the best way it can, if it is even possible to create some kind of string representation. The name toString() is pretty vague, so they added a method pattern() which is more straightforward.

And because they wanted toString() to return something clever they used the pattern of the regex, which is a good string representation for the Pattern class.

What is meaning of [^] in Javascript regexps?

9 votes

[^a] means any character other than a, we know, but what does [^] (with no following characters) mean? Just as - loses its meaning of character range in cases such as [-], I assumed that [^] would match the caret. I spent way too long debugging this problem, only to find out that at least in Chrome 19 it appears to match anything--in other words, be equivalent to .. Is there a spec applicable here or what is the expected behavior?

Yes, I'm aware that I can and probably should use [\^]. This question is more in the nature of morbid curiosity.

According to the JavaScript specification (ES3 and ES5), [^] matches any single code unit, the same as [\s\S], [\0-\uffff], (.|\s) (don't use that; unlike the others, it relies on backtracking), etc. The difference from . is that the dot doesn't match the four newline code points (\r, \n, \u2028, and \u2029).

I don't recommend using [^] or [], because they don't work consistently cross-browser, and they prevent your regexes from working in other programming languages. IE <= 8 and older versions of Safari use the traditional (non-JavaScript) regex behavior for empty character classes. Older versions of Opera reverse the correct JavaScript behavior, so that [] matches any code unit and [^] never matches. The traditional regex behavior is that a leading, unescaped ] within a character class is treated as a literal character and does not end the character class.

If you use the XRegExp library, [] and [^] work correctly and consistently cross-browser. XRegExp also adds the s (aka dotall or singleline) flag that makes a dot match any code unit (the same as [^] in a browser that correctly follows the JavaScript spec).

Regex to ignore accents? PHP

8 votes

Is there anyway to make a Regex that ignores accents?

For example:

preg_replace("/$word/i", "<b>$word</b>", $str);

The "i" in the regex is to ignore case sensitive, but is there anyway to match, for example
java with Jávã?

I did try to make a copy of the $str, change the content to a no accent string and find the index of all the occurrences. But the index of the 2 strings seems to be different, even though it's just with no accents.

(I did a research, but all I could found is how to remove accents from a string)

I don't think, there is such a way. That would be locale-dependent and you probably want a "/u" switch first to enable UTF-8 in pattern strings.

I would probably do something like this.

function prepare($pattern)
{
   $replacements = Array("a" => "[áàäâ]",
                         "e" => "[éèëê]" ...);
   return str_replace(array_keys($replacements), $replacements, $pattern);  
}

pcre_replace("/(" . prepare($word) . ")/ui", "<b>\\1</b>", $str);

In your case, index was different, because unless you used mb_string you were probably dealing with UTF-8 which uses more than one byte per character.

How to match the sentence that start with "أقول " by this code?

7 votes

How to match the sentence that start with "أقول " by this code?

Regex.Matches(Content, "أقول " );

This is an arbic word. "أقول " What is the regular expression exactly ?

Regarding you comment, you want to find any match that starts with "أقول " and ends with "أقول". If this is true, then this is the way:

Regex.Matches(Content, "أقول .*أقول");

For example, if the Content is:

أقول ولكنك لا تسمع ما أقول بسبب صوتك العالي

Then it will match:

أقول ولكنك لا تسمع ما أقول

There is no problem with Arabic being RTL, it's all about viewing, they are not stored in in reverse!

How do I make Python's negative lookbehind less greedy?

7 votes

I've read all related posts and scoured the internet but this is really beating me.

I have some text containing a date.
I would like to capture the date, but not if it's preceded by a certain phrase.

A straightforward solution is to add a negative lookbehind to my RegEx.

Here are some examples (using findall).
I only want to capture the date if it isn't preceded by the phrase "as of".

19-2-11
something something 15-4-11
such and such as of 29-5-11

Here is my regular expression:

(?<!as of )(\d{1,2}-\d{1,2}-\d{2})

Expected results:

['19-2-11']
['15-4-11']
[]

Actual results:

['19-2-11']
['15-4-11']
['9-5-11']

Notice that's 9 not 29. If I change \d{1,2} to something solid like \d{2} on the first pattern:

bad regex for testing: (?<!as of )(\d{2}-\d{1,2}-\d{2})

Then I get my expected results. Of course this is no good because I'd like to match 2-digit days as well as single-digit days.

Apparently my negative lookbehind is quity greedy -- moreso than my date capture, so it's stealing a digit from it and failing. I've tried every means of correcting the greed I can think of, but I just don't know to fix this.

I'd like my date capture to match with the utmost greed, and then my negative lookbehind be applied. Is this possible? My problem seemed like a good use of negative lookbehinds and not overly complicated. I'm sure I could accomplish it another way if I must but I'd like to learn how to do this.

How do I make Python's negative lookbehind less greedy?

The reason is not because lookbehind is greedy. This happens because the regex engine tries to match the pattern at every position it can.

It advances through the phrase such and such as of 29-5-11 successfully matching (?<!as of ) at first, but failing to match \d{1,2}.

But then the engine finds the itself in the position such and such as of !29-5-11(marked with !). But here it fails to match (?<!as of ).

And it advances to the next position: such and such as of 2!9-5-11. Where it successfully matches (?<!as of ) and then \d{1,2}.

How to avoid it?

The general solution is to formulate the pattern as clear as possible.

In this very case I would prepend the digit with the necessary space or the beginning of the string.

(?<!as of)(?:^|\s+)(\d{1,2}-\d{1,2}-\d{2})

The solution of Mark Byers is also very good.

I think it's very important to understand the reason why regex engine behaves this way and gives unwanted results.

By the way the solution I gave above doesn't work if there are 2 or more spaces. It doesn't work because the fist position matches here such and such as of ! 29-5-11 with the abovementioned pattern.

What can be done to avoid it?

Unfortunately lookbehind in Python regex engine doesn't support quantifiers + or *.

I think the simplest solution would be to make sure there is not spaces before (?:^|\s+) (meaing that all the spaces are consumed by (?:^|\s+) straight after any nonspace text (and in case the text is as of, terminate advancing and backtrack to the next starting position starting the search all over again at the next position of the searched text).

re.search(r'(?<!as of)(?<!\s)(?:^|\s+)(\d{1,2}-\d{1,2}-\d{2})','such and such as of  29-5-11').group(1)

Unicode, regular expressions and PyPy

6 votes

I wrote a program to add (limited) unicode support to Python regexes, and while it's working fine on CPython 2.5.2 it's not working on PyPy (1.5.0-alpha0 1.8.0, implementing Python 2.7.1 2.7.2), both running on Windows XP (Edit: as seen in the comments, @dbaupp could run it fine on Linux). I have no idea why, but I suspect it has something to do with my uses of u" and ur". The full source is here, and the relevant bits are:

# -*- coding:utf-8 -*-
import re

# Regexps to match characters in the BMP according to their Unicode category.
# Extracted from Unicode specification, version 5.0.0, source:
# http://unicode.org/versions/Unicode5.0.0/
unicode_categories = {
    ur'Pi':ur'[\u00ab\u2018\u201b\u201c\u201f\u2039\u2e02\u2e04\u2e09\u2e0c\u2e1c]',
    ur'Sk':ur'[\u005e\u0060\u00a8\u00af\u00b4\u00b8\u02c2-\u02c5\u02d2-\u02df\u02...',
    ur'Sm':ur'[\u002b\u003c-\u003e\u007c\u007e\u00ac\u00b1\u00d7\u00f7\u03f6\u204...',
    ...
    ur'Pf':ur'[\u00bb\u2019\u201d\u203a\u2e03\u2e05\u2e0a\u2e0d\u2e1d]',
    ur'Me':ur'[\u0488\u0489\u06de\u20dd-\u20e0\u20e2-\u20e4]',
    ur'Mc':ur'[\u0903\u093e-\u0940\u0949-\u094c\u0982\u0983\u09be-\u09c0\u09c7\u0...',
}

def hack_regexp(regexp_string):
    for (k,v) in unicode_categories.items():
        regexp_string = regexp_string.replace((ur'\p{%s}' % k),v)
    return regexp_string

def regex(regexp_string,flags=0):
    """Shortcut for re.compile that also translates and add the UNICODE flag

    Example usage:
        >>> from unicode_hack import regex
        >>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
        >>> print result.group(0)
        áÇñ
        >>> 
    """
    return re.compile(hack_regexp(regexp_string), flags | re.UNICODE)

(on PyPy there is no match in the "Example usage", so result is None)

Reiterating, the program works fine (on CPython): the Unicode data seems correct, the replace works as intended, the usage example runs ok (both via doctest and directly typing it in the command line). The source file encoding is also correct, and the coding directive in the header seems to be recognized by Python.

Any ideas of what PyPy does "different" that is breaking my code? Many things came to my head (unrecognized coding header, different encodings in the command line, different interpretations of r and u) but as far as my tests go, both CPython and PyPy seems to behave identically, so I'm clueless about what to try next.

Seems PyPy has some encoding problems, both when reading a source file (unrecognized coding header, maybe) and when inputting/outputting in the command line. I replaced my example code with the following:

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'áÇñ123')
>>> print result.group(0) == u'áÇñ'
True
>>>

And it kept working on CPython and failing on PyPy. Replacing the "áÇñ" for its escaped characters - u'\xe1\xc7\xf1' - OTOH did the trick:

>>> from unicode_hack import regex
>>> result = regex(ur'^\p{Ll}\p{L}*').match(u'\xe1\xc7\xf1123')
>>> print result.group(0) == u'\xe1\xc7\xf1'
True
>>>

That worked fine on both. I believe the problem is restricted to these two scenarios (source loading and command line), since trying to open an UTF-8 file using codecs.open works fine. When I try to input the string "áÇñ" in the command line, or when I load the source code of "unicode_hack.py" using codecs, I get the same result on CPython:

>>> u'áÇñ'
u'\xe1\xc7\xf1'
>>> import codecs
>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

but different results on PyPy:

>>>> u'áÇñ'
u'\xa0\u20ac\xa4'
>>>> import codecs
>>>> codecs.open('unicode_hack.py','r','utf8').read()[19171:19174]
u'\xe1\xc7\xf1'

Update: Issue1139 submitted on PyPy bug tracking system, let's see how that turns out...

Regex: Help in allowing only some letters, banning special characters ($% etc.), except others (' -). Multi word string

6 votes

I need a Regex for PHP to do the following:

I want to allow [a-zα-ωá-źа-яա-ֆა-ჰא-ת] and chinese, japanese (more utf-8) letters; I want to ban [^٩٨٧٦٥٤٣٢١٠۰۱۲۳۴۵۶۷۸۹] (arabic numbers);

This is what i've done:

function isValidFirstName($first_name) {
    return preg_match("/^(?=[a-zα-ωá-źа-яա-ֆა-ჰא-ת]+([a-zα-ωá-źа-яա-ֆა-ჰא-ת' -]+)?\z)[a-zα-ωá-źа-яա-ֆა-ჰא-ת' -]+$/i", $first_name);
}

It looks like it works, but if I type letters of more than 1 language, it doesn't validate.

Examples: Авпа Вапапва á-ź John - doesn't validate. John Gger - validates, á-ź á-ź - validates.

I would like to this all of these.

Or if there's a way, to echo a message if user entered more lingual string.

I can't reproduce the failure cases here (Авпа Вапапва á-ź John validates just fine), but you can simplify the regex a lot - you don't need that lookahead assertion:

preg_match('/^[a-zα-ωá-źа-яա-ֆა-ჰא-ת][a-zα-ωá-źа-яա-ֆა-ჰא-ת\' -]*$/i', $first_name)

As far as I can tell from the character ranges you've given, you don't need to exclude the digits because anything outside these character classes will already cause the regex to fail.

Another consideration: If your goal is to allow any letter from any language/script (plus some punctuation and space) you can (if you're using Unicode strings) further simplify this to:

preg_match('/^\pL[\pL\' -]*$/iu', $first_name)

But generally, I wouldn't try to validate a name by regular expressions (or any other means): Falsehoods programmers believe about names.

perl regex for extracting multiline blocks

6 votes

I have text like this:

00:00 stuff
00:01 more stuff
multi line
  and going
00:02 still 
    have

So, I don't have a block end, just a new block start.

I want to recursively get all blocks:

1 = 00:00 stuff
2 = 00:01 more stuff
multi line
  and going

etc

The bellow code only gives me this:

$VAR1 = '00:00';
$VAR2 = '';
$VAR3 = '00:01';
$VAR4 = '';
$VAR5 = '00:02';
$VAR6 = '';

What am I doing wrong?

my $text = '00:00 stuff
00:01 more stuff
multi line
 and going
00:02 still 
have
    ';
my @array = $text =~ m/^([0-9]{2}:[0-9]{2})(.*?)/gms;
print Dumper(@array);

This should do the trick. Beginning of next \d\d:\d\d is treated as block end.

$Str = '00:00 stuff
00:01 more stuff
multi line
  and going
00:02 still 
    have
00:03 still 
    have' ;

@Blocks = ($Str =~ m#(\d\d:\d\d.+?(?:(?=\d\d:\d\d)|$))#gs);

print join "--\n", @Blocks;

How to effectively search and replace in Vim by first "testing" or "preview" the search part?

6 votes

Sometimes I want to search and replace in Vim using the s/search_for/replace_with/options format, but the search_for part becomes a complicated regex that I can't get right the first time.

I have set incsearch hlsearch in my .vimrc so Vim will start highlighting as I type when I am searching using the /search_for format. This is useful to first "test"/"preview" my regex. Then once I get the regex I want, I apply to the s/ to search and replace.

But there is two big limitation to this approach:

  1. It's a hassle to copy and paste the regex I created in / mode to s/ mode.
  2. I can't preview with matched groups in regex (ie ( and )) or use the magic mode \v while in /.

So how do you guys on SO try to do complicated regex search and replace in Vim?

Test your regex in search mode with /, then use s//new_value/. When you pass nothing to the search portion of s, it takes the most recent search.

As @Sam Brink also says, you can use <C-r>/ to paste the contents of the search register, so s/<C-r>//new_value/ works too. This may be more convenient when you have a complicated search expression.

Regular expression pattern matching for number,alphabetcic blocks

6 votes

Im having some strings like these

aa11b2s
abc1sff3
a1b1sdd2

etc.... i need to change these strings to these

aa 11 b 2 s
abc 1 sff 3
a 1 b 1 sdd 2

Simply saying..i need to add a space between each(number/alphabetic s) blocks

var str = 'aa11b2s';
console.log(str.replace(/([\d.]+)/g, ' $1 ').replace(/^ +| +$/g, ''));

split on last occurrence of digit, take 2nd part

6 votes

If I have a string and want to split on the last digit and keep the last part of the split hpw can I do that?

x <- c("ID", paste0("X", 1:10, state.name[1:10]))

I'd like

 [1] NA            "Alabama"     "Alaska"      "Arizona"     "Arkansas"   
 [6] "California"  "Colorado"    "Connecticut" "Delaware"    "Florida"    
[11] "Georgia"    

But would settle for:

 [1] "ID"          "Alabama"     "Alaska"      "Arizona"     "Arkansas"   
 [6] "California"  "Colorado"    "Connecticut" "Delaware"    "Florida"    
[11] "Georgia"    

I can get the first part by:

unlist(strsplit(x, "[^0-9]*$"))

But want the second part.

Thank you in advance.

library(stringr)
unlist(lapply(str_split(x, "[0-9]"), tail,n=1))

gives

[1] "ID"          "Alabama"     "Alaska"      "Arizona"     "Arkansas"    "California"  "Colorado"    "Connecticut" "Delaware"   
[10] "Florida"     "Georgia"

I would look at the documentation stringr for (most possibly) an even better approach.