Best regex questions in September 2011

Match a^n b^n c^n (e.g. "aaabbbccc") using regular expressions (PCRE)

30 votes

It is a well known fact that modern regular expression implementations (most notably PCRE) have little in common with the original notion of regular grammars. For example you can parse the classical example of a context-free grammar {anbn; n>0} (e.g. aaabbb) using this regex (demo):

~^(a(?1)?b)$~

My question is: How far can you go? Is it also possible to parse the context-sensitive grammar {anbncn;n>0} (e.g. aaabbbccc) using PCRE?

Inspired by NullUserExceptions answer (which he already deleted as it failed for one case) I think I have found a solution myself:

$regex = '~^
    (?=(a(?-1)?b)c)
     a+(b(?-1)?c)
$~x';

var_dump(preg_match($regex, 'aabbcc'));    // 1
var_dump(preg_match($regex, 'aaabbbccc')); // 1
var_dump(preg_match($regex, 'aaabbbcc'));  // 0
var_dump(preg_match($regex, 'aaaccc'));    // 0
var_dump(preg_match($regex, 'aabcc'));     // 0
var_dump(preg_match($regex, 'abbcc'));     // 0

Try it yourself: http://codepad.viper-7.com/1erq9v


Explanation

If you consider the regex without the positive lookahead assertion (the (?=...) part), you have this:

~^a+(b(?-1)?c)$~

This does nothing more than check that there's an arbitrary number of as, followed by an equal number of bs and cs.

This doesn't yet satisfy our grammar, because the number of as must be the same, too. We can ensure that by checking that the number of as equals the number of bs. And this is what the expression in the lookahead assertion does: (a(?-1)?b)c. The c is necessary so we don't only match a part of the bs.


Conclusion

I think this impressively shows that modern regex is not only capable of parsing non-regular grammars, but can even parse non-context-free grammars. Hopefully this will lay to rest the endless parroting of "you can't do X with regex because X isn't regular"

How to match string with diacritic in modern perl?

15 votes

For example, match "Nation" in ""Îñţérñåţîöñåļîžåţîöñ" without extra modules. Is it possible in new Perl versions (5.14, 5.15 etc)?

I found an answer! Thanks to tchrist

Rigth solution with UCA match (thnx to http://stackoverflow.com/users/471272/tchrist).

# found start/end offsets for matched utf-substring (without intersections)
use 5.014;
use strict; 
use warnings;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str  = "Îñţérñåţîöñåļîžåţîöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1
   );

my @match = $Collator->match($str, $look);
if (@match) {
    my $found = $match[0];
    my $f_len  = length($found);
    say "match result: $found (length is $f_len)"; 
    my $offset = 0;
    while ((my $start = index($str, $found, $offset)) != -1) {                                                  
        my $end   = $start + $f_len;
        say sprintf("found at: %s,%s", $start, $end);
        $offset = $end + 1;
    }
}

Wrong (but working) solution from http://www.perlmonks.org/?node_id=485681

Magic piece of code is:

    $str = Unicode::Normalize::NFD($str); $str =~ s/\pM//g;

code example:

    use 5.014;
    use utf8;
    use Unicode::Normalize;

    binmode STDOUT, ':encoding(UTF-8)';
    my $str  = "Îñţérñåţîöñåļîžåţîöñ";
    my $look = "Nation";
    say "before: $str\n";
    $str = NFD($str);
    # M is short alias for \p{Mark} (http://perldoc.perl.org/perluniprops.html)
    $str =~ s/\pM//og; # remove "marks"
    say "after: $str";¬
    say "is_match: ", $str =~ /$look/i || 0;

Right solution with UCA (thnx to tchrist):

# found start/end offsets for matched s
use 5.014;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str  = "Îñţérñåţîöñåļîžåţîöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1
   );

my @match = $Collator->match($str, $look);
say "match ok!" if @match;

P.S. "Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment." © tchrist Why does modern Perl avoid UTF-8 by default?

Regular expression for string of digits with no repeated digits?

8 votes

I'm reading through the dragon book and trying to solve an exercise that is stated as follows

Write regular definitions for the following languages:

  • All strings of digits with no repeated digits. Hint: Try this problem first with a few digits, such as { 0, 1, 2 }.

Despite having tried to solve it for hours, I can't imagine a solution, beside the extremely wordy

d0 -> 0?
d1 -> 1?
d2 -> 2?
d3 -> 3?
d4 -> 4?
d5 -> 5?
d6 -> 6?
d7 -> 7?
d8 -> 8?
d9 -> 9?
d10 -> d0d1d2d3d4d5d6d7d8d9 | d0d1d2d3d4d5d6d7d9d8 | ...

Hence having to write 10! alternatives in d10. Since we shall write this regular definition, I doubt that this is a proper solution. Can you help me please?

So the question didn't necessarily ask you to write a regular expression, it asked you to provide a regular definition, which I interpret to include NFA's. It turns out it doesn't matter which you use, as all NFA's can be shown to be mathematically equivalent to regular expressions.

Using the digits 0, 1, and 2, a valid NFA would be the following (sorry for the crummy diagram):

enter image description here

Each state represents the last digit scanned in the input, and there are no loops on any of the nodes, therefore this is an accurate representation of a string with no repeated digits from the set {0,1,2}. Extending this is trivial (although it requires a large whiteboard :) ).

NOTE: I am making the assumption that the string "0102" IS valid, but the string "0012" is not.

This can be converted to a regular expression (although it will be painful) by using the algorithm described here.

Javascript regular expression for punctuation (international)?

8 votes

I need a regular expression to match against all punctuation marks, such as the standard [,!@#$%^&*()], but including international marks like the upside-down Spanish question mark, Chinese periods, etc. My google-fu is coming up short. Does anyone have such a regular expression on hand that's compatible with Javascript?

If its possible for you to use a plugin, then

There is a plugin for javascript: XRegExp Unicode plugins. That adds support for Unicode categories, scripts, and blocks (I personally have only read about it, I never used it).

With this plugin it should be possible to use Unicode categories like \p{P} as explained here on regular-expressions.info

Update: OK, I tested it now, seems to work fine

you need to get the lib from XRegExp and additionally the plugins (linked above)

<script src="xregexp.js"></script>
<script src="xregexp-unicode-base.js"></script>
<script src="xregexp-unicode-categories.js"></script>
</script><script type="text/javascript">
    var unicodeWord = XRegExp("^\\p{P}+$");

    document.write(unicodeWord.test("?.,;!¡¿。、·")); // true
</script>

returns true and I included some spanish and chinese punctuation in my teststring ".,;!¡¿。、·".

Escaping apostrophes in regex?

7 votes

I'm trying to validate a form using a regular expression found here http://regexlib.com/. What I am trying to do is filter out all characters except a-z, commas and apostrophes. If I use this code:

<cfinput name="FirstName" type="text" class="fieldwidth" maxlength="90" required="yes"    validateat="onsubmit,onserver" message="Please ensure you give your First Name and it does not contain any special characters except hyphens or apostrophes." validate="regular_expression" pattern="^([a-zA-Z'-]+)$" />

I get the following error: Unmatched [] in expression. I figured out this relates to the apostrophe because it works if I use this code(but does not allow apostrophes):

<cfinput name="FirstName" type="text" class="fieldwidth" maxlength="90" required="yes"    validateat="onsubmit,onserver" message="Please ensure you give your First Name and it does not contain any special characters except hyphens or apostrophes." validate="regular_expression" pattern="^([a-zA-Z-]+)$" />

So I'm wondering is there some special way to escape apostrophes when using regular expressions?

EDIT

I think I've found where the problem is being caused (thanks to xanatos), not sure how to fix it. Basically CF is generating a hidden field to validate the field as follows:

<input type='hidden' name='FirstName_CFFORMREGEX' value='^([a-zA-Z'-]+)$'>

Because it is using single apostrophes rather than speech marks round the value, it is interpreting the apostrophe as the end of the value.

I think there is a bug in the cfinput implementation. It probably uses the string you pass in pattern in a Javascript Regex but it uses the ' to quote it. So it converts it in:

new Regex('^([a-zA-Z'-]+)$')

Try replacing the quote with \x27 (it's the code for the single quote)

RegEx to split camelCase or TitleCase (advanced)

7 votes

I found a brilliant RegEx to extract the part of a camelCase or TitleCase expression.

 (?<!^)(?=[A-Z])

It works as expected:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value

For example with Java:

String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}

My problem is that it does not work in some cases:

  • Case 1: VALUE -> V / A / L / U / E
  • Case 2: eclipseRCPExt -> eclipse / R / C / P / Ext

To my mind, the result shoud be:

  • Case 1: VALUE
  • Case 2: eclipse / RCP / Ext

In other words, given n uppercase chars:

  • if the n chars are followed by lower case chars, the groups should be: (n-1 chars) / (n-th char + lower chars)
  • if the n chars are at the end, the group should be: (n chars).

Any idea on how to improve this regex?

The following regex works for all of the above examples:

public static void main(String[] args)
{
    for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
        System.out.println(w);
    }
}   

It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".

The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.

Python regex: how to replace each instance of an occurrence with a different value?

6 votes

Suppose I have this string: s = "blah blah blah"

Using Python regex, how can I replace each instance of "blah" with a different value (e.g. I have a list of values v = ("1", "2", "3")

You could use a re.sub callback:

import re
def callback(match):
    return next(callback.v)
callback.v=iter(('1','2','3'))

s = "blah blah blah"
print(re.sub(r'blah',callback,s))

yields

1 2 3

Strange behavior of Javascript regex test function

6 votes
url_regex = /\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:(?:[^\s()<>.]+[.]?)+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi;

This is the result when test function is executed consecutively

url_regex.test('http://t.co') returns true
url_regex.test('http://t.co') returns false
url_regex.test('http://t.co') returns true
url_regex.test('http://t.co') returns false

and so on.

But if i do not store the regex in a variable and directly call the test function, it returns true always

/\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:
(?:[^\s()<>.]+[.]?)+|\((?:[^\s()<>]+|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]+|(?:\
([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))/gi

test('http://t.co') always returns true

I am using latest version of chrome for testing this.

The doc on MDN explains, what happens:

As with exec (or in combination with it), test called multiple times on the same global regular expression instance will advance past the previous match.

You cannot advance further, since the regexp consumed all chars, and will test against the empty string. In the next iteration, it starts again from the first char, since it tested until the end now.

Infinite while-loop in perl

6 votes

Is there a way to do this without getting an infinite loop?

while((my $var) = $string =~ /regexline(.+?)end/g) {
    print $var;
}

This results in an infinite loop, probably because the assigning of a var directly from a regex inside the while returns "true" every time?

I know I can do this:

while($string =~ /regexline(.+?)end/g) {
     my $var = $1;      
     print $var;
}

But I was hoping I could save a line. Is there a regex modifier I can use or something like that?

(Also, what is this notation/trick actually called, if I want to search for it:

(my $var) = $string =~ /regex/;

Thanks!!

Is there a way to do this without getting an infinite loop?

Yes. Use a foreach() instead of a while() loop:

foreach my $var ($string =~ /regexline(.+?)end/g) {

what is this notation/trick actually called, if I want to search for it

It is called a match in list context. It is described in "perldoc perlop":

The g modifier specifies global pattern matching--that is, matching as many times as possible within the string. How it behaves depends on the context. In list context ...

Matching 2 regular expressions in Python

6 votes

Is it possible to match 2 regular expressions in Python?

For instance, I have a use-case wherein I need to compare 2 expressions like this:

re.match('google\.com\/maps', 'google\.com\/maps2', re.IGNORECASE)

I would expect to be returned a RE object.

But obviously, Python expects a string as the second parameter. Is there a way to achieve this, or is it a limitation of the way regex matching works?


Background: I have a list of regular expressions [r1, r2, r3, ...] that match a string and I need to find out which expression is the most specific match of the given string. The way I assumed I could make it work was by:
(1) matching r1 with r2.
(2) then match r2 with r1.
If both match, we have a 'tie'. If only (1) worked, r1 is a 'better' match than r2 and vice-versa.
I'd loop (1) and (2) over the entire list.

I admit it's a bit to wrap one's head around (mostly because my description is probably incoherent), but I'd really appreciate it if somebody could give me some insight into how I can achieve this. Thanks!

Outside of the syntax clarification on re.match, I think I am understanding that you are struggling with taking two or more unknown (user input) regex expressions and classifying which is a more 'specific' match against a string.

Recall for a moment that a Python regex really is a type of computer program. Most modern forms, including Python's regex, are based on Perl. Perl's regex's have recursion, backtracking, and other forms that defy trivial inspection. Indeed a rogue regex can be used as a form of denial of service attack.

To see of this on your own computer, try:

>>> re.match(r'^(a+)+$','a'*24+'!')

That takes about 1 second on my computer. Now increase the 24 in 'a'*24 to a bit larger number, say 28. That take a lot longer. Try 48... You will probably need to CTRL+C now. The time increase as the number of a's increase is, in fact, exponential.

You can read more about this issue in Russ Cox's wonderful paper on 'Regular Expression Matching Can Be Simple And Fast'. Russ Cox is the Goggle engineer that built Google Code Search in 2006. As Cox observes, consider matching the regex 'a?'*33 + 'a'*33 against the string of 'a'*99 with awk and Perl (or Python or PCRE or Java or PHP or ...) Awk matches in 200 microseconds but Perl would require 1015 years because of exponential back tracking.

So the conclusion is: it depends! What do you mean by a more specific match? Look at some of Cox's regex simplification techniques in RE2. If your project is big enough to write your own libraries (or use RE2) and you are willing to restrict the regex grammar used (i.e., no backtracking or recursive forms), I think the answer is that you would classify 'a better match' in a variety of ways.

If you are looking for a simple way to state that (regex_3 < regex_1 < regex_2) when matched against some string using Python or Perl's regex language, I think that the answer is it is very very hard (i.e., this problem is NP Complete)

Edit

Everything I said above is true! However, here is a stab at sorting matching regular expressions based on one form of 'specific': How many edits to get from the regex to the string. The greater number of edits (or the higher the Levenshtein distance) the less 'specific' the regex is.

You be the judge if this works (I don't know what 'specific' means to you for your application):

import re

def ld(a,b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n

    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)      
    return current[n]

s='Mary had a little lamb'    
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
        r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little']

for reg in regs:
    m=re.search(reg,s)
    if m:
        print "'%s' matches '%s' with sub group '%s'" % (reg, s, m.group(0))
        ld1=ld(reg,m.group(0))
        ld2=ld(m.group(0),s)
        score=max(ld1,ld2)
        print "  %i edits regex->match(0), %i edits match(0)->s" % (ld1,ld2)
        print "  score: ", score
        d[reg]=score
        print
    else:
        print "'%s' does not match '%s'" % (reg, s)   

print "   ===== %s =====    === %s ===" % ('RegEx'.center(10),'Score'.center(10))

for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
    print "   %22s        %5s" % (key, value) 

The program is taking a list of regex's and matching against the string Mary had a little lamb.

Here is the sorted ranking from "most specific" to "least specific":

   =====   RegEx    =====    ===   Score    ===
   Mary had a little lamb            0
        Mary.*little lamb            7
            .*little lamb           11
              little lamb           11
      .*[lL]ittle [Ll]amb           15
               \blittle\b           16
                   little           16
                     Mary           18
                  \b\w+mb           18
                     lamb           18
                       .*           22

This based on the (perhaps simplistic) assumption that: a) the number of edits (the Levenshtein distance) to get from the regex itself to the matching substring is the result of wildcard expansions or replacements; b) the edits to get from the matching substring to the initial string. (just take one)

As two simple examples:

  1. .* (or .*.* or .*?.* etc) against any sting is a large number of edits to get to the string, in fact equal to the string length. This is the max possible edits, the highest score, and the least 'specific' regex.
  2. The regex of the string itself against the string is as specific as possible. No edits to change one to the other resulting in a 0 or lowest score.

As stated, this is simplistic. Anchors should increase specificity but they do not in this case. Very short stings don't work because the wild-card may be longer than the string.

Edit 2

I got anchor parsing to work pretty darn well using the undocumented sre_parse module in Python. Type >>> help(sre_parse) if you want to read more...

This is the goto worker module underlying the re module. It has been in every Python distribution since 2001 including all the P3k versions. It may go away, but I don't think it is likely...

Here is the revised listing:

import re
import sre_parse

def ld(a,b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n

    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)      
    return current[n]

s='Mary had a little lamb'    
d={}
regs=[r'.*', r'Mary', r'lamb', r'little lamb', r'.*little lamb',r'\b\w+mb',
        r'Mary.*little lamb',r'.*[lL]ittle [Ll]amb',r'\blittle\b',s,r'little',
        r'^.*lamb',r'.*.*.*b',r'.*?.*',r'.*\b[lL]ittle\b \b[Ll]amb',
        r'.*\blittle\b \blamb$','^'+s+'$']

for reg in regs:
    m=re.search(reg,s)
    if m:
        ld1=ld(reg,m.group(0))
        ld2=ld(m.group(0),s)
        score=max(ld1,ld2)
        for t, v in sre_parse.parse(reg):
            if t=='at':      # anchor...
                if v=='at_beginning' or 'at_end':
                    score-=1   # ^ or $, adj 1 edit

                if v=='at_boundary': # all other anchors are 2 char
                    score-=2

        d[reg]=score
    else:
        print "'%s' does not match '%s'" % (reg, s)   

print
print "   ===== %s =====    === %s ===" % ('RegEx'.center(15),'Score'.center(10))

for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k)):
    print "   %27s        %5s" % (key, value) 

And soted RegEx's:

   =====      RegEx      =====    ===   Score    ===
        Mary had a little lamb            0
      ^Mary had a little lamb$            0
          .*\blittle\b \blamb$            6
             Mary.*little lamb            7
     .*\b[lL]ittle\b \b[Ll]amb           10
                    \blittle\b           10
                 .*little lamb           11
                   little lamb           11
           .*[lL]ittle [Ll]amb           15
                       \b\w+mb           15
                        little           16
                       ^.*lamb           17
                          Mary           18
                          lamb           18
                       .*.*.*b           21
                            .*           22
                         .*?.*           22

Unicode characters having asymmetric upper/lower case. Why?

5 votes

Why do the following three characters have not symmetric toLower, toUpper results

/**
  * Written in the Scala programming language, typed into the Scala REPL.
  * Results commented accordingly.
  */
/* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */
'\u1e9e'.toHexString == "1e9e" // true
'\u1e9e'.toLower.toHexString == "df" // "df" == "df"
'\u1e9e'.toHexString == '\u1e9e'.toLower.toUpper.toHexString // "1e9e" != "df"
/* Unicode Character 'KELVIN SIGN' (U+212A) */
'\u212a'.toHexString == "212a" // "212a" == "212a"
'\u212a'.toLower.toHexString == "6b" // "6b" == "6b"
'\u212a'.toHexString == '\u212a'.toLower.toUpper.toHexString // "212a" != "4b"
/* Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130) */
'\u0130'.toHexString == "130" // "130" == "130"
'\u0130'.toLower.toHexString == "69" // "69" == "69"
'\u0130'.toHexString == '\u0130'.toLower.toUpper.toHexString // "130" != "49"

For the first one, there is this explanation:

In the German language, the Sharp S ("ß" or U+00df) is a lowercase letter, and it capitalizes to the letters "SS".

In other words, U+1E9E lower-cases to U+00DF, but the upper-case of U+00DF is not U+1E9E.

For the second one, U+212A (KELVIN SIGN) lower-cases to U+0068 (LATIN SMALL LETTER K). The upper-case of U+0068 is U+004B (LATIN CAPITAL LETTER K). This one seems to make sense to me.

For the third case, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) is a Turkish/Azerbaijani character that lower-cases to U+0069 (LATIN SMALL LETTER I). I would imagine that if you were somehow in a Turkish/Azerbaijani locale you'd get the proper upper-case version of U+0069, but that might not necessarily be universal.

Characters need not necessarily have symmetric upper- and lower-case transformations.

Edit: To respond to PhiLho's comment below, the Unicode 6.0 spec has this to say about U+212A (KELVIN SIGN):

Three letterlike symbols have been given canonical equivalence to regular letters: U+2126 OHM SIGN, U+212A KELVIN SIGN, and U+212B ANGSTROM SIGN. In all three instances, the regular letter should be used. If text is normalized according to Unicode Standard Annex #15, “Unicode Normalization Forms,” these three characters will be replaced by their regular equivalents.

In other words, you shouldn't really be using U+212A, you should be using U+004B (LATIN CAPITAL LETTER K) instead, and if you normalize your Unicode text, U+212A should be replaced with U+004B.

Postgres regex: Behavior of \s and \S and character class seems wrong

5 votes

The documentation says that \s is whitespace and \S is not whitespace. So far, nothing new to regex users.

But let's check some return values:

SELECT SUBSTRING('abc a c' FROM 'a\\sc');
'a c'

SELECT SUBSTRING('abc a c' FROM 'a[\\s]c'); -- Note the character class
'a c'

SELECT SUBSTRING('abc a c' FROM 'a\\Sc');
'abc'

SELECT SUBSTRING('abc a c' FROM 'a[\\S]c'); -- Note the character class
ERROR:  invalid regular expression: invalid escape \ sequence

So it seems, \s can be used in a character class and \S cannot. Why?

From the manual:

Within bracket expressions, \d, \s, and \w lose their outer brackets, and \D, \S, and \W are illegal.

In any case, the brackets seem redundant since \s and \S themselves are character classes.

The following syntax works for me as an alternative to a[\\S]c:

SELECT SUBSTRING('abc a c' FROM 'a[^[:space:]]c');

Is there a theorical expression size limit for "or" operator on Regex.Replace

5 votes

Is there a theorical expression size limit for "or" operator on Regex.Replace such as Regex.Replace("abc","(a|c|d|e...continue say 500000 elements here)","zzz") ?

Any stackoverflowException on .NET's implementation ?

Thanks

There is no theoretical limit, though each regular expression engine will have its own implementation limits. In this case, since you are using .NET the limit is due to the amount of memory the .NET runtime can use.

A regular expression with one million alernations works fine for me:

string input = "a<142>c";
var options = Enumerable.Range(0, 1000000).Select(x => "<" + x + ">");
string pattern = string.Join("|", options);
string result = Regex.Replace(input, pattern, "zzz");

Result:

azzzc

It's very slow though. Increasing the number of options to 10 million gives me an OutOfMemoryException.

You probably would benefit from looking at another approach.

Replace spaces but not when between parentheses

5 votes

I guess I can do this with multiple regexs fairly easily, but I want to replace all the spaces in a string, but not when those spaces are between parentheses.

For example:

Here is a string (that I want to) replace spaces in.

After the regex I want the string to be

Hereisastring(that I want to)replacespacesin.

Is there an easy way to do this with lookahead or lookbehing operators?

I'm a little confused on how they work, and not real sure they would work in this situation.

Try this:

replace(/\s+(?=[^()]*(\(|$))/g, '')

A quick explanation:

\s+          # one or more white-space chars
(?=          # start positive look ahead
  [^()]*     #   zero or more chars other than '(' and ')'
  (          #   start group 1
    \(       #     a '('
    |        #     OR
    $        #     the end of input
  )          #   end group 1
)            # end positive look ahead

In plain English: it matches one or more white space chars if either a ( or the end-of-input can be seen ahead without encountering any parenthesis in between.

An online Ideone demo: http://ideone.com/jaljw

The above will not work if:

  • there are nested parenthesis
  • parenthesis can be escaped

Perl iterate through each match

5 votes

Let's say I'm scanning through a page of raw html looking for this regex. (The quote mark on the end is intentional).

m/(https?:\/\/.*?(?:'|"))/

This pattern is likely to match ~ 100 times. What is a common perl idiom/a quick way to iterate through a list of all capture group matches?

From the perlretut (a very fine tutorial)

while ($x =~ /(\w+)/g) {
        print "Word is $1, ends at position ", pos $x, "\n";
    }

You can use while together with the g modifier to iterate over all matches, with $1 you get the content of your capturing group 1, and in this example from the tutorial you get also the position with pos.

Parse Apache log in PHP using preg_match

5 votes

I need to save data in a table (for reporting, stats etc...) so a user can search by time, user agent etc. I have a script that runs every day that reads the Apache Log and then insert it in the database.

Log format:

10.1.1.150 - - [29/September/2011:14:21:49 -0400] "GET /info/ HTTP/1.1" 200 9955 "http://www.domain.com/download/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"

My regex:

preg_match('/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) (\".*?\") (\".*?\")$/',$log, $matches);

Now when I print:

print_r($matches);

Array
(
    [0] => 10.1.1.150 - - [29/September/2011:14:21:49 -0400] "GET /info/ HTTP/1.1" 200 9955 "http://www.domain.com/download/" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"
    [1] => 10.1.1.150
    [2] => -
    [3] => -
    [4] => 29/September/2011
    [5] => 14:21:49
    [6] => -0400
    [7] => GET
    [8] => /info/
    [9] => HTTP/1.1
    [10] => 200
    [11] => 9955
    [12] => "http://www.domain.com/download/"
    [13] => "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; de-at) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1"
)

I get: "http://www.domain.com/download/" and same for user agent. How can I get rid of these " in the regex? Bonus (Is there any quick way to insert the date/time easily)?

Thanks

You could change the regex to:

preg_match('/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.*?) (\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)"$/',$log, $matches);

I'm not sure what you meant by 'quick way to insert the date/time'; can you clarify that more for me?