Best regex questions in November 2011

Efficient strings containing each other

12 votes

I have two sets of strings (A and B), and I want to know all pairs of strings a in A and b in B where a is a substring of b.

The first step of coding this was the following:

for a in A:
    for b in B:
        if a in b:
            print (a,b)

However, I wanted to know-- is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:, check if the regexp '.*' + a + '.*': matches 'b'. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a. Also, using a list comprehension for the inner for b in B: loop will likely give a pretty big speedup (and a nested list comprehension may be even better).

I'm not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I'm more concerned with the constant (I just need to do this for several pairs of A and B sets, and I don't want it to run all week).

Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!


Edit:

Using the advice of @ninjagecko and @Sven Marnach, I built a quick prefix table of 10-mers:

    import collections
    prefix_table = collections.defaultdict(set)
    for k, b in enumerate(B):
        for i in xrange(len(prot_seq)-10):
            j = i+10+1
            prefix_table[b[i:j]].add(k)

    for a in A:
        if len(a) >= 10:
            for k in prefix_table[a[:10]]:
                # check if a is in b
                # (missing_edges is necessary, but not sufficient)
                if a in B[k]:
                    print (a,b)
        else:
            for k in xrange(len(prots_and_seqs)):
                # a is too small to use the table; check if
                # a is in any b
                if a in B[k]:
                    print (a, b)

Of course you can easily write this as a list comprehension:

[(a, b) for a in A for b in B if a in b]

This might slightly speed up the loop, but don't expect too much. I doubt using regular expressions will help in any way with this one.

Edit: Here are some timings:

import itertools
import timeit
import re
import collections

with open("/usr/share/dict/british-english") as f:
    A = [s.strip() for s in itertools.islice(f, 28000, 30000)]
    B = [s.strip() for s in itertools.islice(f, 23000, 25000)]

def f():
    result = []
    for a in A:
        for b in B:
            if a in b:
                result.append((a, b))
    return result

def g():
    return [(a, b) for a in A for b in B if a in b]

def h():
    res = [re.compile(re.escape(a)) for a in A]
    return [(a, b) for a in res for b in B if a.search(b)]

def ninjagecko():
    d = collections.defaultdict(set)
    for k, b in enumerate(B):
        for i, j in itertools.combinations(range(len(b) + 1), 2):
            d[b[i:j]].add(k)
    return [(a, B[k]) for a in A for k in d[a]]

print "Nested loop", timeit.repeat(f, number=1)
print "List comprehension", timeit.repeat(g, number=1)
print "Regular expressions", timeit.repeat(h, number=1)
print "ninjagecko", timeit.repeat(ninjagecko, number=1)

Results:

Nested loop [0.3641810417175293, 0.36279606819152832, 0.36295199394226074]
List comprehension [0.362030029296875, 0.36148500442504883, 0.36158299446105957]
Regular expressions [1.6498990058898926, 1.6494300365447998, 1.6480278968811035]
ninjagecko [0.06402897834777832, 0.063711881637573242, 0.06389307975769043]

Edit 2: Added a variant of the alogrithm suggested by ninjagecko to the timings. You can see it is much better than all the brute force approaches.

Edit 3: Used sets instead of lists to eliminate the duplicates. (I did not update the timings -- they remained essentially unchanged.)

Is this C++11 regex error me or the compiler?

11 votes

OK, this isn't the original program I had this problem in, but I duplicated it in a much smaller one. Very simple problem.

main.cpp:

#include <iostream>
#include <regex>
using namespace std;

int main()
{
    regex r1("S");
    printf("S works.\n");
    regex r2(".");
    printf(". works.\n");
    regex r3(".+");
    printf(".+ works.\n");
    regex r4("[0-9]");
    printf("[0-9] works.\n");
    return 0;
}

Compiled successfully with this command, no error messages:

$ g++ -std=c++0x main.cpp

The last line of g++ -v, by the way, is:

gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3)

And the result when I try to run it:

$ ./a.out 
S works.
. works.
.+ works.
terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error
Aborted

It happens the same way if I change r4 to \\s, \\w, or [a-z]. Is this a problem with the compiler? I might be able to believe that C++11's regex engine has different ways of saying "whitespace" or "word character," but square brackets not working is a stretch. Is it something that's been fixed in 4.6.2?

EDIT:

Joachim Pileborg has supplied a partial solution, using an extra regex_constants parameter to enable a syntax that supports square brackets, but neither basic, extended, awk, nor ECMAScript seem to support backslash-escaped terms like \\s, \\w, or \\t.

EDIT 2:

Using raw strings (R"(\w)" instead of "\\w") doesn't seem to work either.

ECMAScript syntax accepts [0-9], \s, \w, etc, see ECMA-262 (15.10). Here's an example with boost::regex that also uses the ECMAScript syntax by default:

#include <boost/regex.hpp>

int main(int argc, char* argv[]) {
  using namespace boost;
  regex e("[0-9]");
  return argc > 1 ? !regex_match(argv[1], e) : 2;
}

It works:

$ g++ -std=c++0x *.cc -lboost_regex && ./a.out 1

According to the C++11 standard (28.8.2) basic_regex() uses regex_constants::ECMAScript flag by default so it must understand this syntax.

Is this C++11 regex error me or the compiler?

gcc-4.6.1 doesn't support c++11 regular expressions (28.13).

Splitting a string on the first space

11 votes

I'd like to split a vector of character strings (people's names) into two columns (vectors). The problem is some people have a 'two word' last name. I'd like to split the first and last names into two columns. I can slit out and take the first names using the code below but the last name eludes me. (look at obs 29 in the sample set below to get an idea as the Ford has a "last name" of Pantera L that must be kept together)

What I have attempted to do so far;

x<-rownames(mtcars)
unlist(strsplit(x, " .*"))

What I'd like it to look like:

            MANUF       MAKE
27          Porsche     914-2
28          Lotus       Europa
29          Ford        Pantera L
30          Ferrari     Dino
31          Maserati    Bora
32          Volvo       142E

The regular expression rexp matches the word at the start of the string, an optional space, then the rest of the string. The parenthesis are subexpressions accessed as backreferences \\1 and \\2.

rexp <- "^(\\w+)\\s?(.*)$"
y <- data.frame(MANUF=sub(rexp,"\\1",x), MAKE=sub(rexp,"\\2",x))
tail(y)
#       MANUF      MAKE
# 27  Porsche     914-2
# 28    Lotus    Europa
# 29     Ford Pantera L
# 30  Ferrari      Dino
# 31 Maserati      Bora
# 32    Volvo      142E

How do I use regex to search ignoring certain characters with NSPredicate?

10 votes

In Hebrew, there are certain vowels that NSPredicate fails to ignore even when using the 'd' (diacritic insensitive) modifier in the predicate. I was told that the solution is to use regular expressions to do the search.

How do I take a search string and "use regex" to search hebrew text that contains vowels, ignoring those vowels?

Edit:

In other words, If I wanted to search the following text, disregarding dashes and asterisks, how would I do so using regex?

Example Text:

I w-en*t t-o the st*o*r*-e yes-ster*day.

Edit 2:

Essentially, I want to:

  1. Take an input string from a user
  2. Take a string to search
  3. Use a regex based on the user's search string to search for "contains" matches in the larger block of text. The regex should ignore vowels as shown above.

Edit 3:

Here's how I'm implementing my search:

//
//  The user updated the search text
//

- (BOOL)searchDisplayController:(UISearchDisplayController *)controller 
shouldReloadTableForSearchString:(NSString *)searchString{

    NSMutableArray *unfilteredResults = [[[[self.fetchedResultsController sections] objectAtIndex:0] objects] mutableCopy];

    if (self.filteredArray == nil) {
        self.filteredArray = [[[NSMutableArray alloc ] init] autorelease];
    }

    [filteredArray removeAllObjects];

    NSPredicate *predicate;

    if (controller.searchBar.selectedScopeButtonIndex == 0) {
        predicate = [NSPredicate predicateWithFormat:@"articleTitle CONTAINS[cd] %@", searchString];
    }else if (controller.searchBar.selectedScopeButtonIndex == 1) {
        predicate = [NSPredicate predicateWithFormat:@"articleContent CONTAINS[cd] %@", searchString];            
    }else if (controller.searchBar.selectedScopeButtonIndex == 2){
        predicate = [NSPredicate predicateWithFormat:@"ANY tags.tagText CONTAINS[cd] %@", searchString];
    }else{
        predicate = [NSPredicate predicateWithFormat:@"(ANY tags.tagText CONTAINS[cd] %@) OR (dvarTorahTitle CONTAINS[cd] %@) OR (dvarTorahContent CONTAINS[cd] %@)", searchString,searchString,searchString];
    }

    for (Article *article in unfilteredResults) {

        if ([predicate evaluateWithObject:article]) {
            [self.filteredArray addObject:article];
        }

    }

    [unfilteredResults release];


    return YES;
}

Edit 4:

I am not required to use regex for this, was just advised to do so. If you have another way that works, go for it!

Edit 5:

I've modified my search to look like this:

NSInteger length = [searchString length];

NSString *vowelsAsRegex = @"[\\u5B0-\\u55C4]*";

NSMutableString *modifiedSearchString = [searchString mutableCopy];

for (int i = length; i > 0; i--) {
    [modifiedSearchString insertString:vowelsAsRegex atIndex:i];
}

if (controller.searchBar.selectedScopeButtonIndex == 0) {
            predicate = [NSPredicate predicateWithFormat:@"articleTitle CONTAINS[cd] %@", modifiedSearchString];
        }else if (controller.searchBar.selectedScopeButtonIndex == 1) {
            predicate = [NSPredicate predicateWithFormat:@"articleContent CONTAINS[cd] %@", modifiedSearchString];            
        }else if (controller.searchBar.selectedScopeButtonIndex == 2){
            predicate = [NSPredicate predicateWithFormat:@"ANY tags.tagText CONTAINS[cd] %@", modifiedSearchString];
        }else{
            predicate = [NSPredicate predicateWithFormat:@"(ANY tags.tagText CONTAINS[cd] %@) OR (dvarTorahTitle CONTAINS[cd] %@) OR (dvarTorahContent CONTAINS[cd] %@)", modifiedSearchString,modifiedSearchString,modifiedSearchString];
        }

for (Article *article in unfilteredResults) {
  if ([predicate evaluateWithObject:article]) {
    [self.filteredArray addObject:article];
  }          
 }

I'm still missing something here, what do I need to do to make this work?

Edit 6:

Okay, almost there. I need to make two more changes to be finished with this.

I need to be able to add other ranges of characters to the regex, which might appear instead of, or in addition to the character in the other set. I've trie changing the first range to this:

[\u05b0-\u05c, \u0591-\u05AF]?

Something tells me that this is incorrect.

Also, I need the rest of the regex to be case insensitive. What modifier do I need to use with the .* regex to make it case insensitive?

This answer picks up where the question left off. Please read that for context.

As it turns out, iOS can make regular expressions case insensitive using an Objective-C modifier to NSPredicate. All that's left is to combine the two ranges. I realized that they are actually two consecutive ranges. My final code looks like this:

NSInteger length = [searchString length];

NSString *vowelsAsRegex = @"[\u0591-\u05c4]?[\u0591-\u05c4]?"; //Cantillation: \u0591-\u05AF Vowels: \u05b0-\u05c

NSMutableString *modifiedSearchString = [searchString mutableCopy];

for (int i = length; i > 0; i--) {
    [modifiedSearchString insertString:vowelsAsRegex atIndex:i];
}

if (controller.searchBar.selectedScopeButtonIndex == 0) {
  predicate = [NSPredicate predicateWithFormat:@"articleTitle CONTAINS[cd] %@", modifiedSearchString];
}else if (controller.searchBar.selectedScopeButtonIndex == 1) {
    predicate = [NSPredicate predicateWithFormat:@"articleContent CONTAINS[c] %@", modifiedSearchString];            
}else if (controller.searchBar.selectedScopeButtonIndex == 2){
    predicate = [NSPredicate predicateWithFormat:@"ANY tags.tagText CONTAINS[c] %@", modifiedSearchString];
}else{
    predicate = [NSPredicate predicateWithFormat:@"(ANY tags.tagText CONTAINS[c] %@) OR (dvarTorahTitle CONTAINS[c] %@) OR (dvarTorahContent CONTAINS[c] %@)", modifiedSearchString,modifiedSearchString,modifiedSearchString];
}

[modifiedSearchString release];

for (Article *article in unfilteredResults) {
  if ([predicate evaluateWithObject:article]) {
    [self.filteredArray addObject:article];
  }          
}

Note that the range portion of the regular expression repeats itself. This is because there can be both a cantillation mark and a vowel on a single letter. Now, I can search uppercase and lowercase English, and Hebrew with or without vowels and cantillation marks.

Awesome!

RegEx: Remove non-letters UTF-8 Safe, Quickly

9 votes

I'm trying to remove everything except valid letters (from any language) in PHP. I've been using this:

$content=preg_replace('/[^\pL\p{Zs}]/u', '', $content);

But it's painfully slow. Takes about 30x longer than:

$content=preg_replace('/[^a-z\s]/', '', $content);

I'm dealing with large amounts of data, so it really isn't feasible to use a slow method.

Is there a faster way of doing this?

Well, it's a wonder it's only 30 times slower, seeing that it needs to take about 1000 times more characters than just a-z into account when checking if a certain code point is a letter or not.

That said, you can improve your regex a bit:

$content=preg_replace('/[^\pL\p{Zs}]+/u', '', $content);

should speed it up by combining adjacent non-letters/space separators into one single replace operation.

How to match regex at start index?

9 votes

How do I create a regex that begins matching where it starts searching?

In other words:

What is the equivalent of \A which says, "match at the start of the search, even if it's not in the beginning of the main string"?

new Regex(@"\A\n").IsMatch("!\n", 1);    // Should be true, but is false

What you're looking for is \G:

new Regex(@"\G\n").IsMatch("!\n", 1);    // It's twue, it's twue!

This was a surprise to me, actually. I knew about \G, but it's usually described as an anchor that matches the beginning of the input or the end of the most recent successful match, neither of which applies here. If this is a .NET innovation, they should make more noise about it; it looks like it could be very handy.

EDIT: Come to think of it, Java's find(int) does work the same way--I've even used it extensively. But then they added the "regions" API in Java 5, which offers much finer control, and I forgot about this idiom. I never thought to look for it in .NET.

RegEx, StringBuilder and Large Object Heap Fragmentation

7 votes

How can I run lots of RegExes (to find matches) in big strings without causing LOH fragmentation?

It's .NET Framework 4.0 so I'm using StringBuilder so it's not in the LOH however as soon as I need to run a RegEx on it I have to call StringBuilder.ToString() which means it'll be in the LOH.

Is there any solution to this problem? It's virtually impossible to have a long running application that deals with big strings and RegExes like this.

An Idea to Solve this problem:

While thinking about this problem, I think I found a dirty solution.

At a given time I only have 5 strings and these 5 strings (bigger than 85KB) will be passed to RegEx.Match.

Since the fragmentation occurs because new objects won't fit to empty spaces in LOH, this should solve the problem:

  1. PadRight all strings to a max. accepted size, let's say 1024KB (I might need to do this with StringBuider)
  2. By doing so all new strings will fit to already emptied memory as previous string is already out of scope
  3. There won't be any fragmentation because object size is always same hence I'll only allocate 1024*5 at a given time, and these space in LOH will be shared between these strings.

I suppose the biggest problem with this design what happens if other big objects allocate this location in LOH which would cause application to allocate lots of 1024 KB strings maybe with an even worse fragmentation. fixed statement might help however how can I send a fixed string to RegEx without actually create a new string which is not located in a fixed memory address?

Any ideas about this theory? (Unfortunately I can't reproduce the problem easily, I'm generally trying to use a memory profiler to observe the changes and not sure what kind of isolated test case I can write for this)

OK, here is my attempt solve this problem in a fairly generic way but with some obvious limitations. Since I haven't seen this advice anywhere and everyone is whining about LOH Fragmentation I wanted to share the code to confirm that my design and assumptions are correct.

Theory:

  1. Create a shared massive StringBuilder (this is to store the big strings that read from we read from streams) - new StringBuilder(ChunkSize * 5);
  2. Create a massive String (has to be bigger than max. accepted size), should be initialized with empty space. - new string(' ', ChunkSize * 10);
  3. Pin string object to memory so GC will not mess with it. GCHandle.Alloc(pinnedText, GCHandleType.Pinned). Even though LOH objects are normally pinned this seems to improve the performance. Maybe because of unsafe code
  4. Read stream into shared StringBuilder and then unsafe copy it to pinnedText by using indexers
  5. Pass the pinnedText to RegEx

With this implementation the code below works just like there is no LOH allocation. If I switch to new string(' ') allocations instead of using a static StringBuilder or use StringBuilder.ToString() code can allocate 300% less memory before crashing with outofmemory exception

I also confirmed the results with a memory profiler, that there is no LOH fragmentation in this implementation. I still don't understand why RegEx doesn't cause any unexpected problems. I also tested with different and expensive RegEx patterns and results are same, no fragmentation.

Code:

http://pastebin.com/ZuuBUXk3

using System;
using System.Collections.Generic;
using System.Runtime.InteropServices;
using System.Text;
using System.Text.RegularExpressions;

namespace LOH_RegEx
{
    internal class Program
    {
        private static List<string> storage = new List<string>();
        private const int ChunkSize = 100000;
        private static StringBuilder _sb = new StringBuilder(ChunkSize * 5);


        private static void Main(string[] args)
        {
            var pinnedText = new string(' ', ChunkSize * 10);
            var sourceCodePin = GCHandle.Alloc(pinnedText, GCHandleType.Pinned);

            var rgx = new Regex("A", RegexOptions.CultureInvariant | RegexOptions.Compiled);

            try
            {

                for (var i = 0; i < 30000; i++)
                {                   
                    //Simulate that we read data from stream to SB
                    UpdateSB(i);
                    CopyInto(pinnedText);                   
                    var rgxMatch = rgx.Match(pinnedText);

                    if (!rgxMatch.Success)
                    {
                        Console.WriteLine("RegEx failed!");
                        Console.ReadLine();
                    }

                    //Extra buffer to fragment LoH
                    storage.Add(new string('z', 50000));
                    if ((i%100) == 0)
                    {
                        Console.Write(i + ",");
                    }
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.ToString());
                Console.WriteLine("OOM Crash!");
                Console.ReadLine();
            }
        }


        private static unsafe void CopyInto(string text)
        {
            fixed (char* pChar = text)
            {
                int i;
                for (i = 0; i < _sb.Length; i++)
                {
                    pChar[i] = _sb[i];
                }

                pChar[i + 1] = '\0';
            }
        }

        private static void UpdateSB(int extraSize)
        {
            _sb.Remove(0,_sb.Length);

            var rnd = new Random();
            for (var i = 0; i < ChunkSize + extraSize; i++)
            {
                _sb.Append((char)rnd.Next(60, 80));
            }
        }
    }
}

align string to a pattern in perl?

7 votes

I have chunks of strings within square brackets, like this:

[p1 text1/label1] [p2 text2/label2] [p3 text3/label3] [...

and so on.

What's inside each chunk isn't important. But sometimes there are stray chunks of text that are NOT surrounded by square brackets. For example:

[p1 text1/label1] [p2 text2/label2] textX/labelX  [p3 text3/label3] [...] textY/labelY textZ/labelZ [...]

I thought I had this solved fine with regex in perl until I realized that I have only catered to the cases where there is a single stray text at the beginning, the middle, or the end of the text, but not where we might have two stray cases together. (like the Y and Z chunks above).

So I realized that regular expressions in perl only catch the first matching pattern? How could the above problem be solved then?

Edit:

The problem is to ensure that all should be surrounded by brackets. Square brackets are never recursive. When surrounding a phrase with brackets, the p-value depends on the "label" value. For eg, if a stray unbracketed phrase is

li/IN

then it should turn into:

[PP li/IN]

I guess it is a mix but the only way I can think of solving the bigger problem I'm working on is to turn all of them into bracketed phrases, so the handling is easier. So I've got it working if an unbracketed phrase happens at the beginning, middle and end, but not if two or more happen together.

I basically used a different regex for each position (beginning, middle and end). The one that catches an unbracketed phrase in the middle looks like this:

$data =~ s/\] (text)#\/label \[/\] \[selected-p-value $1#\/label\] \[/g;

So what I'm doing is just noticing that if a ] comes before and after the text/label pattern, then this one doesn't have brackets. I do something similar for the others too. But I guess this is incredibly un-generic. My regex isn't great!

Actually you can solve this using "only" regex :

#!/usr/bin/perl

use strict;
use warnings;

$_ = "[p1 text1/label1] [p2 text2/label2] textX/labelX  [p3 text3/label3] [...] textY/labelY textZ/labelZ [...]";

s{ ([^\s[]+)|(\[(?:[^[]*)\])     }
 { if( defined $2){ $2 } elsif(defined $1)
    { 
       if($1 =~ m!(.*(?<=/)(.*))!)
       {
         if($2 eq 'labelX')
         {
            "[PP $1]";
         }
         elsif($2 eq 'labelY')
         {
            "[BLA $1]";
         }
         elsif($2 eq 'labelZ')
         {
            "[FOO $1]";
         }
       }
    }
 }xge;

 print;

Output :

[p1 text1/label1] [p2 text2/label2] [PP textX/labelX]  [p3 text3/label3] [...] [BLA textY/labelY] [FOO textZ/labelZ] [...]

php regular expression for video swf

7 votes

iwant to get the video url from a object/embed html source. i read i can use regular expression to get it but me and regular expression are no friends

so heres what i have:

<?php 

function src($text) {
    $text = str_replace('"', '', $text);
    $text = str_replace('src=', '', $text);
    $temporary = explode('<embed', $text);
    $temporary = $temporary[1];
    $temporary = explode(' ', trim($temporary));
    return $temporary[0];
} 

$html = '
<object width="180" height="220">
    <param name="movie" value="http://www.domain.com/video/video1.swf"></param>
    <embed src="http://www.domain.com/video/video1.swf" type="application/x-shockwave-flash" width="180" height="220"></embed>
</object>
'; 

echo src($html);

this works but is it better in regular expression?

i am using lamp

A regular expression is better for this case because the src might never be at the first attribute, therefore this won't work.

Here's what I recommend:

function src($html) {
 if(preg_match('#<embed[^>]*?src=["\'](.*?)["\'](.*?)></embed>#si', stripslashes($html), $src)) {
  return $src[1];
 }
 return ''; // or any other error if you need
}

echo src($html);

will output: http://www.domain.com/video/video1.swf

[^>] matches a single character that is not contained within the brackets. [^>] matches any character other than >

["\'] matches src=" or src='

(.*?) Dot (.) means match any character. Star (*) means zero or more times. And question mark (?) means be greedy and keep going as long as the pattern still matches. Put it all together, it means try and match any character, zero or more times, and get as many as you can

/i is case insensitive

Here's more info:

http://en.wikipedia.org/wiki/Regular_expression

http://www.regular-expressions.info/reference.html

Replace for entire line produces duplicate occurrence of replacement text

7 votes

Simple question: why does

"x" -replace ".*", "y"

produce "yy" ?

"x" -replace ".*", "y"

is the equivalent of

[Regex]::replace("x",".*","y")

The result yy that you see is based on how this works, as per MSDN:

Within a specified input string, replaces all strings that match a specified regular expression with a specified replacement string.

http://msdn.microsoft.com/en-us/library/e7f5w83z.aspx

The replace will find a string that matches the regular expression and replace it with the given replacement. Hence, the x replaced with y and then empty string is replaced with y and you get yy.

This can be verified by doing [Regex]::matches("x",".*") - it give two matches - one for empty string and one for x.

In terms of other regular expression engines, this happens because of the g or the global flag.

This can also be verified in Python as follows ( just to show that this is not limited to Powershell / .Net ):

>>> re.findall(".*","x")
['x', '']

How does JavaScript detect regular expressions?

7 votes

I am writing a JS parser, and am wondering how to differentiate between a regular expression (/lookup/g) and simple division (bar/baz/g). What are the rules that JavaScript uses to identify regular expressions?

You want to check out Section 7.8.5 in the ECMA spec (the annotated version is up-to-date currently, but always check the latest PDF from the ECMA).

Remember too that a JavaScript regex can not be empty. // is always the start of a single line comment.

Tangential, an empty JavaScript regex looks like /(?:)/.

Further discussion.

python - regex search and findall

6 votes

I need to find all matches in a string for a given regex. I've been using findall() to do that until I came across a case where it wasn't doing what I expected. For example:

regex = re.compile('(\d+,?)+')
s = 'There are 9,000,000 bicycles in Beijing.'

print re.search(regex, s).group(0)
> 9,000,000

print re.findall(regex, s)
> ['000']

In this case search() returns what I need (the longest match) but findall() behaves differently, although the docs imply it should be the same:

findall() matches all occurrences of a pattern, not just the first one as search() does.

  • Why is the behaviour different?

  • How can I achieve the result of search() with findall() (or something else)?

Ok, I see what's going on... from the docs:

If one or more groups are present in the pattern, return a list of groups; 
this will be a list of tuples if the pattern has more than one group.

As it turns out, you do have a group, "(\d+,?)"... so, what it's returning is the last occurrence of this group, or 000.

One solution is to surround the entire regex by a group, like this

regex = re.compile('((\d+,?)+)')

then, it will return [('9,000,000', '000')], which is a tuple containing both matched groups. of course, you only care about the first one.

Personally, i would use the following regex

regex = re.compile('((\d+,)*\d+)')

to avoid matching stuff like " this is a bad number 9,123,"

Edit.

Here's a way to avoid having to surround the expression by parenthesis or deal with tuples

s = "..."
regex = re.compile('(\d+,?)+')
it = re.finditer(regex, s)

for match in it:
  print match.group(0)

finditer returns an iterator that you can use to access all the matches found. these match objects are the same that re.search returns, so group(0) returns the result you expect.

How to split string to 2D array with Regex?

6 votes

I've got a problem that seems simple on the face of it but has defeated my meager regex skills. I have a string that I need to convert to an array and then process the values accordingly, which is simple enough, but the format of the string cannot be changed (it is generated elsewhere) and the logic of it has me baffled.

The string is:

[6] [2] [3] 12.00; [5] [4]

It's basically a set of ids and decimal values (in this case id 3 == 12.00). The quantity of ids could change at any moment and decimal values could be in any or all of the ids.

In an ideal world I would have the following array:

Array (
   [0] => Array (
             [id]  => 6
             [num] => 
          )
   [1] => Array (
             [id]  => 2
             [num] => 
          ) 
   [2] => Array (
             [id]  => 3
             [num] => 12.00 
          )
   Etc...

Do any of you regex wizards know how this can be accomplished with less swearing than I've been able to achieve?

I have thus far been able to extract the id's using:

preg_match_all('@\[(.*?)\]@s', $string, $array);

and the decimals using:

preg_match_all('/([0-9]+[,\.]{1}[0-9]{2})/', $string, $array);

but lose the correlation between id's and values.

Example:

<?php

$string = '[6] [2] [3] 12.00; [5] [4]';

preg_match_all('/\[(?P<id>\d+)\](?: (?P<num>[\d\.]+);)?/', $string, $matches, PREG_SET_ORDER);

var_dump($matches);

Output:

array(5) {
  [0]=>
  array(3) {
    [0]=>
    string(3) "[6]"
    ["id"]=>
    string(1) "6"
    [1]=>
    string(1) "6"
  }
  [1]=>
  array(3) {
    [0]=>
    string(3) "[2]"
    ["id"]=>
    string(1) "2"
    [1]=>
    string(1) "2"
  }
  [2]=>
  array(5) {
    [0]=>
    string(10) "[3] 12.00;"
    ["id"]=>
    string(1) "3"
    [1]=>
    string(1) "3"
    ["num"]=>
    string(5) "12.00"
    [2]=>
    string(5) "12.00"
  }
  [3]=>
  array(3) {
    [0]=>
    string(3) "[5]"
    ["id"]=>
    string(1) "5"
    [1]=>
    string(1) "5"
  }
  [4]=>
  array(3) {
    [0]=>
    string(3) "[4]"
    ["id"]=>
    string(1) "4"
    [1]=>
    string(1) "4"
  }
}

Why does this regex take so long to find email addresses in certain files?

6 votes

I have a regular expression that looks for email addresses ( this was taken from another SO post that I can't find and has been tested on all kinds of email configurations ... changing this is not exactly my question ... but understand if that is the root cause ):

/[a-z0-9_\-\+]+@[a-z0-9\-]+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

I'm using preg_match_all() in PHP.

This works great for 99.99...% of files I'm looking in and takes around 5ms, but occasionally takes a couple minutes. These files are larger than the average webpage at around 300k, but much larger files generally process fine. The only thing I can find in the file contents that stands out is strings of thousands of consecutive "random" alphanumeric characters like this:

wEPDwUKMTk0ODI3Nzk5MQ9kFgICAw9kFgYCAQ8WAh4H...

Here are two pages causing the problem. View source to see the long strings.

Any thoughts on what is causing this?

--FINAL SOLUTION--

I tested various regexes suggested in the answers. @hakre's answer solved the problem and reduced processing time to a few hundred milliseconds. @FailedDev's answer helped and dropped this from a few minutes to a few seconds. Below is the final regex I used. It's @hakre's second suggestion.

/[a-z0-9_\-\+]{1,256}+@[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

You already know that your regex is causing an issue for large files. So maybe you can make it a bit smarter?

For example, you're using + to match one or more chars. Let's say you have a string of 10 000 chars. The regex must look 10 000 combinations to find the largest match. Then you combine it with similar ones. Let's say you have a string with 20 000 chars and two + groups. How could they match in the file. Probably 10 000 x 10 000 possibilities. And so on and so forth.

If you can limit the number of characters (this looks a bit like you're looking for email patterns), probably limit the email address domain name to 256 and the address itself to 256 characters. Then this would be 256 x 256 possibilities to test "only":

/[a-z0-9_\-\+]{1,256}@[a-z0-9\-]{1,256}\.([a-z]{2,3})(?:\.[a-z]{2})?/i

That's probably already much faster. Then making those quantifiers possessive will reduce backtracking for PCRE:

/[a-z0-9_\-\+]{1,256}+@[a-z0-9\-]{1,256}+\.([a-z]{2,3})(?:\.[a-z]{2})?/i

Which should speed it up again.

Extracting only characters from a string in Python

6 votes

In Python, I want to extract only the characters from a string.

Consider I have the following string,

input = "{('players',): 24, ('year',): 28, ('money',): 19, ('ipod',): 36, ('case',): 23, ('mini',): 46}"

I want the result as,

output =  "players year money ipod case mini"

I tried to split considering only the alphabets,

word1 = st.split("[a-zA-Z]+")

But the split is not happening.

You could do it with re, but the string split method doesnt take a regex, it takes a string.

Heres one way to do it with re:

import re
word1 = " ".join(re.findall("[a-zA-Z]+", st))

Filtering a diff with a regular expression

6 votes

It seems that it would be extremely handy to be able to filter a diff so that trivial changes are not displayed. I would like to write a regular expression which would be run on the line and then pass it another string that uses the captured arguments to generate a canonical form. If the lines before and after produce the same output, then they would be removed from the diff.

For example, I am working on a PHP code base where a significant number of array accesses are written as my_array[my_key] when they should be my_array["my_key"] to prevent issues if a my_key constant is defined. It would be useful to generate a diff where the only change on the line wasn't adding some quotes.

I can't change them all at once, as we don't have the resources to test the entire code base, so am fixing this whenever I make a change to a function. How can I achieve this? Is there anything else similar to this that I can use to achieve a similar result. For example, a simpler method might be to skip the canonical form and just see if the input is transformed into the output. BTW, I am using Git

There does not seem to be any options to Git's diff command to support what you want to do. However, you could use the GIT_EXTERNAL_DIFF environment variable and a custom script (or any executable created using your preferred scripting or programming language) to manipulate a patch.

I'll assume you are on Linux; if not, you could tweak this concept to suit your environment. Let's say you have a Git repo where HEAD has a file file05 that contains:

line 26662: $my_array[my_key]

And a file file06 that contains:

line 19768: $my_array[my_key]
line 19769: $my_array[my_key]
line 19770: $my_array[my_key]
line 19771: $my_array[my_key]
line 19772: $my_array[my_key]
line 19773: $my_array[my_key]
line 19775: $my_array[my_key]
line 19776: $my_array[my_key]

You change file05 to:

line 26662: $my_array["my_key"]

And you change file06 to:

line 19768: $my_array[my_key]
line 19769: $my_array["my_key"]
line 19770: $my_array[my_key]
line 19771: $my_array[my_key]
line 19772: $my_array[my_key]
line 19773: $my_array[my_key]
line 19775: $my_array[my_key2]
line 19776: $my_array[my_key]

Using the following shell script, let's call it mydiff.sh and place it somewhere that's in our PATH:

#!/bin/bash
echo "$@"
git diff-files --patch --word-diff=porcelain "${5}" | awk '
/^-./ {rec = FNR; prev = substr($0, 2);}
FNR == rec + 1 && /^+./ {
    ln = substr($0, 2);
    gsub("\\[\"", "[", ln);
    gsub("\"\\]", "]", ln);
    if (prev == ln) {
        print " " ln;
    } else {
        print "-" prev;
        print "+" ln;
    }
}
FNR != rec && FNR != rec + 1 {print;}
'

Executing the command:

GIT_EXTERNAL_DIFF=mydiff.sh git --no-pager diff

Will output:

file05 /tmp/r2aBca_file05 d86525edcf5ec0157366ea6c41bc6e4965b3be1e 100644 file05 0000000000000000000000000000000000000000 100644
index d86525e..c2180dc 100644
--- a/file05
+++ b/file05
@@ -1 +1 @@
 line 26662: 
 $my_array[my_key]
~
file06 /tmp/2lgz7J_file06 d84a44f9a9aac6fb82e6ffb94db0eec5c575787d 100644 file06 0000000000000000000000000000000000000000 100644
index d84a44f..bc27446 100644
--- a/file06
+++ b/file06
@@ -1,8 +1,8 @@
 line 19768: $my_array[my_key]
~
 line 19769: 
 $my_array[my_key]
~
 line 19770: $my_array[my_key]
~
 line 19771: $my_array[my_key]
~
 line 19772: $my_array[my_key]
~
 line 19773: $my_array[my_key]
~
 line 19775: 
-$my_array[my_key]
+$my_array[my_key2]
~
 line 19776: $my_array[my_key]
~

This output does not show changes for the added quotes in file05 and file06. The external diff script basically uses the Git diff-files command to create the patch and filters the output through a GNU awk script to manipulate it. This sample script does not handle all the different combinations of old and new files mentioned for GIT_EXTERNAL_DIFF nor does it output a valid patch, but it should be enough to get you started.

You could use Perl regular expressions, Python difflib or whatever you're comfortable with to implement an external diff tool that suits your needs.