Best regex questions in October 2010

haskell regex substitution

12 votes

Despite the ridiculously large number of regex matching engines for Haskell, the only one I can find that will substitute is Text.Regex, which, while decent, is missing a few thing I like from pcre. Are there any pcre-based packages which will do substitution, or am I stuck with this?

The regular expression API in regex-base is generic to the container of characters to match. Doing some kind of splicing generically to implements substitution would be very hard to make efficient. I did not want to provide a crappy generic routine.

Writing a small function to do the substitution exactly how you want is just a better idea, and it can be written to match your container.

How can pattern search make faster ?

9 votes

I am working on about 1GB incremental file and I want to search for a particular pattern. Currently I am using Java Regular expressions, do you have any idea how can I do this faster?

Basically what you need is a state machine that can process a stream. This stream being bounded to the file... Each time the file grow, you read what has been appended to it (like the tail linux command that append to standard output the lines added to the file).

If you need to stop/restart your analyser, you can either just store somewhere the start position (that can depend of the window you need for your pattern matching) and restart from that. Or you can restart from scratch.

That is for the "increasing file" part of the problem.

For the best way to process the content, it depend of what you really need, what kind of data and pattern you want to apply. Regular expression are maybe the best solution: flexible, fast and relatively convenient.

From my understanding, Lucene would be good if you wanted to do document search matching for some natural language content. This would be a poor choice to match all dates or all line with a specific property. Also because Lucene first make an index of the document... This would help only for really heavy processing as indexing in the first place take time.

Regex Problem in Java

7 votes

I am working on some regex and I wonder why this regex

"(?<=(.*?id(( *)=)\\s[\"\']))g"

does not match the string

<input id = "g" />

in java??

Thanks

Not only does Java not allow unbounded lookbehind, it's supposed to throw an exception if you try. The fact that you're not seeing that exception is itself a bug: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6695369

You shouldn't be using lookbehind for that anyway. If you want to match the value of a certain attribute, the easiest, least troublesome approach is to match the whole attribute and use a capturing group to extract the value. For example:

String source = "<input id = \"g\" />"; 
Pattern p = Pattern.compile("\\bid\\s*=\\s*\"([^\"]*)\"");
Matcher m = p.matcher(source);
if (m.find())
{
  System.out.printf("Found 'id' attribute '%s' at position %d%n",
                    m.group(1), m.start());
}

output:

Found 'id' attribute 'g' at position 7

Do yourself a favor and forget about lookbehinds for a while. They're tricky even when they're not buggy, and they're really not as useful as you might expect.

Java string search

7 votes

If I am looking for a particular word inside a string, for example, in the string "how are you" I am looking for "are". Would a regular indexOf() work faster and better or a Regex match()

String testStr = "how are you";
String lookUp = "are";

//METHOD1
if (testStr.indexOf(lookUp) != -1)
{
 System.out.println("Found!");
}

//OR
//METHOD 2
if (testStr.match(".*"+lookUp+".*"))
{
 System.out.println("Found!");
}

Which of the two methods above is a better way of looking for a string inside another string? Or is there a much better alternative?

  • Ivard

If you don't care whether it's actually the entire word you're matching, then indexOf() will be a lot faster.

If, on the other hand, you need to be able to differentiate between are, harebrained, aren't etc., then you need a regex: \bare\b will only match are as an entire word (\\bare\\b in Java).

\b is a word boundary anchor, and it matches the empty space between an alphanumeric character (letter, digit, or underscore) and a non-alphanumeric character.

Caveat: This also means that if your search term isn't actually a word (let's say you're looking for ###), then these word boundary anchors will only match in a string like aaa###zzz, but not in +++###+++.

elegant way to match two wildcarded strings

7 votes

I'm OCRing some text from two different sources. They can each make mistakes in different places, where they won't recognize a letter/group of letters. If they don't recognize something, it's replaced with a ?. For example, if the word is Roflcopter, one source might return Ro?copter, while another, Roflcop?er. I'd like a function that returns whether two matches might be equivalent, allowing for multiple ?s. Example:

match("Ro?copter", "Roflcop?er") --> True
match("Ro?copter", "Roflcopter") --> True
match("Roflcopter", "Roflcop?er") --> True
match("Ro?co?er", "Roflcop?er") --> True

So far I can match one OCR with a perfect one by using regular expressions:

>>> def match(tn1, tn2):
    tn1re = tn1.replace("?", ".{0,4}")
    tn2re = tn2.replace("?", ".{0,4}")

    return bool(re.match(tn1re, tn2) or re.match(tn2re, tn1))

>>> match("Roflcopter", "Roflcop?er")
True
>>> match("R??lcopter", "Roflcopter")
True

But this doesn't work when they both have ?s in different places:

>>> match("R??lcopter", "Roflcop?er")
False

Thanks to Hamish Grubijan for this idea. Every ? in my ocr'd names can be anywhere from 0 to 3 letters. What I do is expand each string to a list of possible expansions:

>>> list(expQuestions("?flcopt?"))
['flcopt', 'flcopt@', 'flcopt@@', 'flcopt@@@', '@flcopt', '@flcopt@', '@flcopt@@', '@flcopt@@@', '@@flcopt', '@@flcopt@', '@@flcopt@@', '@@flcopt@@@', '@@@flcopt', '@@@flcopt@', '@@@flcopt@@', '@@@flcopt@@@']

then I expand both and use his matching function, which I called matchats:

def matchOCR(l, r):
    for expl in expQuestions(l):
        for expr in expQuestions(r):
            if matchats(expl, expr):
                return True
    return False

Works as desired:

>>> matchOCR("Ro?co?er", "?flcopt?")
True
>>> matchOCR("Ro?co?er", "?flcopt?z")
False
>>> matchOCR("Ro?co?er", "?flc?pt?")
True
>>> matchOCR("Ro?co?e?", "?flc?pt?")
True


The matching function:

def matchats(l, r):
    """Match two strings with @ representing exactly 1 char"""
    if len(l) != len(r): return False
    for i, c1 in enumerate(l):
        c2 = r[i]
        if c1 == "@" or c2 == "@": continue
        if c1 != c2: return False
    return True

and the expanding function, where cartesian_product does just that:

def expQuestions(s):
    """For OCR w/ a questionmark in them, expand questions with
    @s for all possibilities"""
    numqs = s.count("?")

    blah = list(s)
    for expqs in cartesian_product([(0,1,2,3)]*numqs):
        newblah = blah[:]
        qi = 0
        for i,c in enumerate(newblah):
            if newblah[i] == '?':
                newblah[i] = '@'*expqs[qi]
                qi += 1
        yield "".join(newblah)

Javascript regex compared to Perl regex

7 votes

I'm just a noob when it comes to regexp... I know Perl is amazing with regexp... and I dont know much Perl. Recently started learning JavaScript and came across regex for validating user inputs... haven't used them much.

How does JavaScript regexp compare with Perl regexp? Similarities and differences? Can all regexp(s) wriiten in JS be used in Perl and vice-versa? Similar syntax?

The most important difference you will encounter in real life is JavaScript's lack of lookbehind assertions.

Then, JavaScript doesn't have a way to prevent backtracking by making matches final (using possessive quantifiers ++/*+/?+ or atomic groups (?>...)).

Furthermore, JavaScript doesn't know Unicode properties/scripts/blocks.

Other than that, the basic regex syntax is very similar in both flavors.

One other (cosmetic) thing is that JavaScript doesn't know verbose regexes, which might make them harder to read.

Out of curiosity, how many people here know how regular expressions are compiled?

7 votes

I'm going over this in my theory class, and I'm curious as to how many people here know what regular expression compilation actually is. I've looked online, and it seems to me that this is a more archaic topic that I thought it was.

So yeah, who here knew before reading this question that a regular expression compile is performed by converting the regex to an epsilon-nondeterministic finite automaton? Who has no clue what that is?

There's a very simple and elegant little regular expression compiler in C that Rob Pike wrote and Brian Kernighan describes in Chapter 1 of O'Reilly's Beautiful Code. It's pretty easy to learn from. Also compiler courses cover it: token types can be defined with regular expressions. So I imagine this knowledge isn't terribly rare.

Regex question ^[a-zA-Z0-9]{5,10}$

7 votes

The above regular expression (in Java) matches a string of alpha numeric characters of length between 5 and 10.

How can I modify the above regular expression to match with the above requirements as well as matching with an empty string?

Make it optional (match exactly one or zero times)

^([a-zA-Z0-9]{5,10})?$

Simple Java regex not working.

7 votes

I have this regex which is supposed to remove sentence delimiters(. and ?):

sentence = sentence.replaceAll("\\.|\\?$","");

It works fine it converts

"I am Java developer." to "I am Java developer"

"Am I a Java developer?" to "Am I a Java developer"

But after deployment we found that it also replaces any other dots in the sentence as

"Hi.Am I a Java developer?" becomes "HiAm I a Java developer"

Why is this happening?

The pipe (|) has the lowest precedence of all operators. So your regex:

\\.|\\?$

is being treated as:

(\\.)|(\\?$)

which matches a . anywhere in the string and matches a ? at the end of the string.

To fix this you need to group the . and ? together as:

(?:\\.|\\?)$

You could also use:

[.?]$

Within a character class . and ? are treated literally so you need not escape them.

Regex: Match any punctuation character except . and _

6 votes

Is there an easy way to match all punctuation except period and underscore, in a C# regex? Hoping to do it without enumerating every single punctuation mark.

Use Regex Subtraction

[\p{P}-[._]]

Here's the link for .NET Regex documentation (I'm not sure if other flavors support it)... http://msdn.microsoft.com/en-us/library/ms994330.aspx

Here's a C# example

string pattern = @"[\p{P}\p{S}-[._]]"; // added \p{S} to get ^,~ and ` (among others)
string test = @"_""'a:;%^&*~`bc!@#.,?";
MatchCollection mx = Regex.Matches(test, pattern);
foreach (Match m in mx)
{
    Console.WriteLine("{0}: {1} {2}", m.Value, m.Index, m.Length);
}

Explanation The pattern is a Character Class Subtraction. It starts with a standard character class like [\p{P}] and then adds a Subtraction Character Class like -[._] which says to remove the . and _. The subtraction is placed inside the [ ] after the standard class guts.

Regex C# problem

6 votes

I'm sure there is a simple solution, but I just seem to be missing it.

I need a regex to do the following:

asdf.txt;qwer should match asdf.txt

"as;df.txt";qwer should match as;df.txt

As you can see, I need to match up to the semi-colon, but if quotes exist(when there is a semi-colon in the value), I need to match inside the quotes. Since I am looking for a file name, there will never be a quote in the value.

My flavor of regex is C#.

Thanks for your help!

"[^"]+"(?=;)|[^;]+(?=;)

This matches text within double quotes followed by a semicolon OR text followed by a semicolon. The semicolon is NOT included in the match.

EDIT: realized my first attempt will match the quotes. The following expression will exclude the quotes, but uses subexpressions.

"([^"]+)";|([^;]+);

String pattern matching problem

6 votes

Imagine we have a long string containing the substrings 'cat' and 'dog' as well as other random characters, eg.

cat x dog cat x cat x dog x dog x cat x dog x cat

Here 'x' represents any random sequence of characters (but not 'cat' or 'dog').

What I want to do is find every 'cat' that is followed by any characters except 'dog' and then by 'cat'. I want to remove that first instance of 'cat' in each case.

In this case, I would want to remove the bracketed [cat] because there is no 'dog' after it before the next 'cat':

cat x dog [cat] x cat x dog x dog x cat x dog x cat

To end up with:

cat x dog x cat x dog x dog x cat x dog x cat

How can this be done?

I thought of somehow using a regular expression like (n)(?=(n)) as VonC recommended here

(cat)(?=(.*cat))

to match all of the pairs of 'cat' in the string. But I am still not sure how I could use this to remove each cat that is not followed by 'dog' before 'cat'.


The real problem I am tackling is in Java. But I am really just looking for a general pseudocode/regex solution.

Is there any particular reason you want to do this with just one RE call? I'm not sure if that's actually possible in one RE.

If I had to do this, I'd probably go in two passes. First mark each instance of 'cat' and 'dog' in the string, then write some code to identify which cats need to be removed, and do that in another pass.

Pseudocode follows:

// Find all the cats and dogs
int[] catLocations = string.findIndex(/cat/);
int[] dogLocations = string.findIndex(/dog/);
int [] idsToRemove = doLogic(catLocations, dogLocations);

// Remove each identified cat, from the end to the front
for (int id : idsToRemove.reverse())
  string.removeSubstring(id, "cat".length());

Question on this JavaScript Syntax ("What Does This Do?")

5 votes

John Resig wrote a nifty Class function, swanky. I'm trying to figure out what is going on, and have pretty much everything figured out except a single line:

fnTest = /xyz/.test(function () {xyz;}) ? /\b_super\b/ : /.*/;

A couple things immediately jump to mind, first xyz is never initialized as a variable; so why then does this work? Second, why is it testing /xyz/ against something that is not returning anything (no return statement). Unless there is some nifty properties of javascript I'm unaware of (which is possible, I fancy myself rather good at JS and can interpret most the code I come across it doesn't, however, mean I'm eve on the same Mt. Everest sized mountain that John Resig calls home).

For those curious, here is the full unedited code from john resigs site John Resig Simple Javascript Inheritance:

(function () {
  var initializing = false, fnTest = /xyz/.test(function(){xyz;}) ? /\b_super\b/ : /.*/;

  // The base Class implementation (does nothing)
  this.Class = function(){};

  // Create a new Class that inherits from this class
  Class.extend = function(prop) {
    var _super = this.prototype;

    // Instantiate a base class (but only create the instance,
    // don't run the init constructor)
    initializing = true;
    var prototype = new this();
    initializing = false;

    // Copy the properties over onto the new prototype
    for (var name in prop) {
      // Check if we're overwriting an existing function
      prototype[name] = typeof prop[name] == "function" &&
        typeof _super[name] == "function" && fnTest.test(prop[name]) ?
        (function(name, fn){
          return function() {
            var tmp = this._super;

            // Add a new ._super() method that is the same method
            // but on the super-class
            this._super = _super[name];

            // The method only need to be bound temporarily, so we
            // remove it when we're done executing
            var ret = fn.apply(this, arguments);       
            this._super = tmp;

            return ret;
          };
        })(name, prop[name]) :
        prop[name];
    }

    // The dummy class constructor
    function Class() {
      // All construction is actually done in the init method
      if ( !initializing && this.init )
        this.init.apply(this, arguments);
    }

    // Populate our constructed prototype object
    Class.prototype = prototype;

    // Enforce the constructor to be what we expect
    Class.constructor = Class;

    // And make this class extendable
    Class.extend = arguments.callee;

    return Class;
  };

})();

It is just a quick & dirty way to check if "function decompilation" works.

The RegExp.prototype.test method will take the argument and it will convert it to String, the xyz reference inside the function is never evaluated.

Why would you have to check this?

Because the Function.prototype.toString method returns an implementation-dependent representation of a function, and in some implementation, such older Safari versions, Mobile Opera, and some Blackberry browsers, they don't actually return anything useful.