Best regex questions in June 2011

How can I use back references with `grep` in R?

11 votes

I am looking for an elegant way of returning back references using regular expressions in R. Le me explain:

Let's say I want to find strings that start with a month name:

x <- c("May, 1, 2011", "30 June 2011")
grep("May|^June", x, value=TRUE)
[1] "May, 1, 2011"

This works, but I really want to isolate the month (i.e. "May", not the entire matched string.

So, one can use gsub to return the back reference using the substitute parameter. But this has two problems:

  1. You have to wrap the pattern inside ".*(pattern).*)" so that the substitution occurs on the entire string.
  2. Rather than returning NA for non-matched strings, gsub returns the original string. This is clearly not what I desire:

The code and results:

gsub(".*(^May|^June).*", "\\1", x) 
[1] "May"          "30 June 2011"

I could probably code a workaround by doing all kinds of additional checks, but this quickly becomes very messy.

To be crystal clear, the desired results should be:

[1] "May"          NA

Is there an easy way of achieving this?

The stringr package has a function exactly for this purpose:

library(stringr)
x <- c("May, 1, 2011", "30 June 2011", "June 2012")
str_extract(x, "May|^June")
# [1] "May"  NA     "June"

It's a fairly thin wrapper around regexpr, but stringr generally makes string handling easier by being more consistent than base R functions.

When should I prefer regex over built-in string functions?

9 votes

Some say I should use regex whenever possible, others say I should use it at least as possible. Is there something like a "Perl Etiquette" about that matter or just TIMTOWTDI?

I think a lot of the answers you got already are good. I want to address the etiquette part because I think there is some.

Summed up: if there is a robust parser available, use it instead of regular expressions; 100% of the time. Never recommend anything else to a novice. So–

Don'ts

Dos

  • Do use substr, index, and rindex where appropriate but recognize they can come off "unperly" so they are best used when benchmarking shows them superior to regular expressions; regexes can be surprisingly fast in many cases.
  • Do use regular expressions when there is no good parser available and writing a Parse::RecDescent grammar is overkill, too much work, or will be too slow.
  • Do use regular expressions for throw-away code like one-liners on well-known/predictable data including the HTML/CSV previously banned from regular expression use.
  • Do be aware of alternatives for bigger problems like P::RecD, Parse::Yapp, and Marpa.
  • Do keep your own council. Perl is supposed to be fun. Do whatever you like; just be prepared to get bashed if you complain when not following advice and it goes sideways. :P

Are ^$ and $^ in PHP regex the same?

8 votes

I know it's a stupid question, im learning how the engine works, i wanna know if someone can explain it.

if(preg_match_all('/$^/m',"",$array))
echo "Match";

if(preg_match_all('/$^\n$/m',"\n",$array))
echo "Match";

Both match!

$ and ^ are zero-width meta-characters. Unlike other meta-characters like . which match one character at a time (unless used with quantifiers), they do not actually match literal characters. This is why ^$ matches an empty string "", even though the regex (sans delimiters) contains two characters while the empty string contains zero.

It doesn't matter that an empty string contains no characters. It still has a starting point and an ending point, and since it's an empty string both are at the same location. Therefore no matter the order or number of ^ and $ you use, all of their permutations should match the empty string.


Your second case is slightly trickier but the same principles apply.

The m modifier (PCRE_MULTILINE) just tells the PCRE engine to feed in the entire string at one go, regardless of newlines, but the string still comprises "multiple lines". It then looks at ^ and $ as "the start of a line" and "the end of a line" respectively.

The string "\n" is essentially logically split into three parts: "", "\n" and "" (because the newline is surrounded by emptiness... sounds poetic).

Then these matches follow:

  1. The first empty string is matched by the starting $^ (as I explain above).
  2. The \n is matched by the same \n in your regex.
  3. The second empty string is matched by the last $.

And that's how your second case results in a match.

Perl regex replace count

8 votes

Is it possible to specify the maximum number of matches to replace. For instance if matching 'l' in "Hello World", would it be possible to replace the first 2 'l' characters, but not the third without looping?

$str = "Hello world!";
$str =~ s/l/r/ for (1,2);
print $str;

I don't see what's so bad about looping.

Actually, here's a way:

$str="Hello world!"; 
$str =~ s/l/$i++ >= 2 ? "l": "r"/eg; 
print $str;

It's a loop, of sorts, since s///g works in a loopy way when you do this. But not a traditional loop.

Perl regex replace in same case

7 votes

If you have a simple regex replace in perl as follows:

($line =~ s/JAM/AAA/g){

how would I modify it so that it looks at the match and makes the replacement the same case as the match for example:

'JAM' would become 'AAA' and 'jam' would become 'aaa'

$line =~ s/JAM/{$& eq 'jam' ? 'aaa' : 'AAA'}/gie;

RegExp match repeated characters

7 votes

For example I have string:

 aacbbbqq

As the result I want to have following matches:

 (aa, c, bbb, qq)  

I know that I can write something like this:

 ([a]+)|([b]+)|([c]+)|...  

But I think i's ugly and looking for better solution. I'm looking for regular expression solution, not self-written finite-state machines.

You can match that with: (\w)\1*

Why does "hello\\s*world" not match "hello world"?

7 votes

Why does this code throw a InputMismatchException ?

Scanner scanner = new Scanner("hello world");
System.out.println(scanner.next("hello\\s*world"));

The same regex matches in http://regexpal.com/ (with \s instead of \\s)

A Scanner, as opposed to a Matcher, has built in tokenization of the string, the default delimiter is white space. So your "hello world" is getting tokenized into "hello" "world" before the match runs. It would be a match if you changed the delimiter before scanning to something not in the string, eg.:

Scanner scanner = new Scanner("hello world");
scanner.useDelimiter(":");
System.out.println(scanner.next("hello\\s*world"));

but it seems like really for your case you should just be using a Matcher.

This is an example of using a Scanner "as intended":

   Scanner scanner = new Scanner("hello,world,goodnight,moon");
   scanner.useDelimiter(",");
   while (scanner.hasNext()) {
     System.out.println(scanner.next("\\w*"));
   }

output would be

hello
world
goodnight
moon

What does a character class with only a lone caret do?

7 votes

In trying to answer the question Writing text into new line when a particular character is found, I have employed Regexp::Grammars. It has long interested me and finally I had reason to learn. I noticed that the description section the author has a LaTeX parser (I am an avid LaTeX user, so this interested me) but it has one odd construct seen here:

    <rule: Option>     [^][\$&%#_{}~^\s,]+

    <rule: Literal>    [^][\$&%#_{}~^\s]+

What do the [^] character classes accomplish?

[^][…] is not two character classes but just one character class containing any other character except ], [, and (see Special Characters Inside a Bracketed Character Class):

However, if the ] is the first (or the second if the first character is a caret) character of a bracketed character class, it does not denote the end of the class (as you cannot have an empty class) and is considered part of the set of characters that can be matched without escaping.

Examples:

"+"   =~ /[+?*]/     #  Match, "+" in a character class is not special.
"\cH" =~ /[\b]/      #  Match, \b inside in a character class
                     #  is equivalent to a backspace.
"]"   =~ /[][]/      #  Match, as the character class contains.
                     #  both [ and ].
"[]"  =~ /[[]]/      #  Match, the pattern contains a character class
                     #  containing just ], and the character class is
                     #  followed by a ].

Regex for matching quotes and single quotes

7 votes

I'm currently writing a parser for ColdFusion code. I'm using a regex (in c#) to extract the name datasource attribute of the cfquery tag.

For the time being the regex is the following <cfquery\s.*datasource\s*=\s*(?:'|")(.*)(?:'|")

it works well for strings like <cfquery datasource="myDS" or <cfquery datasource='myDS'

But it gets crazy when parsing strings like <cfquery datasource="#GetSourceName('myDS')#"

Obviously the part of the regex (?:'|") is the cause. Is there a way to only match single quote when the first match was a single quote? And only match the double quote when the first match was a double quote?

Thanks in advance!

Edit: I think this should work in C# you just need to do a back reference:

datasource\s*=\s*('|")(.*)(?:\1)

or perhaps

datasource\s*=\s*('|")(.*)(?:$1)

matches datasource="#GetSourceName('myDS')#" with a back reference to the first match with \1.

Of course, you cannot ignore the first capture group with ?: and still have this work. Also, you may want to set the lazy flag so as not to match additional "'s

How to grab numbers in the middle of a string? (Python)

7 votes
random string
this is 34 the string 3 that, i need 234
random string
random string
random string
random string

random string
this is 1 the string 34 that, i need 22
random string
random string
random string
random string

random string
this is 35 the string 55 that, i need 12
random string
random string
random string
random string

Within one string there are multiple lines. One of the lines is repeated but with different numbers each time. I was wondering how can I store the numbers in those lines. The numbers will always be in the same position in the line, but can be any number of digits.

Edit: The random strings could have numbers in them as well.

Use regular expressions:

>>> import re
>>> comp_re = re.compile('this is (\d+) the string (\d+) that, i need (\d+)')
>>> s = """random string
this is 34 the string 3 that, i need 234
random string
random string
random string
random string

random string
this is 1 the string 34 that, i need 22
random string
random string
random string
random string

random string
this is 35 the string 55 that, i need 12
random string
random string
random string
random string
"""
>>> comp_re.findall(s)
[('34', '3', '234'), ('1', '34', '22'), ('35', '55', '12')]

What does \D do in Perl regular expressions.

7 votes

In some code I am maintaining, I have found the expression:

$r->{DISPLAY} =~ s/\Device//s;

What surprises me is that it matches both device and Device!

I have not found any mention of \D in the documentation, only \d.

Can someone clarify please...

\D is the negation of \d, i.e. it matches anything that is not a digit.

Open a file and filter it using a regular expression

7 votes

I have a large logfile and I want to extract (write to a new file) certain rows. The problem is I need a certain row and the row before. So the regex should be applied on more than one row. Notepad++ is not able to do that and I don't want to write a script for that.

I assume I can do that with Powershell and a one-liner, but I don't know where to start ...

The regular expression is not the problem, will be something like that ^#\d+.*?\n.*?Failed.*?$

So, how can I open a file using the Powershell, passing the regex and get the rows back that fits my expression?

Look at Select-String and -context parameter:

If you only need to display the matching line and the line before, use (for a test I use my log file and my regex - the date there)

Get-Content c:\Windows\System32\LogFiles\HTTPERR\httperr2.log  |
    Select-String '2011-05-13 06:16:10' -context 1,0

If you need to manipulate it further, store the result in a variable and use the properties:

$line = Get-Content c:\Windows\System32\LogFiles\HTTPERR\httperr2.log  |
        Select-String '2011-05-13 06:16:10' -context 1

# for all the members try this:
$line | Get-Member

#line that matches the regex:
$line.Line
$line.Context.PreContext

If there are more lines that match the regex, access them with brackets:

$line = Get-Content c:\Windows\System32\LogFiles\HTTPERR\httperr2.log  |
        Select-String '2011-05-13 06:16:10' -context 1
$line[0] # first match
$line[1] # second match

Pattern.matches doesn't work, while replaceAll does

7 votes

The regular expression seems to be ok, since the first line line correctly replace the substring with "helloworld", but the same expression won't match in the latter since i cannot see "whynothelloworld?" on console

System.out.println(current_tag.replaceAll("^[01][r]\\s", "helloworld"));

if (Pattern.matches("^[01][r]\\s", current_tag)) { System.out.println("whynothelloworld?");}

Pattern.matches() expects the entire string to match, not just a substring.

Use the .find() method of the regex matcher object instead:

Pattern regex = Pattern.compile("^[01]r\\s");
Matcher regexMatcher = regex.matcher(current_tag);
foundMatch = regexMatcher.find();

Dealing with commas in CSV

7 votes

I get a CSV data from a SOAP call in php. Unfortunately, the data may have commas in it. It is formatted correctly as in

1,name,2,lariat,3,"first, last",5,NMEA,...

I need to parse it to individual values in either php or javascript. I have browsed through threads on stack overflow and elsewhere but have not found a specific solution in php / javascript.

The approach I am currently using is

$subject = '123,name,456,lryyrt,123213,"first,last",8585,namea3';
$pattern = '/,|,"/';
$t2=preg_replace ('/,|(".*")/','$0*',$subject);
$t2=str_replace(',','*',$t2);
$t2=str_replace('*',',',$t2);

Where * is the deliminator, but the preg_replace generates an extra *. I have tried a couple of other approaches involving preg_match and other preg_ functions but did not succeed in having any kind of a clean split.

Any suggestion on how to split up CSV data that contains commas in it?

Don't attempt to do this with a regular expression. Just use str_getcsv()! The third parameter informs str_getcsv() to look for quote-enclosed fields.

$subject = '123,name,456,lryyrt,123213,"first,last",8585,namea3';
$array = str_getcsv($subject, ",", '"');

print_r($array);

Prints the following:

Array
(
    [0] => 123
    [1] => name
    [2] => 456
    [3] => lryyrt
    [4] => 123213
    [5] => first,last
    [6] => 8585
    [7] => namea3
)

Why is it that regex cannot match an XML element ?

6 votes

This article argues that regular expressions cannot match nested structures because regexes are finite automatons.

He then offers a list of problems in which the answer states that the following cannot be solved using regexes:

  1. matching an XML element
  2. matching a C/VB/C# math expression
  3. matching a valid regex

Since 2 & 3 can conceivably contain brackets; this nesting is unsolvable for regexes. But why is it impossible to match an XML element ? (He didn't provide examples).

You can match a limited subset of HTML tags, if you know in advance the tags to be matched.

But you can't (reliably or nicely) parse arbitrary HTML. It is not a regular language.

When a string is being matched against a regular expression, what's going on behind the scenes?

6 votes

I'd be interested to know what kind of algorithms are used for matching it, and how they are optimised, because I imagine that somes regexes could produce a vast number of possible matches that could cause serious problems on a poorly witten regex parser.

Also, I recently discovered the concept of a ReDoS, why do regexes such as (a|aa)+ or (a|a?)+ cause problems?

EDIT: I have used them most in C# and Python, so that's what was in my mind when I was considering the question. I assume Python's is written in C like the rest of the interpreter, but I have no idea about C#

There are two kinds of regular expression engine: NFA and DFA. I am quite rusty so I don't dare go into specifics by memory. Here is a page that goes through the algorithms, though. Some parsers will perform better with poorly-written expressions. A good book on the subject (that is sitting on my shelf) is Mastering Regular Expression.

Help building a regex

6 votes

I need to build a regular expression that finds the word "int" only if it's not part of some string.

I want to find whether int is used in the code. (not in some string, only in regular code)

Example:

int i;  // the regex should find this one.
String example = "int i"; // the regex should ignore this line.
logger.i("int"); // the regex should ignore this line. 
logger.i("int") + int.toString(); // the regex should find this one (because of the second int)

thanks!

It's not going to be bullet-proof, but this works for all your test cases:

(?<=^([^"]*|[^"]*"[^"]*"[^"]*))\bint\b(?=([^"]*|[^"]*"[^"]*"[^"]*)$)

It does a look behind and look ahead to assert that there's either none or two preceding/following quotes "

Here's the code in java with the output:

    String regex = "(?<=^([^\"]*|[^\"]*\"[^\"]*\"[^\"]*))\\bint\\b(?=([^\"]*|[^\"]*\"[^\"]*\"[^\"]*)$)";
    System.out.println(regex);
    String[] tests = new String[] { 
            "int i;", 
            "String example = \"int i\";", 
            "logger.i(\"int\");", 
            "logger.i(\"int\") + int.toString();" };

    for (String test : tests) {
        System.out.println(test.matches("^.*" + regex + ".*$") + ": " + test);
    }

Output (included regex so you can read it without all those \ escapes):

(?<=^([^"]*|[^"]*"[^"]*"[^"]*))\bint\b(?=([^"]*|[^"]*"[^"]*"[^"]*)$)
true: int i;
false: String example = "int i";
false: logger.i("int");
true: logger.i("int") + int.toString();

Using a regex is never going to be 100% accurate - you need a language parser. Consider escaped quotes in Strings "foo\"bar", in-line comments /* foo " bar */, etc.

How to neatly match "x" and "[x]" with a regex without repeating?

6 votes

I'm writing a Perl regex to match both the strings x bla and [x] bla. One alternative is /(?:x|\[x\]) bla/. This isn't desirable, because in the real world, x is more complicated, so I want to avoid repeating it.

The best solution so far is putting x in a variable and pre-compiling the regex:

my $x = 'x';
my $re = qr/(?:$x|\[$x\]) bla/o;

Is there a neater solution? In this case, readability is more important than performance.

It's possible, but not all that clean. You can use the fact that conditional subpatterns support tests such as (?(N)) to check that the Nth capturing subpattern successfully matched. So you can use an expression such as /(\[)?X(?(1)\])/ to match '[X]' or 'X'.

Is the better way to match two different repetitions of the same character class in a regex ?

6 votes

I had been using [0-9]{9,12} all along to signify that the numeric string has a length of 9 or 12 characters. However I now realized that it will match input strings of length 10 or 11 as well. So I came out with the naive:

( [0-9]{9} | [0-9]{12} )

Is there a more succinct regex to represent this ?

You could save one character by using

[0-9]{9}([0-9]{3})?

but in my opinion your way is better because it conveys your intention more clearly. Regexes are hard enough to read already.

Of course you could use \d instead of [0-9].

(Edit: I first thought you could drop the parens around [0-9]{3} but you can't; the question mark will be ignored. So you only save one character, not three.)

(Edit 2: You will also need to anchor the regex with ^ and $ (or \b) or re.match() will also match 123456789 within 1234567890.)

Visual Basic Regular Expression Question

5 votes

I have a list of string. When user inputs chars in, the program would display all possible strings from the list in a textbox.

Dim fruit as new List(Of String) 'contains apple,orange,pear,banana
Dim rx as New Regex(fruit)

For example If user enters a,p,l,e,r , then the program would display apple and pear. It should match any entry for which all letters have been entered, regardless of order and regardless of additional letters. What should I add to rx? If it's not possible with Regular Expressions, then please specify any other ways to do this.

LINQ Approach:

Dim fruits As New List(Of String) From { "apple", "orange", "pear", "banana" }
Dim input As String = "a,p,l,e,r"
Dim letters As String = input.Replace(",", "")
Dim result = fruits.Where(Function(fruit) Not fruit.Except(letters).Any())

Regex Approach:

A regex pattern to match the results would resemble something like:

"^[apler]+$"

This can be built up as:

Dim fruits As New List(Of String) From { "apple", "orange", "pear", "banana" }
Dim input As String = "n,a,b,r,o,n,g,e"
Dim letters As String = input.Replace(",", "")
Dim pattern As String = "^[" + letters + "]+$"
Dim query = fruits.Where(Function(fruit) Regex.IsMatch(fruit, pattern))