Best regex questions in August 2010

49 votes

The goal

Today's Code Golf challenge is to create a regex parser in as few characters as possible.

The syntax

No, I'm not asking you to match Perl-style regular expressions. There's already a very reliable interpreter for those, after all! :-)

Here's all you need to know about regex syntax for this challenge:

  • A term is defined as a single literal character, or a regular expression within grouping parentheses ().
  • The * (asterisk) character represents a Kleene star operation on the previous TERM. This means zero or more of the previous term, concatenated together.
  • The + (plus) character represents a convenient shortcut: a+ is equivalent to aa*, meaning one or more of the previous term.
  • The ? (question mark) character represents zero or one of the previous term.
  • The | (pipe) character represents an alternation, meaning that the REGULAR EXPRESSIONS on either side can be used in the match.
  • All other characters are assumed to be literal. You may assume that all other characters are within [0-9A-Za-z] (i.e., all English alphanumerics).

Or, put another way: */+/? have highest precedence, then concatenation, then alternation. Since alternation has lower precedence than concatenation, its use within a regex without parentheses causes it to be bound to the full regex on each side. * and + and ?, on the other hand, would just apply to the immediately preceding term.

The challenge

Your challenge is to write a program that will compile or interpret a regular expression (as defined above) and then test a number of strings against it.

I'm leaving input up to you. My recommendation would be that the regex should probably come first, and then any number of strings to be tested against it; but if you want to make it last, that's fine. If you want to put everything in command-line arguments or into stdin, or the regex in command-line and the strings in stdin, or whatever, that's fine. Just show a usage example or two.

Output should be true or false, one per line, to reflect whether or not the regex matches.

Notes:

  • I shouldn't need to say this... but don't use any regex libraries in your language! You need to compile or interpret the pattern yourself. (Edit: You may use regex if it's required for splitting or joining strings. You just can't use it to directly solve the problem, e.g., converting the input regex into a language regex and using that.)
  • The regular expression must COMPLETELY match the input string for this challenge. (Equivalently, if you're familiar with Perl-like regex, assume that start- and end-of-string anchoring is in place for all matches)
  • For this challenge, all of the special characters ()*+?| are not expected to occur literally. If one comes up in the input, it is safe to assume that no pattern can match the string in question.
  • Input strings to test should be evaluated in a case-sensitive manner.

The examples

For the examples, I'm assuming everything is done in command-line arguments, regex first. (As I said above, input is up to you.) myregex here represents your invocation of the program.

> myregex easy easy Easy hard
true
false
false

> myregex ab*a aa abba abab b
true
true
false
false

> myregex 0*1|10 1 10 0110 00001
true
true
false
true

> myregex 0*(1|1+0) 1 10 0110 00001
true
true
true
true

> myregex a?b+|(a+b|b+a?)+ abb babab aaa aabba a b
true
true
false
true
false
true

NOTE: Sorry, forgot to make community wiki! :-(

GolfScript - 254 chars

n%([]:B:$:_"()"@*{:I"()*+|?"[{}/]?[{[[0B$,+:B))\;)]_]+}{B)):ß;:B;qß(:ß;}{8q}{[[0ß0$,)]]+}:8{[[0B-1=:ß)]]+:$q}{ß>$ß<\([0+$,+]\++}:q{[[I$,:ß)]]+}]=~:$}/;{n+[0]:3\{:c;;3:1_:3;{,}{)[$=]_*2/{~\.{c={3|:3}*;}{;.1|1,\:1,<{+0}*;}if}/}/;}/;1$,?)"true""false"if n}%

Somewhat straightforwardly, the first loop converts the regex into an NFA, which the second loop runs.

Sun Aug 22 00:58:24 EST 2010 (271→266) changed variable names to remove spaces
Sun Aug 22 01:07:11 EST 2010 (266→265) made [] a variable
Sun Aug 22 07:05:50 EST 2010 (265→259) made null state transitions inline
Sun Aug 22 07:19:21 EST 2010 (259→256) final state made implicit
Mon Feb 7 19:24:19 EST 2011 (256→254) using "()""str"*

$ echo "ab*a aa abba abab b"|tr " " "\n"|golfscript regex.gs
true
true
false
false

$ echo "0*1|10 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
false
true

$ echo "0*(1|1+0) 1 10 0110 00001"|tr " " "\n"|golfscript regex.gs
true
true
true
true

$ echo "a?b+|(a+b|b+a?)+ abb babab aaa aabba a b"|tr " " "\n"|golfscript regex.gs
true
true
false
true
false
true

$ echo "((A|B|C)+(a|(bbbbb)|bb|c)+)+ ABCABCaccabbbbbaACBbbb ABCABCaccabbbbbaACBbbbb"|tr " " "\n"|golfscript regex.gs
false
true

is there need for a more declarative way of expressing regular expressions ? :)

14 votes

I am trying to create a Python function that can take an plain English description of a regular expression and return the regular expression to the caller.

Currently I am thinking of the description in YAML format. So, we can store the description as a raw string variable, which is passed on to this another function and output of that function is then passed to the 're' module. Following is a rather simplistic example:

# a(b|c)d+e*
re1 = """
- literal: 'a'
- one_of: 'b,c'
- one_or_more_of: 'd'
- zero_or_more_of: 'e'
"""
myre = re.compile(getRegex(re1))
myre.search(...)

etc.

Does anyone think something of this sort would be of wider use? Do you know already existing packages that can do it? What are the limitations that you see to this approach? Does anyone think, having the declarative string in code, would make it more maintainable?

For developers trying to write regular expressions that are easy to grok and maintain, I wonder whether this sort of approach would offer anything that re.VERBOSE does not provide already.

For beginners, your idea might have some appeal. However, before you go down this path, you might try to mock up what your declarative syntax would look like for more complicated regular expressions using capturing groups, anchors, look-ahead assertions, and so forth. One challenge is that you might end up with a declarative syntax that is just as difficult to remember as the regex language itself.

You might also think about alternative ways to express things. For example, the first thought that occurred to me was to express a regex using functions with short, easy-to-remember names. For example:

from refunc import *

pattern = Compile(
    'a',
    Capture(
        Choices('b', 'c'),
        N_of( 'd', 1, Infin() ),
        N_of( 'e', 0, Infin() ),
    ),
    Look_ahead('foo'),
)

But when I see that in action, it looks like a pain to me. There are many aspects of regex that are quite intuitive -- for example, + to mean "one or more". One option would be a hybrid approach, allowing your user to mix those parts of regex that are already simple with functions for the more esoteric bits.

pattern = Compile(
    'a',
    Capture(
        '[bc]',
        'd+',
        'e*',
    ),
    Look_ahead('foo'),
)

I would add that in my experience, regular expressions are about leaning a thought process. Getting comfortable with the syntax is the easy part.

Are regular expressions used to build parsers?

10 votes

This is just a question out of curiosity since I have been needing to get more and more into parsing and using regex lately.. it seems, for questions I come across in my searches regarding parsing of some sort, someone always ends up saying, when asked something relating to regex, "regex isn't good for that, use such and such parser instead"... as I have come to better understand regex, I think most stuff is possible, just its rather complex and time consuming since you have to account for many different possiblities, and of course, it has to be combined with conditional statements and loops to build any sort of parser.. so I'm wondering if regex is what is used to build most parsers or is there some other method being used.. I am just wondering since I may have the need to build some fairly complex custom parsers coming up where there isn't necessarily an existing one to use.

thanks for any info as I can't seem to find a direct answer to this.

Typically, you'll use two (at least) types of tools in building your parser.

The first part is lexical analysis -- separating characters into tokens and filtering out comments and whitespace. That part is typically done with regular expressions. Well, it's even more typically done using a scanner generator that converts a collection of pairs of regular expressions and code into a program that executes the corresponding code when it recognizes the regular expressions. This turns out to be more efficient than testing each regular expression each time, and it also works better for various other reasons. FLEX is a common tool for this in C.

The second part of your parser is the grammar. The most typical tool for this is another program-generator that accepts a context-free grammar (CFG) annotated with rules for interpreting the component "parts of speech", as it were. A CFG is able to express things like balanced parenthesis, which a regular expression cannot (unless it's been extended with CF features, making it not strictly "regular" in the mathematical sense). But a CFG with rules is very nice because you can attach a semantic meaning to the phrase structure of your language. BISON is a common tool for this part in C.

But I actually told you a little lie. You see, every real programming language has parts that cannot be expressed within a context-free framework. For example, you need to connect the definition of a variable with the use of it so that you know what instructions to generate, and also if an operation on it is legal. That's typically considered outside the scope of parsing, but there are such things as "attribute grammars" which are like CFGs extended with features that can make even these context-dependencies much easier to code up and work with.

Now, there's no rule that says you HAVE to use such tools. Many simple grammars are easy enough to process with hand-written tools. For example, LISP's S-expressions can be simply scanned as:

If it starts with a digit, read a number. If it starts with a letter, read a symbol. If it's a space, skip it. If it's an open-paren, then skip it, recurse this routine for a value, and expect a close paren.

Well, there are a few more complications for strings and what-have-you, but that's the basic idea. Parsing FORTH is even simpler, because it doesn't build a recursive data structure.

Anyway, that should get you going on whatever your project is.

Regex: Determine if two regular expressions could match for the same imput?

Asked on Wed, 04 Aug 2010 by Tom regex
9 votes

I want to find out if there could ever be conflicts between two known regular expressions, in order to allow the user to construct a list of mutually exclusive regular expressions.

For example, we know that the regular expressions below are quite different but they both match xy50:

'^xy1\d'
'[^\d]\d2$'

Is it possible to determine, using a computer algorithm, if two regular expressions can have such a conflict? How?

There's no halting problem involved here. All you need is to compute if the intersection of ^xy1\d and [^\d]\d2$ in non-empty.

I can't give you an algorithm here, but here are two discussions of a method to generate the intersection without resorting the construction of a DFA:

And then there's RAGEL

which can compute the intersection of regular expressions too.

UPDATE: I just tried out Ragel with OP's regexp. Ragel can generate a "dot" file for graphviz from the resulting state machine, which is terrific. The intersection of the OP's regexp looks like this in Ragel syntax:

('xy1' digit any*) & (any* ^digit digit '2') 

and has the following state machine:

alt text

While the empty intersection:

('xy1' digit any*) & ('q' any* ^digit digit '2')

looks like this:

alt text

So if all else fails, then you can still have Ragel compute the intersection and check if it outputs the empty state machine, by comparing the generated "dot" file.

Regex Group in Perl: how to capture elements into array from regex group that matches unknown number of/multiple/variable occurrences from a string?

9 votes

In Perl, how can I use one regex grouping to capture more than one occurrence that matches it, into several array elements?

For example, for a string:

var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello

to process this with code:

   $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

   my @array = $string =~ <regular expression here>

   for ( my $i = 0; $i < scalar( @array ); $i++ )
   {
     print $i.": ".$array[$i]."\n";
   }

I would like to see as output:

0: var1=100
1: var2=90
2: var5=hello
3: var3="a, b, c"
4: var7=test
5: var3=hello

What would I use as a regex?

The commonality between things I want to match here is an assignment string pattern, so something like:

my @array = $string =~ m/(\w+=[\w\"\,\s]+)*/;

Where the * indicates one or more occurrences matching the group.

(I discounted using a split() as some matches contain spaces within themselves (i.e. var3...) and would therefore not give desired results.)

With the above regex, I only get:

0: var1=100 var2

Is it possible in a regex? Or addition code required?

Looked at existing answers already, when searching for "perl regex multiple group" but not enough clues:

my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

while($string =~ /(?:^|\s+)(\S+)\s*=\s*("[^"]*"|\S*)/g) {
        print "<$1> => <$2>\n";
}

Prints:

<var1> => <100>
<var2> => <90>
<var5> => <hello>
<var3> => <"a, b, c">
<var7> => <test>
<var3> => <hello>

Explanation:

Last piece first: the g flag at the end means that you can apply the regex to the string multiple times. The second time it will continue matching where the last match ended in the string.

Now for the regex: (?:^|\s+) matches either the beginning of the string or a group of one or more spaces. This is needed so when the regex is applied next time, we will skip the spaces between the key/value pairs. The ?: means that the parentheses content won't be captured as group (we don't need the spaces, only key and value). \S+ matches the variable name. Then we skip any amount of spaces and an equal sign in between. Finally, ("[^"]*"|\S*)/ matches either two quotes with any amount of characters in between, or any amount of non-space characters for the value. Note that the quote matching is pretty fragile and won't handle escpaped quotes properly, e.g. "\"quoted\"" would result in "\".

EDIT:

Since you really want to get the whole assignment, and not the single keys/values, here's a one-liner that extracts those:

my @list = $string =~ /(?:^|\s+)((?:\S+)\s*=\s*(?:"[^"]*"|\S*))/g;

Non capturing group?

8 votes

After reading some tutorials I still don't get it.

Could someone explain how ?: is used and what it's good for?

Let me try to explain this with an example:

Consider the following text:

http://stackoverflow.com/
http://stackoverflow.com/questions/tagged/regex

Now, if I apply the regex below over it...

(http|ftp)://([^/\r\n]+)(/[^\r\n]*)?

... I would get the following result:

Match "http://stackoverflow.com/"
     Group 1: "http"
     Group 2: "stackoverflow.com"
     Group 3: "/"

Match "http://stackoverflow.com/questions/tagged/regex"
     Group 1: "http"
     Group 2: "stackoverflow.com"
     Group 3: "/questions/tagged/regex"

But I don't care about the protocol. I just want the host and path of the URL. So, I change the regex to include the non-capturing group (?:).

(?:http|ftp)://([^/\r\n]+)(/[^\r\n]*)?

Now, my result looks like this:

Match "http://stackoverflow.com/"
     Group 1: "stackoverflow.com"
     Group 2: "/"

Match "http://stackoverflow.com/questions/tagged/regex"
     Group 1: "stackoverflow.com"
     Group 2: "/questions/tagged/regex"

See? The first group has not been captured. The parser uses it to match the text, but ignores it later, in the final result.

Hope it helps. Sorry for my lousy english.


EDIT:

As requested, let me try to explain groups too.

Well, groups serve many purposes. They help you extract exact information from a bigger match (which can also be named), let you rematch a previous matched group, and can be used for substitutions. Lets try some examples, shall we?

Ok, imagine you have some kind of XML or HTML (be aware that regex may not be the best tool for the job, but it is nice as an example). You want to parse the tags, so you could you something like this (I have added spaces to make it easier to understand):

   \<(?<TAG>.+?)\> [^<]*? \</\k<TAG>\>
or
   \<(.+?)\> [^<]*? \</\1\>

The first regex has a named group (TAG), while the second one uses a common group. Both regexes do the same thing: they use the value from the first group (the name of the tag) to match the closing that. The difference is that the first one uses the name to use the value, and the second one uses the group index (which starts at 1).

Lets try some substitutions now. Consider the following text:

Lorem ipsum dolor sit amet consectetuer feugiat fames malesuada pretium egestas.

Now, lets use the this dumb regex over it:

\b(\S)(\S)(\S)(\S*)\b

This regex matches words with at least 3 characters, and uses groups to separate the first three letters. The result is this:

Match "Lorem"
     Group 1: "L"
     Group 2: "o"
     Group 3: "r"
     Group 4: "em"
Match "ipsum"
     Group 1: "i"
     Group 2: "p"
     Group 3: "s"
     Group 4: "um"
...

Match "consectetuer"
     Group 1: "c"
     Group 2: "o"
     Group 3: "n"
     Group 4: "sectetuer"
...

So, if we apply the substitution string...

$1_$3$2_$4

... over it, we are trying to use the first group, add an underscore, use the third group, then the second group, add another underscore, and then the fourth group. The resulting string would be like the one below.

L_ro_em i_sp_um d_lo_or s_ti_ a_em_t c_no_sectetuer f_ue_giat f_ma_es m_la_esuada p_er_tium e_eg_stas.

You can use named groups form substitutions too, using ${name}.

To play around with regexes, I recommend Rad Software Regular Expression Designer, which has a nice "Language Elements" tab with quick access to some basic instructions. It's based at .NET's regex engine.

Hope I've help.

Why does my JavaScript regex not work?

8 votes

I don't understand why, but this code gives me a JavaScript error:

<script type="text/javascript">

String.prototype.format = function(values) {
    var result = this;
    for (var i = 0, len = values.length; i < len; i++) {
        result = result.replace(new RegExp("{" + i + "}", "g"), values[i]);
    }
    return result;
};

alert("Hi {0}, I'm {1}. Are you, {0}?".format(["Chris", "swell"]));

</script>

Error

Exception thrown: invalid quantifier

What's wrong with it?

I believe you have to escape the { and }.

String.prototype.format = function(values) {
    var result = this;
    for (var i = 0, len = values.length; i < len; i++) {
        result = result.replace(new RegExp("\\{" + i + "\\}", "g"), values[i]);
    }
    return result;
};

Good free tools for learning/testing Regular Expressions

7 votes

Are there any free tools/resources for testing / learning regular expressions, like RegexBuddy?

Regulater

Expresso

RegexDesigner.NET

Regex-Coach

larsolavtorvik online tool

Regex Pal

Regular Expression Workbench

Rubular

Reggy

RegExr

How to get the last part of a string?

6 votes

Given this string:

http://s.opencalais.com/1/pred/BusinessRelationType

I want to get the last part of it: "BusinessRelationType"

I have been thinking about reversing the whole string then looking for the first "/", take everything to the left of that and reverse that. However, I'm hoping there is a better/more concise method. Thoughts?

Thanks, Paul

one-liner with Linq:

string lastPart = text.Split('/').Last();

Regex headache...

6 votes

I want to validate a some C# source code for a scripting engine. I want to make sure that only System.Math class members may be referenced. I am trying to create a regular expression that will match a dot, followed by a capital letter, followed by any number of word characters, ending at a word boundry that is NOT preceded by System.Math.

I started with this:

(?<!Math)\.[A-Z]+[\w]*

Which works fine for:

return Math.Max(466.89/83.449 * 5.5);  // won’t flag this
return Xath.Max(466.89/83.449 * 5.5);  // will flag this

It correctly matches .Max when it is not preceded by Math. However, now that I'm trying to expand the regular expression to include System, I can't get it to work.

I've tried these permutations of the regular expression and more:

((?<!System\.Math)\.[A-Z]+[\w]*)
((?<!(?<!System)\.Math)\.[A-Z]+[\w]*)
((?<!System)\.(?<!Math)\.[A-Z]+[\w]*)
((?<!System)|(?<!Math)\.[A-Z]+[\w]*)
((?<!System\.Math)|(?<!Math)\.[A-Z]+[\w]*)

Using these statements:

return System.Math.Max(466.89/83.449 * 5.5);
return System.Xath.Max(466.89/83.449 * 5.5);
return Xystem.Math.Max(466.89/83.449 * 5.5);

I've tried everything that I could think of, but it either ALWAYS matches the second element (.Math or .Xath above) or it DOESN'T match ANYTHING.

If anyone would have have mercy on me and point out what I'm doing wrong, I would greatly appreaciate it.

Thanks in advance, Welton

The trick is to make sure you never start matching a member name anywhere but at the beginning. Then it's a simple matter of using a lookahead to find out if whatever you're looking at starts with System.Math.. Try this regex:

(?<![\w.])(?!(?:System\.)?Math\.)(?:[A-Z]\w*\.)+[A-Z]\w*\b

The lookbehind ensures that the match doesn't start in the middle of a word (\w) or the middle of a qualified member name (.). Now, if the lookahead fails it can't just jump to the beginning of the next component (e.g, the Math. in System.Math.) and try again. It's all or nothing.

However, this will match Math.Max if it's not preceded by System.. Do you really need that, or was that just an intermediate step in developing a regex for the full name?

EDIT: I went ahead and made the System. part optional.

Find and kill a process in one line using bash and regex.

6 votes

I often need to kill a process during programming.

The way I do it now is:

[~]$ ps aux | grep 'python csp_build.py'
user    5124  1.0  0.3 214588 13852 pts/4    Sl+  11:19   0:00 python csp_build.py
user    5373  0.0  0.0   8096   960 pts/6    S+   11:20   0:00 grep python csp_build.py
[~]$ kill 5124

How can I extract the process id automatically and kill it in the same line?

Like this:

[~]$ ps aux | grep 'python csp_build.py' | kill <regex that returns the pid>

In bash, you should be able to do:

kill $(ps aux | grep '[p]ython csp_build.py' | awk '{print $2}')

Details on its workings are as follows:

  • The ps gives you the list of all the processes.
  • The grep filters that based on your search string, [p] is a trick to stop you picking up the actual grep process itself.
  • The awk just gives you the second field of each line, which is the PID.
  • The $(x) construct means to execute x then take its output and put it on the command line. The output of that ps pipeline inside that construct above is the list of process IDs so you end up with a command like kill 1234 1122 7654.

Here's a transcript showing it in action:

pax> sleep 3600 &
[1] 2225
pax> sleep 3600 &
[2] 2226
pax> sleep 3600 &
[3] 2227
pax> sleep 3600 &
[4] 2228
pax> sleep 3600 &
[5] 2229
pax> kill $(ps aux | grep '[s]leep' | awk '{print $2}')
[5]+  Terminated              sleep 3600
[1]   Terminated              sleep 3600
[2]   Terminated              sleep 3600
[3]-  Terminated              sleep 3600
[4]+  Terminated              sleep 3600
pax> _

and you can see it terminating all the sleepers.


Explaining the grep '[p]ython csp_build.py' bit in a bit more detail:

When you do sleep 3600 & followed by ps -ef | grep sleep, you tend to get two processes with sleep in it, the sleep 3600 and the grep sleep (because they both have sleep in them, that's not rocket science).

However, ps -ef | grep '[s]leep' won't create a process with sleep in it, it instead creates grep '[s]leep' and here's the tricky bit: the grep doesn't find it because it's looking for the regular expression "any character from the character class [s] (which is s) followed by leep.

In other words, it's looking for sleep but the grep process is grep '[s]leep' which doesn't have sleep in it.

When I was shown this (by someone here on SO), I immediately started using it because

  • it's one less process than adding | grep -v grep; and
  • it's elegant and sneaky, a rare combination :-)

Java Regular Expression

6 votes
{
Main Block
     {
     Nested Block
     }
}
{
Main Block 
     {
     Nested Block
     }
     {
     Nested Block
     }
}

I want to get data within Main Blocks including its Nested Blocks with Java Regex. Is it possible?

Thanks in Advance

IF there can only be at most 1 level of nesting, and the braces characters can not be escaped, then in fact the regex pattern for this is quite simple.

Essentially the structure we have, in some abstract notation, is:

{…(?:{…}…)*…}

Here's a visual breakdown:

  ___top___
 /   nest  \
/    / \    \
{…(?:{…}…)*…}
| \______/| |
|         | |
open      | close
          |
     zero or more

This is not quite regex, of course, because:

  • In "real" regex, we must escape the { and }, since they're metacharacters
  • In "real" regex, we need to replace with the actual pattern for content
    • [^{}]*+ would be a fine pattern. The […] is a character class. [^…] is a negated character class. The * is zero-or-more repetition. The + following the repetition specifier is the possessive quantifier.

So, meta-regexing technique is used to programmatically transform this abstract pattern (which is readable) to valid regex pattern (which can be ugly at times like this). Here's an example (also see on ideone.com):

    import java.util.*;
    import java.util.regex.*;
    //...

    Pattern block = Pattern.compile(
        "{…(?:{…}…)*…}"
            .replaceAll("[{}]", "\\\\$0")
            .replace("…", "[^{}]*+")
    );
    System.out.println(block.pattern());
    // \{[^{}]*+(?:\{[^{}]*+\}[^{}]*+)*[^{}]*+\}

    String text
        = "{ main1 { sub1a } { sub1b } { sub1c } }\n"
        + "{ main2\n"
        + "   { sub2a }\n"
        + "       { sub2c }\n"
        + "}"
        + "   { last one, promise }    ";

    Matcher m = block.matcher(text);
    while (m.find()) {
        System.out.printf(">>> %s <<<%n", m.group());
    }
    // >>> { main1 { sub1a } { sub1b } { sub1c } } <<<
    // >>> { main2
    //    { sub2a }
    //        { sub2c }
    // } <<<
    // >>> { last one, promise } <<<        

As you can see, the actual regex pattern is therefore:

\{[^{}]*+(?:\{[^{}]*+\}[^{}]*+)*[^{}]*+\}

Which as a Java string literal:

"\\{[^{}]*+(?:\\{[^{}]*+\\}[^{}]*+)*[^{}]*+\\}"

Variations

If the nesting level can be deeper, then regex can still be used. You can also allow the { and } to be "escaped" (i.e. used in the content part but not as block delimiter).

The final regex pattern will be quite complicated, but depending on how comfortable you are with meta-regexing (which requires you to be comfortable with regex itself), the code can be quite readable and manageable.

If the nesting level can be arbitrarily deep, then some flavors (e.g. .NET or Perl) can still handle it, but Java regex is not powerful enough to handle it.

Can I shorten this regular expression?

6 votes

I have the need to check whether strings adhere to a particular ID format.

The format of the ID is as follows:

aBcDe-fghIj-KLmno-pQRsT-uVWxy

A sequence of five blocks of five letters upper case or lower case, separated by one dash.

I have the following regular expression that works:

string idFormat = "[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}[-]{1}[a-zA-Z]{5}";

Note that there is no trailing dash, but the all of the blocks within the ID follow the same format. Therefore, I would like to be able to represent this sequence of four blocks with a trailing dash inside the regular expression and avoid the duplication.

I tried the following, but it doesn't work:

string idFormat = "[[a-zA-Z]{5}[-]{1}]{4}[a-zA-Z]{5}";

How do I shorten this regular expression and get rid of the duplicated parts?

What is the best way to ensure that each block does also not contain any numbers?


Edit:

Thanks for the replies, I now understand the grouping in regular expressions.

I'm running a few tests against the regular expression, the following are relevant:

Test 1: aBcDe-fghIj-KLmno-pQRsT-uVWxy
Test 2: abcde-fghij-klmno-pqrst-uvwxy

With the following regular expression, both tests pass:

^([a-zA-Z]{5}-){4}[a-zA-Z]{5}$

With the next regular expression, test 1 fails:

^([a-z]{5}-){4}[a-z]{5}$

Several answers have said that it is OK to omit the A-Z when using a-z, but in this case it doesn't seem to be working.

If you can set regex options to be case insensitive, you could replace all [a-zA-Z] with just plain [a-z]. Furthermore, [-]{1} can be written as -.

Your grouping should be done with (, ), not with [, ] (although you're correctly using the latter in specifying character sets.

Depending on context, you probably want to throw in ^...$ which matches start and end of string, respectively, to verify that the entire string is a match (i.e. that there are no extra characters).

In javascript, something like this:

/^([a-z]{5}-){4}[a-z]{5}$/i

How can I exclude some characters from a class?

6 votes

Say I want to match a "word" character (\w), but exclude "_", or match a whitespace character (\s), but exclude "\t". How can I do this?

Use a negated class including \W or \S.

/[^\W_]/  # anything that's not a non-word character and not _
/[^\S\t]/ # anything that's not a non-space character and not \t

How do I make an arbitrary Perl regex wholly non-capturing? (Answer: You Can't)

6 votes

How can I remove capturing from arbitrarily nested sub-groups in a a Perl regex string? I'd like to nest any regex into an enveloping expression that captures the sub-regex as a whole entity as well as statically known subsequent groups. Do I need to transform the regex string manually into using all non-capturing (?:) groups (and hope I don't mess up), or is there a Perl regex or library mechanism that provides this?

# How do I 'flatten' $regex to protect $2 and $3?
# Searching 'ABCfooDE' for 'foo' OK, but '((B|(C))fo(o)?(?:D|d)?)', etc., breaks.
# I.E., how would I turn it effectively into '(?:(?:B|(?:C))fo(?:o)?(?:D|d)?)'?
sub check {
  my($line, $regex) = @_;
  if ($line =~ /(^.*)($regex)(.*$)/) {
    print "<", $1, "><", $2, "><", $3, ">\n";
  }
}

Addendum: I am vaguely aware of $&, $`, and $' and have been advised to avoid them if possible, and I don't have access to ${^PREMATCH}, ${^MATCH} and ${^POSTMATCH} in my Perl 5.8 environment. The example above can be partitioned into 2/3 chunks using methods like these, and more complex real cases could manually iterate this, but I think I'd like a general solution if possible.

Accepted Answer: What I wish existed and surprisingly (to me at least) does not, is an encapsulating group that makes its contents opaque, such that subsequent positional backreferences see the contents as a single entity and names references are de-scoped. gbacon has a potentially useful workaround for Perl 5.10+, and FM shows a manual iterative mechanism for any version that can accomplish the same effect in specific cases, but j_random_hacker calls it that there is no real language mechanism to encapsulate subexpressions.

In general, you can't.

Even if you could transform all (...)s into (?:...)s, this would not work in the general case because the pattern might require backreferences: e.g. /(.)X\1/, which matches any character, followed by an X, followed by the originally matched character.

So, absent a Perl mechanism for discarding captured results "after the fact", there is no way to solve your problem for all regexes. The best you can do (or could do if you had Perl 5.10) is to use gbacon's suggestion and hope to generate a unique name for the capture buffer.

How do you understand regular expressions that are written in one line?

6 votes

This is a neat well documented regular expression, easy to understand, maintain and modify.

    text = text.replace(/
    (                               // Wrap whole match in $1
        (
            ^[ \t]*>[ \t]?          // '>' at the start of a line
            .+\n                    // rest of the first line
            (.+\n)*                 // subsequent consecutive lines
            \n*                     // blanks
        )+
    )
    /gm,

But how do you go about working with these?

text = text.replace(/((^[ \t]*>[ \t]?.+\n(.+\n)*\n*)+)/gm,

Is there a beautifier of some sort that makes sense of it and describes its functionality?

RegexBuddy will "translate" any regex for you. When fed your example regex, it outputs:

((^[ \t]*>[ \t]?.+\n(.+\n)*\n*)+)

Options: ^ and $ match at line breaks

Match the regular expression below and capture its match into backreference number 1 «((^[ \t]*>[ \t]?.+\n(.+\n)*\n*)+)»
   Match the regular expression below and capture its match into backreference number 2 «(^[ \t]*>[ \t]?.+\n(.+\n)*\n*)+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      Note: You repeated the capturing group itself.  The group will capture only the last iteration.  
          Put a capturing group around the repeated group to capture all iterations. «+»
      Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
      Match a single character present in the list below «[ \t]*»
         Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
         The character “ ” « »
         A tab character «\t»
      Match the character “>” literally «>»
      Match a single character present in the list below «[ \t]?»
         Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
         The character “ ” « »
         A tab character «\t»
      Match any single character that is not a line break character «.+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      Match a line feed character «\n»
      Match the regular expression below and capture its match into backreference number 3 «(.+\n)*»
         Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
         Note: You repeated the capturing group itself.  The group will capture only the last iteration.  
             Put a capturing group around the repeated group to capture all iterations. «*»
         Match any single character that is not a line break character «.+»
            Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
         Match a line feed character «\n»
      Match a line feed character «\n*»
         Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»

This does look rather intimidating in text form, but it's much more readable in HTML form (which can't be reproduced here) or in RegexBuddy itself. It also points out common gotchas (such as repeating capturing groups which is probably not wanted here).

String manipulation vs Regexps

6 votes

We are often told that Regexps are slow and should be avoided whenever possible.

However, taking into account the overhead of doing some string manipulation oneself (not talking about algorithm mistakes - this is a different matter), especially in PHP or Perl (maybe Java) what is the limit, in which case can we consider string manipulation to be a better alternative? What regexps are particularly CPU greedy?

For instance, for the following, in C++, Java, PHP or Perl, what would you recommend

The regexps would probably be faster:

  • s/abc/def/g or a ... while((i=index("abc",$x)>=0) ...$y .= substr()... based solution?
  • s/(\d)+/N/g or a scanning algorithm

But what about

  • an email validation regexp?
  • s/((0|\w)+?[xy]*[^xy]){2,7}/u/g

wouldn't a handmade and specific algorithm be faster (while longer to write)?

edit

The point of the question is to determine what kind of regexp would better be rewritten specifically for a given problem via string manipulation?

edit2

A common implementation is Perl regexp. For instance in Perl - that requires to know how they are implemented - what kind of regexp is to be avoided, because the implementation will make the process lengthy and ineffective? It may not be a complex regexp...

A nice feature of manipulating text with regular expressions is that patterns are high-level and declarative. This leaves the implementation considerable room for optimization such as factoring out the longest common prefix or using Boyer-Moore for static strings. Concise notation makes for quicker reading by experts. I understand immediately what

if (s/^(.)//) {
  ...
}

is doing, and index($_, 0, 1) = "" looks noisy in comparison.

Rather than the lower bound, the important consideration for regular expressions is the upper bound. It's a powerful tool, so people believe it's capable of correctly extracting tokens from XML, email addresses, or C++ programs and don't realize that an even more powerful tool such as a parser is necessary.

Closing open XML tags with regex

5 votes

Basically I want to do the same as here which is done in Python. I'd like to replace all self-closed elements to the long syntax.

Example

    <iframe src="http://example.com/thing"/>

becomes

    <iframe src="http://example.com/thing"></iframe>

Full example:

 <html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  <link rel="stylesheet" type="text/css" href="/sample.css">
  <title></title>
  <script type="text/javascript" src="/swfobject.js">
                //void
          </script>
  <script type="text/javascript" language="JavaScript" src="/generate.js">
//void
  </script>
  <script type="text/javascript" language="JavaScript" src="/prototype.js">
//void
  </script>
</head>
<body id="mediaPlayer" style="margin:0;padding:0;">
<script type="text/javascript">
                                swfobject.registerObject('id_G12564763');       


                function getFlashObject() {
                        var object;
                        if (navigator.appName == 'Microsoft Internet Explorer' || navigator.userAgent.indexOf("Chrome")!=-1)
                        {
                                object = document.getElementById('id_G12564763');
                        } 
                        else 
                        {
                                object = document['flash_id_G12564763'];
                        }
                        return object;
                }

        </script>
</body>
</html>

Ok guys. I found a workaround. I hooked the output method to xml where this html comes from and the XSLT engine takes care of closing those open tags for me. Thanks for answers, but if you happen to have a solution for the problem pls, leave your answer and I will mark it as an answer. This could be useful for others.

Remove accents without using iconv

5 votes

What is the best way to remove accents eg.

ÈâuÑ" becomes "Eaun"

Without using iconv

Complete working code. I know this is long, but it's a sure-shot way used by Wordpress.

<?php

function seems_utf8($str) 
{
    $length = strlen($str);
    for ($i=0; $i < $length; $i++) {
        $c = ord($str[$i]);
        if ($c < 0x80) $n = 0; # 0bbbbbbb
        elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
        elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
        elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
        elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
        elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
        else return false; # Does not match any model
        for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
            if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                return false;
        }
    }
    return true;
}

/**
 * Converts all accent characters to ASCII characters.
 *
 * If there are no accent characters, then the string given is just returned.
 *
 * @param string $string Text that might have accent characters
 * @return string Filtered string with replaced "nice" characters.
 */
function remove_accents($string) {
    if ( !preg_match('/[\x80-\xff]/', $string) )
        return $string;

    if (seems_utf8($string)) {
        $chars = array(
        // Decompositions for Latin-1 Supplement
        chr(195).chr(128) => 'A', chr(195).chr(129) => 'A',
        chr(195).chr(130) => 'A', chr(195).chr(131) => 'A',
        chr(195).chr(132) => 'A', chr(195).chr(133) => 'A',
        chr(195).chr(135) => 'C', chr(195).chr(136) => 'E',
        chr(195).chr(137) => 'E', chr(195).chr(138) => 'E',
        chr(195).chr(139) => 'E', chr(195).chr(140) => 'I',
        chr(195).chr(141) => 'I', chr(195).chr(142) => 'I',
        chr(195).chr(143) => 'I', chr(195).chr(145) => 'N',
        chr(195).chr(146) => 'O', chr(195).chr(147) => 'O',
        chr(195).chr(148) => 'O', chr(195).chr(149) => 'O',
        chr(195).chr(150) => 'O', chr(195).chr(153) => 'U',
        chr(195).chr(154) => 'U', chr(195).chr(155) => 'U',
        chr(195).chr(156) => 'U', chr(195).chr(157) => 'Y',
        chr(195).chr(159) => 's', chr(195).chr(160) => 'a',
        chr(195).chr(161) => 'a', chr(195).chr(162) => 'a',
        chr(195).chr(163) => 'a', chr(195).chr(164) => 'a',
        chr(195).chr(165) => 'a', chr(195).chr(167) => 'c',
        chr(195).chr(168) => 'e', chr(195).chr(169) => 'e',
        chr(195).chr(170) => 'e', chr(195).chr(171) => 'e',
        chr(195).chr(172) => 'i', chr(195).chr(173) => 'i',
        chr(195).chr(174) => 'i', chr(195).chr(175) => 'i',
        chr(195).chr(177) => 'n', chr(195).chr(178) => 'o',
        chr(195).chr(179) => 'o', chr(195).chr(180) => 'o',
        chr(195).chr(181) => 'o', chr(195).chr(182) => 'o',
        chr(195).chr(182) => 'o', chr(195).chr(185) => 'u',
        chr(195).chr(186) => 'u', chr(195).chr(187) => 'u',
        chr(195).chr(188) => 'u', chr(195).chr(189) => 'y',
        chr(195).chr(191) => 'y',
        // Decompositions for Latin Extended-A
        chr(196).chr(128) => 'A', chr(196).chr(129) => 'a',
        chr(196).chr(130) => 'A', chr(196).chr(131) => 'a',
        chr(196).chr(132) => 'A', chr(196).chr(133) => 'a',
        chr(196).chr(134) => 'C', chr(196).chr(135) => 'c',
        chr(196).chr(136) => 'C', chr(196).chr(137) => 'c',
        chr(196).chr(138) => 'C', chr(196).chr(139) => 'c',
        chr(196).chr(140) => 'C', chr(196).chr(141) => 'c',
        chr(196).chr(142) => 'D', chr(196).chr(143) => 'd',
        chr(196).chr(144) => 'D', chr(196).chr(145) => 'd',
        chr(196).chr(146) => 'E', chr(196).chr(147) => 'e',
        chr(196).chr(148) => 'E', chr(196).chr(149) => 'e',
        chr(196).chr(150) => 'E', chr(196).chr(151) => 'e',
        chr(196).chr(152) => 'E', chr(196).chr(153) => 'e',
        chr(196).chr(154) => 'E', chr(196).chr(155) => 'e',
        chr(196).chr(156) => 'G', chr(196).chr(157) => 'g',
        chr(196).chr(158) => 'G', chr(196).chr(159) => 'g',
        chr(196).chr(160) => 'G', chr(196).chr(161) => 'g',
        chr(196).chr(162) => 'G', chr(196).chr(163) => 'g',
        chr(196).chr(164) => 'H', chr(196).chr(165) => 'h',
        chr(196).chr(166) => 'H', chr(196).chr(167) => 'h',
        chr(196).chr(168) => 'I', chr(196).chr(169) => 'i',
        chr(196).chr(170) => 'I', chr(196).chr(171) => 'i',
        chr(196).chr(172) => 'I', chr(196).chr(173) => 'i',
        chr(196).chr(174) => 'I', chr(196).chr(175) => 'i',
        chr(196).chr(176) => 'I', chr(196).chr(177) => 'i',
        chr(196).chr(178) => 'IJ',chr(196).chr(179) => 'ij',
        chr(196).chr(180) => 'J', chr(196).chr(181) => 'j',
        chr(196).chr(182) => 'K', chr(196).chr(183) => 'k',
        chr(196).chr(184) => 'k', chr(196).chr(185) => 'L',
        chr(196).chr(186) => 'l', chr(196).chr(187) => 'L',
        chr(196).chr(188) => 'l', chr(196).chr(189) => 'L',
        chr(196).chr(190) => 'l', chr(196).chr(191) => 'L',
        chr(197).chr(128) => 'l', chr(197).chr(129) => 'L',
        chr(197).chr(130) => 'l', chr(197).chr(131) => 'N',
        chr(197).chr(132) => 'n', chr(197).chr(133) => 'N',
        chr(197).chr(134) => 'n', chr(197).chr(135) => 'N',
        chr(197).chr(136) => 'n', chr(197).chr(137) => 'N',
        chr(197).chr(138) => 'n', chr(197).chr(139) => 'N',
        chr(197).chr(140) => 'O', chr(197).chr(141) => 'o',
        chr(197).chr(142) => 'O', chr(197).chr(143) => 'o',
        chr(197).chr(144) => 'O', chr(197).chr(145) => 'o',
        chr(197).chr(146) => 'OE',chr(197).chr(147) => 'oe',
        chr(197).chr(148) => 'R',chr(197).chr(149) => 'r',
        chr(197).chr(150) => 'R',chr(197).chr(151) => 'r',
        chr(197).chr(152) => 'R',chr(197).chr(153) => 'r',
        chr(197).chr(154) => 'S',chr(197).chr(155) => 's',
        chr(197).chr(156) => 'S',chr(197).chr(157) => 's',
        chr(197).chr(158) => 'S',chr(197).chr(159) => 's',
        chr(197).chr(160) => 'S', chr(197).chr(161) => 's',
        chr(197).chr(162) => 'T', chr(197).chr(163) => 't',
        chr(197).chr(164) => 'T', chr(197).chr(165) => 't',
        chr(197).chr(166) => 'T', chr(197).chr(167) => 't',
        chr(197).chr(168) => 'U', chr(197).chr(169) => 'u',
        chr(197).chr(170) => 'U', chr(197).chr(171) => 'u',
        chr(197).chr(172) => 'U', chr(197).chr(173) => 'u',
        chr(197).chr(174) => 'U', chr(197).chr(175) => 'u',
        chr(197).chr(176) => 'U', chr(197).chr(177) => 'u',
        chr(197).chr(178) => 'U', chr(197).chr(179) => 'u',
        chr(197).chr(180) => 'W', chr(197).chr(181) => 'w',
        chr(197).chr(182) => 'Y', chr(197).chr(183) => 'y',
        chr(197).chr(184) => 'Y', chr(197).chr(185) => 'Z',
        chr(197).chr(186) => 'z', chr(197).chr(187) => 'Z',
        chr(197).chr(188) => 'z', chr(197).chr(189) => 'Z',
        chr(197).chr(190) => 'z', chr(197).chr(191) => 's',
        // Euro Sign
        chr(226).chr(130).chr(172) => 'E',
        // GBP (Pound) Sign
        chr(194).chr(163) => '');

        $string = strtr($string, $chars);
    } else {
        // Assume ISO-8859-1 if not UTF-8
        $chars['in'] = chr(128).chr(131).chr(138).chr(142).chr(154).chr(158)
            .chr(159).chr(162).chr(165).chr(181).chr(192).chr(193).chr(194)
            .chr(195).chr(196).chr(197).chr(199).chr(200).chr(201).chr(202)
            .chr(203).chr(204).chr(205).chr(206).chr(207).chr(209).chr(210)
            .chr(211).chr(212).chr(213).chr(214).chr(216).chr(217).chr(218)
            .chr(219).chr(220).chr(221).chr(224).chr(225).chr(226).chr(227)
            .chr(228).chr(229).chr(231).chr(232).chr(233).chr(234).chr(235)
            .chr(236).chr(237).chr(238).chr(239).chr(241).chr(242).chr(243)
            .chr(244).chr(245).chr(246).chr(248).chr(249).chr(250).chr(251)
            .chr(252).chr(253).chr(255);

        $chars['out'] = "EfSZszYcYuAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyy";

        $string = strtr($string, $chars['in'], $chars['out']);
        $double_chars['in'] = array(chr(140), chr(156), chr(198), chr(208), chr(222), chr(223), chr(230), chr(240), chr(254));
        $double_chars['out'] = array('OE', 'oe', 'AE', 'DH', 'TH', 'ss', 'ae', 'dh', 'th');
        $string = str_replace($double_chars['in'], $double_chars['out'], $string);
    }

    return $string;
}


$str = "ÈâuÑ";
echo remove_accents($str); // Output: EauN
?>