Best regex questions in February 2011

Count regex replaces (C#)

9 votes

Is there a way to count the number of replacements a Regex.Replace call makes?

E.g. for Regex.Replace("aaa", "a", "b"); I want to get the number 3 out (result is "bbb"); for Regex.Replace("aaa", "(?<test>aa?)", "${test}b"); I want to get the number 2 out (result is "aabab").

Ways I can think to do this:

  1. Use a MatchEvaluator that increments a captured variable, doing the replacement manually
  2. Get a MatchCollection and iterate it, doing the replacement manually and keeping a count
  3. Search first and get a MatchCollection, get the count from that, then do a separate replace

Methods 1 and 2 require manual parsing of $ replacements, method 3 requires regex matching the string twice. Is there a better way.

Thanks to both Chevex and Guffa. I started looking for a better way to get the results and found that there is a Result method on the Match class that does the substitution. That's the missing piece of the jigsaw. Example code below:

using System.Text.RegularExpressions;

namespace regexrep
{
    class Program
    {
        static int Main(string[] args)
        {
            string fileText = System.IO.File.ReadAllText(args[0]);
            int matchCount = 0;
            string newText = Regex.Replace(fileText, args[1],
                (match) =>
                {
                    matchCount++;
                    return match.Result(args[2]);
                });
            System.IO.File.WriteAllText(args[0], newText);
            return matchCount;
        }
    }
}

With a file test.txt containing aaa, the command line regexrep test.txt "(?<test>aa?)" ${test}b will set %errorlevel% to 2 and change the text to aabab.

How to build a regular expression (C#) to identify a string of eight 1's & 0's

8 votes

I'm trying to build a regex to determine if a string contains a byte of binary digits, ex. 10010011.

I believe that [0-1][0-1][0-1][0-1][0-1][0-1][0-1][0-1] would work, but I'm sure theres a more efficient way of doing it, and being new to regular expressions, I'm not sure what that is.

If it needs to be exactly 8 (no more/less), use this:

@"(?<![01])([01]{8})(?![01])"

If you don't want to match something like "abc01010101xyz", use this:

@"\b[01]{8}\b"

If you want to match all 8-bit strings anywhere in the input, use this:

@"[01]{8}"

Be aware that if you feed the last pattern an input like 1111111100000000, you're going to get a result set like:

11111111
11111110
11111100
11111000
...
00000000

Removing all whitespace characters except for " "

7 votes

I consider myself pretty good with Regular Expressions, but this one is appearing to be surprisingly tricky: I want to trim all whitespace, except the space character: ' '.

In Java, the RegEx I have tried is: [\s-[ ]], but this one also strips out ' '.

UPDATE:

Here is the particular string that I am attempting to strip spaces from:

project team                manage key

Note: it would be the characters between "team" and "manage". They appear as a long space when editing this post but view as a single space in view mode.

Try using this regular expression:

[^\S ]+

It's a bit confusing to read because of the double negative. The regular expression [\S ] matches the characters you want to keep, i.e. either a space or anything that isn't a whitespace. The negated character class [^\S ] therefore must match all the characters you want to remove.

Simplify this regular expression

7 votes

I'm doing some pre-exam exercises for my compilers class, and needed to simplify this regular expression.

(a U b)*(a U e)b* U (a U b)*(b U e)a*

Quite obviously, the e is the empty string, and the U stands for union.

So far, I think one of the (a U b)* can be removed, as the union of a U a = a. However, I can't find any other simplifications, and am not doing so well with the other problems thus far. :(

Any help is appreciated, thanks very much!

Little rusty on regex, but if * still represents the "zero or more ocurrences" you can replace:

(a U e)b* for (a U b)*

which leaves the first part with:

(a U b)*(a U b)* = (a U b)*

On the right side, you have that

(b U e)a* = (b U a)*

Now, since a U b = b U a, you get:

(a U b)*(a U b)*

on the right hand side, which leaves just

(a U b)* U (a U b)* = (a U b)*

I think that's it...

Java Regular Expression running very slow

7 votes

I'm trying to use the Daring Fireball Regular Expression for matching URLs in Java, and I've found a URL that causes the evaluation to take forever. I've modified the original regex to work with Java syntax.

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.", "www1.", "www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space, non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

// @see http://daringfireball.net/2010/07/improved_regex_for_matching_urls
private static final Pattern DARING_FIREBALL_PATTERN = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

If I attempt to run the following, it takes forever. I've narrowed it down to the matching of balanced parens (I think). If you change the text within the parens, it works fine, but at about 15 characters, it starts to slow down exponentially.

final Matcher matcher = pattern.matcher("https://goo.gl/a(something_really_long_in_balanced_parens)");
boolean found = matcher.find();

Is there a way to improve this regex so that the lines about don't take forever? I have about 100 different URLs in a JUnit test class, and I need those to continue to work as well.

The problem is here:

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space, non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
")+"

What you've got here is nested quantifiers. This plays havoc with any backtracking algorithm - as an example, consider the regex /^(a+)+$/ matching against the string

aaaaaaaaaab

As a first attempt, the inner quantifier will match all of the as. Then the regex fails, so it backs off one. Then the outer quantifier tries to match again, swallowing up the last a, then the regex fails once more. We basically get exponential behaviour as the quantifiers try all sorts of ways of splitting up the run of as, without actually making any progress.

The solution is possessive quantifiers (which we denote by tacking a + onto the end of a quantifier) - we set up the inner quantifiers so that once they have a match, they don't let it go - they'll hold onto that until the match fails or an earlier quantifier backs off and they have to rematch starting somewhere else in the string. If we instead used /^(a++)+$/ as our regex, we would fail immediately on the non-matching string above, rather than going exponential trying to match it.

Try making those inner quantifiers possessive and see if it helps.

Better regex syntax ideas

6 votes

I need some help to complete my idea about regexes.

Introduction

There was a question about better syntax for regexes on SE, but I don't think I'd use the fluent syntax. It's surely nice for newbies, but in case of a complicated regex, you replace a line of gibberish by a whole page of slightly better gibberish. I like the approach by Martin Fowler, where a regex gets composed of smaller pieces. His solution is readable, but hand-made; he proposes a smart way to build a complicated regex instead of a class supporting it.

I'm trying to make it to a class using something like (see his example first)

final MyPattern pattern = MyPattern.builder()
.caseInsensitive()
.define("numberOfPoints", "\\d+")
.define("numberOfNights", "\\d+")
.define("hotelName", ".*")
.define(' ', "\\s+")
.build("score `numberOfPoints` for `numberOfNights` nights? at `hotelName`");

MyMatcher m = pattern.matcher("Score 400 FOR 2 nights at Minas Tirith Airport");
System.out.println(m.group("numberOfPoints")); // prints 400

where fluent syntax is used for combining regexes extended as follows:

  • define named patterns and use them by enclosing in backticks
    • name creates a named group
      • mnemonics: shell captures the result of the command enclosed in backticks
    • :name creates a non-capturing group
      • mnemonics: similar to (?:...)
    • -name creates a backreference
      • mnemonics: the dash connects it to the previous occurrence
  • redefine individual characters and use it everywhere unless quoted
    • here only some characters (e.g., ~ @#%") are allowed
      • redefining + or ( would be extremely confusing, so it's not allowed
      • redefining space to mean any spacing is very natural in the example above
      • redefining a character could make the pattern more compact, which is good unless overused
      • e.g., using something like define('#', "\\\\") for matching backslashes could make the pattern much readable
  • redefine some quoted sequences like \s or \w
    • the standard definitions are not Unicode conform
    • sometimes you might have you own idea what a word or space is

The named patterns serves as a sort of local variables helping to decompose a complicated expression into small and easy to understand pieces. A proper naming pattern makes often a comment unnecessary.

Questions

The above shouldn't be hard to implement (I did already most of it) and could be really useful, I hope. Do you think so?

However, I'm not sure how it should behave inside of brackets, sometimes it's meaningful to use the definitions and sometimes not, e.g. in

.define(' ', "\\s")            // a blank character
.define('~', "/\**[^*]+\*/")   // an inline comment (simplified)
.define("something", "[ ~\\d]")

expanding the space to \s makes sense, but expanding the tilde doesn't. Maybe there should be a separate syntax to define own character classes somehow?

Can you think of some examples where the named pattern are very useful or not useful at all? I'd need some border cases and some ideas for improvement.

Reaction to tchrist's answer

Comments to his objections

  1. Lack of multiline pattern strings.
    • There are no multiline strings in Java, which I'd like to change, but can not.
  2. Freedom from insanely onerous and error-prone double-backslashing...
    • This is again something I can't do, I can only offer a workaround, s. below.
  3. Lack of compile-time exceptions on invalid regex literals, and lack of compile-time caching of correctly compiled regex literals.
    • As regexes are just a part of the standard library and not of the language itself, there's nothing what can done here.
  4. No debugging or profiling facilities.
    • I can do nothing here.
  5. Lack of compliance with UTS#18.
    • This can be easily solved by redefining the corresponding patterns as I proposed. It's not perfect, since in debugger you'll see the blowed up replacements.

I looks like you don't like Java. I'd be happy to see some syntax improvements there, but there's nothing I can do about it. I'm looking for something working with current Java.

RFC 5322

Your example can be easily written using my syntax:

final MyPattern pattern = MyPattern.builder()
.define(" ", "") // ignore spaces
.useForBackslash('#') // (1): see (2)
.define("address",         "`mailbox` | `group`")
.define("WSP",             "[\u0020\u0009]")
.define("DQUOTE",          "\"")
.define("CRLF",            "\r\n")
.define("DIGIT",           "[0-9]")
.define("ALPHA",           "[A-Za-z]")
.define("NO_WS_CTL",       "[\u0001-\u0008\u000b\u000c\u000e-\u001f\u007f]") // No whitespace control
...
.define("domain_literal",  "`CFWS`? #[ (?: `FWS`? `dcontent`)* `FWS`? #] `CFWS1?") // (2): see (1)
...
.define("group",           "`display_name` : (?:`mailbox_list` | `CFWS`)? ; `CFWS`?")
.define("angle_addr",      "`CFWS`? < `addr_spec` `CFWS`?")
.define("name_addr",       "`display_name`? `angle_addr`")
.define("mailbox",         "`name_addr` | `addr_spec`")
.define("address",         "`mailbox` | `group`")
.build("`address`");

Disadvantages

While rewriting your example I encountered the following issues:

  • As there are no \xdd escape sequences \udddd must be used
  • Using another character instead of backslash is a bit strange
  • As I prefer to write it bottom-up, I had to take your lines reverted
  • Without much idea what it does, I except myself having done some errors

On the bright side: - Ignoring spaces is no problem - Comments are no problem - The readability is good

And most important: It's plain Java and uses the existing regex-engine as is.

Named Capture Examples

Can you think of some examples where the named pattern are very useful or not useful at all?

In answer to your question, here is an example where named patterns are especially useful. It’s a Perl or PCRE pattern for parsing an RFC 5322 mail address. First, it’s in /x mode by virtue of (?x). Second, it separates out the definitions from the invocation; the named group address is the thing that does the full recursive-descent parse. Its definition follows it in the non-executing (?DEFINE)…) block.

   (?x)              # allow whitespace and comments

   (?&address)       # this is the capture we call as a "regex subroutine"

   # the rest is all definitions, in a nicely BNF-style
   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

I strongly suggest not reïnventing a perfectly good wheel. Start with becoming PCRE-compatible. If you wish to go beyond basic Perl5 patterns like the RFC5322-parser above, there’s always Perl6 patterns to draw upon.

It really, really pays to do research into existing practice and literature before haring off on an open-ended R&D mission. These problems have all long ago been solved, sometimes quite elegantly.

Improving Java Regex Syntax

If you truly want better regex syntax ideas for Java, you must first address these particular flaws in Java’s regexes:

  1. Lack of multiline pattern strings, as demonstrated above.
  2. Freedom from insanely onerous and error-prone double-backslashing, as also demonstrated above.
  3. Lack of compile-time exceptions on invalid regex literals, and lack of compile-time caching of correctly compiled regex literals.
  4. Impossible to change something like "foo".matches(pattern) to use a better pattern library, partly but not solely because of final classes that are not overridable.
  5. No debugging or profiling facilities.
  6. Lack of compliance with UTS#18: Basic Regular Expression support, the very most elementary steps necessary to make Java regexes useful for Unicode. They currently are not. They don’t even support Unicode 3.1 properties from a decade ago, which means you cannot use Java patterns for Unicode in any reasonable fashion; the basic building blocks are absent.

Of these, the first 3 have been addressed in several JVM languages, including both Groovy and Scala; even Clojure goes part-way there.

The second set of 3 steps will be tougher, but are absolutely mandatory. The last one, the absence of even the most basic Unicode support in regexes, simply kills Java for Unicode work. This is complety inexcusable this late in the game. I can provide plenty of examples if need be, but you should trust me, because I really do know what I’m talking about here.

Only once you have accomplished all these should you be worried about fixing up Java’s regexes so they can catch up with the current state of the art in pattern matching. Until and unless you take care of these past oversights, you can’t begin to look to the present, let alone to the future.

Difference between regex [A-z] and [a-zA-Z]

6 votes

I am using a regex to program an input validator for a text box where I only want alphabetical characters. I was wondering if [A-z] and [a-zA-Z] were equivalent or if there were differences performance wise.

I keep reading [a-zA-Z] on my searches and no mention of [A-z].

I am using java's String.matches(regex).

[A-z] will match ASCII characters in the range from A to z, while [a-zA-Z] will match ASCII characters in the range from A to Z and in the range from a to z. At first glance, this might seem equivalent -- however, if you look at this table of ASCII characters, you'll see that A-z includes several other characters. Specifically, they are [, \, ], ^, _, and ` (which you clearly don't want).

Rebuild regex string based on match keywords in python

6 votes

Example regular expression

regex = re.compile('^page/(?P<slug>[-\w]+)/(?P<page_id>[0-9]+)/$')
matches = regex.match('page/slug-name/5/')
>> matches.groupdict()
{'slug': 'slug-name', 'page_id': '5'}

Is there an easy way to pass a dict back to the regex to rebuild a string?

i.e. {'slug': 'new-slug', 'page_id': '6'} would yield page/new-slug/6/

Here's a solution using sre_parse

import re
from sre_parse import parse

pattern = r'^page/(?P<slug>[-\w]+)/(?P<page_id>[0-9]+)/$'
regex = re.compile(pattern)
matches = regex.match('page/slug-name/5/')
params = matches.groupdict()
print params
>> {'page_id': '5', 'slug': 'slug-name'}

lookup = dict((v,k) for k, v in regex.groupindex.iteritems())
frags = [chr(i[1]) if i[0] == 'literal' else str(params[lookup[i[1][0]]]) \
    for i in parse(pattern) if i[0] != 'at']
print ''.join(frags)
>> page/slug-name/5/

This works by grabbing the raw opcodes via parse(), dumping the positional opcodes (they have 'at' for a first param), replacing the named groups, and concatenating the frags when it's done.

Shouldn't "static" patterns always be static?

6 votes

I just found a bug in some code I didn't write and I'm a bit surprised:

Pattern pattern = Pattern.compile("\\d{1,2}.\\d{1,2}.\\d{4}");
Matcher matcher = pattern.matcher(s);

Despite the fact that this code fails badly on input data we get (because it tries to find dates in the 17.01.2011 format and gets back things like 10396/2011 and then crashed because it can't parse the date but that really ain't the point of this question ; ) I wonder:

  • isn't one of the point of Pattern.compile to be a speed optimization (by pre-compiling regexps)?

  • shouldn't all "static" pattern be always compiled into static pattern?

There are so many examples, all around the web, where the same pattern is always recompiled using Pattern.compile that I begin to wonder if I'm seeing things or not.

Isn't (assuming that the string is static and hence not dynamically constructed):

static Pattern pattern = Pattern.compile("\\d{1,2}.\\d{1,2}.\\d{4}");

always preferrable over a non-static pattern reference?

  1. Yes, the whole point of pre-compiling a Pattern is to only do it once.
  2. It really depends on how you're going to use it, but in general, pre-compiled patterns stored in static fields should be fine. (Unlike Matchers, which aren't threadsafe and therefore shouldn't really be stored in fields at all, static or not.)

The only caveat with compiling patterns in static initializers is that if the pattern doesn't compile and the static initializer throws an exception, the source of the error can be quite annoying to track down. It's a minor maintainability problem but it might be worth mentioning.

Regex to match on capital letter, digit or capital, lowercase, and digit

5 votes

I'm working on an application which will calculate molecular weight and I need to separate a string into the different molecules. I've been using a regex to do this but I haven't quite gotten it to work. I need the regex to match on patterns like H2OCl4 and Na2H2O where it would break it up into matches like:

  1. H2
  2. O
  3. Cl4

  1. Na2
  2. H2
  3. O

The regex i've been working on is this:

([A-Z]\d*|[A-Z]*[a-z]\d*)

It's really close but it currently breaks the matches into this:

  1. H2
  2. O
  3. C
  4. l4

I need the Cl4 to be considered one match. Can anyone help me with the last part i'm missing in this. I'm pretty new to regular expressions. Thanks.

I think what you want is "[A-Z][a-z]?\d*"

That is, a capital letter, followed by an optional small letter, followed by an optional string of digits.

If you want to match 0, 1, or 2 lower-case letters, then you can write:

"[A-Z][a-z]{0,2}\d*"

Note, however, that both of these regular expressions assume that the input data is valid. Given bad data, it will skip over bad data. For example, if the input string is "H2ClxxzSO4", you're going to get:

  1. H2
  2. Clx
  3. S
  4. O4

If you want to detect bad data, you'll need to check the Index property of the returned Match object to ensure that it is equal to the beginning index.

Why my regex with r'string' matches but not 'string' using Python?

5 votes

The way regex works in Python is so intensely puzzling that it makes me more furious with each passing second. Here's my problem:

I understand that this gives a result:

re.search(r'\bmi\b', 'grand rapids, mi 49505)

while this doesn't:

re.search('\bmi\b', 'grand rapids, mi 49505)

And that's okay. I get that much of it. Now, I have a regular expression that's being generated like this:

regex = '|'.join(['\b' + str(state) + '\b' for state in states])

If I now do re.search(regex, 'grand rapids, mi 49505'), it fails for the same reason my second search() example fails.

My question: Is there any way to do what I'm trying to do?

The anwser itself

regex = '|'.join([r'\b' + str(state) + r'\b' for state in states])

The reason behind this is that the 'r' prefix tells Python to not analyze the string you pass to it. If you don't put an 'r' before the string, Python will try to turn any char preceding by '\' into a special char, to allow you to enter break lines (\n), tabs (\t) and such easily.

When you do '\b', you tell Python to create a string, analyse it, and transform '\b' into 'backspace', while when you do r'\b', Python just store '\' then 'b', and this is what you want with for regex. Always use 'r' for string used as regex patterns.

The 'r' notation is called 'raw string', but that's misleading, as there is no such thing as a raw string in Python internals. Just think about it as a way to tell Python to avoid being too smart.

There is another notation in Python < 3.0, u'string', that tells Python to store the string as unicode. You can combine both: ur"é\n" will store "\bé" as unicode, while u"é\n" will store "é" then a line break.

Some ways to improve your code:

regex = '|'.join(r'\b' + str(state) + r'\b' for state in states)

Removed the extra []. It tells Python to not store in memory the list of values you are generating. We can do it here because we don't plan to reuse the list you are creating since you use it directly in your join() and nowhere else.

regex = '|'.join(r'\b%s\b' % state for state in states)

This will take care of the string conversion automatically and is shorter and cleaner. When you format string in Python, think about the % operator.

If states contain a list of states zip code, then there should be stored as string, not as int. In that case, you can skip the type casting and shorten it even more:

regex = r'\b%s\b' % r'\b|\b'.join(states)

Eventually, you may not need regex at all. If all you care is to check if one of the zip code is in the given string, you can just use in (check if an item is in an iterable, like if a string is in a list):

matches = [s for s in states if s in 'grand rapids, mi 49505']

Last word

I understand you may be frustrated when learning a new language, but take the time to give a proper title to your question. In this website, the title should end with a question mark and give specific details about the problem.

Regular expressions negative lookahead

5 votes

I'm doing some regular expression gymnastics. I set myself the task of trying to search for C# code where there is a usage of the as-operator not followed by a null-check within a reasonable amount of space. Now I don't want to parse the C# code. E.g. I want to capture code snippets such as

    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(x1.a == y1.a)

however, not capture

    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(x1 == null)

nor for that matter

    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(somethingunrelated == null) {...}
    if(x1.a == y1.a)

Thus any random null-check will count as a "good check" and hence not found.

The question is: How do I match something while ensuring something else is not found in its sourroundings.

I've tried the naive approach, looking for 'as' then doing a negative lookahead within a 150 characters.

\bas\b.{1,150}(?!\b==\s*null\b)

The above regular expression matches all of the above examples infortunately. My gut tells me, the problem is that the looking ahead and then doing negative lookahead can find many situations where the lookahead does not find the '== null'.

If I try negating the whole expression, then that doesn't help either, at that would match most C# code around.

I love regex gymnastics! Here is a commented PHP regex:

$re = '/# Find all AS, (but not preceding a XX == null).
    \bas\b               # Match "as"
    (?=                  # But only if...
      (?:                # there exist from 1-150
        [\S\s]           # chars, each of which
        (?!==\s*null)    # are NOT preceding "=NULL"
      ){1,150}?          # (and do this lazily)
      (?:                # We are done when either
        (?=              # we have reached
          ==\s*(?!null)  # a non NULL conditional
        )                #
      | $                # or the end of string.
      )
    )/ix'

And here it is in Javascript style:

re = /\bas\b(?=(?:[\S\s](?!==\s*null)){1,150}?(?:(?===\s*(?!null))|$))/ig;

This one did make my head hurt a little...

Here is the test data I am using:

text = r"""    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(x1.a == y1.a)

however, not capture
    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(x1 == null)

nor for that matter
    var x1 = x as SimpleRes;
    var y1 = y as SimpleRes;
    if(somethingunrelated == null) {...}
    if(x1.a == y1.a)"""

Does Java regex optimize this specific case?

5 votes

I wonder how does regex work, my particular regex has an element that looks like this:

(word1|word2|wordn......)

The numbers of words is big several hundreds.
I wonder if the regex engine is just testing the words one by one or if it optimizes the search and it what way.
Any pointer to good documentation will be good.

If you have several hundred words, you need to beware of the ordering of the words in the regex. The regex engine looks for the words from left to right.
If you test the word setValue against the alternation set|setValue, it will match only the 3 letters comprising "set", and not the whole string.

See this link (from www.regular-expressions.info) for the full explanation.

I don't think that the regex engine truly optimizes alternations (i.e., analyzing common prefixes and building nfa accordingly). Therefore, with so many words, I don't think it will be an optimization.

Aside from re-ordering the words, you can also try adding word or line boundary after the alternation, e.g. (set|setValue)$, but I suspect that the regex engine will do a lot of backtracking so it may not be worth the effort.

Escape comma when using String.split

5 votes

I'm trying to perform some super simple parsing o log files, so I'm using String.split method like this:

String [] parts = input.split(",");

And works great for input like:

a,b,c

Or

type=simple, output=Hello, repeat=true 

Just to say something.

How can I escape the comma, so it doesn't match intermediate commas?

For instance, if I want to include a comma in one of the parts:

type=simple, output=Hello, world, repeate=true

I was thinking in something like:

type=simple, output=Hello\, world, repeate=true

But I don't know how to create the split to avoid matching the comma.

I've tried:

String [] parts = input.split("[^\,],");

But, well, is not working.

You can solve it using a negative look behind.

String[] parts = str.split("(?<!\\\\), ");

Basically it says, split on each ", " that is not preceeded by a backslash.

String str = "type=simple, output=Hello\\, world, repeate=true";
String[] parts = str.split("(?<!\\\\), ");
for (String s : parts)
    System.out.println(s);

Output:

type=simple
output=Hello\, world
repeate=true

(ideone.com link)


If you happen to be stuck with the non-escaped comma-separated values, you could do the following (similar) hack:

String[] parts = str.split(", (?=\\w+=)");

Which says split on each ", " which is followed by some word-characters and an =

(ideone.com link)

Regular Expression to replace " {" with "(newline){" in xcode

4 votes

I need to change my coding style of putting opening braces in same line to new line. I need to find and replace the (space){ with (newline){. I heard using regular expression find and replace, its pretty simple.

Could anyone help me on this?

You could try the following:

  • In the Find box, type space \ { $
  • In the Replace box, type control+q return {

control+q is needed to quote the return key. There’s no visual feedback for typing control+q return, so the only visible character in the replace box is the opening curly brace:

Screenshot Find & Replace

Although this answers your question, there’s (at least!) one problem: it won’t indent the opening curly brace, so something like

- (void)method {
    for (obj in collection) {
        NSLog(@"%@", obj);
    }
}

is converted to

- (void)method
{
    for (obj in collection)
{
        NSLog(@"%@", obj);
    }
}

The menu item Edit > Format > Re-Indent will place the opening curly braces in the correct indentation tab but there might be non-desired side effects to your code style.


Edit: as commented in the other answer, you might want a regular expression that matches an arbitrary number of whitespaces surrounding the curly brace, e.g. \s*{\s*$

Regular Expression in C++

4 votes

Hello, I want to write C++ library for Regular Expression. I know there are many libraries available but I want to learn theory behind regular expression and implemented it by myself.

Can anybody please guide on what should I start with.

http://swtch.com/~rsc/regexp/regexp1.html has a good explanation of the two major approaches to regular expressions, their trade-offs, and how to make the faster one (DFAs) usable in a lot of cases that most implementations fail to use them for.

How to match two strings with integers greater than zero using regex?

4 votes

I'm looking for a simple regex to match this:

int.int"

where the integer is greater then 0.

matches:

1.1"
1.5"
5.1"
40.30"
1.29"

mismatches:

1.1
0.4"
4.0"
0.30"
39.0"

You can use the following regex:

^[1-9][0-9]*\.[1-9][0-9]*"$

Rubular Link

^     : Start anchor
[1-9] : Non zero digit
[0-9]*: Zero or more of any digit 0-9
\.    : A literal period
"     : A literal "
$     : End anchor

The anchors are essential. Without them you'll match any string that has the pattern you want anywhere, say foo11.22bar. With the anchors the regex will try to match the entire string not just any proper subset of it.

. is a regex meta character which matches any character (other than newline).
To match a literal . you need to escape it as \..