Best regex questions in April 2011

regex matching irreducible fraction

14 votes

How can I match irreducible fraction with regex?

For example, 23/25, 3/4, 5/2, 100/101, etc.

First of all, I have no idea about gcd-algorithm realisation in regex.

UPD: for all answer like "You are using the wrong tool".

Yeah, guys, I realise what regex is using for. It's ok. But this is a point of that weird question.

UPD_2: The thing is to find such regex, that could be helpful in situation like:

$> echo "1/2" | grep -P regex
1/2
$> echo "2/4" | grep -P regex

So, regex should be only string, without using any scripts and variables. Only regex.

Actually, I've already know some regex, which match reducible fraction written in unary number system.

$> echo "11/1111" | grep -P '^1/1+$|(11+)+\1+/\1+$'
11/1111

So thing is to convert from decimal to unary number system in regex, but I don't know know.

UPDATE

Since the poster requested a single regex that matches against strings like "36/270", but says it doesn’t matter how legible it is, that regex is:

my $reducible_rx = qr{^(\d+)/(\d+)$(?(?{(1x$1."/".1x$2)=~m{^(?|1+/(1)|(11+)\1*/\1+)$}})|^)};

But, if like me, you believe that an illegible regex is absolutely unacceptable, you will write that more legibly as:

my $reducible_rx = qr{
  # first match a fraction:
    ^ ( \d+ ) / ( \d+ ) $
  # now for the hard part:
    (?(?{ ( 1 x $1 . "/" . 1 x $2 ) =~ m{
                ^
                (?|    1+      / (1)  # trivial case: GCD=1
                  |  (11+) \1* / \1+  # find the GCD
                )
                 $
            }x
        })
          # more portable version of (*PASS)
     | ^  # more portable version of (*FAIL)
     )
}x;

You can improve maintainability by splitting out the version that matches the unary version from the one that matches the decimal version like this:

# this one assumes unary notation
my $unary_rx = qr{
    ^ 
    (?|   1+       / (1)
      | (11+)  \1* / \1+ 
    ) 
    $
}x;

# this one assumes decimal notation and converts internally
my $decimal_rx = qr{
  # first match a fraction:
    ^ ( \d+ ) / ( \d+ ) $ 
  # now for the hard part:
    (?(?{( 1 x $1 . "/" . 1 x $2 ) =~ $unary_rx})
          # more portable version of (*PASS)
     | ^  # more portable version of (*FAIL) 
     )
}x;

Isn’t that much easier by separating it into two named regexes? That would now make $reducible_rx the same as $decimal_rx, but the unary version is its own thing. That’s how I would do it, but the original poster wanted a single regex, so you’d have to interpolate the nested one for that as I first present above.

Either way, you can plug into the test harness below using:

    if ($frac =~ $reducible_rx) {
        cmp_ok($frac, "ne", reduce($i, $j), "$i/$j is $test");
    } else {
        cmp_ok($frac, "eq", reduce($i, $j), "$i/$j is $test");
    }

And you will see that it is a correct regex that passes all tests, and does so moreover using a single regex, wherefore having now passed all requirements of the original question, I declare Qᴜᴏᴅ ᴇʀᴀᴛ ᴅᴇᴍᴏɴsᴛʀᴀɴᴅᴜᴍ: “Quit, enough done.” 😇

And you’re welcome.


The answer is to match the regex ^(?|1+/(1)|(11+)\1*/\1+)$ against the fraction once it has been converted from decimal to unary notation, at which point the greatest common factor will be found in $1 on a match; otherwise they are coprimes. If you are using Perl 5.14 or better, you can even do this in one step:

use 5.014;
my $reg  = qr{^(?|1+/(1)|(11+)\1*/\1+)$};
my $frac = "36/270";  # for example
if ($frac =~ s/(\d+)/1 x $1/reg =~ /$reg/) { 
    say "$frac can be reduced by ", length $1;
} else {
    say "$frac is irreducible";
}

Which will correctly report that:

36/270 can be reduced by 18

(And of course, reducing by 1 means there is no longer a denominator.)

If you wanted to have a bit of punning fun with your readers, you could even do it this way:

use 5.014;
my $regex = qr{^(?|1+/(1)|(11+)\1*/\1+)$};
my $frac  = "36/270";  # for example
if ($frac =~ s/(\d+)/"1 x $1"/regex =~ /$regex/) {
    say "$frac can be reduced by ", length $1;
} else {
    say "$frac is irreducible";
}

Here is the code that demonstrates how to do this. Furthermore, it constructs a test suite that tests its algorithm using all (positive) numerators and denominators up to its argument, or 30 by default. To run it under a test harness, put it in a file named coprimes and do this:

$ perl -MTest::Harness -e 'runtests("coprimes")'
coprimes .. ok       
All tests successful.
Files=1, Tests=900,  1 wallclock secs ( 0.13 usr  0.02 sys +  0.33 cusr  0.02 csys =  0.50 CPU)
Result: PASS

Here is an example of its output when run without the test harness:

$ perl coprimes 10
1..100
ok 1 - 1/1 is 1
ok 2 - 1/2 is 1/2
ok 3 - 1/3 is 1/3
ok 4 - 1/4 is 1/4
ok 5 - 1/5 is 1/5
ok 6 - 1/6 is 1/6
ok 7 - 1/7 is 1/7
ok 8 - 1/8 is 1/8
ok 9 - 1/9 is 1/9
ok 10 - 1/10 is 1/10
ok 11 - 2/1 is 2
ok 12 - 2/2 is 1
ok 13 - 2/3 is 2/3
ok 14 - 2/4 is 1/2
ok 15 - 2/5 is 2/5
ok 16 - 2/6 is 1/3
ok 17 - 2/7 is 2/7
ok 18 - 2/8 is 1/4
ok 19 - 2/9 is 2/9
ok 20 - 2/10 is 1/5
ok 21 - 3/1 is 3
ok 22 - 3/2 is 3/2
ok 23 - 3/3 is 1
ok 24 - 3/4 is 3/4
ok 25 - 3/5 is 3/5
ok 26 - 3/6 is 1/2
ok 27 - 3/7 is 3/7
ok 28 - 3/8 is 3/8
ok 29 - 3/9 is 1/3
ok 30 - 3/10 is 3/10
ok 31 - 4/1 is 4
ok 32 - 4/2 is 2
ok 33 - 4/3 is 4/3
ok 34 - 4/4 is 1
ok 35 - 4/5 is 4/5
ok 36 - 4/6 is 2/3
ok 37 - 4/7 is 4/7
ok 38 - 4/8 is 1/2
ok 39 - 4/9 is 4/9
ok 40 - 4/10 is 2/5
ok 41 - 5/1 is 5
ok 42 - 5/2 is 5/2
ok 43 - 5/3 is 5/3
ok 44 - 5/4 is 5/4
ok 45 - 5/5 is 1
ok 46 - 5/6 is 5/6
ok 47 - 5/7 is 5/7
ok 48 - 5/8 is 5/8
ok 49 - 5/9 is 5/9
ok 50 - 5/10 is 1/2
ok 51 - 6/1 is 6
ok 52 - 6/2 is 3
ok 53 - 6/3 is 2
ok 54 - 6/4 is 3/2
ok 55 - 6/5 is 6/5
ok 56 - 6/6 is 1
ok 57 - 6/7 is 6/7
ok 58 - 6/8 is 3/4
ok 59 - 6/9 is 2/3
ok 60 - 6/10 is 3/5
ok 61 - 7/1 is 7
ok 62 - 7/2 is 7/2
ok 63 - 7/3 is 7/3
ok 64 - 7/4 is 7/4
ok 65 - 7/5 is 7/5
ok 66 - 7/6 is 7/6
ok 67 - 7/7 is 1
ok 68 - 7/8 is 7/8
ok 69 - 7/9 is 7/9
ok 70 - 7/10 is 7/10
ok 71 - 8/1 is 8
ok 72 - 8/2 is 4
ok 73 - 8/3 is 8/3
ok 74 - 8/4 is 2
ok 75 - 8/5 is 8/5
ok 76 - 8/6 is 4/3
ok 77 - 8/7 is 8/7
ok 78 - 8/8 is 1
ok 79 - 8/9 is 8/9
ok 80 - 8/10 is 4/5
ok 81 - 9/1 is 9
ok 82 - 9/2 is 9/2
ok 83 - 9/3 is 3
ok 84 - 9/4 is 9/4
ok 85 - 9/5 is 9/5
ok 86 - 9/6 is 3/2
ok 87 - 9/7 is 9/7
ok 88 - 9/8 is 9/8
ok 89 - 9/9 is 1
ok 90 - 9/10 is 9/10
ok 91 - 10/1 is 10
ok 92 - 10/2 is 5
ok 93 - 10/3 is 10/3
ok 94 - 10/4 is 5/2
ok 95 - 10/5 is 2
ok 96 - 10/6 is 5/3
ok 97 - 10/7 is 10/7
ok 98 - 10/8 is 5/4
ok 99 - 10/9 is 10/9
ok 100 - 10/10 is 1

And here is the program:

#!/usr/bin/env perl
#
# coprimes - test suite to use unary coprimality algorithm
# 
# Tom Christiansen <tchrist@perl.com>
# Sun Apr 17 12:18:19 MDT 2011

use strict;
use warnings;

my $DEFAULT = 2*3*5;
my $max = @ARGV ? shift : $DEFAULT;

use Test::More;
plan tests => $max ** 2;

my $rx = qr{
    ^
    (?|   1+       / (1)
      | (11+)  \1* / \1+
    )
    $
}x;

for my $i ( 1 .. $max ) {
    for my $j ( 1 .. $max ) {
        my $test;
        if (((1 x $i) . "/" . (1 x $j)) =~ /$rx/) {
            my $cf = length($1);
            $test = $i / $cf;
            $test .= "/" . $j/$cf unless $j/$cf == 1;
        } else {
            $test = "$i/$j";
        }
        cmp_ok($test, "eq", reduce($i, $j), "$i/$j is $test");
    }
}

sub reduce {
    my ($a, $b) = @_;
    use Math::BigRat;
    my $f = new Math::BigRat "$a/$b";
    return "$f";
}

Finding strings that differ with at most one letter from a given string in SAS with PROC SQL

10 votes

First some context. I am using proc sql in SAS, and need to fetch all the entries in a data set (with a couple of million entries) that have variable "Name" equal to (let's say) "Massachusetts". Of course, since the data was once manually entered by humans, close to all conceivable spelling errors occur ("Amssachusetts", "Kassachusetts" etc.).

I have found that few entries get more than two characters wrong, so the code

Name like "__ssachusetts" OR Name like "_a_sachusetts" OR ... OR Name like "Massachuset__"

would select the entries I am looking for. However, I am hoping that there must be a more convenient way to write

Name that differs by at most 2 characters from "Massachusetts";

Is there? Or is there some other strategy for fetching these entries? I tried searching both stackoverflow and the web but was unsuccesful. I am also a relative beginner with both SQL and SAS.

Some additional information: The database is not in English (and the actual string is not "Massachusetts") so using SOUNDEX is not really feasible (if it ever were).

Thanks in advance.

(Edit: Improved the title)

SAS has built-in functions COMPGED and COMPLEV to compute distances between strings. Here is an example that shows how to select just those with a Levenshtein edit distance of less than or equal to 2.

data typo;
input name $20.;
datalines;
massachusetts
masachusets
mssachusetts
nassachusets
nassachussets
massachusett
;

proc sql;
  select name from typo
  where complev(name, "massachusetts") <= 2;
quit;

Java: How do I determine why a regular expression pattern match fails?

8 votes

I am using a regular expression to match whether or not a pattern matches, but I also want to know when it fails.

For example, say I have a pattern of "N{1,3}Y". I match it against string "NNNNY". I would like to know that it failed because there were too many Ns. Or if I match it against string "XNNY", I would like to know that it failed because an invalid character "X" was in the string.

From looking at the Java regular expression package API (java.util.regex), additional information only seems to be available from the Matcher class when the match succeeds.

Is there a way to resolve this issue? Or is regular expression even an option in this scenario?

I guess you should use a parser, rather than simple regular expressions.

Regular Expressions are good providing matches for string, but not quite so in providing NON-matches, let alone explaining why a match failed.

Is there a common/standard subset of Regular Expressions?

8 votes

Do the "control characters" used in regular expressions differ a lot among different implementations of regex parsers (eg. regex in Ruby, Java, C#, sed etc.).

For example, in Ruby, the \D means not a digit; does it mean the same in Java, C# and sed? I guess what I'm asking is, is there a "standard" for regex'es that all regex parsers support?

If not, is there some common subset that should be learned and mastered (and then learn the parser-specific ones as they're encountered) ?

See the list of basic syntax on regular-expressions.info.

And a comparison of the different "flavors".

Simple integer regular expression

7 votes

I have ValidationRegularExpression="[0-9]" which only allows a single character. How do I make it allow between (and including) 1 and 7 digits? I tried [0-9]{1-7} but it didn't work.

You got the syntax almost correct: [0-9]{1,7}.

You can make your solution a bit more elegant (and culture-sensitive) by replacing [0-9] with the generic character group "decimal digit": \d (remember that other languages might use different characters for digits than 0-9).

And here's the documentation for future reference:

Japanese/chinese email addresses?

7 votes

I'm making some site which must be fully unicode. Database etc are working, i only have some small logic error. Im testing my register form with ajax if fields are valid, in email field i check with regular expressions.

However if a user has a email address like 日本人@日人日本人.com it isn't coming trough.

  1. This type of mail addresses exist?

Are email addresses always like this? (a-z A-Z 0-9) @ (a-z A-Z 0-9).(a-z A-Z 0-9)

As per RFC 5322 ("Internet Message Format"), section 3.4.1 ("Addr-Spec Specification") you can't use non US-ASCII characters such as those you've listed. However, characters such as...

! # $ % & ' * + - / = ? ^ _  { | } ~

...are legal, as well as the full stop/period character as long as there's only one in a row.

For more information see the above RFC and indeed the Wikipedia article on email addresses, specifically the "syntax" section.

UPDATE

There's also a newer (albeit experimental) RFC 5336 which handles the now legitimate international domains containing UTF-8 characters, etc.

Declaration to make PHP script completely Unicode-friendly

7 votes

Remembering to do all the stuff you need to do in PHP to get it to work properly with Unicode is far too tricky, tedious, and error-prone, so I'm looking for the trick to get PHP to magically upgrade absolutely everything it possibly can from musty old ASCII byte mode into modern Unicode character mode, all at once and by using just one simple declaration.

The idea is to modernize PHP scripts to work with Unicode without having to clutter up the source code with a bunch of confusing alternate function calls and special regexes. Everything should just “Do The Right Thing” with Unicode, no questions asked.

Given that the goal is maximum Unicodeness with minimal fuss, this declaration must at least do these things (plus anything else I’ve forgotten that furthers the overall goal):

  • The PHP script source is itself in considered to be in UTF‑8 (eg, strings and regexes).

  • All input and output is automatically converted to/from UTF‑8 as needed, and with a normalization option (eg, all input normalized to NFD and all output normalized to NFC).

  • All functions with Unicode versions use those instead (eg, Collator::sort for sort).

  • All byte functions (eg, strlen, strstr, strpos, and substr) work like the corresponding character functions (eg, mb_strlen, mb_strstr, mb_strpos, and mb_substr).

  • All regexes and regexy functions transparently work on Unicode (ie, like all the preggers have /u tacked on implicitly, and things like \w and \b and \s all work on Unicode the way The Unicode Standard requires them to work, etc).

For extra credit :), I'd like there to be a way to “upgrade” this declaration to full grapheme mode. That way the byte or character functions become grapheme functions (eg, grapheme_strlen, grapheme_strstr, grapheme_strpos, and grapheme_substr), and the regex stuff works on proper graphemes (ie, . — or even [^abc] — matches a Unicode grapheme cluster no matter how many code points it contains, etc).

That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.

So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.


One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :

mbstring supports a 'function overloading' feature which enables you to add multibyte awareness to such an application without code modification by overloading multibyte counterparts on the standard string functions.
For example, mb_substr() is called instead of substr() if function overloading is enabled.

What is the reason behind the advice that the substrings in regex should be ordered based on length?

7 votes

longest first

>>> p = re.compile('supermanutd|supermanu|superman|superm|super')

shortest first

>>> p = re.compile('super|superm|superman|supermanu|supermanutd')

Why is the longest first regex preferred?

Alternatives in Regexes are tested in order you provide, so if first branch matches, then Rx doesn't check other branches. This doesn't matter if you only need to test for match, but if you want to extract text based on match, then it matters.

You only need to sort by length when your shorter strings are substrings of longer ones. For example when you have text:

supermanutd
supermanu
superman
superm

then with your first Rx you'll get:

>>> regex.findall(string)
[u'supermanutd', u'supermanu', u'superman', u'superm']

but with second Rx:

>>> regex.findall(string)
[u'super', u'super', u'super', u'super', u'super']

Test your regexes with http://www.pythonregex.com/

string.match(regex) vs regex.match(string)

6 votes

What's the difference between string.match(regex) and regex.match(string) in Ruby? What's the justification for having both those constructs in the language?

I thnk that, intuitively, match, or the related method =~, expresses some kind of equality, as reflected in the fact that =~ includes the equality = and the equivalence ~ relations (not in ruby but in mathematics). But it is not totally an equivalence relation, and among the three axioms of equality (reflexivity, commutativity, transitivity), particularly commutativity seems reasonable to be maintaind in this relation; it is natural for a programmer to expect that string.match(regex) or string =~ regex would mean the same thing as regex.match(string) or regex =~ string. I myself, would have problem remembering if either is defined and not the other. In fact, some people feel it strange that the method ===, which also reminds us of some kind of equality, is not commutative, and have raised questions.

Scala regex replace with anonymous function

6 votes

In Ruby I can replace characters in a string in the following way:

a = "one1two2three"
a.gsub(/\d+/) {|e| e.to_i + 1}
=> "one2two3three"

The result of evaluating the block from the second line, will replace what was matched in the pattern. Can we do something equivalent in Scala? Replace something in a regex with the results of a function/anonymous function?

Yes, Regex#replaceAllIn has an overloaded version that takes a function Match => String. The equivalent Scala version of your code would be:

"""\d+""".r.replaceAllIn("one1two2three", m => (m.group(0).toInt + 1).toString)

How to get sentence number from input?

6 votes

Hi there,

It seems hard to detect a sentence boundary in a text. Quotation marks like .!? may be used to delimite sentences but not so accurate as there may be ambiguous words and quotations such as U.S.A or Prof. or Dr. I am studying Tperlregex library and Regular Expression Cookbook by Jan Goyvaerts but I do not know how to write the expression that detects sentence?

What may be comparatively accurate expression using Tperlregex in delphi?

Thanks

First, you probably need to arrive at your own definition of what a "sentence" is, then implement that definition. For example, how about:

He said: "It's OK!"

Is it one sentence or two? A general answer is irrelevant. Decide whether you want it to interpret it as one or two sentences, and proceed accordingly.

Second, I don't think I'd be using regular expressions for this. Instead, I would scan each character and try to detect sequences. A period by itself may not be enough to delimit a sentence, but a period followed by whitespace or carriage return (or end of string) probably does. This immediately lets you weed out U.S.A (periods not followed by whitespace).

For common abbreviations like Prof. an Dr. it may be a good idea to create a dictionary - perhaps editable by your users, since each language will have its own set of common abbreviations.

Each language will have its own set of punctuation rules too, which may affect how you interpret punctuation characters. For example, English tends to put a period inside the parentheses (like this.) while Polish does the opposite (like this). The same difference will apply to double quotes, single quotes (some languages don't use them at all, sometimes they are indistinguishable from apostrophes etc.). Your rules may well have to be language-specific, at least in part.

In the end, you may approximate the human way of delimiting sentences, but there will always be cases that can throw the analysis off. For example, assuming that you have a dictionary that recognizes "Prof." as an abbreviation, what are you going to do about

Most people called him Professor Jones, but to me he was simply The Prof.

Even if you have another sentence that follows and starts with a capital letter, that still won't help you know where the sentence ends, because it might as well be

Most people called him Professor Jones, but to me he was simply Prof. Bill.

What happens when you compile regular expressions?

6 votes

We all know that you can compile your much-used regular expressions into something that performs very good. But what is this witchcraft happening behind the curtains?

I guess that a finite state automaton gets built there, but you must know better than me.

The details of regular expression compilation vary by implementation. For example, compilation in Python or re2 simply creates an instance of a regular expression object. This object's state machine may be modeled as a graph or virtual machine. Without compilation (example: RE.match(expression, input)), a new regular expression object is created behind the scenes each time match is called. This is needless work if you're going to use an expression more than once.

In C#, one of three things can happen when you compile:

  1. A regular expression object is created (implemented as a virtual machine) similar to Python and re2.
  2. A regular expression object is created and its virtual machine opcodes are compiled to in-memory IL instructions on-the-fly.
  3. A regular expression object is created and its virtual machine opcodes are compiled to disk as IL instructions.

You mention an interest in algorithms. Take a look at Russ Cox's excellent articles for two approaches:

How to determine if a non-English string is in upper case?

6 votes

I'm using the following code to check for a string where all the characters are upper-case letters:

        if (preg_match('/^[\p{Lu}]+$/', $word)) {

This works great for English, but fails to detect letters with accents, Russian letters, etc. Is \p{Lu} supposed to work for all languages? Is there a better approach?

A special option is the /u which turns on the Unicode matching mode, instead of the default 8-bit matching mode. You should specify /u for regular expressions that use \x{FFFF}, \X or \p{L} to match Unicode characters, graphemes, properties or scripts. PHP will interpret '/regex/u' as a UTF-8 string rather than as an ASCII string.

http://www.regular-expressions.info/php.html --

C# regular expression to match ANY character?

6 votes

In C#, I write the following string to a string variable, carriage return and all:

asdfasdfasdf
asdfas<test>asdfasdf

asdfasdf<test>asdfasdf

In Notepad2, I use this regular expression:

<test>.*<test>

It selects this text as expected:

<test>asdfasdf

asdfasdf<test>

However, when I do this in C#:

System.Text.RegularExpressions.Regex.Replace(s, "<test>.*<test>", string.Empty);

It doesn't remove the string. However, when I run this code on a string without any carriage returns, it does work.

So what I am looking for is a regex that will match ANY character, regardless whether or not it is a control code or a regular character.

You forgot to specify that the Regex operation (specifically, the . operator) should match all characters (not all characters except \n):

System.Text.RegularExpressions.Regex.Replace(s, "<test>.*<test>", string.Empty, RegexOptions.Singleline);

All you needed to add was RegexOptions.Singleline.

Search/Replace in Vim

6 votes

I want to delete all occurrences of square brackets that conform to this regex: \[.*\].*{, but I only want to delete the brackets, not what follows - i.e., I want to delete the brackets and what's inside them, only when they are followed by an opening curly brace.

How do I do that with Vim's search/replace?

You can use \zs and \ze to set the beginning and the end of the match.

:%s/\zs\[.*\]\ze.*{//g should work.

You are telling Vim to replace what is between \zs and \ze by an empty string.

(Note that you need the +syntax option compiled in your Vim binary)

For more information, see :help /\zs or :help pattern

Edit : Actually \zs is not necessary in this case but I leave it for educational purpose. :)

Changing the RegExp flags

6 votes

So basically I wrote myself this function so as to be able to count the number of occurances of a Substring in a String:

String.prototype.numberOf = function(needle) {
  var num = 0,
      lastIndex = 0;
  if(typeof needle === "string" || needle instanceof String) {
    while((lastIndex = this.indexOf(needle, lastIndex) + 1) > 0)
      {num++;} return num;
  } else if(needle instanceof RegExp) {
    // needle.global = true;
    return this.match(needle).length;
  } return 0;
};

The method itself performs rather well and both the RegExp and String based searches are quite comparable as to the execution time (both ~2ms on the entire vast Ray Bradbury's "451 Fahrenheit" searching for all the "the"s).

What sort of bothers me, though, is the impossibility of changing the flag of the supplied RegExp instance. There is no point in calling String.prototype.match in this function without the global flag of the supplied Regular Expression set to true, as it would only note the first occurance then. You could certainly set the flag manually on each RegExp passed to the function, I'd however prefer being able to clone and then manipulate the supplied Regular Expression's flags.

Astonishingly enough, I'm not permitted to do so as the RegExp.prototype.global flag (more precisely all flags) appear to be read-only. Thence the commented-out line 8.

So my question is: Is there a nice way of changing the flags of a RegExp object?

I don't really wanna do stuff like this:

if(!expression.global)
  expression = eval(expression.toString() + "g");

Some implementations might not event support the RegExp.prototype.toString and simply inherit it from the Object.prototype, or it could be a different formatting entirely. And it just seems as a bad coding practice to begin with.

First, your current code does not work correctly when needle is a regex which does not match. i.e. The following line:

return this.match(needle).length;

The match method returns null when there is no match. A JavaScript error is then generated when the length property of null is (unsuccessfully) accessed. This is easily fixed like so:

var m = this.match(needle);
return m ? m.length : 0;

Now to the problem at hand. You are correct when you say that global, ignoreCase and multiline are read only properties. The only option is to create a new RegExp. This is easily done since the regex source string is stored in the re.source property. Here is a tested modified version of your function which corrects the problem above and creates a new RegExp object when needle does not already have its global flag set:

String.prototype.numberOf = function(needle) {
    var num = 0,
    lastIndex = 0;
    if (typeof needle === "string" || needle instanceof String) {
        while((lastIndex = this.indexOf(needle, lastIndex) + 1) > 0)
            {num++;} return num;
    } else if(needle instanceof RegExp) {
        if (!needle.global) {
            // If global flag not set, create new one.
            var flags = "g";
            if (needle.ignoreCase) flags += "i";
            if (needle.multiline) flags += "m";
            needle = RegExp(needle.source, flags);
        }
        var m = this.match(needle);
        return m ? m.length : 0;
    }
    return 0;
};

Emacs regular expression: what \< and \> can do that \b cannot do?

6 votes

Regexp Backslash - GNU Emacs Manual says that \< matches at the beginning of a word, \> matches at the end of a word, and \b matches a word boundary. \b is just as in other non-Emacs regular expressions. But it seems that \< and \> are particular to Emacs regular expressions. Are there cases where \< and \> are needed instead of \b? For instance, \bword\b would match the same as \<word\> would, and the only difference is that the latter is more readable.

You can get unexpected results if you assume they behave the same..
What can \< and > that \b can do?
The answer is that \< and\> are explicit... This end of a word! and only this end!
\bis general.... Either end of a word will match...

GNU Operators * Word Operators

line="cat dog sky"  
echo "$line" |sed -n "s/\(.*\)\b\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\>\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\<\(.*\)/# |\1|\2|/p"
echo
line="cat  dog  sky"  
echo "$line" |sed -n "s/\(.*\)\b\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\>\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\<\(.*\)/# |\1|\2|/p"
echo
line="cat  dog  sky  "  
echo "$line" |sed -n "s/\(.*\)\b\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\>\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\<\(.*\)/# |\1|\2|/p"
echo

output

# |cat dog |sky|
# |cat dog| sky|
# |cat dog |sky|

# |cat  dog  |sky|
# |cat  dog|  sky|
# |cat  dog  |sky|

# |cat  dog  sky|  |
# |cat  dog  sky|  |
# |cat  dog  |sky  |