Best regex questions in January 2012

Why can regular expressions have an exponential running time?

18 votes

It is possible to write a Regex which needs in some cases exponential running time. Such an example is (aa|aa)*. If there is an input of an odd number of as it needs exponential running time.

It is easy to test this. If the input contains only as and has length 51, the Regex needs some seconds to compute (on my machine). Instead if the input length is 52 its computing time is not noticeable (I tested this with the built-in Regex-parser of the JavaRE).

I have written a Regex-parser to find the reason for this behavior, but I didn't find it. My parser can build an AST or a NFA based on a Regex. After that it can translate the NFA to a DFA. To do this it uses the powerset construction algorithm.

When I parse the Rgex mentioned above, the parser creates a NFA with 7 states - after conversion there are only 3 states left in the DFA. The DFA represents the more sensible Regex (aa)*, which can be parsed very fast.

Thus, I don't understand why there are parsers which can be so slow. What is the reason for this? Do they not translate the NFA to a DFA? If yes, why not? And what's the technical reasons why they compute so slow?

Russ Cox has a very detailed article about why this is and the history of regexes (part 2, part 3).

Regular expression matching can be simple and fast, using finite automata-based techniques that have been known for decades. In contrast, Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be excruciatingly slow. With the exception of backreferences, the features provided by the slow backtracking implementations can be provided by the automata-based implementations at dramatically faster, more consistent speeds.

Largely, it comes down to proliferation of non-regular features in "regular" expressions such as backreferences, and the (continued) ignorance of most programmers that there are better alternatives for regexes that do not contain such features (which is many of them).

While writing the text editor sam in the early 1980s, Rob Pike wrote a new regular expression implementation, which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike's implementation incorporated submatch tracking into an efficient NFA simulation but, like the rest of the Eighth Edition source, was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch, but using backtracking, and released his implementation into the public domain. It became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on. (In his defense, Spencer knew the routines could be slow, and he didn't know that a more efficient algorithm existed. He even warned in the documentation, “Many users have found the speed perfectly adequate, although replacing the insides of egrep with this code would be a mistake.”) Pike's regular expression implementation, extended to support Unicode, was made freely available with sam in late 1992, but the particularly efficient regular expression search algorithm went unnoticed.

Regular expression doesn't match empty string in multiline mode (Java)

11 votes

I just observed this behavior;

Pattern p1 = Pattern.compile("^$");
Matcher m1 = p1.matcher("");
System.out.println(m1.matches()); /* true */

Pattern p2 = Pattern.compile("^$", Pattern.MULTILINE);
Matcher m2 = p2.matcher("");
System.out.println(m2.matches()); /* false */

It strikes me as odd that the last statement is false. This is what the docs say;

By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence. http://docs.oracle.com/javase/1.4.2...

From what I get from this, it should match? The following makes things even more confusing;

Pattern p3 = Pattern.compile("^test$");
Matcher m3 = p3.matcher("test");
System.out.println(m3.matches()); /* true */

Pattern p4 = Pattern.compile("^test$", Pattern.MULTILINE);
Matcher m4 = p4.matcher("test");
System.out.println(m4.matches()); /* true */

So what is this? How do I make sense of this? I hope someone can shed some light on this, would be really appreciated.

If MULTILINE mode is activated then ^ matches at the beginning of input and after any line terminator except at the end of input.

Since you are at the end of input, ^ can't match in multiline mode.

This is surprising, even disgusting, but nevertheless according to its documentation.

How often does JavaScript recompile regex literals in functions?

10 votes

Given this function:

function doThing(values,things){
  var thatRegex = /^http:\/\//i; // is this created once or on every execution?
  if (values.match(thatRegex)) return values;
  return things;
}

How often does the JavaScript engine have to create the regex? Once per execution or once per page load/script parse?

To prevent needless answers or comments, I personally favor putting the regex outside the function, not inside. The question is about the behavior of the language, because I'm not sure where to look this up, or if this is an engine issue.


EDIT:

I was reminded I didn't mention that this was going to be used in a loop. My apologies:

var newList = [];
foreach(item1 in ListOfItems1){ 
  foreach(item2 in ListOfItems2){ 
    newList.push(doThing(item1, item2));
  }
}

So given that it's going to be used many times in a loop, it makes sense to define the regex outside the function, but so that's the idea.

also note the script is rather genericized for the purpose of examining only the behavior and cost of the regex creation

There are two "regular expression" type objects in javascript. Regular expression instances and the RegExp object.

Also, there are two ways to create regular expression instances:

  1. using the /regex/ syntax and
  2. using new RegExp('regex');

Each of these will create new regular expression instance each time.

However there is only ONE global RegExp object.

var input = 'abcdef';
var r1 = /(abc)/;
var r2 = /(def)/;
r1.exec(input);
alert(RegExp.$1); //outputs 'abc'
r2.exec(input);
alert(RegExp.$1); //outputs 'def'

The actual pattern is compiled as the script is loaded when you use Syntax 1

The pattern argument is compiled into an internal format before use. For Syntax 1, pattern is compiled as the script is loaded. For Syntax 2, pattern is compiled just before use, or when the compile method is called.

But you still could get different regular expression instances each method call. Test in chrome vs firefox

function testregex() {
    var localreg = /abc/;
    if (testregex.reg != null){
        alert(localreg === testregex.reg);
    };
    testregex.reg = localreg;
}
testregex();
testregex();

It's VERY little overhead, but if you wanted exactly one regex, its safest to only create one instance outside of your function

Why does the "g" modifier give different results when test() is called twice?

10 votes

Given this code:

var reg = /a/g;
console.log(reg.test("a"));
console.log(reg.test("a"));

I get this result:

true
false

I have no idea how this could happen. I have tested in both Node.js (v8) and Firefox browser.

To workaround the problem, you can remove the g flag or reset lastIndex as in

var reg = /a/g;
console.log(reg.test("a"));
reg.lastIndex = 0;
console.log(reg.test("a"));

The problem arises because test is based around exec which looks for more matches after the first if passed the same string and the g flag is present.

15.10.6.3 RegExp.prototype.test(string) # Ⓣ Ⓡ

The following steps are taken:

  1. Let match be the result of evaluating the RegExp.prototype.exec (15.10.6.2) algorithm upon this RegExp object using string as the argument.
  2. If match is not null, then return true; else return false.

The key part of exec is step 6 of 15.10.6.2:

6. Let global be the result of calling the [[Get]] internal method of R with argument "global".
7. If global is false, then let i = 0.

When i is not reset to 0, then exec (and therefore test) does not start looking at the beginning of the string.

This is useful for exec because you can loop to handle each match:

 var myRegex = /o/g;
 var myString = "fooo";
 for (var match; match = myRegex.exec(myString);) {
   alert(match + " at " + myRegex.lastIndex);
 }

but obviously it isn't so useful for test.

How to portably parse the (Unicode) degree symbol with regular expressions?

8 votes

I'm writing a simple regular expression parser for the output of the sensors utility on Ubuntu. Here's an example of a line of text I'm parsing:

temp1:        +31.0°C  (crit = +107.0°C)

And here's the regex I'm using to match that (in Python):

temp_re = re.compile(r'(temp1:)\s+(\+|-)(\d+\.\d+)\W\WC\s+' 
                     r'\(crit\s+=\s+(\+|-)(\d+\.\d+)\W\WC\).*')

This code works as expected and matches the example text I've given above. The only bits I'm really interested in are the numbers, so this bit:

(\+|-)(\d+\.\d+)\W\WC

which starts by matching the + or - sign and ends by matching the °C.

My question is, why does it take two \W (non-alphanumeric) characters to match ° rather than one? Will the code break on systems where Unicode is represented differently to mine? If so, how can I make it portable?

Possible portable solution:

Convert input data to unicode, and use re.UNICODE flag in regular expressions.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re


data = u'temp1:        +31.0°C  (crit = +107.0°C)'
temp_re = re.compile(ur'(temp1:)\s+(\+|-)(\d+\.\d+)°C\s+' 
                     ur'\(crit\s+=\s+(\+|-)(\d+\.\d+)°C\).*', flags=re.UNICODE)

print temp_re.findall(data)

Output

[(u'temp1:', u'+', u'31.0', u'+', u'107.0')]

EDIT

@netvope allready pointed this out in comments for question.

Update

Notes from J.F. Sebastian comments about input encoding:

check_output() returns binary data that sometimes can be text (that should have a known character encoding in this case and you can convert it to Unicode). Anyway ord(u'°') == 176 so it can not be encoded using ASCII encoding.

So, to decode input data to unicode, basically* you should use encoding from system locale using locale.getpreferredencoding() e.g.:

data = subprocess.check_output(...).decode(locale.getpreferredencoding())

With data encoded correctly:

you'll get the same output without re.UNICODE in this case.


Why basically? Because on Russian Win7 with cp1251 as preferredencoding if we have for example script.py which decodes it's output to utf-8:

#!/usr/bin/env python
# -*- coding: utf8 -*-

print u'temp1: +31.0°C  (crit = +107.0°C)'.encode('utf-8')

And wee need to parse it's output:

subprocess.check_output(['python', 
                         'script.py']).decode(locale.getpreferredencoding())

will produce wrong results: 'В°' instead °.

So you need to know encoding of input data, in some cases.

Detect repetitions in string

8 votes

I have a simple problem, but can't come with a simple solution :)

Let's say I have a string. I want to detect if there is a repetition in it.

I'd like:

"blablabla" # => (bla, 3)

"rablabla"  # => (bla, 2)

The thing is I don't know what pattern I am searching for (I don't have "bla" as input).

Any idea?

EDIT:
Seeing the comments, I think I should precise a bit more what I have in mind:

  • In a string, there is either a pattern that is repeted or not.
  • The repeted pattern can be of any length.

If there is a pattern, it would be repeted over and over again until the end. But the string can end in the middle of the pattern.

Example:

"testblblblblb" # => ("bl",4) 

import re
def repetitions(s):
   r = re.compile(r"(.+?)\1+")
   for match in r.finditer(s):
       yield (match.group(1), len(match.group(0))/len(match.group(1)))

finds all non-overlapping repeating matches, using the shortest possible unit of repetition:

>>> list(repetitions("blablabla"))
[('bla', 3)]
>>> list(repetitions("rablabla"))
[('abl', 2)]
>>> list(repetitions("aaaaa"))
[('a', 5)]
>>> list(repetitions("aaaaablablabla"))
[('a', 5), ('bla', 3)]

Is there a simple way of parsing this text into a Map

7 votes

I receive the response from a service as below. How to parse this into a Map? I first thought of split at whitespace but it doesn't work as the value might contain spaces e.g. look at the value of SA key in the below response.

One option I thought of is to split at whitespace provided the previous character is a double quote. Not sure how to write the regex for this though.

TX="0000000000108000001830001" FI="" OS="8" CI="QU01SF1S2032" AW="SSS" SA="1525 Windward Concourse"

Parse at quotes. You could even use a regular expression to find each key/value pair, assuming each value is in quotes. My only question would be, what are the rules for if a value contains embedded quotes? (Are they escaped using '\' or such? Regardless, this is not currently accounted for in the below...)

For example:

(\w+)="([^"]*)"

This will even give you groups #1 and #2 that can be used to provide the key and the value, respectively.

Run this in a loop, using Java's Matcher.find() method, until you find all of the pairs.

Sample code:

String input = "TX=\"0000000000108000001830001\" FI=\"\" OS=\"8\" CI=\"QU01SF1S2032\" AW=\"SSS\" SA=\"1525 Windward Concourse\"";

Pattern p = Pattern.compile("\\s*(\\w+)=\"([^\"]*)\"\\s*");

Matcher m = p.matcher(input);
while(m.find()){
    System.out.println(m.group(1));
    System.out.println(m.group(2));
}

Output:

TX
0000000000108000001830001
FI

OS
8
CI
QU01SF1S2032
AW
SSS
SA
1525 Windward Concourse

Python regex search AND split

7 votes

In PHP one can use the function preg_match with the flag PREG_OFFSET_CAPTURE in order to search a regex patter within a string and know what follows and what comes first. For example, given the string aaa bbb ccc ddd eee fff, I'd like to match-split r'ddd' and have:

before = 'aaa bbb ccc '
match = 'ddd'
after = ' eee fff'

How to do this in python? Thanks

You can use re.split() but you need to put parentheses around the pattern so as to save the match:

>>> re.split('(ddd)', 'aaa bbb ccc ddd eee fff', 1)
['aaa bbb ccc ', 'ddd', ' eee fff']

but in this case you don't need a regex at all:

>>> 'aaa bbb ccc ddd eee fff'.partition('ddd')
('aaa bbb ccc ', 'ddd', ' eee fff')

Edit: I should probably also mention that with re.split you will get all of the matching groups, so you need to be prepared for that or use non-capturing groups everywhere you would otherwise use parentheses for precedence:

>>> re.split('(d(d)d)', 'aaa bbb ccc ddd eee fff', 1)
['aaa bbb ccc ', 'ddd', 'd', ' eee fff']

Regex: Require that quotes are escaped in a string

6 votes

thanks for looking,

I've had a terrible time trying to get the right search terms for this regex question. I need to ensure that quotes are already escaped in a string, otherwise the match should fail. (Most search results for this kind of question are just pages saying you need to escape quotes or how to escape quotes.)

Valid:

This is valid
This \"is Valid
This is al\"so Valid\"

Invalid:

This i"s invalid
This i"s inv"alid

The only thing I've managed to find so far is

((?:\\"|[^"])*)

This seems to match the first part of the following, but nothing after the escaped quote

This is a \"test

Again, this should fail:

This is a \"test of " the emergency broadcast system

Thanks for any help, I hope this is even possible.

In C#, this appears to work as you want:

string pattern = "^([^\"\\\\]*(\\\\.)?)*$";

Stripping out the escaping leaves you with:

^([^"\\]*(\\.)?)*$

which roughly translates into: start-of-string, (multi-chars-excluding-quote-or-backslash, optional-backslash-anychar)-repeated, end-of-string

It's the start-of-string and end-of-string markers which forces the match over the complete text.

Faster replacement for Regex

6 votes

I have in class around 100 Regex calls, every call cover different type of data in text protocol, but i have many files and based on analytics regex took 88% of execution of my code.

Many this type of code:

{
  Match m_said = Regex.Match(line, @"(.*) said,", RegexOptions.IgnoreCase);
  if (m_said.Success)
  {
    string playername = ma.Groups[1].Value;
    // some action
    return true;
  }
}

{
  Match ma = Regex.Match(line, @"(.*) is connected", RegexOptions.IgnoreCase);
  if (ma.Success)
  {
    string playername = ma.Groups[1].Value;
    // some action
    return true;
  }
}
{
  Match ma = Regex.Match(line, @"(.*): brings in for (.*)", RegexOptions.IgnoreCase);
  if (ma.Success)
  {
    string playername = ma.Groups[1].Value;
    long amount = Detect_Value(ma.Groups[2].Value, line);
    // some action
    return true;
  }
}

Is any way to replace Regex with some other faster solution?

For regexps that are tested in loop, it is often faster to precompile them outside of the loop and just test them inside of the loop.

You need to declare the different regexps first with their respective patterns and only call the Match() with the text to test in a second step.

Are Ruby 1.9 regular expressions equally powerful to a context free grammar?

6 votes

I have this regular expression:

regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x

When I test it against several strings, it appears to be as powerful as a context free grammar because it handles the recursion properly.

regex.match("aaacaaa")
# => #<MatchData "aaacaaa" foo:"aaacaaa">
regex.match("aacaa")
# => #<MatchData "aacaa" foo:"aacaa">
regex.match("aabcbaa")
# => #<MatchData "aabcbaa" foo:"aabcbaa">
regex.match("aaacaa")
# => nil

"Fun with Ruby 1.9 Regular Expressions" has an example where he actually arranges all the parts of a regex so that it looks like a context-free grammar as follows:

sentence = %r{ 
    (?<subject>   cat   | dog   | gerbil    ){0} 
    (?<verb>      eats  | drinks| generates ){0} 
    (?<object>    water | bones | PDFs      ){0} 
    (?<adjective> big   | small | smelly    ){0} 

    (?<opt_adj>   (\g<adjective>\s)?     ){0} 

    The\s\g<opt_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object> 
}x

Between his technique for rearranging the parts of the regex, and my example of recursive named capturing groups, does this mean Ruby 1.9 regular expressions have the power equivalent to a context-free grammar?

This is one of the awesome things are the Oniguruma regexp engine used in Ruby 1.9 -- it has the power of a parser, and is not restricted to recognizing regular languages. It has positive and negative lookahead/lookbehind, which even can be used to recognize some languages which are not context-free! Take the following as an example:

regexp = /\A(?<AB>a\g<AB>b|){0}(?=\g<AB>c)a*(?<BC>b\g<BC>c|){1}\Z/

This regexp recognizes strings like "abc", "aabbcc", "aaabbbccc", and so on -- the number of "a", "b", and "c" must be equal, or it will not match.

(One limitation: you can't use named groups in the lookahead and lookbehind.)

Although I haven't peeked under the hood, Oniguruma seems to deal with named groups by simple recursive descent, backing up when something doesn't match. I've observed that it can't deal with left recursion. For example:

irb(main):013:0> regexp = /(?<A>\g<A>a|)/
SyntaxError: (irb):13: never ending recursion: /(?<A>\g<A>a|)/
    from C:/Ruby192/bin/irb:12:in `<main>'

I don't remember my parsing theory very clearly, but I think that a non-deterministic top-down parser like this should be able to parse any context-free language. ("language", not "grammar"; if your grammar has left recursion, you will have to convert it to right recursion.) If that is incorrect, please edit this post.

RegEx: Compare two strings to find Alliteration and Assonance

6 votes

would be possible to Compare two strings to find Alliteration and Assonance?

i use mainly javascript or php

I'm not sure that a regex would be the best way of building a robust comparison tool. A simple regex might be part of a larger solution that used more sophisticated algorithms for non-exact matching.

There are a variety of readily-available options for English, some of which could be extended fairly simply to languages that use the Latin alphabet. Most of these algorithms have been around for years or even decades and are well-documented, though they all have limits.

I imagine that there are similar algorithms for non-Latin alphabets but I can't comment on their availability firsthand.

Phonetic Algorithms

The Soundex algorithm is nearly 100 years old and has been implemented in multiple programming languages. It is used to determine a numeric value based on the pronunciation of a string. It is not precise but it may be useful for identifying similar sounding words/syllables. I've experimented with it in MS SQL Server and it is available in PHP.

http://php.net/manual/en/function.soundex.php

I have not worked with the metaphone algorithm, but Wikipedia and the PHP docs claim it is more accurate than Soundex when dealing with the English language. There are numerous implementations available (Wikipedia has a long list at the end of the article) and it is included in PHP.

http://www.php.net/manual/en/function.metaphone.php

Word Deconstruction

Levenshtein can be used to suggest alternate spellings (for example, to normalize user input) and might be useful as part of a more granular algorithm for alliteration and assonance.

http://www.php.net/manual/en/function.levenshtein.php

Logically, it would help to understand the syllabication of the words in the string so that each word could be deconstructed. The syllable break could resolve ambiguity as to how two adjacent letters should be pronounced. This thread has a few links:

PHP Syllable Detection

Sample with PHP Snippets

Lastly, here's a simple alliteration analyzer that turned up while doing a little reading on the subject:

http://coding.pressbin.com/99/How-my-Alliteration-Analyzer-works

All matches of regex in Haskell

6 votes

According to a number of tutorials (including Real World Haskell) one can, using ghci do the following

ghci > :m Text.Regex.Posix
ghci > "foo foo foo" =~ "foo" :: [String]
["foo","foo","foo"]

Yet, when I attempt this, it yields

No instance for (RegexContext Regex [Char] [String])
  arising from a use of `=~'
Possible fix:
  add an instance declaration for
  (RegexContext Regex [Char] [String])
In the expression: "abc" =~ "ab" :: [String]
In an equation for `it': it = "abc" =~ "ab" :: [String]

What is the correct way of obtaining a list of all matches in haskell?

The regex libraries can be somewhat confusing with their overloaded return types, but to get all the matches you just need to ensure that the return type is AllTextMatches, for example:

Prelude> :m + Text.Regex.Posix
Prelude Text.Regex.Posix> getAllTextMatches $ "foo foo foo" =~ "foo" :: [String]
["foo","foo","foo"]

Finding string literals in my code

5 votes

At my company we recently noticed that one developer was not using language files but putting text directly in the code.

My idea was to search for words between quotes that have atleast 1 or more whitespace in them. But I got kinda stuck with

("|')(\w|\s{1,})*('|")

this does match text but does not require that it has atleast 1 word and atleast 1 whitespace (so it matches about anything between quotes). Can anyone help me out?

The language I want to use for this is PHP (or I could do a notepad++ search)

If you want to match single or double quoted strings (without escapes) that contain a "word" and a space you could use:

"(?=[^"\n]*\w)(?=[^"\n]*\s)[^"\n]+"|'(?=[^'\n]*\w)(?=[^'\n]*\s)[^'\n]+'

In PHP it would look like:

preg_match_all("/\"(?=[^\"\n]*\\w)(?=[^\"\n]*\\s)[^\"\n]+\"|'(?=[^'\n]*\\w)(?=[^'\n]*\\s)[^'\n]+'/", $string, $matches);

Sed regex and substring negation

5 votes

What is the correct syntax for finding a substring (a string which is preceded and followed by specific strings) which does not match a specific pattern?

For example, I want to take all substrings which start with BEGIN_, end with _END and the substring in between is not equal to FOO; and replace the whole substring with the format "(inner substring)". The following would match:

  • BEGIN_bar_END -> (bar)
  • BEGIN_buz_END -> (buz)
  • BEGIN_ihfd8f398IHFf9f39_END -> (ihfd8f398IHFf9f39)

But BEGIN_FOO_END would not match.

I have played around with the following, but cannot seem to find the correct syntax:

sed -e 's/BEGIN_(^FOO)_END/($1)/g'
sed -e 's/BEGIN_([^FOO])_END/($1)/g'
sed -e 's/BEGIN_(?!FOO)_END/($1)/g'
sed -e 's/BEGIN_(!FOO)_END/($1)/g'
sed -e 's/BEGIN_(FOO)!_END/($1)/g'
sed -e 's/BEGIN_!(FOO)_END/($1)/g'

There is no general negation operator in Sed, IIRC because compilation of regexes with negation to DFAs takes exponential time. You can work around this with

'/BEGIN_FOO_END/b; s/BEGIN_\(.*\)_END/(\1)/g'

where /BEGIN_FOO_END/b means: if we find BEGIN_FOO_END, then branch (jump) to the end of the Sed script.

Memory usage and known issues with RegEx and different Framework versions

5 votes

We have a Windows Service created in .Net 4.0, the services parses large text files that are made up of lines of comma separated values (Several million lines, with between 5-10 values), no problem here, we can read the lines, split them into a Key/Value collection and process the values. To validate the values we are using Data Paralellism to pass the Values, which is basically an array of values in specific formats, to a method that performs RegEx validation on individual values.

Up until now we have used static Regular Expressions, not the static RegEx.IsMatch method but a static RegEx property with the RegexOption defined as RegexOptions.Compiled, as detailed below.

private static Regex clientIdentityRegEx = new Regex("^[0-9]{4,9}$", RegexOptions.Compiled);

Using this method we had a pretty standard memory footprint, the memory increased marginally with the greater number of values in each line, the time taken was more or less linear to the total number of lines.

To allow the Regular Expression to be used in other projects, of varying Framework versions, we recently moved the static RegEx properties to a common utilities project that is now compiled using the .Net 2.0 CLR (the actual Regular Expressions have not changed), the number of RegEx properties exposed has increased to about 60, from 25 or so. Since doing this we have started running into memory issues, an increase in memory 3 or more times that of the original project. When we profile the running service we can see the memory appears to be "leaking" from the RegEx.IsMatch, not any specific RegEx but various depending on which are called.

I found the following comment on a old MSDN blog post from one of the BCL team relating to .Net 1.0/1.1 RegEx.

There are even more costs for compilation that should mentioned, however. Emitting IL with Reflection.Emit loads a lot of code and uses a lot of memory, and that's not memory that you'll ever get back. In addition. in v1.0 and v1.1, we couldn't ever free the IL we generated, meaning you leaked memory by using this mode. We've fixed that problem in Whidbey. But the bottom line is that you should only use this mode for a finite set of expressions which you know will be used repeatedly.

I will add we have profiled "most" of the common RegEx calls and cannot replicate the issue individually.

Is this a known issue with the .Net 2.0 CLR?

In the article are the writers states "But the bottom line is that you should only use this mode for a finite set of expressions which you know will be used repeatedly", what is likely to be the finite number of expressions used in this manner, and is this likely to be a cause?

Update: In line with answer from @Henk Holterman is there any best practices for benchmark testing Regular Expressions, specifically RegEx.IsMatch, other than using sheer brute force by volume and parameter format?

Answer: Hanks answer of "The scenario calls for a limited, fixed number of RegEx objects" was pretty much spot on, we added the static RegEx'es to the class until we isolated the expressions with a notible increase in memory usage, these were migrated to separate static classes which seems to have solved some of the memory issues.

It appears, although I cannot cofirm this, there is a difference between compiled RegEx usage between the .Net 2.0 CLR and the .Net 4.0 CLR as the memory issues do not occur when the complied solely for the .Net 4.0 framework. (Any confirmations?)

The scenario calls for a limited, fixed number of RegEx objects. That shouldn't leak. You should verify that in the new situation the RegEx objects are still being reused.

The other possibility is the increased number (60 from 25) expressions. Could just one of them maybe be a little more complex, leading to excessive backtracking?

Search and replace multiple lines in xml/text files using python

5 votes

---Update 3: I have got the script to update the required data into the xml files completed but the following code is being dropped from the written file. Why is this? how can I replace it?

<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type='text/xsl' href='ANZMeta.xsl'?>

Current working code (except for issue mentioned above).

import os, xml, arcpy, shutil
from xml.etree import ElementTree as et 

path=os.getcwd()
arcpy.env.workspace = path

FileList = arcpy.ListFeatureClasses()
FileCount = len(FileList)
zone="_Zone"

for File in FileList:
    FileDesc_obj = arcpy.Describe(File)
    FileNm=FileDesc_obj.file
    newMetaFile=FileNm+"_BaseMetadata.xml"

    check_meta=os.listdir(path)
    if FileNm+'.xml' in check_meta:
        shutil.copy2(FileNm+'.xml', newMetaFile)
    else:
        shutil.copy2('L:\Data_Admin\QA\Metadata_python_toolset\Master_Metadata.xml', newMetaFile)
    tree=et.parse(newMetaFile)

    print "Processing: "+str(File)

    for node in tree.findall('.//title'):
        node.text = str(FileNm)
    for node in tree.findall('.//northbc'):
        node.text = str(FileDesc_obj.extent.YMax)
    for node in tree.findall('.//southbc'):
        node.text = str(FileDesc_obj.extent.YMin)
    for node in tree.findall('.//westbc'):
        node.text = str(FileDesc_obj.extent.XMin)
    for node in tree.findall('.//eastbc'):
        node.text = str(FileDesc_obj.extent.XMax)        
    for node in tree.findall('.//native/nondig/formname'):
        node.text = str(os.getcwd()+"\\"+File)
    for node in tree.findall('.//native/digform/formname'):
        node.text = str(FileDesc_obj.featureType)
    for node in tree.findall('.//avlform/nondig/formname'):
        node.text = str(FileDesc_obj.extension)
    for node in tree.findall('.//avlform/digform/formname'):
        node.text = str(float(os.path.getsize(File))/int(1024))+" KB"
    for node in tree.findall('.//theme'):
        node.text = str(FileDesc_obj.spatialReference.name +" ; EPSG: "+str(FileDesc_obj.spatialReference.factoryCode))
    print node.text
    projection_info=[]
    Zone=FileDesc_obj.spatialReference.name

    if "GCS" in str(FileDesc_obj.spatialReference.name):
        projection_info=[FileDesc_obj.spatialReference.GCSName, FileDesc_obj.spatialReference.angularUnitName, FileDesc_obj.spatialReference.datumName, FileDesc_obj.spatialReference.spheroidName]
        print "Geographic Coordinate system"
    else:
        projection_info=[FileDesc_obj.spatialReference.datumName, FileDesc_obj.spatialReference.spheroidName, FileDesc_obj.spatialReference.angularUnitName, Zone[Zone.rfind(zone)-3:]]
        print "Projected Coordinate system"
    x=0
    for node in tree.findall('.//spdom'):
        for node2 in node.findall('.//keyword'):
            print node2.text
            node2.text = str(projection_info[x])
            print node2.text
            x=x+1


    tree.write(newMetaFile)

---Update 1&2: Thanks to Aleyna I have the following basic code that works

import os, xml, arcpy, shutil
from xml.etree import ElementTree as et 

CodeString=['northbc','southbc', '<nondig><formname>']

nondig='nondigital'
path=os.getcwd()
arcpy.env.workspace = path
xmlfile = path+"\\test.xml"

FileList = arcpy.ListFeatureClasses()
FileCount = len(FileList)

for File in FileList:
    FileDesc_obj = arcpy.Describe(File)
    FileNm=FileDesc_obj.file
    newMetaFile=FileNm+"_Metadata.xml"
    shutil.copy2('L:\Data_Admin\QA\Metadata_python_toolset\Master_Metadata.xml', newMetaFile)
    tree=et.parse(newMetaFile)

    for node in tree.findall('.//northbc'):
        node.text = str(FileDesc_obj.extent.YMax)
    for node in tree.findall('.//southbc'):
        node.text = str(FileDesc_obj.extent.YMin)
    for node in tree.findall('.//westbc'):
        node.text = str(FileDesc_obj.extent.XMin)
    for node in tree.findall('.//eastbc'):
        node.text = str(FileDesc_obj.extent.XMax)        
    for node in tree.findall('.//native/nondig/formname'):
        node.text = nondig

    tree.write(newMetaFile)

The issue is with dealing with xml code like

- <spdom>
  <keyword thesaurus="">GDA94</keyword> 
  <keyword thesaurus="">GRS80</keyword> 
  <keyword thesaurus="">Transverse Mercator</keyword> 
  <keyword thesaurus="">Zone 55 (144E - 150E)</keyword> 
  </spdom>

As keyword thes...is not unique within the <spdom> can we update these in a order from the values coming from

FileDesc_obj.spatialReference.name

u'GCS_GDA_1994'

---ORIGINAL POST---

I am building up a program to generate xml metadata files from spatial files in our library. I have already created the scripts to extract the required spatial and attrib data from the files and create a shp and text file based index of the files but now I want to write this info to base metadata xml file that is written to anzlic standards by replacing the values held by common/static elements...

So for example I want to replace the following xml code

<northbc>8097970</northbc>
<southbc>8078568</southbc>

with

<northbc> GeneratedValue_[desc.extent.XMax] /<northbc>
<southbc> GeneratedValue_[desc.extent.XMax] </southbc>

The issue is that obviously the number/value between and will not be the same.

Similarly for xml tags like <title>, <nondig><formname> etc...in the latter example both tags must be searched for together as formname appears multiple times (is not unique).

I am using the Python Regular Expression manual [here][1],

Using the given tag(s) above:

import os
import xml
from xml.etree import ElementTree as et 
path = r"/your/path/to/xml.file" 
tree = et.parse(path)
for node in tree.findall('.//northbc'):
    node.text = "New Value"
tree.write(path)

Here, XPATH .//northbc returns all the 'northbc' nodes in the XML doc. You can tailor the code for your need easily.