Best regex questions in July 2011

Regular expression /(ab)?use/: Is a more complex expression worth it?

13 votes

I'm writing a simple Perl script that translates assembly instruction strings to 32-bit binary code.

I decided to handle translation grouping instruction by type (ADD and SUB are R-Type instructions and so on...) so in my code I'm doing something like this:

my $bin = &r_type($instruction) if $instruction =~ /^(?:add|s(?:ub|lt|gt))\s/;

because I want to handle add, sub, slt and sgt in the same way.

I realized however that maybe using that regular expression could be an 'overkill' for the task I'm supposed to do... could the pattern

/^(?:add|sub|slt|sgt)\s/

represent a better use of regular expressions in this case?

Thanks a lot.

Unless you are using a perl older than 5.10, the simple alternation will perform better anyway (see here), so there is no reason to try to optimize it.

Python pattern-matching. Match 'c[any number of consecutive a's, b's, or c's or b's, c's, or a's etc.]t'

12 votes

Sorry about the title, I couldn't come up with a clean way to ask my question.

In Python I would like to match an expression 'c[some stuff]t', where [some stuff] could be any number of consecutive a's, b's, or c's and in any order.

For example, these work: 'ct', 'cat', 'cbbt', 'caaabbct', 'cbbccaat'

but these don't: 'cbcbbaat', 'caaccbabbt'

Edit: a's, b's, and c's are just an example but I would really like to be able to extend this to more letters. I'm interested in regex and non-regex solutions.

Not sure how attached you are to regex, but here is a solution using a different method:

from itertools import groupby

words = ['ct', 'cat', 'cbbt', 'caaabbct', 'cbbccaat',  'cbcbbaat', 'caaccbabbt']
for w in words:
    match = False
    if w.startswith('c') and w.endswith('t'):
        temp = w[1:-1]
        s = set(temp)
        match = s <= set('abc') and len(s) == len(list(groupby(temp)))
    print w, "matches" if match else "doesn't match"

The string matches if a set of the middle characters is a subset of set('abc') and the number of groups returned by groupby() is the same as the number of elements in the set.

Consolidate repeating pattern

8 votes

I am working on a script that develops certain strings of alphanumeric characters, separated by a dash -. I need to test the string to see if there are any sets of characters (the characters that lie in between the dashes) that are the same. If they are, I need to consolidate them. The repeating chars would always occur at the front in my case.

Examples:

KRS-KRS-454-L
would become:
KRS-454-L

DERP-DERP-545-P
would become:
DERP-545-P

<?php
$s = 'KRS-KRS-454-L';
echo preg_replace('/^(\w+)-(?=\1)/', '', $s);
?>
// KRS-454-L

This uses a positive lookahead (?=...) to check for repeated strings.

Note that \w also contains the underscore. If you want to limit to alphanumeric characters only, use [a-zA-Z0-9].

Also, I've anchored with ^ as you've mentioned: "The repeating chars would always occur at the front [...]"

JavaScript RegExp.test() returns false, even though it should return true

7 votes

The Prob: I get an AJAX response (JSON or plaintext with line breaks). Each item of the response should be checked via RegEx to find out whether it matches to user-definded patter or not.

Example:

Ajax Response (plain-text)

"Aldor
Aleph
Algae
Algo
Algol
Alma-0
Alphard
Altran"

User-pattern:

/^Alg/ig.test(responseItem)

RegExp Results should look like:

Aldor   // false
Aleph   // false
Algae   // true
Algo    // true
Algol   // true
Alma-0  // false
Alphard // false
Altran  // false

But each time i get different (and kinda weired) results... e.g. (/^alg/ig.test("Algo") => false)

My code:

HTML

...
<form>
  <input id="in" />
</form>
<div id="x">
  Aldor
  Aleph
  Algae
  Algo
  Algol
  Alma-0
  Alphard
  Altran
</div><button id="checker">check!</button>
...

JavaScript (jQuery 1.6.2)

$(function(){
    var $checker = $('#checker');

    $checker.click(function(ev){
        var inputFieldVal = $.trim($('#in').val());
        console.log(inputFieldVal); // Alg
        var regExpPattern = '^'+inputFieldVal,
            re = new RegExp(regExpPattern, 'igm');
        onsole.log(re); // /^Al/gim
        // Get text out of div#x
        var text = $('#x').text();
        // Trim and 'convert' to an array...
            text = $.trim(text).split('\n');
        console.log(text); // ["Aldor", "Aleph", "Algae", "Algo", "Algol", "Alma-0", "Alphard", "Altran"]

        for (var index=0, upper=text.length;index<upper;++index) {
            console.log(
               re.test(text[index]),
               text[index]
             );
        }
    });
})

Console OUTPUT:

/^Alg/ig => should match each item which starts with Alg

false "Aldor"
false "Aleph"
true "Algae"
false "Algo" //Why ? O.o
true "Algol"
false "Alma-0"
false "Alphard"
false "Altran"

/^Al/ig => should match each item because every item start with Al

true "Aldor"
false "Aleph" //Why ? O.o
true "Algae"
false "Algo" //Why ? O.o
true "Algol"
false "Alma-0" //Why ? O.o
true "Alphard"
false "Altran" //Why ? O.o

Any suggestions?

This is a common behavior that the the exec or test methods show when you deal with patterns that have the global g flag.

The RegExp object will keep track of the lastIndex where a match was found, and then on subsequent matches it will start from that lastIndex instead of starting from 0.

For example:

var re = /^a/g;
console.log(re.test("ab")); // true, lastIndex was 0
console.log(re.test("ab")); // false, lastIndex was 1

Remove the g flag from your pattern, since you are looking just for a single match (you are testing each line separately).

Is there a JavaScript regex equivalent to the intersection (&&) operator in Java regexes?

6 votes

In Java regexes, you can use the intersection operator && in character classes to define them succinctly, e.g.

[a-z&&[def]]    // d, e, or f
[a-z&&[^bc]]    // a through z, except for b and c

Is there an equivalent in JavaScript?

Is there an equivalent in JavaScript?

Simple answer: no, there's not. It is specific Java syntax.

See: Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan. Here's a sneak-peek to the relevant section.

Probably needless to say, but the following JavaScript code:

if(s.match(/^[a-z]$/) && s.match(/[^bc]/)) { ... }

would do the same as the Java code:

if(s.matches("[a-z&&[^bc]]")) { ... }

Replacing each match with a different word

6 votes

I have a regular expression like this:

findthe = re.compile(r" the ")
replacement = ["firstthe", "secondthe"]
sentence = "This is the first sentence in the whole universe!"

What I am trying to do is to replace each occurrence with an associated replacement word from a list so that the end sentence would look like this:

>>> print sentence
This is firstthe first sentence in secondthe whole universe

I tried using re.sub inside a for loop enumerating over replacement but it looks like re.sub returns all occurrences. Can someone tell me how to do this efficiently?

If it is not required to use regEx than you can try to use the following code:

replacement = ["firstthe", "secondthe"]
sentence = "This is the first sentence in the whole universe!"

words = sentence.split()

counter = 0
for i,word in enumerate(words):
    if word == 'the':
        words[i] = replacement[counter]
        counter += 1

sentence = ' '.join(words)

Or something like this will work too:

import re
findthe = re.compile(r"\b(the)\b")
print re.sub(findthe, replacement[1],re.sub(findthe, replacement[0],sentence, 1), 1)

And at least:

re.sub(findthe, lambda matchObj: replacement.pop(0),sentence)

Mod Rewrite Regex - Multiple Negative Lookaheads

6 votes

I currently have the working Mod Rewrite Regex:

RewriteEngine On
RewriteCond %{QUERY_STRING} ^(.*)$
RewriteRule ^(.*/)?((?:cmd)[^/]*)/((?!(?:cmd)[.+]*)(.+)) $1?$2=$3&%1 [L]

That regex takes the following URL and transforms it into the URL immediately below:

www.site.com/cmd1/param/cmd2/param2/stillparam2 and turn it into www.site.com/index.php?cmd1=param&cmd2=param2/stillparam2

That works fine, but I would also like to create another negative lookahead assertion to ensure that a URL block - ie a /texthere/ param - doesn't include an underscore. An invalid string might look like: www.test.com/cmd/thing/getparam_valuehere; the regex should parse the cmd/thing as a key and value pair and ignore the rest of the string. I would then also write another RewriteRule to have the block of the URL with the underscore in it added as another URL parameter. The following URL translation would occur:

www.test.com/cmd/param1/cmd2/directory/param2/sortorder_5
www.test.com?cmd=param1&cmd2=directory/param2&sortorder=5

Please let me know if I have not been clear enough. Any help would be great.

NB: I have tried using a negative lookahead nested inside the one already present - (?!(?!)) - and have tried using an | on two negative lookaheads, but neither solutions worked. I thought that perhaps something else was more fundamentally wrong?

Thanks all.

Edit: I have also tried the following - which I really thought would work (but obviously, didn't!)

RewriteRule ^(.*/)?((?:cmd)[^/]*)/((?!(?:cmd)[.+]*)(?![.+]*(?:_)[.+]*)(.+)) $1?$2=$3&%1 [L]

That does the following:

www.test.com/cmd/param1/sortorder_1/ translates to www.test.com?cmd=param1/sortorder_1/

When it should instead become: www.test.com?cmd=param1&sortorder=2/. The rule to translate /sortorder_2/ into&sortorder=2 has not yet been created, but you can hopefully see what I mean).

After about four days of experimenting, I ended up with a somewhat different solution than I had originally expected to find. I simply removed all the actual URL manipulation to my index.php file and routed all requests through there. Here is my (much cleaner) .htaccess file:

Options +FollowSymlinks
RewriteEngine On
RewriteCond %{QUERY_STRING} (.*)
RewriteRule (.*) index.php?path=$1 [QSA,L]

and here is the block of code I used to parse the entered URL:

preg_match_all('|/([A-Za-z0-9]+)((?!/)[A-Za-z0-9-.]*)|', $_GET['path'], $matches);

        // Remove all '$_GET' parameters from the actual $_GET superglobal:
        foreach($matches[0] as $k => $v) {
            $search = '/' . substr($v, 1);
            $_GET['path'] = str_replace($search, '', $_GET['path'], $count);
        }

        // Add $_GET params to URL args
        for ($i = 0; $i < count($matches[1]); $i++) {
            self::$get_arguments[$matches[1][$i]] = $matches[2][$i];
        }

        // Retrieve all 'cmd' properties from the URL and create an array with them:
        preg_match_all('~(cmd[0-9]*)/(.+?)(?=(?:cmd)|(?:\z))~', $_GET['path'], $matches);

        if (isset($matches[1][0])) {
            return self::$url_arguments = array_combine($matches[1], $matches[2]);

On a URL like this:

http://localhost/frame_with_cms/frame/www/cmd/one/cmd2/two/cmd3/three/cmd4/four/getparam_valuepart1_valuepart2/cmd5/five/

It successfully produces these separate arrays which I then use to handle requests:

Array
(
    [getparam] => valuepart1_valuepart2
)
Array
(
    [cmd] => one/
    [cmd2] => two/
    [cmd3] => three/
    [cmd4] => four/
    [cmd5] => five/
)

Thanks to all who took the time to read and reply.

python regex to split on certain patterns with skip patterns

6 votes

I want to split a Python string on certain patterns but not others. For example, I have the string

Joe, Dave, Professional, Ph.D. and Someone else

I want to split on \sand\s and ,, but not , Ph.D.

How can this be accomplished in Python regex?

You can use:

re.split(r'\s+and\s+|,(?!\s*Ph\.D\.)\s*', 'Joe, Dave, Professional, Ph.D. and Someone else')

Result:

['Joe', 'Dave', 'Professional, Ph.D.', 'Someone else']

R- regexp question

5 votes

I need to re-shape my data frame using regexp and, in particular, this kind of line

X21_GS04.A.mzdata

must became:

GS04.A

I tryed

pluto <- sub('^X[0-90_]+','', my.data.frame$File.Name, perl=TRUE)

and it works; than I tryed

pluto <- sub('.mzdata$','', my.data.frame$File.Name, perl=TRUE)

and it works too.

The problem is that I have no idea how to combine the two code in one, I tryed a script such this

pluto <- sub('^X[0-90_]+ | .mzdata$','', my.data.frame$File.Name, perl=TRUE)

but nothing appens. Can someone say to me where I wrong??

Best Riccardo

Remove space in your regex. Also escape . char: \., i.e.:

^X[0-9]+_|\.mzdata$

How to switch/rotate every two lines with sed/awk?

5 votes

I have been doing this by hand and I just can't do it anymore-- I have thousands of lines and I think this is a job for sed or awk.

Essentially, we have a file like this:

A sentence X
A matching sentence Y
A sentence Z
A matching sentence N

This pattern continues for the entire file. I want to flip every sentence and matching sentence so the entire file will end up like:

A matching sentence Y
A sentence X
A matching sentence N
A sentence Z

Any tips?

edit: extending the initial problem

Dimitre Radoulov provided a great answer for the initial problem. This is an extension of the main problem-- some more details:

Let's say we have an organized file (due to the sed line Dimitre gave, the file is organized). However, now I want to organize the file alphabetically but only using the language (English) of the second line.

watashi 
me
annyonghaseyo
hello
dobroye utro!
Good morning!

I would like to organize alphabetically via the English sentences (every 2nd sentence). Given the above input, this should be the output:

dobroye utro!
Good morning!
annyonghaseyo
hello
watashi
me 

sed 'N; 
s/\(.*\)\n\(.*\)/\2\
\1/' infile

N - append the next line of input into the pattern space
\(.*\)\n\(.*\) - save the matching parts of the pattern space the one before and the one after the newline.
\2\\ \1 - exchange the two lines (\1 is the first saved part, \2 the second). Use escaped literal newline for portability

With some sed implementations you could use the escape sequence \n: \2\n\1 instead.

Characters classes in ranges - vim

5 votes

Given I have the following string:

This is a test {{ string.string.string }}.

And try to perform the following substitution:

%s/{{ [\w\.]\+ }}/substitute/g

It will not work with the error: Pattern not found.

When I use:

%s/{{ [a-zA-Z\.]\+ }}/substitute/g

It works.

Is there a way to use the meta-character-classes in ranges in VIM?

You can use:

  • A non capturing sub-expression, see :help E53 (you can use a capturing sub-expression as well, \(\), but the overhead of capturing is useless)

    %s/{{ \%(\w\|\.\)\+ }}/substitute/g
    
  • A sequence of optionally matched atoms - \%[], see :help E70

    %s/{{ \%[\w\.]\+ }}/substitute/g
    

Is there a group of groups in Regex in javascript or any other language

5 votes

I was just playing with the native replace method for strings in javascript. Is there any thing like groups of groups. If not, how are groups ordered in string where a group encapsulates other open and closed parentheses (potential groups). For example,

var string = "my name is name that is named man".replace(/((name)|(is)|(man))/g, "$1");

What will the group references $1, $2, $3, and $4 be. I already tried it on my local computer (on firebug) but it gives me results that I can't readily understand. A clear explanation on this will be appreciated!!

In some languages you can specify a flag to say the order you want the groups to be in. In Javascript you can't specify it. They will be in the order that the opening parenthesis occur in. So in your above example the groups will be, in order:

1) ((name)|(is)|(man))

2) (name)

3) (is)

4) (man)

To see the output more clearly from your above string, execute:

"my name is name that is named man".replace(/((name)|(is)|(man))/g, '1($1) 2($2) 3($3) 4($4)\n');

Then you can cleary see what's in each group when each match is reached:

"my 1(name) 2(name) 3() 4()
 1(is) 2() 3(is) 4()
 1(name) 2(name) 3() 4()
 that 1(is) 2() 3(is) 4()
 1(name) 2(name) 3() 4()
d 1(man) 2() 3() 4(man)"

When the first match is reached you can see that the string which matched group 2 (name) matched group 1 as well. Group 3 and 4 didn't match anything. Same goes for each match. in this case since group one wraps everything it will always contain the whole match, and since the inner part is an or, only one of those three groups will contain any text on each match.

Replace Property Definitions in VB.Net Code

4 votes

In VB 2010, you can use the implied properties like C# which turns this

Private _SONo As String

Public Property SONo() As String
    Get
        Return _SONo
    End Get
    Set(ByVal value As String)
        _SONo = value
    End Set
End Property

Into

Public Property SONo() As String

What I want to do is replace the old style with the new style in a few file. Since Visual Studio's find and replace tool allows you to do regular expressions, I assume there must be an expression I can use to do this conversion.

What would the regular expression be to do this conversion?

This could be dangerous as you might have logic in the property setters/getters, but if they don't have logic you could say:

Regular Expression:

Private\s_(\w+)\sAs\s(\w+).*?(^\w+).*?Property.*?End\sProperty

Replace:

${3} Property ${1} As ${2}

I've tested this with RegexBuddy targeting the .NET regex variant. Note, that this may or may not work in the Visual Studio Find/Replace prompt as that is yet another variant.

UPDATE: VS's variant (Dot can't match newlines so we need to add that functionality, also converted: \w = :a, \s = :b, {} for tags, and *? = @):

Private:b_{:a+}:bAs:b{:a+}(.|\n)@{:a+}(.|\n)@Property(.|\n)@End:bProperty

\3 Property \1 As \2

The Regex does the following:

Options: dot matches newline; case insensitive; ^ and $ match at line breaks

Match the characters “Private” literally «Private»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
Match the character “_” literally «_»
Match the regular expression below and capture its match into backreference number 1 «(\w+)»
   Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
Match the characters “As” literally «As»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
Match the regular expression below and capture its match into backreference number 2 «(\w+)»
   Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regular expression below and capture its match into backreference number 3 «(\w+)»
   Match a single character that is a “word character” (letters, digits, and underscores) «\w+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the characters “Property” literally «Property»
Match any single character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the characters “End” literally «End»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s»
Match the characters “Property” literally «Property»

Query Android's SQLiteDatabase using Regex

3 votes

I am trying to fetch entries from a SQLiteDatabase in an Android program using the query function using the selection parameter. I have had success with simple pattern matching using the SQLite's LIKE and the % wildcard. Now I want to do more complex pattern matching using regular expressions.

According to the SQLite website, for the REGEXP operator to function, it must be user defined. Has anyone had any success creating use defined SQLite functions for Android's SQLiteDatabase? Or has anyone found another way to use regular expressions when searching through strings in a database?

As described here

The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If a application-defined SQL function named "regexp" is added at run-time, that function will be called in order to implement the REGEXP operator.

But there is GLOB which is little more advanced than LIKE.

Overlapping source and destination blocks in memcpy with boost

2 votes

Can anyone explain me why on that c++ simple code valgrind returns this.

First problem is with boost:regex. When i use subpattern with a question mark (for optional matching) valgrind will return:

Source and destination overlap in memcpy (line 8)

Second problem is with std::string::erase.

I have no idea what am i doing wrong.

Seems like the library code is using memcpy when, to be strictly portable, it should be using memmove.

For the compiler's library, like std::string, this is probably ok as that code doesn't have to be portable to other compilers, and can use knowledge about how the specific implementation works.

With the boost library, you will probably have to trust that they also know what they are doing. The library has a lot of configurations for different compilers and might also use a specific g++ extension.

Find words with specified first letter (Regex)

2 votes

I need regex to find words starting, for example, whith letters "B" or "b". In sentence Bword abword bword I need to find Bword and bword. My curresnt regex is: [Bb]\w+ (first character is space), but it doesn't find Bword.

Thanks in advance.

Try using following regex: (?i)\bB\w*\b

It means:

  1. (?i) - turn on ignore case option
  2. \b - first or last character in a word
  3. B
  4. \w* - Alphanumeric, any number of repetitions
  5. \b - first or last character in a word

So it will find Bword and bword.

JAVA: replaceAll reg pattern

1 votes

Considering the following string:

String s = "/static/201105-3805-somerandom/images/optional-folder/filename.gif";

How can I remove the "static/201105-3805-somerandom/" part? The "201105-3805-somerandom" part is completely random but always is composed of: - 6 digits - the "-" char - {1, n} digit chars - the "-" char - {1, n} digit and letter chars

If I use "/static/[0-9]*-[0-9]*-*/";, it replaces everything to the last / instead of the one just after the "{1, n} digit and letter chars", what am I missing?

Thanks

s = s.replaceAll("^/static/\\d{6}-\\d{1,}-.*?/","")

Validate phone number by custom validator and javascript

1 votes

I have a phoneTextBox control, which contains 4 TextBoxes:

country code (1-3 digits), city code (1-7 digits), local number (1-7 digits) and extra phone number (1-5 digits).

The extra phone number is not required.

The code below doesn't work.

    <script type="text/javascript">
    function ValidatePhoneNumber(source, args) 

    {
        if (     $('#<%=txtCountryCode.ClientID %>').val().match(/^\d{1,3}$) ||
                 $('#<%=txtCityCode.ClientID %>').val().match(/^\d{1,7}$) ||
                 $('#<%=txtMainPhoneNumber.ClientID %>').val().match(/^\d{1,7}$)
           )


        {
            if ($('#<%=txtExtraPhoneNumber.ClientID %>').val().length<=0)
            {
               args.IsValid = true;
               return;
            }
            else 
            {
                if ($('#<%=txtExtraPhoneNumber.ClientID %>').val().match(/^\d{1,5}$)
                {
                    args.IsValid = true;
                   return;

                }
                else 
                {
                    args.IsValid = false;

                }

            }
        }
        else 
                {
                    args.IsValid = false;

                }

}
</script>
    <div style="display: inline">
        <asp:CustomValidator runat="server" ForeColor="Red" ErrorMessage="Invalid format" ClientValidationFunction="ValidatePhoneNumber" />
        <div>
            <b>+</b><asp:TextBox ID="txtCountryCode" runat="server" Width="30px" MaxLength="3"></asp:TextBox>
            <asp:TextBox ID="txtCityCode" runat="server" Width="60px" MaxLength="7"></asp:TextBox>
            <asp:TextBox ID="txtMainPhoneNumber" runat="server" Width="60px" MaxLength="7"></asp:TextBox>
            <asp:TextBox ID="txtExtraPhoneNumber" runat="server" Width="50px" MaxLength="5"></asp:TextBox>
        </div>
    </div>

    args.IsValid = $('#<%=txtCountryCode.ClientID %>').val().match(/^\d{1,3}$/) &&
                 $('#<%=txtCityCode.ClientID %>').val().match(/^\d{1,7}$/) &&
                 $('#<%=txtMainPhoneNumber.ClientID %>').val().match(/^\d{1,7}$/) &&
$('#<%=txtExtraPhoneNumber.ClientID %>').val().match(/^\d{0,5}$/);

Replace everything from beginning of line to equal sign

1 votes

For a content with the format:

KEY=VALUE

like:

LISTEN=I am listening.

I need to do some replacing using regex. I want this regular expression to replace anything before the = with $key (making it have to be from beginning of line so a key like 'EN' wont replace a key like "TOKEN".

Here's what I'm using, but it doesn't seem to work:

$content = preg_replace('~^'.$key.'\s?=[^\n$]+~iu',$newKey,$content);

$content = "foo=one\n"
         . "bar=two\n"
         . "baz=three\n";

$keys = array(
    'foo' => 'newFoo',
    'bar' => 'newBar',
    'baz' => 'newBaz',
);
foreach ( $keys as $oldKey => $newKey ) {
    $oldKey = preg_quote($oldKey, '#');
    $content = preg_replace("#^{$oldKey}( ?=)#m", "{$newKey}\\1", $content);
}

echo $content;

Output:

newFoo=one
newBar=two
newBaz=three

How do I match latin unicode characters in ColdFusion or Java regex?

1 votes

I'm looking for a ColdFusion or Java regex (to use in a replace function) that will only match numbers [0-9], letters [a-z], but include none ASCII Portuguese letters (unicode latin, like ç and ã).

Some like this:

str = reReplaceNoCase(str, "match none number/letter but keep unicode latin chars", "", "ALL");

Input string: "informação 123 ?:#$%"
Desired outcome: "informação 123"

I know I can match letters and numbers with [a-z][0-9], but this doesn't match letters such as ç and ã.

Try alphanumeric character class: \w, it should match letters, digits, and underscores.

Also you can use special named class \p{L} (I don't know, does Java RegEx parser support it). So in C# your task can be done using following code:

var input = "informação 123 ?:#$%";
var result = Regex.Replace(input, @"[^\p{L}\s0-9]", string.Empty);

Regex [^\p{L}\s0-9] means: any character not in this class (all letters, white space, digits). Thereby it matches in your example ?:#$% and we can replace these characters with empty string.