Best regex questions in March 2011

Regular expression to search for Gadaffi

145 votes

I'm trying to search for the word Gadaffi. What's the best regular expression to search for this?

My best attempt so far is:

\b[KG]h?add?af?fi$\b

But I still seem to be missing some journals. Any suggestions?

Update: I found a pretty extensive list here: http://www.express.be/joker/nl/platdujour/gaddafi-khadaffi-el-qadafi-kadhafy/141157.htm

The answer below matches all the 30 variants:

Gadaffi
Gadafi
Gadafy
Gaddafi
Gaddafy
Gaddhafi
Gadhafi
Gathafi
Ghadaffi
Ghadafi
Ghaddafi
Ghaddafy
Gheddafi
Kadaffi
Kadafi
Kaddafi
Kadhafi
Kazzafi
Khadaffy
Khadafy
Khaddafi
Qadafi
Qaddafi
Qadhafi
Qadhdhafi
Qadthafi
Qathafi
Quathafi
Qudhafi
Kad'afi

\b[KGQ]h?add?h?af?fi\b

Arabic transcription is (Wiki says) "Qaḏḏāfī", so maybe adding a Q. And one H ("Gadhafi", as the article (see below) mentions).

Btw, why is there a $ at the end of the regex?


Btw, nice article on the topic:

Gaddafi, Kadafi, or Qaddafi? Why is the Libyan leader’s name spelled so many different ways?.


EDIT

To match all the names in the article you've mentioned later, this should match them all. Let's just hope it won't match a lot of other stuff :D

\b(Kh?|Gh?|Qu?)[aeu](d['dt]?|t|zz|dhd)h?aff?[iy]\b

C# library for human readable pattern matching?

11 votes

Does anybody know a C# library for matching human readable patterns? Similar to regex, but friendlier?

Given a string value, I want to be able to match it against a pattern along the lines of:

(this AND that) OR "theother"

where "this" and "that" are LIKE expressions, and "theother" is an exact match due to the quotes.

UPDATE: Ok, just to be a little bit clearer. The reason I want this is to allow end users to enter in their own patterns, as string values. So I'm after something that works in a similar way to regex, but uses human readable strings that my users will easily understand

var pattern = "(this AND that) OR \"theother\""; // Could be fetched from textbox
var match = SomeLib.IsMatch(myString, pattern);

Well, after a lot of searching, I wasn't able to find exactly what I was after, but needing to get something working pretty quickly, and due to the fact the system I'm using already has the relevant DLLs, I've ended up using Lucene.NET to created a temporary index containing a single document with the relevant fields I need to search added to it. I can then do the type of query I'm after against it, and check for any matches. By using the RAMDirectory class I was able to create the index in memory, and dispose of it after the lookup, so no index files have to be written to disk.

I'm sure there are probably less intensive ways to achieve this, but as I say, it's the best I could come up with in the time I had.

Thank to everyone for their suggestions, and I would still like to know if there is a better way of doing this?

Regex for a string up to 20 chars long with a comma

6 votes

I need to define a regex for a string with the following requirements:

  • Maximum 20 characters
  • Must be in the form Name,Surname
  • No numbers and special characters allowed (again, it's a name&surname)

I already tried something like ^[^1-9\?\*\.\?\$\^\_]{1,20}[,][^1-9\?\*\.\?\$\^\_\-]{1,20}$ but as you can find, it also matches a 40 chars long string.

How can I check for the whole string's maximum length and at the same time impose 1 comma inside of it and obviously not at the borders?

Thank you

Try the regex:

^(?=[^,]+,[^,]+$)[a-zA-Z,]{1,20}$

Rubular Link

Explanation:

^                : Start anchor
(?=[^,]+,[^,]+$) : Positive lookahead to ensure string has exactly one comma
                   surrounded by atleast one non-comma character on both sides.
[a-zA-Z,]{1,20}  : Ensure entire string is of length max 20 and has only 
                   letters and comma
$                : End anchor

Is there a way to create a string that matches a given C# regex?

6 votes

My application has a feature that parses text using a regular expression to extract special values. I find myself also needing to create strings that follow the same format. Is there a way to use the already defined regular expression to create those strings?

For example, assume my regex looks something like this:

public static Regex MyRegex = new Regex( @"sometext_(?<group1>\d*)" );

I'd like to be able to use MyRegex to create a new string, something like:

var created = MyRegex.ToString( new Dictionary<string, string>() {{ "group1", "data1" }};

Such that created would then have the value "sometextdata1".

Update: Judging from some of the answers below, I didn't make myself clear enough. I don't want to generate random strings matching the criteria, I want to be able to create specific strings matching the criteria. In the example above, I provided "data1" to fill "group1". Basically, I have a regex that I want to use in a manner similar to format strings instead of also defining a separate format string.

You'll need a tool called Rex. Well you don't 'need' it, but it's what I use :-)

http://research.microsoft.com/en-us/projects/rex/

You can (although not ideal), add the exe as a reference to your project and utilize the classes that have been made public.

It works quite well.

Regex named capture groups in Delphi XE

6 votes

I have built a match pattern in RegexBuddy which behaves exactly as I expect. But I cannot transfer this to Delphi XE, at least when using the latest built in TRegEx or TPerlRegEx.

My real world code have 6 capture group but I can illustrate the problem in an easier example. This code gives "3" in first dialog and then raises an exception (-7 index out of bounds) when executing the second dialog.

var
  Regex: TRegEx;
  M: TMatch;
begin
  Regex := TRegEx.Create('(?P<time>\d{1,2}:\d{1,2})(?P<judge>.{1,3})');
  M := Regex.Match('00:00  X1 90  55KENNY BENNY');
  ShowMessage(IntToStr(M.Groups.Count));
  ShowMessage(M.Groups['time'].Value);
end;

But if I use only one capture group

Regex := TRegEx.Create('(?P<time>\d{1,2}:\d{1,2})');

The first dialog shows "2" and the second dialog will show the time "00:00" as expected.

However this would be a bit limiting if only one named capture group was allowed, but thats not the case... If I change the capture group name to for example "atime".

var
  Regex: TRegEx;
  M: TMatch;
begin
  Regex := TRegEx.Create('(?P<atime>\d{1,2}:\d{1,2})(?P<judge>.{1,3})');
  M := Regex.Match('00:00  X1 90  55KENNY BENNY');
  ShowMessage(IntToStr(M.Groups.Count));
  ShowMessage(M.Groups['atime'].Value);
end;

I'll get "3" and "00:00", just as expected. Is there reserved words I cannot use? I don't think so because in my real example I've tried completely random names. I just cannot figure out what causes this behaviour.

When pcre_get_stringnumber does not find the name, PCRE_ERROR_NOSUBSTRING is returned.

PCRE_ERROR_NOSUBSTRING is defined in RegularExpressionsAPI as PCRE_ERROR_NOSUBSTRING = -7.

Some testing shows that pcre_get_stringnumber returns PCRE_ERROR_NOSUBSTRING for every name that has the first letter in the range of k to z and that range is dependent of the first letter in judge. Changing judge to something else changes the range.

As i see it there is at lest two bugs involved here. One in pcre_get_stringnumber and one in TGroupCollection.GetItem that needs to raise a proper exception instead of SRegExIndexOutOfBounds

Remove all exclusive Latin characters using regex

6 votes

I'm developing a Portuguese software, so many of my entities have names like 'maça' or 'lição' and I want to use the entity as a resource key. So I want keep every character except the 'ç,ã,õ....'

There is some optimum solution using regex? My actual regex is (as Remove characters using Regex suggest):

Regex regex = new Regex(@"[\W_]+");
string cleanText = regex.Replace(messyText, "").ToUpper();

only to emphasize, I'm worried just with Latin characters.

A simple option is to white-list the accepted characters:

string clean = Regex.Replace(messy, @"[^a-zA-z0-9!@#]+", "");

If you want to remove all non-ASCII letters but keep all other characters, you can use character class subtraction:

string clean = Regex.Replace(messy, @"[\p{L}-[a-zA-Z]]+", "");

It can also be written as the more standard and complicated [^\P{L}a-zA-Z]+ (or [^\Wa-zA-Z]), which reads "select all characters that are not (not word letters or ASCII letters)", which ends up with the letters we're looking for.

You may also consider the following approach more useful: How do I remove diacritics (accents) from a string in .NET?

How to replace pairs of tokens in a string?

6 votes

Hi all,

New to python, competent in a few languages, but can't see a 'snazzy' way of doing the following. I'm sure it's screaming out for a regex, but any solution I can come up with (using regex groups and what not) becomes insane quite quickly.

So, I have a string with html-like tags that I want to replace with actual html tags.

For example:

Hello, my name is /bJane/b.

Should become:

Hello, my name is <b>Jane</b>.

It might be combo'd with [i]talic and [u]nderline as well:

/iHello/i, my /uname/u is /b/i/uJane/b/i/u.

Should become:

<i>Hello</i>, my <u>name</u> is <b><i><u>Jane</b></i></u>.

Obviously a straight str.replace won't work because every 2nd token needs to be preceeded with the forwardslash.

For clarity, if tokens are being combo'd, it's always first opened, first closed.

Many thanks!

PS: Before anybody gets excited, I know that this sort of thing should be done with CSS, blah, blah, blah, but I didn't write the software, I'm just reversing its output!

Maybe something like this can help :

import re


def text2html(text):
    """ Convert a text in a certain format to html.

    Examples:
    >>> text2html('Hello, my name is /bJane/b')
    'Hello, my name is <b>Jane</b>'
    >>> text2html('/iHello/i, my /uname/u is /b/i/uJane/u/i/b')
    '<i>Hello</i>, my <u>name</u> is <b><i><u>Jane</u></i></b>'

    """

    elem = []

    def to_tag(match_obj):
        match = match_obj.group(0)
        if match in elem:
            elem.pop(elem.index(match))
            return "</{0}>".format(match[1])
        else:
            elem.append(match)
            return "<{0}>".format(match[1])

    return re.sub(r'/.', to_tag, text)

if __name__ == "__main__":
    import doctest
    doctest.testmod()

Why the difference between .NET regular expressions and Visual Studio's regular expressions?

5 votes

I've finally found references to both Visual Studio's regular expressions for Find and Replace, and .NET's regular expression package, and now out of morbid curiousity I want to know: why the difference!?

I'm sure there's a technical, historical, or usability reason, but it confused the bajeepers [sp? ;-) ] out of me at first.

I'd speculate that the VS regexes are designed to match code well, having defined lots of handy shortcuts like :w for an entire word, or :i for a C++ identifier, or :q for a quoted string.

They usually don't need to handle arbitrary data that you'd need lookaround assertions and stuff like that for. Or at least that was lower on the priorities list.

What's a Rails plugin, or Ruby gem, to automatically fix English grammar?

5 votes

Facebook just re-launched Comments, with a automatic grammar fixing feature.

What does the grammar filter do?

Adds punctuation (e.g. periods at the end of sentences)
Trims extra whitespace
Auto cases words (e.g. capitalize the first word of a sentence)
Expands slang words (e.g. plz becomes please)
Adds a space after punctuation (e.g. Hi,Cat would become Hi, Cat)
Fix common grammar mistakes (e.g. convert ‘dont' to ‘don’t’)

What is an equivalent plugin or gem?

I don't know of anything with those particular features.

However, you might look at Ruby LinkParser, which is a Ruby wrapper for the Link Grammar parser developed by academics and used by the Abiword project for grammar checking. (Note that "link" in Link Grammer parser doesn't refer to HTML links, but rather to a structure that described English syntax as a set of links between words).

Here's another interesting checker, written in Ruby, which is designed to check LaTex files for some of the problems you mention (plus others).

Replace newlines, but keep the blank lines

5 votes

I want to replace newlines (\r\n) with space, but I want to keep the blank lines. In other words, I want to replace \r\n with ' ', if \r\n is not preceded by another \r\n. For example:

line 1

line 2
line 3
line 4

Shold end up as...

line 1

line 2 line 3 line 4

But not as "line 1 line 2 line 3 line 4", which is what I'm doing right now with this

preg_replace("/\r\n/", " ", $string);

Try this:

(?<!\n)\n(?!\n)

Of course, you can change \n to whatever you need.

Working example: http://ideone.com/dF5L9

python: how to interrupt a regex match

5 votes

I iterate over the lines in a large number of downloaded text files and do a regex match on each line. Usually, the match takes less than a second. However, at times a match takes several minutes, sometimes the match does not finish at all and the code just hangs (waited an hour a couple of times, then gave up). Therefore, I need to introduce some kind of timeout and tell the regex match code in some way to stop after 10 seconds or so. I can live with the fact that I will lose the data the regex was supposed to return.

I tried the following (which of course is already 2 different, thread-based solutions shown in one code sample):

def timeout_handler():
    print 'timeout_handler called'

if __name__ == '__main__':
    timer_thread = Timer(8.0, timeout_handler)
    parse_thread = Thread(target=parse_data_files, args=(my_args))
    timer_thread.start()
    parse_thread.start()
    parse_thread.join(12.0)
    print 'do we ever get here ?'

but I do neither get the timeout_handler called nor the do we ever get here ? line in the output, the code is just stuck in parse_data_files.

Even worse, I can't even stop the program with CTRL-C, instead I need to look up the python process number and kill that process. Some research showed that the Python guys are aware of regex C code running away: http://bugs.python.org/issue846388

I did achieve some success using signals:

signal(SIGALRM, timeout_handler)
alarm(8)
data_sets = parse_data_files(config(), data_provider)
alarm(0)

this gets me the timeout_handler called line in the output - and I can still stop my script using CTRL-C. If I now modify the timeout_handler like this:

class TimeoutException(Exception): 
    pass 

def timeout_handler(signum, frame):
    raise TimeoutException()

and enclose the actual call to re.match(...) in a try ... except TimeoutException clause, the regex match actually does get interrupted. Unfortunately, this only works in my simple, single-threaded sandbox script I'm using to try out stuff. There is a few things wrong with this solution:

  • the signal triggers only once, if there is more than one problematic line, I'm stuck on the second one
  • the timer starts counting right there, not when the actual parsing starts
  • because of the GIL, I have to do all the signal setup in the main thread and signals are only received in the main thread; this clashes with the fact that multiple files are meant to be parsed simultaneously in separate threads - there is also only one global timeout exception raised and I don't see how to know in which thread I need to react to it
  • I've read several times now that threads and signals do not mix very well

I have also considered doing the regex match in a separate process, but before I get into that, I thought I'd better check here if anyone has come across this problem before and could give me some hints on how to solve it.

Update

the regex looks like this (well, one of them anyway, the problem occurs with other regexes, too; this is the simplest one):

'^(\d{5}), .+?, (\d{8}), (\d{4}), .+?, .+?,' + 37 * ' (.*?),' + ' (.*?)$'

sample data:

95756, "KURN ", 20110311, 2130, -34.00, 151.21, 260, 06.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -

As said, the regex usually performs ok - I can parse several hundreds of files with several hundreds of lines in less than a minute. That's when the files are complete, though - the code seems to hang with files that have incomplete lines, such as e.g.

`95142, "YMGD ", 20110311, 1700, -12.06, 134.23, 310, 05.0, 25.8, 23.7, 1004.7, 20.6, 0.0, -9999, -9999, 07.0, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999, -9999

I do also get cases where the regex seems to return right away and reports a non-match.

Update 2

I have only quickly read through the catastrophic article, but as far as I can tell so far, that's not the cause - I do not nest any repetition operators.

I'm on Mac OSX, so I can't use RegexBuddy to analyze my regex. I tried RegExhibit (which apparently uses a Perl RegEx engine internally) - and that runs away, too.

You are running into catastrophic backtracking; not because of nested quantifiers but because your quantified characters also can match the separators, and since there are a lot of them, you'll get exponential time in certain cases.

Aside from the fact that it looks more like a job for a CSV parser, try the following:

r'^(\d{5}), [^,]+, (\d{8}), (\d{4}), [^,]+, [^,]+,' + 37 * r' ([^,]+),' + r' ([^,]+)$'

By explicitly disallowing the comma to match between separators, you'll speed up the regex enormously.

If commas may be present inside quoted strings, for example, then just exchange [^,]+ (in places where you'd expect this) with

(?:"[^"]*"|[^,]+)

To illustrate:

Using your regex against the first example, RegexBuddy reports a successful match after 793 steps of the regex engine. For the second (incomplete line) example, it reports a match failure after 1.000.000 steps of the regex engine (this is where RegexBuddy gives up; Python will keep on churning).

Using my regex, the successful match happens in 173 steps, the failure in 174.

I need a regular expression to convert a string into an anchor tag to be used as a hyperlink

5 votes

Hi I'm looking a regular expression that will change this:

[check out this URL!](http://www.reallycoolURL.com)

into this:

<a href="http://www.reallycoolURL.com">check out this URL</a>

i.e. a user can input a URL using my format and my C# application will convert this into a hyperlink. I'm looking to make use of the Regex.Replace function within C#, any help would be appreciated!

Use the Regex.Replace method to specify a replacement string that will allow you to format the captured groups. An example would be:

string input = "[check out this URL!](http://www.reallycoolURL.com)";
string pattern = @"\[(?<Text>[^]]+)]\((?<Url>[^)]+)\)";
string replacement = @"<a href=""${Url}"">${Text}</a>";
string result = Regex.Replace(input, pattern, replacement);
Console.WriteLine(result);

Notice that I am using named capture groups in the pattern, which allows me to refer to them as ${Name} in the replacement string. You can structure the replacement easily with this format.

The pattern breakdown is:

  • \[(?<Text>[^]]+)]: match an opening square bracket, and capture everything that is not a closing square bracket into the named captured group Text. Then match the closing square bracket. Notice that the closing square bracket need not be escaped within the character class group. It is important to escape the opening square bracket though.
  • \((?<Url>[^)]+)\): same idea but with parentheses and capturing into the named Url group.

Named groups help with clarity, and regex patterns can benefit from all the clarity they can get. For the sake of completeness here is the same approach without using named groups, in which case they are numbered:

string input = "[check out this URL!](http://www.reallycoolURL.com)";
string pattern = @"\[([^]]+)]\(([^)]+)\)";
string replacement = @"<a href=""$2"">$1</a>";
string result = Regex.Replace(input, pattern, replacement);
Console.WriteLine(result);

In this case ([^]]+) is the first group, referred to via $1 in the replacement pattern, and the second group is ([^)]+), referred to by $2.

natural language processing fix for combined words

5 votes

I have some text that was generate by another system. It combined some words together in what I assume was some sort of wordwrap by-product. So something simple like 'the dog' is combine into 'thedog'.

I checked the ascii and unicode string to see is there wasn't some unseen character in there, but there wasn't. A confounding problem is that this is medical text and a corpus to check against aren't that available. So, real example is '...test to rule out SARS versus pneumonia' ends up as '... versuspneumonia.'

Anyone have a suggestion for finding and separating these?

Here is what I did. I combined a couple of ideas and using a general bootstrapping methodology came up with a pretty good solution. I used Python for all of this.

  1. took a sample of reports, tokenized all the words and created a frequency table.
  2. For words with a frequency of 3 or under (frequency of 4 or more was deemed common enough to be correct), I spell checked them using PyEnchant package (enchant library)
  3. built a medical dictionary from the 'misspelled' words, in step 2, that were clinical.
  4. for all the reports, created a frequency table
  5. for words with a frequency under 4, I spell checked each using PyEnchant and my medical dictionary
  6. Took each misspelled word and split them in all possible ways. The splits were tested for the creation of 2 correctly spelled words. kept any successful split
  7. For each potential solutions the highest weighted solution was used.

Regex failing when pattern involves dollar sign ($)

5 votes

I'm running into a bit of an issue when it comes to matching subpatterns that involve the dollar sign. For example, consider the following chunk of text:

Regular Price: $20.50       Final Price: $15.20
Regular Price: $18.99       Final Price: $2.25
Regular Price: $11.22       Final Price: $33.44
Regular Price: $55.66       Final Price: $77.88

I was attempting to match the Regular/Final price sets with the following regex, but it simply wasn't working (no matches at all):
preg_match_all("/Regular Price: \$(\d+\.\d{2}).*Final Price: \$(\d+\.\d{2})/U", $data, $matches);

I escaped the dollar sign, so what gives?

Inside a double quoted string the backslash is treated as an escape character for the $. The backslash is removed by the PHP parser even before the preg_match_all function sees it:

$r = "/Regular Price: \$(\d+\.\d{2}).*Final Price: \$(\d+\.\d{2})/U";
var_dump($r);

Output (ideone):

"/Regular Price: $(\d+\.\d{2}).*Final Price: $(\d+\.\d{2})/U"
                 ^                           ^
              the backslashes are no longer there

To fix this use a single quoted string instead of a double quoted string:

preg_match_all('/Regular Price: \$(\d+\.\d{2}).*Final Price: \$(\d+\.\d{2})/U',
               $data,
               $matches);

See it working online: ideone

java - split string using regular expression

5 votes

I need to split a string where there's a comma, but it depends where the comma is placed.

As an example

consider the following:

C=75,user_is_active(A,B),user_is_using_app(A,B),D=78

I'd like the String.split() function to separate them like this:

C=75 

user_is_active(A,B) 

user_using_app(A,B)

D=78

I can only think of one thing but I'm not sure how it'd be expressed in regex.

The characters/words within the brackets are always capital. In other words, there won't be a situation where I will have user_is_active(a,b).

Is there's a way to do this?

If you don't have more than one level of parentheses, you could do a split on a comma that isn't followed by a closing ) before an opening (:

String[] splitArray = subjectString.split(
    "(?x),   # Verbose regex: Match a comma\n" +
    "(?!     # unless it's followed by...\n" +
    " [^(]*  # any number of characters except (\n" +
    " \\)    # and a )\n" +
    ")       # end of lookahead assertion");

Your proposed rule would translate as

String[] splitArray = subjectString.split(
    "(?x),        # Verbose regex: Match a comma\n" +
    "(?<!\\p{Lu}) # unless it's preceded by an uppercase letter\n" +
    "(?!\\p{Lu})  # or followed by an uppercase letter");

but then you would miss a split in a text like

Org=NASA,Craft=Shuttle

What is the equivalent of "?|" operator found in php(pcre) in C# ?

5 votes

Hi,

The following regular expression will match "Saturday" or "Sunday" : (?:(Sat)ur|(Sun))day

But in one case backreference 1 is filled while backreference 2 is empty and in the other case vice-versa.

PHP (pcre) provides a nice operator "?|" that circumvents this problem. The previous regex would become (?|(Sat)ur|(Sun))day. So there will not be empty backreferences.

Is there an equivalent in C# or some workaround ?

.NET doesn't support the branch-reset operator, but it does support named groups, and it lets you reuse group names without restriction (something no other flavor does, AFAIK). So you could use this:

(?:(?<abbr>Sat)ur|(?<abbr>Sun))day

...and the abbreviated name will be stored in Match.Groups["abbr"].

Find important text in arbitrary HTML using PHP?

4 votes

I have some random HTML layouts that contain important text I would like to extract. I cannot just strip_tags() as that will leave a bunch of extra junk from the sidebar/footer/header/etc.

I found a method built in Python and I was wondering if there is anything like this in PHP.

The concept is rather simple: use information about the density of text vs. HTML code to work out if a line of text is worth outputting. (This isn’t a novel idea, but it works!) The basic process works as follows:

  1. Parse the HTML code and keep track of the number of bytes processed.
  2. Store the text output on a per-line, or per-paragraph basis.
  3. Associate with each text line the number of bytes of HTML required to describe it.
  4. Compute the text density of each line by calculating the ratio of text t> o bytes.
  5. Then decide if the line is part of the content by using a neural network.

You can get pretty good results just by checking if the line’s density is above a fixed threshold (or the average), but the system makes fewer mistakes if you use machine learning - not to mention that it’s easier to implement!

Update: I started a bounty for an answer that could pull main content from a random HTML template. Since I can't share the documents I will be using - just pick any random blog sites and try to extract the body text from the layout. Remember that the header, sidebar(s), and footer may contain text also. See the link above for ideas.

  • phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.

UPDATE 2

  1. many blogs make use of CMS;
  2. blogs html structure is the same almost the time.
  3. avoid common selectors like #sidebar, #header, #footer, #comments, etc..
  4. avoid any widget by tag name script, iframe
  5. clear well know content like:
    1. /\d+\scomment(?:[s])/im
    2. /(read the rest|read more).*/im
    3. /(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im
    4. /[^a-z0-9]+/im

search for well know classes and ids:

  • typepad.com .entry-content
  • wordpress.org .post-entry .entry .post
  • movabletype.com .post
  • blogger.com .post-body .entry-content
  • drupal.com .content
  • tumblr.com .post
  • squarespace.com .journal-entry-text
  • expressionengine.com .entry
  • gawker.com .post-body

  • Ref: The blog platforms of choice among the top 100 blogs


$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content');
$doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div');

search based on common html structure that look like this:

<div>
<h1|h2|h3|h4|a />
<p|div />
</div>

$doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div');

explode string into tokens keeping quoted substr intact

4 votes

I dont know where I seen it, but can anyone tell me how to accomplish this using php and regex?

'this is a string "that has quoted text" inside.'

i want to be able to explode it like this

[0]this
[1]is
[2]a
[3]string
[4]"that has quoted text"
[5]inside

keeping the quotes intact.

Can you please try following code:

$str = 'this is a string  "that has quoted text" inside.';
var_dump ( preg_split('#\s*("[^"]*")\s*|\s+#', $str, -1 , PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY) );

Output: 
array(6) {
  [0]=>
  string(4) "this"
  [1]=>
  string(2) "is"
  [2]=>
  string(1) "a"
  [3]=>
  string(6) "string"
  [4]=>
  string(22) ""that has quoted text""
  [5]=>
  string(7) "inside."
}

Here is the link for above working code on dialpad

Update: For escaping support please try:

preg_split('#\s*((?<!\\\\)"[^"]*")\s*|\s+#', $str, -1 , PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);