Best dom questions in March 2011

How can I Observe the contents of an 'a' tag - jquery

13 votes

I have a blank <a> tag that content is loaded into via an external piece of javascript. I want to observe the <a> and when its content changes perform another task. The content will only ever change once.

Can this be done?

I am using also using jQuery.

Thanks in advance

You can use a mixture out of jQuery && DOM Level 3 events (see browser support below).

If you want to check for any changes within the content, you could do this:

var $a = $('a');

$a.one('DOMNodeInserted', function(e) {
    console.log('content changed!: ', e);    

    console.log('new content: ', $(this).html());   
});

$a.one('DOMAttrModified', function(e) {
    console.log('attribute changed!: ');        

    console.log('attribute that was changed: ', e.attrName);
});

See this code in action: http://jsfiddle.net/wJbMj/1/

Reference: DOMNodeInserted, DOMAttrModified


While the above solution is actually pretty convinient to me, it'll only work in browser that support those events. To have a more generic solution, you can hook into jQuerys setter methods. The downside in this solution is, that you will only catch changes that were done through jQuery.

var _oldAttr = $.fn.attr;
$.fn.attr = function() {
    console.log('changed attr: ', arguments[0]);
    console.log('new value: ', arguments[1]);
    return _oldAttr.apply(this, arguments);
};

You could hook into .text() and .html() the exact same way. You would need to check if the this value within the overwritten methods represent the correct DOMnode.

What are the pros and cons of adding <script> and <link> elements using JavaScript?

9 votes

Recently I saw some HTML with only a single <script> element in its <head>...

<head>
    <title>Example</title>
    <script src="script.js" type="text/javascript"></script>
    <link href="plain.css" type="text/css" rel="stylesheet" />
</head>

This script.js then adds any other necessary <script> elements and <link> elements to the document using document.write(...): (or it could use document.createElement(...) etc)

document.write("<link href=\"javascript-enabled.css\" type=\"text/css\" rel=\"styleshet\" />");
document.write("<script src=\"https://ajax.googleapis.com/ajax/libs/jquery/1.5.1/jquery.min.js\" type=\"text/javascript\"></script>");
document.write("<script src=\"https://ajax.googleapis.com/ajax/libs/jqueryui/1.8.10/jquery-ui.min.js\" type=\"text/javascript\"></script>");
document.write("<link href=\"http://ajax.googleapis.com/ajax/libs/jqueryui/1.7.0/themes/trontastic/jquery-ui.css\" type=\"text/css\" rel=\"stylesheet\" />")
document.write("<script src=\"validation.js\" type=\"text/css\"></script>")

Note that there is a plain.css CSS file in the document <head> and script.js just adds any and all CSS and JavaScript which would be used by a JS-enabled user agent.

What are some of the pros and cons of this technique?

The blocking nature of document.write

document.write will pause everything that the browser is working on the page (including parsing). It is highly recommended to avoid because of this blocking behavior. The browser has no way of knowing what you're going to shuff into the HTML text stream at that point, or whether the write will totally trash everything on the DOM tree, so it has to stop until you're finished.

Essentially, loading scrips this way will force the browser to stop parsing HTML. If your script is in-line, then the browser will also execute those scripts before it goes on. Therefore, as a side-note, it is always recommended that you defer loading scripts until after your page is parsed and you've shown a reasonable UI to the user.

If your scripts are loaded from separate files in the "src" attribute, then the scripts may not be consistently executed across all browsers.

Losing browser speed optimizations and predictability

This way, you lose a lot of the performance optimizations made by modern browsers. Also, when your scripts execute may be unpredictable.

For example, some browsers will execute the scripts right after you "write" them. In such cases, you lose parallel downloads of scripts (because the browser doesn't see the second script tag until it has downloaded and executed the first). You lose parallel downloads of scripts and stylesheets and other resources (many browsers can download resources, stylesheets and scripts all at the same time).

Some browsers defer the scripts until after the end to execute them.

The browser cannot continue to parse the HTML while document.write is going on and, in certain cases, when the scripts written are being executed due to the blocking behavior of document.write, so your page shows up much slower.

In other words, your site has just become as slow as it was loading on a decades-old browser with no optimizations.

Why would somebody do it like this?

The reason you may want to use something like this is usually for maintainability. For instance, you may have a huge site with thousands of pages, each loading the same set of scripts and stylesheets. However, when you add a script file, you don't want to edit thousands of HTML files to add the script tags. This is particularly troublesome when loading JavaScript libraries (e.g. Dojo or jQuery) -- you have to change each HTML page when you upgrade to the next version.

The problem is that JavaScript doesn't have an @include or @import statement for you to include other files.

Some solutions

The solution to this is probably not by injecting scripts via document.write, but by:

  1. Using @import directives in stylesheets
  2. Using a server scripting language (e.g. PHP) to manage your "master page" and to generate all other pages (however, if you can't use this and must maintain many HTML pages individually, this is not a solution)
  3. Avoid document.write, but load the JavaScript files via XHR, then eval() them -- this may have security concerns though
  4. Use a JavaScript Library (e.g. Dojo) that has module-loading features so that you can keep a master JS file which loads other files. You won't be able to avoid having to update the version numbers of the library file though...

"Editing" user text on the fly?

6 votes

Hi guys,

I would like to implements something like vBullettin (or stackoverflow) does. When the user clicks "edit" the HTML text is converted to plain text into a <textarea></textarea> ready for the editing.

How would you implemeent something like that? Note I can use jQuery.

I would like especially know the authentication part (if users clicks "edit" on soemone else comments there is a warning)

Thanks

Check out this jQuery plugin for inline editing

http://www.appelsiini.net/projects/jeditable

Find important text in arbitrary HTML using PHP?

4 votes

I have some random HTML layouts that contain important text I would like to extract. I cannot just strip_tags() as that will leave a bunch of extra junk from the sidebar/footer/header/etc.

I found a method built in Python and I was wondering if there is anything like this in PHP.

The concept is rather simple: use information about the density of text vs. HTML code to work out if a line of text is worth outputting. (This isn’t a novel idea, but it works!) The basic process works as follows:

  1. Parse the HTML code and keep track of the number of bytes processed.
  2. Store the text output on a per-line, or per-paragraph basis.
  3. Associate with each text line the number of bytes of HTML required to describe it.
  4. Compute the text density of each line by calculating the ratio of text t> o bytes.
  5. Then decide if the line is part of the content by using a neural network.

You can get pretty good results just by checking if the line’s density is above a fixed threshold (or the average), but the system makes fewer mistakes if you use machine learning - not to mention that it’s easier to implement!

Update: I started a bounty for an answer that could pull main content from a random HTML template. Since I can't share the documents I will be using - just pick any random blog sites and try to extract the body text from the layout. Remember that the header, sidebar(s), and footer may contain text also. See the link above for ideas.

  • phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.

UPDATE 2

  1. many blogs make use of CMS;
  2. blogs html structure is the same almost the time.
  3. avoid common selectors like #sidebar, #header, #footer, #comments, etc..
  4. avoid any widget by tag name script, iframe
  5. clear well know content like:
    1. /\d+\scomment(?:[s])/im
    2. /(read the rest|read more).*/im
    3. /(?:.*(?:by|post|submitt?)(?:ed)?.*\s(at|am|pm))/im
    4. /[^a-z0-9]+/im

search for well know classes and ids:

  • typepad.com .entry-content
  • wordpress.org .post-entry .entry .post
  • movabletype.com .post
  • blogger.com .post-body .entry-content
  • drupal.com .content
  • tumblr.com .post
  • squarespace.com .journal-entry-text
  • expressionengine.com .entry
  • gawker.com .post-body

  • Ref: The blog platforms of choice among the top 100 blogs


$selectors = array('.post-body','.post','.journal-entry-text','.entry-content','.content');
$doc = phpQuery::newDocumentFile('http://blog.com')->find($selectors)->children('p,div');

search based on common html structure that look like this:

<div>
<h1|h2|h3|h4|a />
<p|div />
</div>

$doc = phpQuery::newDocumentFile('http://blog.com')->find('h1,h2,h3,h4')->parent()->children('p,div');

Parsing an XML inside a zip in-memory

4 votes

I have a Zip that contains two files: an XML and a thumbnail. I would like to open the XML file and parse it WITHOUT having to extract on disk.

One of DocumentBuilder's parse method requires an InputStream. Is there a way to get the InputStream of the XML in the Zipped file? I kinda got lost. I'm pretty sure ZipInputStream or ZipFile has something to offer, but I can't figure it out :/

Thank you in advance!

I believe you're looking for something like this:

FileInputStream fin = new FileInputStream("your.zip");
ZipInputStream zin = new ZipInputStream(fin);
ZipEntry ze = null;
while ((ze = zin.getNextEntry()) != null) {
    if (ze.getName().equals("your.xml")) {
        // pass zin to DocumentBuilder
    }
}