Best questions in October 2010

What do people find difficult about C pointers?

102 votes

From the number of questions posted here, it's clear that people have some pretty fundemental issues when getting their heads around pointers and pointer arithmetic.

I'm curious to know why. They've never really caused me major problems (although I first learned about them back in the Neolithic). In order to write better answers to these questions, I'd like to know what people find difficult.

So, if you're struggling with pointers, or you recently were but suddenly "got it", what were the aspects of pointers that caused you problems?

(I've tried to word this so that it's not argumentative or subjective. I'm genuinely interested in what people have problems with here)

I suspect people are going a bit too deep in their answers. An understanding of scheduling, actual CPU operations, or assembly-level memory management isn't really required.

When I was teaching, I found the following holes in students' understanding to be the most common source of problems:

  1. Heap vs Stack storage. It is simply stunning how many people do not understand this, even in a general sense.
  2. Stack frames. Just the general concept of a dedicated section of the stack for local variables, along with the reason it's a 'stack'... details such as stashing the return location, exception handler details, and previous registers can safely be left till someone tries to build a compiler.
  3. "Memory is memory is memory" Casting just changes which versions of operators or how much room the compiler gives for a particular chunk of memory. You know you're dealing with this problem when people talk about "what (primative) variable X really is".

Most of my students were able to understand a simplified drawing of a chunk of memory, generally the local variables section of the stack at the current scope. Generally giving explicit fictional addresses to the various locations helped.

I guess in summary, I'm saying that if you want to understand pointers, you have to understand variables, and what they actually are in modern architectures.

How to normalize HTML in Javascript or jQuery?

71 votes

Tags can have multiple attributes. The order in which attributes appear in the code does not matter. For example:

<a href="#" title="#">
<a title="#" href="#">

How can I "normalize" the HTML in Javascript, so the order of the attributes is always the same? I don't care which order is chosen, as long as it is always the same.

UPDATE: my original goal was to make it easier to diff (in JavaScript) 2 HTML pages with slight differences. Because users could use different software to edit the code, the order of the attributes could change. This make the diff too verbose.

ANSWER: Well, first thanks for all the answers. And YES, it is possible. Here is how I've managed to do it. This is a proof of concept, it can certainly be optimized:

function sort_attributes(a, b) {
  if( a.name == b.name) {
    return 0;
  }

  return (a.name < b.name) ? -1 : 1;
}

$("#original").find('*').each(function() {
  if (this.attributes.length > 1) {
    var attributes = this.attributes;
    var list = [];

    for(var i =0; i < attributes.length; i++) {
      list.push(attributes[i]);
    }

    list.sort(sort_attributes);

    for(var i = 0; i < list.length; i++) {
      this.removeAttribute(list[i].name, list[i].value);
    }

    for(var i = 0; i < list.length; i++) {
      this.setAttribute(list[i].name, list[i].value);
    }
  }
});

Same thing for the second element of the diff, $('#different'). Now $('#original').html() and $('#different').html() show HTML code with attributes in the same order.

This is a proof of concept, it can certainly be optimized:

function sort_attributes(a, b) {
  if( a.name == b.name) {
    return 0;
  }

  return (a.name < b.name) ? -1 : 1;
 }

$("#original").find('*').each(function() {
  if (this.attributes.length > 1) {
    var attributes = this.attributes;
    var list = [];

    for(var i =0; i < attributes.length; i++) {
      list.push(attributes[i]);
    }

     list.sort(sort_attributes);

    for(var i = 0; i < list.length; i++) {
      this.removeAttribute(list[i].name, list[i].value);
    }

     for(var i = 0; i < list.length; i++) {
       this.setAttribute(list[i].name, list[i].value);
    }
  }
 });

Same thing for the second element of the diff, $('#different'). Now $('#original').html() and $('#different').html() show HTML code with attributes in the same order.

Is LINQ banned in your company?

60 votes

I work for a large company who develop enterprise applications which are performance oriented. Virtually every line of code is closely scrutinised and optimized as much as possible to ensure the best performance.

Company policy dictates that LINQ is strictly banned. This is because it is believed that LINQ has a negative performance impact compared to more traditional coding practices.

Is LINQ banned in your company? If yes, what are the reason given for this? If not, what applications does your company develop and why is performance not so crucial?

I, personally, do not ban LINQ usage in my company - but rather encourage its use whenever appropriate. LINQ is very expressive, and I find that it produces code that is often much more maintainable, and higher quality, than "traditional" methods of development.

In addition, I often find that many developers tend to write more performant code using LINQ than "traditional" looping methods. The streaming nature of LINQ can dramatically cut down on the runtimes if used correctly.

Personally, I find that any company that "bans" the usage of a specific language or framework feature outright has higher level problems. Every technology exists for a reason - and every feature has its place. Whether it's appropriate in a specific scenario is another issue - but banning it completely is a sign of poor management of developers, in my opinion.

If not, what applications does your company develop and why is performance not so crucial?

Performance is absolutely critical to me. My company develops scientific software, and many of our routines have runtimes in the hours (or even days), so every ounce of performance we can squeeze out is very useful.

That being said, performance doesn't come from micro-optimizing in most cases - it comes from designing a better architecture and "large scale" algorithms. The key for this to be truly successful is having a good design, and keeping the code as simple as possible. Simplicity helps in profiling tremendously - which in turn can pin point the real performance issues in the application.

Occasionally this will be due to a LINQ statement, but very rarely. More often this is due to a poor design decision. I find that LINQ actually reduces the frequency of poor design decisions. It's much easier to refactor a LINQ statement into a more performant tight loop when necessary than it is to try to profile an application that's "chattier" in terms of code than necessary.

In addition, LINQ has some huge performance opportunities, even on a small scale. For example, it's much simpler to parallelize a LINQ query via PLINQ than to try to parallelize many other constructs, for example. Granted, it's not magic, and care still needs to be taken, but LINQ tends to parallelize with fewer issues than loops. If performance is the goal, simple, clean code should be the target, and LINQ helps achieve that more quickly.

Moving from C++ to C

57 votes

After a few years coding in C++, I was recently offered a job coding in C, in the embedded field.

Putting aside the question of whether it's right or wrong to dismiss C++ in the embedded field, there are some features/idioms in C++ I would miss a lot. Just to name a few:

  • Generic, type-safe data structures (using templates).
  • RAII. Especially in functions with multiple return points, e.g. not having to remember to release the mutex on each return point.
  • Destructors in general. I.e. you write a d'tor once for MyClass, then if a MyClass instance is a member of MyOtherClass, MyOtherClass doesn't have to explicitly deinitialize the MyClass instance - its d'tor is called automatically.
  • Namespaces.

What are your experiences moving from C++ to C?
What C substitutes did you find for your favorite C++ features/idioms? Did you discover any C features you wish C++ had?

Working on an embedded project, I tried working in all C once, and just couldn't stand it. It was just so verbose that it made it hard to read anything. Also, I liked the optimized-for-embedded containers I had written, which had to turn into much less safe and harder to fix #define blocks.

Code that in C++ looked like:

if(uart[0]->Send(pktQueue.Top(), sizeof(Packet)))
    pktQueue.Dequeue(1);

turns into:

if(UART_uchar_SendBlock(uart[0], Queue_Packet_Top(pktQueue), sizeof(Packet)))
    Queue_Packet_Dequeue(pktQueue, 1);

which many people will probably say is fine but gets ridiculous if you have to do more than a couple "method" calls in a line. Two lines of C++ would turn into five of C (due to 80-char line length limits). Both would generate the same code, so it's not like the target processor cared!

One time (back in 1995), I tried writing a lot of C for a multiprocessor data-processing program. The kind where each processor has its own memory and program. The vendor-supplied compiler was a C compiler (some kind of HighC derivative), their libraries were closed source so I couldn't use GCC to build, and their APIs were designed with the mindset that your programs would primarily be the initialize/process/terminate variety, so inter-processor communication was rudimentary at best.

I got about a month in before I gave up, found a copy of cfront, and hacked it into the makefiles so I could use C++. Cfront didn't even support templates, but the C++ code was much, much clearer.

Generic, type-safe data structures (using templates).

The closest thing C has to templates is to declare a header file with a lot of code that looks like:

TYPE * Queue_##TYPE##_Top(Queue_##TYPE##* const this)
{ /* ... */ }

then pull it in with something like:

#define TYPE Packet
#include "Queue.h"
#undef TYPE

Note that this won't work for compound types (e.g. no queues of unsigned char) unless you make a typedef first.

Oh, and remember, if this code isn't actually used anywhere, then you don't even know if it's syntactically correct.

EDIT: One more thing: you'll need to manually manage instantiation of code. If your "template" code isn't all inline functions, then you'll have to put in some control to make sure that things get instantiated only once so your linker doesn't spit out a pile of "multiple instances of Foo" errors.

To do this, you'll have to put the non-inlined stuff in an "implementation" section in your header file:

#ifdef implementation_##TYPE

/* Non-inlines, "static members", global definitions, etc. go here. */

#endif

And then, in one place in all your code per template variant, you have to:

#define TYPE Packet
#define implementation_Packet
#include "Queue.h"
#undef TYPE

Also, this implementation section needs to be outside the standard #ifndef/#define/#endif litany, because you may include the template header file in another header file, but need to instantiate afterward in a .c file.

Yep, it gets ugly fast. Which is why most C programmers don't even try.

RAII.

Especially in functions with multiple return points, e.g. not having to remember to release the mutex on each return point.

Well, forget your pretty code and get used to all your return points (except the end of the function) being gotos:

TYPE * Queue_##TYPE##_Top(Queue_##TYPE##* const this)
{
    TYPE * result;
    Mutex_Lock(this->lock);
    if(this->head == this->tail)
    {
        result = 0;
        goto Queue_##TYPE##_Top_exit:;
    }

    /* Figure out `result` for real, then fall through to... */

Queue_##TYPE##_Top_exit:
    Mutex_Lock(this->lock);
    return result;
}

Destructors in general.

I.e. you write a d'tor once for MyClass, then if a MyClass instance is a member of MyOtherClass, MyOtherClass doesn't have to explicitly deinitialize the MyClass instance - its d'tor is called automatically.

Object construction has to be explicitly handled the same way.

Namespaces.

That's actually a simple one to fix: just tack a prefix onto every symbol. This is the primary cause of the source bloat that I talked about earlier (since classes are implicit namespaces). The C folks have been living this, well, forever, and probably won't see what the big deal is.

YMMV

How can I compare two sets of 1000 numbers against each other?

53 votes

I must check approximately 1000 numbers against 1000 other numbers.

I loaded both and compared them server-side:

foreach( $numbers1 as $n1 ) {
  foreach( $numbers2 as $n2 ) {
    if( $n1 == $n2 ) {
      doBla();
    }
  }
}

This took a long time, so I tried to do the same comparison client side using two hidden div elements. Then compared them using JavaScript. It still takes 45 seconds to load the page (using hidden div elements).

I do not need to load the numbers that are not the same.

Is there a faster algorithm? I am thinking of comparing them database side and just load the error numbers, then do an Ajax call for the remaining non-error numbers. But is a MySQL database fast enough?

Sort the lists first. Then you can walk up both lists from the start, comparing as you go.

The loop would look something like this:

var A = getFirstArray().sort(), B = getSecondArray().sort();

var i = 0, j = 0;
while (i < A.length && j < B.length) {
    if (A[i] === B[j]) {
        doBla(A[i]);
        i++; j++;
    }
    else if (A[i] < B[j]) {
        i++;
    }
    else
        j++;
}

(That's JavaScript; you could do it server-side too, but I don't know PHP.)

Edit — just to be fair to all the hashtable fans (whom I respect of course), it's pretty easy to do that in JavaScript:

var map = {};
for (var i = 0; i < B.length; ++i) map[B[i]] = true; // Assume integers.
for (var i = 0; i < A.length; ++i) if (map[A[i]]) doBla(A[i]);

Or if the numbers are or might be floats:

var map = {};
for (var i = 0; i < B.length; ++i) map['' + B[i]] = true; // Assume integers.
for (var i = 0; i < A.length; ++i) if (map['' + A[i]]) doBla(A[i]);

Since numbers are pretty cheap to hash (even in JavaScript, converting to string before hashing is surprisingly cheap), this would be pretty fast.

Fast and Lean PDF Viewer for iPhone / iPad / iOs - tips and hints?

48 votes

There has been many Questions recently about drawing PDF's.

Yes, you can render PDF's very easily with a UIWebView but this cant give the performance and functionality that you would expect from a good PDF viewer.

You can draw a PDF page to a CALayer or to a UIImage. Apple even have sample code to show how draw a large PDF in a Zoomable UIScrollview

But the same issues keep cropping up.

UIImage Method:

  1. PDF's in a UIImage don't optically scale as well as a Layer approach.
  2. The CPU and memory hit on generating the UIImages from a PDFcontext limits/prevents using it to create a real-time render of new zoom-levels.

CATiledLayer Method:

  1. Theres a significant Overhead (time) drawing a full PDF page to a CALayer: individual tiles can be seen rendering (even with a tileSize tweak)
  2. CALayers cant be prepared ahead of time (rendered off-screen).

Generally PDF viewers are pretty heavy on memory too. Even monitor the memory usage of apple's zoomable PDF example.

In my current project, I'm developing a PDF viewer and am rendering a UIImage of a page in a separate thread (issues here too!) and presenting it while the scale is x1. CATiledLayer rendering kicks in once the scale is >1. iBooks takes a similar double take approach as if you scroll the pages you can see a lower res version of the page for just less than a second before a crisp version appears.

Im rendering 2 pages each side of the page in focus so that the PDF image is ready to mask the layer before it starts drawing.Pages are destroyed again when they are +2 pages away from the focused page.

Does anyone have any insights, no matter how small or obvious to improve the performance/ memory handling of Drawing PDF's? or any other issues discussed here?

EDIT: Some Tips (Credit- Luke Mcneice,VdesmedT,Matt Gallagher,Johann):

  • Save any media to disk when you can.

  • Use larger tileSizes if rendering on TiledLayers

  • init frequently used arrays with placeholder objects, alternitively another design approach is this one

  • Note that images will render faster than a CGPDFPageRef

  • Use NSOperations or GCD & Blocks to prepare pages ahead of time.

  • call CGContextSetInterpolationQuality(ctx, kCGInterpolationHigh); CGContextSetRenderingIntent(ctx, kCGRenderingIntentDefault); before CGContextDrawPDFPage to reduce memory usage while drawing

  • init'ing your NSOperations with a docRef is a bad idea (memory), wrap the docRef into a singleton.

  • Cancel needless NSOperations When you can, especially if they will be using memory, beware of leaving contexts open though!

  • Recycle page objects by doing pointer swaps or destroy unused views

  • Close any open Contexts as soon as you don't need them

  • on receiving memory warnings release and reload the DocRef and any page Caches

Other PDF Features:

Documentation

I have build such kind of application using approximatively the same approach except :

  • I cache the generated image on the disk and always generate two to three images in advance in a separate thread.
  • I don't overlay with a UIImage but instead draw the image in the layer when zooming is 1. Those tiles will be released automatically when memory warnings are issued.

Whenever the user start zooming, I acquire the CGPDFPage and render it using the appropriate CTM. The code in - (void)drawLayer: (CALayer*)layer inContext: (CGContextRef) context is like :

CGAffineTransform currentCTM = CGContextGetCTM(context);    
if (currentCTM.a == 1.0 && baseImage) {
    //Calculate ideal scale
    CGFloat scaleForWidth = baseImage.size.width/self.bounds.size.width;
    CGFloat scaleForHeight = baseImage.size.height/self.bounds.size.height; 
    CGFloat imageScaleFactor = MAX(scaleForWidth, scaleForHeight);

    CGSize imageSize = CGSizeMake(baseImage.size.width/imageScaleFactor, baseImage.size.height/imageScaleFactor);
    CGRect imageRect = CGRectMake((self.bounds.size.width-imageSize.width)/2, (self.bounds.size.height-imageSize.height)/2, imageSize.width, imageSize.height);
    CGContextDrawImage(context, imageRect, [baseImage CGImage]);
} else {
    @synchronized(issue) { 
        CGPDFPageRef pdfPage = CGPDFDocumentGetPage(issue.pdfDoc, pageIndex+1);
        pdfToPageTransform = CGPDFPageGetDrawingTransform(pdfPage, kCGPDFMediaBox, layer.bounds, 0, true);
        CGContextConcatCTM(context, pdfToPageTransform);    
        CGContextDrawPDFPage(context, pdfPage);
    }
}

issue is the object containg the CGPDFDocumentRef. I synchronize the part where I access the pdfDoc property because I release it and recreate it when receiving memoryWarnings. It seems that the CGPDFDocumentRef object do some internal caching that I did not find how to get rid of.

How does Facebook achieve good performance?

43 votes

Almost everyone has a Facebook account, even people who are not familiar with the Internet. With millions people actively using Facebook, updating their status, replying to messages, uploading photos and so on, how is Facebook's page still loading very fast?

I was told that Facebook was built using only PHP and MySQL, so how can Facebook's performance be so good?

Note: This needs to be updated.

  1. Facebook uses HipHop, which converts PHP into C++ code (which is then compiled into much more efficient machine code than actual PHP).

  2. Facebook has data distributed across many, many servers. For example, they also use Hadoop clusters for some of their data storage.

  3. memcached :)

Why do most C developers use define instead of const?

43 votes

In many programs a #define serves the same purpose as a constant. For example.

#define FIELD_WIDTH 10
const int fieldWidth = 10;

I commonly see the first form preferred over the other, relying on the pre-processor to handle what is basically an application decision. Is there a reason for this tradition?

There is a very solid reason for this: const in C does not mean something is constant. It just means a variable is read-only.

In places where the compiler requires a true constant (such as for array sizes for non-VLA arrays), using a const variable, such as fieldWidth is just not possible.

When NOT to use yield (return)

40 votes

There are several useful questions here on SO about the benefits of yield return. For example,

I'm looking for thoughts on when NOT to use yield return. For example, if I expect to need to return all items in a collection, it doesn't seem like yield would be useful, right?

What are the cases where use of yield will be limiting, unnecessary, get me into trouble, or otherwise should be avoided?

What are the cases where use of yield will be limiting, unnecessary, get me into trouble, or otherwise should be avoided?

It's a good idea to think carefully about your use of "yield return" when dealing with recursively defined structures. For example, I often see this:

public static IEnumerable<T> PreorderTraversal<T>(Tree<T> root)
{
    if (root == null) yield break;
    yield return root.Value;
    foreach(T item in PreorderTraversal(root.Left))
        yield return item;
    foreach(T item in PreorderTraversal(root.Right))
        yield return item;
}

Perfectly sensible-looking code, but it has performance problems. Suppose the tree is h deep. Then there will at most points be O(h) nested iterators built. Calling "MoveNext" on the outer iterator will then make O(h) nested calls to MoveNext. Since it does this O(n) times for a tree with n items, that makes the algorithm O(hn). And since the height of a binary tree is lg n <= h <= n, that means that the algorithm is at best O(n lg n) and at worst O(n^2) in time, and best case O(lg n) and worse case O(n) in stack space. It is O(h) in heap space because each enumerator is allocated on the heap. (On implementations of C# I'm aware of; a conforming implementation might have other stack or heap space characteristics.)

But iterating a tree can be O(n) in time and O(1) in stack space. You can write this instead like:

public static IEnumerable<T> PreorderTraversal<T>(Tree<T> root)
{
    var stack = new Stack<Tree<T>>();
    stack.Push(root);
    while (stack.Count != 0)
    {
        var current = stack.Pop();
        if (current == null) continue;
        yield return current.Value;
        stack.Push(current.Left);
        stack.Push(current.Right);
    }
}

which still uses yield return, but is much smarter about it. Now we are O(n) in time and O(h) in heap space, and O(1) in stack space.

Further reading: see Wes Dyer's article on the subject:

http://blogs.msdn.com/b/wesdyer/archive/2007/03/23/all-about-iterators.aspx

How does facebook rewrite the source URL of a page in the browser address bar?

38 votes

Go to http://www.facebook.com/facebook?v=wall, then click on the info tab. The content will be loaded, and the address bar now becomes http://www.facebook.com/facebook?v=info but the webpage didn't reload.

At first I think it is Ajax, but my question is, how do you change the address bar without reloading? I know I can change anchor (#wall) using JS but querystring (?v=wall), how?

It's using HTML5's new history.pushState() feature to allow the page to masquerade as being at a different URL to that from which it was originally fetched.

This seems only to be supported by WebKit at the moment, which is why the rest of us are seeing ?v=wall#!/facebook?v=info instead of ?v=info.

The feature allows dynamically-loaded pages to be properly bookmarked, exchanged etc between JS-supporting and non-JS-supporting user agents. Because if you as a JS user linked someone to ?v=wall#!/facebook?v=info and their browser didn't support JS and XMLHttpRequest, the page wouldn't work for them. The #! is also used as a tip to search engines to download the non-AJAX version.

How do search engines find relevant content?

38 votes

How does Google find relevant content when it's parsing the web?

Let's say, for instance, Google uses the PHP native DOM Library to parse content. What methods would they be for it to find the most relevant content on a web page?

My thoughts would be that it would search for all paragraphs, order by the length of each paragraph and then from possible search strings and query params work out the percentage of relevance each paragraph is.

Let's say we had this URL:

http://domain.tld/posts/stackoverflow-dominates-the-world-wide-web.html

Now from that URL I would work out that the HTML file name would be of high relevance so then I would see how close that string compares with all the paragraphs in the page!

A really good example of this would be Facebook share, when you share a page. Facebook quickly bots the link and brings back images, content, etc., etc.

I was thinking that some sort of calculative method would be best, to work out the % of relevancy depending on surrounding elements and meta data.

Are there any books / information on the best practices of content parsing that covers how to get the best content from a site, any algorithms that may be talked about or any in-depth reply?


Some ideas that I have in mind are:

  • Find all paragraphs and order by plain text length
  • Somehow find the Width and Height of div containers and order by (W+H) - @Benoit
  • Check meta keywords, title, description and check relevancy within the paragraphs
  • Find all image tags and order by largest, and length of nodes away from main paragraph
  • Check for object data, such as videos and count the nodes from the largest paragraph / content div
  • Work out resemblances from previous pages parsed

The reason why I need this information:

I'm building a website where webmasters send us links and then we list their pages, but I want the webmaster to submit a link, then I go and crawl that page finding the following information.

  • An image (if applicable)
  • A < 255 paragraph from the best slice of text
  • Keywords that would be used for our search engine, (Stack Overflow style)
  • Meta data Keywords, Description, all images, change-log (for moderation and administration purposes)

Hope you guys can understand that this is not for a search engine but the way search engines tackle content discovery is in the same context as what I need it for.

I'm not asking for trade secrets, I'm asking what your personal approach to this would be.

This is a very general question but a very nice topic! Definitely upvoted :) However I am not satisfied with the answers provided so far, so I decided to write a rather lengthy answer on this.

The reason I am not satisfied is that the answers are basically all true (I especially like the answer of kovshenin (+1), which is very graph theory related...), but the all are either too specific on certain factors or too general.

It's like asking how to bake a cake and you get the following answers:

  • You make a cake and you put it in the oven.
  • You definitely need sugar in it!
  • What is a cake?
  • The cake is a lie!

You won't be satisfied because you wan't to know what makes a good cake. And of course there are a lot or recipies.

Of course Google is the most important player, but, depending on the use case, a search engine might include very different factors or weight them differently.

For example a search engine for discovering new independent music artists may put a malus on artists websites with a lots of external links in.

A mainstream search engine will probably do the exact opposite to provide you with "relevant results".

There are (as already said) over 200 factors that are published by Google. So webmasters know how to optimize their websites. There are very likely many many more that the public is not aware of (in Google's case).

But in the very borad and abstract term SEO optimazation you can generally break the important ones apart into two groups:

  1. How well does the answer match the question? Or: How well does the pages content match the search terms?

  2. How popular/good is the answer? Or: What's the pagerank?

In both cases the important thing is that I am not talking about whole websites or domains, I am talking about single pages with a unique URL.

It's also important that pagerank doesn't represent all factors, only the ones that Google categorizes as Popularity. And by good I mean other factors that just have nothing to do with popularity.

In case of Google the official statement is that they want to give relevant results to the user. Meaning that all algorithms will be optimized towards what the user wants.

So after this long introduction (glad you are still with me...) I will give you a list of factors that I consider to be very important (at the moment):

Category 1 (how good does the answer match the question?

You will notice that a lot comes down to the structure of the document!

  • The page primarily deals with the exact question.

Meaning: the question words appear in the pages title text or in heading paragraphs paragraphs. The same goes for the position of theese keywords. The earlier in the page the better. Repeated often as well (if not too much which goes under the name of keywords stuffing).

  • The whole website deals with the topic (keywords appear in the domain/subdomain)

  • The words are an important topic in this page (internal links anchor texts jump to positions of the keyword or anchor texts / link texts contain the keyword).

  • The same goes if external links use the keywords in link text to link to this page

Category 2 (how important/popular is the page?)

You will notice that not all factors point towards this exact goal. Some are included (especially by Google) just to give pages a boost, that... well... that just deserved/earned it.

  • Content is king!

The existence of unique content that can't be found or only very little in the rest of the web gives a boost. This is mostly measured by unordered combinations of words on a website that are generally used very little (important words). But there are much more sophisticated methods as well.

  • Recency - newer is better

  • Historical change (how often the page has updated in the past. Changing is good.)

  • External link popularity (how many links in?)

If a page links another page the link is worth more if the page itself has a high pagerank.

  • External link diversity

basically links from different root domains, but other factors play a role too. Factors like even how seperated are the webservers of linking sites geographically (according to their ip address).

  • Trust Rank

For example if big, trusted, established sites with redactional content link to you, you get a trust rank. That's why a link from The New York Times is worth much more than some strange new website, even if it's PageRank is higher!

  • Domain trust

Your whole website gives a boost to your content if your domain is trusted. Well different factors count here. Of course links from trusted sties to your domain, but it will even do good if you are in the same datacenter as important websites.

  • Topic specific links in.

If websites that can be resolved to a topic link to you and the query can be resolved to this topic as well, it's good.

  • Distribution of links in over time.

If you earned a lot of links in in a short period of time, this will do you good at this time and the near future afterwards. But not so good later in time. If you slow and steady earn links it will do you good for content that is "timeless".

  • Links from restrited domains

A link from a .gov domain is worth a lot.

  • User click behaviour

Whats the clickrate of your search result?

  • Time spent on site

Google analytics tracking, etc. It's also tracked if the user clicks back or clicks another result after opening yours.

  • Collected user data

Votes, rating, etc., references in Gmail, etc.

Now I will introduce a third category, and one or two points from above would go into this category, but I haven't thought of that... The category is:

** How important/good is your website in general **

All your pages will be ranked up a bit depending on the quality of your websites

Factors include:

  • Good site architecture (easy to navgite, structured. Sitemaps, etc...)

  • How established (long existing domains are worth more).

  • Hoster information (what other websites are hosted near you?

  • Search frequency of your exact name.

Last, but not least, I want to say that a lot of these theese factors can be enriched by semantic technology and new ones can be introduced.

For example someone may search for Titanic and you have a website about icebergs ... that can be set into correlation which may be reflected.

Newly introduced semantic identifiers. For example OWL tags may have a huge impact in the future.

For example a blog about the movie Titanic could put a sign on this page that it's the same content as on the Wikipedia article about the same movie.

This kind of linking is currently under heavy development and establishment and nobody knows how it will be used.

Maybe duplicate content is filtered, and only the most important of same content is displayed? Or maybe the other way round? That you get presented a lot of pages that match your query. Even if they dont contain your keywords?

Google even applies factors in different relevance depending on the topic of your search query!

How to determine if a number is positive or negative in Java?

36 votes

I was asked this question in Amazon Chennai(India) interview , to determine whether an number is positive or negative. The rules are that , we should not use conditional operators such as <, and >, built in java functions (like substring, indexOf, charAt, and startsWith), no regex, or API's. I did some homework on this and the code is given below, but it only works for integer type. But they asked me to write a generic code that works for float, double, and long.

 // This might not be better way!!

 S.O.P ((( number >> 31 ) & 1) == 1 ? "- ve number " : "+ve number );

any ideas from your side.Thanks.

Updates:

OMG!!! looking at the answers, i think the interviewer thought me i was upto something.

Thank you all for providing an excellent answers(Especially Nanda,Itzwarky,Nabb and Strilanc) and spending your precious time of yours.

The integer cases are easy. The double case is trickier, until you remember about infinities.

Note: If you consider the double constants "part of the api", you can replace them with overflowing expressions like 1E308 * 2.

int sign(int i) {
    if (i == 0) return 0;
    if (i >> 31 != 0) return -1
    return +1;
}
int sign(long i) {
    if (i == 0) return 0;
    if (i >> 63 != 0) return -1
    return +1;
}
public static int sign(double f) {
    if (f != f) throw new IllegalArgumentException("NaN");
    if (f == 0) return 0;
    f *= Double.POSITIVE_INFINITY;
    if (f == Double.POSITIVE_INFINITY) return +1;
    if (f == Double.NEGATIVE_INFINITY) return -1;

    //this should never be reached, but I've been wrong before...
    throw new IllegalArgumentException("Unfathomed double");
}

Explicit assignment of null

34 votes
string s1;
string s2 = null;

if (s1 == null) // compile error
if (s2 == null) // ok

I don't really understand why the explicit assignment is needed. Whats the difference between a null variable and an unassigned variable? I always assumed that unassigned variables were simply assigned as null by the runtime/compiler anyway. If they're not null, then what are they?

Unassigned members are automatically initialized to their default values (which is the null reference in the case for string).

Unassigned local variables are not assigned any value and trying to access a possibly unassigned variable will give a compile error.

Does Razor syntax provide a compelling advantage in UI markup?

34 votes

I notice Scott Guthrie is starting to mention Razor a fair bit on his blog but I'm just not that sure that it's a good fit for my style.

Granted it's a fairly unfamiliar style for someone who's pretty used to a "standard" sort of ASP.Net markup (content place holders and inline code), but it just feels like a lot of additional pages to manage and less clear markup to me.

What are other peoples' feelings on it? Is it something that you believe should be seriously considered when scaffolding new MVC pages or is it just trying to solve a problem that doesn't exist?

[Disclaimer: I'm one of the Microsoft developers on MVC and Razor, so I might be a bit biased :)]

We designed Razor to be a concise templating language that uses only the minimal necessary amount of control characters. I would say that large parts of your views can be expressed with fewer characters than the same code using the "traditional" WebForms syntax.

For example the following code snippet in ASPX syntax:

<% if(someCondition) { %>
  <ol>
  <% foreach(var item in Model) { %>
     <li><%: item.ToString() %></li>
  <% } %>
  </ol>
<% } %>

Can be expressed as follows in Razor:

@if(someCondition) {
   <ol>
   @foreach(var item in Model) {
      <li>@item.ToString()</li>
   }
   </ol>
}

While the ASPX version has 21 transition characters (the <% and %>), the Razor version has only three (@)

I would say that the advantages of Razor are as follows:

  1. Concise syntax, which is very similar to the way you write regular C# code (check out the following recent blog post by Phil Haack comparing Asxp with Razor syntax: http://haacked.com/archive/2011/01/06/razor-syntax-quick-reference.aspx)
  2. Automatic HTML encoding of output (which helps protect you from html injection attacks)
  3. Built in (though not 100%) validation of your markup which helps you avoid unbalanced tags

The page-related concepts also map easily from what you have in ASPX

  • As you can see inline code is still allowed
  • Sections (which can be optional) are equivalent to content placeholders
  • Layout pages instead of Master pages
  • The concepts of full and partial views are the same
  • @functions { ... } blocks instead of <script runat="server"> ... </script>

In addition Razor has a number of useful concepts that I would say are better than what is available in ASPX:

  • @helper functions for really easy creation of functions that emit markup
  • @model keyword for specifying your view's model type without having to write a <%@ Page ... directive with the full class name

I would like to think that we have tackled a real problem, which is to allow you to more easily write concise and standards-compliant views while at the same time providing you with ways to refactor common code.

Of course, not everyone will prefer the syntax which is why we are also fully supporting the ASPX view engine. In addition you can check out Spark and NHaml, which are two 3rd-party view engines that enjoy significant community following. The following blog post has a good comparison of the different offerings: http://blogs.msdn.com/b/coding4fun/archive/2010/10/04/10070953.aspx

In C99, is f()+g() undefined or merely unspecified?

31 votes

I used to think that in C99, even if the side-effects of functions f and g interfered, and although the expression f() + g() does not contain a sequence point, f and g would contain some, so the behavior would be unspecified: either f() would be called before g(), or g() before f().

I am no longer so sure. What if the compiler inlines the functions (which the compiler may decide to do even if the functions are not declared inline) and then reorders instructions? May one get a result different of the above two? In other words, is this undefined behavior?

This is not because I intend to write this kind of thing, this is to choose the best label for such a statement in a static analyzer.

The expression f() + g() contains a minimum of 4 sequence points; one before the call to f() (after all zero of its arguments are evaluated); one before the call to g() (after all zero of its arguments are evaluated); one as the call to f() returns; and one as the call to g() returns. Further, the two sequence points associated with f() occur either both before or both after the two sequence points associated with g(). What you cannot tell is which order the sequence points will occur in - whether the f-points occur before the g-points or vice versa.

Even if the compiler inlined the code, it has to obey the 'as if' rule - the code must behave the same as if the functions were not interleaved. That limits the scope for damage (assuming a non-buggy compiler).

So the sequence in which f() and g() are evaluated is unspecified. But everything else is pretty clean.


In a comment, supercat asks:

I would expect function calls in the source code remain as sequence points even if a compiler decides on its own to inline them. Does that remain true of functions declared "inline", or does the compiler get extra latitude?

I believe the 'as if' rule applies and the compiler doesn't get extra latitude to omit sequence points because it uses an explicitly inline function. The main reason for thinking that (being too lazy to look for the exact wording in the standard) is that the compiler is allowed to inline or not inline a function according to its rules, but the behaviour of the program should not change (except for performance).

Also, what can be said about the sequencing of (a(),b()) + (c(),d())? Is it possible for c() and/or d() to execute between a() and b(), or for a() or b() to execute between c() and d()?

  • Clearly, a executes before b, and c executes before d. I believe it is possible for c and d to be executed between a and b, though it is fairly unlikely that it the compiler would generate the code like that; similarly, a and b could be executed between c and d. And although I used 'and' in 'c and d', that could be an 'or' - that is, any of these sequences of operation meet the constraints:

    • abcd
    • acbd
    • acdb
    • cadb
    • cdab
    • cabd

    I'm not certain that's an exhaustive listing, but it covers most of the variants.

If such a thing would be possible, that would imply a significant difference between inline functions and macros.

There are significant differences between inline functions and macros, but I don't think the ordering in the expression is one of them. That is, any of the functions a, b, c or d could be replaced with a macro, and the same sequencing of the macro bodies could occur. The primary difference, it seems to me, is that with the inline functions, there are guaranteed sequence points at the function calls - as outlined in the main answer - as well as at the comma operators. With macros, you lose the function-related sequence points. (So, maybe that is a significant difference...) However, in so many ways the issue is rather like questions about how many angels can dance on the head of a pin - it isn't very important in practice. If someone presented me with the expression (a(),b()) + (c(),d()) in a code review, I would tell them to rewrite the code to make it clear:

a();
c();
x = b() + d();

And that assumes there is no critical sequencing requirement on b() vs d().

Do global variables mean faster code?

30 votes

I read recently, in an article on game programming written in 1996, that using global variables is faster than passing parameters.

Was this ever true, and if so, is this still true today?

Short answer - No, good programmers make code go faster by knowing and using the appropriate tools for the job, and then optimizing in a methodical way where their code does not meet their requirements.

Longer answer - This article, which in my opinion is not especially well-written, is not in any case general advice on program speedup but '15 ways to do faster blits'. Extrapolating this to the general case is missing the writer's point, whatever you think of the merits of the article.

If I was looking for performance advice, I would place zero credence in an article that does not identify or show a single concrete code change to support the assertions in the sample code, and without suggesting that measuring the code might be a good idea. If you are not going to show how to make the code better, why include it?

Some of the advice is years out of date - FAR pointers stopped being an issue on the PC a long time ago.

A serious game developer (or any other professional programmer, for that matter) would have a good laugh about advice like this:

You can either take out the assert's completely, or you can just add a #define NDEBUG when you compile the final version.

My advice to you, if you really wish to evaluate the merit of any of these 15 tips, and since the article is 14 years old, would be to compile the code in a modern compiler (Visual C++ 10 say) and try to identify any area where using a global variable (or any of the other tips) would make it faster.

[Just joking - my real advice would be to ignore this article completely and ask specific performance questions on Stack Overflow as you hit issues in your work that you cannot resolve. That way the answers you get will be peer reviewed, supported by example code or good external evidence, and current.]

Why is inlining considered faster than a function call?

30 votes

Now, I know it's because there's not the overhead of calling a function, but is the overhead of calling a function really that heavy (and worth the bloat of having it inlined) ?

From what I can remember, when a function is called, say f(x,y), x and y are pushed onto the stack, and the stack pointer jumps to an empty block, and begins execution. I know this is a bit of an oversimplification, but am I missing something? A few pushes and a jump to call a function, is there really that much overhead?

Let me know if I'm forgetting something, thanks!

Aside from the fact that there's no call (and therefore no associated expenses, like parameter preparation before the call and cleanup after the call), there's another significant advantage of inlining. When the function body is inlined, it's body can be re-interpreted in the specific context of the caller. This might immediately allow the compiler to further reduce and optimize the code.

For one simple example, this function

void foo(bool b) {
  if (b) {
    // something
  }
  else {
    // something else
  }
}

will require actual branching if called as a non-inlined function

foo(true);
...
foo(false);

However, if the above calls are inlined, the compiler will immediately be able to eliminate the branching. Essentially, in the above case inlining allows the compiler to interpret the function argument as a compile-time constant (if the parameter is a compile-time constant) - something that is generally not possible with non-inlined functions.

However, it is not even remotely limited to that. In general, the optimization opportunities enabled of inlining are significantly more far-reaching. For another example, when the function body is inlined into the specific caller's context, the compiler in general case will be able to propagate the known aliasing-related relationships present in the calling code into the inlined function code, thus making it possible to optimize the function's code better.

Again, the possible examples are numerous, all of them stemming from the basic fact that inlined calls are immersed into the specific caller's context, thus enabling various inter-context optimizations, which would not be possible with non-inlined calles. With inlining you basically get many individual versions of your original function, each version is tailored and optimized individually for each specific caller context. The price of that is, obviously, the potential danger of code bloat, but if used correctly, it can provide noticeable performance benefits.

What are the Options for Storing Hierarchical Data in a Relational Database?

28 votes

Good Overviews

Generally speaking you're making a decision between fast read times (e.g. nested set) or fast write times (adjacency list). Usually you end up with a combination of the options below that best fit your needs. The following provides some in depth reading:

Options

Ones I am aware of and general features:

  1. Adjacency List:
    • Columns: ID, ParentID
    • Easy to implement.
    • Cheap node moves, inserts, and deletes.
    • Expensive to find level (can store as a computed column), ancestry & descendants (Bridge Hierarchy combined with level column can solve), path (Lineage Column can solve).
    • Use Common Table Expressions in those databases that support them to traverse.
  2. Nested Set (a.k.a Modified Preorder Tree Traversal)
    • First described by Joe Celko - covered in depth in his book Trees and Hierarchies in SQL for Smarties
    • Columns: Left, Right
    • Cheap level, ancestry, descendants
    • Compared to Adjacency List, moves, inserts, deletes more expensive.
    • Requires a specific sort order (e.g. created). So sorting all descendants in a different order requires additional work.
  3. Nested Intervals
    • Combination of Nested Sets and Materialized Path where left/right columns are floating point decimals instead of integers and encode the path information.
  4. Bridge Table (a.k.a. Closure Table: some good ideas about how to use triggers for maintaining this approach)
    • Columns: ancestor, descendant
    • Stands apart from table it describes.
    • Can include some nodes in more than one hierarchy.
    • Cheap ancestry and descendants (albeit not in what order)
    • For complete knowledge of a hierarchy needs to be combined with another option.
  5. Flat Table
    • A modification of the Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
    • Expensive move and delete
    • Cheap ancestry and descendants
    • Good Use: threaded discussion - forums / blog comments
  6. Lineage Column (a.k.a. Materialized Path, Path Enumeration)
    • Column: lineage (e.g. /parent/child/grandchild/etc...)
    • Limit to how deep the hierarchy can be.
    • Descendants cheap (e.g. LEFT(lineage, #) = '/enumerated/path')
    • Ancestry tricky (database specific queries)

Database Specific Notes

MySQL

Oracle

PostgreSQL

SQL Server

  • General summary
  • 2008 offers HierarchyId data type appears to help with Lineage Column approach and expand the depth that can be represented.

This is kind of a question that is still interesting even after all big 3 vendors implemented Recursive WITH clause. I'd suggest that different readers would be pleased with different answers.

  1. Comprehensive list of references by Troels Arvin (although it seems to be missing many recent fine articles mentioned in similar stackoverflow threads).
  2. For the lack of competition, introductory textbook by Joe Celko "Trees and Hierarchies in SQL for Smarties" can indeed be considered a classics.
  3. For mathematical sophistry and connections between various methods look up Tropashko publications.

27 votes

I have the following loop:

for (byte i = 0 ; i < 128; i++) {
    System.out.println(i + 1 + " " + name);
}

When I execute my programm it prints all numbers from -128 to 127 in an infinite loop. Why does this happen?

byte is a 1-byte type so can vary between -128...127, so condition i < 128 is always true. When you add 1 to 127 it overflows and becomes -128 and so on in a (infinite) loop...

Why is printing to stdout so slow? Can it be sped up?

26 votes

I've always been amazed/frustrated with how long it takes to simply output to the terminal with a print statement. After some recent painfully slow logging I decided to look into it and was quite surprised to find that almost all the time spent is waiting for the terminal to process the results.

Can writing to stdout be sped up somehow?

I wrote a script ('print_timer.py' at the bottom of this question) to compare timing when writing 100k lines to stdout, to file, and with stdout redirected to /dev/null. Here is the timing result:

$python print_timer.py
this is a test
this is a test
<snipped 99997 lines>
this is a test
-----
timing summary (100k lines each)
-----
print                         :11.950 s
write to file (+ fsync)       : 0.122 s
print with stdout = /dev/null : 0.050 s

Wow. To make sure python isn't doing something behind the scenes like recognizing that I reassigned stdout to /dev/null or something, I did the redirection outside the script...

$ python print_timer.py > /dev/null
-----
timing summary (100k lines each)
-----
print                         : 0.053 s
write to file (+fsync)        : 0.108 s
print with stdout = /dev/null : 0.045 s

So it isn't a python trick, it is just the terminal. I always knew dumping output to /dev/null sped things up, but never figured it was that significant!

It amazes me how slow the tty is. How can it be that writing to physical disk is WAY faster than writing to the "screen" (presumably an all-RAM op), and is effectively as fast as simply dumping to the garbage with /dev/null?

This link talks about how the terminal will block I/O so it can "parse [the input], update its frame buffer, communicate with the X server in order to scroll the window and so on"... but I don't fully get it. What can be taking so long?

I expect there is no way out (short of a faster tty implementation?) but figure I'd ask anyway.


UPDATE: after reading some comments I wondered how much impact my screen size actually has on the print time, and it does have some significance. The really slow numbers above are with my Gnome terminal blown up to 1920x1200. If I reduce it very small I get...

-----
timing summary (100k lines each)
-----
print                         : 2.920 s
write to file (+fsync)        : 0.121 s
print with stdout = /dev/null : 0.048 s

That is certainly better (~4x), but doesn't change my question. It only adds to my question as I don't understand why the terminal screen rendering should slow down an application writing to stdout. Why does my program need to wait for screen rendering to continue?

Are all terminal/tty apps not created equal? I have yet to experiment. It really seems to me like a terminal should be able to buffer all incoming data, parse/render it invisibly, and only render the most recent chunk that is visible in the current screen configuration at a sensible frame rate. So if I can write+fsync to disk in ~0.1 seconds, a terminal should be able to complete the same operation in something of that order (with maybe a few screen updates while it did it).

I'm still kind of hoping there is a tty setting that can be changed from the application side to make this behaviour better for programmer. If this is strictly a terminal application issue, then this maybe doesn't even belong on StackOverflow?

What am I missing?


Here is the python program used to generate the timing:

import time, sys, tty
import os

lineCount = 100000
line = "this is a test"
summary = ""

cmd = "print"
startTime_s = time.time()
for x in range(lineCount):
    print line
t = time.time() - startTime_s
summary += "%-30s:%6.3f s\n" % (cmd, t)

#Add a newline to match line outputs above...
line += "\n"

cmd = "write to file (+fsync)"
fp = file("out.txt", "w")
startTime_s = time.time()
for x in range(lineCount):
    fp.write(line)
os.fsync(fp.fileno())
t = time.time() - startTime_s
summary += "%-30s:%6.3f s\n" % (cmd, t)

cmd = "print with stdout = /dev/null"
sys.stdout = file(os.devnull, "w")
startTime_s = time.time()
for x in range(lineCount):
    fp.write(line)
t = time.time() - startTime_s
summary += "%-30s:%6.3f s\n" % (cmd, t)

print >> sys.stderr, "-----"
print >> sys.stderr, "timing summary (100k lines each)"
print >> sys.stderr, "-----"
print >> sys.stderr, summary

Thanks for all the comments! I've ended up answering it myself with your help. It feels dirty answering your own question, though.

Question 1: Why is printing to stdout slow?

Answer: Printing to stdout is not inherently slow. It is the terminal you work with that is slow. And it has pretty much zero to do with I/O buffering on the application side (eg: python file buffering). See below.

Question 2: Can it be sped up?

Answer: Yes it can, but seemingly not from the program side (the side doing the 'printing' to stdout). To speed it up, use a faster different terminal emulator.

Explanation...

I tried a self-described 'lightweight' terminal program called wterm and got significantly better results. Below is the output of my test script (at the bottom of the question) when running in wterm at 1920x1200 in on the same system where the basic print option took 12s using gnome-terminal:

-----
timing summary (100k lines each)
-----
print                         : 0.261 s
write to file (+fsync)        : 0.110 s
print with stdout = /dev/null : 0.050 s

0.26s is MUCH better than 12s! I don't know whether wterm is more intelligent about how it renders to screen along the lines of how I was suggesting (render the 'visible' tail at a reasonable frame rate), or whether it just "does less" than gnome-terminal. For the purposes of my question I've got the answer, though. gnome-terminal is slow.

So - If you have a long running script that you feel is slow and it spews massive amounts of text to stdout... try a different terminal and see if it is any better!

Note that I pretty much randomly pulled wterm from the ubuntu/debian repositories. This link might be the same terminal, but I'm not sure. I did not test any other terminal emulators.


Update: Because I had to scratch the itch, I tested a whole pile of other terminal emulators with the same script and full screen (1920x1200). My manually collected stats are here:

wterm           0.3s
aterm           0.3s
rxvt            0.3s
mrxvt           0.4s
konsole         0.6s
yakuake         0.7s
lxterminal        7s
xterm             9s
gnome-terminal   12s
xfce4-terminal   12s
vala-terminal    18s
xvt              48s

The recorded times are manually collected, but they were pretty consistent. I recorded the best(ish) value. YMMV, obviously.

As a bonus, it was an interesting tour of some of the various terminal emulators available out there! I'm amazed my first 'alternate' test turned out to be the best of the bunch.