Best performance questions in February 2011

How fast is D compared to C++ ?

36 votes

I like some features of D, but would be interested if they come with a runtime penalty?

To compare, I implemented a simple program that computes scalar products of many short vectors both in C++ and in D. The result is surprising:

  • D: 18.9 s [see below for final runtime]
  • C++: 3.8 s

Is C++ really almost five times as fast or did I make a mistake in the D program (it is my first one, please forgive me)?


Summary of answers: D is as fast as C++ (for numerical computations) if

  1. one uses the correct optimization options (-O -release -inline -m64)
  2. uses gdc/gdmd instead of dmd
  3. uses correct D calling conventions (in vector_t x instead of const ref vector_t x in the arguments to function scalar_product below; --- while replacing the for loops by foreach looks nicer, but does not affect runtimes).

Here all the figures on my machine:

  • D (const ref arguments, dmd compiler, correct optimization): 9.6s
  • D (in arguments, dmd compiler, correct optimization): 6.1s
  • D (in arguments, gdmd compiler, correct optimization): 3.9s
  • C++: 3.8s

I used gdc/gdmd 4.3. The combination D(const ref arguments, gdmd compiler) did not compile and thus has not been tested. Aside: gdc/gdmd did not have the datetime library yet and compiling it from sources failed for me, so it seems not as easy to handle as dmd right now.

Thanks CyberShadow and GMan and all others!


Continuation of original question:

I compiled C++ with g++ -O3 (gcc-snapshot 2011-02-19) and D with dmd -O (dmd 2.052) on a moderate recent linux desktop. The results are reproducible over several runs and standard deviations negligible.

Here the C++ program:

#include <iostream>
#include <random>
#include <chrono>
#include <string>

#include <vector>
#include <array>

typedef std::chrono::duration<long, std::ratio<1, 1000>> millisecs;
template <typename _T>
long time_since(std::chrono::time_point<_T>& time) {
      long tm = std::chrono::duration_cast<millisecs>( std::chrono::system_clock::now() - time).count();
  time = std::chrono::system_clock::now();
  return tm;
}

const long N = 20000;
const int size = 10;

typedef int value_type;
typedef long long result_type;
typedef std::vector<value_type> vector_t;
typedef typename vector_t::size_type size_type;

inline value_type scalar_product(const vector_t& x, const vector_t& y) {
  value_type res = 0;
  size_type siz = x.size();
  for (size_type i = 0; i < siz; ++i)
    res += x[i] * y[i];
  return res;
}

int main() {
  auto tm_before = std::chrono::system_clock::now();

  // 1. allocate and fill randomly many short vectors
  vector_t* xs = new vector_t [N];
  for (int i = 0; i < N; ++i) {
    xs[i] = vector_t(size);
      }
  std::cerr << "allocation: " << time_since(tm_before) << " ms" << std::endl;

  std::mt19937 rnd_engine;
  std::uniform_int_distribution<value_type> runif_gen(-1000, 1000);
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < size; ++j)
      xs[i][j] = runif_gen(rnd_engine);
  std::cerr << "random generation: " << time_since(tm_before) << " ms" << std::endl;

  // 2. compute all pairwise scalar products:
  time_since(tm_before);
  result_type avg = 0;
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j) 
      avg += scalar_product(xs[i], xs[j]);
  avg = avg / N*N;
  auto time = time_since(tm_before);
  std::cout << "result: " << avg << std::endl;
  std::cout << "time: " << time << " ms" << std::endl;
}

And here the D version:

import std.stdio;
import std.datetime;
import std.random;

const long N = 20000;
const int size = 10;

alias int value_type;
alias long result_type;
alias value_type[] vector_t;
alias uint size_type;

value_type scalar_product(const ref vector_t x, const ref vector_t y) {
  value_type res = 0;
  size_type siz = x.length;
  for (size_type i = 0; i < siz; ++i)
    res += x[i] * y[i];
  return res;
}

int main() {   
  auto tm_before = Clock.currTime();

  // 1. allocate and fill randomly many short vectors
  vector_t[] xs;
  xs.length = N;
  for (int i = 0; i < N; ++i) {
    xs[i].length = size;
  }
  writefln("allocation: %i ", (Clock.currTime() - tm_before));
  tm_before = Clock.currTime();

  for (int i = 0; i < N; ++i)
    for (int j = 0; j < size; ++j)
      xs[i][j] = uniform(-1000, 1000);
  writefln("random: %i ", (Clock.currTime() - tm_before));
  tm_before = Clock.currTime();

  // 2. compute all pairwise scalar products:
  result_type avg = cast(result_type) 0;
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j) 
      avg += scalar_product(xs[i], xs[j]);
  avg = avg / N*N;
  writefln("result: %d", avg);
  auto time = Clock.currTime() - tm_before;
  writefln("scalar products: %i ", time);

  return 0;
}

To enable all optimizations and disable all safety checks, compile your D program with the following DMD flags:

-O -inline -release -noboundscheck

EDIT: I've tried your programs with g++, dmd and gdc. dmd does lag behind, but gdc achieves performance very close to g++. The commandline I used was gdmd -O -release -inline (gdmd is a wrapper around gdc which accepts dmd options).

Looking at the assembler listing, it looks like neither dmd nor gdc inlined scalar_product, but g++/gdc did emit MMX instructions, so they might be auto-vectorizing the loop.

PHP array performance

15 votes

Hi, this is my first question on Stackoverflow, please bear with me.

I'm testing an algorithm for 2d bin packing and I've chosen PHP to mock it up as it's my bread-and-butter language nowadays.

As you can see on http://themworks.com/pack_v0.2/oopack.php?ol=1 it works pretty well, but you need to wait around 10-20 seconds for 100 rectangles to pack. For some hard to handle sets it would hit the php's 30s runtime limit.

I did some profiling and it shows that most of the time my script goes through different parts of a small 2d array with 0's and 1's in it. It either checks if certain cell equals to 0/1 or sets it to 0/1. It can do such operations million times and each times it takes few microseconds.

I guess I could use an array of booleans in a statically typed language and things would be faster. Or even make an array of 1 bit values. I'm thinking of converting the whole thing to some compiled language. Is PHP just not good for it?

If I do need to convert it to let's say C++, how good are the automatic converters? My script is just a lot of for loops with basic arrays and objects manipulations.

Thank you!

Edit. This function gets called more than any other. It reads few properties of a very simple object, and goes through a very small part of a smallish array to check if there's any element not equal to 0.

function fits($bin, $w, $h, $x, $y) {

    $w += $x;
    $h += $y;

    for ($i = $x; $i < $w; $i++) {

        for ($j = $y; $j < $h; $j++) {

            if ($bin[$i][$j] !== 0) {
                return false;
            }
        }
    }

    return true;    
}

Update: I've tried using 1d array instead of 2d as one of the answers suggested. Since I needed to always have current bin width accessible, I decided to wrap everything in the object. Also, now in every loop the index needs to be calculated. Now the script takes even more time to run. Other techniques didn't bring much performance boost, rather made code less readable. Time for HipHop I guess.

Update: since hiphop php only runs on linux, and I don't have one, I've decided to rewrite the whole thing in C++. It's nice to freshen up the old skills. Also, if I do find a way to use hiphop, it'll be interesting to compare hand-written C++ code and the one hiphop would generate.

Update: I rewrote this thing in c++, on average it works 20 times faster and uses much less memory. Let me see if I can make it even faster.

Array access in PHP can certainly be slow. PHP uses hash tables to implement arrays, i.e. in order to access an element in an array it has to calculate a hash and traverse a linked list. Using a compiled language with real arrays will definitely improve performance, because there a direct memory access is made. For the interested: Code for hash access with string and with integer.

Concerning your code, there are several points I would optimize:

  • return directly, don't break twice.
  • put $file->get_width() and $file->get_height into simple variables. I assume that the height or width doesn't change throughout the process. Remember: Functions in PHP are slow.
  • Use a one-dimensional array, instead of nested arrays. You save one hash lookup per iteration that way. Actually a one-dimensional array is only marginally faster or even slightly slower. Comparison of several ways of saving the data concerning performance and memory usage.

.

function fits($bin, $x, $y, $w, $h) {
    $w += $x;
    $h += $y;

    for ($i = $x; $i < $w; ++$i) {
        for ($j = $y; $j < $h; ++$j) {
            if ($bin[$i][$j] !== 0) {
                return false;
            }
        } 
    }

    return true;   
}

Though I'm not sure, why you add $x to the $width / $y to the $height. Don't you want to iterate from the current coordinates to the image boundaries?

Publishing an ASP.NET MVC2 site with Web Deploy

11 votes

I currently use Web Deploy, http://learn.iis.net/page.aspx/346/web-deploy/ to publish my MVC2 app. It used to work well, but now it is got to the point where I can't continue using it:

When the MVC app was small and had only a few users it was easy to publish. Just right click the project in Visual Studio and choose "Publish". And because there were only a few users it was easy to find a time when no one was using the site to do a quick update.

Then the app got bigger and had a few more users. The "Publish" action started taking longer and longer and occasionally timing out. Even when I recycled the app pool before deploy it still took a long time.

Also it became harder to find a time when no one was using the site so the update could be done without affecting anyone.

Then the "Publish" action started timing out every single time, and I had to switch to manual deployment as per this earlier unanswered question: Visual Studio 2010 - web deploy times out - what to do?

Now the manual deploy is taking longer and longer, from 5 to 20 minutes. And the number of users has grown significantly, so the deployment always affects someone (slow response times, timeouts, site unavailable, etc)

So what can I do? Is there a better alternative to using web deploy?

Edit:

Today's deployment took 18 minutes to publish just 49 changed files. The situation is just ridiculous and is one of the biggest weaknesses of our site right now. So I'm starting a decent sized bounty in the hopes of solving this.

Some more questions that may lead to a solution:

  • Why would it take so long when only a few files have been changed?
  • Why does the web deploy zip always include the entire codebase and not just changed files?
  • Why don't I just manually copy the changed files myself and skip the whole web deploy? But it is hard to manually work out what files have changed. I use SVN - does it have a way to output only files that have changed between two branches?
  • What other questions should I be asking but haven't thought of yet?

In reply to answers:

Re: http://www.troyhunt.com/2010/11/you-deploying-it-wrong-teamcity_24.html This is exactly how I was doing the deploy, and would be an ideal method. Web deploy does correctly identify which files have changed, however it times out and no publish occurs. There are around 2500 files in the solution, perhaps it is taking too long to identify which ones are changed? Or it could be that publish has a short timeout value and that just uploading the 15mb zip file uses all that time up.

I do have full control over the server, and it does support web deploy. There are actually 2 servers: the primary live server, and a redundant server that we keep ready in case the first falls over. So any solution has to be easy to deploy to more than one server (web deploy was ideal until it stopped working).

The suggestion of creating a new folder for each release and then just changing IIS to point to that new folder sounds like it will result in lower downtime/slowtime when during the publish. But it is a very manual process and I would prefer something more automated.

Edit #2

I have managed to narrow it down, and found exactly where it is slow - but not why. This is from the deploy log:

[9/02/2011 12:11:56 a.m.] Performing synchronization pass #1.
[9/02/2011 12:11:56 a.m.] Parameter entry 'IIS Web Application Name/1' is applicable to 'iisApp/C:\src\Site.2010\Site.UI\obj\Release\Package\PackageTmp' because of its scope.
[9/02/2011 12:11:56 a.m.] Parameter entry 'IIS Web Application Name/2' is applicable to 'setAcl/C:\src\Site.2010\Site.UI\obj\Release\Package\PackageTmp' because of its scope.
[9/02/2011 12:11:56 a.m.] Parameter entry 'IIS Web Application Name/2' is applicable to 'setAcl/C:\src\Site.2010\Site.UI\obj\Release\Package\PackageTmp' because of its scope.
[9/02/2011 12:11:56 a.m.] Parameter entry 'Add write permission to App_Data Folder/1' is applicable to 'setAcl/C:\src\Site.2010\Site.UI\obj\Release\Package\PackageTmp\App_Data' because of its scope.
[9/02/2011 12:11:56 a.m.] Source createApp (C:\src\Site.2010\Site.UI\obj\Release\Package\PackageTmp) does not match destination (Default Web Site/virtual-dir/) differing in attributes (isDest['False','True']). Update pending.
[9/02/2011 12:11:56 a.m.] Update operation on createApp (C:\src\Site.2010\Site.UI\obj\Release\Package\PackageTmp) skipped because of rule CreateApplicationRule.
[9/02/2011 12:11:56 a.m.] Source filePath (C:\src\Site.2010\Site.UI\obj\Release\Package\PackageTmp\App_Data\Create.sql) does not match destination (Default Web Site/virtual-dir/App_Data\Create.sql) differing in attributes (size['259691','259697'],lastWriteTime['02/08/2011 10:45:20','02/06/2011 03:48:16']). Update pending.

[400 lines of file updates skipped, time expired 2 seconds ....]

[9/02/2011 12:11:58 a.m.] Delete operation on filePath (Default Web Site/v2/zzz_app_offline.htm) skipped because of rule DoNotDeleteRule.
[9/02/2011 12:11:58 a.m.] Source setAcl (C:\src\Site.2010\Site.UI\obj\Release\Package\PackageTmp) does not match destination (Default Web Site/virtual-dir/) differing in attributes (isDest['False','True'],setAclUser,setAclAccess). Update pending.
[9/02/2011 12:11:58 a.m.] Updating setAcl (Default Web Site/virtual-dir/).
[9/02/2011 12:13:47 a.m.] Source setAcl (C:\src\Site.2010\Site.UI\obj\Release\Package\PackageTmp) does not match destination (Default Web Site/virtual-dir/) differing in attributes (isDest['False','True'],setAclUser,setAclAccess). Update pending.
[9/02/2011 12:13:47 a.m.] Updating setAcl (Default Web Site/virtual-dir/).
[9/02/2011 12:17:11 a.m.] Source setAcl (C:\src\Site.2010\Site.UI\obj\Release\Package\PackageTmp\App_Data) does not match destination (Default Web Site/virtual-dir//App_Data) differing in attributes (isDest['False','True'],setAclUser,setAclAccess). Update pending.
[9/02/2011 12:17:11 a.m.] Updating setAcl (Default Web Site/virtual-dir//App_Data).
[9/02/2011 12:17:11 a.m.] The dependency check 'DependencyCheckInUse' found no issues.
[9/02/2011 12:17:11 a.m.] The synchronization completed in 1 pass(es).

The cause of the slowness is the "Updating setAcl" component. I am examining the ACLs of the development box and server box to see what is different. However it seems like an extremely bad idea to copy the ACL from a dev box to a server box! I already had the ACL set up just fine on the server.

I'd start by trying to isolate where the timeout is happening. You've mentioned a 15MB zip with 2,500 files which doesn't strike me as particularly large. Have you tried creating a deployment package in Visual Studio then running it directly on the server? This will take network latency out of the picture which is a pretty fundamental variable when it comes to timeouts.

As for why a zip with the entire application needs to be uploaded, you need to remember the actual identification of what has changed and subsequent deployment into IIS all happens on the server. It's not Visual Studio or msdeploy on your local machine calling the shots on this.

As for why you don't just manually copy the changed files over, it's summarised in my blog post you've referenced but in short, it's laborious and error prone. It means you need to consciously work through the thought process of "which of my 2,500 files just changed" rather than simply saying "make my target site match my development version". You haven't mentioned if you're publishing the web.config or not but obviously config transforms is another important reason why the simple CTRL-C then CTRL-V approach is cumbersome.

Trying to just take a change directly from SVN is also risky. Your first problem is you need to have complete confidence in the integrity and accuracy of the revision you're updating from if you're to get the appropriate changes published. You're then left with trying to sync these to the target and you're back at the same issues raised in the previous paragraph. The other big problem is versioning object code is always nasty; you'll be in a perpetual state of conflict with anyone else on the project and VCS is simply not intended to function this way.

My advice would be to focus on solving the root cause of the problem - Web Deploy is timing out - rather than simply trying to work around the symptoms. Manually publishing changes only or messing around with IIS bindings is only going to create more trouble for you in the long run and a lot more work in the immediate term. See how you go sharing the results of creating a package, copying it to the server then executing it locally and we'll take it from there. Once you have it working as designed, you should be seeing deployments no more than a few minutes and site outage measured in seconds.

BTW - You might also like to add what sort of latency you have between your PC and the server and how long it would normally take to transfer a 15MB file over HTTP.

ViewBag vs ViewData performance difference in MVC?

9 votes

I know that ViewData and ViewBag both use the same backing data and that neither are as good as using strongly typed models in most cases. However when choosing between the two is the dynamic nature of ViewBag slower than using ViewData?

Okay - my initial answer basically said 'no' - time for a bit of a u-turn.

It should be 'no' in a perfect dynamic world - but upon closer inspection it would appear that there will either be no difference (accounting for JIT magic) or it might be ever-so-slightly slower, although not enough to warrant not using it (I certainly am).

In theory if properly implemented, the ViewBag would ultimately outperform the use of the ViewData dictionary because the binding of the expressions (e.g. ViewBag.Foo) is very well cached across the different CallSites that the compiler will generate (reflect a method that does a read or write to the ViewBag and you'll see what I mean).

The caching layers of the DLR are well documented (if a little difficult to understand once you get in depth) but basically the runtime does its best to 'remember' where a given value instance is once its bound it - for example via a Set or Get statement.

BUT The caching, its use and effectiveness, is entirely dependent upon the underlying implementations of classes/interfaces such as DynamicObject, IDynamicMetaObjectProvider etc; as well as the end-result of the Get/Set expression binding.

In the case of the MVC internal DynamicViewDataDictionary class - it ultimately ends up binding to this:

public override bool TryGetMember(GetMemberBinder binder, out object result)
{
  result = this.ViewData[binder.Name];
  return true;
}

For var a = ViewBag.Foo

And

public override bool TrySetMember(SetMemberBinder binder, object value)
{
  this.ViewData[binder.Name] = value;
  return true;
}

For ViewBag.Foo = Bar;

In other words - the statements are effectively being rewritten to wrappers around the dictionary indexer.

Because of that, there's certainly no way it could be faster than doing it yourself.

Were ViewData to feed off of ViewBag, instead of the other way around, and had ViewBag then been implemented with even something like ExpandoObject, then it might be a different story - as the dynamic implementation of ExpandoObject is much more intelligent and the caching rules it employs allow for some pretty cool runtime optimisations.

In Conclusion

(thanks to Shawn McLean for suggesting one was needed!)

ViewBag will be slower than ViewData; but probably not enough to warrant concern.

Java Regular Expression running very slow

7 votes

I'm trying to use the Daring Fireball Regular Expression for matching URLs in Java, and I've found a URL that causes the evaluation to take forever. I've modified the original regex to work with Java syntax.

private final static String pattern = 
"\\b" + 
"(" +                            // Capture 1: entire matched URL
  "(?:" +
    "[a-z][\\w-]+:" +                // URL protocol and colon
    "(?:" +
      "/{1,3}" +                        // 1-3 slashes
      "|" +                             //   or
      "[a-z0-9%]" +                     // Single letter or digit or '%'
                                        // (Trying not to match e.g. "URI::Escape")
    ")" +
    "|" +                            //   or
    "www\\d{0,3}[.]" +               // "www.", "www1.", "www2." … "www999."
    "|" +                            //   or
    "[a-z0-9.\\-]+[.][a-z]{2,4}/" +  // looks like domain name followed by a slash
  ")" +
  "(?:" +                           // One or more:
    "[^\\s()<>]+" +                      // Run of non-space, non-()<>
    "|" +                               //   or
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
  ")+" +
  "(?:" +                           // End with:
    "\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
    "|" +                                   //   or
    "[^\\s`!\\-()\\[\\]{};:'\".,<>?«»“”‘’]" +        // not a space or one of these punct chars (updated to add a 'dash'
  ")" +
")";

// @see http://daringfireball.net/2010/07/improved_regex_for_matching_urls
private static final Pattern DARING_FIREBALL_PATTERN = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

If I attempt to run the following, it takes forever. I've narrowed it down to the matching of balanced parens (I think). If you change the text within the parens, it works fine, but at about 15 characters, it starts to slow down exponentially.

final Matcher matcher = pattern.matcher("https://goo.gl/a(something_really_long_in_balanced_parens)");
boolean found = matcher.find();

Is there a way to improve this regex so that the lines about don't take forever? I have about 100 different URLs in a JUnit test class, and I need those to continue to work as well.

The problem is here:

"(?:" +                           // One or more:
"[^\\s()<>]+" +                      // Run of non-space, non-()<>
"|" +                               //   or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" +  // balanced parens, up to 2 levels
")+"

What you've got here is nested quantifiers. This plays havoc with any backtracking algorithm - as an example, consider the regex /^(a+)+$/ matching against the string

aaaaaaaaaab

As a first attempt, the inner quantifier will match all of the as. Then the regex fails, so it backs off one. Then the outer quantifier tries to match again, swallowing up the last a, then the regex fails once more. We basically get exponential behaviour as the quantifiers try all sorts of ways of splitting up the run of as, without actually making any progress.

The solution is possessive quantifiers (which we denote by tacking a + onto the end of a quantifier) - we set up the inner quantifiers so that once they have a match, they don't let it go - they'll hold onto that until the match fails or an earlier quantifier backs off and they have to rematch starting somewhere else in the string. If we instead used /^(a++)+$/ as our regex, we would fail immediately on the non-matching string above, rather than going exponential trying to match it.

Try making those inner quantifiers possessive and see if it helps.

I'm looking for any optimizations I can make in a graphics editing program

6 votes

Heyo, this is my first time asking a question here so do forgive me if I mess somethin up >~<

I'm working on a program similar to openCanvas, the earlier ones that allowed multiple people to draw on the same canvas in real time over the internet. OC's really buggy and has a lot of limitations, which is why I wanted to write this.

I have it set up so the canvas extends "indefinitely" in all directions and is made up of 512x512 blocks of pixels that don't become active until they're drawn on, which should be really easy to make, and I was thinking about using Direct3D to make it hardware accelerated, thus the 512 square blocks.

My problem comes when I want to use layers, I'm not quite sure how I can compose layers quickly and without using a ton of memory, since my target is DirectX9 compatible video cards with 128m of memory, and a system with about 3.2 ghz of CPU power and between 2 and 8 gigs of ram. I had a few different approaches I was thinking of using and was wondering which would probably be the best, and if there was anything I could look into to make it run better.

My first idea was to make the gfx hardware do as much work as possible by having all the layers on all the blocks serve as textures, and they'd be updated by locking the changed area, updating them on the cpu, and unlocking them. Blocks that aren't currently being changed are flattened into one texture and the individual layers themselves are kept in system memory, which would reduce the gfx memory used, but could significantly increase bandwidth usage between system and gfx memory. I can see the constant locking and unlocking potentially slowing down the system pretty bad as well. Another possible issue is that I've heard some people using up to 200 layers, and I can't think of any good ways to optimize that given the above.

My other idea was to compose the textures -completely- in system memory, write them into a texture, and copy that texture to gfx memory to be rendered in each block. This seems to eliminate a lot of the issues with the other method, but at the same time I'm moving all the work into the CPU, instead of balancing it. This isn't a big deal as long as it still runs quickly, though. Again, however, there's the issue of having a couple hundred layers. In this case though, I could probably only update the final pixels that are actually changing, which is what I think the bigger name programs like Sai and Photoshop do.

I'm mostly looking for recommendations, suggestions that might improve the above, better methods, or links to articles that might be related to such a project. While I'm writing it in C++, I have no trouble translating from other languages. Thanks for your time~

Data Structure
You should definitely use a quadtree (or another hierarchical data structure) to store your canvas and its nodes should contain much smaller blocks than 512x512 pixels. Maybe not as small as 1x1 pixels, because then the hierarchical overhead would kill you - you'll find a good balance through testing.

Drawing
Let your users only draw on one (the highest) resolution. Imagine a infinitely large uniform grid (two dimensional array). Since you know the position of the mouse and the amount which your users have scrolled from the origin, you can derive absolute coordinates. Traverse the quadtree into that region (eventually adding new nodes) and insert the blocks (for example 32x32) as the user draws them into the quadtree. I would buffer what the user draws in a 2D array (for example as big as his screen resolution) and use a separate thread to traverse/alter the quadtree and copy the data from the buffer to circumvent any delays.

Rendering
Traversing the quadtree and copying all tiles to one texture and send it to the GPU? No! You see, sending over one texture which is as big as the screen resolution is not the problem (bandwidth wise). But traversing the quadtree and assembling the final image is (at least if you want many fps). The answer is to store the quadtree in system memory and stream it from the GPU. Means: Asynchronously another thread does the traversal and copies currently viewed data to the GPU in chunks as fast as it can. If your user doesn't view the canvas in full resolution you don't have to traverse the tree to leaf level, which gives you automatic level of detail (LOD).

Some random thoughts regarding the proposed strategy

  • The quadtree approach is great because it is very memory efficient.
  • The streaming idea can be extended to the HDD...SeaDragon
  • A sophisticated implementation would require something like CUDA.
  • If your GPU doesn't offer the necessary performance/programmability, just implement the traversal on the CPU - a little longer delay until the image is fully displayed but should be acceptable. Don't forget to program asynchronously using multiple threads so that the screen doesn't freeze while waiting on the CPU. You can play with different effects: Showing the whole image at once, blurry at first and slowly increasing detail (breadth first search (BFS)) or render it tile by tile (depth first search (DFS)) - maybe mixed with some cool effects.
  • Software implementation should be pretty easy, when only allowing viewing of the canvas at full resolution. If one can zoom out in steps that's a minor change to the traversal. If one can zoom seamlessly, that'll require linear interpolation between neighboring quadtree nodes' tiles - not trivial anymore, but doable.
  • Layers: The quadtree should give you low enough memory consumption allowing you to simply store one quadtree per layer. But when you've many layers you'll need some optimizations to stay in realtime: You can't assemble 200 textures per frame and send them to the GPU. Maybe (not totally sure whether that's the best solution) for each layer deleting all nodes of quadtrees beneath that layer whose tile's pixels are completely covered by the layer above. That would need to be done at runtime while drawing and a depth-buffer is required. If you offer an eraser tool, you can't delete nodes but have to mark them as "invisible" so that they can be omitted during traversal.

..off the top of my head. If you have further questions, let me know!