Best linq questions in March 2011

Speeding up the rate that IIS/.NET/LINQ retrieves data from the Network Buffers

12 votes

When doing a TCP analysis of the traffic between my web servers and database servers I see the network buffers (TCP Window) filling up frequently. The web servers then send TCP messages to the database server telling it that its buffers are full an not to send more data until given an update.

For example, this is the size of the network buffer in bytes for one of the more long lived connections to the database server over time:

Network Buffer Graph

The web servers are running a .NET 4.0 application running in IIS integrated mode on Windows 2008 R2 web servers. The SQL server is a 2008 R2 server.

My interpretation of this is that the SQL server is returning data to the web servers faster then the application on the web server can collect the data from the buffers. I have tried tuning just about everything I can in the network drivers to work around this issue. In particular increasing the RSS queues, disabling interrupt moderation, and setting Windows 2008 R2 server to increase the buffer size more aggressively.

So if my interpretation is correct that leaves me wondering about two possibilities:

  1. Is there any way in .NET to tell it to increase the size of the network buffers? The "enhanced 2008 R2 TCP stack" is rarely deciding to enable window scaling (Making the buffer bigger than 65 kBytes) for this connection (probably due to the low latency). It looks like the ability to manually set this system wide is gone in Windows server 2008 r2 (There used to be registry entries that are now ignored). So is there a way I can force this in the code?
  2. Is there anything that can be tuned that would speed up the rate at which the application reads information for the network buffers, in particular for the SQL connections?

Edit:
Requested DMV Query cutting off at ASYNC_NETWORK_IO:

SELECT * FROM sys.dm_os_wait_stats ORDER BY waiting_tasks_count desc;
wait_type  waiting_tasks_count  wait_time_ms  max_wait_time_ms  signal_wait_time_ms
CXPACKET            1436226309    2772827343             39259           354295135
SLEEP_TASK           231661274     337253925             10808            71665032
LATCH_EX             214958564     894509148             11855            84816450
SOS_SCHEDULER_YIELD  176997645     227440530              2997           227332659
ASYNC_NETWORK_IO     112914243      84132232             16707            16250951

1) What makes you think that this is TCP flow control, as opposed to SQL Server not producing data in the intervals where there is no traffic? Check if sys.dm_exec_requests look at wait_type. The wait types are described in Waits and Queues. If is indeed the client applying TCP flow control, then you'll see the wait type ASYNC_NETWORK_IO.

2) If the issue is indeed network wait type, then the solution is not to increase the bandwidth, but obviously to reduce the traffic. The client has no business requesting so much data from the server as to cause TCP flow control. This would be caused by doing horribly wrong things in the client, like counting rows or client-side paging. Move the processing on the server, and just get small results sets with the data you need.

Edit

Consuming the DB call result set ultimately boils down to one form or another of this:

FetchNextRow
while (not EnfOfResults)
{
  ProcessRow;
  FetchNextRow;
}

What this could mean, in real terms, it could be foreach row in IQueryable or SqlDataReader.Read(). But the basic idea is the same, that the client fetches rows from the result, process them, then gets some more rows. If the client code does anything in that ProcessRow that blocks, then the client code will not reach the point where it fetches the next row again, and thus will eventually trigger TCP flow control which in turn will cause SQL Server to suspend the query (since it has no place to write the results into). There is nothing you can do in terms of TCP to make this better. Increasing the window size can actually make maters worse, as now all those results that were previously suppressed at source (DB) are going to be created and have to be stored somewhere, which will ultimately mean live memory allocated to storage and may make things far worse than they are now.

If I would be in your shoes right now I'd focus on identifying where does that ProcessRow blocking occur. An hypothesis I put forward was that that processing would be an MVC View writing into the response buffer and being blocked in turn by TCP flow control resulting from the user agent not consuming the HTTP response (eg. Ajax call completed but the browser is not running the completion code to consume the response because the main thread is looping busy on something else). As always, the best approach is to methodically measure. Some possible tools:

Working with Cross Context Joins in LINQ-to-SQL

9 votes

Initially I had written this query using LINQ-to-SQL

var result = from w in PatternDataContext.Windows
    join cf in PatternDataContext.ControlFocus on w.WindowId equals cf.WindowId
    join p in PatternDataContext.Patterns on cf.CFId equals p.CFId
    join r in ResultDataContext.Results on p.PatternId equals r.PatternId
    join fi in ResultDataContext.IclFileInfos on r.IclFileId equals fi.IclFileId
    join sp in sessionProfileDataContext.ServerProfiles on fi.ServerProfileId equals sp.ProfileId
    join u in infrastructure.Users on sp.UserId equals u.Id
    where w.Process.Equals(processName)
    select u.DistributedAppId;

And when I executed it, and saw result in the QuickWatch.., it showed this message:

the query contains references to items defined on a different data context

On googling, I found this topic at Stackoverflow itself, where I learned simulating cross context joins and as suggested there, I changed my query a bit to this:

var result = from w in PatternDataContext.Windows
    join cf in PatternDataContext.ControlFocus on w.WindowId equals cf.WindowId
    join p in PatternDataContext.Patterns on cf.CFId equals p.CFId
    join r in SimulateJoinResults() on p.PatternId equals r.PatternId
    join fi in SimulateJoinIclFileInfos() on r.IclFileId equals fi.IclFileId
    join sp in SimulateJoinServerProfiles() on fi.ServerProfileId equals sp.ProfileId
    join u in SimulateJoinUsers() on sp.UserId equals u.Id
    where w.Process.Equals(processName)
    select u.DistributedAppId;

This query is using these SimulateXyz methods:

private static IQueryable<Result> SimulateJoinResults()
{
  return from r in SessionDataProvider.Instance.ResultDataContext.Results select r;
}
private static IQueryable<IclFileInfo> SimulateJoinIclFileInfos()
{
  return from f in SessionDataProvider.Instance.ResultDataContext.IclFileInfos select f;
}
private static IQueryable<ServerProfile> SimulateJoinServerProfiles()
{
  return from sp in sessionProfileDataContext.ServerProfiles select sp;
}
private static IQueryable<User> SimulateJoinUsers()
{
  return from u in infrastructureDataContext.Users select u;
}

But even this approach didn't solve the problem. I'm still getting this message in QuickWatch...:

the query contains references to items defined on a different data context

Any solution for this problem? Along with the solution, I would also want to know why the problem still exists, and how exactly the new solution removes it, so that from next time I could solve such problems myself. I'm new to LINQ, by the way.

I've had to do this before, and there are two ways to do it.

The first is to move all the servers into a single context. You do this by pointing LINQ-to-SQL to a single server, then, in that server, create linked servers to all the other servers. Then you just create views for any tables you're interested from the other servers, and add those views to your context.

The second is to manually do the joins yourself, by pulling in data from one context, and using just the properties you need to join into another context. For example,

int[] patternIds = SessionDataProvider.Instance.ResultDataContext.Results.Select(o => o.patternId).ToArray();
var results = from p in PatternDataContext.Patterns
              where patternIds.Contains(p.PatternId)
              select p;

Though the first is easier to work with, it does have its share of problems. The problem is that you're relying on SQL Server to be performant with linked servers, something it is notoriously bad at. For example, consider this query:

var results = from p in DataContext.Patterns
              join r in DataContext.LinkedServerResults on p.PatternId equals r.PatternId
              where r.userId = 10;

When you enumerate this query, the following will occur (let's call the normal and linked servers MyServer and MyLinkedServer, respectively)

  1. MyServer asks MyLinkedServer for the Results
  2. MyLinkedServer sends the Results back to MyServer
  3. MyServer takes those Results, joins them on the Patterns table, and returns only the ones with Results.userId = 10

So now the question is: When is the filtering done - on MyServer or MyLinkedServer? In my experience, for such a simple query, it will usually be done on MyLinkedServer. However, once the query gets more complicated, you'll suddenly find that MyServer is requesting the entire Results table from MyLinkedServer and doing the filtering after the join! This wastes bandwidth, and, if the Results tables is large enough, could turn a 50ms query into a 50 second query!

You could fix unperformant cross-server joins using stored procedures, but if you do a lot of complex cross-server joins, you may end up writing stored procedures for most of your queries, which is a lot of work and defeats part of the purpose of using L2SQL in the first place (not having to write a lot of SQL).

In comparison, the following code would always perform the filtering on the server containing the Results table:

int[] patternIds = (from r in SessionDataProvider.Instance.ResultDataContext.Results
                    where r.userId = 10
                    select r.PatternId).ToArray();
var results = from p in PatternDataContext.Patterns
              where patternIds.Contains(p.PatternId)
              select p;

Which is best for your situation is up to your best judgement.


Note that there is a third potential solution which I did not mention, as it is not really a programmer-solution: you could ask your server admins to set up a replication task to copy the necessary data from MyLinkedServer to MyServer once a day/week/month. This is only an option if:

  • Your program can work with slightly stale data from MyLinkedServer
  • You only need to read, never write, to MyLinkedServer
  • The tables you need from MyLinkedServers are not exorbitantly huge
  • You have the space/bandwidth available
  • Your database admins are not stingy/lazy

Architecture Question - One Central Database and Many Different Programs Accessing It

6 votes

I am designing a program that will build and maintain a database, and act as a central server. This is the 'first stage' of a grander plan. Coming later will be 3-5 remote programs built around the information put into this database.

The requirements are:

  1. The remote programs must be able to access the information in the database.
  2. The remote programs must be able to set alerts when information in the database changes.
  3. The remote programs must be able to request the central server to go out and fetch new / different data.

So, the question is this: how do I expose this data and events to the outside world? My two choices are:

  1. Have them communicate directly with my 'server' application. This seems easier to:
    • do event notifications (although I suppose I'm probably missing something in SQL).
    • It also seems like this is more 'upgradeable' - that is I don't need to worry about the database updating and crashing all my remote programs because something changed. I can account for this and transform it the data to a version the child program will understand.
  2. Just go ahead and let them connect directly to the database.
    • This nice thing about this is that it's solved. I can use LINQ for SQL. The only thing the main server application needs to do is let the remote programs know where the database is.
    • I'm unsure how to trigger / relay 'events' for field changes in a database over different programs that may or may not be on the same computer.

Forgive my ignorance on this question. I feel woefully unprepared to ask it, but I'm having a hard time figuring out where to get started with this. It is my first real DB project :-/

Thanks!

If the other programs are going to need to know about updates to the database, then the best solution is to manage all db updates through your server application so it can alert clients of the changes. Otherwise it will be tough for the clients to be aware of changes to the db. This also has the advantage of hiding the implementation details of your storage solution from the clients, so you are free to change databases, etc...

LINQ to XML - accessing descendants with a prefix

5 votes

I have a sample xml file like this

<vs:BioData>
<vs:Name>Name</vs:Name>
<vs:Address>address</vs:Address>
<vs:Zip>Zip</vs:zip>
</vs:BioData>

All the nodes have a prefix value as vs and can anyone tell me how would I go about parsing this file to read Name and Address Information? I am very new to LINQ. Any help on this would be greatly appreciated.

Thanks!

You need to know what the namespace is. That will have been declared earlier, with something like:

xmlns:vs="http://some_url_here"

You can query using XNamespace:

XNamespace vs = "http://some_url_here";
var names = doc.Descendants(vs + "Name")
               .Select(x => (string) x)
               .ToList();

The + here is actually converting an XNamespace and a string to an XName.