Best linq questions in November 2011

What is the easiest way to find the LINQ statement for a SQL statement

31 votes

I am a SQL Server DBA for a company that sells an ASP.NET MVC3 application that uses LINQ and Entity Framework 4 for all database access. When I find an inefficient query in my SQL Server's plan cache that was generated by LINQ, I would like to be able to find that LINQ statement in the source code so that I can optimize it. What is the best way to find the LINQ that generated a given SQL statement?

For example, is there any way to put an entry in a config file or decorate the code somehow so that the class and method name or the LINQ statement itself are included as comments in the generated SQL?

The commercial tools ORM Profiler, Entity Framework Profiler or Hugati Query Profiler will both give you a stack trace for the methods which generated the SQL. That makes it fairly easy to find the LINQ in code, though it isn't displayed directly.

These tools also have the advantage that they make it easy to find inefficient queries amongst the many other SQL statements executed by the app.

foreach + break vs linq FirstOrDefault performance difference

23 votes

I have two classes that perform date date range data fetching for particular days.

public class IterationLookup<TItem>
{
    private IList<Item> items = null;

    public IterationLookup(IEnumerable<TItem> items, Func<TItem, TKey> keySelector)
    {
        this.items = items.OrderByDescending(keySelector).ToList();
    }

    public TItem GetItem(DateTime day)
    {
        foreach(TItem i in this.items)
        {
           if (i.IsWithinRange(day))
           {
               return i;
           }
        }
        return null;
    }
}


public class LinqLookup<TItem>
{
    private IList<Item> items = null;

    public IterationLookup(IEnumerable<TItem> items, Func<TItem, TKey> keySelector)
    {
        this.items = items.OrderByDescending(keySelector).ToList();
    }

    public TItem GetItem(DateTime day)
    {
        return this.items.FirstOrDefault(i => i.IsWithinRange(day));
    }
}

Then I do speed tests which show that Linq version is about 5 times slower. This would make sense when I would store items locally without enumerating them using ToList. This would make it much slower, because with every call to FirstOrDefault, linq would also perform OrderByDescending. But that's not the case so I don't really know what's going on. Linq should perform very similar to iteration.

This is the code excerpt that measures my timings

IList<RangeItem> ranges = GenerateRanges(); // returns List<T>

var iterLookup = new IterationLookup<RangeItems>(ranges, r => r.Id);
var linqLookup = new LinqLookup<RangeItems>(ranges, r => r.Id);

Stopwatch timer = new Stopwatch();

timer.Start();
for(int i = 0; i < 1000000; i++)
{
    iterLookup.GetItem(GetRandomDay());
}
timer.Stop();
// display elapsed time

timer.Restart();
for(int i = 0; i < 1000000; i++)
{
    linqLookup.GetItem(GetRandomDay());
}
timer.Stop();
// display elapsed time

Why do I know it should perform better? Because when I write a very similar code without using these lookup classes, Linq performs very similar to foreach iterations...

// continue from previous code block

// items used by both order as they do in classes as well
IList<RangeItem> items = ranges.OrderByDescending(r => r.Id).ToList();

timer.Restart();
for(int i = 0; i < 1000000; i++)
{
    DateTime day = GetRandomDay();
    foreach(RangeItem r in items)
    {
        if (r.IsWithinRange(day))
        {
            // RangeItem result = r;
            break;
        }
    }
}    
timer.Stop();
// display elapsed time

timer.Restart();
for(int i = 0; i < 1000000; i++)
{
   DateTime day = GetRandomDay();
   items.FirstOrDefault(i => i.IsWithinRange(day));
}
timer.Stop();
// display elapsed time

This is by my opinion highly similar code. FirstOrDefault as much as I know also iterate for only as long until it gets to a valid item or until it reaches the end. And this is somehow the same as foreach with break.

But even iteration class performs worse than my simple foreach iteration loop which is also a mystery because all the overhead it has is the call to a method within a class compared to direct access.

Question

What am I doing wrong in my LINQ class that it performs so very slow?
What am I doing wrong in my Iteration class so it performs twice as slow as direct foreach loop?

Which times are being measured?

I do these steps:

  1. Generate ranges (as seen below in results)
  2. Create object instances for IterationLookup, LinqLookup (and my optimized date range class BitCountLookup which is not part of discussion here)
  3. Start timer and execute 1 million lookups on random days within maximum date range (as seen in results) by using previously instantiated IterationLookup class.
  4. Start timer and execute 1 million lookups on random days within maximum date range (as seen in results) by using previously instantiated LinqLookup class.
  5. Start timer and execute 1 million lookups (6 times) using manual foreach+break loops and Linq calls.

As you can see object instantiation is not measured.

Appendix I: Results over million lookups

Ranges displayed in these results don't overlap, which should make both approaches even more similar in case LINQ version doesn't break loop on successful match (which it highly likely does).

Generated Ranges:

ID Range        000000000111111111122222222223300000000011111111112222222222
                123456789012345678901234567890112345678901234567890123456789
09 22.01.-30.01.                     |-------|
08 14.01.-16.01.             |-|
07 16.02.-19.02.                                              |--|
06 15.01.-17.01.              |-|
05 19.02.-23.02.                                                 |---|
04 01.01.-07.01.|-----|
03 02.01.-10.01. |-------|
02 11.01.-13.01.          |-|
01 16.01.-20.01.               |---|
00 29.01.-06.02.                            |-------|

Lookup classes...

- Iteration: 1028ms
- Linq: 4517ms   !!! THIS IS THE PROBLEM !!!
- BitCounter: 401ms

Manual loops...

- Iter: 786ms
- Linq: 981ms
- Iter: 787ms
- Linq: 996ms
- Iter: 787ms
- Linq: 977ms
- Iter: 783ms
- Linq: 979ms

Appendix II: GitHub:Gist code to test for yourself

I've put up a Gist so you can get the full code yourself and see what's going on. Create a Console application and copy Program.cs into it an add other files that are part of this gist.

Grab it here.

Appendix III: Final thoughts and measurement tests

The most problematic thing was of course LINQ implementatino that was awfully slow. Turns out that this has all to do with delegate compiler optimization. LukeH provided the best and most usable solution that actually made me try different approaches to this. I've tried various different approaches in the GetItem method (or GetPointData as it's called in Gist):

  1. the usual way that most of developers would do (and is implemented in Gist as well and wasn't updated after results revealed this is not the best way of doing it):

    return this.items.FirstOrDefault(item => item.IsWithinRange(day));
    
  2. by defining a local predicate variable:

    Func<TItem, bool> predicate = item => item.IsWithinRange(day);
    return this.items.FirstOrDefault(predicate);
    
  3. local predicate builder:

    Func<DateTime, Func<TItem, bool>> builder = d => item => item.IsWithinRange(d);
    return this.items.FirstOrDefault(builder(day));
    
  4. local predicate builder and local predicate variable:

    Func<DateTime, Func<TItem, bool>> builder = d => item => item.IsWithinRange(d);
    Func<TItem, bool> predicate = builder(day);
    return this.items.FirstOrDefault(predicate);
    
  5. class level (static or instance) predicate builder:

    return this.items.FirstOrDefault(classLevelBuilder(day));
    
  6. externally defined predicate and provided as method parameter

    public TItem GetItem(Func<TItem, bool> predicate)
    {
        return this.items.FirstOrDefault(predicate);
    }
    

    when executing this method I also took two approaches:

    1. predicate provided directly at method call within for loop:

      for (int i = 0; i < 1000000; i++)
      {
          linqLookup.GetItem(item => item.IsWithinRange(GetRandomDay()));
      }
      
    2. predicate builder defined outside for loop:

      Func<DateTime, Func<Ranger, bool>> builder = d => r => r.IsWithinRange(d);
      for (int i = 0; i < 1000000; i++)
      {
          linqLookup.GetItem(builder(GetRandomDay()));
      }
      

Results - what performs best

For comparison when using iteration class, it takes it approx. 770ms to execute 1 million lookups on randomly generated ranges.

  1. #3 local predicate builder turns out to be best compiler optimized so it performs almost as fast as usual iterations. 800ms.
  2. #6.2 predicate builder defined outside for loop: 885ms
  3. #6.1 predicate defined within for loop: 1525ms
  4. all others took somewhere between 4200ms - 4360ms and are thus considered unusable.

So whenever you use a predicate in externally frequently callable method, define a builder and execute that. This will yield best results.

The biggest surprise to me about this is that delegates (or predicates) may be this much time consuming.

Further to Gabe's answer, I can confirm that the difference appears to be caused by the cost of re-constructing the delegate for every call to GetPointData.

If I add a single line to the GetPointData method in your IterationRangeLookupSingle class then it slows right down to the same crawling pace as LinqRangeLookupSingle. Try it:

// in IterationRangeLookupSingle<TItem, TKey>
public TItem GetPointData(DateTime point)
{
    // just a single line, this delegate is never used
    Func<TItem, bool> dummy = i => i.IsWithinRange(point);

    // the rest of the method remains exactly the same as before
    // ...
}

(I'm not sure why the compiler and/or jitter can't just ignore the superfluous delegate that I added above. Obviously, the delegate is necessary in your LinqRangeLookupSingle class.)

One possible workaround is to compose the predicate in LinqRangeLookupSingle so that point is passed to it as an argument. This means that the delegate only needs to be constructed once, not every time the GetPointData method is called. For example, the following change will speed up the LINQ version so that it's pretty much comparable with the foreach version:

// in LinqRangeLookupSingle<TItem, TKey>
public TItem GetPointData(DateTime point)
{
    Func<DateTime, Func<TItem, bool>> builder = x => y => y.IsWithinRange(x);
    Func<TItem, bool> predicate = builder(point);

    return this.items.FirstOrDefault(predicate);
}

linq ambiguity on where and select

5 votes

Today I ran into an issue with LINQ to objects (not SQL) that popped up due to a typo. I had a .Select one place and a .Where in another place. I was expecting same result but they are showing different numbers. Assume somelist has 10 elements with all elements having qty = 0

//returns 10 - basically count of all rows. I am expecting 0
 somelist.Select(p => p.qty > 0).Count() 

//returns 0 - the correct count
 somelist.Where(p => p.qty > 0).Count() 

if both select and where return IEnumerable<T> then why the ambiguity? Thank you.

Select is a projection, so what you get is the expression p.qty > 0 evaluated for each element in somelist. i.e. lots of true/false values (the same number as your original list). So when you do Count on it, you get the same number. If you look the select will return IEnumerable<bool> (because the type of p.qty > 0 is a bool).

Where filters the results so count runs on the filtered list, and give you the expected results. The type of this is an IEnumerable<TypeOfElementInOriginalList>.

Note you can also do: somelist.Count(p => p.qty > 0) because Count has an overload that accepts a predicate to filter by.

Overriding or ignoring undeclared entities in C# using LINQ

5 votes

I have a little utility that runs through looking for certain things in XML files using LINQ. It processes a MASSIVE collection of them rather quickly and nicely. However, about 20% of a certain batch of files fail to be read and are skipped, failing because of the degree symbol's presence as &deg; in the files. This is the "Reference to undeclared entity 'deg'." a previous question was about.

The solutions offered in the previous question cannot be directly applied here. I am not at liberty to go around modifying the files, and making copies of them and replacing instances or inserting tags in the copies seems inefficient. What would be the best way to go about getting LINQ to ignore the undeclared entities, which have absolutely no bearing on what my program does anyway? Or is there perhaps a good way of getting an XDocument.Load to be fed some entity declarations beforehand?

Unfortunately entities form part of the well-formedness rules for XML (2.1 Well-Formed XML Documents). It seems like you're saying you want the XDocument.Load to load what is notionally an XML file, but does not in fact conform to the rules, which it won't do, quite reasonably.

If your users are passing you what are supposed to be XML files, but that have undefined entities, then either you have to get them to provide the files in a valid format, or manage the incorrectness youself at load-time, in the ways that have been suggested.

It seems to me, from your restrictions, that the neatest approach would be to follow the example linked-to and create some settings to pass into the XMLReader along the lines of (Validating an XML Document in the DOM).

If there are entities which aren't defined and aren't listed in public schemas, you'll need to create your own schema which defines all the entities you need. So, create a generic settings for the XMLReader which references your own, custom schema. Add the necessary entities to this schema as certain files fail to load and then you'll build up a list of all the entites that you need to define in order that the XML files are valid.

Then, for each document you try to load, create an XMLReader for the file using the settings above and call the XDocument(XMLReader) overload.

Linq to return records that don't have a follow up record

5 votes

I have this query that counts the total number of +1's a user has made on our website:

return db.tblGPlusOneClicks
    .Where(c => 
        c.UserID == UserID
        && c.IsOn
        )
    .Select(c=>c.URLID)
    .Distinct()
    .Count();

Data originates from this table:

enter image description here

A simple count of distinct URLs where IsOn = true will show the count of pages they have +1'd. However, the table also stores when they un-plus1 something, by storing the value in IsOn as false.

If I:

  • Plus one my homepage
  • Unplus one my homepage

It shouldn't count this as a plus for that user in our query as the last action for this URL for this user was to un-plus 1 it. Similarly, if I:

  • Plus one my homepage
  • Unplus one my homepage
  • Plus one my homepage
  • Unplus one my homepage
  • Plus one my homepage

It should count this in the original query as the last action for that URL was to plus 1 it.

How can I modify my query to count the instances where IsOn is true and that was the last known action for that user for that URL? I'm struggling to write a query that does this.

Try this:

return db.tblGPlusOneClicks
    .Where(c => c.UserID == UserID)
    .GroupBy(c => c.URLID)
    .Count(g => g.OrderByDescending(c => c.Date).First().IsOn);