Best sql-server questions in September 2011

Does SQL Server read all of a COALESCE function even if the first argument is not NULL?

15 votes

I'm using a T-SQL COALESCE function where the first argument will not be null on about 95% of the times it is ran. If the first argument is NULL, the second argument is quite a lengthy process:

SELECT COALESCE(c.FirstName
                ,(SELECT TOP 1 b.FirstName
                  FROM TableA a 
                  JOIN TableB b ON .....)
                )

If, for example, c.FirstName = 'John', would SQL Server still run the sub-query?

I know with the VB.NET IIF() function, if the second argument is True, the code still reads the third argument (even though it won't be used).

Nope. Simple test:

SELECT COALESCE(1, (SELECT 1/0)) -- runs fine
SELECT COALESCE(NULL, (SELECT 1/0)) -- throws error

If the second condition is evaluated, an exception is thrown for divide-by-zero.

Optimal performing query for latest record for each N

9 votes

Here is the scenario I find myself in.

I have a reasonably big table that I need to query the latest records from. Here is the create for the essential columns for the query:

CREATE TABLE [dbo].[ChannelValue](
   [ID] [bigint] IDENTITY(1,1) NOT NULL,
   [UpdateRecord] [bit] NOT NULL,
   [VehicleID] [int] NOT NULL,
   [UnitID] [int] NOT NULL,
   [RecordInsert] [datetime] NOT NULL,
   [TimeStamp] [datetime] NOT NULL
   ) ON [PRIMARY]
GO

The ID column is a Primary Key and there is a non-Clustered index on VehicleID and TimeStamp

CREATE NONCLUSTERED INDEX [IX_ChannelValue_TimeStamp_VehicleID] ON [dbo].[ChannelValue] 
(
    [TimeStamp] ASC,
    [VehicleID] ASC
)ON [PRIMARY]
GO

The table I'm working on to optimise my query is a little over 23 million rows and is only a 10th of the sizes the query needs to operate against.

I need to return the latest row for each VehicleID.

I've been looking through the responses to this question here on StackOverflow and I've done a fair bit of Googling and there seem to be 3 or 4 common ways of doing this on SQL Server 2005 and upwards.

So far the fastest method I've found is the following query:

SELECT cv.*
FROM ChannelValue cv
WHERE cv.TimeStamp = (
SELECT
    MAX(TimeStamp)
FROM ChannelValue
WHERE ChannelValue.VehicleID = cv.VehicleID
)

With the current amount of data in the table it takes about 6s to execute which is within reasonable limits but with the amount of data the table will contain in the live environment the query begins to perform too slow.

Looking at the execution plan my concern is around what SQL Server is doing to return the rows.

I cannot post the execution plan image because my Reputation isn't high enough but the index scan is parsing every single row within the table which is slowing the query down so much.

Execution Plan

I've tried rewriting the query with several different methods including using the SQL 2005 Partition method like this:

WITH cte
AS (
    SELECT *,
    ROW_NUMBER() OVER(PARTITION BY VehicleID ORDER BY TimeStamp DESC) AS seq
     FROM ChannelValue
)

SELECT
   VehicleID,
   TimeStamp,
   Col1
FROM cte
WHERE seq = 1

But the performance of that query is even worse by quite a large magnitude.

I've tried re-structuring the query like this but the result speed and query execution plan is nearly identical:

SELECT cv.*
FROM (
   SELECT VehicleID
    ,MAX(TimeStamp) AS [TimeStamp]
   FROM ChannelValue
   GROUP BY VehicleID
) AS [q]
INNER JOIN ChannelValue cv
   ON cv.VehicleID = q.VehicleID
   AND cv.TimeStamp = q.TimeStamp

I have some flexibility available to me around the table structure (although to a limited degree) so I can add indexes, indexed views and so forth or even additional tables to the database.

I would greatly appreciate any help at all here.

Edit Added the link to the execution plan image.

Depends on your data (how many rows are there per group?) and your indexes.

See Optimizing TOP N Per Group Queries for some performance comparisons of 3 approaches.

In your case with millions of rows for only a small number of Vehicles I would add an index on VehicleID, Timestamp and do

SELECT CA.*
FROM   Vehicles V
       CROSS APPLY (SELECT TOP 1 *
                    FROM   ChannelValue CV
                    WHERE  CV.VehicleID = V.VehicleID
                    ORDER  BY TimeStamp DESC) CA  

Curly braces in T-SQL

8 votes

I've come across the following t-sql:

SELECT {d'9999-12-31'}

Which returns 9999-12-31 00:00:00.000.

This seems to be converting the type of the string literal to a DATETIME. I can't find any documentation on this syntax and I'm wondering if there are any variations, for example if I have a literal 1 but want to represent this in a BIGINT without using CONVERT()/CAST().

Can anyone provide any further information on this syntax? Thanks.

These are ODBC escape sequences. See Date, Time, and Timestamp Escape Sequences for more details.

There is also similar syntax for uniqueidentifiers

SELECT {guid '00000000-0000-0000-0000-000000000000'},

as well as procedure calls and some other constructs detailed off that link.

With regard to the rest of your question I'm not aware of any way of having an integer literal treated as a bigint or of any particular resource that lists all the ways of influencing how literals are assigned datatypes by SQL Server. Some ways are below.

;WITH cte(thing) AS
(
SELECT CAST(1 AS SQL_VARIANT) UNION ALL
SELECT $1 UNION ALL
SELECT 1e0 UNION ALL
SELECT 1.0000 UNION ALL
SELECT 2147483648 UNION ALL 
SELECT {ts '2011-09-15 01:23:56.123'}  UNION ALL
SELECT {d '2011-09-15'} UNION ALL
SELECT { t '13:33:41' }  UNION ALL
SELECT {guid '00000000-0000-0000-0000-000000000000'} UNION ALL
SELECT 'Foo' UNION ALL
SELECT N'Foo'
)
SELECT thing, 
       sql_variant_property(thing,'basetype') AS basetype,
       sql_variant_property(thing,'precision') AS precision, 
       sql_variant_property(thing,'scale') AS scale, 
       sql_variant_property(thing,'maxlength') AS maxlength
FROM cte

Returns

thing                          basetype            precision   scale  maxlength
------------------------------ ------------------- ----------- ------ ---------
1                              int                 10          0      4
1.00                           money               19          4      8
1                              float               53          0      8
1.0000                         numeric             5           4      5
2147483648                     numeric             10          0      5
2011-09-15 01:23:56.123        datetime            23          3      8
2011-09-15 00:00:00.000        datetime            23          3      8
2011-09-15 13:33:41.000        datetime            23          3      8
00000000-0000-0000-0000-000000 uniqueidentifier    0           0      16
Foo                            varchar             0           0      3
Foo                            nvarchar            0           0      6

How can I structure a query to give me only the rows that match ALL values in a CSV list of IDs in T-SQL

8 votes

I've created a simple example (hopefully much more fun than my actual data) to better express my question:

CREATE TABLE SUPER_HERO
(   ID INT,
    NAME VARCHAR(50)
)

INSERT INTO SUPER_HERO VALUES (1, 'Storm')
INSERT INTO SUPER_HERO VALUES (2, 'Silver Surfer')
INSERT INTO SUPER_HERO VALUES (3, 'Spider Man')

CREATE TABLE SKILL
(   ID INT,
    NAME VARCHAR(50)
)

INSERT INTO SKILL VALUES (1, 'Flight')
INSERT INTO SKILL VALUES (2, 'Weather Control')
INSERT INTO SKILL VALUES (3, 'Super Speed')

CREATE TABLE SUPER_HERO_SKILL
(   SUPER_HERO_ID INT,
    SKILL_ID INT
)

INSERT INTO SUPER_HERO_SKILL VALUES (1, 1) --Storm has Flight
INSERT INTO SUPER_HERO_SKILL VALUES (1, 2) --Storm has Weather Control
INSERT INTO SUPER_HERO_SKILL VALUES (2, 1) --Silver Surfer has Flight
INSERT INTO SUPER_HERO_SKILL VALUES (2, 3) --Silver Surfer has Super Speed
INSERT INTO SUPER_HERO_SKILL VALUES (3, 3) --Spider Man has Super Speed

Example of bad query (not showing desired results):

DECLARE @DELIMITER CHAR = ','
DECLARE @CSV_STRING VARCHAR(20) = '1,3'

SELECT 
    SUPER_HERO_NAME =   SUPER_HERO.NAME,
    SKILL_NAME      =   SKILL.NAME
FROM 
    SUPER_HERO
    JOIN SUPER_HERO_SKILL ON  SUPER_HERO_SKILL.SUPER_HERO_ID = SUPER_HERO.ID
    JOIN SKILL ON SUPER_HERO_SKILL.SKILL_ID = SKILL.ID
    JOIN dbo.Split(@CSV_STRING, @DELIMITER) SPLIT  ON SPLIT.ITEMS = SKILL.ID

What I would like to see:
When DECLARE @CSV_STRING VARCHAR(20) = '1,3' I should only see "Silver Surfer" since he is the only one with both skills 1 and 3 which correlate to Flight and Super Speed.

When DECLARE @CSV_STRING VARCHAR(20) = '1,2,3' I should not see any heroes in my universe since there are none defined to have all three skills listed.


There must be something simple that I am missing. I have tried structuring the query many different ways. I have presented the simplest form of it here as not to complicate the presentation of problem.

Note: I use a function that acts as a Split based on delimiter passed in.

I've tested below query and it works.

Use the below spliter function which returns an int column. So it's easy to check the count in the HAVING clause.

CREATE FUNCTION [dbo].[DelimitedParamParser]( @DelimitedIds VARCHAR(MAX), @Delimiter CHAR(1)) 
RETURNS @IdsTable 
TABLE ( Id INT ) 
AS BEGIN

DECLARE @Length INT,
        @Index INT,
        @NextIndex INT

SET @Length = DATALENGTH(@DelimitedIds)
SET @Index = 0
SET @NextIndex = 0


WHILE (@Length > @Index )
BEGIN
    SET @NextIndex = CHARINDEX(@Delimiter, @DelimitedIds, @Index)
    IF (@NextIndex = 0 ) SET @NextIndex = @Length + 2
        INSERT @IdsTable SELECT SUBSTRING( @DelimitedIds, @Index, @NextIndex - @Index )
    SET @index = @nextindex + 1
END
 RETURN
END

This works, keep in mind to give an extra comma at the end.

DECLARE @DELIMITER CHAR = ','
DECLARE @CSV_STRING VARCHAR(20) = '1,3,'

SELECT Distinct SUPER_HERO.NAME, SKILL.NAME
FROM 
    SUPER_HERO
    INNER JOIN SUPER_HERO_SKILL ON  SUPER_HERO_SKILL.SUPER_HERO_ID = SUPER_HERO.ID
    INNER JOIN SKILL ON SUPER_HERO_SKILL.SKILL_ID = SKILL.ID
    WHERE SUPER_HERO.ID IN
    (
    SELECT SUPER_HERO_SKILL.SUPER_HERO_ID   
    FROM 
        SUPER_HERO
        INNER JOIN SUPER_HERO_SKILL ON  SUPER_HERO_SKILL.SUPER_HERO_ID = SUPER_HERO.ID
        INNER JOIN SKILL ON SUPER_HERO_SKILL.SKILL_ID = SKILL.ID
        INNER JOIN DelimitedParamParser(@CSV_STRING, @DELIMITER) SPLIT  ON SPLIT.ID = SUPER_HERO_SKILL.SKILL_ID
    GROUP BY SUPER_HERO_SKILL.SUPER_HERO_ID
    HAVING COUNT(DISTINCT(SUPER_HERO_SKILL.SKILL_ID)) = (SELECT COUNT(DISTINCT(Id)) FROM DelimitedParamParser(@CSV_STRING, @DELIMITER))
    )

Using parameters in SQL statements

8 votes

I am very new to working with databases. Now I can write SELECT, UPDATE, DELETE, and INSERT commands. But I have seen many forums where we prefer to write:

SELECT empSalary from employee where salary = @salary

...instead of:

SELECT empSalary from employee where salary = txtSalary.Text

Why do we always prefer to use parameters and how would I use them?

I wanted to know the use and benefits of the first method. I have even heard of SQL injection but I don't fully understand it. I don't even know if SQL injection is related to my question.

Using parameters helps prevent SQL Injection attacks when the database is used in conjunction with a program interface such as a desktop program or web site.

In your example, a user can directly run SQL code on your database by crafting statements in txtSalary.

For example, if they were to write 0 OR 1=1, the executed SQL would be

 SELECT empSalary from employee where salary = 0 or 1=1

whereby all empSalaries would be returned.

Further, a user could perform far worse commands against your database, including deleting it If they wrote 0; Drop Table employee:

SELECT empSalary from employee where salary = 0; Drop Table employee

The table employee would then be deleted.


In your case, it looks like you're using .NET. Using parameters is as easy as:

C#

string sql = "SELECT empSalary from employee where salary = @salary";
SqlConnection connection = New SqlConnection(/* connection info */);
SqlCommand command = SqlCommand(sql, connection);

command.Parameters.AddWithValue("salary", txtSalary.Text);

VB.NET

Dim sql As String = "SELECT empSalary from employee where salary = @salary"
Dim connection As New SqlConnection(connectionInfo)
Dim command As SqlCommand(sql, connection)

With command.Parameters
    .AddWithValue("salary", txtSalary.Text)
End With

Full outer join on one criteria, inner join on another

7 votes

i need to join Header and Detail rows into one resultset:

(sample DDL and inserts to follow):

Orders:

OrderID  OrderDate                CurrencyID  BuyAmount  BuyRate 
=======  =======================  ==========  =========  ========
1        2011-09-01 15:57:00.000  7           12173.60   1.243893
1        2011-09-01 15:57:00.000  9           69.48      1

OrderDetails:

OrderID  CurrencyID  SellAmount  SellRate
=======  ==========  ==========  ========
1        7           10000       1
1        8           12384       0.9638

i want them joined on OrderID and CurrencyID:

OrderID  CurrencyID  BuyAmount  BuyRate   SellAmount  SellRate
=======  ==========  =========  ========  ==========  ========
1        7           12173.60   1.243893  10000       1
1        8           NULL       NULL      12384       0.9638
1        9           69.48      1         NULL        NULL

Sample scripts

--USE Scratch

--Create a temporary `Orders` and, `OrderDetails` tables:
IF OBJECT_ID('tempdb..#Orders') > 0 DROP TABLE #Orders
CREATE TABLE #Orders
(
    OrderID int NOT NULL,
    OrderDate datetime NOT NULL,
    CurrencyID int NOT NULL,
    BuyAmount money NOT NULL,
    BuyRate real NOT NULL
)

IF OBJECT_ID('tempdb..#OrderDetails') > 0 DROP TABLE #OrderDetails
CREATE TABLE #OrderDetails
(
    OrderID int NOT NULL,
    CurrencyID int NOT NULL,
    SellAmount money NOT NULL,
    SellRate real NOT NULL
)

-- **Insert sample data:**

INSERT INTO #Orders (OrderID, OrderDate, CurrencyID, BuyAmount, BuyRate) 
VALUES (1, '20110901 15:57:00', 7, 12173.60, 1.2438933)
INSERT INTO #Orders (OrderID, OrderDate, CurrencyID, BuyAmount, BuyRate) 
VALUES (1, '20110901 15:57:00', 9, 69.48, 1)


INSERT INTO #OrderDetails (OrderID, CurrencyID, SellAmount, SellRate)
VALUES (1, 7, 10000, 1)
INSERT INTO #OrderDetails (OrderID, CurrencyID, SellAmount, SellRate)
VALUES (1, 8, 12384, 0.9638)


/*Desired Output:
OrderID  CurrencyID  BuyAmount  BuyRate   SellAmount  SellRate
=======  ==========  =========  ========  ==========  ========
1        7           12173.60   1.243893  10000       1
1        8           NULL       NULL      12384       0.9638
1        9           69.48      1         NULL        NULL

*/

i can't find a combination of RIGHT OUTER JOIN, FULL OUTER JOIN, COALESCE that can produce my desired output.


Update:

It's also possible that OrderDetails doesn't contain a matching CurrencyID from the Orders table:

Orders:

OrderID  CurrencyID  BuyAmount  BuyRate 
=======  ==========  =========  ========
1        7           12173.60   1.243893
1        9           69.48      1

OrderDetails:

OrderID  CurrencyID  SellAmount  SellRate
=======  ==========  ==========  ========
1        8           12384       0.9638

So, you've tried this?

SELECT
  COALESCE(o.OrderID, od.OrderID) AS OrderID,
  COALESCE(o.CurrencyID, od.CurrencyID) AS CurrencyID,
  o.BuyAmount,
  o.BuyRate,
  od.SellAmount,
  od.SellRate
FROM
  #Orders AS o
  FULL OUTER JOIN #OrderDetails AS od ON o.OrderID = od.OrderID AND o.CurrencyID = od.CurrencyID

use recursive common table expressions to find consecutive no.s from two tables

7 votes

i have the following tables:

Actual         Optional
------         --------
4                 3
13                6
20                7
26                14
                  19
                  21
                  27
                  28

What i have to do is select :

1) all the values from "Actual" Table.

2) select values from "Optional" table if they form a consecutive series with "actual" table values

The expected result is:

Answer
------
4
13
20
26
3    --because it is consecutive to 4 (i.e 3=4-1) 
14   --14=13+1
19   --19=20-1
21   --21=20+1
27   --27=26+1
28   --this is the important case.28 is not consecutive to 26 but 27 
     --is consecutive to 26 and 26,27,28 together form a series.

I wrote a query using recursive cte but it is looping forever and fails after recursion reaches 100 levels. The problem i faced is 27 matches with 26, 28 matches with 27 and 27 with 28.again 28 with 27...(forever)

Here is the query i wrote:

with recurcte as
        (
        select num as one,num as two from actual
        union all
         select opt.num as one,cte.two as two 
         from recurcte cte join optional opt 
         on opt.num+1=cte.one or opt.num-1=cte.one
        )select * from recurcte

;WITH Combined
     AS (SELECT 1 AS Actual, N
         FROM   (VALUES(4),
                       (13),
                       (20),
                       (26)) Actual(N)
         UNION ALL
         SELECT 0 AS Actual, N
         FROM   (VALUES(3),
                       (6),
                       (7),
                       (14),
                       (19),
                       (21),
                       (27),
                       (28)) Optional (N)),
     T1
     AS (SELECT *,
                N - DENSE_RANK() OVER (ORDER BY N) AS Grp
         FROM   Combined),
     T2
     AS (SELECT *,
                MAX(Actual) OVER (PARTITION BY Grp) AS HasActual
         FROM   T1)
SELECT DISTINCT N
FROM   T2
WHERE  HasActual = 1  

SQL why is SELECT COUNT(*) , MIN(col), MAX(col) faster then SELECT MIN(col), MAX(col)

7 votes

We're seeing a huge difference between these queries.

The slow query

SELECT MIN(col) AS Firstdate, MAX(col) AS Lastdate 
FROM table WHERE status = 'OK' AND fk = 4193

Table 'table'. Scan count 2, logical reads 2458969, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 1966 ms, elapsed time = 1955 ms.

The fast query

SELECT count(*), MIN(col) AS Firstdate, MAX(col) AS Lastdate 
FROM table WHERE status = 'OK' AND fk = 4193

Table 'table'. Scan count 1, logical reads 5803, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 9 ms.

Question

What is the reason between the huge performance difference between the queries?

Update A little update based on questions given as comments:

The order of execution or repeated execution changes nothing performance wise. There are no extra parameters used and the (test)database is not doing anything else during execution.

Slow query

|--Nested Loops(Inner Join)
 |--Stream Aggregate(DEFINE:([Expr1003]=MIN([DBTest].[dbo].[table].[startdate])))
   |    |--Top(TOP EXPRESSION:((1)))
   |         |--Nested Loops(Inner Join, OUTER REFERENCES:([DBTest].[dbo].[table].[id], [Expr1008]) WITH ORDERED PREFETCH)
   |              |--Index Scan(OBJECT:([DBTest].[dbo].[table].[startdate]), ORDERED FORWARD)
   |              |--Clustered Index Seek(OBJECT:([DBTest].[dbo].[table].[PK_table]), SEEK:([DBTest].[dbo].[table].[id]=[DBTest].[dbo].[table].[id]),  WHERE:([DBTest].[dbo].[table].[FK]=(5806) AND [DBTest].[dbo].[table].[status]<>'A') LOOKUP ORDERED FORWARD)
   |--Stream Aggregate(DEFINE:([Expr1004]=MAX([DBTest].[dbo].[table].[startdate])))
        |--Top(TOP EXPRESSION:((1)))
             |--Nested Loops(Inner Join, OUTER REFERENCES:([DBTest].[dbo].[table].[id], [Expr1009]) WITH ORDERED PREFETCH)
                  |--Index Scan(OBJECT:([DBTest].[dbo].[table].[startdate]), ORDERED BACKWARD)
                  |--Clustered Index Seek(OBJECT:([DBTest].[dbo].[table].[PK_table]), SEEK:([DBTest].[dbo].[table].[id]=[DBTest].[dbo].[table].[id]),  WHERE:([DBTest].[dbo].[table].[FK]=(5806) AND [DBTest].[dbo].[table].[status]<>'A') LOOKUP ORDERED FORWARD)

Fast query

 |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1012],0)))
   |--Stream Aggregate(DEFINE:([Expr1012]=Count(*), [Expr1004]=MIN([DBTest].[dbo].[table].[startdate]), [Expr1005]=MAX([DBTest].[dbo].[table].[startdate])))
        |--Nested Loops(Inner Join, OUTER REFERENCES:([DBTest].[dbo].[table].[id], [Expr1011]) WITH UNORDERED PREFETCH)
             |--Index Seek(OBJECT:([DBTest].[dbo].[table].[FK]), SEEK:([DBTest].[dbo].[table].[FK]=(5806)) ORDERED FORWARD)
             |--Clustered Index Seek(OBJECT:([DBTest].[dbo].[table].[PK_table]), SEEK:([DBTest].[dbo].[table].[id]=[DBTest].[dbo].[table].[id]),  WHERE:([DBTest].[dbo].[table].[status]<'A' OR [DBTest].[dbo].[table].[status]>'A') LOOKUP ORDERED FORWARD)

The execution plan from SSMS

Answer

The answer given below by Martin Smith seems to explain the problem. The super short version is that the MS-SQL query-analyser wrongly uses a query plan in the slow query which causes a complete table scan.

Adding a Count(*), the query hint with(FORCESCAN) or a combined index on the startdate,FK and status columns fixes the performance issue.

There are 810,064 rows in the table.

You have the query

SELECT COUNT(*),
       MIN(startdate) AS Firstdate,
       MAX(startdate) AS Lastdate
FROM   table
WHERE  status <> 'A'
       AND fk = 4193 

1,893 (0.23%) rows meet the fk = 4193 predicate, and of those two fail the status <> 'A' part so overall 1,891 match and need to be aggregated.

You also have two indexes neither of which cover the whole query.

For your fast query it uses an index on fk to directly find rows where fk = 4193 then needs to do 1,893 key lookups to find each row in the clustered index to check the status predicate and retrieve the startdate for aggregation.

When you remove the COUNT(*) from the SELECT list SQL Server no longer has to process every qualifying row. As a result it considers another option.

You have an index on startdate so it could start scanning that from the beginning, doing key lookups back to the base table and as soon as it finds the first matching row stop as it has found the MIN(startdate), Similarly the MAX can be found with another scan starting the other end of the index and working backwards.

SQL Server estimates that each of these scans will end up processing 590 rows before they hit upon one that matches the predicate. Giving 1,180 total lookups vs 1,893 so it chooses this plan.

I'm not sure exactly what formula it uses to arrive at the 590 figure but from earlier numbers it can be seen that a row plucked at random has a 99.77% chance of not matching the where condition, so if you pick 590 random rows the chances of all of them not matching is approx 0.9977 ^ 590 or only around 25%

Unfortunately the 1,891 rows that meet the predicate are not randomly distributed with respect to startdate. In fact they are all condensed into a single 8,205 row segment towards the end of the index meaning that the scan to get to the MIN(startdate) ends up doing 801,859 key lookups before it can stop.

This can be reproduced below.

CREATE TABLE T
(
id int identity(1,1) primary key,
startdate datetime,
fk int,
[status] char(1),
Filler char(2000)
)

CREATE NONCLUSTERED INDEX ix ON T(startdate)

INSERT INTO T
SELECT TOP 810064 Getdate() - 1,
                  4192,
                  'B',
                  ''
FROM   sys.all_columns c1,
       sys.all_columns c2  


UPDATE T 
SET fk = 4193, startdate = GETDATE()
WHERE id BETWEEN 801859 and 803748 or id = 810064

UPDATE T 
SET  startdate = GETDATE() + 1
WHERE id > 810064


/*Both queries give the same plan. 
UPDATE STATISTICS T WITH FULLSCAN
makes no difference*/

SELECT MIN(startdate) AS Firstdate, 
       MAX(startdate) AS Lastdate 
FROM T
WHERE status <> 'A' AND fk = 4192


SELECT MIN(startdate) AS Firstdate, 
       MAX(startdate) AS Lastdate 
FROM T
WHERE status <> 'A' AND fk = 4193

You could consider using query hints to force the plan to use the index on fk rather than startdate or add the suggested missing index highlighted in the execution plan on (fk,status) INCLUDE (startdate) to avoid this issue.

Addition

Regarding the estimated rows and the 590 figure I've been doing some experiments where SQL Server has perfect statistics and it seems the formula it uses is actually very straight forward and is simply table_size / estimated_number_of_rows_that_match.

My repro above has estimated 428.379 rows and this is simply 810064 / 1891.

The one exception to the above seems to be that it will never come up with an estimated rows figure greater than 90% of the table even if the statistics indicate that no rows will match at all.

Your plan's exact figure was 589.643 which indicates then that it only expected 1,374 rows overall to match so again there might be a slight correlation where rows with fk = 4193 are somewhat more likely to also meet the status <> 'A' criteria than the average.

Oracle MIN as analytic function - odd behavior with ORDER BY?

6 votes

This particular case was distilled from an example where the programmer assumed that for two shipments into a tank car, line #1 would be loaded first. I corrected this to allow for the loading to be performed in any order - however, I discovered that MIN() OVER (PARTITION BY) allows an ORDER BY in Oracle (this is not allowed in SQL Server), and additionally, it alters the behavior of the function, causing the ORDER BY to apparently be added to the PARTITION BY.

WITH data AS (
SELECT 1 AS SHIPMENT_ID, 1 AS LINE_NUMBER, 2 AS TARE, 3 AS GROSS FROM DUAL
UNION ALL
SELECT 1 AS SHIPMENT_ID, 2 AS LINE_NUMBER, 1 AS TARE, 2 AS GROSS FROM DUAL
)
SELECT MIN(tare) OVER (PARTITION BY shipment_id) first_tare
,MAX(gross) OVER (PARTITION BY shipment_id) last_gross
,FIRST_VALUE(tare) OVER (PARTITION BY shipment_id ORDER BY LINE_NUMBER) first_tare_incorrect
,FIRST_VALUE(gross) OVER (PARTITION BY shipment_id ORDER BY LINE_NUMBER DESC) last_gross_incorrect
,MIN(tare) OVER (PARTITION BY shipment_id ORDER BY LINE_NUMBER) first_tare_incorrect_still
,MAX(gross) OVER (PARTITION BY shipment_id ORDER BY LINE_NUMBER DESC) last_gross_incorrect_still
,MIN(tare) OVER (PARTITION BY shipment_id, LINE_NUMBER) first_tare_incorrect_still2
,MAX(gross) OVER (PARTITION BY shipment_id, LINE_NUMBER) last_gross_incorrect_still2
 FROM data

A SQL Server example (with non-applicable code commented out):

WITH data AS (
SELECT 1 AS SHIPMENT_ID, 1 AS LINE_NUMBER, 2 AS TARE, 3 AS GROSS -- FROM DUAL
UNION ALL
SELECT 1 AS SHIPMENT_ID, 2 AS LINE_NUMBER, 1 AS TARE, 2 AS GROSS -- FROM DUAL
)
SELECT MIN(tare) OVER (PARTITION BY shipment_id) first_tare
,MAX(gross) OVER (PARTITION BY shipment_id) last_gross
-- ,FIRST_VALUE(tare) OVER (PARTITION BY shipment_id ORDER BY LINE_NUMBER) first_tare_incorrect
-- ,FIRST_VALUE(gross) OVER (PARTITION BY shipment_id ORDER BY LINE_NUMBER DESC) last_gross_incorrect
-- ,MIN(tare) OVER (PARTITION BY shipment_id ORDER BY LINE_NUMBER) first_tare_incorrect_still
-- ,MAX(gross) OVER (PARTITION BY shipment_id ORDER BY LINE_NUMBER DESC) last_gross_incorrect_still
,MIN(tare) OVER (PARTITION BY shipment_id, LINE_NUMBER) first_tare_incorrect_still2
,MAX(gross) OVER (PARTITION BY shipment_id, LINE_NUMBER) last_gross_incorrect_still2
 FROM data

So question: What is Oracle doing and why and is it right?

If you add an ORDER BY to the MIN analytic function, you turn it into a "min so far" function rather than an overall minimum. For the final row for whatever you're partitioning by, the results will be the same. But the prior rows may have a different "min so far" than the overall minimum.

Using the EMP table as an example, you can see that the minimum salary so far for the department eventually converges on the overall minimum for the department. And you can see that the "min so far" value for any given department decreases as lower values are encountered.

SQL> ed
Wrote file afiedt.buf

  1  select ename,
  2         deptno,
  3         sal,
  4         min(sal) over (partition by deptno order by ename) min_so_far,
  5         min(sal) over (partition by deptno) min_overall
  6    from emp
  7*  order by deptno, ename
SQL> /

ENAME          DEPTNO        SAL MIN_SO_FAR MIN_OVERALL
---------- ---------- ---------- ---------- -----------
CLARK              10       2450       2450        1300
KING               10       5000       2450        1300
MILLER             10       1300       1300        1300
ADAMS              20       1110       1110         800
FORD               20       3000       1110         800
JONES              20       2975       1110         800
SCOTT              20       3000       1110         800
smith              20        800        800         800
ALLEN              30       1600       1600         950
BLAKE              30       2850       1600         950
MARTIN             30       1250       1250         950
SM0                30        950        950         950
TURNER             30       1500        950         950
WARD               30       1250        950         950
BAR
PAV

16 rows selected.

Of course, it would make more sense to use this form of the analytic function when you're trying to do something like compute a personal best that you can use as a comparison in future periods. If you're tracking an individual's decreasing golf scores, mile times, or weight, displaying personal bests can be a form of motivation.

SQL> ed
Wrote file afiedt.buf

  1  with golf_scores as
  2  (  select 1 golfer_id, 80 score, sysdate dt from dual union all
  3     select 1, 82, sysdate+1 dt from dual union all
  4     select 1, 72, sysdate+2 dt from dual union all
  5     select 1, 75, sysdate+3 dt from dual union all
  6     select 1, 71, sysdate+4 dt from dual union all
  7     select 2, 74, sysdate from dual )
  8  select golfer_id,
  9         score,
 10         dt,
 11         (case when score=personal_best
 12               then 'New personal best'
 13               else null
 14           end) msg
 15    from (
 16  select golfer_id,
 17         score,
 18         dt,
 19         min(score) over (partition by golfer_id
 20                              order by dt) personal_best
 21    from golf_scores
 22*        )
SQL> /

 GOLFER_ID      SCORE DT        MSG
---------- ---------- --------- -----------------
         1         80 12-SEP-11 New personal best
         1         82 13-SEP-11
         1         72 14-SEP-11 New personal best
         1         75 15-SEP-11
         1         71 16-SEP-11 New personal best
         2         74 12-SEP-11 New personal best

6 rows selected.

Why does ISNUMERIC('.') return 1?

6 votes

Recently I was working with ISNUMERIC in SQL Server, when I encountered a problem, which led to finding this snippet of code.

SELECT ISNUMERIC('.')

This returns 1, as in true, shouldn't this return 0 as in false?

See IsNumeric() Broken? Only up to a point.

select cast('.' as money) 

returns 0.00 (though the cast fails for int and float)

ISNUMERIC just checks that the value can be cast to any one of the numeric datatypes which is generally useless. Usually you want to know whether it can be cast to a specific type.

Some useful helper functions for that are here IsNumeric, IsInt, IsNumber.

Denali will have TRY_PARSE and TRY_CONVERT that will help with this greatly

SQL Server execution plans: filter to show bottlenecks?

6 votes

Is there a way to only show items that are greater than 1% of the batch?

I am trying to find bottlenecks in a proc, it contains loops and other logic and 99% of the resulting executing plan I don't care about, but it's hard to scroll through in Management Studio and actually find the parts that slow it down.

Have you looked at SQL Sentry Plan Explorer? This is a 100% free tool that can integrate with Management Studio and provides much more clarity when analyzing plans. A very quick example shows how you can quickly focus on the most expensive node(s) in any graphical plan:

enter image description here

Download:

http://sqlsentry.net/plan-explorer/sql-server-query-view.asp

A recent blog post I wrote, by popular demand, to help others illustrate how much easier execution plans can be with Plan Explorer:

http://sqlblog.com/blogs/aaron_bertrand/archive/2011/09/13/a-demo-kit-for-sql-sentry-plan-explorer.aspx

Per your actual question, it doesn't hide any low- or zero-cost operations from the plan, but it does highlight the most expensive nodes in red, orange and yellow; we also highlight lookups and scans, and you can also display cost as a % by CPU+I/O, CPU alone, or I/O alone (in SSMS you can only use the combination). Many of the performance issues you're looking for are front and center or on one of the detailed tabs, rather than hidden away in Properties panels, quirky tooltips or not presented at all. In addition to the enhancements to the graphical plan you can also focus on specific statements that happen in a loop (with lots of related but lower-cost statements) by sorting in the Statements Tree node. Differences in estimated vs. actual rows (for actual plans) can highlight potential statistics issues; columns required by a key lookup can identify missing index columns quickly (or a SELECT * query that shouldn't be); and runtime vs. compiled parameters can point out parameter sniffing issues where you may want to use RECOMPILE, OPTIMIZE FOR or turn on "Optimize for ad hoc workloads." Finally, unlike with Management Studio, you can very easily generate runtime metrics (duration, CPU, reads, etc.) by generating an actual plan within the tool - with native tools you need to mess with SET options or use profiler to get these metrics - and even with those you have to discount the work involved with getting the results to the client (Plan Explorer runs the query on the server but discards the results).

Is a SqlConnection automatically closed when an application is closed?

6 votes

I am querying the db in a separate thread.

If I close the application while the query is being executed, will the SqlConnection automatically close or will it remain open?

If the process is terminated, all the OS resources, including network connections, will be released. In other words - that's fine.

Fastest Way to Count Distinct Values in a Column, Including NULL Values

5 votes

The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.

This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.

This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).

What is the most efficient way to do this?


Update After Testing

I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest

  1. Count/GroupBy (Cheran)
  2. CountDistinct+SubQuery (Ellis)
  3. dense_rank (Eriksson)
  4. Count+Max (Andriy)

Testing Results (in seconds):

   Method          20_Columns   5_Columns   1_Column (Large)   1_Column (Small)
1) Count/GroupBy      10.8          4.8            2.8               0.14       
2) CountDistinct      12.4          4.8            3                 0.7         
3) dense_rank        226           30              6                 4.33 
4) Count+Max          98.5         44             16                12.5        

Notes:

  • Interestingly enough, the two methods that were fastest (by far, with only a small difference in between then) were both methods that submitted separate queries for each column (and in the case of result #2, the query included a subquery, so there were really two queries submitted per column). Perhaps because the gains that would be achieved by limiting the number of table scans is small in comparison to the performance hit taken in terms of memory requirements (just a guess).
  • Though the dense_rank method is definitely the most elegant, it seems that it doesn't scale well (see the result for 20 columns, which is by far the worst of the four methods), and even on a small scale just cannot compete with the performance of Count.

Thanks for the help and suggestions!

SELECT COUNT(*)
FROM (SELECT ColumnName
      FROM TableName
      GROUP BY ColumnName) AS s;

GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.

How do I select TOP 5 PERCENT from each group?

5 votes

I have a sample table like this:

CREATE TABLE #TEMP(Category VARCHAR(100), Name VARCHAR(100))

INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Lisa')
INSERT INTO #TEMP VALUES('A', 'Lisa')
INSERT INTO #TEMP VALUES('A', 'Bucky')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Ross')
INSERT INTO #TEMP VALUES('B', 'Ross')
INSERT INTO #TEMP VALUES('B', 'Ross')

SELECT Category, Name, COUNT(Name) Total
FROM #TEMP
GROUP BY Category, Name
ORDER BY Category, Total DESC

DROP TABLE #TEMP

Gives me the following:

A   John    6
A   Adam    4
A   Lisa    2
A   Bucky   1
B   Lily    5
B   Tom     4
B   Ross    3

Now, how do I select the TOP 5 PERCENT records from each category assuming each category has more than 100 records (did not show in sample table here)? For instance, in my actual table, it should remove the John record from A and Lily record from B as appropriate (again, I did not show the full table here) to get:

A   Adam    4
A   Lisa    2
A   Bucky   1
B   Tom     4
B   Ross    3

I have been trying to use CTEs and PARTITION BY clauses but cannot seem to achieve what I want. It removes the TOP 5 PERCENT from the overall result but not from each category. Any suggestions?

You could use a CTE (Common Table Expression) paired with the NTILE windowing function - this will slice up your data into as many slices as you need, e.g. in your case, into 20 slices (each 5%).

;WITH SlicedData AS
(
   SELECT Category, Name, COUNT(Name) Total,
            NTILE(20) OVER(PARTITION BY Category ORDER BY COUNT(Name) DESC) AS  'NTile'
   FROM #TEMP
   GROUP BY Category, Name
)
SELECT *
FROM SlicedData
WHERE NTile > 1

This basically groups your data by Category,Name, orders by something else (not sure if COUNT(Name) is really the thing you want here), and then slices it up into 20 pieces, each representing 5% of your data partition. The slice with NTile = 1 is the top 5% slice - just ignore that when selecting from the CTE.

See:

for more info

Maximum size of a varchar(max) variable

5 votes

At any time in the past, if someone had asked me the maximum size for a varchar(max), I'd have said 2GB, or looked up a more exact figure (2^31-1, or 2147483647).

However, in some recent testing, I discovered that varchar(max) variables can apparently exceed this size:

create table T (
    Val1 varchar(max) not null
)
go
declare @KMsg varchar(max) = REPLICATE('a',1024);
declare @MMsg varchar(max) = REPLICATE(@KMsg,1024);
declare @GMsg varchar(max) = REPLICATE(@MMsg,1024);
declare @GGMMsg varchar(max) = @GMsg + @GMsg + @MMsg;
select LEN(@GGMMsg)
insert into T(Val1) select @GGMMsg
select LEN(Val1) from T

Results:

(no column name)
2148532224
(1 row(s) affected)
Msg 7119, Level 16, State 1, Line 6
Attempting to grow LOB beyond maximum allowed size of 2147483647 bytes.
The statement has been terminated.

(no column name)
(0 row(s) affected)

So, given that I now know that a variable can exceed the 2GB barrier - does anyone know what the actual limit is for a varchar(max) variable?


(Above test completed on SQL Server 2008 (not R2). I'd be interested to know whether it applies to other versions)

As far as I can tell there is no upper limit in 2008.

In SQL Server 2005 the code in your question fails on the assignment to the @GGMMsg variable with

Attempting to grow LOB beyond maximum allowed size of 2,147,483,647 bytes.

the code below fails with

REPLICATE: The length of the result exceeds the length limit (2GB) of the target large type.

However it appears these limitations have quietly been lifted. On 2008

DECLARE @y VARCHAR(MAX) = REPLICATE(CAST('X' AS VARCHAR(MAX)),92681); 

SET @y = REPLICATE(@y,92681);

SELECT LEN(@y) 

Returns

8589767761

I ran this on my 32 bit desktop machine so this 8GB string is way in excess of addressable memory

Running

select internal_objects_alloc_page_count
from sys.dm_db_task_space_usage
WHERE session_id = @@spid

Returned

internal_objects_alloc_page_co 
------------------------------ 
2144456    

so I presume this all just gets stored in LOB pages in tempdb with no validation on length. The page count growth was all associated with the SET @y = REPLICATE(@y,92681); statement. The initial variable assignment to @y and the LEN calculation did not increase this.

The reason for mentioning this is because the page count is hugely more than I was expecting. Assuming an 8KB page then this works out at 16.36 GB which is obviously more or less double what would seem to be necessary.

Edit: I speculate that this is likely due to the inefficiency of the string concatenation operation needing to copy the entire huge string and append a chunk on to the end rather than being able to add to the end of the existing string. Unfortunately at the moment the .WRITE method isn't supported for varchar(max) variables.