Best sql questions in July 2012

SQL CONCAT - Funny characters - but I'm not laughing

19 votes

I am using this query to produce a blog archive menu, like so:

SELECT CONCAT( MONTHNAME(published) , ' ' , YEAR(published)) AS monthyear, 
COUNT(*) AS total
FROM blog_articles
WHERE status = 'Online' AND Year(published) = 2012
GROUP BY CONCAT( MONTHNAME(published) , ' ' , YEAR(published))

Expected result:

January 2012        103
February 2012       87
March 2012          23
April 2012          99

Actual result:

enter image description here

The count is correct but instead of the mental oriental, it should read:

May 2012       1

How can I get rid of the funny characters and have English instead?

When I run this query in phpMyAdmin, the characters are fine.

UPDATE

I have just tried running conn.Execute("SET lc_time_names = 'en_US';") and didn't solve the problem. To check, I used conn.Execute("SELECT @@lc_time_names;") and it returned "en_US".

ANOTHER UPDATE

Searching on MySQL forums for related issues, I found another guy whining about the same issue. A MySQL guru said:

"This is the famous (infamous) server returns wrong types for CONCAT of a number and a string. One has to use the (cast) form for this to work, the driver can't tell what the real type is."

I'm not sure what cast is or how to fix this issue.

Silly server, math is for numbers.

CAST() CAST(enum_col AS CHAR) - cast a value as a certain type

 SELECT CONCAT( CAST(MONTHNAME(published) as char) , ' ' , CAST(YEAR(published)) as char) AS monthyear, 
 COUNT(*) AS total
 FROM blog_articles
 WHERE status = 'Online' AND Year(published) = 2012
 GROUP BY CONCAT( MONTHNAME(published) , ' ' , YEAR(published))

DATE_FORMAT() You might also try DATE_FORMAT(YEAR(published), '%Y') function instead of CAST.

How to return default value from SQL query

11 votes

Is there any easy way to return single scalar or default value if query doesn't return any row?

At this moment I have something like this code example:

IF (EXISTS (SELECT * FROM Users WHERE Id = @UserId))  
    SELECT Name FROM Users WHERE Id = @UserId  
ELSE
    --default value
    SELECT 'John Doe'

How to do that in better way without using IF-ELSE?

Assuming the name is not nullable and that Id is unique so can match at most one row.

 SELECT ISNULL(MAX(Name),'John Doe')
 FROM Users 
 WHERE Id = @UserId  

SQL select group query

9 votes

Below is my Table

Table1

+--------+----------+---------+  
| amount | make     | product |  
+--------+----------+---------+  
|    100 | Nokia    | Mobiles |   
|    300 | Samesung | Mobiles |   
|    700 | Micromax | Mobiles |   
|   1000 | Karbonn  | Mobiles |   
|    500 | Lava     | Mobiles |   
|    100 | Floyer   | Gift    |   
|    500 | Arichies | Gift    |   
|    300 | Feeling  | Gift    |   
+--------+----------+---------+  

Now I want to display the two highest amount for each product...

So I want to build single SQL query which gives me result as below..

+--------+----------+---------+  
| amount | make     | product |  
+--------+----------+---------+  
|   1000 | Karbonn  | Mobiles |   
|    700 | Micromax | Mobiles |   
|    500 | Arichies | Gift    |   
|    300 | Feeling  | Gift    |   
+--------+----------+---------+  

Kindly help me to build such query..

You can use this solution to retrieve the "group-wise maximum" based on the amount:

SELECT a.*
FROM Table1 a
INNER JOIN Table1 b ON a.product = b.product AND a.amount <= b.amount
GROUP BY a.amount, a.product
HAVING COUNT(*) <= 2

Simply change the 2 to however many of the top rows you want to retrieve per product.

If you wanted to retrieve the lowest two rows per product, you can simply change the <= sign in the INNER JOIN to a >=.

You can fiddle around with this solution here: SQL-Fiddle Demo

Does order of where clauses matter in SQL

9 votes

Let's say I have a table called PEOPLE having 3 columns ID, LastName, FirstName, none of these columns are indexed.
LastName is more unique, and FirstName is less unique.

If I do 2 searches:

select * from PEOPLE where FirstName="F" and LastName="L" 
select * from PEOPLE where LastName="L" and FirstName="F"

My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first sql.

Is my understanding correct?

No, that order doesn't matter (or at least: shouldn't matter).

Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.

I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.

What does matter is whether or not you have a suitable index for this!

In the case of SQL Server, it will likely use an index if you have:

  • an index on (LastName, FirstName)
  • an index on (FirstName, LastName)
  • an index on just (LastName), or just (FirstName) (or both)

On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).

Why are BOOLEAN type columns problematic in relational database design?

9 votes

I've been working mostly with Oracle for the past few years, and am quite used to seeing single character varchar columns used as boolean values.

I can also see (per stack overflow answers), that suggested type for MySQL is TINYINT.

Now I've taken on my little side project - using DerbyDB, and it supports BOOLEAN columns, but not until after version 10 or so.

So, the question is, why is it so hard to incorporate a BOOLEAN column while designing a relational database? Am I missing something, or is it just pushed down the to-do list as unimportant, since you can use another column type meanwhile?

Tom Kyte pretty much echoes your last sentence in this blog entry:

"It just isn't a type we have -- I can say no more and no less. ANSI doesn't have it -- many databases don't have it (we are certainly not alone). In the grand scheme of things -- I would say the priotization of this is pretty "low" (thats my opinion there)."

He's speaking from the Oracle perspective, but it applies to any relational RDBMS.

Recursive SQL statement (PostgreSQL 9.1.4)

9 votes

PostgreSQL 9.1

Business situation

Every month, there is a new batch of accounts given to a specific process. Every batch can be described by month, number of accounts and total balance of accounts. The goal of the process is to recover some of the balance back from customers. Each batch is than tracked separately on a monthly basis (amount recovered on each month since batch was transferred to the process).

Goal

My goal is to predict what amount will be recovered in the future.

Data definition

create table vintage_data (
    granularity date,       /* Month when account entered process*/
    distance_in_months integer, /* Distance in months from date when accounts entered process*/
    entry_accounts integer,     /* Number of accounts that entered process in a given month*/
    entry_amount numeric,       /* Total amount for account that entered process in a given month*/
    recovery_amount numeric     /* Amount recovered in Nth month on accounts that entered process in a given month */
);

Sample data

insert into vintage_data values('2012-01-31',1,200,100000,1000);
insert into vintage_data values('2012-01-31',2,200,100000,2000);
insert into vintage_data values('2012-01-31',3,200,100000,3000);
insert into vintage_data values('2012-01-31',4,200,100000,3500);
insert into vintage_data values('2012-01-31',5,200,100000,3400);
insert into vintage_data values('2012-01-31',6,200,100000,3300);
insert into vintage_data values('2012-02-28',1,250,150000,1200);
insert into vintage_data values('2012-02-28',2,250,150000,1600);
insert into vintage_data values('2012-02-28',3,250,150000,1800);
insert into vintage_data values('2012-02-28',4,250,150000,1200);
insert into vintage_data values('2012-02-28',5,250,150000,1600);
insert into vintage_data values('2012-03-31',1,200,90000,1300);
insert into vintage_data values('2012-03-31',2,200,90000,1200);
insert into vintage_data values('2012-03-31',3,200,90000,1400);
insert into vintage_data values('2012-03-31',4,200,90000,1000);
insert into vintage_data values('2012-04-30',1,300,180000,1600);
insert into vintage_data values('2012-04-30',2,300,180000,1500);
insert into vintage_data values('2012-04-30',3,300,180000,4000);
insert into vintage_data values('2012-05-31',1,400,225000,2200);
insert into vintage_data values('2012-05-31',2,400,225000,6000);
insert into vintage_data values('2012-06-30',1,100,60000,1000);

Calculation Process

You can imagine the data as a triangular matrix (X values are to be forecasted):

distance_in_months                       1      2     3       4      5      6
granularity entry_accounts  entry_amount
2012-01-31  200             100000       1000   2000   3000   3500   3400   3300
2012-02-28  250             150000       1200   1600   1800   1200   1600   (X-1)
2012-03-31  200              90000       1300   1200   1400   1000   (X0)   (X4)
2012-04-30  300             180000       1600   1500   4000   (X1)   (X5)   (X8)
2012-05-31  400             225000       2200   6000   (X2)   (X6)   (X9)   (X11)
2012-06-30  100              60000       1000   (X3)   (X7)   (X10)  (X12   (X13)

Algorithm

The goal I have is to forecast all the missing points (future). To illustrate the process, this is the calculation for the point X1

1) Get row totals for previous three months using distance up to 4:

2012-01-31  1000+2000+3000+3500=9500 (d4m3)
2012-02-28  1200+1600+1800+1200=5800 (d4m2)
2012-03-31  1300+1200+1400+1000=4900 (d4m1)

2) Get row totals for previous three months using distance up to 3:

2012-01-31  1000+2000+3000=6000 (d3m3)
2012-02-28  1200+1600+1800=4600 (d3m2)
2012-03-31  1300+1200+1400=3800 (d3m1)

3) Calculate weighted average running rate for distance 3 and distance 4 (weighted by entry_amount):

(d4m3+d4m2+d4m1)/(100000+150000+90000) = (9500+5800+4900)/(100000+150000+90000) = 20200/340000 = 0.0594
(d3m3+d3m2+d3m1)/(100000+150000+90000) = (6000+4600+3800)/(100000+150000+90000) = 14400/340000 = 0.0424

4) Calculate the change between distance 3 and distance 4

((d4m3+d4m2+d4m1)/(100000+150000+90000))/((d3m3+d3m2+d3m1)/(100000+150000+90000)) =
= (20200/340000)/(14400/340000) =
= 0.0594/0.0424 = 1.403 (PredictionRateForX1)

5) Calculate row totals for predicted month using distance up to 3:

2012-04-30  1600+1500+4000=7100

6) Calculate rate using entry_amount for predicted month

7100/180000 = 0.0394

7) Calculate rate predicted for X1

0.0394 * PredictionRateForX1 = 0.05534

8) Calculate amount for X1

(0.05534-0.0394)*180000 = 2869.2

Problem

The problem is how to calculate the rest of the matrix (from x-1 to x13) using SQL statement. It is obvious that this will require some sort of recursive algorithm.

It's a big task, split it up to make it more manageable. I would put that in a plpgsql function with RETURN TABLE:

  1. Create a temporary table for your "Calculation Process" matrix using a crosstab query You need the tablefunc module installed for that. Run (once per database):

    CREATE EXTENSION tablefunc;
    
  2. Update the temp table field by field.

  3. Return table.

The following demo is fully functional and tested with PostgreSQL 9.1.4. Building on the table definition provided in the question:

-- DROP FUNCTION f_forcast();

CREATE OR REPLACE FUNCTION f_forcast()
  RETURNS TABLE (
  granularity date
 ,entry_accounts numeric
 ,entry_amount numeric
 ,d1 numeric
 ,d2 numeric
 ,d3 numeric
 ,d4 numeric
 ,d5 numeric
 ,d6 numeric) AS
$BODY$
BEGIN

--== Create temp table with result of crosstab() ==--

CREATE TEMP TABLE matrix ON COMMIT DROP AS
SELECT *
FROM   crosstab (
        'SELECT granularity, entry_accounts, entry_amount
               ,distance_in_months, recovery_amount
         FROM   vintage_data
         ORDER  BY 1, 2',

        'SELECT DISTINCT distance_in_months
         FROM   vintage_data
         ORDER  BY 1')
AS tbl (
  granularity date
 ,entry_accounts numeric
 ,entry_amount numeric
 ,d1 numeric
 ,d2 numeric
 ,d3 numeric
 ,d4 numeric
 ,d5 numeric
 ,d6 numeric
 );

ANALYZE matrix; -- update statistics to help calculations


--== Calculations ==--

-- I implemented the first calculation for X1 and leave the rest to you.
-- Can probably be generalized in a loop or even a single statement.

UPDATE matrix m
SET    d4 = (
    SELECT (sum(x.d1) + sum(x.d2) + sum(x.d3) + sum(x.d4))
            /(sum(x.d1) + sum(x.d2) + sum(x.d3)) - 1
            -- removed redundant sum(entry_amount) from equation
    FROM  (
        SELECT *
        FROM   matrix a
        WHERE  a.granularity < m.granularity
        ORDER  BY a.granularity DESC
        LIMIT  3
        ) x
    ) * (m.d1 + m.d2 + m.d3)
WHERE m.granularity = '2012-04-30';

--- Next update X2 ..


--== Return results ==--

RETURN QUERY
TABLE  matrix
ORDER  BY 1;

END;
$BODY$ LANGUAGE plpgsql;

Call:

SELECT * FROM f_forcast();

I have simplified quite a bit, removing some redundant steps in the calculation.
The solution employs a variety of advanced techniques. You need to know your way around PostgreSQL to work with this.

Tricky SQL query involving consecutive values

8 votes

I need to perform a relatively easy to explain but (given my somewhat limited skills) hard to write SQL query.

Assume we have a table similar to this one:

 exam_no | name | surname | result | date
---------+------+---------+--------+------------
 1       | John | Doe     | PASS   | 2012-01-01
 1       | Ryan | Smith   | FAIL   | 2012-01-02 <--
 1       | Ann  | Evans   | PASS   | 2012-01-03
 1       | Mary | Lee     | FAIL   | 2012-01-04
 ...     | ...  | ...     | ...    | ...
 2       | John | Doe     | FAIL   | 2012-02-01 <--
 2       | Ryan | Smith   | FAIL   | 2012-02-02
 2       | Ann  | Evans   | FAIL   | 2012-02-03
 2       | Mary | Lee     | PASS   | 2012-02-04
 ...     | ...  | ...     | ...    | ...
 3       | John | Doe     | FAIL   | 2012-03-01
 3       | Ryan | Smith   | FAIL   | 2012-03-02
 3       | Ann  | Evans   | PASS   | 2012-03-03
 3       | Mary | Lee     | FAIL   | 2012-03-04 <--

Note that exam_no and date aren't necessarily related as one might expect from the kind of example I chose.

Now, the query that I need to do is as follows:

  • From the latest exam (exam_no = 3) find all the students that have failed (John Doe, Ryan Smith and Mary Lee).
  • For each of these students find the date of the first of the batch of consecutively failing exams. Another way to put it would be: for each of these students find the date of the first failing exam that comes after their last passing exam. (Look at the arrows in the table).

The resulting table should be something like this:

 name | surname | date_since_failing
------+---------+--------------------
 John | Doe     | 2012-02-01
 Ryan | Smith   | 2012-01-02
 Mary | Lee     | 2012-03-04

How can I perform such a query?

Thank you for your time.

You can take advantage of the fact that if someone passed the most recent exam, then they have not failed any exams since their most recent pass: therefore the problem reduces to finding the first exam failed since the most recent pass:

SELECT   name, surname, MIN(date) date_since_fail
FROM     results NATURAL LEFT JOIN (
  SELECT   name, surname, MAX(date) lastpass
  FROM     results
  WHERE    result = 'PASS'
  GROUP BY name, surname
) t
WHERE    result = 'FAIL' AND date > IFNULL(lastpass,0)
GROUP BY name, surname

See it on sqlfiddle.

SQL update where in set of data

8 votes
+------------------+
| id1 | id2 | bool |
+------------------+
|  1  |  1  |  F   |
|  1  |  2  |  F   |
|  2  |  1  |  F   |
+------------------+

UPDATE table_name
SET bool = T
WHERE (id1, id2) IN ((1,1),(2,1)) --Need work here

So basically I want to select where the conditions of (id1, id2) = (value1, value2). Similar to the statement below:

WHERE id1 = value1 AND id2 = value2

however in set of values in an array. Is this possible?

Thanks in advance

EDIT: I'm using SQL server 2008. I'm sorry if it wasn't too clear. I'm trying to put this as a stored procedure and call it from a service. The input would be some sort of an array (variable size), and find a match with the two IDs in a row.

Here is the way to do it in MSSql. All you need is to make one value (in this example VARCHAR) from Id1 and Id2. In this case you can use IN statement with the values set. Also you should think about NULLs in id1 and id2 if they are allowed in these fields (just add: and id1 is not null and id2 is not null).

UPDATE table_name
SET bool = T
WHERE convert(varchar(20),id1)+','+convert(varchar(20),id2) in ('1,1','2,1')

MySQL: Display one random result which a user has not voted on

7 votes

I need help on displaying one random result in which the current user has not voted on.

Currently my database setup and the last query I have tried can be found on http://sqlfiddle.com/#!2/2f91b/1

Basically I can isolate each individual item using this query:

SELECT a.img_url, a.item_id, a.user_id, a.img_status, b.item_question, c.user_name, c.user_fbid, d.voter_id, count(d.img_id) AS totalVotes
FROM os_photos a 
LEFT JOIN os_items b ON a.item_id = b.item_id
LEFT JOIN os_users c ON a.user_id = c.user_id
LEFT JOIN os_votes d ON a.img_id = d.img_id
GROUP BY a.img_id
ORDER BY RAND()
LIMIT 1

My Problem is: With the SQL knowledge that I have, I am unable to isolate the results to show only the rows in which user #2 has not voted on. I realize the problem is when I use group by, it combines the voter_id and therefore I am unable to check if user #2 has had any input for the item.

Example:

Item #  |  voter_id
1       |      2
1       |      3
2       |      2
3       |      1
3       |      4
4       |      3
4       |      1
5       |      1
5       |      2

With the above sample set, the resulting item should be either item #3, #4 or any other items which have not been voted on.

Your help, advise and knowledge is greatly appreciated.

To get the items that dont exist you need a LEFT JOIN with condition that would otherwise make a positive match, and then add a WHERE clause matching one of the resulting columns to NULL:

SELECT a.img_url, a.item_id, a.user_id, a.img_status, b.item_question, c.user_name,c.user_fbid, d.voter_id, count(d.img_id) AS totalVotes
FROM os_photos a 
LEFT JOIN os_items b ON a.item_id = b.item_id
LEFT JOIN os_users c ON a.user_id = c.user_id
LEFT JOIN os_votes d ON a.img_id = d.img_id
LEFT JOIN os_votes d2 ON a.img_id = d2.img_id AND d2.voter_id=2
WHERE d2.voter_id IS NULL
GROUP BY a.img_id
ORDER BY RAND()
LIMIT 1

How can I supply a List<int> to a SQL parameter?

7 votes

I have a SQL statement like the following:

...
const string sql = @"UPDATE PLATYPUS
SET DUCKBILLID = :NEWDUCKBILLID
WHERE PLATYPUSID IN (:ListOfInts)";
...
ocmd.Parameters.Add("ListOfInts", ??WhatNow??);

How can I provide the comma separated list of ints, which could be any (reasonable*) number of values

  • By "reasonable" in this case I mean between one and a couple dozen.

Try something like this:

ocmd.Parameters.Add("ListOfInts", String.Join(",",myList.ToArray());

The String.Join method takes all elements in an array and places the specified character between them.

Sorting Username

7 votes

How to sort Username correctly?

For example I do this query:

SELECT * FROM  `members` WHERE username LIKE  'bx%' ORDER BY username ASC 

Result:

bx1
bx10
bx11
bx12
bx13
bx14
bx15
bx16
bx17
bx18
bx19
bx2
bx20
bx21
bx3
bx4
bx5

I want to return like this:

bx1
bx2
bx3
bx4
bx5
...
bx15
bx16

and so on

SELECT *
FROM `members`
WHERE username LIKE 'bx%'
ORDER BY LENGTH(username), username

The fact that you need to do this tells me your schema is denormalized. If possible, store the integer part of the username in a separate column if you need to do operations on it.

SQL Fiddle Example

How to test for a series of equal X paired rows in a row referring to a specific column?

6 votes

I'm using SQL Server 2008 R2.

Consider this table @t (TOP 20 of ORDER BY PK DESC):

PK  SK  VC  APP     M   C
==  ==  ==  ====    ==  ==================
21  7   79  NULL    0   NULL
20  9   74  1       3   20=14, 18=13, 15=2
19  6   79  1       2   19=11, 17=7
18  9   77  1       0   NULL
17  6   74  1       0   NULL
16  7   79  1       0   NULL
15  9   74  1       0   NULL
14  9   74  1       0   NULL
13  9   77  1       0   NULL
12  7   77  1       0   NULL
11  6   79  1       0   NULL
10  7   79  1       0   NULL
9   7   74  1       0   NULL
8   7   79  1       0   NULL
7   6   74  1       0   NULL
6   6   74  1       0   NULL
5   7   79  1       0   NULL
4   7   77  1       0   NULL
3   6   79  1       0   NULL
2   9   74  1       0   NULL

Created with this:

DECLARE @t TABLE(PK INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED, SK INT NOT NULL, VC INT NULL,  APP INT NULL, M INT NOT NULL,  C NVARCHAR(111) NULL);

INSERT @t (SK,VC,APP,M,C) VALUES
(7,77,1,0,NULL),
(9,74,1,0,NULL),
(6,79,1,0,NULL),
(7,77,1,0,NULL),
(7,79,1,0,NULL),
(6,74,1,0,NULL),
(6,74,1,0,NULL),
(7,79,1,0,NULL),
(7,74,1,0,NULL),
(7,79,1,0,NULL),
(6,79,1,0,NULL),
(7,77,1,0,NULL),
(9,77,1,0,NULL),
(9,74,1,0,NULL),
(9,74,1,0,NULL),
(7,79,1,0,NULL),
(6,74,1,0,NULL),
(9,77,1,0,NULL),
(6,79,1,2,'19=11, 17=7'),
(9,74,1,3,'20=14, 18=13, 15=2'),
(7,79,NULL,0,NULL)

My task is to return true for a match if the latest row (where APP IS NOT NULL) complete a series of X matching pairs or latest rows of the same group (same current SK).

For example, when testing for only 2 pairs, given the current required test is on SK=6, as soon as getting to PK = 19 there is a match.

The match is VC(19)=VC(11)=79 AND VC(17)=VC(7)=74

See that by executing the following:

DECLARE @PairsToTest int = 2
DECLARE @SK int = 6
SELECT 
    TOP (2*@PairsToTest) 
    * 
    FROM @t 
    WHERE 
            APP IS NOT NULL 
        AND SK = @SK 
    ORDER BY SK, PK DESC

results:

PK  SK  VC  APP M   C
19  6   79  1   2   19=11, 17=7
17  6   74  1   0   NULL
11  6   79  1   0   NULL
7   6   74  1   0   NULL

Another example:

When testing for 3 pairs, a match is found on PK=20 when looking in SK=9 (Although it is interesting question by itself, for my task there is no need to test for all SKs. A result for a given SK is sufficient for me.

To see the match execute this:

DECLARE @PairsToTest int = 3
DECLARE @SK int = 9
SELECT 
    TOP (2*@PairsToTest) 
    * 
    FROM @t 
    WHERE 
            APP IS NOT NULL 
        AND SK = @SK 
    ORDER BY SK, PK DESC

which results:

PK  SK  VC  APP M   C
20  9   74  1   3   20=14, 18=13, 15=2
18  9   77  1   0   NULL
15  9   74  1   0   NULL
14  9   74  1   0   NULL
13  9   77  1   0   NULL
2   9   74  1   0   NULL

as you can see: VC(20)=VC(14)=74, VC(18)=VC(13)=74 and VC(15)=VC(2)

I thought of selecting the required sets of rows in the correct order, and count the equal rows in VC. If the count is the same as the @PairsToTest this is the sign to raise a flag.

I tried:

DECLARE @PairsToTest int = 3
DECLARE @SK int = 9
;with t0 as
(
SELECT 
    TOP (2*@PairsToTest) 
    * 
    FROM @t 
    WHERE 
            APP IS NOT NULL 
        AND SK = @SK 
    ORDER BY SK, PK DESC
),
t1 AS
(
SELECT TOP (@PairsToTest) * FROM t0
),
t2 AS
(
SELECT TOP (@PairsToTest) * FROM t0 ORDER BY PK ASC 
)
,t3 AS
(
SELECT TOP 99999999 * FROM t2 ORDER BY PK DESC
)

IF (SELECT COUNT(*) FROM t1 LEFT OUTER JOIN t3 ON t1.VC = t3.VC) = @PairsToTest 
    SELECT 1
ELSE
    SELECT 0

but there are too may flaws in this:

  1. VC does not contain unique data (only by chance)
  2. the IF is not allowed
  3. I should get rid of the TOP 99999999 in t3 (although I can live with that)

What are the required changes I should take in order to solve this?

DECLARE @t TABLE(PK INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED, SK INT NOT NULL, VC INT NULL,  APP INT NULL, M INT NOT NULL,  C NVARCHAR(111) NULL);

INSERT @t (SK,VC,APP,M,C) VALUES
(7,77,1,0,NULL),
(9,74,1,0,NULL),
(6,79,1,0,NULL),
(7,77,1,0,NULL),
(7,79,1,0,NULL),
(6,74,1,0,NULL),
(6,74,1,0,NULL),
(7,79,1,0,NULL),
(7,74,1,0,NULL),
(7,79,1,0,NULL),
(6,79,1,0,NULL),
(7,77,1,0,NULL),
(9,77,1,0,NULL),
(9,74,1,0,NULL),
(9,74,1,0,NULL),
(7,79,1,0,NULL),
(6,74,1,0,NULL),
(9,77,1,0,NULL),
(6,79,1,2,'19=11, 17=7'),
(9,74,1,3,'20=14, 18=13, 15=2'),
(7,79,NULL,0,NULL)


DECLARE @PairsToTest int = 3
DECLARE @SK int = 9

IF ((SELECT COUNT(*) FROM @t WHERE APP IS NOT NULL AND SK = @SK)-@PairsToTest) >=0
    BEGIN
        DECLARE @swapData  TABLE(PK1 INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED, PK INT NOT NULL, SK INT NOT NULL, VC INT NULL,  APP INT NULL, M INT NOT NULL,  C NVARCHAR(111) NULL);
        DECLARE @olderData TABLE(PK2 INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED, PK INT NOT NULL, SK INT NOT NULL, VC2 INT NULL,  APP INT NULL, M INT NOT NULL,  C NVARCHAR(111) NULL);
        DECLARE @newerData TABLE(PK3 INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED, PK INT NOT NULL, SK INT NOT NULL, VC3 INT NULL,  APP INT NULL, M INT NOT NULL,  C NVARCHAR(111) NULL);


        INSERT @swapData  SELECT TOP ((SELECT COUNT(*) FROM @t WHERE APP IS NOT NULL AND SK = @SK)-@PairsToTest) * FROM @t WHERE APP IS NOT NULL AND SK = @SK ORDER BY PK
        INSERT @olderData SELECT TOP (@PairsToTest) PK,SK,VC,APP,M,C FROM @swapData ORDER BY PK1 DESC
        INSERT @newerData SELECT TOP (@PairsToTest) * FROM @t WHERE APP IS NOT NULL AND SK = @SK ORDER BY SK, PK DESC



        DECLARE @Matches int = (SELECT COUNT(*)FROM @newerData INNER JOIN @olderData ON PK2 = PK3 WHERE VC2=VC3)

        IF @Matches = @PairsToTest 
            SELECT 1 AS Match 
        ELSE 
            SELECT 0 AS Match
    END
ELSE
    SELECT 0 AS Match

/*
SELECT TOP (2*@PairsToTest) * FROM @t WHERE APP IS NOT NULL AND SK = @SK ORDER BY SK, PK DESC
SELECT * FROM @olderData
SELECT * FROM @newerData
*/

Can Indices actually decrease SELECT performance?

6 votes

Possible Duplicate:
Degraded performance of a query after adding Index

after reading some stuff about indices on SQL Server and their performance advantages for selects and disadvantages for updates / inserts, i was wondering if badly used indices could actually also hurt performance for selects. What conditions would have to be fulfilled to have an index decrease performance of a pure select query? Do such situations exist?

Thanks!

(although I always try to include code examples, i can't think of anything that would support this question...)

Yes, albeit very slightly - so slightly that it would be justified to also answer "No".

If you have an index which might be considered for a query, but is not useable, the optimizer will waste a short time pondering whether and how to use it (in rare cases with REALLY complicated indexes and views, and more frequently when index performance hints are wrong, you might end up choosing a suboptimal query plan).

Some cases would be:

  • a table without indexes
  • a table with a badly chosen index, which get discarded
  • a table where TWO indexes exist, and for some reason (e.g. obsolete statistics), the existence of the second index makes the optimizer choose it, while it would have been more convenient to use the first.

In the first two cases the query time is the same (and entails a full scan), but in the second, you also have to analyze and discard the index.

Where an index hurts you - where ALL indexes hurt you - is in inserts, deletes and updates. Then, any index not used by the update query, yet affected by same, will require a write to the index itself.

So you will want to have indexes, but as few as you can without sacrificing SELECT performances. Actually, you might decide against indexing for a rarely used SELECT query in order to avoid having the needed index constantly updated by all other UPDATE queries.

Edit: after reading Heinzi's answer, I'd also like to add that most DB servers have maintenance tools which analyze the tables and indexes (and sometimes query performance counters too), and properly update the hints of which Heinzi spoke. So it's also important to periodically "maintain" the database to keep the optimizer supplied with up-to-date information on which indexes to choose from.

Check for x consecutive days - given timestamps in database

6 votes

Could anybody give me an idea or hint how you could check for X consecutive days in a database table (MySQL) where logins (user id, timestamp) are stored?

Stackoverflow does it (e.g. badges like Enthusiast - if you log in for 30 consecutive days or so...). What functions would you have to use or what is the idea of how to do it?

Something like SELECT 1 FROM login_dates WHERE ...?

You can accomplish this using a shifted self-outer-join in conjunction with a variable. See this solution:

SELECT IF(COUNT(1) > 0, 1, 0) AS has_consec
FROM
(
    SELECT *
    FROM
    (
        SELECT IF(b.login_date IS NULL, @val:=@val+1, @val) AS consec_set
        FROM tbl a
        CROSS JOIN (SELECT @val:=0) var_init
        LEFT JOIN tbl b ON 
            a.user_id = b.user_id AND
            a.login_date = b.login_date + INTERVAL 1 DAY
        WHERE a.user_id = 1
    ) a
    GROUP BY a.consec_set
    HAVING COUNT(1) >= 30
) a

This will return either a 1 or a 0 based on if a user has logged in for 30 consecutive days or more at ANYTIME in the past.

The brunt of this query is really in the first subselect. Let's take a closer look so we can better understand how this works:

With the following example data set:

CREATE TABLE tbl (
  user_id INT,
  login_date DATE
);

INSERT INTO tbl VALUES
(1, '2012-04-01'),  (2, '2012-04-02'),
(1, '2012-04-25'),  (2, '2012-04-03'),
(1, '2012-05-03'),  (2, '2012-04-04'),
(1, '2012-05-04'),  (2, '2012-05-04'),
(1, '2012-05-05'),  (2, '2012-05-06'),
(1, '2012-05-06'),  (2, '2012-05-08'),
(1, '2012-05-07'),  (2, '2012-05-09'),
(1, '2012-05-09'),  (2, '2012-05-11'),
(1, '2012-05-10'),  (2, '2012-05-17'),
(1, '2012-05-11'),  (2, '2012-05-18'),
(1, '2012-05-12'),  (2, '2012-05-19'),
(1, '2012-05-16'),  (2, '2012-05-20'),
(1, '2012-05-19'),  (2, '2012-05-21'),
(1, '2012-05-20'),  (2, '2012-05-22'),
(1, '2012-05-21'),  (2, '2012-05-25'),
(1, '2012-05-22'),  (2, '2012-05-26'),
(1, '2012-05-25'),  (2, '2012-05-27'),
                    (2, '2012-05-28'),
                    (2, '2012-05-29'),
                    (2, '2012-05-30'),
                    (2, '2012-05-31'),
                    (2, '2012-06-01'),
                    (2, '2012-06-02');

This query:

SELECT a.*, b.*, IF(b.login_date IS NULL, @val:=@val+1, @val) AS consec_set
FROM tbl a
CROSS JOIN (SELECT @val:=0) var_init
LEFT JOIN tbl b ON 
    a.user_id = b.user_id AND
    a.login_date = b.login_date + INTERVAL 1 DAY
WHERE a.user_id = 1

Will produce:

Example Result

As you can see, what we are doing is shifting the joined table by +1 day. For each day that is not consecutive with the prior day, a NULL value is generated by the LEFT JOIN.

Now that we know where the non-consecutive days are, we can use a variable to differentiate each set of consecutive days by detecting whether or not the shifted table's rows are NULL. If they are NULL, the days are not consecutive, so just increment the variable. If they are NOT NULL, then don't increment the variable:

Result With Highlighted Groupings

After we've differentiated each set of consecutive days with the incrementing variable, it's then just a simple matter of grouping by each "set" (as defined in the consec_set column) and using HAVING to filter out any set that has less than the specified consecutive days (30 in your example):

Result With Grouping By The consec_set column

Then finally, we wrap THAT query and simply count the number of sets that had 30 or more consecutive days. If there was one or more of these sets, then return 1, otherwise return 0.


See a SQLFiddle step-by-step demo

t-SQL to update table to remove overlapping time frames

6 votes

I was wondering if someone could help me with this SQL statement?

Say, I have an SQL Server 2008 table like this:

id -- INT PRIMARY KEY
dtIn -- DATETIME2
dtOut -- DATETIME2
type -- INT

id  dtIn    dtOut    type
1   05:00   10:00    1
2   08:00   16:00    2
3   02:00   08:00    1
4   07:30   11:00    1
5   07:00   12:00    2

I need to remove any time overlaps in the table above. This can be illustrated with this diagram: enter image description here

So I came up with this SQL:

UPDATE [table] AS t 
SET dtOut = (SELECT MIN(dtIn) FROM [table] WHERE type = t.type AND t.dtIn >= dtIn AND t.dtIn < dtOut) 
WHERE type = t.type AND t.dtIn >= dtIn AND t.dtIn < dtOut

But it doesn't work. Any idea what am I doing wrong here?

*EDIT*

OK, it took me awhile to get to this. Seems to be a working SQL for what I need it for:

--BEGIN TRANSACTION;

--delete identical dtIn
DELETE dT1
FROM tbl dT1
WHERE EXISTS
(
    SELECT *
    FROM tbl dT2
    WHERE dT1.Type = dT2.Type
    AND dT1.dtIn = dT2.dtIn
    AND (
            dT1.dtOut < dT2.dtOut
            OR (dT1.dtOut = dT2.dtOut AND dT1.id < dT2.id)
        )
);

--adjust dtOuts to the max dates for overlapping section
UPDATE tbl
SET dtOut = COALESCE((
    SELECT MAX(dtOut)
    FROM tbl as t1
    WHERE t1.type = tbl.type
    AND t1.dtIn < tbl.dtOut 
AND t1.dtOut > tbl.dtIn
    ), dtOut);

-- Do the actual updates of dtOut
UPDATE tbl
SET dtOut = COALESCE((
    SELECT MIN(dtIn)
    FROM tbl as t2
    WHERE t2.type = tbl.type AND
          t2.id <> tbl.id AND
          t2.dtIn >= tbl.dtIn AND t2.dtIn < tbl.dtOut
    ), dtOut);

--COMMIT TRANSACTION;

Just off the top of my head I believe that one of Joe Celko's books had this as an example problem. You might find the excerpt available on Google.

This might be closer. I think you weren't really doing the subquery the right way.

UPDATE table
SET dtOut = (
    SELECT MIN(t2.dtIn)
    FROM [table] as t2
    WHERE t2.id <> table.id AND t2.type = table.type
        AND table.dtIn < t2.dtIn AND t2.dtIn < table.dtOut
        AND table.dtOut <= t2.dtOut
    )
WHERE EXISTS (
    SELECT 1
    FROM [table] as t3
    WHERE
            t3.type = table.type
        AND t3.id <> table.id
        AND table.dtIn < t3.dtIn AND t3.dtIn < table.dtOut
        AND table.dtOut <= t3.dtOut
    )

EDIT I overlooked the id column at the top of the page so obviously that's a better check than making sure the endpoints don't match up. The solution is probably easier if you can assume that no two rows of identical type have the dtIn.

Btw, there's no reason to use a CROSS APPLY when a subquery will do exactly the same job.

EDIT 2 I did some quick testing and I think my query handles the scenario in your diagram. There's one case where it might not do what you want.

For a given type, think about the final two segments S1 and S2 in order of start time. S2 starts after S1 but also imagine it ends before S1 does. S2 is fully contained in the interval of S1, so it's either ignorable or the information for the two segments needs to be split into a third segment and that's where the problem gets trickier.

So this solution just assumes they can be ignored.


EDIT 3 based on the comment about combining updates

SQLFiddle posted by OP

-- eliminate redundant rows
DELETE dT1 /* FROM tbl dT1 -- unnecessary */
WHERE EXISTS
(
    SELECT *
    FROM tbl dT2
    WHERE dT1.Type = dT2.Type AND dT1.dtIn = dT2.dtIn
    AND (
        dT1.dtOut < dT2.dtOut
        OR (dT1.dtOut = dT2.dtOut AND dT1.id < dT2.id)
    )
);

--adjust dtOuts to the max dates
UPDATE tbl
SET dtOut = COALESCE((
    SELECT MAX(dtOut)
    FROM tbl as t1
    WHERE t1.type = tbl.type
    ), dtOut);

-- Do the actual updates of dtOut
UPDATE tbl
SET dtOut = COALESCE((
    SELECT MIN(dtIn)
    FROM tbl as t2
    WHERE t2.type = tbl.type AND
          t2.id <> tbl.id AND
          t2.dtIn >= tbl.dtIn AND t2.dtIn < tbl.dtOut
    ), dtOut);

Either one of the two updates below should replace the two updates above.

UPDATE tbl
SET dtOut = (
    SELECT
        COALESCE(
            MIN(dtIn),
            /* as long as there's no GROUP BY, there's always one row */
            (SELECT MAX(dtOut) FROM tbl as tmax WHERE tmax.type = tbl.type)
        )
    FROM tbl as tmin
    WHERE tmin.type = tbl.type

        AND tmin.dtIn > tbl.dtIn
        /*  
        regarding the original condition in the second update:
            t2.dtIn >= tbl.dtIn AND t2.dtIn < tbl.dtOut

        dtIns can't be equal because you already deleted those
        and if dtIn was guaranteed to be less than dtOut it's
        also automatically always less than max(dtOut)
        */
);

UPDATE tbl
SET dtOut = COALESCE(
  (
    SELECT MIN(dtIn) FROM tbl as tmin
    WHERE tmin.type = tbl.type AND tmin.dtIn > tbl.dtIn
  ),
  (        
    SELECT MAX(dtOut) FROM tbl as tmax
    WHERE tmax.type = tbl.type
  )
);

INNER JOIN vs INNER JOIN (SELECT . FROM)

6 votes

Is there any difference in terms of performance between these two versions of the same query?

--Version 1
SELECT p.Name, s.OrderQty
FROM Product p
INNER JOIN SalesOrderDetail s on p.ProductID = s.ProductID

--Version 2
SELECT p.Name, s.OrderQty
FROM Product p
INNER JOIN (SELECT ProductID, OrderQty FROM SalesOrderDetail) s on p.ProductID = s.ProductID

I've heard it said (by a DBA) that Version 2 is faster because it fetches, within the inner SELECT statement, only the columns that are required for the query. But that doesn't seem to make sense, since query performance (as I know) is based on number of rows affected and final list of columns returned.

The query plans for both are identical, so I'm guessing there isn't any difference between the two.

Am I correct?

You are correct. You did exactly the right thing, checking the query plan rather than trying to second-guess the optimiser. :-)

Finding subfolders of a file path from table of file paths using SQL Server "LIKE" function

5 votes

A column in one my SQL Server database tables contains paths (not file paths, just the directory path).

I currently have a query which returns all sub-folders of a given file path, using the LIKE function;

WHERE Path LIKE @FilePath + '%'

On a side note, I am aware that there is an issue with this approach as the escape characters used by LIKE are all valid file path characters. I'm going to make proper use of ESCAPE in the finished query.

Anyway, back to my question. I would like to be able to modify the query so that it returns only direct sub-folders of the specified path. This isn't as easy as just putting a trailing slash on the end of the string, as obviously the % wildcard gobbles everything up as it parses the right portion of the string.

I was looking at using the [^] escape sequence, to perhaps do a search which excludes the \ character, but I've had no luck with that. I tried this;

WHERE Path LIKE @FilePath + '[^\]'

But it doesn't produce the intended result, so I'm not even sure if I'm using it correctly.

Do not use LIKE for this. LEFT and SUBSTRING will give better performance as you are not doing any wildcard searches, which you should try and avoid in all instances.

The query below will do this and exclude an exact match, so if there is a directory 'C:\Temp\ it will not show up in the results, but all the sub directories will.

What we are doing is throwing away all results that possibly have more sub directories ('\') and just using the ones with a single sub directory.

Hope this helps!

DECLARE @directory VARCHAR(150);
SET @directory = 'C:\Temp\';

SELECT 
    * 
FROM 
    tblDir
WHERE 
    LEFT(Dir, LEN(@directory)) = @directory
    AND CHARINDEX('\', SUBSTRING(Dir, LEN(@directory) ,LEN(Dir) - LEN(@directory)), 2) = 0
    AND Dir <> @directory

SQL Order By Clause Execution

5 votes

This question isn't about order of executions. It's about just the ORDER BY.

In standard execution is:

  • FROM
  • WHERE
  • GROUP BY
  • HAVING
  • SELECT
  • ORDER BY
  • TOP

EDIT: This question has been more or less the issue of "Does SQL Server apply short circuit evaluation when executing ORDER BY expressions?" The answer is SOMETIMES! I just haven't found a reasonable reason as to why. See Edit #4.

Now supposed I have a statement like this:

DECLARE @dt18YearsAgo AS DATETIME = DATEADD(YEAR,-18,GETDATE());
SELECT
  Customers.Name
FROM
  Customers
WHERE
  Customers.DateOfBirth > @dt18YearsAgo
ORDER BY
  Contacts.LastName ASC, --STATEMENT1
  Contacts.FirstName ASC, --STATEMENT2
  (
   SELECT 
     MAX(PurchaseDateTime) 
   FROM 
     Purchases 
   WHERE 
     Purchases.CustomerID = Customers.CustomerID
  ) DESC --STATEMENT3

This isn't the real statement I'm trying to execute, but just an example. There are three ORDER BY statements. The third statement is only used for rare cases where the last name and first name match. My question is, if there are no duplicate last names, does SQL Server not execute ORDER BY statements #2 and #3? And, logically, if there are no duplicate last name and first name, does SQL Server note execute statement #3.

This is really for optimization. Reading from the Purchases table should only be a last resort. In the case of my application, it wouldn't be efficient to read every single "PurchaseDateTime" from "Purchases" grouping by "CustomerID".

Please keep the answer related to my question and not a suggestion like building an index for CustomerID, PurchaseDateTime in Purchases. The real question is, does SQL Server skip unnecessary ORDER BY statements?

Edit: Apparently, SQL Server will always execute every statement as long as there is one row. Even with one row, this will give you a divide by zero error:

DECLARE @dt18YearsAgo AS DATETIME = DATEADD(YEAR,-18,GETDATE());
SELECT
  Customers.Name
FROM
  Customers
WHERE
  Customers.DateOfBirth > @dt18YearsAgo
ORDER BY
  Contacts.LastName ASC, --STATEMENT1
  Contacts.FirstName ASC, --STATEMENT2
  1/(Contacts.ContactID - Contacts.ContactID) --STATEMENT3

Edit2: Apparently, this doesn't give divide by zero:

DECLARE @dt18YearsAgo AS DATETIME = DATEADD(YEAR,-18,GETDATE());
SELECT
  Customers.Name
FROM
  Customers
WHERE
  Customers.DateOfBirth > @dt18YearsAgo
ORDER BY
  Contacts.LastName ASC, --STATEMENT1
  Contacts.FirstName ASC, --STATEMENT2
  CASE WHEN 1=0 
    THEN Contacts.ContactID 
    ELSE 1/(Contacts.ContactID - Contacts.ContactID)
  END --STATEMENT3

Well, the original answer to my question is YES, it does execute, but what's nice is that I can stop execute with a proper CASE WHEN

Edit3: We can stop execution of an ORDER BY statement with a proper CASE WHEN. The trick, I guess, is to figure out how to use it properly. CASE WHEN will give me what I want, which a short circuit execution in an ORDER BY statement. I compared the Execution Plan in SSMS and depending on the CASE WHEN statement, the Purchases table isn't scanned at all EVEN THOUGH it's a clearly visible SELECT/FROM statement:

DECLARE @dt18YearsAgo AS DATETIME = DATEADD(YEAR,-18,GETDATE());
SELECT
  Customers.Name
FROM
  Customers
WHERE
  Customers.DateOfBirth > @dt18YearsAgo
ORDER BY
  Contacts.LastName ASC, --STATEMENT1
  Contacts.FirstName ASC, --STATEMENT2
  CASE WHEN 1=0 
    THEN
    (
     SELECT 
       MAX(PurchaseDateTime) 
     FROM 
       Purchases 
     WHERE 
       Purchases.CustomerID = Customers.CustomerID
    )
    ELSE Customers.DateOfBirth 
  END DESC

Edit4: Now I'm completely confused. Here's an example by @Lieven

WITH Test (name, ID) AS 
(SELECT 'Lieven1', 1 UNION ALL SELECT 'Lieven2', 2) 

SELECT * FROM Test ORDER BY name, 1/ (ID - ID) 

This yields no divide by zero, which means SQL Server does in fact, do short circuit evaluation on SOME tables, specifically those created with the WITH command.

Trying this with a TABLE variable:

DECLARE @Test TABLE
(
    NAME nvarchar(30),
    ID int
);
INSERT INTO @Test (Name,ID) VALUES('Lieven1',1);
INSERT INTO @Test (Name,ID) VALUES('Lieven2',2);
SELECT * FROM @Test ORDER BY name, 1/ (ID - ID)

will yield a divide by zero error.

First of all what you are calling "Statements" are no such thing. They are sub-clauses of the ORDER BY (major) clause. The difference is important, because "Statement" implies something separable, ordered and procedural, and SQL sub-clauses are none of those things.

Specifically, SQL sub-clauses (that is, the individual items of a SQL major clause (SELECT, FROM, WHERE, ORDER BY, etc.)) have no implicit (nor explicit) execution order of their own. SQL will re-order them in anyway that it finds convenient and will almost always execute all of them if it execute any of them. In short, SQL Server does not do that kind of "short-circuit" optimizations because they are trivially effective and seriously get in the way of the very different kind of optimizations that it does do (i.e., Statistical Data Access/Operator Optimizations).

So the correct answer to your original question (which you should not have changed) is NO, not reliably. You cannot rely on SQL Server to not use some sub-clause of the ORDER BY, simply because it looks like it does not need to.

The only common exception to this is that the CASE function can (in most circumstances) be used to short-circuit execution paths (within the CASE function though, not outside of it), but only because it is specifically designed for this. I cannot think of anything else in SQL that you can rely on to act like this.

Need some clarification on SELECT within UPDATE t-SQL statement

5 votes

Say, if I have the following t-SQL statement (designed to run on SQL Server 2008):

UPDATE tbl
SET col1 = (
    SELECT MAX(col1) FROM tbl AS t1 WHERE t1.type = tbl.type
);

How exactly does SELECT work in this case:

  1. It selects taking into account the results of each UPDATE, or

  2. It selects from the tbl as it was before the UPDATE began updating records.

Can someone clarify this for me?

2! Your subquery SELECT pulls the value as determined before the UPDATE makes any changes.

Matching a string against a column full of regular expressions

5 votes

I have a column in a table that is full of regular expressions.

I have a string in my code, and I want to find out which regular expressions in that column would match that string and return those rows.

Aside from pulling each row and matching the regular expression (which is costly, checking against potentially thousands of records for a single page load) is there a way I can do this in SQL instead with one (or a couple) queries?

Example input: W12ABC

Example column data

1   ^[W11][\w]+$  
2   ^[W12][\w]+$  
3   ^[W13][\w]+$ 
4   ^[W1][\w]+[A][\w]+$  
5   ^[W1][\w]+[B][\w]+$  
6   ^[W1][\w]+[C][\w]+$  

Should return rows 2 and 4.

If you're open to wildcards you can store patterns that work for LIKE syntax: "W12%" or "W%ABC". Your query would look like this:

SELECT * FROM Table Where 'W12ABC' LIKE Column