Best sql questions in September 2010

is email address a bad primary key

73 votes

Is email address a bad candidate for primary when compared to auto incrementing numbers. Our web application needs the email address to be unique in the system. So, I thought of using email address as primary key. But, my colleague suggests that string comparison will be slower to integer comparison. Is it a valid reason to not use email ids as primary key.

We are using postgres Thanks

String comparison is slower than int comparison. However, this does not matter if you simply retrieve a user from the database using the e-mail address. It does matter if you have complex queries with multiple joins.

If you store information about users in multiple tables, the foreign keys to the users table will be the e-mail address. That means that you store the e-mail address multiple times.

Can select * usage ever be justified?

45 votes

I've always preached to my developers that SELECT * is evil and should be avoided like the plague.

Are there any cases where it can be justified?

I'm not talking about COUNT(*) - which most optimizers can figure out.

Edit

I'm talking about production code.

And one great example I saw of this bad practice was a legacy asp application that used select * in a stored procedure, and used ADO to loop through the returned records, but got the columns by index. You can imagine what happened when a new field was added somewhere other than the end of the field list.

I'm quite happy using * in audit triggers.

In that case it can actually prove a benefit because it will ensure that if additional columns are added to the base table it will raise an error so it cannot be forgotten to deal with this in the audit trigger and/or audit table structure.

Why is SELECT * considered harmful?

37 votes

Why is SELECT * bad practice? Wouldn't it mean less code to change if you added a new column you wanted?

I understand that SELECT COUNT(*) is a performance problem on some DBs, but what if you really wanted every column?

The asterisk character, "*", in the SELECT statement is shorthand for all the columns in the table(s) involved in the query.

Performance

The * shorthand can be slower because:

  • Not all the fields are indexed, forcing a full table scan - less efficient
  • What you save to send SELECT * over the wire risks a full table scan
  • Returning more data than is needed
  • Returning trailing columns using variable length data type can result in search overhead

Maintenance

When using SELECT *:

  • Someone unfamiliar with the codebase would be forced to consult documentation to know what columns are being returned before being able to make competent changes. Making code more readable, minimizing the ambiguity and work necessary for people unfamiliar with the code saves more time and effort in the long run.
  • If code depends on column order, SELECT * will hide an error waiting to happen if a table had its column order changed.
  • Even if you need every column at the time the query is written, that might not be the case in the future
  • the usage complicates profiling

Design

SELECT * is an anti-pattern:

  • The purpose of the query is less obvious; the columns used by the application is opaque
  • It breaks the modularity rule about using strict typing whenever possible. Explicit is almost universally better.

When Should "SELECT *" Be Used?

It's acceptable to use SELECT * when there's the explicit need for every column in the table(s) involved, as opposed to every column that existed when the query was written. The database will internally expand the * into the complete list of columns - there's no performance difference.

Otherwise, explicitly list every column that is to be used in the query - preferably while using a table alias.

Is there anything faster than SqlDataReader in .NET?

18 votes

I need to load one column of strings from table on SqlServer into Array in memory using C#. Is there a faster way than open SqlDataReader and loop through it. Table is large and time is critical.

EDIT I am trying to build .dll and use it on server for some operations on database. But it is to slow for now. If this is fastest than I have to redesign the database. I tough there may be some solution how to speed thing up.

Data Reader

About the fastest access you will get to SQL is with the SqlDataReader.

Profile it

It's worth actually profiling where your performance issue is. Usually where you think the performance issue is, is proven to be totally wrong after you've profiled it.

For example it could be:

  1. The time... the query takes to run
  2. The time... the data takes to copy across the network/process boundry
  3. The time... .Net takes to load the data into memory
  4. The time... your code takes to do something with it

Profiling each of these in isolation will give you a better idea of where your bottleneck is. For profiling your code, there is a great article from Microsoft

Cache it

The thing to look at to improve performance is to work out if you need to load all that data everytime. Can the list (or part of it) be cached? Take a look at the new System.Runtime.Caching namespace.

Rewrite as T-SQL

If you are doing purely data operations (as your question suggests) you could re-write your code which is using the data to be T-SQL and run natively on SQL, this has to potential to be much faster as you will be working with the data directly and not shifting it about.

If your code has a lot of nessecary procedural logic you try mixing T-SQL with CLR Integration giving you the benefits of both worlds.

This very much comes down to the complexity (or more procedural nature) of your logic.

If all else fails

If all areas are optimal (or as near as), and your design is without fault. I wouldn't even get into micro-optimisation, I'd just throw hardware at it.

What hardware? Try the reliability and performance monitor to find out where the bottle neck is. Most likely place for the problem you describe HDD or RAM.

Why does this simple mysql insert query take occasionally so long?

15 votes

Ok, I've got a real head scratcher... I'm going bald!

This is a pretty simple problem. Inserting data into the table normally works fine, except for a few times, the insert query takes a few seconds. This isn't very good, so I setup a simulation of the insert process. I am NOT trying to bulk insert data. I am trying to find out why the insert query occasionally takes more than 2 seconds to run. Joshua suggested that the index file may be being adjusted; I have removed the id (primary key field), but the delay still happens.

I have a MyISAM table: daniel_test_insert (this table starts COMPLETELY empty):

create table if not exists daniel_test_insert ( 
    id int unsigned auto_increment not null, 
    value_str varchar(255) not null default '', 
    value_int int unsigned default 0 not null, 
    primary key (id) 
)

I insert data into it, and sometimes, a insert query takes > 2 seconds to run. THERE ARE NO READS on this table. All writes, in serial, by a single threaded program.

This same row; 100,000 times. I run the exact same query 100,000 times, because once in a while the query takes a long time, and I'm trying to find out why. It appears to be a random occurrence so far though.

This query for example took 4.194 seconds (a very long time for an insert)

Query: INSERT INTO daniel_test_insert SET value_int=12345, value_str='afjdaldjsf aljsdfl ajsdfljadfjalsdj fajd as f' - ran for 4.194 seconds
status               | duration | cpu_user  | cpu_system | context_voluntary | context_involuntary | page_faults_minor
starting             | 0.000042 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
checking permissions | 0.000024 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
Opening tables       | 0.000024 | 0.001000  | 0.000000   | 0                 | 0                   | 0                
System lock          | 0.000022 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
Table lock           | 0.000020 | 0.000000  | 0.000000   | 0                 | 0                   | 0                
init                 | 0.000029 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
update               | 4.067331 | 12.151152 | 5.298194   | 204894            | 18806               | 477995           
end                  | 0.000094 | 0.000000  | 0.000000   | 8                 | 0                   | 0                
query end            | 0.000033 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
freeing items        | 0.000030 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
closing tables       | 0.125736 | 0.278958  | 0.072989   | 4294              | 604                 | 2301             
logging slow query   | 0.000099 | 0.000000  | 0.000000   | 1                 | 0                   | 0                
logging slow query   | 0.000102 | 0.000000  | 0.000000   | 7                 | 0                   | 0                
cleaning up          | 0.000035 | 0.000000  | 0.000000   | 7                 | 0                   | 0

This is an abbreviated version of the SHOW PROFILE command, I threw out the columns that were all zero.

Now the update has an incredible number of context switches and minor page faults.

Opened_Tables increases about 1 per 10 seconds on this database (not running out of table_cache space)

Stats:

MySQL 5.0.89

Hardware: 32 Gigs of ram / 8 cores @ 2.66GHz; raid 10 SCSI harddisks (SCSI II???) I have had the harddrives and raid controller queried: no errors are being reported. CPU's are about 50% idle.

iostat -x 5 (reports less than 10% utilization for harddisks) top report load average about 10 for 1 minute (normal for our db machine)

Swap space has 156k used (32 gigs of ram :)

I'm at a loss to find out what is causing this performance lag! Does anyone have any suggestions?

This does NOT happen on our low-load slaves, only on our high load master. This also happens with memory and innodb tables.

Warning: This is a production system, so nothing exotic!

-daniel (I'm going to have use my dogs hair for a tuopee!!!)

Updated: Sept 20th, 2010: I'm going bald!

I have noticed the same phenomenon on my systems. Queries which normally take a millisecond will suddenly take 1-2 seconds. All of my cases are simple, single table INSERT/UPDATE/REPLACE statements --- not on any SELECTs. No load, locking, or thread build up is evident.

I had suspected that it's due to clearing out dirty pages, flushing changes to disk, or some hidden mutex, but I have yet to narrow it down.

Also Ruled Out

  • Server load -- no correlation with high load
  • Engine -- happens with InnoDB/MyISAM/Memory
  • MySQL Query Cache -- happens whether it's on or off
  • Log rotations -- no correlation in events

The only other observation I have at this point is derived from the fact I'm running the same db on multiple machines. I have a heavy read application so I'm using an environment with replication -- most of the load is on the slaves. I've noticed that even though there is minimal load on the master, the phenomenon occurs more there. Even though I see no locking issues, maybe it's Innodb/Mysql having trouble with (thread) concurrency? Recall that the updates on the slave will be single threaded.

MySQL Verion 5.1.48

Update

I think I have a lead for the problem on my case. On some of my servers, I noticed this phenomenon on more than the others. Seeing what was different between the different servers, and tweaking things around, I was lead to the MySQL innodb system variable innodb_flush_log_at_trx_commit.

I found the doc a bit awkward to read, but innodb_flush_log_at_trx_commit can take the values of 1,2,0:

  • For 1, the log buffer is flushed to the log file for every commit, and the log file is flushed to disk for every commit.
  • For 2, the log buffer is flushed to the log file for every commit, and the log file is flushed to disk approximately every 1-2 seconds.
  • For 0, the log buffer is flushed to the log file every second, and the log file is flushed to disk every second.

Effectively, in the order (1,2,0), as reported and documented, you're supposed to get with increasing performance in trade for increased risk.

Having said that, I found that the servers with innodb_flush_log_at_trx_commit=0 were performing worse (i.e. having 10-100 times more "long updates") than the servers with innodb_flush_log_at_trx_commit=2. Moreover, things immediately improved on the bad instances when I switched it to 2 (note you can change it on the fly).

So, my question is, what is yours set to? Note that I'm not blaming this parameter, but rather highlighting that it's context is related to this issue.

Is there any difference between IS NULL and =NULL

10 votes

I am surprised to see that IS NULL and =NULL are yielding different results in a select query. What is difference between them? When to use what. I would be glad if you can explain me in detail.

= NULL is always unknown (this is piece of 3 state logic), but WHERE clause treats it as false and drops from the result set. So for NULL you should use IS NULL

Reasons are described here: http://stackoverflow.com/questions/1843451/why-does-null-null-evaluate-to-false-in-sql-server

In Sql server, when should you use GO and when should you use semi-colon ; ?

8 votes

I've always been confused with when I should use the GO keyword after commands and whether a semi-colon is required at the end of commands.

When I run the Generate-script in sql server management studio, it seems to use GO all over the place, but not the semi-colon.

Please can someone explain to me the differences and why/when i should use them.

thanks

GO only relates to SSMS - it isn't actual Transact SQL, it just tells SSMS to send the SQL statements between each GO in individual batches sequentially.

The ; is a SQL statement delimiter, but for the most part the engine can interpret where your statements are broken up.

The main exception, and place where the ; is used most often is before a Common Table Expression Statement.

What's the difference: Windows Authentication, Passport Authentication and Form Authentication?

8 votes

Just going to start making a web application and was wondering which was better, or at least what are the main differences between them (as it probably matters what I am using them for)?

  • Windows Authentication
  • Passport Authentication
  • Form Authentication

I would say it greatly depends on what your web app will be doing, as each one has its place. Here is some brief details about each one.

Windows authentication enables you to identify users without creating a custom page. Credentials are stored in the Web server s local user database or an Active Directory domain. Once identified you can use the user s credentials to gain access to resources that are protected by Windows authorization.

Forms authentication enables you to identify users with a custom database such as an ASP.NET membership database. Alternatively you can implement your own custom database. Once authenticated you can reference the roles the user is in to restrict access to portions of your Web site.

Passport authentication relies on a centralized service provided by Microsoft. Passport authentication identifies a user with using his or her e-mail address and a password and a single Passport account can be used with many different Web sites. Passport authentication is primarily used for public Web sites with thousands of users.

Anonymous authentication does not require the user to provide credentials.

http://msdn.microsoft.com/en-us/library/eeyk640h.aspx - ASP.NET Authentication further details on forms and window authentication

Edit Rushyo link is better: http://msdn.microsoft.com/en-us/library/ee817643.aspx

Generate all combinations in SQL

7 votes

I need to generate all combinations of size @k in a given set of size @n. Can someone please review the following SQL and determine first if the following logic is returning the expected results, and second if is there a better way?

/*CREATE FUNCTION dbo.Factorial ( @x int ) 
RETURNS int 
AS
BEGIN
    DECLARE @value int

    IF @x <= 1
        SET @value = 1
    ELSE
        SET @value = @x * dbo.Factorial( @x - 1 )

    RETURN @value
END
GO*/
SET NOCOUNT ON;
DECLARE @k int = 5, @n int;
DECLARE @set table ( [value] varchar(24) );
DECLARE @com table ( [index] int );

INSERT @set VALUES ('1'),('2'),('3'),('4'),('5'),('6');

SELECT @n = COUNT(*) FROM @set;

DECLARE @combinations int = dbo.Factorial(@n) / (dbo.Factorial(@k) * dbo.Factorial(@n - @k));

PRINT CAST(@combinations as varchar(max)) + ' combinations';

DECLARE @index int = 1;

WHILE @index <= @combinations
BEGIN
    INSERT @com VALUES (@index)
    SET @index = @index + 1
END;

WITH [set] as (
    SELECT 
        [value], 
        ROW_NUMBER() OVER ( ORDER BY [value] ) as [index]
    FROM @set
)
SELECT 
    [values].[value], 
    [index].[index] as [combination]
FROM [set] [values]
CROSS JOIN @com [index]
WHERE ([index].[index] + [values].[index] - 1) % (@n) BETWEEN 1 AND @k
ORDER BY
    [index].[index];

Using a numbers table or number-generating CTE, select 0 through 2^n - 1. Using the 1-bit positions in these numbers to indicate the presence or absence of the relative members in the combination, and eliminating those that don't have the correct number of values, you should be able to return a result set with all the combinations you desire.

WITH Nums (Num) AS (
   SELECT Num
   FROM Numbers
   WHERE Num BETWEEN 0 AND POWER(2, @n) - 1
), BaseSet AS (
   SELECT ind = Power(2, Row_Number() OVER (ORDER BY Value) - 1), *
   FROM @set
), Combos AS (
   SELECT
      ComboID = N.Num,
      S.Value,
      Cnt = Count(*) OVER (PARTITION BY N.Num)
   FROM
      Nums N
      INNER JOIN BaseSet S ON N.Num & S.ind <> 0
)
SELECT
   ComboID,
   Value
FROM Combos
WHERE Cnt = @k
ORDER BY ComboID, Value;

Update

Ok, I tweaked the query to give correct results. I had my @n and @k mixed up in the first CTE. Other than that and a missing *, the query is unchanged. It performs pretty well, but I thought of a way to optimize it, cribbing from the Nifty Parallel Bit Count to get the right number of items taken at a time ahead of time. This performs 3 to 3.5 times faster (both CPU and time):

WITH Nums AS (
   SELECT Num, P1 = (Num & 0x55555555) + ((Num / 2) & 0x55555555)
   FROM Numbers
   WHERE Num BETWEEN 0 AND POWER(2, @n) - 1
), Nums2 AS (
   SELECT Num, P2 = (P1 & 0x33333333) + ((P1 / 4) & 0x33333333)
   FROM Nums
), Nums3 AS (
   SELECT Num, P3 = (P2 & 0x0f0f0f0f) + ((P2 / 16) & 0x0f0f0f0f)
   FROM Nums2
), BaseSet AS (
   SELECT ind = Power(2, Row_Number() OVER (ORDER BY Value) - 1), *
   FROM @set
)
SELECT
   ComboID = N.Num,
   S.Value
FROM
   Nums3 N
   INNER JOIN BaseSet S ON N.Num & S.ind <> 0
WHERE P3 % 255 = @k
ORDER BY ComboID, Value;

I went and read the bit-counting page and think that this could perform better if I don't do the % 255 but go all the way with bit arithmetic. When I get a chance I'll try that and see how it stacks up.

My performance claims are based on the queries run without the ORDER BY clause. For clarity, what this code is doing is counting the number of set 1-bits in Num from the Numbers table. That's because the number is being used as a sort of indexer to choose which elements of the set are in the current combination, so the number of 1-bits will be the same.

I hope you like it!

May I also suggest this change to your Factorial UDF:

ALTER FUNCTION dbo.Factorial (
   @x bigint
)
RETURNS bigint
AS
BEGIN
   IF @x <= 1 RETURN 1
   RETURN @x * dbo.Factorial(@x - 1)
END

Now you can calculate much larger sets of combinations, plus it's more efficient.

Just for fun here's what I used for my performance testing with big sets:

DECLARE
   @k int,
   @n int;

DECLARE @set TABLE (value varchar(24));
INSERT @set VALUES ('A'),('B'),('C'),('D'),('E'),('F'),('G'),('H'),('I'),('J'),('K'),('L'),('M'),('N'),('O'),('P'),('Q'); --,('R'),('S');
SET @n = @@RowCount;
SET @k = 5;

DECLARE @combinations bigint = dbo.Factorial(@n) / (dbo.Factorial(@k) * dbo.Factorial(@n - @k));
SELECT CAST(@combinations as varchar(max)) + ' combinations', MaxNumUsedFromNumbersTable = POWER(2, @n);

Note that you could use an on-the-fly numbers table, but it may not perform as well. I haven't tried it.

Update 2

I looked at your query and found that it is not correct. For example, using my test data, set 1 is the same as set 18. It looks like your query takes a sliding stripe that wraps around: each set is always 5 adjacent members, looking something like this:

 1 ABCDE            
 2 ABCD            Q
 3 ABC            PQ
 4 AB            OPQ
 5 A            NOPQ
 6             MNOPQ
 7            LMNOP 
 8           KLMNO  
 9          JKLMN   
10         IJKLM    
11        HIJKL     
12       GHIJK      
13      FGHIJ       
14     EFGHI        
15    DEFGH         
16   CDEFG          
17  BCDEF           
18 ABCDE            
19 ABCD            Q

Comparing the pattern from my queries:

 31 ABCDE  
 47 ABCD F 
 55 ABC EF 
 59 AB DEF 
 61 A CDEF 
 62  BCDEF 
 79 ABCD  G
 87 ABC E G
 91 AB DE G
 93 A CDE G
 94  BCDE G
103 ABC  FG
107 AB D FG
109 A CD FG
110  BCD FG
115 AB  EFG
117 A C EFG
118  BC EFG
121 A  DEFG

Just to drive the bit-pattern -> index of combination thing home for anyone interested, notice that 31 in binary = 11111 and the pattern is ABCDE. 121 in binary is 1111001 and the pattern is A__DEFG (backwards mapped).

For the record, this technique of using the bit pattern of integers to select members of a set is what I've coined the "Vertical Cross Join." It effectively results in the cross join of multiple sets of data, where the number of sets & cross joins is arbitrary. Here, the number of sets is the number of items taken at a time.

That is, actually cross joining would look something like this:

SELECT
   A.Value,
   B.Value,
   C.Value
FROM
   @Set A
   CROSS JOIN @Set B
   CROSS JOIN @Set C
WHERE
   A.Value = 'A'
   AND B.Value = 'B'
   AND C.Value = 'C'

But the queries above cross join as many times as necessary with only one join. The results are unpivoted compared to actual cross joins, sure, but that's a minor matter.

Update 3

Peter showed that my "vertical cross join" doesn't perform as well as simply writing dynamic SQL to actually do the CROSS JOINs it avoids. At the trivial cost of a few more reads, his solution has metrics between 10 and 17 times better. The performance of his query decreases faster than mine as the amount of work increases, but not fast enough to stop anyone from using it.

The second set of numbers below is the factor as divided by the first row in the table, just to show how it scales.

Erik

Items  CPU   Writes  Reads  Duration |  CPU  Writes  Reads Duration
----- ------ ------ ------- -------- | ----- ------ ------ --------
17•5    7344     0     3861    8531  |
18•9   17141     0     7748   18536  |   2.3          2.0      2.2
20•10  76657     0    34078   84614  |  10.4          8.8      9.9
21•11 163859     0    73426  176969  |  22.3         19.0     20.7
21•20 142172     0    71198  154441  |  19.4         18.4     18.1

Peter

Items  CPU   Writes  Reads  Duration |  CPU  Writes  Reads Duration
----- ------ ------ ------- -------- | ----- ------ ------ --------
17•5     422    70    10263     794  | 
18•9    6046   980   219180   11148  |  14.3   14.0   21.4    14.0
20•10  24422  4126   901172   46106  |  57.9   58.9   87.8    58.1
21•11  58266  8560  2295116  104210  | 138.1  122.3  223.6   131.3
21•20  51391     5  6291273   55169  | 121.8    0.1  613.0    69.5

Extrapolating, eventually my query will be cheaper (though it is from the start in reads), but not for a long time. To use 21 items in the set already requires a numbers table going up to 2097152...

I love single-query solutions to problems like this, but if you're looking for the best performance, an actual cross-join is best, unless you start dealing with seriously huge numbers of combination. But what does anyone want with hundreds of thousands or even millions of rows? Even the growing number of reads don't seem too much of a problem, though 6 million is a lot and it's getting bigger fast...

Anyway. Dynamic SQL wins. I still had a beautiful query. :)

Extended placeholders for SQL, e.g. WHERE id IN (??)

7 votes

Bounty update: Already got a very good answer from Mark. Adapted := into :, below. However, I'm still looking for similar schemes besides DBIx. I'm just interested in being compatible to anything.


I need advise on the syntax I've picked for "extended" placeholders in parameterized SQL statements. Because building some constructs (IN clauses) was bugging me, I decided on a few syntax shortcuts that automatically expand into ordinary ? placeholders.
I like them. But I want to package it up for distribution, and am asking myself if they are easily understandable.

Basically my new placeholders are ?? and :? (enumerated params) and :& and :, and :| and :: (for named placeholders) with following use cases:

-> db("  SELECT * FROM all WHERE id IN (??)  ", [$a, $b, $c, $d, $e])

The ?? expands into ?,?,?,?,?,... depending on the number of $args to my db() func. This one is pretty clear, and its syntax is already sort of standardized. Perls DBIx::Simple uses it too. So I'm pretty certain this is an acceptable idea.

-> db("  SELECT :? FROM any WHERE id>0   ",  ["title", "frog", "id"]);
// Note: not actually parameterized attr, needs cleanup regex

Admit it. I just liked the smiley. Basically this :? placeholder expands an associative $args into plain column names. It throws away any $args values in fact. It's actually useful for INSERTs in conjunction with ??, and sometimes for IN clauses. But here I'm already wondering if this new syntax is sensible, or not just a misnomer because it mixes : and ? characters. But somehow it seems to match the syntax scheme well.

-> db("  UPDATE some SET :, WHERE :& AND (:|)   ", $row, $keys, $or);

Here the mnemonic :, expands into a list of name=:name pairs separated by , commas. Whereas the :& is a column=:column list joined by ANDs. For parity I've added :|. The :& has other use cases out of UPDATE commands, though.
But my question is not about the usefulness, but if :, and :& appear to be rememberable?

 -> db("  SELECT * FROM all WHERE name IN (::)  ", $assoc);

After some though I also added :: to interpolate a :named,:value,:list very much like ?? expands to ?,?,?. Similar use cases, and sensible to have for uniformness.

Anyway, has anybody else implemented a scheme like that? Different placeholders? Or which would you recommend for simplicity? Update: I know that the PHP Oracle OCI interface can also bind array parameters, but doesn't use specific placeholders for it. And I'm looking for comparable placeholder syntaxes.

You might want to avoid using := as a placeholder because it already has a usage in for example MySQL.

See for example this answer for a real world usage.

What does "WHERE 1" mean in SQL?

7 votes

Sometimes phpMyAdmin generates queries like:

SELECT * 
FROM  `items` 
WHERE 1 
LIMIT 0 , 30

I wonder if WHERE 1 has any meaning in a query like that.

It doesn't. It means ALWAYS TRUE so it won't have any filtering impact on your query. Query planner will probably ignore that clause.

It's usually used when you build a client side query by concatenating filtering conditions.

So, if your base query is stored in a string like this (example is in PHP, but it certainly applies to many other languages):

$sql = "select * from foo where 1 ";

Then you can just concatenate a lot of filtering conditions with an AND clause regardless of it being the first condition you are using or not:

// pseudo php follows...
if ($filter_by_name) {
    $sql = $sql . " and name = ? ";
}
if ($filter_by_number) {
    $sql = $sql . " and number = ? ";
}
// so on, and so forth.

SQL Server Enterprise Manager - Mass Delete of Tables and Changing Ownership of Tables

7 votes

I have pretty much no experience with SQL Server's Enterprise Manager so I am not sure if this is even possible (or hopefully laughably simple!)

During an import into a database something has happened where each table has duplicated itself with two important differences.

The first is that the Owner on both tables is different, the second is that only the structure has copied across on one of the copies.

Sod's law indicated that of course the data was stored on the tables owned by the wrong person, so my question is can I quickly delete all tables owned by one user and can I quickly change the ownership of all other tables to bring them in line.

There are enough tables that automation is going to be my preferred option by a LONG way!

Any help would be greatly appreciated, I am running SQL Server 2000

declare @emptyOwner varchar(20)
declare @wrongOwner varchar(20)
declare @emptyOwnerID bigint
declare @wrongOwnerID bigint
declare @tableName nvarchar(255)

set @emptyOwner = 'dbo'
set @wrongOwner = 'guest'

select @emptyOwnerID = (select uid from sysusers where name = @emptyOwner)
select @wrongOwnerID = (select uid from sysusers where name = @wrongOwner)

select name as tableName
into #tempTable
from systables
where type='U'
and exists (select 1 from systables where type = 'U' and uid = @emptyOwnerID)
and exists (select 1 from systables where type = 'U' and uid = @wrongOwnerID)

declare @dynSQL nvarchar(MAX)

declare ownme cursor for
  select tableName from #tempTable

open ownme
fetch next from ownme into @tableName

while @@FETCH_STATUS = 0
begin
    @dynSQL = 'DROP TABLE [' + @emptyOwner + '].[' + @tableName + ']'
    exec(@dynSQL)

    @dynSQL = 'sp_changeobjectowner ''[' + @wrongOwner + '].[' + @tableName + ']'',''' + @emptyOwner + ''''
    exec(@dynSQL)

    fetch next from ownme into @tableName
end

close ownme
deallocate ownme

TSQL Finding Order that occurred in 3 consecutive months

7 votes

Please help me to generate the following query. Say I have customer table and order table.

Customer Table

CustID CustName

1      AA     
2      BB
3      CC
4      DD  

Order Table

OrderID  OrderDate          CustID
100      01-JAN-2000        1  
101      05-FEB-2000        1     
102      10-MAR-2000        1 
103      01-NOV-2000        2    
104      05-APR-2001        2 
105      07-MAR-2002        2
106      01-JUL-2003        1
107      01-SEP-2004        4
108      01-APR-2005        4
109      01-MAY-2006        3 
110      05-MAY-2007        1  
111      07-JUN-2007        1
112      06-JUL-2007        1 

I want to find out the customers who have made orders on three successive months. (Query using SQL server 2005 and 2008 is allowed).

The desired output is:

CustName      Year   OrderDate   

    AA        2000  01-JAN-2000       
    AA        2000  05-FEB-2000
    AA        2000  10-MAR-2000

    AA        2007  05-MAY-2007        
    AA        2007  07-JUN-2007        
    AA        2007  06-JUL-2007         

Edit: Got rid or the MAX() OVER (PARTITION BY ...) as that seemed to kill performance.

;WITH cte AS ( 
SELECT    CustID  ,
          OrderDate,
          DATEPART(YEAR, OrderDate)*12 + DATEPART(MONTH, OrderDate) AS YM
 FROM     Orders
 ),
 cte1 AS ( 
SELECT    CustID  ,
          OrderDate,
          YM,
          YM - DENSE_RANK() OVER (PARTITION BY CustID ORDER BY YM) AS G
 FROM     cte
 ),
 cte2 As
 (
 SELECT CustID  ,
          MIN(OrderDate) AS Mn,
          MAX(OrderDate) AS Mx
 FROM cte1
GROUP BY CustID, G
HAVING MAX(YM)-MIN(YM) >=2 
 )
SELECT     c.CustName, o.OrderDate, YEAR(o.OrderDate) AS YEAR
FROM         Customers AS c INNER JOIN
                      Orders AS o ON c.CustID = o.CustID
INNER JOIN  cte2 c2 ON c2.CustID = o.CustID and o.OrderDate between Mn and Mx
order by c.CustName, o.OrderDate

adding a column description

7 votes

Does anyone know how to add a description to a SQL Server column by running a script? I know you can add a description when you create the column using SQL Server Management Studio.

How can I script this so when my SQL scripts create the column, a description for the column is also added?

I'd say you will probably want to do it using the sp_addextendedproperty stored proc.

Microsoft has some good documentation on it but you can also look at this link:

http://www.eggheadcafe.com/software/aspnet/32895758/how-to-set-description-property-with-alter-table-add-column.aspx

Try this:

EXEC sp_addextendedproperty 
@name = N'Description', @value = 'Hey, here is my description!',
@level0type = N'Schema', @level0name = yourschema,
@level1type = N'Table',  @level1name = YourTable,
@level2type = N'Column', @level2name = yourColumn;
GO

Copy one column to another for over a billion rows in SQL Server database

7 votes

Database : SQL Server 2005

Problem : Copy values from one column to another column in the same table with a billion+ rows.

test_table (int id, bigint bigid)

Things tried 1: update query

update test_table set bigid = id 

fills up the transaction log and rolls back due to lack of transaction log space.

Tried 2 - a procedure on following lines

set nocount on
set rowcount = 500000
while @rowcount > 0
begin
 update test_table set bigid = id where bigid is null
 set @rowcount = @@rowcount
 set @rowupdated = @rowsupdated + @rowcount
end
print @rowsupdated

The above procedure starts slowing down as it proceeds.

Tried 3 - Creating a cursor for update.

generally discouraged in SQL Server documentation and this approach updates one row at a time which is too time consuming.

Is there an approach that can speed up the copying of values from one column to another. Basically I am looking for some 'magic' keyword or logic that will allow the update query to rip through the billion rows half a million at a time sequentially.

Any hints, pointers will be much appreciated.

I'm going to guess that you are closing in on the 2.1billion limit of an INT datatype on an artificial key for a column. Yes, that's a pain. Much easier to fix before the fact than after you've actually hit that limit and production is shut down while you are trying to fix it :)

Anyway, several of the ideas here will work. Let's talk about speed, efficiency, indexes, and log size, though.

Log Growth

The log blew up originally because it was trying to commit all 2b rows at once. The suggestions in other posts for "chunking it up" will work, but that may not totally resolve the log issue.

If the database is in SIMPLE mode, you'll be fine (the log will re-use itself after each batch). If the database is in FULL or BULK_LOGGED recovery mode, you'll have to run log backups frequently during the running of your operation so that SQL can re-use the log space. This might mean increasing the frequency of the backups during this time, or just monitoring the log usage while running.

Indexes and Speed

ALL of the where bigid is null answers will slow down as the table is populated, because there is (presumably) no index on the new BIGID field. You could, (of course) just add an index on BIGID, but I'm not convinced that is the right answer.

The key (pun intended) is my assumption that the original ID field is probably the primary key, or the clustered index, or both. In that case, lets take advantage of that fact, and do a variation of Jess' idea:

set @counter = 1
while @counter < 2000000000 --or whatever
begin
  update test_table set bigid = id 
  where id between @counter and (@counter + 499999) --BETWEEN is inclusive
  set @counter = @counter + 500000
end

This should be extremely fast, because of the existing indexes on ID.

The ISNULL check really wasn't necessary anyway, neither is my (-1) on the interval. If we duplicate some rows between calls, that's not a big deal.

SQL interview question

7 votes

I got following question on an interview: Given a table of natural numbers with some missing ones, provide output of two tables, beginning of number gap in first table and ending in second. Example:

 ____    ________
|    |   |   |   |
| 1  |   | 3 | 3 |
| 2  |   | 6 | 7 |
| 4  |   | 10| 12|
| 5  |   |___|___|
| 8  |
| 9  |
| 13 |
|____|

While this is pretty much the same as Phil Sandler's answer, this should return two separate tables (and I think it looks cleaner) (it works in SQL Server, at least):

DECLARE @temp TABLE (num int)
INSERT INTO @temp VALUES (1),(2),(4),(5),(8),(9),(13)

DECLARE @min INT, @max INT
SELECT @min = MIN(num), @max = MAX(num) FROM @temp

SELECT t.num + 1 AS range_start
    FROM @temp t
    LEFT JOIN @temp t2 ON t.num + 1 = t2.num
    WHERE t.num < @max AND t2.num IS NULL

SELECT t.num - 1 AS range_end
    FROM @temp t
    LEFT JOIN @temp t2 ON t.num - 1 = t2.num
    WHERE t.num > @min AND t2.num IS NULL

SQL Server NULL constraint

7 votes

Is is possible in SQL Server 2008 to create such a constraint that would restrict two columns to have NULL values at the same time? So that

Column1 Column2
NULL    NULL   -- not allowed
1       NULL   -- allowed
NULL    2      -- allowed
2       3      -- allowed

ALTER TABLE MyTable WITH CHECK 
   ADD CONSTRAINT CK_MyTable_ColumNulls CHECK (Column1 IS NOT NULL OR Column2 IS NOT NULL)

As part of the create

CREATE TABLE MyTable (
 Column1 int NULL,
 Column2 int NULL,

 CONSTRAINT CK_MyTable_ColumNulls CHECK (Column1 IS NOT NULL OR Column2 IS NOT NULL)
)

SQL Joins: Future of the SQL ANSI Standard (where vs join)?

6 votes

We are developing ETL jobs and our consultant have been using "old style" SQL when joining tables

select a.attr1, b.attr1
from table1 a, table2 b
where a.attr2 = b.attr2

instead of using inner join clausule

select a.attr1, b.attr1
from table1 as a inner join table2 as b
   on a.attr2 = b.attr2

My question is that in the long run, is there a risk for using the old "where join"? How long this kind of joins are supported and kept as ANSI standard? Our platform is SQL Server and my primary cause is that in the future these "where joins" are no longer supported. When this happens, we have to modify all our ETL jobs using "inner join" style of joins.

I doubt that "where joins" would ever be unsupported. It's just not possible to not support them, because they are based on Cartesian products and simple filtering. They actually aren't joins.

But there are many reasons to use the newer join syntax. Among others:

  • Readability
  • Maintainability
  • Easier change to outer joins

Transform SQL table to XML with column as parent node

6 votes

I'm trying to transform a table to an XML struture and I want one of the columns in my table to represent a parent node and the other column to represent a child node.

I have got part of the way but I don't have the complete solution. I need the TABLE_NAME column to transform to a xml parent node and the COLUMN_NAME column to transform as child nodes. If I execute the following I get the nesting but I also get multiple parent nodes.

select
 TABLE_NAME AS 'tn',
 COLUMN_NAME AS 'tn/cn'
from (
 select 'TABLE_A' AS TABLE_NAME, 'COLUMN_1' AS COLUMN_NAME
 UNION ALL
 select 'TABLE_A' AS TABLE_NAME, 'COLUMN_2' AS COLUMN_NAME
 UNION ALL
 select 'TABLE_B' AS TABLE_NAME, 'COLUMN_1' AS COLUMN_NAME
 UNION ALL
 select 'TABLE_B' AS TABLE_NAME, 'COLUMN_2' AS COLUMN_NAME
) x
for xml path(''), ROOT('datatable')

OUPUT>>>

<datatable>
  <tn>TABLE_A<cn>COLUMN_1</cn></tn>
  <tn>TABLE_A<cn>COLUMN_2</cn></tn>
  <tn>TABLE_B<cn>COLUMN_1</cn></tn>
  <tn>TABLE_B<cn>COLUMN_2</cn></tn>
</datatable>

DESIRED OUTPUT >>>

<datatable>
  <TABLE_A>
   <cn>COLUMN_1</cn>
   <cn>COLUMN_2</cn>
  </TABLE_A>
  <TABLE_B>
    <cn>COLUMN_1</cn>
    <cn>COLUMN_2</cn>
  </TABLE_B>
</datatable>

Is this possible or am I dreaming? and is it possible without XML EXPLICIT or is this the kind of thing EXPLICIT is there for?

The other possiblity I've been trying is to stuff the xml and then apply an xquery, but no joy with that yet.

Thanks,

Gary

As others have mentioned, FOR XML doesn't allow you to dynamically name nodes. The node names have to be constants by the time the query itself is compiled. You can work around this with dynamic sql but then you end up with code that gets harder and harder to read.

An alternative would be to manually generate the talbe name nodes and CAST into XML:

Setup:

CREATE TABLE a (table_name VARCHAR(20), column_name VARCHAR(20)
INSERT INTO a VALUES ('TABLE_A', 'COLUMN_1')
INSERT INTO a VALUES ('TABLE_A', 'COLUMN_2')
INSERT INTO a VALUES ('TABLE_B', 'COLUMN_1')
INSERT INTO a VALUES ('TABLE_B', 'COLUMN_2')

Execute:

SELECT CAST(
      '<' + table_name + '>'
    + (SELECT c.column_name as 'CN'
         FROM a c
        WHERE c.table_name = p.table_name
       FOR XML PATH('')) 
    + '</' + table_name + '>'
    AS XML)
  FROM a p
GROUP BY p.table_name
FOR XML PATH(''), ROOT('datatable')

Produces:

<datatable>
  <TABLE_A>
    <CN>COLUMN_1</CN>
    <CN>COLUMN_2</CN>
  </TABLE_A>
  <TABLE_B>
    <CN>COLUMN_1</CN>
    <CN>COLUMN_2</CN>
  </TABLE_B>
</datatable>