Best sql questions in October 2010

How can I compare two sets of 1000 numbers against each other?

53 votes

I must check approximately 1000 numbers against 1000 other numbers.

I loaded both and compared them server-side:

foreach( $numbers1 as $n1 ) {
  foreach( $numbers2 as $n2 ) {
    if( $n1 == $n2 ) {
      doBla();
    }
  }
}

This took a long time, so I tried to do the same comparison client side using two hidden div elements. Then compared them using JavaScript. It still takes 45 seconds to load the page (using hidden div elements).

I do not need to load the numbers that are not the same.

Is there a faster algorithm? I am thinking of comparing them database side and just load the error numbers, then do an Ajax call for the remaining non-error numbers. But is a MySQL database fast enough?

Sort the lists first. Then you can walk up both lists from the start, comparing as you go.

The loop would look something like this:

var A = getFirstArray().sort(), B = getSecondArray().sort();

var i = 0, j = 0;
while (i < A.length && j < B.length) {
    if (A[i] === B[j]) {
        doBla(A[i]);
        i++; j++;
    }
    else if (A[i] < B[j]) {
        i++;
    }
    else
        j++;
}

(That's JavaScript; you could do it server-side too, but I don't know PHP.)

Edit — just to be fair to all the hashtable fans (whom I respect of course), it's pretty easy to do that in JavaScript:

var map = {};
for (var i = 0; i < B.length; ++i) map[B[i]] = true; // Assume integers.
for (var i = 0; i < A.length; ++i) if (map[A[i]]) doBla(A[i]);

Or if the numbers are or might be floats:

var map = {};
for (var i = 0; i < B.length; ++i) map['' + B[i]] = true; // Assume integers.
for (var i = 0; i < A.length; ++i) if (map['' + A[i]]) doBla(A[i]);

Since numbers are pretty cheap to hash (even in JavaScript, converting to string before hashing is surprisingly cheap), this would be pretty fast.

What are the Options for Storing Hierarchical Data in a Relational Database?

28 votes

Good Overviews

Generally speaking you're making a decision between fast read times (e.g. nested set) or fast write times (adjacency list). Usually you end up with a combination of the options below that best fit your needs. The following provides some in depth reading:

Options

Ones I am aware of and general features:

  1. Adjacency List:
    • Columns: ID, ParentID
    • Easy to implement.
    • Cheap node moves, inserts, and deletes.
    • Expensive to find level (can store as a computed column), ancestry & descendants (Bridge Hierarchy combined with level column can solve), path (Lineage Column can solve).
    • Use Common Table Expressions in those databases that support them to traverse.
  2. Nested Set (a.k.a Modified Preorder Tree Traversal)
    • First described by Joe Celko - covered in depth in his book Trees and Hierarchies in SQL for Smarties
    • Columns: Left, Right
    • Cheap level, ancestry, descendants
    • Compared to Adjacency List, moves, inserts, deletes more expensive.
    • Requires a specific sort order (e.g. created). So sorting all descendants in a different order requires additional work.
  3. Nested Intervals
    • Combination of Nested Sets and Materialized Path where left/right columns are floating point decimals instead of integers and encode the path information.
  4. Bridge Table (a.k.a. Closure Table: some good ideas about how to use triggers for maintaining this approach)
    • Columns: ancestor, descendant
    • Stands apart from table it describes.
    • Can include some nodes in more than one hierarchy.
    • Cheap ancestry and descendants (albeit not in what order)
    • For complete knowledge of a hierarchy needs to be combined with another option.
  5. Flat Table
    • A modification of the Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
    • Expensive move and delete
    • Cheap ancestry and descendants
    • Good Use: threaded discussion - forums / blog comments
  6. Lineage Column (a.k.a. Materialized Path, Path Enumeration)
    • Column: lineage (e.g. /parent/child/grandchild/etc...)
    • Limit to how deep the hierarchy can be.
    • Descendants cheap (e.g. LEFT(lineage, #) = '/enumerated/path')
    • Ancestry tricky (database specific queries)

Database Specific Notes

MySQL

Oracle

PostgreSQL

SQL Server

  • General summary
  • 2008 offers HierarchyId data type appears to help with Lineage Column approach and expand the depth that can be represented.

This is kind of a question that is still interesting even after all big 3 vendors implemented Recursive WITH clause. I'd suggest that different readers would be pleased with different answers.

  1. Comprehensive list of references by Troels Arvin (although it seems to be missing many recent fine articles mentioned in similar stackoverflow threads).
  2. For the lack of competition, introductory textbook by Joe Celko "Trees and Hierarchies in SQL for Smarties" can indeed be considered a classics.
  3. For mathematical sophistry and connections between various methods look up Tropashko publications.

The riddle of the working broken query.

13 votes

I was going through some old code that was written in years past by another developer at my organization. Whilst trying to improve this code, I discovered that the query it uses had a very bad problem.

  OdbcDataAdapter financialAidDocsQuery =
            new OdbcDataAdapter(
                @"SELECT   a.RRRAREQ_TREQ_CODE, 
                           b.RTVTREQ_SHORT_DESC, 
                           a.RRRAREQ_TRST_DESC, 
                           RRRAREQ_STAT_DATE,
                           RRRAREQ_EST_DATE,
                           a.RRRAREQ_SAT_IND, 
                           a.RRRAREQ_SBGI_CODE, 
                           b.RTVTREQ_PERK_MPN_FLAG, 
                           b.RTVTREQ_PCKG_IND, 
                           a.RRRAREQ_MEMO_IND,
                           a.RRRAREQ_TRK_LTR_IND, 
                           a.RRRAREQ_DISB_IND, 
                           a.RRRAREQ_FUND_CODE, 
                           a.RRRAREQ_SYS_IND
                  FROM     FAISMGR.RRRAREQ a, FAISMGR.RTVTREQ b
                  WHERE    a.RRRAREQ_TREQ_CODE = b.RTVTREQ_CODE
                           and a.RRRAREQ_PIDM = :PIDM
                           AND a.RRRAREQ_AIDY_CODE = :AidYear ",
                this.bannerOracle);
        financialAidDocsQuery.SelectCommand.Parameters.Add(":PIDM", OdbcType.Int, 32).Value = this.pidm;
        financialAidDocsQuery.SelectCommand.Parameters.Add(":AidYear", OdbcType.Int, 32).Value = this.aidYear;
        DataTable financialAidDocsResults = new DataTable();
        financialAidDocsQuery.Fill(financialAidDocsResults);
        FADocsGridView.DataSource = financialAidDocsResults;
        FADocsGridView.DataBind();

The problem is that the column a.RRRAREQ_TRST_DESC does not exist. A fact you learn very quickly when running it in Oracle SQL Developer.

The strange thing?

This code works.

The gridview binds successfully. (It doesn't try to bind to that field.) And it's been in production for years.

So, my question is...why? I've never seen a bad query work. I've never seen Oracle allow it or a data provider hack around it.

Does anyone have any idea what's going on here?

Hmmm...A few things to check:

  1. Does this code actually run? It may seem silly to suggest this, but there may be a newer file that replaced this one.

  2. Is an exception being squelched by your code? (Anyone who would name columns like that is definitely capable of squelching those pesky exceptions)

  3. Is the exception being squelched by 3rd party code? (Not as likely, but sometimes 3rd party code prefers to use annoying error codes instead of exceptions).

Past those suggestions, I'm not sure.

EDIT:

Revisiting the 2nd point, if you are working in ASP.NET, check that there is no global-level exception handler that is squelching exceptions. I ran into that problem on one site that I worked on and found dozens of exceptions in a single day.

How do I help a person who wants relational database data in a CSV format?

12 votes

The non-technical person who asks, "Can you just put the database in an Excel spreadsheet for me?" is so common it's almost a trope. (I could've sworn there was an xkcd about this, but I can't find it.) We regularly get customers asking for all their data in a CSV.

To say this person is non-technical is redundant, so I need to explain to them in a non-technical, friendly, non-condescending way that Excel is not designed to represent one-to-many relationships (without making them grok the idea of one-to-many).

If you have had personal experience with a particular strategy that has worked for you with non-technical people, I'd definitely like to hear it.

EDIT: It seems like most of the answers are leaning toward questioning the intent of the requestor. That, in itself, is a difficult thing to do nicely. Some of the answers are leaning toward just throwing data at a person and telling them to go away. That's not what I'm going for here. I'm looking to help this person, not make them go away. That's what I'm trying to get to here.

So, two part question: What do I tell a customer (i.e., someone whose needs I want to satisfy) who requests a CSV of their data for a.) backup, or b.) getting information out of the system.

Since there's no exact right answer here, substantiated answers are welcome.

Ask them to send you a blank spreadsheet with the column headings for what they want, and tell them you'll fill it in.

When the blank spreadsheet comes one of two things will be true:

  1. You will pretty easily be able to meet the requirements of the project (eg, no puzzling how to figure out which of 5 phone numbers is Phone1 and Phone2).

  2. You will have some questions (such as "I have anywhere from 0 to 25 phone numbers per person. How do you want me to choose which is Phone1 and which Phone2? And what do you want me to do with any extras?).

If the second is true, the client will either be able to provide you with answers or they will realize that there is a genuine problem representing the data they want in spreadsheet form and, hopefully, ask for your assistance in planning the next step forward.

Does every table really need an auto-incrementing artificial primary key?

8 votes

Almost every table in every database I've seen in my 7 years of development experience has an auto-incrementing primary key. Why is this? If I have a table of U.S. states where each state where each state must have a unique name, what's the use of an auto-incrementing primary key? Why not just use the state name as the primary key? Seems to me like an excuse to allow duplicates disguised as unique rows.

This seems plainly obvious to me, but then again, no one else seems to be arriving at and acting on the same logical conclusion as me, so I must assume there's a good chance I'm wrong.

Is there any real, practical reason we need to use auto-incrementing keys?

No.

In most cases, having a surrogate INT IDENTITY key is an easy option: it can be guaranteed to be NOT NULL and 100% unique, something a lot of "natural" keys don't offer - names can change, so can SSN's and other items of information.

In the case of state abbreviations and names - if anything, I'd use the two-letter state abbreviation as a key.

A primary key must be:

  • unique (100% guaranteed! Not just "almost" unique)
  • NON NULL

A primary key should be:

  • stable if ever possible (not change - or at least not too frequently)

State two-letter codes definitely would offer this - that might be a candidate for a natural key. A key should also be small - an INT of 4 bytes is perfect, a two-letter CHAR(2) column just the same. I would not ever use a VARCHAR(100) field or something like that as a key - it's just too clunky, most likely will change all the time - not a good key candidate.

So while you don't have to have an auto-incrementing "artificial" (surrogate) primary key, it's often quite a good choice, since no naturally occuring data is really up to the task of being a primary key, and you want to avoid having huge primary keys with several columns - those are just too clunky and inefficient.

Does the order of fields in a WHERE clause affect performance in MySQL?

8 votes

I have two indexed fields in a table - type and userid (individual indexes, not a composite).

types field values are very limited (let's say it is only 0 or 1), so 50% of table records have the same type. userid values, on the other hand, come from a much larger set, so the amount of records with the same userid is small.

Will any of these queries run faster than the other:

select * from table where type=1 and userid=5
select * from table where userid=5 and type=1

Also if both fields were not indexed, would it change the behavior?

SQL was designed to be a declarative language, not a procedural one. So the query optimizer should not consider the order of the where clause predicates in determining how to apply them.

I'm probably going to waaaay over-simplify the following discussion of an SQL query optimizer. I wrote one years ago, along these lines (it was tons of fun!). If you really want to dig into modern query optimization, see Dan Tow's SQL Tuning, from O'Reilly.

In a simple SQL query optimizer, the SQL statement first gets compiled into a tree of relational algebra operations. These operations each take one or more tables as input and produce another table as output. Scan is a sequential scan that reads a table in from the database. Sort produces a sorted table. Select produces a table whose rows are selected from another table according to some selection condition. Project produces a table with only certain columns of another table. Cross Product takes two tables and produces an output table composed of every conceivable pairing of their rows.

Confusingly, the SQL SELECT clause is compiled into a relational algebra Project, while the WHERE clause turns into a relational algebra Select. The FROM clause turns into one or more Joins, each taking two tables in and producing one table out. There are other relational algebra operations involving set union, intersection, difference, and membership, but let's keep this simple.

This tree really needs to be optimized. For example, if you have:

select E.name, D.name 
from Employee E, Department D 
where E.id = 123456 and E.dept_id = D.dept_id

with 5,000 employees in 500 departments, executing an unoptimized tree will blindly produce all possible combinations of one Employee and one Department (a Cross Product) and then Select out just the one combination that was needed. The Scan of Employee will produce a 5,000 record table, the Scan of Department will produce a 500 record table, the Cross Product of those two tables will produce a 2,500,000 record table, and the Select on E.id will take that 2,500,000 record table and discard all but one, the record that was wanted.

[Real query processors will try not to materialize all of these intermediate tables in memory of course.]

So the query optimizer walks the tree and applies various optimizations. One is to break up each Select into a chain of Selects, one for each of the original Select's top level conditions, the ones and-ed together. (This is called "conjunctive normal form".) Then the individual smaller Selects are moved around in the tree and merged with other relational algebra operations to form more efficient ones.

In the above example, the optimizer first pushes the Select on E.id = 123456 down below the expensive Cross Product operation. This means the Cross Product just produces 500 rows (one for each combination of that employee and one department). Then the top level Select for E.dept_id = D.dept_id filters out the 499 unwanted rows. Not bad.

If there's an an index on Employee's id field, then the optimizer can combine the Scan of Employee with the Select on E.id = 123456 to form a fast index Lookup. This means that only one Employee row is read into memory from disk instead of 5,000. Things are looking up.

The final major optimization is to take the Select on E.dept_id = D.dept_id and combine it with the Cross Product. This turns it into a relational algebra Equijoin operation. This doesn't do much by itself. But if there's an index on Department.dept_id, then the lower level sequential Scan of Department feeding the Equijoin can be turned into a very fast index Lookup of our one employee's Department record.

Lesser optimizations involve pushing Project operations down. If the top level of your query just needs E.name and D.name, and the conditions need E.id, E.dept_id, and D.dept_id, then the Scan operations don't have to build intermediate tables with all the other columns, saving space during the query execution. We've turned a horribly slow query into two index lookups and not much else.

Getting more towards the original question, let's say you've got:

select E.name 
from Employee E 
where E.age > 21 and E.state = 'Delaware'

The unoptimized relational algebra tree, when executed, would Scan in the 5,000 employees and produce, say, the 126 ones in Delaware who are older than 21. The query optimizer also has some rough idea of the values in the database. It might know that the E.state column has the 14 states that the company has locations in, and something about the E.age distributions. So first it sees if either field is indexed. If E.state is, it makes sense to use that index to just pick out the small number of employees the query processor suspects are in Delaware based on its last computed statistics. If only E.age is, the query processor likely decides that it's not worth it, since 96% of all employees are 22 and older. So if E.state is indexes, our query processor breaks the Select and merges the E.state = 'Delaware' with the Scan to turn it into a much more efficient Index Scan.

Let's say in this example that there are no indexes on E.state and E.age. The combined Select operation takes place after the sequential "Scan" of Employee. Does it make a difference which condition in the Select is done first? Probably not a great deal. The query processor might leave them in the original order in the SQL statement, or it might be a bit more sophisticated and look at the expected expense. From the statistics, it would again find that the E.state = 'Delaware' condition should be much more highly selective, so it would reverse the conditions and do that first, so that there are only 126 E.age > 21 comparisons instead of 5,000. Or it might realize that string equality comparisons are much more expensive than integer compares and leave the order alone.

At any rate, all this is very complex and your syntactic condition order is very unlikely to make a difference. I wouldn't worry about it unless you have a real performance problem and your database vendor uses the condition order as a hint.

8 votes

Is their any GUI-interface for MySQL, like Microsoft SQL Server 2005 Server Management Studio provides? Because basically I am operating MySQL with command prompt interface.

Mysql query browser:

http://dev.mysql.com/downloads/gui-tools/5.0.html

Database Design: replace a boolean column with a timestamp column?

7 votes

Earlier I have created tables this way:

create table workflow (
    id number primary key,
    name varchar2(100 char) not null,
    is_finished number(1) default 0 not null,
    date_finished date
);

Column is_finished indicates whether the workflow finished or not. Column date_finished is when the workflow was finished.

Then I had the idea "I don't need is_finished as I can just say: where data_finished is not null", and I designed without is_finished column:

create table workflow (
    id number primary key,
    name varchar2(100 char) not null,
    date_finished date
);

(We use Oracle 10)

Is it a good or bad idea? I've heard you can not have an index on a column with NULL values, so where data_finished is not null will be very slow on big tables.

There is a right way to index null values, and it doesn't use a FBI. Oracle will index null values, but it will NOT index null LEAF values in the tree. So, you could eliminate the column is_finished and create the index like this.

CREATE INDEX ON workflow (date_finished, 1);

Then, if you check the explain plan on this query:

SELECT count(*) FROM workflow WHERE date_finished is null;

You might see the index being used (if the optimizer is happy).

Back to the original question: looking at the variety of answers here, I think there is no right answer. I may have a personal preference to eliminate a column if it is unnecessary, but I also don't like overloading the meaning of columns either. There are two concepts here:

  1. The record has finished. is_finished
  2. The record finished on a particular date. date_finished

Maybe you need to keep these separate, maybe you don't. When I think about eliminating the is_finished column, it bothers me. Down the road, the situation may arise where the record finished, but you don't know precisely when. Perhaps you have to import data from another source and the date is unknown. Sure, that's not in the business requirements now, but things change. What do you do then? Well, you have to put some dummy value in the date_finished column, and now you've compromised the data a bit. Not horribly, but there is a rub there. The little voice in my head is shouting YOU'RE DOING IT WRONG when I do things like that.

My advice, keep it separate. You're talking about a tiny column and a very skinny index. Storage should not be an issue here.

Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.

-Eric S. Raymond

How do I use OData Expand like a SQL join?

7 votes

I'm trying to figure out how to accomplish the equivalent of:

select *
from Users u
inner join Comments c on c.UserId = u.Id
where Id = 1569

(table aliases for better sql readability)

...on the StackOverflow OData endpoint. How would this url be constructed? I'm looking at the documentation for Expand at OData.org and I would have thought it'd look something like:

https://odata.sqlazurelabs.com/OData.svc/v0.1/rp1uiewita/StackOverflow/Users?$Expand=Comments&$filter=UserId eq 1569 but isn't right.

In Linq, it would be this (I think), but Join isn't supported:

Users.Where(u=>u.Id==1569).Join(Comments, u=>u.Id, c=>c.UserId, (a,b)=>a.Id==b.UserId)

I don't need to figure this out in Linq strictly, I'm just trying to figure out how to construct the query url. Basically, how can I translate the SQL join predicate to an OData url and do this in one call?

The right way to do this would be something like: http://odata.stackexchange.com/stackoverflow/atom/Users(1569)?$expand=Comments The problem is that there seem to be no users in the data source (don't know why). So the above query will return a 404. But it is the right syntax. The idea is that if you want information about just one user you "navigate" to it by using the /Users(1569) (the stuff in parethesis is the primary key of the entity set). Then if you also want to include all the comments, you simply add $expand=Comments. If you want just the comments and not the information about the user you can do /Users(1569)/Comments.

Note that the service you used doesn't define navigation properties, so the above won't work as "joins" are not really supported. But the stackexchange odata endpoint does have the navigation properties defined. Basically the joins are defined on the server/service so that the client doesn't have to know which column is a foreign key to wich primary key. It also helps with data sources which don't use relational databases as their storage, as it doesn't force them to create fake foreign keys.

Explanation of particular sql injection

7 votes

Browsing through the more dubious parts of the web, I happened to come across this particular SQL injection:

http://server/path/page.php?id=1+union+select+0,1,concat_ws(user(),0x3a,database(),0x3a,version()),3,4,5,6--

My knowledge of SQL - which I thought was half decent - seems very limiting as I read this.

Since I develop extensively for the web, I was curious to see what this code actually does and more importantly how it works.

It replaces an improperly written parametrized query like this:

$sql = '
SELECT  *
FROM    products
WHERE   id = ' . $_GET['id'];

with this query:

SELECT  *
FROM    products
WHERE   id = 1
UNION ALL
select 0,1,concat_ws(user(),0x3A,database(),0x3A,version()),3,4,5,6

, which gives you information about the database name, version and username connected.

SQL statement to get all customers with no orders

6 votes

I have a typical Persons table and an Orders table defined in such a way that I can do JOIN query as the following to return Orders for all Persons.

SELECT Persons.LastName, Persons.FirstName, Orders.OrderNo
FROM Persons
INNER JOIN Orders
ON Persons.id=Orders.Person_id

The question is, how do I write a statement that would return all Persons with NO Orders?

I'm using mysql.

Thank all in advance.

You may want to use LEFT JOIN and IS NULL:

SELECT     Persons.LastName, Persons.FirstName
FROM       Persons
LEFT JOIN  Orders ON Persons.id = Orders.Person_id
WHERE      Orders.Person_id IS NULL;

The result of a left join always contains all records of the "left" table (Persons), even if the join-condition does not find any matching record in the "right" table (Orders). When there is no match, the columns of the "right" table will NULL in the result set.

SQL And NULL Values in where clause

6 votes

So I have a simple query that returns a listing of products

SELECT     Model, CategoryID
FROM         Products
WHERE     (Model = '010-00749-01') 

This returns

010-00749-01    00000000-0000-0000-0000-000000000000
010-00749-01    NULL

Whitch is correct, so I wanted only get the products whose CategoryID is not '00000000-0000-0000-0000-000000000000' so I have

SELECT     Model, CategoryID
FROM         Products
WHERE     (Model = '010-00749-01') 
AND (CategoryID <> '00000000-0000-0000-0000-000000000000') 

But this returns no results. So I changed the query to

SELECT     Model, CategoryID
FROM         Products
WHERE     (Model = '010-00749-01') 
AND ((CategoryID <> '00000000-0000-0000-0000-000000000000') OR  (CategoryID  IS NULL))

Which returns expected result

010-00749-01    NULL

Can someone explain this behavior to me? MS SQL Server 2008

Check out the full reference on Books Online - by default ANSI_NULLS is on meaning you'd need to use the approach you have done. Otherwise, you could switch that setting OFF at the start of the query to switch the behaviour round.

When SET ANSI_NULLS is ON, a SELECT statement that uses WHERE column_name = NULL returns zero rows even if there are null values in column_name. A SELECT statement that uses WHERE column_name <> NULL returns zero rows even if there are nonnull values in column_name.
...
When SET ANSI_NULLS is ON, all comparisons against a null value evaluate to UNKNOWN. When SET ANSI_NULLS is OFF, comparisons of all data against a null value evaluate to TRUE if the data value is NULL.

Here's a simple example to demonstrate the behaviour with regard to comparisons against NULL:

-- This will print TRUE
SET ANSI_NULLS OFF;
IF NULL <> 'A'
    PRINT 'TRUE'
ELSE
    PRINT 'FALSE'

-- This will print FALSE
SET ANSI_NULLS ON;
IF NULL <> 'A'
    PRINT 'TRUE'
ELSE
    PRINT 'FALSE'

Excluding matches on JOIN fields that are NULL

6 votes

If you do a join that looks like this

SELECT T1.KeyField1, T1.KeyField2, T2.Field3
FROM T1 JOIN T2 ON T1.KeyField1 = T2.KeyField1 AND T1.KeyField2 = T2.KeyField2

Is there a way to not allow NULLS to match similar to the results this query would return

SELECT T1.KeyField1, T1.KeyField2, T2.Field3
FROM T1 JOIN T2 ON T1.KeyField1 = T2.KeyField1 AND T1.KeyField2 = T2.KeyField2
               AND T1.KeyField2 IS NOT NULL AND T2.KeyField2 IS NOT NULL

EDIT

I actually asked the question wrong.... Let me try again.

We are comparing an new data to old data and looking for records where the rows are exactly the same.

So both tables defined:

CREATE TABLE [Table](
    [Identifier] [int] IDENTITY(1,1) NOT NULL,
    [Key1] [varchar](50) NOT NULL,
    [Data1] [varchar](50) NULL,
    [Data2] [varchar](50) NULL

If I do the query:

DELETE
FROM T1 JOIN T2 ON T1.Key1 = T2.Key1 
               AND T1.Data1 = T2.Data2 AND T1.Data2 = T2.Data2

Give

T1 & T2

| Key1 | Data1       | Data2   |
| 1000 | 123 Main St | <NULL>  |
| 1001 | 456 High St | FLOOR 2 |

This would not remove the duplicate record 1000 from T1 since Data2 is NULL.

Outside of making use of a magic value in the join, is there any other way to compare these?

I understand that I should make the consultants rewrite the code to insert all NULLS as '', but this is a huge undertaking at this point. I am also looking at hashing the row to look for differences.

try using this:

SET ANSI_NULLS ON

http://msdn.microsoft.com/en-us/library/aa259229(SQL.80).aspx

EDIT

joining with "magic numbers" like:

ISNULL(T1.Field1, '-9999') = ISNULL(T2.Field2, '-9999') 

is the best you can do in your situation, and will most likely hurt the query performance significantly. I'd say the real issue is a design one, joining on NULLs is just plain strange to me.

SQL Server - pull X random records per state

6 votes

I have a table with records for each zip code in the united states. For the purposes of displaying on a map, I need to select X random records per state. How would I go about doing this?

Use:

WITH sample AS (
 SELECT t.*,
        ROW_NUMBER() OVER (PARTITION BY t.state
                               ORDER BY NEWID()) AS rank
   FROM ZIPCODES t)
SELECT s.*
  FROM sample s
 WHERE s.rank <= 5

Exclusive access could not be obtained because the database is in use

6 votes

I'm using following code to restore databases,

void Restore(string ConnectionString, string DatabaseFullPath, string backUpPath)
{
    string sRestore =
        "USE [master] RESTORE DATABASE [" + DatabaseFullPath + "] FROM DISK = N'" + backUpPath + "' WITH  FILE = 1,  NOUNLOAD,  STATS = 10";

    using (SqlConnection con = new SqlConnection(ConnectionString))
    {
        con.Open();
        SqlCommand cmdBackUp = new SqlCommand(sRestore, con);
        cmdBackUp.ExecuteNonQuery();
    }
}

but I receive below exception

"Exclusive access could not be obtained because the database is in use.
RESTORE DATABASE is terminating abnormally.
Changed database context to 'master'."

How can I fix it ?

A restore can only happen if the database does not have any connections to it. The easy way on a MS SQL Server to kick all users off is:

ALTER DATABASE [MyDB] SET Single_User WITH Rollback Immediate
GO

Now, you can perform your restore with impunity. Make sure you set it back to Multi-user mode when you're done with the restore:

ALTER DATABASE [MyDB] SET Multi_User
GO

Evaluating the mean absolute deviation of a set of numbers in Oracle

6 votes

I'm trying to implement a procedure to evaluate the median absolute deviation of a set of numbers (usually obtained via a GROUP BY clause).

An example of a query where I'd like to use this is:

select id, mad(values) from mytable group by id;

I'm going by the aggregate function example but am a little confused since the function needs to know the median of all the numbers before all the iterations are done.

Any pointers to how such a function could be implemented would be much appreciated.

In Oracle 10g+:

SELECT  MEDIAN(ABS(value - med))
FROM    (
        SELECT  value, MEDIAN(value) OVER() AS med
        FROM    mytable
        )

, or the same with the GROUP BY:

SELECT  id, MEDIAN(ABS(value - med))
FROM    (
        SELECT  id, value, MEDIAN(value) OVER(PARTITION BY id) AS med
        FROM    mytable
        )
GROUP BY
        id

Preventing SQL Injection in ASP.Net VB.Net

5 votes

I have this code

UPDATE OPENQUERY (db,'SELECT * FROM table WHERE ref = ''"+ Ref +"'' AND bookno = ''"+ Session("number") +"'' ') 

How would I prevent SQL Injections on this?

Thanks

Jamie

UPDATE

Here's what i'm trying

SqlCommand cmd = new SqlCommand("Select * from Table where ref=@ref", con); 
cmd.Parameters.AddWithValue("@ref", 34);

For some reason everything I try and add it doesn't seem to work I keep getting SQL Command mentioned below.

The error is this

'SqlCommand' is a type and cannot be used as an expression

I'm taking over someone else's work so this is all new to me and I would like do things the right way so if anyone can provide any more help on how to make my query above safe from SQL injections then please do.

Thanks

Jamie

UPDATE NO 2

I added in the code as VasilP said like this

Dim dbQuery As [String] = "SELECT * FROM table WHERE ref = '" & Tools.SQLSafeString(Ref) & "' AND bookno = '" & Tools.SQLSafeString(Session("number")) & "'"

But I get an error Tools is not declared do I need to specify a certain namespace for it to work?

Thanks

Jamie

UPDATE

Has anyone got any ideas on the best of getting my query safe from SQL injection without the errors that i'm experiencing?

Thanks

Jamie

UPDATE

I now have it so it work without the parameters bit here's my updated souce code any idea why it won't add the parameter value?

Dim conn As SqlConnection = New SqlConnection("server='server1'; user id='w'; password='w'; database='w'; pooling='false'")
   conn.Open()


Dim query As New SqlCommand("Select * from openquery (db, 'Select * from table where investor = @investor ') ", conn)
query.Parameters.AddWithValue("@investor", 69836)

dgBookings.DataSource = query.ExecuteReader
dgBookings.DataBind()

It works like this

Dim conn As SqlConnection = New SqlConnection("server='server1'; user id='w'; password='w'; database='w'; pooling='false'")
   conn.Open()


Dim query As New SqlCommand("Select * from openquery (db, 'Select * from table where investor = 69836') ", conn)

dgBookings.DataSource = query.ExecuteReader
dgBookings.DataBind()

The error i'm getting is this

An error occurred while preparing a query for execution against OLE DB provider 'MSDASQL'. 

And it's because it isn't replacing the @investor with the 69836

Any ideas?

SOLUTION

Here is how I solved my problem

Dim conn As SqlConnection = New SqlConnection("server='h'; user id='w'; password='w'; database='w'; pooling='false'")

conn.Open()

Dim query As New SqlCommand("DECLARE @investor varchar(10), @sql varchar(1000) Select @investor = 69836 select @sql = 'SELECT * FROM OPENQUERY(db,''SELECT * FROM table WHERE investor = ''''' + @investor + ''''''')' EXEC(@sql)", conn)

dgBookings.DataSource = query.ExecuteReader
dgBookings.DataBind()

Now I can write queries without the worry of SQL injection

Thanks everyone

Try using a parameterized query here is a link http://www.aspnet101.com/2007/03/parameterized-queries-in-asp-net/

Also, do not use OpenQuery... use the this to run the select

SELECT * FROM db...table WHERE ref = @ref AND bookno = @bookno

More articles describing some of your options:

http://support.microsoft.com/kb/314520

http://stackoverflow.com/questions/125457/what-is-the-t-sql-syntax-to-connect-to-another-sql-server


Edited

Note your original question was asking about distributed queries and Linked servers. This new statement does not reference a distributed query. I can only assume you are directly connecting to the database now. Here is an example that should work. Here is another reference site for using SqlCommand.Parameters

SqlCommand cmd = new SqlCommand("Select * from Table where ref=@ref", con); 
cmd.Parameters.Add("@ref", SqlDbType.Int);
cmd.Parameters["@ref"] = 34;

Edited:

Ok Jamie taylor I will try to answer your question again.

You are using OpenQuery becuase you are probably using a linked DB

Basically the problem is the OpenQuery Method takes a string you cannot pass a variable as part of the string you sent to OpenQuery.

You can format your query like this instead. The notation follows servername.databasename.schemaname.tablename. If you are using a linked server via odbc then omit databasename and schemaname, as illustrated below

    Dim conn As SqlConnection = New SqlConnection("your SQL Connection String")
    Dim cmd As SqlCommand = conn.CreateCommand()
    cmd.CommandText = "Select * db...table where investor = @investor"
    Dim parameter As SqlParameter = cmd.CreateParameter()
    parameter.DbType = SqlDbType.Int
    parameter.ParameterName = "@investor"
    parameter.Direction = ParameterDirection.Input
    parameter.Value = 34