Best sql questions in June 2012

Re-indexing huge database (the English Wikipedia) efficiently

11 votes

THE GIST

Before performing a massive 40+ GB import of the English Wikipedia, I had to temporarily remove indexes and auto-increment fields from three tables ('page', 'revision', and 'text') to handle the load. Now I have finally successfully imported the English Wikipedia to my local machine and created a local mirror (MediaWiki API). Yay!

However, I now need to re-create the indexes and auto-increment fields in less than a decade. Luckily, (1) I took plenty of screen-shots of the relevant tables in phpmyadmin before I removed the indexes and fields; (2) I can explain with extreme precision the steps I took before the import; and (3) this shouldn't be too difficult for anyone fluent in MySQL. Unfortunately, I have no expertise in MySQL whatsoever, so "baby steps" explanations would be extremely helpful.

PRECISELY WHAT I DID (PREPARING FOR THE IMPORT):

Steps 1, 2, 3: This image depicts the table page before I modified the field page_id by clicking 'Change' and un-checking 'Auto-Increment' (in preparation for importing). I performed precisely the same modifications for the field rev_id in table revision and old_id in table text but omitted the screen-shots to avoid redundancy.

table 'page' before modification of 'page_id'

Step 4: This image depicts the indexes for the table page before I dropped all of them.

indexes for table 'page' before I dropped them

Step 5: This image depicts the indexes for the table revision before I dropped all of them.

indexes for table 'revision' before I dropped them

Step 6: This image depicts the indexes for the table text before I dropped all of them.

indexes for table 'text' before I dropped them

WHAT I NEED NOW (RESTORING AFTER THE IMPORT):

I just need to restore the original indexes and auto-increment fields without waiting a hundred years.

Set-up details: PHP 5.3.8 (apache2handler), MySQL 5.5.16 (InnoDB), Apache 2.2.21, Ubuntu 12.04 LTS, MediaWiki 1.19.0 (private wiki)

I really like Wikipedia so I'll try to help.

You need to use a lot of

ALTER TABLE

Add primary keys

ALTER TABLE page ADD PRIMARY KEY (page_id);
ALTER TABLE revision ADD PRIMARY KEY (rev_id);
ALTER TABLE text ADD PRIMARY KEY (old_id);

Add auto increments back

ALTER TABLE page MODIFY COLUMN page_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;

I need the table descriptions for all tables before continuing. If rev_id and old_id are same definitions as page_id then:

ALTER TABLE revision MODIFY COLUMN rev_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;
ALTER TABLE text MODIFY COLUMN old_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT;

Add unique keys

ALTER TABLE page ADD UNIQUE name_title(page_namespace, page_title);
ALTER TABLE revision ADD UNIQUE rev_page_id(rev_page, rev_id);

Other indexes

ALTER TABLE page ADD INDEX page_random(page_random);
ALTER TABLE page ADD INDEX page_len(page_len);
ALTER TABLE page ADD INDEX page_redirect_namespace(page_is_redirect, page_namespace, page_len);
ALTER TABLE revision ADD INDEX rev_timestamp(rev_timestamp);
ALTER TABLE revision ADD INDEX page_timestamp(rev_page, rev_timestamp);
ALTER TABLE revision ADD INDEX user_timestamp(rev_user, rev_timestamp);
ALTER TABLE revision ADD INDEX user_text_timestamp(rev_user_text, rev_timestamp);

Again, there may be column definitions that change this stuff. You need to provide the CREATE TABLE info.

SQL: Specified cast is not valid

11 votes

Note: I am looking for why this is happening and how to fix it, I am not looking for a workaround. This appears to be a server(SQL Server or Connection string) issue.

I have a program that is connected to a sql 2008 database (Database A) and I have inline sql that runs that has ints and strings returned and it works fine. But I have been asked to switch to another 2008 database (Database B) and now everything is coming back as a string and I am getting a specified cast is not valid from C# where when I am connected to the sql 2008 (Database A) it does not say this. This is a inline sql statement so the sql statement is not changing and the table schema of the database is the same. Its doing this on int primary keys Anyone have any ideas?

I originally thought the was a 2000 to 2008 issue but I now have the some problem on 2008 as well. Both databases are on the same instance of sql server these are the connection strings

Connection Strings

  Server=Server01\instance;Database=Fraud_Micah; Trusted_Connection=yes <- Server 2008 (this one does not)
  Server=Server02\instance;Database=Fraud; Trusted_Connection=yes <- Server 2008 (this one works)

Both databases are at DB compatibility_level of 100

The select Statement

select *, delimeter, file_filetype.LocalPath, ArchiveDir, EmailList
from file_importtable 
join file_filetype on file_importtable.FileTypeID = file_filetype.ID
where importsuccessdate is null and transferdate is not null
and remotediscoverdate is not null 
and OriginalFileName in ('Test987.xml.pgp')

fileTypeID is where its breaking -> InvalidCastException: Specified cast is not valid.

C# Code (Note reader is type SQLDataReader)

if (!(reader.IsDBNull(reader.GetOrdinal("FileTypeID"))))
{
    file.FileTypeID = reader.GetInt32(reader.GetOrdinal("FileTypeID"));
}

Here is the column definition: [FileTypeID] [int] NULL, there is no null values in the table.

I don't think the C# code comes from this, its a int? public int? FileTypeID { get; set; }

In debug mode: reader["FileTypeID"] -> "1" it is in fact a string but why when I connect to a 2008 database would it return a 1 instaed of a "1"

2008 Table A Def

[ProcessSuccessDate] [datetime] NULL,
[ProcessSuccessUser] [datetime] NULL,
[FileTypeID] [int] NULL,
[HoldDate] [datetime] NULL,

2008 Table B Def

ProcessSuccessDate] [datetime] NULL,
[ProcessSuccessUser] [datetime] NULL,
[FileTypeID] [int] NULL,
[HoldDate] [datetime] NULL,

file.FileTypeID = (int)reader["FileTypeID"]; yields the same result.

Doing a

     file.FileTypeID (int)reader.GetInt32(reader.GetOrdinal("FileTypeID"));

does work but I don't want to do that for every column that already should be coming back as a int also writing sql like this

     select Convert(int, FileTypeID) as FileTypeId, delimeter, file_filetype.LocalPath, ArchiveDir, EmailList

can get around the issue as well, however I want to know why I have to do this if I already set the type as a int in the table. I might as well put all the types as strings in the table. At this point I am not looking for a workaround I want to understand why its not working like it should be.

In the scheme of things I desided to delete the table and recreate it using

[script table as] -> [create]

from the same table that was giving me issues, then a reinserted all the same data. This resolved the issue. So I don't believe the DML of the Table changed and the data did not change either but this is what resolved my issue.

SQL selecting people you may know

11 votes

The question you're asking appears subjective and is likely to be closed.

I wasn't surprised when I saw above horrible warning while I was filling in title field.

I read almost every thread talking about friends of friends or mutual friends but I'm not sure I found the right solution that I want to do.

I'm sorry I'm not good at English nor SQL.

How can I find the right answer while being not good at both of languages?

I decided I have to ask. I won't let myself down for down-votes or any duplication warnings.

As I want the answer, I'll write down as sincerely as possible for any further similar problems can be helped.

I have a table for friend relations.

FRIEND (TABLE)
-----------------------------------
PLAYER_ID(PK,FK)   FRIEND_ID(PK,FK)
-----------------------------------
1                  2                 // 1 knows 2
2                  1                 // 2 knows 1
1                  3                 // 1 knows 3
2                  3                 // 2 knows 3
2                  4                 // 2 knows 4
2                  5                 // 2 knows 5 // updated
3                  5                 // 3 knows 5 // updated
1                  100
1                  200
1                  300
100                400
200                400
300                400

Both composite primary keys are also foreign keys from PLAYER table.

I asked and got answered from such nice people for "people know each other".

SQL view for acquaintance from table.

And I have a view like this.

ACQUAINTANCE (VIEW)
-----------------------------------
PLAYER_ID(PK,FK)   FRIEND_ID(PK,FK)
-----------------------------------
1                  2                 // 1 knows 2
2                  1                 // 2 knows 1

As you might be noticed, this relationships' business logic has following two purposes.

  1. One player can say he or she knows someone else.
  2. When both people say they know each other, they can be said as acquaintance.

And, now, I want to know is there any good way for

  1. Selecting other PLAYER_IDs
  2. With given PLAYER(PLAYER_ID) (say 1)
  3. Which each is one of `friends of given PLAYER's direct friends'
  4. Which each is not the PLAYER himself (excluding 1 -> 2 -> 1)
  5. Which each is not the PLAYER's direct friends (excluding 3 from 1 -> 2 -> 3 by 1 -> 3)
  6. Order by number of mutual friends if possible.

I think the Justin Niessner's answer in "people you may know" sql query is the closest path that I must follow.

Thanks in advance.

I'll close the thread if this subject is really duplicated and not necessary.

UPDATE --------------------------------------------------------------

for Raphaël Althaus's comment whose name is same with my future daughter (is it boy's name?),

3 is a candidate for friends of friends of 1 because

1 knows 2
2 knows 3

but excluded because

1 already knows 3

Basically I want to serve for the given player the

people he or she may know
which is not himself or herself // this is nothing but obvious
which each is not already known to himself

With above table

by 1 -> 2 -> 4 and 1 -> 3 -> 5

4 and 5 can be suggested for 1 as 'people you may know'

order by number of mutual friends will be perfect
but I don't think I can understand even if someone show me how. sorry.

Thank you.

UPDATE ---------------------------------------------------------------------

I think I must try step by step by myself from what I've learned FROM HERE WITH VARIOUS PEOPLE even if it's not the right answer. Please let me know if I'm doing anything wrong.

First of all, let me self join the FRIEND table itself.

SELECT *
FROM FRIEND F1 INNER JOIN FRIEND F2 ON F1.FRIEND_ID = F2.PLAYER_ID

prints

+-----------+-----------+-----------+-----------+
| PLAYER_ID | FRIEND_ID | PLAYER_ID | FRIEND_ID |
+-----------+-----------+-----------+-----------+
|         1 |         2 |         2 |         1 |
|         1 |         2 |         2 |         3 |
|         1 |         2 |         2 |         4 |
|         1 |         2 |         2 |         5 |
|         1 |         3 |         3 |         5 |
|         2 |         1 |         1 |         2 |
|         2 |         1 |         1 |         3 |
|         2 |         3 |         3 |         5 |
+-----------+-----------+-----------+-----------+

F2.FRIEND_ID only

SELECT F2.FRIEND_ID
FROM FRIEND F1 INNER JOIN FRIEND F2 ON F1.FRIEND_ID = F2.PLAYER_ID

prints

+-----------+
| FRIEND_ID |
+-----------+
|         1 |
|         3 |
|         4 |
|         5 |
|         5 |
|         2 |
|         3 |
|         5 |
+-----------+

for 1 only

SELECT F2.FRIEND_ID
FROM FRIEND F1 INNER JOIN FRIEND F2 ON F1.FRIEND_ID = F2.PLAYER_ID
WHERE F1.PLAYER_ID = 1;

prints

+-----------+
| FRIEND_ID |
+-----------+
|         1 |
|         3 |
|         4 |
|         5 |
|         5 |
+-----------+

not 1

SELECT F2.FRIEND_ID
FROM FRIEND F1 INNER JOIN FRIEND F2 ON F1.FRIEND_ID = F2.PLAYER_ID
WHERE F1.PLAYER_ID = 1 
AND F2.FRIEND_ID != 1;

prints

+-----------+
| FRIEND_ID |
+-----------+
|         3 |
|         4 |
|         5 |
|         5 |
+-----------+

not 1's direct knowns

SELECT F2.FRIEND_ID
FROM FRIEND F1 INNER JOIN FRIEND F2 ON F1.FRIEND_ID = F2.PLAYER_ID
WHERE F1.PLAYER_ID = 1
AND F2.FRIEND_ID != 1
AND F2.FRIEND_ID NOT IN (SELECT FRIEND_ID FROM FRIEND WHERE PLAYER_ID = 1);

prints

+-----------+
| FRIEND_ID |
+-----------+
|         4 |
|         5 |
|         5 |
+-----------+

I think I'm getting there.

UPDATE -----------------------------------------------------------------

Following paths added

1 -> 100 -> 400
1 -> 200 -> 400
1 -> 300 -> 400

And the last query prints (again)

+-----------+
| FRIEND_ID |
+-----------+
|         4 |
|         5 |
|         5 |
|       400 |
|       400 |
|       400 |
+-----------+

at last, I got the candidates: 4, 5, 400

Putting distinct surely work for the primary goal

SELECT DISTINCT F2.FRIEND_ID
FROM FRIEND F1 INNER JOIN FRIEND F2 ON F1.FRIEND_ID = F2.PLAYER_ID
WHERE F1.PLAYER_ID = 1
AND F2.FRIEND_ID != 1
AND F2.FRIEND_ID NOT IN (SELECT FRIEND_ID FROM FRIEND WHERE PLAYER_ID = 1);

prints

+-----------+
| FRIEND_ID |
+-----------+
|         4 |
|         5 |
|       400 |
+-----------+

And, now, ordering by mutual counts needed.

Here comes the number of mutual friends for each candidates.

+-----------+
| FRIEND_ID |
+-----------+
|         4 | 1 (1 -> 2 -> 4)
|         5 | 2 (1 -> 2 -> 5, 1 -> 3 -> 5)
|       400 | 3 (1 -> 100 -> 400, 1 -> 200 -> 400, 1 -> 300 -> 400)
+-----------+

How can I calculate and order by those number of mutual friends?

SELECT F2.FRIEND_ID, COUNT(*)
FROM FRIEND F1 INNER JOIN FRIEND F2 ON F1.FRIEND_ID = F2.PLAYER_ID
WHERE F1.PLAYER_ID = 1
AND F2.FRIEND_ID != 1
AND F2.FRIEND_ID NOT IN (SELECT FRIEND_ID FROM FRIEND WHERE PLAYER_ID = 1)
GROUP BY F2.FRIEND_ID;

prints

+-----------+----------+
| FRIEND_ID | COUNT(*) |
+-----------+----------+
|         4 |        1 |
|         5 |        2 |
|       400 |        3 |
+-----------+----------+

I got it!

SELECT F2.FRIEND_ID, COUNT(*) AS MFC
FROM FRIEND F1 INNER JOIN FRIEND F2 ON F1.FRIEND_ID = F2.PLAYER_ID
WHERE F1.PLAYER_ID = 1
AND F2.FRIEND_ID != 1
AND F2.FRIEND_ID NOT IN (SELECT FRIEND_ID FROM FRIEND WHERE PLAYER_ID = 1)
GROUP BY F2.FRIEND_ID
ORDER BY MFC DESC;

prints

+-----------+-----+
| FRIEND_ID | MFC |
+-----------+-----+
|       400 |   3 |
|         5 |   2 |
|         4 |   1 |
+-----------+-----+

Can anybody please confirm this? Is that query optimal? Any possible performance problem when make it as a view?

Thank you.

UPDATE --------------------------------------------------------------------------------------------

I created a view as

CREATE VIEW FOLLOWABLE AS
    SELECT F1.PlAYER_ID, F2.FRIEND_ID AS FOLLOWABLE_ID, COUNT(*) AS MFC
    FROM FRIEND F1 INNER JOIN FRIEND F2 ON F1.FRIEND_ID = F2.PLAYER_ID
    WHERE F2.FRIEND_ID != F1.PLAYER_ID
    AND F2.FRIEND_ID NOT IN (SELECT FRIEND_ID FROM FRIEND WHERE PLAYER_ID = F1.PLAYER_ID)
    GROUP BY F2.FRIEND_ID
    ORDER BY MFC DESC;

and tested.

mysql> select * from FOLLOWABLE;
+-----------+---------------+-----+
| PlAYER_ID | FOLLOWABLE_ID | MFC |
+-----------+---------------+-----+
|         1 |           400 |   3 |
|         1 |             5 |   2 |
|         2 |           100 |   1 |
|         2 |           200 |   1 |
|         2 |           300 |   1 |
|         1 |             4 |   1 |
+-----------+---------------+-----+
6 rows in set (0.01 sec)

mysql> select * from FOLLOWABLE WHERE PLAYER_ID = 1;
+-----------+---------------+-----+
| PlAYER_ID | FOLLOWABLE_ID | MFC |
+-----------+---------------+-----+
|         1 |           400 |   3 |
|         1 |             5 |   2 |
|         1 |             4 |   1 |
+-----------+---------------+-----+
3 rows in set (0.00 sec)

use this EDIT

SELECT `friend_id` AS `possible_friend_id`
FROM `friends`
WHERE `player_id` IN (        --selecting those who are known
    SELECT `friend_id`        --by freinds of #1
    FROM `friends`
    WHERE `player_id` = 1) 
AND `friend_id` NOT IN (      --but not those who are known by #1
    SELECT `friend_id`
    FROM `friends`
    WHERE `player_id` = 1)
AND NOT `friend_id` = 1       --and are not #1 himself
                              --if one is known by multiple people
                              --he'll be multiple time in the list
GROUP BY `possible_friend_id` --so we group
ORDER BY COUNT(*) DESC        --and order by amount of repeatings

Performance difference: condition placed at INNER JOIN vs WHERE clause

10 votes

Say I have a table order as

id | clientid | type | amount | itemid | date
---|----------|------|--------|--------|-----------
23 | 258      | B    | 150    | 14     | 2012-04-03
24 | 258      | S    | 69     | 14     | 2012-04-03
25 | 301      | S    | 10     | 20     | 2012-04-03
26 | 327      | B    | 54     | 156    | 2012-04-04
  • clientid is a foreign-key back to the client table
  • itemid is a foreign key back to an item table
  • type is only B or S
  • amount is an integer

and a table processed as

id | orderid | processed | date
---|---------|-----------|---------
41 | 23      | true      | 2012-04-03
42 | 24      | true      | 2012-04-03
43 | 25      | false     | <NULL>
44 | 26      | true      | 2012-04-05     

I need to get all the rows from order that for the same clientid on the same date have opposing type values. Keep in mind type can only have one of two values - B or S. In the example above this would be rows 23 and 24.

The other constraint is that the corresponding row in processed must be true for the orderid.

My query so far

SELECT c1.clientid,
       c1.date,
       c1.type,
       c1.itemid,
       c1.amount,
       c2.date,
       c2.type,
       c2.itemid,
       c2.amount

FROM   order c1
INNER JOIN order c2 ON c1.itemid    =  c2.itemid AND
                       c1.date      =  c2.date   AND
                       c1.clientid  =  c2.clientid AND
                       c1.type     <>  c2.type AND
                       c1.id        <  c2.id

INNER JOIN processed p1 ON p1.orderid   =  c1.id AND
                         p1.processed =  true
INNER JOIN processed p2 ON p2.orderid   =  c2.id AND
                         p2.processed =  true

QUESTION: Keeping the processed = true as part of the join clause is slowing the query down. If I move it to the WHERE clause then the performance is much better. This has piqued my interest and I'd like to know why.

The primary keys and respective foreign key columns are indexed while the value columns (value, processed etc) aren't.

Disclaimer: I have inherited this DB structure and the performance difference is roughly 6 seconds.

The reason that you're seeing a difference is due to the execution plan that the planner is putting together, this is obviously different depending on the query (arguably, it should be optimising the 2 queries to be the same and this may be a bug). This means that the planner thinks it has to work in a particular way to get to the result in each statement.

When you do it within the JOIN, the planner will probably have to select from the table, filter by the "True" part, then join the result sets. I would imagine this is a large table, and therefore a lot of data to look through, and it can't use the indexes as efficiently.

I suspect that if you do it in a WHERE clause, the planner is choosing a route that is more efficient (ie. either index based, or pre filtered dataset).

You could probably make the join work as fast (if not faster) by adding an index on the two columns (not sure if included columns and multiple column indexes are supported on Postgres yet).

In short, the planner is the problem it is choosing 2 different routes to get to the result sets, and one of those is not as efficient as the other. It's impossible for us to know what the reasons are without the full table information and the EXPLAIN ANALYZE information.

If you want specifics on why your specific query is doing this, you'll need to provide more information. However the reason is the planner choosing different routes.

Additional Reading Material:

http://www.postgresql.org/docs/current/static/explicit-joins.html

Just skimmed, seems that the postgres planner doesn't re-order joins to optimise it. try changing the order of the joins in your statement to see if you then get the same performance... just a thought.

How To perform a SQL Query to DataTable Operation That Can Be Cancelled

10 votes

I tried to make the title as specific as possible. Basically what I have running inside a backgroundworker thread now is some code that looks like:

 SqlConnection conn = new SqlConnection(connstring);
                    SqlCommand cmd = new SqlCommand(query, conn);
                    conn.Open();
                    SqlDataAdapter sda = new SqlDataAdapter(cmd);
                    sda.Fill(Results);
                    conn.Close();
                    sda.Dispose();

Where query is a string representing a large, time consuming query, and conn is the connection object.

My problem now is I need a stop button. I've come to realize killing the backgroundworker would be worthless because I still want to keep what results are left over after the query is canceled. Plus it wouldn't be able to check the canceled state until after the query.

What I've come up with so far:

I've been trying to conceptualize how to handle this efficiently without taking too big of a performance hit.

My idea was to use a SqlDataReader to read the data from the query piece at a time so that I had a "loop" to check a flag I could set from the GUI via a button. The problem is as far as I know I can't use the Load() method of a datatable and still be able to cancel the sqlcommand. If I'm wrong please let me know because that would make cancelling slightly easier.

In light of what I discovered I came to the realization I may only be able to cancel the sqlcommand mid-query if I did something like the below (pseudo-code):

while(reader.Read())
{
 //check flag status
 //if it is set to 'kill' fire off the kill thread

 //otherwise populate the datatable with what was read
}

However, it would seem to me this would be highly ineffective and possibly costly. Is this the only way to kill a sqlcommand in progress that absolutely needs to be in a datatable? Any help would be appreciated!

There are really two stages where cancelling matters:

  1. Cancelling the initial query execution before the first rows are returned
  2. Aborting the process of reading the rows as they are served

Depending on the nature of the actual sql statement, either of these steps could be 99% of the time, so they both should be considered. For example, calling SELECT * on some table with a billion rows will take essentionally no time to execute but will take a very long time read. Conversely, requesting a super complicated join on poorly tuned tables and then wrapping it all in some aggregating clauses may take minutes to execute but negligible time to read the handful of rows once they are actually returned.

Well-tuned advanced database engines will also cache chunks of rows at a time for complicated queries, so you will see alternating pauses where the engine is executing the query on the next batch of rows and then fast bursts of data as it returns the next batch of results.

Cancelling the query execution

In order to be able to cancel a query while it is executing you can use one of the overloads of SqlCommand.BeginExecuteReader to start the query, and call SqlCommand.Cancel to abort it. Alternatively you can call ExecuteReader() syncronously in one thread and still call Cancel() from another. I'm not including code examples because there are plenty of them in the documentation.

Aborting the read operation

Here using a simple boolean flag is probably the easiest way. And remember it's really easy to fill a data table row using the Rows.Add() overload that takes an array of object, that is:

object[] buffer = new object[reader.FieldCount]
while(reader.Read()) {
    if(cancelFlag) break;
    reader.GetValues(buffer);
    dataTable.Rows.Add(buffer);
}

Cancelling blocking calls to Read()

A sort of mixed case occurs when, as mentioned earlier, a call to reader.Read() causes the database engine to do another batch of intensive processing. As noted in the MSDN documentation, calls to Read() can be blocking in this case even if the original query was executed with BeginExecuteReader. You can still get around this by calling Read() in one thread that's handling all the reading but calling Cancel() in another thread. The way you know if you reader is in a blocking Read call is to have another flag the the reader thread updates while the monitoring thread reads:

...
inRead = true
while(reader.Read()) {
    inRead = false
    ...
    inRead = true
}

// Somewhere else:
private void foo_onUITimerTick(...) {
   status.Text = inRead ? "Waiting for server" : "Reading";
}

Regarding performance of Reader vs Adapter

A DataReader is usually faster than using DataAdapter.Fill(). The whole point of a DataReader is to be really, really fast and responsive for reading. Checking some boolean flag once per row would not add a measurable difference in time even over millions of rows.

The limiting factor for a big database query is not the local CPU processing time but the size of the I/O pipe (your network connection for a remote database or your disk speed for a local one) or a combination of the db server's own disk speed and CPU processing time for a complex query. Both a DataAdapter and a DataReader will spend time (perhaps the majority of the time) just waiting for a few nanoseconds at a time for the next row to be served.

One convenience of DataAdapter.Fill() is that it does the magic of dynamically generating the DataTable columns to match the query results, but that's not difficult to do yourself (see SqlDataReader.GetSchemaTable()).

Complex SQL query (newbie)

8 votes

I have a MySQL table containing the points (x/y coordinates) of tracks. Each row contains the TrackID, a Timestamp, and the X and Y Positions for that track at that given point in time.

What I want is a list of all TrackIDs that were active during a given time interval (tmin...tmax), sorted by their start-time, even if that start time is outside the interval.

A little illustration might help:

Illustration

As an example: Track 1 is active from t11 till t12, which means I have many rows in my table with ID=1 and with timestamps ranging from t11 to t12.

The desired output would be:

TrackID | StartTime
--------+-----------
    7   |    t71
    1   |    t11
    2   |    t21
    6   |    t61

I tried something like this:

SELECT TrackID, MIN(Timestamp) AS StartTime FROM Tracks WHERE Timestamp BETWEEN tmin AND tmax GROUP BY TrackID ORDER BY StartTime;

However, in the example above I don't get the real start times for tracks 1 and 7, since all rows with timestamps less than tmin are not considered at all.

Of course I could in a first step just get all active TrackIDs with

SELECT TrackID FROM Tracks WHERE Timestamp BETWEEN tmin AND tmax GROUP BY TrackID;

and then with separate queries find the start times of all these tracks and then sort them in my application code.

But I'm sure there is a way to do this with one SQL query. My table contains millions of rows, so efficiency is an issue here.

I hope someone can show me a solution (and explain it, since I'd like to learn SQL better...)!

One way to think about it is to construct the logic to handle your four special cases in your diagram. These two rules should suffice.

  1. tend > tmin AND
  2. tstart < tmax

If any of these two conditions are true, then the track should be included. You will need a list of tracks as in your second query with their min and max values, and then perform the comparisons:

SELECT T.TrackID
  FROM (SELECT TrackID, MIN(Timestamp) AS StartTime, MAX(Timestamp) AS EndTime
        FROM Tracks GROUP BY TrackID) T
 WHERE T.EndTime > tmin AND T.StartTime < tmax

Why an union is faster than a group by

8 votes

Well, maybe I am too old school and I would like to understand the following.

query 1.

select count(*), gender from customer
group by gender

query 2.

select count(*), 'M' from customer
where gender ='M'
union
select count(*), 'F' from customer
where gender ='F'

the 1st query is simpler, but for some reason in the profiler,when I execute both at the same time, it says that query 2 uses 39% of the time, and query 1, 61%.

I would like to understand the reason, maybe I have to rewrite all my queries.

Your query 2 is actually a nice trick. It works like this: You have an index on gender. The DBMS can seek into that index two times to get two ranges of rows (one for M and one for F). It doesn't need to read anything from these rows, just that they exist. It can count the number of rows that exist in the two ranges.

In the first query the DBMS needs to decode the rows to read the gender, then it needs to either sort the rows or build a hashtable to aggregate them. That is more expensive than just counting rows.

Parameter doesn't perform as well as hard coding the value

8 votes

I have a stored procedure that performs terribly. When I declare a variable, set its value and then use it in the where clause the statement takes over an hour to run. When I hard code the variables in the where clause it runs in less than a second.

I started to look into what was wrong with it through execution plans. It looks like when I try and pass it some declared variables the execution plan crates some Hash Match because it selects values from a view that uses a UNION and a common table expression.

/*************   Begin of Stored Procedure ***************/
CREATE PROCEDURE GetFruit
  @ColorId bigint,
  @SeasionId bigint
WITH RECOMPILE
AS
BEGIN

SELECT
    A.Name
FROM
    [Apple_View] A   /* This is the view down below */
    INNER JOIN [Fruit] F
        ON ( F.ColorId = @ColorId
            AND A.FruitId = F.FruitId)          
WHERE
    (A.ColorId = @ColorId
    AND 
    A.SeasonId = @SeasonId)

END
/************* End of Stored Procedure   ***************/

/************* Begin of View   ***************/
WITH Fruits (FruitId, ColorId, SeasonId) AS
(
    -- Anchor member
    SELECT
        F.FruitId
        ,F.ColorId
        ,F.SeasonId
    FROM
        ((  
            SELECT DISTINCT
                EF.FruitId
                ,EF.ColorId
                ,EF.SeasonId
                ,EF.ParentFruitId
            FROM
                ExoticFruit EF
                INNER JOIN Fruit FR
                    ON FR.FruitId = EF.FruitId
        UNION
            SELECT DISTINCT
                SF.FruitId
                ,SF.ColorId
                ,SF.SeasonId
                ,SF.ParentFruitId               
            FROM
                StinkyFruit SF
                INNER JOIN Fruit FR
                    ON FR.FruitId = SF.FruitId
        UNION
            SELECT DISTINCT
                CF.FruitId
                ,CF.ColorId
                ,CF.SeasonId
                ,CF.ParentFruitId
            FROM
                CrazyFruit CF
                INNER JOIN Fruit FR
                    ON FR.FruitId = CF.FruitId

            )) f

    UNION ALL

    -- Recursive Parent Fruit
    SELECT 
        FS.FruitId
        ,FS.ColorId
        ,FS.SeasonId
        ,FS.ParentFruitId
    FROM
        Fruits FS
        INNER JOIN MasterFruit MF
            ON  MF.[ParentFruitId] = fs.[FruitId]
)

SELECT DISTINCT
    FS.FruitId
    ,FS.ColorId
    ,FS.SeasonId
    FROM
        Fruits FS

/************* End of View   ***************/


/* To Execute */
EXEC GetFruit 1,3

If I run the Stored Procedure using the set values it takes over an hour and here is the execution plan. With Variables

If I run the Stored Procedure removing the DECLARE and SET values and just set the Where clause to the following statement it runs in less than a second and here is the execution plan:

WHERE(A.ColorId = 1 AND  A.SeasonId = 3)

hard coded where clause

Notice how the hard coded variables uses indexing while the first uses a hash set. Why is that? Why are hard coded values in the where clause working different from the declared variables?

-------this is what finally performed with the help of @user1166147------

I changed the stored procedure to use sp_executesql.

CREATE PROCEDURE GetFruit
  @ColorId bigint,
  @SeasionId bigint
WITH RECOMPILE
AS
BEGIN

DECLARE @SelectString nvarchar(max)

SET @SelectString = N'SELECT
    A.Name
FROM
    [Apple_View] A   /* This is the view down below */
    INNER JOIN [Fruit] F
        ON ( F.ColorId = @ColorId
            AND A.FruitId = F.FruitId)          
WHERE
    (A.ColorId = ' + CONVERT(NVARCHAR(MAX), @ColorId) + '
    AND 
    A.SeasonId = ' + CONVERT(NVARCHAR(MAX), @SeasonId) + ')'

EXEC sp_executesql @SelectString

END

EDIT SUMMARY Per a request from Damien_The_Unbeliever

The goal is to get best/most information about the variable value to SQL BEFORE the plan is created, generally parameter sniffing does this. There may be a reason that parameter sniffing was 'disabled' in this case. Without seeing a better representation of the actual code we can't really say what the solution is or why the problem exists. Try the things below to force the affected areas to generate plans using actual values.

*LONG VERSION WITH MORE DETAIL *

Is this your actual stored proc? Do you have default values for your parameters? If so, what are they?

Parameter sniffing can help - but it has to have typical parameters values to create the plan well, and if not, won't really help or will create a bad plan based off of the non typical parameter value. So if a variable has a default value of null or a value that is not a typical value the first time it is run and the plan compiled - it creates a bad plan.

If someone else wrote this sproc - they may have intentionally 'disabled' parameter sniffing with the local variables for a reason. Business rules may require these variable structures.

The goal is to get best/most information about the variable value to SQL BEFORE the plan is created, and generally Parameter Sniffing does this. But there are things that can make it affect performance negatively, and that may be why it is 'disabled'. It still seems like the plan is being created with atypical values for the parameters or not enough info still - using parameter sniffing or not.

Try calling the query inside the sproc with Use sp_executesql to execute the affected queries, forcing it to generate a plan for that area with the actual variables, and see if it's better. This may be your solution if you have to have this sort of irregular parameter value - create stored procs that run the affected parts and call them later from within the stored procedure - after the variable has received a typical value.

Without seeing a better representation of the actual code, it is hard to see what the problem is. Hopefully this info will help -

Mysql return more than one row

7 votes

I have this query.

SELECT tickets.id,source.name as 'source_name',flow_stage.title as 'flow status',tickets.user_name as user_created,tickets.created,tickets.rel_client_id,td_doc_nr.value_string as 'document number',source.source_code,
    IF(!ISNULL(td_doc_nr.value_string),
     (SELECT  GROUP_CONCAT(product_name SEPARATOR ',') from documents d 
        join document_bundles b on b.document_id = d.id
        join document_products p on p.doc_bundle_id = b.id
        join document_product_cstm_fields f on f.doc_product_id = p.id
        join document_product_cstm_field_data fd on fd.cstm_field_id = f.id
        where d.doc_nr = td_doc_nr.value_string
        and value_string ='auto')
,NULL) as test
FROM tickets tickets
JOIN tickets_flow_stages flow_stage ON  flow_stage.id  = tickets.flow_stage_id 
JOIN tickets_sources source ON source.id = tickets.source_id
    LEFT JOIN tickets_custom_fields tf_doc_nr ON tf_doc_nr.name = 'document_number' AND tf_doc_nr.rel_entity_id = source.id
    LEFT JOIN tickets_custom_field_data td_doc_nr ON td_doc_nr.rel_entity_id = tickets.id AND
 td_doc_nr.field_instance_id = tf_doc_nr.id 
WHERE tickets.source_id in  (114,122,125,129,131) AND tickets.status = 1  

enter image description here

I added this subquery to the 'if statement'

 SELECT  GROUP_CONCAT(product_name SEPARATOR ',') from documents d 
    join document_bundles b on b.document_id = d.id
    join document_products p on p.doc_bundle_id = b.id
    join document_product_cstm_fields f on f.doc_product_id = p.id
    join document_product_cstm_field_data fd on fd.cstm_field_id = f.id
    where d.doc_nr = 'C9000100781'
    and value_string ='auto'

enter image description here After this I added a new column.

 SELECT  GROUP_CONCAT(product_name SEPARATOR ','),GROUP_CONCAT(DISTINCT b.msisdn SEPARATOR ',') from documents d 
    join document_bundles b on b.document_id = d.id
    join document_products p on p.doc_bundle_id = b.id
    join document_product_cstm_fields f on f.doc_product_id = p.id
    join document_product_cstm_field_data fd on fd.cstm_field_id = f.id
    where d.doc_nr = 'C9000100781'
    and value_string ='auto'

enter image description here It returns two columns. How can I return two columns?Is it possible? :) Thanks

Final result I got an error 'Operand should contain 1 column(s)' enter image description here

A subquery inside an IF statement can't return multiple columns. You will need to join the subquery into the results, and pull out the two separate columns individually:

SELECT ...
    IF(!ISNULL(td_doc_nr.value_string), sub.one, NULL) as one,
    IF(!ISNULL(td_doc_nr.value_string), sub.two, NULL) as two
FROM ...
LEFT JOIN (
    SELECT  d.doc_nr, GROUP_CONCAT(product_name SEPARATOR ','),GROUP_CONCAT(DISTINCT b.msisdn SEPARATOR ',') from documents d 
    join document_bundles b on b.document_id = d.id
    join document_products p on p.doc_bundle_id = b.id
    join document_product_cstm_fields f on f.doc_product_id = p.id
    join document_product_cstm_field_data fd on fd.cstm_field_id = f.id
    where value_string ='auto'
    group by d.doc_nr
) sub on sub.doc_nr = td_doc_nr.value_string

What exactly is "pinning" in relation to indexes in DBMS?

7 votes

As part of a test question, students were asked to define and describe pinning in relation to indexes. When one talks about "pinning" in indexes, what exactly is this? Is there some other word/term I can search for, as google has not provided any solutions.

When a table or index is pinned, it means it is kept in memory.

The database has a certain amount of memory allocated to work with. Typically, the database caches recently used data.

When an index is pinned, it means the index is kept in memory permanently, rather than being aged out of the cache.

LISTAGG equivalent with windowing clause

7 votes

In oracle, the LISTAGG function allows me to use it analytically with a OVER (PARTITION BY column..) clause. However, it does not support use of windowing with the ROWS or RANGE keywords.

I have a data set from a store register (simplified for the question). Note that the register table's quantity is always 1 - one item, one transaction line.

TranID TranLine ItemId OrderID Dollars Quantity
------ -------- ------ ------- ------- --------
1      101      23845  23      2.99    1
1      102      23845  23      2.99    1
1      103      23845  23      2.99    1
1      104      23845  23      2.99    1
1      105      23845  23      2.99    1

I have to "match" this data to a table in an special order system where items are grouped by quantity. Note that the system can have the same item ID on multiple lines (components ordered may be different even if the item is the same).

ItemId OrderID Order Line Dollars Quantity
------ ------- ---------- ------- --------
23845  23      1          8.97    3
23845  23      2          5.98    2

The only way I can match this data is by order id, item id and dollar amount.

Essentially I need to get to the following result.

ItemId OrderID Order Line Dollars Quantity Tran ID  Tran Lines
------ ------- ---------- ------- -------- -------  ----------
23845  23      1          8.97    3        1        101;102;103
23845  23      2          5.98    2        1        104;105

I don't specifically care if the tran lines are ordered in any way, all I care is that the dollar amounts match and that I don't "re-use" a line from the register in computing the total on the special order. I don't need the tran lines broken out into a table - this is for reporting purposes and the granularity never goes back down to the register transaction line level.

My initial thinking was that I can do this with analytic functions to do a "best match" to identify the the first set of rows that match to the dollar amount and quantity in the ordering system, giving me a result set like:

TranID TranLine ItemId OrderID Dollars Quantity CumDollar  CumQty
------ -------- ------ ------- ------- -------- --------   ------
1      101      23845  23      2.99    1        2.99       1
1      102      23845  23      2.99    1        5.98       2
1      103      23845  23      2.99    1        8.97       3
1      104      23845  23      2.99    1        11.96      4
1      105      23845  23      2.99    1        14.95      5

So far so good. But I then try to add LISTAGG to my query:

SELECT tranid, tranline, itemid, orderid, dollars, quantity, 
       SUM(dollars) OVER (partition by tranid, itemid, orderid order by tranline) cumdollar,
       SUM(quantity) OVER (partition by tranid, itemid, orderid order by tranline) cumqty
       LISTAGG (tranline) within group (order by tranid, itemid, orderid, tranline) OVER (partition by tranid, itemid, orderid)
FROM table

I discover that it always returns a full agg instead of a cumulative agg:

TranID TranLine ItemId OrderID Dollars Quantity CumDollar  CumQty ListAgg
------ -------- ------ ------- ------- -------- --------   ------ -------
1      101      23845  23      2.99    1        2.99       1      101;102;103;104;105
1      102      23845  23      2.99    1        5.98       2      101;102;103;104;105
1      103      23845  23      2.99    1        8.97       3      101;102;103;104;105
1      104      23845  23      2.99    1        11.96      4      101;102;103;104;105
1      105      23845  23      2.99    1        14.95      5      101;102;103;104;105

So this isn't useful.

I would much prefer to do this in SQL if at all possible. I am aware that I can do this with cursors & procedural logic.

Is there any way to do windowing with the LISTAGG analytic function, or perhaps another analytic function which would support this?

I'm on 11gR2.

The only way I can think of to achieve this is with a correlated subquery:

WITH CTE AS
(   SELECT  TranID, 
            TranLine, 
            ItemID, 
            OrderID, 
            Dollars, 
            Quantity, 
            SUM(dollars) OVER (PARTITION BY TranID, ItemID, OrderID ORDER BY TranLine) AS CumDollar, 
            SUM(Quantity) OVER (PARTITION BY TranID, ItemID, OrderID ORDER BY TranLine) AS CumQuantity
    FROM    T
)
SELECT  TranID, 
        TranLine, 
        ItemID, 
        OrderID, 
        Dollars, 
        Quantity, 
        CumDollar, 
        CumQuantity, 
        (   SELECT  LISTAGG(Tranline, ';') WITHIN GROUP(ORDER BY CumQuantity)
            FROM    CTE T2
            WHERE   T1.CumQuantity >= T2.CumQuantity
            AND     T1.ItemID = T2.ItemID
            AND     T1.OrderID = T2.OrderID
            AND     T1.TranID = T2.TranID
            GROUP BY tranid, itemid, orderid
        ) AS ListAgg
FROM    CTE T1;

I realise this doesn't give the exact output you were asking for, but hopefully it is enough to overcome the problem of the cumulative LISTAGG and get you on your way.

I've set up an SQL Fiddle to demonstrate the solution.

Consider the following two EXPLAINs:

EXPLAIN SELECT * FROM sales WHERE title != 'The'

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE      sales   ALL      title        NULL  NULL    NULL    41707   Using where

And -

EXPLAIN SELECT * FROM sales WHERE title = 'The'
id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE      sales   ref      title         title    767 const   1   Using where 

Why does the != query have a NULL key? Why doesn't it use title? What causes a = statement to be able to utilize an index but not a !=?

There is no point on using the index unless title is exactly 'The' very frequently.

Since almost every row needs to be selected you don't gain anything from using an index. It can actually be costly to use an index, which is probably what your MySQL engine is determining, so it is opting not to use the index.

Compare the amount of work done in these two situations:

Using the index:

1) Read the entire index tree into memory.
2) Search the index tree for the value 'The' and filter out those entries.
3) Read every row except for the few exceptions (which probably are in the same blocks on the disk as rows that do need to be read, so really the whole table is likely to be read in) from the table into memory.

Without the index:

1) Read every row into memory and while reading them filter out any where title = 'The' from the result set

SQL Inner Join. ON condition vs WHERE clause

7 votes

I am busy converting a query using the old style syntax to the new join syntax. The essence of my query is as follows :

Original Query

SELECT i.*  
FROM 
  InterestRunDailySum i, 
  InterestRunDetail ird, 
  InterestPayments p
WHERE 
   p.IntrPayCode = 187
   AND i.IntRunCode = p.IntRunCode AND i.ClientCode = p.ClientCode
   AND ird.IntRunCode = p.IntRunCode AND ird.ClientCode = p.ClientCode

New Query

SELECT  i.*
  FROM InterestPayments p
    INNER JOIN InterestRunDailySum i 
      ON (i.IntRunCode = p.IntRunCode AND i.ClientCode = p.ClientCode)
    INNER JOIN InterestRunDetail ird 
      ON (ird.IntRunCode = p.IntRunCode AND ird.IntRunCode = p.IntRunCode)
  WHERE 
    p.IntrPayCode = 187

In this example, "Original Query" returns 46 rows, where "New Query" returns over 800

Can someone explain the difference to me? I would have assumed that these queries are identical.

The problem is with your join to InterestRunDetail. You are joining on IntRunCode twice.

The correct query should be:

SELECT  i.*
  FROM InterestPayments p
    INNER JOIN InterestRunDailySum i 
      ON (i.IntRunCode = p.IntRunCode AND i.ClientCode = p.ClientCode)
    INNER JOIN InterestRunDetail ird 
      ON (ird.IntRunCode = p.IntRunCode AND ird.ClientCode = p.ClientCode)
  WHERE 
    p.IntrPayCode = 187

How to remove duplicating rows from union statement

7 votes

OK - I have looked and looked and found a lot of examples but nothing quite meeting my need. Maybe I used the wrong words to search with, but I could use your help. I will provide as much detail as I can.

I need to produce a report that merges fields from two tables, or rather a view and a table, into one table for a report. Here is the statement I am using:

SELECT A.ConfInt, A.Conference, 
        NULL as Ordered,
        NULL as Approved,
        NULL as PickedUp,
        SUM(dbo.Case_Visit_Payments.Qty) AS Qty
FROM         dbo.Conferences as A INNER JOIN
                      dbo.Case_Table ON A.ConfInt = dbo.Case_Table.Conference_ID INNER JOIN
                      dbo.Case_Visit_Payments ON dbo.Case_Table.Case_ID = dbo.Case_Visit_Payments.Case_ID
WHERE     (dbo.Case_Visit_Payments.Item_ID = 15 AND A.ProjectCool = 1)
GROUP BY A.Conference, A.ConfInt
UNION
SELECT  B.ConfInt, 
        B.Conference, 
        SUM(dbo.Cool_Fan_Order.NumberOfFansRequested) AS Ordered, 
        SUM(dbo.Cool_Fan_Order.Qty_Fans_Approved) AS Approved, 
        SUM(dbo.Cool_Fan_Order.Qty_Fans_PickedUp) AS PickedUp, 
        NULL AS Qty
FROM         dbo.Conferences as B LEFT OUTER JOIN
                      dbo.Cool_Fan_Order ON B.ConfInt = dbo.Cool_Fan_Order.Conference_ID
where B.ProjectCool = 1
GROUP BY B.Conference, B.ConfInt

And here are the results:

4   Our Lady        NULL    NULL    NULL    11
4   Our Lady        40      40      40      NULL
7   Holy Rosary     20      20      20      NULL
11  Little Flower   NULL    NULL    NULL    21
11  Little Flower   5       5       20      NULL
19  Perpetual Help  NULL    NULL    NULL    2
19  Perpetual Help  20      20      20      NULL

What I would strongly prefer is to not have the duplicating rows, such as:

4   Our Lady        40      40      40      11
7   Holy Rosary     20      20      20      NULL
11  Little Flower   5       5       20      21
19  Perpetual Help  20      20      20      2

I hope this question was clear enough. Any Suggestions would be greatly appreciated. And I do mark as answered. :)

Gregory

The quick answer is to wrap your query inside another one,

SELECT ConfInt
     , Conference
     , SUM(Ordered) AS Ordered
     , SUM(Approved) As Approved
     , SUM(PickedUp) AS PickedUp
     , SUM(Qty) AS Qty
  FROM (

       <your UNION query here>  

      )
 GROUP BY ConfInt, Conference

This is not the only way to achieve the result set, but its the quickest fix to meet the specified requirements.

As an alternative, I believe these queries will return equivalent results:

We could use a correlated subquery in the SELECT list to get Qty:

;WITH q AS
      ( SELECT B.ConfInt
             , B.Conference
             , SUM(o.NumberOfFansRequested) AS Ordered
             , SUM(o.Qty_Fans_Approved) AS Approved
             , SUM(o.Qty_Fans_PickedUp) AS PickedUp
          FROM dbo.Conferences as B
          LEFT
          JOIN dbo.Cool_Fan_Order o ON o.Conference_ID = B.ConfInt
         WHERE B.ProjectCool = 1
         GROUP BY B.ConfInt, B.Conference
      )
 SELECT q.ConfInt
      , q.Conference
      , q.Ordered
      , q.Approved
      , q.PickedUp
      , ( SELECT SUM(v.Qty)
            FROM dbo.Case_Table t
            JOIN dbo.Case_Visit_Payments v ON v.Case_ID = t.Case_ID
           WHERE t.Conference_ID = q.ConfInt
             AND v.Item_ID = 15
        ) AS Qty
   FROM q
  ORDER BY q.ConfInt, q.Conference

Or, we could use LEFT JOIN operation on the two queries, rather than UNION. (We know that the query referencing Cool_Fan_Order can be the LEFT side of the outer join, because we know that it returns at least as many rows as the other query. (Basically, we know that the other query can't return values of ConfInt and Conference that aren't in the Cool_Fan_Order query.)

;WITH p AS 
      ( SELECT A.ConfInt
             , A.Conference
             , SUM(v.Qty) AS Qty
          FROM dbo.Conferences as A
          JOIN dbo.Case_Table t ON t.Conference_ID = A.ConfInt
          JOIN dbo.Case_Visit_Payments v ON v.Case_ID = t.Case_ID
         WHERE A.ProjectCool = 1
           AND v.Item_ID = 15
         GROUP BY A.ConfInt, A.Conference
      )
    , q AS
      ( SELECT B.ConfInt
             , B.Conference
             , SUM(o.NumberOfFansRequested) AS Ordered
             , SUM(o.Qty_Fans_Approved) AS Approved
             , SUM(o.Qty_Fans_PickedUp) AS PickedUp
          FROM dbo.Conferences as B
          LEFT
          JOIN dbo.Cool_Fan_Order o ON B.ConfInt = o.Conference_ID
         WHERE B.ProjectCool = 1
         GROUP BY B.ConfInt, B.Conference
      )
 SELECT q.ConfInt
      , q.Conference
      , q.Ordered
      , q.Approved
      , q.PickedUp
      , p.Qty
   FROM q
   LEFT
   JOIN p ON p.ConfInt = q.ConfInt AND p.Conference = q.Conference
  ORDER BY q.ConfInt, q.Conference

The choice between those three (they all return an equivalent resultset under all conditons), boils down to readability and maintainability, and performance. On large enough rowsets, there may be some observable performance differences between the three statements.

How To Slow Down A SQL Query?

7 votes

As strange as it sounds I need to slow down a SQL query. Currently I'm using Microsoft SQL Server 2008 R2 on an in-house development server with the AdventureWorks database. I'm in the process of testing some code and the queries that I'm running are too fast no matter what I try!

Basically I'm testing a cut-off feature and need a sufficiently long query to be able to cut it off before it completes.

Unfortunately as it is a local installation there isn't a single query or large enough table in the AdventureWorks database to actually give me good data to work with. I've tried

WAITFOR DELAY '01:00'

Which worked great to just test to make sure it was working, but now I need to test to see if I can cut the data stream off mid-read. The WAITFOR statement doesn't do me justice in that respect because I need it to actively be retrieving data back from the server. My first intuition was to use convoluted calculations to slow it down, however even having SQL server multiply all the numerical values in the query by themselves 37 times only slowed down the query by milliseconds. The second thing I tried was embedding the WAITFOR statement in a sub-query but it appears you can't do that. Finally, the only thing I haven't tried is to execute multiple stored procedures and WAITFOR in between them, but I don't think that would work for what I need.

I have to say, I'm impressed at how hard it is to make an absolutely terrible query when you're this close to the server.

Is there any way I can slow down a query easily?

Thank you!

Just do a load of cross joins.

SELECT T1.*
FROM SomeTable T1,  
     SomeTable T2,  
     SomeTable T3,  
     SomeTable T4

For a 1,000 row table that will generate 1,000 billion rows which should be plenty slow enough.

Will SQL update a record if the new values are the same?

6 votes

Will SQL update a record if there is no change to the record?

For examople, is it more efficient to replace

UPDATE TABLE myTable 
Set Col1 = ISNULL(Col1,'')
...
Set Col100 = ISNULL(Col30,'')

with

UPDATE TABLE myTable 
Set Col1 = ISNULL(Col1,'')
...
Set Col100 = ISNULL(Col30,'')
WHERE 
Col1 IS NULL OR
...
Col30 IS NULL

Yes, it will attempt overwrite.

Why same query giving two different result?

6 votes

I have created two tables & inserted values as shown below .

Table 1

create table maxID (myID varchar(4));

insert into maxID values ('A001');
insert into maxID values ('A002');
insert into maxID values ('A004');
insert into maxID values ('A003');

Table 2

create table maxID2 (myID varchar(4) PRIMARY KEY);

insert into maxID2 values ('A001');
insert into maxID2 values ('A002');
insert into maxID2 values ('A004');
insert into maxID2 values ('A003');

When I execute query

SELECT myId, @rowid:=@rowid+1 as myrow 
FROM maxID, (SELECT @rowid:=0) as init
ORDER BY myrow desc
LIMIT 1;

I get output as

+++++++++++++
myid + myrow
+++++++++++++
A003 + 4
+++++++++++++

AND

When I execute query

SELECT myId, @rowid:=@rowid+1 as myrow 
FROM maxID2, (SELECT @rowid:=0) as init
ORDER BY myrow desc
LIMIT 1;

I get output as

+++++++++++++
myid + myrow
+++++++++++++
A004 + 4
+++++++++++++

The difference between two table is that in second table I have myID as PRIMARY KEY.

You can view above data/ result at www.sqlfiddle.com.

My Question is

Why I am getting two different results when query is same?

NOTE : This question is bit related to my old question Getting last record from mysql, where I almost got the answer and Yak informed me that the order of rows are not guaranteed. :(

This is because when the selected fieldset is totally included into a given index fieldset, this index is used to retrieve the data instead of the fullscan result.

Since indexes have a default sorting order, when raw tables datas don't, the extracted data using the index is therefore appearing in a different order than if it were from a full table scan.

In your case the 4th row is indeed the 4th one when you are using the primary key because internally mysql (oracle, sql server...) organized it this way to seek the data faster.

Please note that by chance, you might have obtained the same result in both queries, just because the default selected result order is indeed not proven to be related to the inserted order.

At last, let me warn you that if you planned to add an index with a specific order in mysql (such as described here) so the rows are retrieved in, let say, the DESC order, you could not do it since it is not an allowed feature yet in mysql:

An index_col_name specification can end with ASC or DESC. These keywords are permitted for future extensions for specifying ascending or descending index value storage. Currently, they are parsed but ignored; index values are always stored in ascending order.

Use of GROUP BY twice in MySQL

6 votes

My table looks like this.

Location    Head    Id  IntTime
1           AMD     1   1
2           INTC    3   3
3           AMD     2   2
4           INTC    4   4
5           AMD2    1   0
6           ARMH    5   1
7           ARMH    5   0
8           ARMH    6   1
9           AAPL    7   0
10          AAPL    7   1

Location is the primary key. I need to GROUP BY Head and by Id and when I use GROUP BY, I need to keep the row with the smallest IntTime.

After the first GROUP BY Id, I should get (I keep the smallest IntTime)

Location    Head    Id  IntTime
2           INTC    3   3
3           AMD     2   2
4           INTC    4   4
5           AMD2    1   0
7           ARMH    5   0
8           ARMH    6   1
9           AAPL    7   0

After the second GROUP BY Head, I should get (I keep the smallest IntTime)

Location    Head    Id  IntTime
2           INTC    3   3
3           AMD     2   2
5           AMD2    1   0
7           ARMH    5   0
9           AAPL    7   0

When I run the following command, I keep the smallest IntTime but the rows are not conserved.

SELECT Location, Head, Id, MIN(IntTime) FROM test 
GROUP BY Id

Also, to run the second GROUP BY, I save this table and do again

SELECT Location, Head, Id, MIN(IntTime) FROM test2 
GROUP BY Head

Is there a way to combine both commands?

[Edit: clarification] The result should not contain two Head with the same value or two Id with the same value. When deleting those duplicates, the row with the smallest IntTime should be kept.

This query returns the exact final results you're looking for (example):

SELECT `final`.*
FROM `tableName` AS `final`
JOIN (
    SELECT `thead`.`Id`, `Head`, MIN(`intTime`) AS `min_intTime`
    FROM `tableName` AS `thead`
    JOIN (
        SELECT `Id`, MIN(intTime) as min_intTime
        FROM `tableName` AS `tid`
        GROUP BY `Id`
    ) `tid`
    ON `tid`.`Id` = `thead`.`Id`
    AND `tid`.`min_intTime` = `thead`.`intTime`
    GROUP BY `Head`
) `thead`
ON `thead`.`Head` = `final`.`Head`
AND `thead`.`min_intTime` = `final`.`intTime`
AND `thead`.`Id` = `final`.`Id`

How it works

The innermost query groups by Id, and returns the Id and corresponding MIN(intTime). Then, the middle query groups by Head, and returns the Head and corresponding Id and MIN(intTime). The final query returns all rows, after being narrowed down. You can think of the final (outermost) query as a query on a table with only the rows you want, so you can do additional comparisons (e.g. WHERE final.intTime > 3).

Finding repeated occurrences with ranking functions

6 votes

Please help me generate the following query i've been struggling with for some time now. Lets' say I have a simple table with month number and information whether there were any failed events in this particular month

Below a script to generate sample data:

WITH DATA(Month, Success) AS
(
    SELECT  1, 0 UNION ALL
    SELECT  2, 0 UNION ALL
    SELECT  3, 0 UNION ALL
    SELECT  4, 1 UNION ALL
    SELECT  5, 1 UNION ALL
    SELECT  6, 0 UNION ALL
    SELECT  7, 0 UNION ALL
    SELECT  8, 1 UNION ALL
    SELECT  9, 0 UNION ALL
    SELECT 10, 1 UNION ALL
    SELECT 11, 0 UNION ALL
    SELECT 12, 1 UNION ALL
    SELECT 13, 0 UNION ALL
    SELECT 14, 1 UNION ALL
    SELECT 15, 0 UNION ALL
    SELECT 16, 1 UNION ALL
    SELECT 17, 0 UNION ALL
    SELECT 18, 0
)

Given the definition of a "repeated failure ":

When event failure occurs during at least 4 months in any 6 months period then the last month with such failure is a "repeated failure" my query should return the following output

Month   Success RepeatedFailure
1       0   
2       0   
3       0   
4       1   
5       1   
6       0       R1
7       0       R2
8       1   
9       0   
10      1   
11      0       R3
12      1   
13      0   
14      1   
15      0   
16      1   
17      0
18      0       R1

where:

  • R1 -1st repeated failure in month no 6 (4 failures in last 6 months).
  • R2 -2nd repeated failure in month no 7 (4 failures in last 6 months).
  • R3 -3rd repeated failure in month no 11 (4 failures in last 6 months).

R1 -again 1st repeated failure in month no 18 because Repeated Failures should be again numbered from the beginning when new Repeated Failure occurs for the first time in last 6 reporting periods

Repeated Failures are numerated consecutively because based on its number i must apply appropriate multiplier:

  • 1st repated failure - X2
  • 2nd repeated failure - X4
  • 3rd and more repeated failure -X5.

I'm sure this can be improved, but it works. We essentially do two passes - the first to establish repeated failures, the second to establish what kind of repeated failure each is. Note that Intermediate2 can definitely be done away with, I've only separated it out for clarity. All the code is one statement, my explanation is interleaved:

;WITH DATA(Month, Success) AS
-- assuming your data  as defined (with my edit)
,Intermediate AS 
(
SELECT
    Month,
    Success,
    -- next column for illustration only
    (SELECT SUM(Success) 
     FROM DATA hist 
     WHERE curr.Month - hist.Month BETWEEN 0 AND 5) 
        AS SuccessesInLastSixMonths,
    -- next column for illustration only
    6 - (SELECT SUM(Success) 
     FROM DATA hist 
     WHERE curr.Month - hist.Month BETWEEN 0 AND 5) 
        AS FailuresInLastSixMonths,
    CASE WHEN 
            (6 - (SELECT SUM(Success) 
                    FROM DATA hist 
                    WHERE curr.Month - hist.Month BETWEEN 0 AND 5)) 
            >= 4 
            THEN 1
            ELSE 0 
    END AS IsRepeatedFailure
FROM DATA curr 
-- No real data until month 6
WHERE curr.Month > 5
)

At this point we have established, for each month, whether it's a repeated failure, by counting the failures in the six months up to and including it.

,Intermediate2 AS
(
SELECT 
    Month,
    Success,
    IsRepeatedFailure,
    (SELECT SUM(IsRepeatedFailure) 
        FROM Intermediate hist 
        WHERE curr.Month - hist.Month BETWEEN 0 AND 5) 
        AS RepeatedFailuresInLastSixMonths
FROM Intermediate curr
)

Now we have counted the number of repeated failures in the six months leading up to now

SELECT
    Month,
    Success,
    CASE IsRepeatedFailure 
        WHEN 1 THEN 'R' + CONVERT(varchar, RepeatedFailuresInLastSixMonths) 
        ELSE '' END
    AS RepeatedFailureText
FROM Intermediate2

so we can say, if this month is a repeated failure, what cardinality of repeated failure it is.

Result:

Month       Success     RepeatedFailureText
----------- ----------- -------------------------------
6           0           R1
7           0           R2
8           1           
9           0           
10          1           
11          0           R3
12          1           
13          0           
14          1           
15          0           
16          1           
17          0           
18          0           R1

(13 row(s) affected)

Performance considerations will depend on on how much data you actually have.

How to get Max id from table of database in java code

5 votes

I want to write code which give max id from the table but it is throwing error.

code:

Class.forName("oracle.jdbc.driver.OracleDriver");
Connection con = DriverManager.getConnection("XXXXX", "XXXX", "XXX");
Statement st2 = con.createStatement();
ResultSet idMax = st2.executeQuery("select nvl(max(work_id),0) from workdetails");
int id2 = idMax.getInt(0);  // throw error: Invalid column index

System.out.println(id2);

// ****************************
int id2 = idMax.getInt("work_id");
System.out.println(id2);   // throw error: ResultSet.next was not called

A result set starts at a dummy record and should be advanced to the real first record by calling the next method :

ResultSet idMax = st2.executeQuery("select nvl(max(work_id),0) max_id from workdetails");
int id2 = -1;
if (idMax.next()) {
   id2 = idMax.getInt("max_id");  
}