Best sql questions in January 2011

I'm not sure if I have the correct indexes or if I can improve the speed of my query in MySQL?

13 votes

My query has a join, and it looks like it's using two indexes which makes it more complicated. I'm not sure if I can improve on this, but I thought I'd ask.

The query produces a list of records with similar keywords the record being queried.

Here's my query.

SELECT match_keywords.padid,
       COUNT(match_keywords.word) AS matching_words
FROM   keywords current_program_keywords
       INNER JOIN keywords match_keywords
         ON match_keywords.word = current_program_keywords.word
WHERE  match_keywords.word IS NOT NULL
       AND current_program_keywords.padid = 25695
GROUP  BY match_keywords.padid
ORDER  BY matching_words DESC
LIMIT  0, 11  

The EXPLAIN alt text

Word is varchar(40).

You can start by trying to remove the IS NOT NULL test, which is implicitly removed by COUNT on the field. It also looks like you would want to omit 25695 from match_keywords, otherwise 25695 (or other) would surely show up as the "best" match within your 11 row limit?

SELECT     match_keywords.padid,
           COUNT(match_keywords.word) AS matching_words
FROM       keywords current_program_keywords
INNER JOIN keywords match_keywords
        ON match_keywords.word = current_program_keywords.word
WHERE      current_program_keywords.padid = 25695
GROUP BY   match_keywords.padid
ORDER BY   matching_words DESC
LIMIT      0, 11

Next, consider how you would do it as a person.

  • You would to start with a padid (25695) and retrieve all the words for that padid
  • From those list of words, go back into the table again and for each matching word, get their padid's (assumed to have no duplicate on padid + word)
  • group the padid's together and count them
  • order the counts and return the highest 11

With your list of 3 separate single-column indexes, the first two steps (both involve only 2 columns) will always have to jump from index back to data to get the other column. Covering indexes may help here - create two composite indexes to test

create index ix_keyword_pw on keyword(padid, word);
create index ix_keyword_wp on keyword(word, padid);

With these composite indexes in place, you can remove the single-column indexes on padid and word since they are covered by these two.

Note: You always have to temper SELECT performance against

  • size of indexes (the more you create the more to store)
  • insert/update performance (the more indexes, the longer it takes to commit since it has to update the data, then update all indexes)

SQL LIMIT returns no results where no LIMIT returns results

11 votes
SELECT * FROM mm_tfs 
WHERE product_slug LIKE '%football%' 
    AND schoolid = '8' AND category_id ='21'
LIMIT 4

Returns 4 values like I'm asking, but the following statement returns 0 - is there a rule about using the OR statement that I'm not familiar with? My assumption is that it should return all of the values in 1 (or more if it weren't for the limit).

SELECT * FROM mm_tfs 
WHERE (product_slug LIKE '%football%' 
   AND schoolid = '8' AND category_id ='21') 
OR (product_description LIKE '%football%' 
   AND schoolid = '8' AND category_id ='21') 
LIMIT 4

NOTE by cyberkiwi The first OR portion of Q2 is exactly the same as the WHERE clause on Q1

product_description LIKE '%football%' AND schoolid = '8' AND category_id ='21

Without the OR Statement by itself does produce the desired result as long as it does not have the limit. When the OR statement is used with the LIMIT as well, there are 0 values returned.

SELECT * FROM mm_tfs 
WHERE product_description LIKE '%football%' 
    AND schoolid = '8' AND category_id ='21'
LIMIT 4

^-- That produces 0 results

SELECT * FROM mm_tfs 
WHERE product_description LIKE '%football%' 
    AND schoolid = '8' AND category_id ='21'

^-- That produces results

The strangest part of this is that all of these queries product the right effect in my PHPMYADMIN SQL query window, but not in the application itself.

You are repeating some of the conditions which is not necessary. Try this instead:

SELECT * FROM mm_tfs 
WHERE 
   (product_slug LIKE '%football%' 
 OR product_description LIKE '%football%')
   AND schoolid = '8' 
   AND category_id ='21'
LIMIT 4

UPDATE:

I have created the following table:

create table mm_tfs2 (schoolid varchar(2), categoryid varchar(2), 
                      description varchar(20), slug varchar(20));

And 5 times:

insert into mm_tfs2 values (8, 21, '', 'football');

And finally the query:

select * from mm_tfs2 where 
(slug like '%football%' and schoolid = 8 and categoryid = 21) 
or (description like '%football%' and schoolid = 8 and categoryid = 21) 
limit 4;
+----------+------------+-------------+----------+
| schoolid | categoryid | description | slug     |
+----------+------------+-------------+----------+
| 8        | 21         |             | football |
| 8        | 21         |             | football |
| 8        | 21         |             | football |
| 8        | 21         |             | football |
+----------+------------+-------------+----------+
4 rows in set (0.00 sec)

So I'm sorry to say that I'm not able to recreate the problem.

SQL Server silently truncates varchar's in stored procedures

10 votes

According to this forum discussion, SQL Server (I'm using 2005 but I gather this also applies to 2000 and 2008) silently truncates any varchars you specify as stored procedure parameters to the length of the varchar, even if inserting that string directly using an INSERT would actually cause an error. eg. If I create this table:

CREATE TABLE testTable(
    [testStringField] [nvarchar](5) NOT NULL
)

then when I execute the following:

INSERT INTO testTable(testStringField) VALUES(N'string which is too long')

I get an error:

String or binary data would be truncated.
The statement has been terminated.

Great. Data integrity preserved, and the caller knows about it. Now let's define a stored procedure to insert that:

CREATE PROCEDURE spTestTableInsert
    @testStringField [nvarchar](5)
AS
    INSERT INTO testTable(testStringField) VALUES(@testStringField)
GO

and execute it:

EXEC spTestTableInsert @testStringField = N'string which is too long'

No errors, 1 row affected. A row is inserted into the table, with testStringField as 'strin'. SQL Server silently truncated the stored procedure's varchar parameter.

Now, this behaviour might be convenient at times but I gather there is NO WAY to turn it off. This is extremely annoying, as I want the thing to error if I pass too long a string to the stored procedure. There seem to be 2 ways to deal with this.

First, declare the stored proc's @testStringField parameter as size 6, and check whether its length is over 5. This seems like a bit of a hack and involves irritating amounts of boilerplate code.

Second, just declare ALL stored procedure varchar parameters to be varchar(max), and then let the INSERT statement within the stored procedure fail.

The latter seems to work fine, so my question is: is it a good idea to use varchar(max) ALWAYS for strings in SQL Server stored procedures, if I actually want the stored proc to fail when too long a string is passed? Could it even be best practice? The silent truncation that can't be disabled seems stupid to me.

It just is.

I've never noticed a problem though because one of my checks would be to ensure my parameters match my table column lengths. In the client code too. Personally, I'd expect SQL to never see data that is too long. If I did see truncated data, it'd be bleeding obvious what caused it.

If you do feel the need for varchar(max) beware a massive performance issue because of datatype precedence. varchar(max) has higher precedence than varchar(n) (longest is highest). So in this type of query you'll get a scan not a seek and every varchar(100) value is CAST to varchar(max)

UPDATE ...WHERE varchar100column = @varcharmaxvalue

Edit:

One could always raise a Connect issue for MS. At least they may explain this behaviour.

And it's probably worthy of inclusion in Erland Sommarkog's Strict settings

Edit 2, after Martins comment

DECLARE @sql VARCHAR(MAX), @nsql nVARCHAR(MAX)
SELECT @sql = 'B', @nsql = 'B'; 
select LEN(@sql), LEN(@nsql), DATALENGTH(@sql), DATALENGTH(@nsql)

declare @t table(c varchar(8000)) insert into @t values (replicate('A', 7500))

select LEN(c) from @t
select LEN(@sql + c), LEN(@nsql + c), DATALENGTH(@sql + c), DATALENGTH(@nsql + c) from @t

How do I average the difference between specific values in TSQL?

9 votes

Hey folks, sorry this is a bit of a longer question...

I have a table with the following columns:

[ChatID] [User] [LogID] [CreatedOn] [Text]

What I need to find is the average response time for a given user id, to another specific user id. So, if my data looks like:

[1] [john] [20] [1/1/11 3:00:00] [Hello]
[1] [john] [21] [1/1/11 3:00:23] [Anyone there?]
[1] [susan] [22] [1/1/11 3:00:43] [Hello!]
[1] [susan] [23] [1/1/11 3:00:53] [What's up?]
[1] [john] [24] [1/1/11 3:01:02] [Not much]
[1] [susan] [25] [1/1/11 3:01:08] [Cool]

...then I need to see that Susan has an average response time of (20 + 6) / 2 => 13 seconds to John, and John has an average of (9 / 1) => 9 seconds to Susan.

I'm not even sure this can be done in set-based logic, but if anyone has any ideas, they'd be much appreciated!

I don't have a PC to verify syntax or anything, but I think this should give you a starting place:

WITH ChatWithRownum AS (
    SELECT ChatID, User, LogID, CreatedOn, ROW_NUMBER() OVER(ORDER BY ChatID, CreatedOn) AS rownum
    FROM ChatLog
)
SELECT First.ChatID, Second.User,
    AVG(DATEDIFF(seconds, First.CreatedOn, Second.CreatedOn)) AS AvgElapsedTime
FROM ChatWithRownum First
    JOIN ChatWithRownum Second ON First.ChatID = Second.ChatID
        AND First.rownum = Second.rownum - 1
WHERE First.User != Second.User
GROUP BY First.ChatID, Second.User

Essentially, the idea is to add row numbers to the data so you can join one row to the next row (so you have a statement followed by its immediate response). Once you have the rows joined, you can get the time elapsed between the two entries, and then group the data by the ChatID (I'm assuming that times between separate chats aren't relevant) and the two users. Like I said though, this is just a starting place as I'm sure there may be some additional criteria and/or bugs in my query :)

Pass a NULL value to DateTime Field in LINQ

7 votes

My database table is like this

CREATE TABLE MYBUDGET.tbl_CurrentProperty
(
    [PropID]            INT             NOT NULL  IDENTITY(1,1),
    [UpdatedOn]         DATETIME        NOT NULL,
    [Amount]            MONEY           NOT NULL,
    [Remarks]           VARCHAR(100)    NOT NULL,
)
ALTER TABLE MYBUDGET.tbl_CurrentProperty ADD CONSTRAINT PK_CurrentProperty_PropID PRIMARY KEY ([PropID])
ALTER TABLE MYBUDGET.tbl_CurrentProperty ADD CONSTRAINT DF_CurrentProperty_UpdatedOn DEFAULT (DATEADD(MINUTE,30,DATEADD(HOUR, 5, GETUTCDATE()))) FOR [UpdatedOn]
ALTER TABLE MYBUDGET.tbl_CurrentProperty ADD CONSTRAINT CK_CurrentProperty_Amount CHECK([Amount] > -1)
GO

I'm using LINQ to SQL. In C# I need to pass only [Amount] and [Remarks] fields and other fields must be used its default values ([PropID] and [UpdatedOn]).

In C# I create tbl_CurrentProperties object like below,

tbl_CurrentProperties currentProperties = new tbl_CurrentProperties();
currentProperties.Amount = 50.00M;
currentProperties.Remarks = "remarks";

and then submit the object to the data context. But here, Linq assigned '1/1/0001 12:00:00 AM' for UpdatedOn field. But this is violate the SQL datatim rage 1/1/1753 12:00:00 AM and 12/31/9999 11:59:59 PM and Occurring an exception. Also I can't assign a NULL value manually for a DateTime field since its a not nullable type. Any how I need to make this to use its DEFAULT Constraint. How to do this ?


PS: I want to use it like this because, My database is Online and Users are in different locations. So If I used DateTime.Now, the time in the user machine may be wrong, and It insert a wrong value into DB. I need to use SQL server time always.

Andrey's answer is partly right. I just tested this and here's what I found.

In your dbml designer, on your UpdatedOn column set the following:

Auto Generated Value = True 
Nullable = False

Then, on an INSERT if you use SQL Server Profiler to look at the generated SQL, you'll see that UpdatedOn is not included in the INSERT. Not even a null value. This is important: for SQL Server to use a default value for that colum, the column must be omitted from the INSERT. If you set Nullable = True on the UpdatedOn, LINQ to SQL might be including the column on the INSERT with a null value.

FYI, immediately after the INSERT there should be a SELECT where LINQ to SQL is retrieving the auto-generated value, so your entity object has the latest value.

Hamming distance on binary strings in SQL

7 votes

I have a table in my DB where I store SHA256 hashes in a BINARY(32) column. I'm looking for a way to compute the Hamming distance of the entries in the column to a supplied value, i.e. something like:

SELECT * FROM table 
  ORDER BY HAMMINGDISTANCE(hash, UNHEX(<insert supplied sha256 hash here>)) ASC 
  LIMIT 10

(in case you're wondering, the Hamming distance of strings A and B is defined as BIT_COUNT(A^B), where ^ is the bitwise XOR operator and BIT_COUNT returns the number of 1s in the binary string).

Now, I know that both the ^ operator and BIT_COUNT function only work on INTEGERs and so I'd say that probably the only way to do it would be to break up the binary strings in substrings, cast each binary substring to integer, compute the Hamming distance substring-wise and then add them. The problem with this is that it sounds terribly complicated, not efficient and definitely not elegant. My question therefore is: could you suggest any better way? (please note that I'm on shared hosting and therefore I can't modify the DB server or load libraries)

edit(1): Obviously loading the whole table in PHP and doing the computations there would be possible but I'd rather avoid it because this table will probably grow quite large.

edit(2): The DB server is MySQL 5.1

edit(3): My answer below contains the code that I just described above.

edit(4): I just found out that using 4 BIGINTs to store the hash instead of a BINARY(32) yields massive speed improvements (more than 100 times faster). See the comments to my answer below.

FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):

CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
  );

When to use SQL sub-queries versus a standard join?

7 votes

I am working on rewriting some poorly written SQL queries and they are over-utilizing sub-queries. I am looking for best-practices regarding the use of sub-queries.

Any help would be appreciated.

Subqueries are usually fine unless they are dependent subqueries (also known as correlated subqueries). If you are only using independent subqueries and they are using appropriate indexes then they should run quickly. If you have a dependent subquery you might run into performance problems because a dependent subquery typically needs to be run once for each row in the outer query. So if your outer query has 1000 rows, the subquery will be run 1000 times. On the other hand an independent subquery typically only needs to be evaluated once.

If you're not sure what is meant by a subquery being dependent or independent here's a rule of thumb - if you can take the subquery, remove it from its context, run it, and get a result set then it's an independent subquery. If you get a syntax error because it refers to some tables outside of the subquery then its a dependent subquery.

The general rule of course has a few exceptions. For example:

  • Many optimizers can take a dependent subquery and find a way to run it efficiently as a JOIN. For example an NOT EXISTS query might result in an ANTI JOIN query plan, so it will not necessarily be any slower than writing the query with a JOIN.
  • MySQL has a bug where an independent subquery inside an IN expression is incorrectly identified as a dependent subquery and so a suboptimal query plan is used. This is apparently fixed in the very newest versions of MySQL.

If performance is an issue then measure your specific queries and see what works best for you.

Mysql Slow Insert

7 votes

Hi,

i have the following InnoDB table:

+-----------+-----------+------+-----+-------------------+----------------+
| Field     | Type      | Null | Key | Default           | Extra          |
+-----------+-----------+------+-----+-------------------+----------------+
| id        | int(11)   | NO   | PRI | NULL              | auto_increment |
| doc_id    | char(32)  | NO   |     | NULL              |                |
| staff     | char(18)  | NO   |     | NULL              |                |
| timestamp | timestamp | NO   | MUL | CURRENT_TIMESTAMP |                |
+-----------+-----------+------+-----+-------------------+----------------+

With this keys:

+--------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table        | Non_unique | Key_name        | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| staff_online |          0 | PRIMARY         |            1 | id          | A         |      277350 |     NULL | NULL   |      | BTREE      |         |
| staff_online |          1 | timestamp       |            1 | timestamp   | A         |      277350 |     NULL | NULL   |      | BTREE      |         |
| staff_online |          1 | staff_timestamp |            1 | timestamp   | A         |      277350 |     NULL | NULL   |      | BTREE      |         |
| staff_online |          1 | staff_timestamp |            2 | staff       | A         |      277350 |     NULL | NULL   |      | BTREE      |         |
+--------------+------------+-----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

I just noticed that in mysql-slow.log i some times have an INSERT query on this table which takes > 1 second

INSERT INTO `staff_online` (`doc_id`, `staff`, `timestamp`) VALUES ('150b60a0ab8c5888bdbbb80bd8b7f8a2', 'asia', '2011-01-29 16:52:54')

I'm really puzzled why does it take so long and how can i speed up it?

BTW: Each day there're ~80 slow INSERTS and 40 slow UPDATES like this.

Sometimes it is not the query itself which causes a slowdown - another query operating on the table can easily inserts to slow down due to transactional isolation and locking. Your slow queries might simply have been waiting for another transaction(s) to complete. This is fairly common on a busy table, or if your server is executing long/complex transactions.

Another significant factor will be how the overall performance of your database: how your my.cnf file is tuned, how the server itself is tuned, what else the server has running on it, and of course, what hardware the server is running.

The linux tool mytop and the query SHOW ENGINE INNODB STATUS\G can be helpful to see possible trouble spots. General linux performance tools can also show how busy your disks are, etc.

Given the nature of this table, have you considered an alternative way to keep track of who is online? In MySQL, I have used a MEMORY table for such purposes in the past. A NoSQL data store might also be good for this type of information. Redis could store this as sorted set with much success (score == timestamp).

Further reading:

INT vs VARCHAR in search

6 votes

Which one of the following queries will be faster and more optimal (and why):

  1. SELECT * FROM items WHERE w = 320 AND h = 200 (w and h are INT)

  2. SELECT * FROM items WHERE dimensions = '320x200'(dimensions is VARCHAR)

Here are some actual measurements. (Using SQLite; may try it with MySQL later.)

Data = All 1,000,000 combinations of w, h ∈ {1...1000}, in randomized order.

CREATE TABLE items (id INTEGER PRIMARY KEY, w INTEGER, h INTEGER)

Average time (of 20 runs) to execute SELECT * FROM items WHERE w = 320 and h = 200 was 5.39±0.29 µs.

CREATE TABLE items (id INTEGER PRIMARY KEY, dimensions TEXT)

Average time to execute SELECT * FROM items WHERE dimensions = '320x200' was 5.69±0.23 µs.

There is no significant difference, efficiency-wise.

But

There is a huge difference in terms of usability. For example, if you want to calculate the area and perimeter of the rectangles, the two-column approach is easy:

SELECT w * h, 2 * (w + h) FROM items

Try to write the corresponding query for the other way.

How do I get 5 latest comments (SQL query for SQL Server ) for each user?

6 votes

I have a table that looks like this: comment_id, user_id, comment, last_updated.

Comment_id is a key here. Each user may have multiple comments.

How do I get 5 latest comments (SQL query for SQL Server ) for each user?

Output should be similar to the original table, just limit user's comments to 5 most recent for every user.

Assuming at least SQL Server 2005 so you can use the window function (row_number) and the CTE:

;with cteRowNumber as (
    select comment_id, user_id, comment, last_updated, ROW_NUMBER() over (partition by user_id order by last_updated desc) as RowNum
        from comments
)
select comment_id, user_id, comment, last_updated
    from cteRowNumber
    where RowNum <= 5
    order by user_id, last_updated desc

SQL query to find all tables in a database that have a column with a specific name

6 votes

What query can I run on a database that will tell me which tables in that database have a column named "RCPTNMBR"?

Most databases support this:

select table_name from information_schema.columns where column_name = 'x'

MySQL query - possible to include this clause?

6 votes

I have the following query, which retrieves 4 adverts from certain categories in a random order.

At the moment, if a user has more than 1 advert, then potentially all of those ads might be retrieved - I need to limit it so that only 1 ad per user is displayed.

Is this possible to achieve in the same query?

SELECT      a.advert_id, a.title, a.url, a.user_id, 
            FLOOR(1 + RAND() * x.m_id) 'rand_ind' 

FROM        adverts AS a
INNER JOIN  advert_categories AS ac
ON          a.advert_id = ac.advert_id,
(
            SELECT MAX(t.advert_id) - 1 'm_id' 
            FROM adverts t
)           x

WHERE       ac.category_id IN 
(
            SELECT category_id
            FROM website_categories
            WHERE website_id = '8'
)
AND         a.advert_type = 'text'

GROUP BY    a.advert_id
ORDER BY    rand_ind 
LIMIT       4

Note: The solution is the last query at the bottom of this answer.

Test Schema and Data

create table adverts (
    advert_id int primary key, title varchar(20), url varchar(20), user_id int, advert_type varchar(10))
;
create table advert_categories (
    advert_id int, category_id int, primary key(category_id, advert_id))
;
create table website_categories (
    website_id int, category_id int, primary key(website_id, category_id))
;
insert website_categories values
    (8,1),(8,3),(8,5),
    (1,1),(2,3),(4,5)
;
insert adverts (advert_id, title, user_id) values
    (1, 'StackExchange', 1),
    (2, 'StackOverflow', 1),
    (3, 'SuperUser', 1),
    (4, 'ServerFault', 1),
    (5, 'Programming', 1),
    (6, 'C#', 2),
    (7, 'Java', 2),
    (8, 'Python', 2),
    (9, 'Perl', 2),
   (10, 'Google', 3)
;
update adverts set advert_type = 'text'
;
insert advert_categories values
    (1,1),(1,3),
    (2,3),(2,4),
    (3,1),(3,2),(3,3),(3,4),
    (4,1),
    (5,4),
    (6,1),(6,4),
    (7,2),
    (8,1),
    (9,3),
   (10,3),(10,5)
;

Data properties

  • each website can belong to multiple categories
  • for simplicity, all adverts are of type 'text'
  • each advert can belong to multiple categories. If a website has multiple categories that are matched multiple times in advert_categories for the same user_id, this causes the advert_id's to show twice when using a straight join between 3 tables in the next query.

This query joins the 3 tables together (notice that ids 1, 3 and 10 each appear twice)

select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
inner join adverts a on a.advert_id = ac.advert_id and  a.advert_type = 'text'
where wc.website_id='8'
order by a.advert_id

To make each website show only once, this is the core query to show all eligible ads, each only once

        select *
        from adverts a
        where a.advert_type = 'text'
          and exists (
            select *
            from website_categories wc
            inner join advert_categories ac on wc.category_id = ac.category_id
            where wc.website_id='8'
              and a.advert_id = ac.advert_id)

The next query retrieves all the advert_id's to be shown

select advert_id, user_id
from (
    select
        advert_id, user_id,
        @r := @r + 1 r
    from (select @r:=0) r
    cross join 
    (
        # core query -- vvv
        select a.advert_id, a.user_id
        from adverts a
        where a.advert_type = 'text'
          and exists (
            select *
            from website_categories wc
            inner join advert_categories ac on wc.category_id = ac.category_id
            where wc.website_id='8'
              and a.advert_id = ac.advert_id)
        # core query -- ^^^
        order by rand()
    ) EligibleAdsAndUserIDs
) RowNumbered
group by user_id
order by r
limit 2

There are 3 levels to this query

  1. aliased EligibleAdsAndUserIDs: core query, sorted randomly using order by rand()
  2. aliased RowNumbered: row number added to core query, using MySQL side-effecting @variables
  3. the outermost query forces mysql to collect rows as numbered randomly in the inner queries, and group by user_id causes it to retain only the first row for each user_id. limit 2 causes the query to stop as soon as two distinct user_id's have been encountered.

This is the final query which takes the advert_id's from the previous query and joins it back to table adverts to retrieve the required columns.

  1. only once per user_id
  2. feature user's with more ads proportionally (statistically) to the number of eligible ads they have

Note: Point (2) works because the more ads you have, the more likely you will hit the top placings in the row numbering subquery

select a.advert_id, a.title, a.url, a.user_id
from
(
    select advert_id
    from (
        select
            advert_id, user_id,
            @r := @r + 1 r
        from (select @r:=0) r
        cross join 
        (
            # core query -- vvv
            select a.advert_id, a.user_id
            from adverts a
            where a.advert_type = 'text'
              and exists (
                select *
                from website_categories wc
                inner join advert_categories ac on wc.category_id = ac.category_id
                where wc.website_id='8'
                  and a.advert_id = ac.advert_id)
            # core query -- ^^^
            order by rand()
        ) EligibleAdsAndUserIDs
    ) RowNumbered
    group by user_id
    order by r
    limit 2
) Top2
inner join adverts a on a.advert_id = Top2.advert_id;

Get total hours worked in a day mysql

6 votes

Hi all,

I have a mysql table where employee login logout timings are noted.Here in the in-out coloumn 1-represents login and 0-represents logout.

  [id]   [User_id]           [Date_time]                 [in_out]
    1       1          2011-01-20 09:30:03                  1
    2       1          2011-01-20 11:30:43                  0
    3       1          2011-01-20 11:45:12                  1
    4       1          2011-01-20 12:59:56                  0
    5       1          2011-01-20 13:33:11                  1
    6       1          2011-01-20 15:38:16                  0
    7       1          2011-01-20 15:46:23                  1
    8       1          2011-01-20 17:42:45                  0

Is it possible to retreieve total hours worked in a day by a user using single query?? I tried a alot but all in vain.I can do this in php using array but unable to do so using single query.

Thanks in advance...

SELECT `User_id`, time(sum(`Date_time`*(1-2*`in_out`)))
  FROM `whatever_table` GROUP BY `User_id`;

The (1-2*`in_out`) term gives every login event a -1 factor and every logout event a +1 factor. The sum function takes the sum of the Date_time column, and GROUP BY `User_id` makes that the sum for each different user is created.

How to append data from SQL to an existing file

6 votes

SQL has the option to dump data into a file, using the INTO OUTFILE option, for exmaple

SELECT * from FIshReport INTO OUTFILE './FishyFile'

The problem is, this command is only allowed if the file didn't exist before it. It creates the file and then enters the data. So, is there any way to append data to a file this way?

As the MySQL page on SELECT syntax suggests:

http://dev.mysql.com/doc/refman/5.0/en/select.html

the alternative to this is to issue the SELECT from the MySQL client:

However, if the MySQL client software is installed on the remote machine,
you can instead use a client command such as mysql -e "SELECT ..." > file_name 
to generate the file on the client host. 

which, in your case, would be modified to be:

mysql -e "SELECT * from FishReport" >> file_name

so that you simply append to the file.

From your Tcl script, you could simply issue this as an exec command:

http://www.tcl.tk/man/tcl/tutorial/Tcl26.html

Is it ever a good idea to not have an 'id' primary key for a table?

6 votes

It seems to me like its always a good idea, but is there ever a case where you are better off not having this in table?

In my experience, almost never. (For a "speed matters, I'm just inserting and don't really care about retrieval at this point" style of application, perhaps.)

Whilst you might conceivably never use the ID field, it's nearly always wise to have one happily AUTO_INCREMENTing away, because one day you might need one. (You could of course simply do an 'ALTER..' to add one, but that's besides the point.)

Oracle SQL: Detecting breaks in continual spans

6 votes

I have the following table and I'm trying to detect products that have a break in its spans.

Product     | unit_Cost | price start date |    price end date
--------------------------------------------------------------------------
product 1     15.00         01/01/2011      03/31/2011
product 1     15.00         04/01/2011      06/31/2011
product 1     15.00         07/01/2011      09/31/2011
product 1     15.00         10/01/2011      12/31/2011

product 2     10.00         01/01/2011      12/31/2011

product 3     25.00         01/01/2011      06/31/2011
product 3     25.00         10/01/2011      12/31/2011

So here I want it to report back product3 because we are missing the span

07/01/2011 - 09/31/2011

Any ideas on how I can do this?

EDIT: Oracle Ver: 10g

Create Table Statement

CREATE TABLE Sandbox.TBL_PRODUCT
(
  PRODUCT_ID        VARCHAR2(13 BYTE),   
  PRODUCT           VARCHAR2(64 BYTE),
  UNIT_COST         NUMBER,
  PRICE_START_DATE  DATE,
  PRICE_END_DATE    DATE
)

EDIT 2 start dates and end dates cannot overlap

EDIT 3 a span can be any two dates as long as price_end_date >= price_start_date. Equal is included since a product can be on sale for one day.

Try this (using LEAD analytic function):

SELECT *
  FROM (
                SELECT a.*, LEAD(price_start_date,1,NULL) OVER(PARTITION BY product ORDER BY price_end_date) next_start_date 
         FROM Product a
       )
WHERE (price_end_date + 1)<> next_start_date

Example with Setup

        CREATE TABLE PRODUCT
          (
            PRODUCT   VARCHAR2(100 BYTE),
            UNIT_COST NUMBER,
            START_DATE DATE,
            END_DATE DATE
          );

        INSERT INTO Product VALUES('product 1','15.00',TO_DATE('01/01/2011','MM/DD/RRRR'),TO_DATE('03/31/2011','MM/DD/RRRR'));
        INSERT INTO Product VALUES('product 1','15.00',TO_DATE('04/01/2011','MM/DD/RRRR'),TO_DATE('06/30/2011','MM/DD/RRRR'));
        INSERT INTO Product VALUES('product 1','15.00',TO_DATE('07/01/2011','MM/DD/RRRR'),TO_DATE('09/30/2011','MM/DD/RRRR'));
        INSERT INTO Product VALUES('product 1','15.00',TO_DATE('10/01/2011','MM/DD/RRRR'),TO_DATE('12/31/2011','MM/DD/RRRR'));
        INSERT INTO Product VALUES('product 2','10.00',TO_DATE('01/01/2011','MM/DD/RRRR'),TO_DATE('12/31/2011','MM/DD/RRRR'));
        INSERT INTO Product VALUES('product 3','25.00',TO_DATE('01/01/2011','MM/DD/RRRR'),TO_DATE('06/30/2011','MM/DD/RRRR'));
        INSERT INTO Product VALUES('product 3','25.00',TO_DATE('10/01/2011','MM/DD/RRRR'),TO_DATE('12/31/2011','MM/DD/RRRR'));

SELECT *
  FROM (
                SELECT a.*, LEAD(start_date,1,NULL) OVER(PARTITION BY product ORDER BY start_date) next_start_date 
                 FROM Product a
              )
WHERE (end_date + 1)<> next_start_date

EDIT:Updated the query to consider the next start_date and current end_date to avoid issues with the distribution of the data.

Why does EF generate this sql when quering reference property

6 votes

When using the AdventureWorks database and issuing this query:

   AdventureWorksEntities entities = new AdventureWorksEntities();
   entities.Contacts.Where(x => x.FirstName == "A" 
                               || x.FirstName == "B" 
                               || x.FirstName == "C")
                     .ToList();

Will be translated to this sql, which is the best it can:

//ommited for brevity
FROM [Person].[Contact] AS [Extent1]
WHERE [Extent1].[FirstName] IN (N'A',N'B',N'C')

However when I issue this query:

entities.Employee.Where(x => x.Contact.FirstName == "A" 
                             || x.Contact.FirstName == "B" 
                             || x.Contact.FirstName == "C")
                .ToList();

I get this SQL:

//ommited for brevity
FROM   [HumanResources].[Employee] AS [Extent1]
INNER JOIN [Person].[Contact] AS [Extent2] ON [Extent1].[ContactID] = [Extent2].[ContactID]
LEFT OUTER JOIN [Person].[Contact] AS [Extent3] ON [Extent1].[ContactID] = [Extent3].[ContactID]
WHERE [Extent2].[FirstName] = N'A' OR [Extent3].[FirstName] IN (N'B',N'C')

Why am I getting an inner and outer join and is EF splitting the where across both of them?

Of Note using contains creates the same SQL:

var names = new List<string>{"A", "B", "C"};
entities.Employee.Where(x => names.Contains(x.Contact.FirstName)).ToList();

EDIT: So it appears to be a EF bug, I've accepted the answer which provided a work around

EDIT: Opened connect issue, it's located here

Write a Stored Procedure which accepts a TVP as input parameter and let EF materialize the results from the SP :)

Difference between a statement and a query in SQL

5 votes

I still live in this ambiguity: conceptually what's the difference between a statement and a query in SQL? Can anybody give a definition for each of them? It would be useful, for example when choosing variables names inside programs in a way that will be clear for everybody. Thanks!

ADDICTIONALLY: How can I call a chunk of SQL code made by more than one statement where statements are separated by a semicolon (;)? Who already replied can edit his answer. Many thanks!

A statement is any text that the database engine recognizes as a valid command. As of SQL-92:

An SQL-statement is a string of characters that conforms to the format and syntax rules specified in this international standard.

A query is a statement that returns a recordset (possibly empty).

How can I call a chunk of SQL code made by more than one statement where statements are separated by a semicolon (;)? Who already replied can edit his answer. Many thanks!

A series of SQL statements sent to the server at once is called a batch.

Not all SQL engines required the statements in a batch to be semicolon delimited. SQL Server, for instance, generally does not and breaks the statements based on context. CTE statements starting with WITH are a notable exception.

What is a good way to find gaps in a set of datespans?

5 votes

What is a way to find gaps in a set of date spans?

For example, I have these date spans:

1/ 1/11 - 1/10/11  
1/13/11 - 1/15/11  
1/20/11 - 1/30/11

Then I have a start and end date of 1/7/11 and 1/14/11.

I want to be able to tell that between 1/10/11 and 1/13/11 there is a gap so the start and end date is not possible. Or I want to return only the datespans up to the first gap encountered.

If this can be done in SQL server that would be good.

I was thinking to go through each date to find out if it lands in a datespan... if it does not then there's a gap on that day.

  • Jump to 2nd last code block for: *I want to be able to tell that between 1/10/11 and 1/13/11 there is a gap so the start and end date is* not possible.
  • Jump to last code block for: *I want to return only the datespans up to the first gap encountered.*

First of all, here's a virtual table to discuss

create table spans (date1 datetime, date2 datetime);
insert into spans select '20110101', '20110110';
insert into spans select '20110113', '20110115';
insert into spans select '20110120', '20110130';

This is a query that will list, individually, all the dates in the calendar

declare @startdate datetime, @enddate datetime
select @startdate = '20110107', @enddate = '20110114'
select distinct a.date1+v.number
from spans A
inner join master..spt_values v
  on v.type='P' and v.number between 0 and datediff(d, a.date1, a.date2)
-- we don't care about spans that don't intersect with our range
where A.date1 <= @enddate
  and @startdate <= A.date2

Armed with this query, we can now test to see if there are any gaps, by counting the days in the calendar against the expected number of days

declare @startdate datetime, @enddate datetime
select @startdate = '20110107', @enddate = '20110114'

select case when count(distinct a.date1+v.number)
    = datediff(d,@startdate, @enddate) + 1
    then 'No gaps' else 'Gap' end
from spans A
inner join master..spt_values v
  on v.type='P' and v.number between 0 and datediff(d, a.date1, a.date2)
-- we don't care about spans that don't intersect with our range
where A.date1 <= @enddate
  and @startdate <= A.date2
-- count only those dates within our range
   and a.date1 + v.number between @startdate and @enddate

Another way to do this is to just build the calendar from @start to @end up front and look to see if there is a span with this date

declare @startdate datetime, @enddate datetime
select @startdate = '20110107', @enddate = '20110114'
-- startdate+v.number is a day on the calendar
select @startdate + v.number
from master..spt_values v
where v.type='P' and v.number between 0
  and datediff(d, @startdate, @enddate)

-- run the part above this line alone to see the calendar
-- the condition checks for dates that are not in any span (gap)
  and not exists (
    select *
    from spans
    where @startdate + v.number between date1 and date2)

The query returns ALL dates that are gaps in the date range @start - @end A TOP 1 can be added to just see if there are gaps

To return all records that are before the gap, use the query as a derived table in a larger query

declare @startdate datetime, @enddate datetime
select @startdate = '20110107', @enddate = '20110114'
select *
from spans
where date1 <= @enddate and @startdate <= date2 -- overlaps
  and date2 < ( -- before the gap
    select top 1 @startdate + v.number
    from master..spt_values v
    where v.type='P' and v.number between 0
      and datediff(d, @startdate, @enddate)
      and not exists (
        select *
        from spans
        where @startdate + v.number between date1 and date2)
    order by 1 ASC
)