Best sql questions in July 2011

Standard use of 'Z' instead of NULL to represent missing data?

59 votes

Outside of the argument of whether or not NULLs should ever be used: I am responsible for an existing database that uses NULL to mean "missing or never entered" data. It is different from empty string, which means "a user set this value, and they selected 'empty'."

Another contractor on the project is firmly on the "NULLs do not exist for me; I never use NULL and nobody else should, either" side of the argument. However, what confuses me is that since the contractor's team DOES acknowledge the difference between "missing/never entered" and "intentionally empty or indicated by the user as unknown," they use a single character 'Z' throughout their code and stored procedures to represent "missing/never entered" with the same meaning as NULL throughout the rest of the database.

Although our shared customer has asked for this to be changed, and I have supported this request, the team cites this as "standard practice" among DBAs far more advanced than I; they are reluctant to change to use NULLs based on my ignorant request alone. So, can anyone help me overcome my ignorance? Is there any standard, or small group of individuals, or even a single loud voice among SQL experts which advocates the use of 'Z' in place of NULL?

Update

I have a response from the contractor to add. Here's what he said when the customer asked for the special values to be removed to allow NULL in columns with no data:

Basically, I designed the database to avoid NULLs whenever possible. Here is the rationale:

A NULL in a string [VARCHAR] field is never necessary because an empty (zero-length) string furnishes exactly the same information.

A NULL in an integer field (e.g., an ID value) can be handled by using a value that would never occur in the data (e.g, -1 for an integer IDENTITY field).

A NULL in a date field can easily cause complications in date calculations. For example, in logic that computes date differences, such as the difference in days between a [RecoveryDate] and an [OnsetDate], the logic will blow up if one or both dates are NULL -- unless an explicit allowance is made for both dates being NULL. That's extra work and extra handling. If "default" or "placeholder" dates are used for [RecoveryDate] and [OnsetDate] (e.g., "1/1/1900") , mathematical calculations might show "unusual" values -- but date logic will not blow up.

NULL handling has traditionally been an area where developers make mistakes in stored procedures.

In my 15 years as a DBA, I've found it best to avoid NULLs wherever possible.

This seems to validate the mostly negative reaction to this question. Instead of applying an accepted 6NF approach to designing out NULLs, special values are used to "avoid NULLs wherever possible." I posted this question with an open mind, and I am glad I learned more about the "NULLs are useful / NULLs are evil" debate, but I am now quite comfortable labeling the 'special values' approach to be complete nonsense.

an empty (zero-length) string furnishes exactly the same information.

No, it doesn't; in the existing database we are modifying, NULL means "never entered" and empty string means "entered as empty".

NULL handling has traditionally been an area where developers make mistakes in stored procedures.

Yes, but those mistakes have been made thousands of times by thousands of developers, and the lessons and caveats for avoiding those mistakes are known and documented. As has been mentioned here: whether you accept or reject NULLs, representation of missing values is a solved problem. There is no need to invent a new solution just because developers continue make easy-to-overcome (and easy-to-identify) mistakes.


As a footnote: I have been a DBE and developer for more than 20 years (which is certainly enough time for me to know the difference beetween a database engineer and a database administrator). Throughout my career I have always been in the "NULLs are useful" camp, though I was aware that several very smart people disagreed. I was extremely skeptical about the "special values" approach, but not well-versed enough in the academics of "How To Avoid NULL the Right Way" to make a firm stand. I always love learning new things—and I still have lots to learn after 20 years. Thanks to all who contributed to make this a useful discussion.

Sack your contractor.

Okay, seriously, this isn't standard practice. This can be seen simply because all RDBMS that I have ever worked with implement NULL, logic for NULL, take account of NULL in foreign keys, have different behaviour for NULL in COUNT, etc, etc.

I would actually contend that using 'Z' or any other place holder is worse. You still require code to check for 'Z'. But you also need to document that 'Z' doesn't mean 'Z', it means something else. And you have to ensure that such documentation is read. And then what happens if 'Z' ever becomes a valid piece of data? (Such as a field for an initial?)

At a basic level, even without debating the validity of NULL vs 'Z', I would insist that the contractor conforms to standard practices that exist within your company, not his. Instituting his standard practice in an environment with an alternative standard practice will cause confusion, maintenance overheads, mis-understanding, and in the end increased costs and mistakes.


EDIT

There are cases where using an alternative to NULL is valid in my opinion. But only where doing so reduces code, rather than creating special cases which require accounting for.

I've used that for date bound data, for example. If data is valid between a start-date and an end-date, code can be simplified by not having NULL values. Instead a NULL start-date could be replaced with '01 Jan 1900' and a NULL end-date could be replaced with '31 Dec 2079'.

This still can change behaviour from what may be expected, and so should be used with care:

  • WHERE end-date IS NULL no longer give data that is still valid
  • You just created your own millennium bug
  • etc.

This is equivalent to reforming abstractions such that all properties can always have valid values. It is markedly different from implicitly encoding specific meaning into arbitrarily chosen values.

Still, sack the contractor.

When should I use C++ instead of SQL?

10 votes

I am a C++ programmer who occasionally uses MySQL to work with databases, but my SQL knowledge is rather limited. However I am surely willing to change that.

At the moment I am trying to do analysis(!) on the data I have in my database solely with SQL queries. But I am about to give up, and instead import the data to C++ and do the analysis with C++ code.

I have discussed this with my colleagues, and they also push me to use C++, saying that SQL is not meant for complex analysis but mainly for importing (from the existing tables) and exporting (to new tables) data, and a little bit more such as merging data to - e.g. - joined tables.

Can somebody help me drawing a line? So I know when to switch to C++? Of course performance is also an issue.

What are indications that things get to complex in SQL? Or maybe I just take the wrong approach with designing the queries. Then where can I find tutorials, books, ... to take a better approach?

I hope this is not too vague. I am really a bit lost.

SQL excels at analyzing large sets of relational data.

The place to draw the line is the scale of your analysis.

If you analyze individual records one at a time, do it in your application.

If you analyze large sets of records as a unit, SQL is definitely the best tool for that job.

Row-by-row analysis is not something SQL is designed or optimized for very well. But, if you want to know something about a million-row group of data, do it in the database.

How to reduce remote SQL Server loads?

8 votes

I want to create an application in C# with client and server sides. It will work over local network. Client sides must check for updates on a remote SQL Server. Lets say we've set update time interval to 2 seconds. If i have 20 client side applications, then they'll send query to the remote SQL Server every 2 sec and it will load server quite a lot. Now I want to know is there any way to reduce server load or it's only way to check for updates?

Sorry for my bad english. Thx in advance

From my point of view, there is no need to allow clients to connect the DB serer directly. There should be one more tier here which will only connect to the server and cache information about the updates. Your clients should connect to this additional information and work with the cached info.

UPDATE As far as I understand, the problem appears because all your clients ping your DB server every two seconds. The solution to this problem is to create a special module which will only have access to the DB server and asks it for the update. For example, every two seconds. If the update is ready, it should be able to fetch it from the DB and store. This is what I meant under the additional tier.

Now, let's return to your clients. They should be able to communicate with this module and get information from it about a ready update (this information is cached and thus it is really fast to obtain it. Also you needn't ping the server at every client request). If update is ready, fetch it to the client side and work on client side.

As for the communication between this additional tier and clients. Since you are working with .NET, I would suggest that you take a look at the WCF which, from my point of view, becomes a standard approach of implementing the between-process communication in .NET. There are a lot of information in the network about it, I will post the links shortly.

Here is my favorite WCF book:

Programming WCF Services

MSDN entry:

Windows Communication Foundation

Advanced SQL Select Query

8 votes
week      cookie
1         a
1         b
1         c
1         d
2         a 
2         b
3         a
3         c
3         d

This table represent someone visits a website in a particular week. Each cookie represents an individual person. Each entry represent someone visit this site in a particular week. For example, the last entry means 'd' come to the site in week 3.

I want to find out how many (same) people keep coming back in the following week, when given a start week to look at.

For example, if I look at week 1. I will get result like:

1 | 4
2 | 2
3 | 1

Because 4 user came in week 1. Only 2 of them (a,b) came back in week 2. Only 1 (a) of them came in all of these 3 weeks.

How can I do a select query to find out? The table will be big: there might be 100 weeks, so I want to find the right way to do it.

Thank you so much. I really need get this done.

This query uses variables to track adjacent weeks and work out if they are consecutive:

set @start_week = 2, @week := 0, @conseq := 0, @cookie:='';
select conseq_weeks, count(*)
from (
select 
  cookie,
  if (cookie != @cookie or week != @week + 1, @conseq := 0, @conseq := @conseq + 1) + 1 as conseq_weeks,
  (cookie != @cookie and week <= @start_week) or (cookie = @cookie and week = @week + 1) as conseq,
  @cookie := cookie as lastcookie,
  @week := week as lastweek
from (select week, cookie from webhist where week >= @start_week order by 2, 1) x
) y
where conseq
group by 1;

This is for week 2. For another week, change the start_week variable at the top.

Here's the test:

create table webhist(week int, cookie char);
insert into webhist values (1, 'a'), (1, 'b'), (1, 'c'), (1, 'd'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'c'), (3, 'd');

Output of above query with where week >= 1:

+--------------+----------+
| conseq_weeks | count(*) |
+--------------+----------+
|            1 |        4 |
|            2 |        2 |
|            3 |        1 |
+--------------+----------+

Output of above query with where week >= 2:

+--------------+----------+
| conseq_weeks | count(*) |
+--------------+----------+
|            1 |        2 |
|            2 |        1 |
+--------------+----------+

p.s. Good question, but a bit of a ball-breaker

Should I TRUNCATE TABLE before I DROP TABLE to avoid logging overhead?

7 votes

I understand that using TRUNCATE is a minimally logged operation and does not log the deletion of each record while DROP logs delete operations.

So, is it safe to assume that if I want to get rid of a relatively large table and I want this to happen as QUICKLY and with as LITTLE logging overhead as possible I should TRUNCATE TABLE before I DROP TABLE? Does doing this in RECOVERY SIMPLE make any difference?

I should note that this needs to happen in an automated fashion (within pre-written scripts) because this will be deployed to client databases where both downtime and log file growth could be a problem.

While TRUNCATE doesn't log individual rows, it does log for the page/extent. This is why you can rollback a truncate (which not a lot of people know). My guess is that if you just truncate then drop it will actually be slower than a drop on its own. If you commit in between, maybe not, but it would also depend on the log activity, recovery model, when you hit a checkpoint, etc.

Why is the speed important here? It's not like users are using the table if you're about to drop it...

Why don't you test it? Unless someone has run an extensive study about this covering several different variables, I doubt you're going to get much more than quasi-educated guesses.

PHP - Expose own size to client (so the client knows how much it is downloading)

7 votes

My PHP script is outputting the contents of a .sql file, after it has been called by a POST request from my Delphi Desktop Client.

Here is what is happening:

  1. My Desktop Client sends a POST request to my PHP Script.
  2. The Script then calls mysqldump and generates a file - xdb_backup.sql
  3. The Script then include "xdb_backup.sql"; which will print and return it to the Desktop Client, whereafter it deletes the SQL file.

The problem is, that the size of the SQL file can vary (for testing, I generated one that is 6 mb). I would like my desktop client to be able to show the progress, however the PHP script does not expose it's size, so I have no Progressbar.Max value to assign.

How can I make my PHP script let the Client know how big it is before the whole thing is over ?

Note: Downloading the SQL file is not an option, as the script has to destroy it. :)

You would do

$fsize = filesize($file_path); 

where $file_path will be path to the generated file xdb_backup.sql,

to get the filesize in server and return headers with the following line attached.

header("Content-Length: " . $fsize);

Take a look at http://www.hotscripts.com/forums/php/47774-download-script-not-sending-file-size-header-corrupt-files-since-using-remote-file-server.html which explains a download php script.

php sql injection

7 votes

I have been surfing these days and got to know about SQL INJECTION ATTACK. i have tried to implement on my local machine to know how this can be done so that i can prevent it in my system...

i have written code like this

PHP Code :

if(count($_POST) > 0){

       $con = mysql_connect("localhost","root","") or die(mysql_error());
    mysql_select_db('acelera',$con) or die(mysql_error()); //
    echo $sql = 'SELECT * FROM acl_user WHERE user_email = "'.$_POST['email'].'" AND user_password = "'.$_POST['pass'].'"';
    $res_src = mysql_query($sql);
    while($row = mysql_fetch_array($res_src)){
        echo "<pre>";print_r($row);echo "</pre>";
    }
}

HTML CODE :

<html>
<head></head>
<body>

 EMAIL : <input type="text" name="email" id="email" /><br />
    PASWD : <input type="text" name="pass" id="pass" /><br />
    <input type="submit" name="btn_submit" value="submit email pass" />
        </body>
</html>

by this code if i give input as " OR ""=" then sql injection should get done. but it is not working properly. in post data i have addition slashes if i give above input in password field.

can any one show me how actually SQL INJECTION ATTACK can be done?(code will be more appreciable)

You probably have magic quotes enabled. Check the return value of get_magic_quotes_gpc.

"Magic quotes" is an antique attempt from PHP to auto-magically prevent SQL injection, but in current versions it has been deprecated and you are encouraged to use prepared statements to avoid SQL injection.

See here how to disable them so you can experiment with SQL injection.

Is there a reason not to use views in Oracle?

7 votes

I have recently noticed that nobody uses views in my company (and it's a big company).

I want to create a few views largely because they make my queries simpler to the eye, and these views are on somewhat big tables that don't get very frequent updates (once a day).

My alternative is to create a type table of type record an populate it each time a SP is called. Is this better than using a view? (my guess is no)

PS: database is oracle 10g and EDIT: - yes i have asked around but no one could give me a reason. - both the views and the queries that will be using them are heavy on joins.

Aesthetics doesn't have a place in SQL, or coding in general, when there's performance implications.

If the optimizer determines that predicate pushing can occur, a view will be as good as directly querying the table(s) the view represents. And as Justin mentions, because a view is a macro that expands into the underlying query the view represents -- a soft parse (re-use of the query from cache) is very likely because the cache check needs to match queries exactly.

But be aware of the following:

  • layering views (one view based on another) is a bad practice -- errors won't be encountered until the view is run
  • joining views to other tables and or views is highly suspect -- the optimizer might not see things as well if the underlying query is in place of the view reference. I've had such experiences, because the views joined to were doing more than what the query needed -- sometimes, the queries from all the views used were condensed into a single query that ran much better.

I recommend creating your views, and comparing the EXPLAIN plans to make sure that you are at least getting identical performance. I'd need to see your code for populating a TYPE before commenting on the approach, but it sounds like a derived table in essence...

It's possible you would benefit from using materialized views, but they are notorious restricted in what they support.

SQL - order by list order

6 votes

I have the following query that returns rows based on a comma seperated list

Select * from Table where RecordID in (22,15,105,1,65,32)

I would like the results of this query to return to in the order of the ID's in the list. Is that possible with SQL?

Thanks in advance

If you need the output to appear in a particular order, then you need to specify that order, using something the server can sort. Not knowing which engine you're working against, the general scheme would be to create a temp table or use rowset constructors to pair each record ID with its desired sort order.

E.g. (SQL Server)

declare @T table (RecordID int,Position int)
insert into @T (RecordID,Position)
select 22,1 union all
select 15,2 union all
select 105,3 union all
select 1,4 union all
select 65,5 union all
select 32,6

select * from Table t inner join @T t2 on t.RecordID = t2.RecordID order by t2.Position

Impact of Package Size to Performance in Oracle 10g

6 votes

Using Oracle 10g. The original Oracle designer for this project has moved on and those of us remaining are reasonable developer's for Oracle but we need some tuning and planning assistance.

We have compartmentalized procedures in 'like' packages, some (many) of which have grown in size to include many (50 ish) procedures of varying complexities.

At this point several small procedures (select ID from Contract where Item = 'xyz') in these larger packages take much longer 'than expected' to execute from inside a these packages (from TOAD, SQL Developer, or from .NET Oracle Provider) than it does if the proc is compiled on its own or into a smaller package. (Tables are indexed)

Should there be a performance cost for using such large packages even when calling relatively simple individual procedures or is there some 'other' factor we should be looking for?

(note: upgrade to Oracle 11 is planned but not 'imminent')

The first time any method in a package is invoked, the entire package needs to be read into memory. In general, that is supposed to be an advantage on the assumption that if you are calling one method in a package, it's likely that many related methods will be called either by the one procedure you called or by subsequent application calls. But it does mean that the first execution is potentially slowed by loading much more code may strictly be necessary for the simple function. That penalty should disappear, however, once the package has been loaded into memory. It doesn't sound like you're talking about a problem with the performance of the first call of a procedure, though, which would tend to rule this out.

Is there any code in the package's initialization block that would run before the small procedure was executed that might be skewing the results?

How are you determining how long it takes to call these small procedures and what does "much longer" mean? Are you calling them a handful of times and measuring some small number of elapsed milliseconds and seeing, say, a 30% increase in execution times? Or are you calling them thousands of times and seeing a 1000% increase in execution times?

How to structure an extremely large table

6 votes

This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.

By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available. I have two conceptual ideas to speed it up.

1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.

2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.

Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.

The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)

There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.

From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.

There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.

10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.

Changing a SQL column title via query

6 votes

I have the following query:

SELECT product_description.name, product.quantity,product.price,product_option_value_description.name,product_option_value.quantity
FROM product
INNER JOIN product_description
ON product.product_id=product_description.product_id
INNER JOIN product_option_value_description
ON product.product_id=product_option_value_description.product_id
INNER JOIN product_option_value
ON product.product_id=product_option_value.product_id
ORDER BY product_description.name 

How could I change the title for product_option_value_description.name as I would like to name this option.

Use an alias like so:

product_option_value_description.name AS `Option`

If you want to change the column's name, not only for this query but in general use ALTER TABLE

ALTER TABLE product_option_value_description CHANGE name newname DATATYPE;

6 votes

So I was writing a query in Visual Studio 2010 (by which I mean I opened the server explorer, right clicked the server and chose New Query). The query includes the condition

A AND B AND C AND D AND E AND F AND (G OR H)

which is conjunctive normal form (CNF). When I ran the query(attached to MSSQL Server 2008), it changed the text into

A AND B AND C AND D AND E AND F AND G OR
A AND B AND C AND D AND E AND F AND H

which is disjunctive normal form (DNF).

From the little I found on-line, it seems like DNF allows SQL to run the conjunctives separately and union them at the end.

However, for something like this, with so many repeated conditions, does DNF actually provide an advantage over CNF? If it doesn't, how can I force the optimizer to take the condition as is? If it does, should I write the query in my application code in CNF form because it's shorter and neater or in DNF form because it saves time for the optimizer?

I don't know about the relative advantages of DNF/CNF in this situation, or even how to force the optimizer in this fashion.

Generally speaking, you don't want to force the optimizer to take your 'perceived', 'current', optimization over the one it will generate (there are exceptions to this, but these are usually rare). This largely has to do with the fact that the 'best' optimization may change over time, as a side effect of other actions (like adding an index). If you're forcing the optimizer to adopt a particular optimization, you're locking it into that path, even if a new one may perform better.

Given that, you should write the query in the form that is easiest to read and maintain (CNF), and let the optimizer change it if necessary - this is the whole point of SQL being a declarative language, to allow the optimizer to muck with things as necessary.

Split a string and return greatest in mssql

6 votes

I need to find a way to get the data with the highest versionNumber.

Here is my database design:

VERSIONNUMBER - varchar(15)
DOWNLOADPATH - varchar(100)

Lets say I have records like:

VERSIONNUMBER -------- DOWNLOADPATH
1.1.2                  a.com
1.1.3                  b.com
2.1.4                  c.com
2.1.5                  d.com
2.2.1                  e.com

I need to get the record with the versionnumber 2.2.1. Need some help with the sql though :)

Thank you for any help

Try this:

with a as
(
    select * from (values
    ('1.1.2'),('1.1.3'),('2.1.4 '), ('2.1.5'), ('2.2.1') ) as b(c)
)
select c, PARSENAME(c,1),PARSENAME(c,2), PARSENAME(c,3)
from a
order by 
convert(int,PARSENAME(c,3)),
convert(int,PARSENAME(c,2)),
convert(int,PARSENAME(c,1))

Inspired from: http://www.sql-server-helper.com/tips/sort-ip-address.aspx

with a as
(
    select * from (values
    ('1.1.2'),('1.1.3'),('2.1.4 '), ('2.1.5'), ('2.2.1') ) as b(c)
),
x as 
(
    select c, 
       convert(int,PARSENAME(c,3)) * 100 
       + convert(int,PARSENAME(c,2)) * 10 
       + convert(int,PARSENAME(c,1)) * 1 as the_value
    from a
)
select c from x where the_value = (select MAX(the_value) from x)

In software development, it is typical to find a minor version number that has two digits in it, the version's number don't have any bearing with number's value, thus version 1.12 is greater than 1.5; to compensate for that, you must pad the digits adequately:

    -- Use this, the query above is not future-proof :-)
with a as
(
    select * from (values
    ('2.1.4 '), ('2.1.12'), ('2.1.5'), ('2.2.1') ) as b(c)
),
x as 
(
    select c, 
       convert(int,PARSENAME(c,3)) * 100*100*100 
       + convert(int,PARSENAME(c,2)) * 100*100 
       + convert(int,PARSENAME(c,1)) * 100 as the_value
    from a
)
select c, the_value from x   
order by the_value

Output:

2.1.4   2010400
2.1.5   2010500
2.1.12  2011200
2.2.1   2020100

If you don't take that into consideration(as with the following query):

with a as
(
    select * from (values
    ('2.1.4 '), ('2.1.12'), ('2.1.5'), ('2.2.1') ) as b(c)
),
x as 
(
    select c, 
       convert(int,PARSENAME(c,3)) * 100
       + convert(int,PARSENAME(c,2)) * 10
       + convert(int,PARSENAME(c,1)) * 1 as the_value
    from a
)
select c, the_value from x   
order by the_value;


    -- KorsG's answer has a bug too
with a as
(
    select * from (values
    ('2.1.4 '), ('2.1.12'), ('2.1.5'), ('2.2.1') ) as b(c)
),
x as 
(
    select c, 
       CAST(REPLACE(c, '.', '') AS int) as the_value
    from a
)
select c, the_value from x   
order by the_value      

Those two queries will yield the same (incorrect) output:

c           the_value
2.1.4   214
2.1.5   215
2.2.1   221
2.1.12  222

The 2.2.1 and 2.1.12's value overlapped. That also happens when you merely remove the dots and directly convert the resulting string to int. 2.1.12 become two thousand one hundred twelve, 2.2.1 become two hundred twenty one. 2.2.1 is greater than 2.1.12, not less than

Query that counts different combinations

6 votes

First of all, my apologies for the title of this question, I don't have a better idea for the moment. Make a good suggestion and I will fix the title. (If I have permission to do that, I actually don't know.)

The situation:

I am having a tough time getting the correct SQL query done. I have a setup where people can place orders, with products etc, and they get discounts in certain circumstances.

Consider the following schema:

Product:
  [...]

Price:
  product_id: integer
  description: string
  [...]

Order:
  [...]

OrderItem:
  order_id: integer
  price_id: integer
  amount: integer

And consider the following rules:

  • There are 9 different products.
  • All Products have 2 Prices, one with description PriceA and one described PriceB.
  • All these prices are the same per type for every product. (That is, all PriceA prices are the same, and all PriceB prices are the same.)

The problem:

For every set of 5 different products with the same price level (i.e. PriceA vs. PriceB), the total price of the order is lowered with a certain amount. I am trying to write a query that tells me how many times that happens.

Examples:

Example 1:
A user places an order:

  • 5 times product1,
  • 5 times product2,
  • 5 times product3,
  • 3 times product4,
  • 3 times product5.

All at PriceA, the customer receives 3 times the discount, since there are 3 complete sets of 5

Example 2: A user places an order:

  • 5 times product1,
  • 5 times product2,
  • 5 times product3,
  • 5 times product4,
  • 2 times product5,
  • 2 times product6,
  • 2 times product7

All the PriceA price. Now, the customer receives 5 times the discount, since there are 4 sets of 5, two involving product5, two involving product6 and one involving product7.

The struggle:

I tried this SQL:

SELECT min(amount) as amount from
    (SELECT oi.amount from `order` o
        inner join orderitem oi on oi.order_id = o.id
        inner join price p on oi.price_id = p.id AND p.description = "PriceA"
        inner join product pr on p.product_id = pr.id
        order by oi.amount desc
        limit 5) as z
    having count(amount) = 5;

This beautifully works for Example 1, but in example 2, it will give the wrong result, as it will select the first set of 5 items, and then disregard the

The question is: Is this solvable in SQL? Or would I be better of broadening my selection and doing the math by scripting? (My web application is written in PHP, so I do have room for some server-side mathematics.)

The solution:

I implemented Neil's solution; it now looks like this:

/** @var $oid integer The order ID. */

/* Select all amounts ordered per item, only for a given price title. */
$sql = <<<SQL
SELECT oi.amount as amount FROM orderitems oi
    INNER JOIN orders   o  ON oi.order_id  = o.id AND o.id = $oid
    INNER JOIN prices   p  ON oi.price_id  = p.id AND p.title = '%s'
    INNER JOIN products pr ON p.product_id = pr.id
    ORDER BY oi.amount DESC
SQL;
$discountInfo = array(
    array(
        'price'     => 'PriceA',
        'discounts' => array(
            9 => 49.50, /* Key is 'size of set', value is 'discount'. */
            5 => 23.50
        ),
    ),
    array(
        'price' => 'PriceB',
        'discounts'  => array(
            9 => 22,
            5 => 10,
        ),
    ),
);

foreach($discountInfo as $info)
{
    /* Store all ordered amounts per item in Array. */
    $arr = array();
    $result = $this->dbQuery(sprintf($sql,$info['price']));
    while ($row = mysql_fetch_assoc($result)) $arr[] = $row['amount'];

    foreach ($info['discounts'] as $am => $di)
    {
        /* For each highest set of $am items, fetch the smallest member. */
        while (count($arr) >= $am)
        {
            /* Add discount for each complete set */
            $discount += $di * $arr[$am - 1];

            /* Substract the amounts from all members of the set */
            for ($i=0; $i<=$am - 1; $i++) $arr[$i] -= $arr[$am - 1];

            /* Remove all elements that are 0 */
            foreach ($arr as $k=>$v) if ($v == 0) unset ($arr[$k]);

            /* Array is messed up now, re-sort and re-index it. */
            rsort($arr);
        } 
    } 
}

This is how I would do it in code: Split the items into two arrays, one for each price level. For each array of products, while there are at least five products in the array:

  1. Sort the array in descending order by the number of items of product
  2. Add the number of items of the fifth product in the array to the total number of discounts
  3. Subtract the number of items of the fifth product in the array from the first five products in the array
  4. Delete any zero elements from the array

Fastest way to find string by substring in SQL?

5 votes

I have huge table with 2 columns: Id and Title. Id is bigint and I'm free to choose type of Title column: varchar, char, text, whatever. Column Title contains random text strings like "abcdefg", "q", "allyourbasebelongtous" with maximum of 255 chars.

My task is to get strings by given substring. Substrings also have random length and can be start, middle or end of strings. The most obvious way to perform it:

SELECT * FROM t LIKE '%abc%'

I don't care about INSERT, I need only to do fast selects. What can I do to perform search as fast as possible?

I use MS SQL Server 2008 R2, full text search will be useless, as far as I see.

If you want to use less space than Randy's answer and there is considerable repetition in your data, you can create an N-Ary tree data structure where each edge is the next character and hang each string and trailing substring in your data on it. You number the nodes in depth first order. Then you can create a table with up to 255 rows for each of your records, with the Id of your record, and the node id in your tree that matches the string or trailing substring. Then when you do a search, you find the node id that represents the string you are searching for (and all trailing substrings) and do a range search.

SQL Query - Need to improve performance

5 votes

I have a data load scenario where I create dynamic sql query to pull data and cache in our service. There is 1 table that contains all product data : ProductHistory (47 columns, 200,000 records + and will keep growing)

What I need: Get the latest products by using the maximum id, maximum version and maximum changeid.

First Attempt:

SELECT distinct Product.* FROM ProductHistory product 
WHERE  product.version = 
(SELECT max(version) from ProductHistory p2 where product.Id = p2.Id 
  and product.changeId = 
(SELECT max(changeid) from ProductHistory p3 where p2.changeId = p3.changeId))

This took more than 2.51 minutes.

Other Failed Attempt:

select distinct product.* from ProductHistory product 
where CAST(CAST(id as nvarchar)+'0'+CAST(Version as nvarchar)+'0'+CAST(changeid as nvarchar) as decimal) = 
(select MAX(CAST(CAST(id as nvarchar)+'0'+CAST(Version as nvarchar)+'0'+CAST(changeid as nvarchar) as decimal)) from ProductHistory p2 
where product.Id = p2.Id)

It basically uses the same principle as when you order dates, concatenating the numbers ordered by relevance.

For example 11 Jun 2007 = 20070711
And in our case: Id = 4 , version = 127, changeid = 32   => 40127032
The zeros are there not to mix up the 3 different ids

But this one takes 3.10 minutes !!! :(

So, I basically need a way to make my first attempt query better by any chance. I was also wondering with such amount of data, is this the best speed of retrieval that I should expect ?

  1. I ran sp_helpindex ProductHistory and found out the indexes as below :

    PK_ProductHistoryNew - clustered, unique, primary key located on PRIMARY- Id, Version

  2. I wrapped the first query in a SP but still no change.

So, wondering by what other means we can improve the performance of this operation ?

Thanks, Mani p.s : I am just running these queries in SQL management stuido to see the time.

Run the query from Sql Server Management Studio and look at the query plan to see where the bottle neck is. Any place you see a "table scan" or "index scan" it has to go through all data to find what it is looking for. If you create appropriate indexes that can be used for these operations it should increase performance.

SQL Server single query memory usage

5 votes

I would like to find out or at least estimate how much memory does a single query (a specific query) eats up while executing. There is no point in posting the query here as I would like to do this on multiple queries and see if there is a change over different databases. Is there any way to get this info?

Using SQL Server 2008 R2

thanks

Gilad.

You might want to take a look into DMV (Dynamic Management Views) and specifically into sys.dm_exec_query_memory_grants. See for example this query (taken from here):

DECLARE @mgcounter INT
SET @mgcounter = 1
WHILE @mgcounter <= 5 -- return data from dmv 5 times when there is data
BEGIN
    IF (SELECT COUNT(*)
      FROM sys.dm_exec_query_memory_grants) > 0
    BEGIN
             SELECT *
             FROM sys.dm_exec_query_memory_grants mg
                         CROSS APPLY sys.dm_exec_sql_text(mg.sql_handle) -- shows query text
             -- WAITFOR DELAY '00:00:01' -- add a delay if you see the exact same query in results
             SET @mgcounter = @mgcounter + 1
    END
END

While issuing the above query it will wait until some query is running and will collect the memory data. So to use it, just run the above query and after that your query that you want to monitor.

Reasons for objection to SQL triggers that insert data into other tables?

5 votes

I'm being told by a person with some authority in our company that it's a "database no-no" to create triggers in a database that change rows in another table.

I've used this technique to create default initial configuration, auto-maintaining audit logs, and various other things that would have been a nightmare to consistently maintain inside the heterogeneous applications that connect to that database. For over a decade, I've read that this as an appropriate way to centralize relationship constraint maintenance and get the responsibility out of the applications interacting with the data.

As such, my BS meter is pegging with this. Am I missing something fundamentally wrong with that technique that makes it a bad practice in general?

If you are careful with your trigger code, there is nothing inherently bad about it. Some people get bitten by bad trigger code and then decide that triggers are bad (eventhough it was the bad trigger code that was the problem). They then generalize this as, "never use triggers".

The other problem is....

Using the audit tables as an example, suppose you have a stored procedure that updates a table AND puts data in to an audit table. Now suppose you write trigger code to put data in to the audit table. You could end up with duplicate audit data.

How to optimize a TSQL query?

5 votes

"activity" is a bit field. I need to set it to true if one of the rows with this client_id has value true

SELECT c.client_id, u.branch_id, a.account_id, activity
FROM Clients c INNER JOIN 
      accounts a ON c.id=a.client_id INNER JOIN uso u ON a.uso_id = u.uso_id,
     (SELECT MAX(CONVERT(int,accounts.activity)) as activity, client_id
       FROM accounts GROUP BY client_id) activ
WHERE activ.client_id = c.id

This query executes about 2 minutes. Please help me to optimize it.

Seems activity field is a BIT and you cannot do a MIN or MAX on it.

Instead of this, use TOP:

SELECT  c.client_id, u.branch_id, a.account_id,
        (
        SELECT  TOP 1 activity
        FROM    accounts ai
        WHERE   ai.client_id = c.id
        ORDER BY
                activity DESC
        )
FROM    clients c
JOIN    accounts a
ON      c.id = a.client_id
JOIN    uso u
ON      a.uso_id = u.uso_id

Create an index on accounts (client_id, activity) for this to work fast.

You may want to read this article: