Best sql questions in March 2011

Speeding up the rate that IIS/.NET/LINQ retrieves data from the Network Buffers

12 votes

When doing a TCP analysis of the traffic between my web servers and database servers I see the network buffers (TCP Window) filling up frequently. The web servers then send TCP messages to the database server telling it that its buffers are full an not to send more data until given an update.

For example, this is the size of the network buffer in bytes for one of the more long lived connections to the database server over time:

Network Buffer Graph

The web servers are running a .NET 4.0 application running in IIS integrated mode on Windows 2008 R2 web servers. The SQL server is a 2008 R2 server.

My interpretation of this is that the SQL server is returning data to the web servers faster then the application on the web server can collect the data from the buffers. I have tried tuning just about everything I can in the network drivers to work around this issue. In particular increasing the RSS queues, disabling interrupt moderation, and setting Windows 2008 R2 server to increase the buffer size more aggressively.

So if my interpretation is correct that leaves me wondering about two possibilities:

  1. Is there any way in .NET to tell it to increase the size of the network buffers? The "enhanced 2008 R2 TCP stack" is rarely deciding to enable window scaling (Making the buffer bigger than 65 kBytes) for this connection (probably due to the low latency). It looks like the ability to manually set this system wide is gone in Windows server 2008 r2 (There used to be registry entries that are now ignored). So is there a way I can force this in the code?
  2. Is there anything that can be tuned that would speed up the rate at which the application reads information for the network buffers, in particular for the SQL connections?

Edit:
Requested DMV Query cutting off at ASYNC_NETWORK_IO:

SELECT * FROM sys.dm_os_wait_stats ORDER BY waiting_tasks_count desc;
wait_type  waiting_tasks_count  wait_time_ms  max_wait_time_ms  signal_wait_time_ms
CXPACKET            1436226309    2772827343             39259           354295135
SLEEP_TASK           231661274     337253925             10808            71665032
LATCH_EX             214958564     894509148             11855            84816450
SOS_SCHEDULER_YIELD  176997645     227440530              2997           227332659
ASYNC_NETWORK_IO     112914243      84132232             16707            16250951

1) What makes you think that this is TCP flow control, as opposed to SQL Server not producing data in the intervals where there is no traffic? Check if sys.dm_exec_requests look at wait_type. The wait types are described in Waits and Queues. If is indeed the client applying TCP flow control, then you'll see the wait type ASYNC_NETWORK_IO.

2) If the issue is indeed network wait type, then the solution is not to increase the bandwidth, but obviously to reduce the traffic. The client has no business requesting so much data from the server as to cause TCP flow control. This would be caused by doing horribly wrong things in the client, like counting rows or client-side paging. Move the processing on the server, and just get small results sets with the data you need.

Edit

Consuming the DB call result set ultimately boils down to one form or another of this:

FetchNextRow
while (not EnfOfResults)
{
  ProcessRow;
  FetchNextRow;
}

What this could mean, in real terms, it could be foreach row in IQueryable or SqlDataReader.Read(). But the basic idea is the same, that the client fetches rows from the result, process them, then gets some more rows. If the client code does anything in that ProcessRow that blocks, then the client code will not reach the point where it fetches the next row again, and thus will eventually trigger TCP flow control which in turn will cause SQL Server to suspend the query (since it has no place to write the results into). There is nothing you can do in terms of TCP to make this better. Increasing the window size can actually make maters worse, as now all those results that were previously suppressed at source (DB) are going to be created and have to be stored somewhere, which will ultimately mean live memory allocated to storage and may make things far worse than they are now.

If I would be in your shoes right now I'd focus on identifying where does that ProcessRow blocking occur. An hypothesis I put forward was that that processing would be an MVC View writing into the response buffer and being blocked in turn by TCP flow control resulting from the user agent not consuming the HTTP response (eg. Ajax call completed but the browser is not running the completion code to consume the response because the main thread is looping busy on something else). As always, the best approach is to methodically measure. Some possible tools:

Clear and readable SQL code

10 votes

I'm writing some SQL code with several subqueries and lots of joins everywhere, both inside the subquery and the resulting table from the subquery. We're not using views so that's out of the question.

After writing it I'm looking at it and scratching my head wondering what it's even doing cause I can't follow it.

What kind of formatting do you use to make an attempt to clean up such a mess? Indents perhaps?

With large queries I tend to rely a lot on named result sets using WITH. This allows to define the result set beforehand and it makes the main query simpler. Named results sets may help to make the query plan more efficient as well e.g. postgres stores the result set in a temporary table.

Example:

WITH 
  cubed_data AS (
     SELECT 
        dimension1_id,
        dimension2_id,
        dimension3_id,
        measure_id,
        SUM(value) value
     FROM
        source_data
     GROUP BY
        CUBE(dimension1, dimension2, dimension3),
        measure
  ), 
  dimension1_label AS(
     SELECT 
        dimension1_id,
        dimension1_label
     FROM 
        labels 
     WHERE 
        object = 'dimension1'
  ), ...
SELECT 
  *
FROM  
  cubed_data
  JOIN dimension1_label USING (dimension1_id)
  JOIN dimension2_label USING (dimension2_id)
  JOIN dimension3_label USING (dimension3_id)
  JOIN measure_label USING (measure_id)

The example is a bit contrived but I hope it shows the increase in clarity compared to inline subqueries. Named result sets have been a great help for me when I've been preparing data for OLAP use. Named results sets are also must if you have/want to create recursive queries.

WITH works at least on current versions of Postgres, Oracle and SQL Server

Create a summary row for data across multiple tables

9 votes

I'm trying to write a SQL query to generate a summary row for the actions performed by a given user in a given period. I have the following relevant table structure:

users

  • id
  • team

audit_periods (can be processing, shipping, break, etc)

  • user_id
  • period_type (can be "processing", "shipping", etc -- not currently normalized)
  • started_at
  • finished_at (can be null for the current period, hence the logic around times below)

audit_tasks

  • audit_period_id
  • audit_task_type_id
  • created_at
  • score

audit_task_types

  • name ("scan", "place_in_pallet", etc)
  • score (seems redundant, but we need to maintain the score that the audit_task received at the time it was performed, as the audit_task_type score can change later)

ER Diagram

For each user for a given period, I'd like to create something like the following row of data:

users.id users.email time_spent_processing time_spent_shipping ... number_of_scans number_of_pallets

which would be calculated by figuring out for each user:

  • What audit_periods fall at least partially in the desired window? (Uses started_at and finished_at.)
  • How long did a user spend in each type of audit_period? (Should involve group by audit_periods.period_type, I'd imagine.)
  • What audit_tasks fall within the desired window? (Uses created_at -- not in the code below yet.)
  • How many of each type of audit_task did a user accomplish during the window? (Joins out to audit_task_type, and likely involves a group by on audit_task_types.name.)
  • How many points were earned during the time period? (Sums the scores of all the audit_tasks in the window.)

I've exhausted all of the SQL tricks I know (not many) and came up with something like the following:

select 
    u.id as user_id,
    u.email as email,
    u.team as team,
    ap.period_type as period_type,
    att.name,
    time_to_sec(
      timediff(least("2011-03-17 00:00:00", ifnull(ap.finished_at, utc_timestamp())), greatest("2011-03-16 00:00:00", ap.started_at))
    ) as period_duration,
    sum(at.score) as period_score
  from audit_periods as ap
  inner join users as u on ap.user_id = u.id
  left join audit_tasks as at on at.audit_period_id = ap.id
  left join audit_task_types as att on at.audit_task_type_id = att.id
  where (ap.started_at >= "2011-03-16 00:00:00" or (ap.finished_at >= "2011-03-17 00:00:00" and ap.finished_at <= "2011-03-17 00:00:00"))
    and (ap.finished_at <= "2011-03-17 00:00:00" or (ap.started_at >= "2011-03-16 00:00:00" and ap.started_at <= "2011-03-16 00:00:00"))
    and u.team in ("Foo", "Bar")
  group by u.id, ap.id, at.id

but this seems to be functionally equivalent to just selecting all of the audit tasks in the end. I've tried some subqueries as well, but to little avail. More directly, this generates something like (skipping less important columns):

user_id   |   period_type   |   period_duration  |  name            |   score
1             processing        1800s               scan                200
1             shipping          1000s               place_in_pallet     100
1             shipping          1000s               place_in_pallet     100
1             break             500s                null                null

when I want:

user_id   |   processing    |   shipping  |  break  |  scan  |  place_in_pallet  |  score
1             1800s             1000s        500s      1        2                   400

I can easily fetch all of the audit_tasks for a given user and roll them up in code, but I might be fetching hundreds of thousands of audit_tasks over a given period, so it needs to be done in SQL.

Just to be clear -- I'm looking for a query to generate one row per user, containing summary data collected across the other 3 tables. So, for each user, I want to know how much time he spent in each type of audit_period (3600 seconds processing, 3200 seconds shipping, etc), as well as how many of each audit_task he performed (5 scans, 10 items placed in pallet, etc).

I think I have the elements of a solution, I'm just having trouble piecing them together. I know exactly how I would accomplish this in Ruby/Java/etc, but I don't think I understand SQL well enough to know which tool I'm missing. Do I need a temp table? A union? Some other construct entirely?

Any help is greatly appreciated, and I can clarify if the above is complete nonsense.

You will need to break this up into two crosstab queries which give you the information about audit_periods by user and another query that will give you the audit_task information by user and then join that to the Users table. It isn't clear how you want to roll up the information in each of the cases. For example, if a given user has 10 audit_period rows, how should the query roll up those durations? I assumed a sum of the durations here but you might want a min or max or perhaps even an overall delta.

Select U.user_id
    , AuditPeriodByUser.TotalDuration_Processing As processing
    , AuditPeriodByUser.TotalDuration_Shipping As shipping
    , AuditPeriodByUser.TotalDuration_Break As break
    , AuditTasksByUser.TotalCount_Scan As scan
    , AuditTasksByUser.TotalCount_Place_In_Pallet As place_in_pallet
    , AuditTasksByUser.TotalScore As score
From users As U
    Left Join   (
                Select AP.user_id
                    , Sum( Case When AP.period_type = 'processing' 
                                Then Time_To_Sec( 
                                        TimeDiff( 
                                            Coalesce(AP.started_at, UTC_TIMESTAMP()), AP.finished_at ) ) ) 
                        As TotalDuration_Processing
                    , Sum( Case When AP.period_type = 'shipping' 
                                Then Time_To_Sec( 
                                        TimeDiff( 
                                            Coalesce(AP.started_at, UTC_TIMESTAMP()), AP.finished_at ) ) ) 
                        As TotalDuration_Shipping
                    , Sum( Case When AP.period_type = 'break' 
                                Then Time_To_Sec( 
                                        TimeDiff( 
                                            Coalesce(AP.started_at, UTC_TIMESTAMP()), AP.finished_at ) ) ) 
                        As TotalDuration_Break
                From audit_periods As AP
                Where AP.started_at >= @StartDate 
                    And AP.finished_at <= @EndDate
                Group by AP.user_id
                ) As AuditPeriodByUser
            On AuditPeriodByUser.user_id = U.user_id
    Left Join   (
                Select AP.user_id
                    , Sum( Case When AT.Name = 'scan' Then 1 Else 0 End ) As TotalCount_Scan
                    , Sum( Case When AT.Name = 'place_in_pallet' Then 1 Else 0 End ) As TotalCount_Place_In_Pallet
                    , Sum( AT.score ) As TotalScore
                From audit_tasks As AT
                    Join audit_task_types As ATT
                        On ATT.id = AT.audit_task_type_id
                    Join audit_periods As AP
                        On AP.audit_period_id = AP.id
                Where AP.started_at >= @StartDate 
                    And AP.finished_at <= @EndDate
                Group By AP.user_id
                ) As AuditTasksByUser
        On AuditTasksByUser.user_id = U.user_id

What type of Join to use?

7 votes

I've got a core table and and 3 tables that extend the 'core' table in different ways.

I'm working with MLS data and I have a 'common' table that contains information common to all mls listings and then a table that has specifically "residential" information, one for "commercial",etc... I have been using mls number to join a single table when I know a listing when the property type is known, but for searching I want to join all of them and have the special fields available for search criteria (not simply searching the common table).

What type of join will give me a dataset that will contain all listings (including the extended fields in the idx tables) ?

For each Common table record there is a single corresponding record in ONLY ONE of the idx tables.

                     ___________
                    |           |
                    |  COMMON   |
                    |           |
                    |___________|
                         _|_   
                          |
       ___________________|_____________________
     _|_                 _|_                   _|_
 _____|_____         _____|______           ____|______ 
|           |       |            |         |           |
|   IDX1    |       |   IDX2     |         |   IDX3    |
|           |       |            |         |           |
|___________|       |____________|         |___________|

If you want everything in one row, you can use something like this format. Basically it gives you all the "Common" fields, then the other fields if there is a match otherwise NULL:

SELECT  Common.*,
        Idx1.*,
        Idx2.*,
        Idx3.*
FROM Common
LEFT JOIN Idx1
    ON Idx1.MLSKey = Common.MLSKey
LEFT JOIN Idx2
    ON Idx2.MLSKey = Common.MLSKey  
LEFT JOIN Idx3
    ON Idx3.MLSKey = Common.MLSKey

Bear in mind it's better to list out fields than to use the SELECT * whenever possible...

Also I'm assuming MySQL syntax is the same as SQL Server, which is what I use.

Can someone explain this SQL query to me?

7 votes

I'm reading this article and I'm trying to understand this SQL statement but I am still somewhat new to SQL.

I'm not sure what comment and c refer to.
I think one of them is the table name but I am not sure of the other. Also, apparently there is a subquery within it which I have not had any experience with:

  SELECT c.id, c.user_id, c.body, c.deep, c.lineage, c.parent_id,
         (SELECT COUNT(*) 
            FROM comment 
           WHERE comment.lineage LIKE (CONCAT(c.lineage,'%')) 
             AND comment.lineage != c.lineage) AS replies
    FROM comment as c
ORDER BY c.lineage

SELECT c.id,
       c.user_id,
       c.body, 
       c.deep, 
       c.lineage, 
       c.parent_id, (
       SELECT COUNT(*)
         FROM comment
        where comment.lineage LIKE (CONCAT(c.lineage,'%'))
          AND comment.lineage!=c.lineage)
       as replies
       FROM comment as c 
       order by c.linea

The first list are all the fields to be selected, with the prefix of c which is the alias later to the comment table.

The query in a query is a subquery, which runs that query which does a like and concatenates .clineage with % (which is the wildcard). This subquery result is saved in replies.

The results are ordered by linea.

Avoid creating a clustered index based on an incrementing key

7 votes

I got this hint from mssqlcity.com. However, I cannot understand its explanation.

Avoid creating a clustered index based on an incrementing key

For example, if a table has surrogate integer primary key declared as IDENTITY and the clustered index was created on this column, then every time data is inserted into this table, the rows will be added to the end of the table. When many rows will be added a "hot spot" can occur. A "hot spot" occurs when many queries try to read or write data in the same area at the same time. A "hot spot" results in I/O bottleneck. Note. By default, SQL Server creates clustered index for the primary key constraint. So, in this case, you should explicitly specify NONCLUSTERED keyword to indicate that a nonclustered index is created for the primary key constraint.

Before I read that, I thought if I pick a column that is random in nature, it's not correct because this will cause unnecessary page relocation when adding a new row. So, I think using a sorted column is preferrable.

After reading this hint, I think it's trying to say we don't really want to use a straightly sorted column to be our clustered index either because there is going to be an I/O bottleneck for those write-intensive application.

I don't really understand the cause of the I/O bottleneck that they are talking about. Are they saying too many operations sharing the same page is going to slow down the disk operations? How does this happen? Can somebody explain to me?

The Hot Spot they are referring to is not an issue in SQL Server 2005 and newer.

What USED to happen is that all your data was being written to the same area of the clustered index and the same sector(s) on the disk which caused a lot of dirty pages to be created at once (dirty pages being data pages that have been altered but not committed to disk), and when a flush or checkpoint ran this could cause issues.

Newer versions do not experience this behavior due to changes in the IO architecture (from what I understand).

Is it bad practice to use temporary tables in SQL?

Asked on Fri, 18 Mar 2011 by Gabe sql
7 votes

That is, using temporary tables with some initial unique data and then populating it one or several fields at a time. Sometimes it makes code seem more readable but it also leads to procedural type thinking. And it's also slower than using derived tables or other methods. Is it discouraged in industry?

It would be a bad practice if all set-based operations were a) implemented and b) efficiently in all engines.

However, for some tasks (like emulating LAG and LEAD in SQL Server, long insert chains on cascading auto-generated id is several tables etc), temp tables or table variables are a nice solution.

You should note that temporary tables are very often created and dropped by the engine itself for the operations involving using temporary in MySQL, spool in SQL Server etc.

So each time you create a temp table you should ask yourself a question:

  • Do I create a temp table because I don't know a set-based way, or because I know a set-based way but the server (or optimizer) does not?

If the answer is "I know but the optimizer does not", then create the table. The optimizer would do the same if it could.

Alphabetically ordering records with "The", "A", "An" etc at the beginning of varchar field

7 votes

I'm looking for both MySQL and PostgreSQL solutions for this kind of problem.

Say I have a number of records with a title field. The titles are book or movie titles, like "The Cat in the Hat" and "Robin Hood". But while the titles must be displayed in their original form, they ought to be sorted in the way that libraries sort them, which is by moving any article, like "The" or "An" to the end of the title.

So "The Cat in the Hat" is sorted as if it were "Cat in the Hat, The".

What's the best way either to design the schema or write the query so that these records are sorted by title in the same way that libraries sort the title? (I also wish I knew the technical term for this type of ordering by title.) Also, what performance considerations should I be aware of and what indexes should I create?

Create a custom function that (sortableTitle, perhaps?) that will modify strings starting with your unwanted words. Finish your query statement with order by sortableTitle(title). This will incur an extra CPU cost, though you'll have to benchmark to know how much so.

You could create an extra column (sortTitle) that is populated by a trigger. This will take up some space, but then your server will be able to sort rows by an index.

Excepting the above, you cannot (without modifying the database server code) directly create an index that is in the order you want. As far as I can tell, that applies to both MySQL and PostgreSQL.

Multithreaded Delphi database application failing with large amounts of data

6 votes

Overview of the application:
I have a Delphi application that allows a user to define a number of queries, and run them concurrently over multiple MySQL databases. There is a limit on the number of threads that can be run at once (which the user can set). The user selects the queries to run, and the systems to run the queries on. Each thread runs the specified query on the specified system using a TADOQuery component.

Description of the problem:
When the queries retrieve a low number of records, the application works fine, even when lots of threads (up to about 100) are submitted. The application can also handle larger numbers of records(150,000+) as long as only a few threads (up to about 8) are running at once. However, when the user is running more than around 10 queries at once (i.e. 10+ threads), and each thread is retrieving around 150,000+ records, we start getting errors. Here are the specific error messages that we have encountered so far:

a: Not enough storage is available to complete this operation
b: OLE error 80040E05
c: Unspecified error
d: Thread creation error: Not enough storage is available to process this command
e: Object was open
f: ODBC Driver does not support the requested properties

Evidently, the errors are due to a combination of factors: number of threads, amount of data retrieved per thread, and possibly the MySQL server configuration.

The main question really is why are the errors occurring? I appreciate that it appears to be in some way related to resources, but given the different errors that are being returned, I'd like to get my head around exactly why the errors are cropping up. Is it down to resources on the PC, or something to do with the configuration of the server, for example.

The follow up question is what can we do to avoid getting the problems? We're currently throttling down the application by lowering the number of threads that can be run concurrently. We can't force the user to retrieve less records as the queries are totally user defined and if they want to retrieve 200,000 records, then that's up to them, so there's not much that we can do about that side of things. Realistically, we don't want to throttle down the speed of the application because most users will be retrieving small amounts of data, and we don't want to make the application to slow for them to use, and although the number of threads can be changed by the user, we'd rather get to the root of the problem and try to fix it without having to rely on tweaking the configuration all the time.

It looks you're loading a lot of data client-side. They may require to be cached in the client memory (especially if you use bidirectional cursors), and in a 32 bit application that could not be enough, depending on the average row size and how efficient is the library to store rows. Usually the best way to accomplish database work is to perform that on the server directly, without retrieving data to the client. Usually databases have an efficient cache system and can write data out to disk when they don't fit in memory. Why do you retrieve 150000 rows at once? You could use a mechanism to transfer data only when the user actually access them (sort of paging through data), to avoid large chunks of "wasted" memory.

Confusion about 1:1 relationship

6 votes

I've been learning database design and I'm confused about 1:1 relationships. From what I understand, you can simply add columns to the appropriate table. Can someone provide a real world example of where a 1:1 relationship was either necessary or provided some significant benefit? I.e., where would I use a 1:1 relationship and what would it look like?

I'll give you a real practical example.

In the medical billing world, doctors who want to get paid by medicare handle billing by creating a dictation report for each visit with a patient. This might actually be a recorded audio dictation transcribed by a secretary, but more often it's just a written description of what they did and talked about with the patient, along with history, impressions, and so forth. A licensed medical coder will then read this dictation and decide what the doctor is allowed to bill.

Separate from the dictation, there is demographic information about the patient involved: name, age, billing address, etc. This information must be strictly separate from information about the dictation, to prevent coders from allowing bias to cloud their billing judgements or violating patients' privacy.

This data is often kept well-normalized with a 1:many relationship in the data systems at the point of origin, and only the right parts are displayed to the right people at the right times. However, a significant number of offices out-source their billing function to a third party. This way a small clinic, for example, doesn't have to keep a licensed medical coder on staff; one coder at the billing office can handle the needs of many clinics. When the data is sent from the clinic to the billing office, the patient demographic information and the dictations need to come over as separate pieces, possibly at separate times. At this point, they'll likely be stored in completely separate tables with a 1:1 relationship and a shared ID field to match them up later.

In this case, the 1:1 relationship has very little to do with the data model. You could probably match up the records at the time of import, and as a bill moves through the system eventually the provincial patient information received in the clinic's demographic record will be matched to a real person so the 1:many relationship can be restored. Otherwise you'd get a separate statement on a separate account for each visit to the doctor.

Instead, it has almost everything to do with the systems design. There are likely entirely different people building and using the billing part verses the coding part at our imaginary billing service. This way, each side can each have full control of it's own fiefdom, and you are sure that no one, not even a developer, is breaking any privacy rules.

SQL query to lambda expression or linq

6 votes

How can I convert the following query to a lambda expression

select * 
from Invoice_main 
where id not in 
    (select invoice_main_id from Invoice_payment_by_pay_method)

I could not find an alternative for 'not in'.

Assuming you are using LINQ-to-SQL:

from inv in Invoice_main
where !(from m in Invoice_payment_by_pay_method select m.invoice_main_id).Contains(inv.id)
select inv

The !(...).Contains(...) is automatically converted by LINQ-to-SQL to a NOT EXISTS clause (note: this is more efficient than the NOT IN clause).

Other providers (i.e. not LINQ-to-SQL) may not support this rewrite of .Contains into EXISTS so this may not work for everything.

Which is better database design?

6 votes

Given a site like StackOverflow, would it be better to create num_comments column to store how many comments a submission has and then update it when a comment is made or just query the number of rows with the COUNT function? It seems like the latter would be more readable and elegant but the former would be more efficient. What does SO think?

Definitely to use COUNT. Storing the number of comments is a classic de-normalization that produces headaches. It's slightly more efficient for retrieval but makes inserts much more expensive: each new comment requires not only an insert into the comments table, but a write lock on the row containing the comment count.

Is there ever a case in SQL where a subquery is more efficient than a join?

6 votes

I've seen people hypothetically say that there are cases when a subquery can be more efficient than a join but I have never actually seen a good example of this?

What would be a case when you would want to use a subquery over a join?

The classic example is searching for rows in a table that do not have corresponding rows in another.

SELECT a.*
    FROM TableA a
    WHERE NOT EXISTS(SELECT NULL FROM TableB b WHERE b.parent_id = a.id)

is generally better than

SELECT a.* 
    FROM TableA a
        LEFT JOIN TableB b
            ON a.id = b.parent_id
    WHERE b.parent_id IS NULL

See also: Left outer join vs NOT EXISTS

Inventory Average Cost Calculation in SQL

6 votes

I want to compute inventory costs using average value, and I'm somewhat stuck here...

Consider a simple transaction table tr: (ids are autoincrement, negative volume indicates a sell transaction)

order_id | volume | price | type
       1 |   1000 |   100 | B
       2 |   -500 |   110 | S
       3 |   1500 |    80 | B
       4 |   -100 |   150 | S
       5 |   -600 |   110 | S
       6 |    700 |   105 | B

Now I want to know the total volume and total costs after each transaction. The difficulty is getting the sells right. Sells are always valued at the average cost at this point (ie the sell price is actually not relevant here), so the transaction order does matter here.

Optimally, the result would look like this:

order_id | volume | price | total_vol | total_costs | unit_costs
       1 |   1000 |   100 |      1000 |      100000 |        100
       2 |   -500 |   110 |       500 |       50000 |        100
       3 |   1500 |    80 |      2000 |      170000 |         85
       4 |   -100 |   150 |      1900 |      161500 |         85
       5 |   -600 |   110 |      1300 |      110500 |         85
       6 |    700 |   105 |      2000 |      184000 |         92

Now, total_vol is easy with a sum(volume) over (...), total costs on the other hand. I've played around with window functions, but unless I'm missing something totally obvious (or very clever), I don't think it can be done with window functions alone...

Any help would be appreciated. :)

UPDATE:

This is the code I finally used, a combination of both answers (the data model is a bit more complex than my simplified example above, but you get the idea):

select ser_num
  , tr_id
  , tr_date
  , action_typ
  , volume
  , price
  , total_vol
  , trunc(total_costs,0) total_costs
  , trunc(unit_costs,4) unit_costs
from itt
  model
    partition by (ser_num)
    dimension by (row_number() over (partition by ser_num order by tr_date, tr_id) rn)
    measures (tr_id, tr_date, volume, price, action_typ, 0 total_vol, 0 total_costs, 0 unit_costs)
    rules automatic order 
    ( total_vol[ANY] order by rn
      = nvl(total_vol[cv()-1],0) + 
        decode(action_typ[cv()], 'Buy', 1,  'Sell', -1) * volume[cv()]
    , total_costs[ANY] order by rn
      = case action_typ[cv()]
          when 'Buy' then volume[cv()] * price[cv()] + nvl(total_costs[cv()-1],0)
          when 'Sell' then total_vol[cv()] * nvl(unit_costs[cv()-1],price[cv()])
        end
    , unit_costs[ANY] order by rn
      = decode(total_vol[cv()], 0, unit_costs[cv()-1], 
        total_costs[cv()] / total_vol[cv()])
    )
order by ser_num, tr_date, tr_id 

Some observations:

  • When using partitions and references to the previous cell (cv()-1), the dimension has to be partitioned in the same way as the whole model clause (this is also why using iteration_number can be tricky)
  • No iteration is needed here as long as you specify the correct execution order on the rules (order by rn edit: Automatic order does this automatically)
  • Automatic order is probably not necessary here, but it cant hurt.

You can use the MODEL clause to do this recursive calculation

Create sample table and insert data

create table costs (order_id int, volume int, price numeric(16,4), type char(1));

insert into costs (order_id, volume, price) values (1,1000,100);
insert into costs (order_id, volume, price) values (2,-500,110);
insert into costs (order_id, volume, price) values (3,1500,80);
insert into costs (order_id, volume, price) values (4,-100,150);
insert into costs (order_id, volume, price) values (5,-600,110);
insert into costs (order_id, volume, price) values (6,700,105);

The query (EDITED changing rules iterate(1000) to rules automatic order implements the MODEL clause as it is intended to function, i.e. top to bottom sequentially. It also took the query from 0.44s to 0.01s!)

select order_id, volume, price, total_vol, total_costs, unit_costs
    from (select order_id, volume, price,
                 volume total_vol,
                 0.0 total_costs,
                 0.0 unit_costs,
                 row_number() over (order by order_id) rn
          from costs order by order_id)
   model
         dimension by (order_id)
         measures (volume, price, total_vol, total_costs, unit_costs)
         rules automatic order -- iterate(1000)
         ( total_vol[any] = volume[cv()] + nvl(total_vol[cv()-1],0.0),
           total_costs[any] =
                    case SIGN(volume[cv()])
                    when -1 then total_vol[cv()] * nvl(unit_costs[cv()-1],0.0)
                    else volume[cv()] * price[cv()] + nvl(total_costs[cv()-1],0.0)
                    end,
           unit_costs[any] = total_costs[cv()] / total_vol[cv()]
         )
   order by order_id

Output

ORDER_ID VOLUME     PRICE      TOTAL_VOL   TOTAL_COSTS   UNIT_COSTS
1        1000       100        1000        100000        100
2        -500       110        500          50000        100
3        1500        80        2000        170000        85
4        -100       150        1900        161500        85
5        -600       110        1300        110500        85
6        700        105        2000        184000        92

This site has a good tutorial on the MODEL clause


The EXCEL sheet for the data above would look like this, with the formula extended downwards

    A         B       C      D          E                         F
 ---------------------------------------------------------------------------
1|  order_id  volume  price  total_vol  total_costs               unit_costs
2|                                   0                         0           0
3|  1           1000    100  =C4+E3     =IF(C4<0,G3*E4,F3+C4*D4)  =F4/E4
4|  2           -500    110  =C5+E4     =IF(C5<0,G4*E5,F4+C5*D5)  =F5/E5
5|  3           1500     80  =C6+E5     =IF(C6<0,G5*E6,F5+C6*D6)  =F6/E6
6|  4           -100    150  =C7+E6     =IF(C7<0,G6*E7,F6+C7*D7)  =F7/E7
7|  5           -600    110  =C8+E7     =IF(C8<0,G7*E8,F7+C8*D8)  =F8/E8
8|  6           700     105  =C9+E8     =IF(C9<0,G8*E9,F8+C9*D9)  =F9/E9

Architecture Question - One Central Database and Many Different Programs Accessing It

6 votes

I am designing a program that will build and maintain a database, and act as a central server. This is the 'first stage' of a grander plan. Coming later will be 3-5 remote programs built around the information put into this database.

The requirements are:

  1. The remote programs must be able to access the information in the database.
  2. The remote programs must be able to set alerts when information in the database changes.
  3. The remote programs must be able to request the central server to go out and fetch new / different data.

So, the question is this: how do I expose this data and events to the outside world? My two choices are:

  1. Have them communicate directly with my 'server' application. This seems easier to:
    • do event notifications (although I suppose I'm probably missing something in SQL).
    • It also seems like this is more 'upgradeable' - that is I don't need to worry about the database updating and crashing all my remote programs because something changed. I can account for this and transform it the data to a version the child program will understand.
  2. Just go ahead and let them connect directly to the database.
    • This nice thing about this is that it's solved. I can use LINQ for SQL. The only thing the main server application needs to do is let the remote programs know where the database is.
    • I'm unsure how to trigger / relay 'events' for field changes in a database over different programs that may or may not be on the same computer.

Forgive my ignorance on this question. I feel woefully unprepared to ask it, but I'm having a hard time figuring out where to get started with this. It is my first real DB project :-/

Thanks!

If the other programs are going to need to know about updates to the database, then the best solution is to manage all db updates through your server application so it can alert clients of the changes. Otherwise it will be tough for the clients to be aware of changes to the db. This also has the advantage of hiding the implementation details of your storage solution from the clients, so you are free to change databases, etc...

DELETE*FROM table

6 votes

Is there a way (similar to the below code) to delete all the rows in a specified table using c#?

SqlConnection con = new SqlConnection(conString);
con.Open();

string sql = @"DELETE*FROM compsTickers;";
SqlCommand cmd = new SqlCommand(sql, con);
cmd.ExecuteNonQuery();
con.Close();

Right now i'm getting an error: Incorrect syntax near '*'.

No need for the *. You delete rows, not columns, from a table, so you don't specify columns to delete.

DELETE FROM compsTickers

How to get last inserted id?

5 votes

I have this code:

string insertSql = 
    "INSERT INTO aspnet_GameProfiles(UserId,GameId) VALUES(@UserId, @GameId)";

using (SqlConnection myConnection = new SqlConnection(myConnectionString))
{
   myConnection.Open();

   SqlCommand myCommand = new SqlCommand(insertSql, myConnection);

   myCommand.Parameters.AddWithValue("@UserId", newUserId);
   myCommand.Parameters.AddWithValue("@GameId", newGameId);

   myCommand.ExecuteNonQuery();

   myConnection.Close();
}

When I insert into this table, I have an auto_increment int primary key column called GamesProfileId, how can i get the last inserted one after this so I can use that id to insert into another table?

Change the insert statement to this for SQL Server 2005+

INSERT INTO aspnet_GameProfiles(UserId,GameId) OUTPUT INSERTED.ID VALUES(@UserId, @GameId)

...or SQL Server 2000

INSERT INTO aspnet_GameProfiles(UserId,GameId) VALUES(@UserId, @GameId);SELECT SCOPE_IDENTITY()

And then

 Int32 newId = (Int32) myCommand.ExecuteScalar();

What concepts should I study to accomplish this?

5 votes

I am such a newbie to c# I have to ask questions before I can get started!

What I want to do is enter a numeric number into a text box, send it to an attached sql compact database, check to see if the number is in the table, if true return the data to the form. If false, I want to run some code that will get the information and update the table, add to table, send to the form. Other then creating sql tables via C#, Could someone help me prototype this concept so to speak so I can start reading up on the concepts so that I can start building this portion of my project? Thanks.

There are many different ways to do what you are describing. A quick and easy way to handle this scenario would be to use WPF for the user interface and LINQ to SQL for the database access. There are tons of tutorials on both technologies, but here are the basic steps:

Step 1: Create a new WPF project in Visual Studio
Step 2: Add a LINQ to SQL class and map it to your Database
Step 3: Edit the MainWindow.xaml and add the input textbox, check button, and results textbox

Sample code for MainWindow.xaml (note this is quick and dirty):

<Window x:Class="WPFPlayground.MainWindow"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    Title="MainWindow" Height="350" Width="525">
    <Grid>
        <StackPanel Orientation="Horizontal" Height="30">
            <TextBox Name="InputTextBox" Width="50"/>
            <Button Name="CheckButton" Content="Check DB" Click="CheckButton_Click"/> 
            <TextBox Name="ResultsTextBox" Width="100"/>
        </StackPanel>
    </Grid>
</Window>

Step 4: Edit the code behind of MainWindow.xaml.cs to handle the button click event

Sample code for Click event in MainWindow.xaml.cs (again quick and dirty)

private void CheckButton_Click(object sender, RoutedEventArgs e)
{
    // Get instance of my LINQ to SQL datacontext
    var db = new MyDbDataContext();

    // Try to get the record that matches the text in the InputTextBox
    var data = db.TableName.FirstOrDefault(r => r.Id == InputTextBox.Text);

    // Was able to find a matching record so show results data
    if (data != null)
    {
        ResultsTextBox.Text = data.EventDesc;
    }
    else 
    {
        // do what ever you need to do when there is no match
    }
}

Step 5: Learn some best practices and do not use this sample code :)

Have fun.

When we don't need a primary key for our table?

5 votes

Will it ever happen that we design a table that doesn't need a primary key?

No.

The primary key does a lot of stuff behind-the-scenes, even if your application never uses it.

For example: clustering improves efficiency (because heap tables are a mess).

Not to mention, if ANYONE ever has to do something on your table that requires pulling a specific row and you don't have a primary key, you are the bad guy.

Does the database (maximum) field length affect performance?

4 votes

At my company, we have a legacy database with various tables and therefore many, many fields. A lot of the fields seem to have large limits (ex: NVARCHAR(MAX)) that are never reached. Does arbitrarily making the fields their maximum width or 2 to 3 times larger than what is normally inputted negatively affect performance? How should one balance performance with field lengths? Is there a balance?

There's two parts to this question:

Does using NVARCHAR over VARCHAR hurt performance? Yes, storing data in unicode fields doubles the storage requirements. Your data stored in those fields is 2x the size it needs to be (until SQL Server 2008 R2 came out, which includes unicode compression. Your table scans will take twice as long and only half as much data can be stored in memory in the buffer cache.

Does using MAX hurt performance? Not directly, but when you use VARCHAR(MAX), NVARCHAR(MAX), and those kinds of fields, and if you need to index them, you won't be able to rebuild those indexes online.