Best database questions in July 2011

What is a good choice of database for a small .NET application?

65 votes

I'm developing some small application with C# in .NET and I wanna have some small local Database next to it where I can save and retrieve records by Sql queries. I don't need anything powerful, just something to use instead of keeping records in files like .txt. so what's your suggestion for that? thanks p.s. I already tried to use .mdf and .sdf , but no success, do you think they work well too? how?

You have a couple of immediately recognisable and free options:

The SQL Server Compact download comes with the ADO.NET provider that you will need to reference in code. The SQLite download might not have it so here is a link:

http://sqlite.phxsoftware.com/

They both use SQL, though likely with a few limitations / quirks. Management Studio works with Compact, whereas with SQLite you will need another UI tool such as SQLite Administrator:

http://sqliteadmin.orbmu2k.de/

There are NoSQL alternatives, such as:

Personally I would avoid using MS Access in the face of other free options. You cannot go wrong with either Compact or SQLite, they are both lovely small databases that run relatively quickly in little RAM - personal preference as to the religious aspects about liking a Microsoft product I suppose :-)

I use Sterling for Windows Phone programming as it is built to use Isolated Storage. I have only seen articles on RavenDb, but I can tell you that it is a JSON based document storage framework.

Not to confuse the situation (go with SQLite or SQL Server Compact), but there are other embedded / local databases out there, some are relational others are object-oriented:

Not all of these are free. SQL / LINQ / in-proc support differs across them all. This list is just for curiosity.

There is now also Karvonite, however the code gallery link is broken. When it's live again I'll be looking into this one for WP7 development.

Is there anything that can be put after the "ORDER BY" clause that can pose a security risk?

18 votes

Basically, what I want to do is this:

mysql_query("SELECT ... FROM ... ORDER BY $_GET[order]")

They can obviously easily create a SQL error by putting non-sense in there, but mysql_query only allows you to execute 1 query, so they can't put something like 1; DROP TABLE ....

Is there any damage a malicious user could do, other than creating a syntax error?

If so, how can I sanitize the query?

There's a lot of logic built on the $_GET['order'] variable being in SQL-like syntax, so I really don't want to change the format.


To clarify, $_GET['order'] won't just be a single field/column. It might be something like last_name DESC, first_name ASC.

Yes, SQL injection attacks can use an unescaped ORDER BY clause as a vector. There's an explanation of how this can be exploited and how to avoid this problem here:

http://josephkeeler.com/2009/05/php-security-sql-injection-in-order-by/

That blog post recommends using a white list to validate the ORDER BY parameter against, which is almost certainly the safest approach.


To respond to the update, even if the clause is complex, you can still write a routine that validates it against a whitelist, for example:

function validate_order_by($order_by_parameter) {
    $columns = array('first_name', 'last_name', 'zip', 'created_at');

    $parts = preg_split("/[\s,]+/", $order_by_parameter);

    foreach ($parts as $part) {
        $subparts = preg_split("/\s+/", $part);

        if (count($subparts) < 0 || count($subparts) > 2) {
           // Too many or too few parts.
           return false;
        }

        if (!in_array($subparts[0], $columns)) {
           // Column name is invalid.
           return false;
        }

        if (count($subparts) == 2 
            && !in_array(strtoupper($subparts[1]), array('ASC', 'DESC')) {
          // ASC or DESC is invalid
          return false;
        }
    }

    return true;
}

Even if the ORDER BY clause is complex, it's still made only out of values you supply (assuming you're not letting users edit it by hand). You can still validate using a white list.

I should also add that I normally don't like to expose my database structure in URLs or other places in the UI and will often alias the stuff in the parameters in the URLs and map it to the real values using a hash.

Using Mathematica in MySQL databases

10 votes

I've seen it's possible to make a connection between Mathematica and MySQL databases using Input Needs["DatabaseLink"] and conn = OpenSQLConnection[JDBC["MySQL(Connector/J)", "yourserver/yourdatabase"], "Username" -> "yourusername", "Password" -> "yourpassword"] (in case anyone wants to give it a try). Documentation of DatabaseLink here, by the way.

Does anyone have experience using Mathematica in this way, probably to analyze data contained in the database? Are there obvious drawbacks (speed, memory needed, etc)?.

Regards

I recently used databases to speed up a Manipulate[] block.

Without the database, essential data from a 150 MB ASCII file were required in memory for access. As a result, the Manipulate[] block slowed down. It's possible that PackedArray[] would have helped. I didn't investigate this.

With the database, the speed of access of individual datasets is slightly slower than a Select[] block, but memory footprint is down by a factor of nearly 10.

I'd say go for it.

Can a Python list, set or dictionary be implemented invisibly using a database?

9 votes

The Python native capabilities for lists, sets & dictionaries totally rock. Is there a way to continue using the native capability when the data becomes really big? The problem I'm working on involved matching (intersection) of very large lists. I haven't pushed the limits yet -- actually I don't really know what the limits are -- and don't want to be surprised with a big reimplementation after the data grows as expected.

Is it reasonable to deploy on something like Google App Engine that advertises no practical scale limit and continue using the native capability as-is forever and not really think about this?

Is there some Python magic that can hide whether the list, set or dictionary is in Python-managed memory vs. in a DB -- so physical deployment of data can be kept distinct from what I do in code?

How do you, Mr. or Ms. Python Super Expert, deal with lists, sets & dicts as data volume grows?

I'm not quite sure what you mean by native capabilities for lists, sets & dictionaries. However, you can create classes that emulate container types and sequence types by defining some methods with special names. That means that you could create a class that behaves like a list, but stores its data in a SQL database or on GAE datastore. Simply speaking, this is what an ORM does. However, mapping objects to a database is very complicated and it is probably not a good idea to invent your own ORM, but to use an existing one.

I'm afraid there is no one-size-fits-all solution. Especially GAE is not some kind of of Magic Fairy Dust you can sprinkle on your code to make it scale. There are several limitations you have to keep in mind to create an application that can scale. Some of them are general, like computational complexity, others are specific to the environment your code runs in. E.g. on GAE the maximum response time is limited to 30 seconds and querying the datastore works different that on other databases.

It's hard to give any concrete advice without knowing your specific problem, but I doubt that GAE is the right solution.

In general, if you want to work with large datasets, you either have to keep that in mind from the start or you will have to rework your code, algorithms and data structures as the datasets grow.

Are there any issues with always preparing SQL statements with PHP?

8 votes

Is there any issue with always preparing SQL statements with PHP instead of executing them directly?

Not sure if database system matters, but it's DB2 on System i.

You might take a slight performance hit, if they are real prepared statements and not just emulated in the driver. This is because you will have to make two calls to the database, rather than just one.

This takes a long time...how do I speed this dictionary up? (python)

7 votes
    meta_map = {}
    results = db.meta.find({'corpus_id':id, 'method':method}) #this Mongo query only takes 3ms
    print results.explain()
    #result is mongo queryset of 2000 documents

    count = 0
    for r in results:
        count += 1
        print count
        word = r.get('word')
        data = r.get('data',{})
        if not meta_map.has_key(word):
            meta_map[word] = data
    return meta_map

This is super, super slow for some reason.

There are a total of 2000 results. Below is an example of a result document (from Mongo). All other results are similar in length.

{ "word" : "articl", "data" : { "help" : 0.42454812322341984, "show" : 0.24099054286865948, "lack" : 0.2368313038407821, "steve" : 0.20491936823259457, "gb" : 0.18757527934987422, "feedback" : 0.2855335862138559, "categori" : 0.28210549642632016, "itun" : 0.23615623082085788, "articl" : 0.21378509220044106, "black" : 0.22720575131038662, "hidden" : 0.26172127252557625, "holiday" : 0.27662433827306804, "applic" : 0.1802411089325281, "digit" : 0.20491936823259457, "sourc" : 0.21909218369809863, "march" : 0.2632736571995878, "ceo" : 0.2153108869289692, "donat" : 1, "volum" : 0.2572042432755638, "octob" : 0.2802470156773559, "toolbox" : 0.2153108869289692, "discuss" : 0.26973295489368615, "list" : 0.3698592948408095, "upload" : 0.1802411089325281, "random" : 1, "default" : 0.33044754314072383, "februari" : 0.2899936154686609, "januari" : 0.25228424754983525, "septemb" : 0.1802411089325281, "page" : 0.24675067183234803, "view" : 0.20019523259334138, "pleas" : 0.2839965947961194, "mdi" : 0.2731217555354, "unsourc" : 0.2709524603813144, "direct" : 0.18757527934987422, "dead" : 0.22720575131038662, "smartphon" : 0.2839965947961194, "jump" : 0.3004203939398161, "see" : 0.33044754314072383, "design" : 0.2839965947961194, "download" : 0.19574598998663462, "home" : 0.3004203939398161, "event" : 0.651573574681647, "wikipedia" : 0.21909218369809863, "content" : 0.2471475889083912, "version" : 0.42454812322341984, "gener" : 0.3004203939398161, "refer" : 0.2188507485718582, "navig" : 0.27662433827306804, "june" : 0.2153108869289692, "screen" : 0.27662433827306804, "free" : 0.22720575131038662, "job" : 0.19574598998663462, "key" : 0.3004203939398161, "addit" : 0.22484486630589545, "search" : 0.2878804276884952, "current" : 0.5071530767683105, "worldwid" : 0.20491936823259457, "iphon" : 0.2230524329516571, "action" : 0.24099054286865948, "chang" : 0.18757527934987422, "summari" : 0.33044754314072383, "origin" : 0.2572042432755638, "softwar" : 0.651573574681647, "point" : 0.27662433827306804, "extern" : 0.22190187748860113, "mobil" : 0.2514880028687207, "cloud" : 0.18757527934987422, "use" : 0.2731217555354, "log" : 0.27662433827306804, "commun" : 0.33044754314072383, "interact" : 0.5071530767683105, "devic" : 0.3004203939398161, "long" : 0.2839965947961194, "avail" : 0.19574598998663462, "appl" : 0.24099054286865948, "disambigu" : 0.3195885490528538, "statement" : 0.2737499468972353, "namespac" : 0.3004203939398161, "season" : 0.3004203939398161, "juli" : 0.27243508666247285, "relat" : 0.19574598998663462, "phone" : 0.26973295489368615, "link" : 0.2178125232318433, "line" : 0.42454812322341984, "pilot" : 0.27243508666247285, "account" : 0.2572042432755638, "main" : 0.34870313981256423, "provid" : 0.2153108869289692, "histori" : 0.2714135089366041, "vagu" : 0.24875213214603717, "featur" : 0.24099054286865948, "creat" : 0.26645207330844684, "ipod" : 0.2230524329516571, "player" : 0.20491936823259457, "io" : 0.2447908314834019, "need" : 0.2580912994161046, "develop" : 0.27662433827306804, "began" : 0.24099054286865948, "client" : 0.19574598998663462, "also" : 0.42454812322341984, "cleanup" : 0.24875213214603717, "split" : 0.26973295489368615, "tool" : 0.2878804276884952, "product" : 0.42454812322341984, "may" : 0.2676701118192027, "assist" : 0.1802411089325281, "variant" : 0.2514880028687207, "portal" : 0.3004203939398161, "user" : 0.20491936823259457, "consid" : 0.27662433827306804, "date" : 0.2731217555354, "recent" : 0.24099054286865948, "read" : 0.2572042432755638, "reliabl" : 0.2388872270166464, "sale" : 0.22720575131038662, "ambigu" : 0.23482106920048526, "person" : 0.260801274024785, "contact" : 0.24099054286865948, "encyclopedia" : 0.2153108869289692, "time" : 0.2368313038407821, "model" : 0.24099054286865948, "audio" : 0.19574598998663462 }}

The whole process takings about 15 seconds...what the hell? How can I speed it up? :)

Edit: I realize that when I print the count in console, it goes from 0 to 101 very fast, and then freezes for 10 seconds, and then continues from 102 to 2000

could this be a MongoDB problem?

Edit 2: I printed the Mongo EXPLAIN() of the query below:

{u'allPlans': [{u'cursor': u'BtreeCursor corpus_id_1_method_1_word_1',
                u'indexBounds': {u'corpus_id': [[u'iphone', u'iphone']],
                                 u'method': [[u'advanced', u'advanced']],
                                 u'word': [[{u'$minElement': 1},
                                            {u'$maxElement': 1}]]}}],
 u'cursor': u'BtreeCursor corpus_id_1_method_1_word_1',
 u'indexBounds': {u'corpus_id': [[u'iphone', u'iphone']],
                  u'method': [[u'advanced', u'advanced']],
                  u'word': [[{u'$minElement': 1}, {u'$maxElement': 1}]]},
 u'indexOnly': False,
 u'isMultiKey': False,
 u'millis': 3,
 u'n': 2443,
 u'nChunkSkips': 0,
 u'nYields': 0,
 u'nscanned': 2443,
 u'nscannedObjects': 2443,
 u'oldPlan': {u'cursor': u'BtreeCursor corpus_id_1_method_1_word_1',
              u'indexBounds': {u'corpus_id': [[u'iphone', u'iphone']],
                               u'method': [[u'advanced', u'advanced']],
                               u'word': [[{u'$minElement': 1},
                                          {u'$maxElement': 1}]]}}}

These are the stats for the mongo collection:

> db.meta.stats();
{
    "ns" : "inception.meta",
    "count" : 2450,
    "size" : 3001068,
    "avgObjSize" : 1224.9257142857143,
    "storageSize" : 18520320,
    "numExtents" : 6,
    "nindexes" : 2,
    "lastExtentSize" : 13893632,
    "paddingFactor" : 1.009999999999931,
    "flags" : 1,
    "totalIndexSize" : 368640,
    "indexSizes" : {
        "_id_" : 114688,
        "corpus_id_1_method_1_word_1" : 253952
    },
    "ok" : 1
}


> db.meta.getIndexes();
[
    {
        "name" : "_id_",
        "ns" : "inception.meta",
        "key" : {
            "_id" : 1
        },
        "v" : 0
    },
    {
        "ns" : "inception.meta",
        "name" : "corpus_id_1_method_1_word_1",
        "key" : {
            "corpus_id" : 1,
            "method" : 1,
            "word" : 1
        },
        "v" : 0
    }
]

Your query is returning almost all the documents in your collection (which may or may not be correct in this case; good database advice is always to transmit as few documents/rows as possible from the server to your application), and your collection is about 3 megabytes in size. It's possible that the delay you are seeing is simply due to the network transmission time.

Many database rows vs one comma separated values row

6 votes

I'm creating a table for allowing website users to become friends. I'm trying to determine which is the best table design to store and return a user's friends. The goal is to have fast queries and not use up a lot of db space.

I have two options:

Have individual rows for each friendship.

+----+-------------+-------------------+
| ID | User_ID     | Friend_ID         |
+----+-------------+-------------------+
| 1  | 102         | 213               |
| 2  | 64          | 23                |
| 3  | 4           | 344               |
| 4  | 102         | 2                 |
| 5  | 102         | 90                |
| 6  | 64          | 88                |
+----+-------------+-------------------+

Or store all friends in one row as CSV

    +----+-------------+-------------------+
    | ID | User_ID     | Friend_ID         |
    +----+-------------+-------------------+
    | 1  | 102         | 213,44,34,67,8    |
    | 2  | 64          | 23,33,45,105      |
    +----+-------------+-------------------+

When retrieving friends I can create an array using explode() however deleting a user would be trickier.

Edit: For second method I would separate each id in array in php for functions such as counting and others.

Which method do you think is better?

First method is definitely better. It's what makes relational databases great :)

It will allow you to search for and group by much more specific criteria than the 2nd method.

Say you wanted to write a query so users could see who had them as a friend. The 2nd method would require you to use IN() and would be much slower than simply using JOINS.

MySQL Ordering AA before A

6 votes

I'm doing an e-commerce website for a client who sells lingerie, I've written up a bra size picker for them but they've come back to me today with a slight issue.

With bra sizes, AA is smaller than A, so it should appear before that in the chart, but when I use mySQL order by on the size, obviously it puts A first, then AA, then B etc

Is there a simple way I can get mySQL to order AA first, then A, B etc?

Thanks

Assuming A is the only possible repeating letter, you can do this:

SELECT  *
FROM    bra
ORDER BY
        LENGTH(size) DESC, size

But a better solution would be to create a conversion table which would store all possible sizes (European, Japanese etc) including metrical on which yoг can order.

You may use it to build conversion charts and show the sizes in person's preferred system as well.

How to structure an extremely large table

6 votes

This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.

By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available. I have two conceptual ideas to speed it up.

1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.

2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.

Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.

The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)

There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.

From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.

There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.

10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.

SQL Query - Need to improve performance

5 votes

I have a data load scenario where I create dynamic sql query to pull data and cache in our service. There is 1 table that contains all product data : ProductHistory (47 columns, 200,000 records + and will keep growing)

What I need: Get the latest products by using the maximum id, maximum version and maximum changeid.

First Attempt:

SELECT distinct Product.* FROM ProductHistory product 
WHERE  product.version = 
(SELECT max(version) from ProductHistory p2 where product.Id = p2.Id 
  and product.changeId = 
(SELECT max(changeid) from ProductHistory p3 where p2.changeId = p3.changeId))

This took more than 2.51 minutes.

Other Failed Attempt:

select distinct product.* from ProductHistory product 
where CAST(CAST(id as nvarchar)+'0'+CAST(Version as nvarchar)+'0'+CAST(changeid as nvarchar) as decimal) = 
(select MAX(CAST(CAST(id as nvarchar)+'0'+CAST(Version as nvarchar)+'0'+CAST(changeid as nvarchar) as decimal)) from ProductHistory p2 
where product.Id = p2.Id)

It basically uses the same principle as when you order dates, concatenating the numbers ordered by relevance.

For example 11 Jun 2007 = 20070711
And in our case: Id = 4 , version = 127, changeid = 32   => 40127032
The zeros are there not to mix up the 3 different ids

But this one takes 3.10 minutes !!! :(

So, I basically need a way to make my first attempt query better by any chance. I was also wondering with such amount of data, is this the best speed of retrieval that I should expect ?

  1. I ran sp_helpindex ProductHistory and found out the indexes as below :

    PK_ProductHistoryNew - clustered, unique, primary key located on PRIMARY- Id, Version

  2. I wrapped the first query in a SP but still no change.

So, wondering by what other means we can improve the performance of this operation ?

Thanks, Mani p.s : I am just running these queries in SQL management stuido to see the time.

Run the query from Sql Server Management Studio and look at the query plan to see where the bottle neck is. Any place you see a "table scan" or "index scan" it has to go through all data to find what it is looking for. If you create appropriate indexes that can be used for these operations it should increase performance.

Reasons for objection to SQL triggers that insert data into other tables?

5 votes

I'm being told by a person with some authority in our company that it's a "database no-no" to create triggers in a database that change rows in another table.

I've used this technique to create default initial configuration, auto-maintaining audit logs, and various other things that would have been a nightmare to consistently maintain inside the heterogeneous applications that connect to that database. For over a decade, I've read that this as an appropriate way to centralize relationship constraint maintenance and get the responsibility out of the applications interacting with the data.

As such, my BS meter is pegging with this. Am I missing something fundamentally wrong with that technique that makes it a bad practice in general?

If you are careful with your trigger code, there is nothing inherently bad about it. Some people get bitten by bad trigger code and then decide that triggers are bad (eventhough it was the bad trigger code that was the problem). They then generalize this as, "never use triggers".

The other problem is....

Using the audit tables as an example, suppose you have a stored procedure that updates a table AND puts data in to an audit table. Now suppose you write trigger code to put data in to the audit table. You could end up with duplicate audit data.

Should I rate-limit or reduce my database queries?

5 votes

I'm creating a PHP script that imports some data from text files into a MySQL database. These text files are pretty large, an average file will have 10,000 lines in it each of which corresponds to a new item I want in my database. (I won't be importing files very often)

I'm worried that reading a line from the file, and then doing a INSERT query, 10,000 times in a row might cause some issues. Is there a better way for me to do this? Should I perform one INSERT query with all 10,000 values? Or would that be just as bad?

Maybe I can reach a medium, and perform something like 10 or 100 entries at once. Really my problem is that I don't know what is good practice. Maybe 10,000 queries in a row is fine and I'm just worrying for nothing.

Any suggestions?

yes it is

<?php
$lines = file('file.txt');
$count = count($lines);
$i = 0;
$query = "INSERT INTO table VALUES ";
foreach($lines as $line){
    $i++;
    if ($count == $i) {
        $query .= "('".$line."')";
    }
    else{
        $query .= "('".$line."'),";
    }
}
echo $query;

http://sandbox.phpcode.eu/g/5ade4.php

this will make one single query, which is multiple faster than one-line-one-query style!

Is it necessary to have an index on every combination of queryable fields in a SQL table to optimize performance?

5 votes

If my User table has several fields that are queryable (say DepartmentId, GroupId, RoleId) will it make any speed difference if I create an index for each combination of those fields?

By "queryable", I'm referring to a query screen where the end user can select records based on Department, Group or Role by selecting from a drop-down.

At the moment, I have a index on DepartmentId, GroupId and RoleId. That's a single non-unique index per field.

If an end user selects "anyone in Group B", the SQL looks like:

select * from User where GroupId = 2

Having an index on GroupId should speed that up.

But if the end user select "anyone in Group B and in Role C", the SQL would look like this:

select * from User where GroupId = 2 and RoleId = 3

Having indexes on GroupId and RoleId individually may not make any difference, right?

A better index for that search would be if I had one index spanning both GroupId and RoleId.

But if that's the case, than that would mean that I would need to have an index for every combination of queryable fields. So I would need all these indexes:

  • DepartmentId
  • GroupId
  • RoleId
  • DepartmentId and GroupId
  • DepartmentId and RoleId
  • GroupId and RoleId
  • Department Id, GroupId and RoleId

Can anyone shed some light on this? I'm using MySQL if that makes a difference.

A multi-column index can be used for any left prefix of that index. So, an index on (A, B, C) can be used for queries on (A), (A, B) and (A, B, C), but it cannot, for example, be used for queries on (B) or (B, C).

If the columns are all indexed individually, MySQL (5.0 or later) may also use Index Merge Optimization.

Simple key/value database

4 votes

I'm looking for a very(!) simple (server-side) key/value database which preferably has a REST API. It's used to cache some values in a javascript driven web application. Looking for a quick set up and easy ajax interaction. Any tips would be greatly appreciated!

You should check out couch db:

http://couchdb.apache.org/docs/intro.html

Sounds about what you are looking for.

Ideal data source for Standalone Applications?

3 votes

Suppose I created a standalone application in java and distributed it through my website.

What I need to know is what is the ideal Database solution I can use so that all the users can use the application without installing anything extra.

For example, if I use SQL server or MySql database, the downloader will need those installed so that I can dump my database on it. Another solution I see is to use Database file like Microsoft Access .mdb file. But that too will need Microsoft Office installed on the client. XML files are platform independent but anyone can open and change them.

What is the solution used by most of the applications?

H2 is a very potent candidate for your need. We use it as a default RDBMS in all our demo applications and also for db-driven unit testing.