Best mysql questions in July 2011

Is there anything that can be put after the "ORDER BY" clause that can pose a security risk?

18 votes

Basically, what I want to do is this:

mysql_query("SELECT ... FROM ... ORDER BY $_GET[order]")

They can obviously easily create a SQL error by putting non-sense in there, but mysql_query only allows you to execute 1 query, so they can't put something like 1; DROP TABLE ....

Is there any damage a malicious user could do, other than creating a syntax error?

If so, how can I sanitize the query?

There's a lot of logic built on the $_GET['order'] variable being in SQL-like syntax, so I really don't want to change the format.


To clarify, $_GET['order'] won't just be a single field/column. It might be something like last_name DESC, first_name ASC.

Yes, SQL injection attacks can use an unescaped ORDER BY clause as a vector. There's an explanation of how this can be exploited and how to avoid this problem here:

http://josephkeeler.com/2009/05/php-security-sql-injection-in-order-by/

That blog post recommends using a white list to validate the ORDER BY parameter against, which is almost certainly the safest approach.


To respond to the update, even if the clause is complex, you can still write a routine that validates it against a whitelist, for example:

function validate_order_by($order_by_parameter) {
    $columns = array('first_name', 'last_name', 'zip', 'created_at');

    $parts = preg_split("/[\s,]+/", $order_by_parameter);

    foreach ($parts as $part) {
        $subparts = preg_split("/\s+/", $part);

        if (count($subparts) < 0 || count($subparts) > 2) {
           // Too many or too few parts.
           return false;
        }

        if (!in_array($subparts[0], $columns)) {
           // Column name is invalid.
           return false;
        }

        if (count($subparts) == 2 
            && !in_array(strtoupper($subparts[1]), array('ASC', 'DESC')) {
          // ASC or DESC is invalid
          return false;
        }
    }

    return true;
}

Even if the ORDER BY clause is complex, it's still made only out of values you supply (assuming you're not letting users edit it by hand). You can still validate using a white list.

I should also add that I normally don't like to expose my database structure in URLs or other places in the UI and will often alias the stuff in the parameters in the URLs and map it to the real values using a hash.

Efficient method to find collision free random numbers

12 votes

I have a users table, the user ID is public. But I want to obfuscate the number of registered user and trends of the project, so I don't want to have public incrementing IDs.

When a new user is created I want to find a random integer number that is greater than a certain number and that is not yet in the database.

Naive code:

<?php
    $found = false;
    while(!$found) {
      $uid = rand(1000000000,4294967295) // find random number betwen minimum and maximum
      $dbh->beginTransaction();
      // check if user id is in use, and if not insert it
      if($dbh->query("SELECT * FROM users WHERE uid = $uid")) {
        $dbh->exec("INSERT INTO users (uid) VALUES ($uid)");
        $found = true;
      }
      $dbh->commit();
    }
    // we just got our new uid ...
?>

This will work it however may become inefficient. True that there is a big range and the probability of hitting an unused uid is high. But what if I want to use a smaller range, because I don't want to have so long userids?

Example of my concerns:

  • 60% of all user ids are in use
  • the chance of hitting an unused uid are 0.4
  • the first attempt has 0.4% success rate
  • if 1st not successful the second attempt has 0.6*0.4 probability
  • so with a maximum of two tries i have 0.4 + 0.6*0.4 proability (is that right??)

So one method to optimize is that came to my mind is the following:

  • find a random number, check if its free, if not, increment it by 1 and try again and so on
  • if the maximum number is hit, continue with the minimum number

That should give me a number with a maximum runtime of O(range)

That sounds pretty bad but I think it is not, because I submit random numbers to the database and that they are all at the beginnig is very unlikely. So how good/bad is it really?

I think this would work just fine but I want it BETTER

So what about this?

  • find a random number
  • query the database for how many numbers are occupied in the range whole range, starting from that number (this first step is trivial...)
  • if there are numbers occupied in that range, divide the range by half and try again. starting with the initial number
  • if there are numbers occupied divide the range by half and try again. starting with the initial number

If I am thinking correctly this will give ma a number with a maximum of O(log(range)) time.

That is pretty satisfying because log() is pretty good. However I think this method will often be as bad as possible. Because with our random numbers we will probably always hit numbers in the large intervals.

So at the beginning our pure random method is probably better.

So what about having a limit like this

  • select current number of used numbers
  • is it greater than X, logarithmic range approach
  • if it is not, use pure random method

What would X be and why?

So final question:

This is pretty easy and pretty complicated at the same time.

I think this is a standard problem because lots and lots of system use random ids (support tickets etc), so I cannot imagine I am the first one to stumble across this.

How would you solve this? Any input is appriciated!

Is there maby an existing class / procedure for this I can use?

Or maby some database functions that I can use?

I would like to do it in PHP/Mysql

IMPORTANT EDIT:

I just thought about the range/logarithmic solution. It seems to be complete bullshit sorry for my wording because:

  • what if i hit an occupied number at start?

Then I am dividing my range so long if it is only 1. And even then the number is occoupied.

So its completely the same as the pure random method from start, only worse....

I am a bit embarassed I made this up but I will leave it in because I think its a good example of overcomplicated thinknig!

If p is the proportion of ids in use, your "naive" solution will, on average, require 1/(1-p) attempts to find an unused id. (See Exponential distribution). In the case of 60% occupancy, that is a mere 1/0.4 = 2.5 queries ...

Your "improved" solution requires about log(n) database calls, where n is the number of ids in use. That is quite a bit more than the "naive" solution. Also, your improved solution is incomplete (for instance, it does not handle the case where all number in a subrange are taken, and does not elaborate with subrange you recurse into) and is more complex to implement to boot.

Finally, note that your implementation will only be thread safe if the database provides very strict transaction isolation, which scales poorly, and might not be the default behaviour of your database system. If that turns out to be a problem, you could speculatively insert with a random id, and retry in the event of a constraint violation.

When should I use C++ instead of SQL?

10 votes

I am a C++ programmer who occasionally uses MySQL to work with databases, but my SQL knowledge is rather limited. However I am surely willing to change that.

At the moment I am trying to do analysis(!) on the data I have in my database solely with SQL queries. But I am about to give up, and instead import the data to C++ and do the analysis with C++ code.

I have discussed this with my colleagues, and they also push me to use C++, saying that SQL is not meant for complex analysis but mainly for importing (from the existing tables) and exporting (to new tables) data, and a little bit more such as merging data to - e.g. - joined tables.

Can somebody help me drawing a line? So I know when to switch to C++? Of course performance is also an issue.

What are indications that things get to complex in SQL? Or maybe I just take the wrong approach with designing the queries. Then where can I find tutorials, books, ... to take a better approach?

I hope this is not too vague. I am really a bit lost.

SQL excels at analyzing large sets of relational data.

The place to draw the line is the scale of your analysis.

If you analyze individual records one at a time, do it in your application.

If you analyze large sets of records as a unit, SQL is definitely the best tool for that job.

Row-by-row analysis is not something SQL is designed or optimized for very well. But, if you want to know something about a million-row group of data, do it in the database.

Using Mathematica in MySQL databases

10 votes

I've seen it's possible to make a connection between Mathematica and MySQL databases using Input Needs["DatabaseLink"] and conn = OpenSQLConnection[JDBC["MySQL(Connector/J)", "yourserver/yourdatabase"], "Username" -> "yourusername", "Password" -> "yourpassword"] (in case anyone wants to give it a try). Documentation of DatabaseLink here, by the way.

Does anyone have experience using Mathematica in this way, probably to analyze data contained in the database? Are there obvious drawbacks (speed, memory needed, etc)?.

Regards

I recently used databases to speed up a Manipulate[] block.

Without the database, essential data from a 150 MB ASCII file were required in memory for access. As a result, the Manipulate[] block slowed down. It's possible that PackedArray[] would have helped. I didn't investigate this.

With the database, the speed of access of individual datasets is slightly slower than a Select[] block, but memory footprint is down by a factor of nearly 10.

I'd say go for it.

Advanced SQL Select Query

8 votes
week      cookie
1         a
1         b
1         c
1         d
2         a 
2         b
3         a
3         c
3         d

This table represent someone visits a website in a particular week. Each cookie represents an individual person. Each entry represent someone visit this site in a particular week. For example, the last entry means 'd' come to the site in week 3.

I want to find out how many (same) people keep coming back in the following week, when given a start week to look at.

For example, if I look at week 1. I will get result like:

1 | 4
2 | 2
3 | 1

Because 4 user came in week 1. Only 2 of them (a,b) came back in week 2. Only 1 (a) of them came in all of these 3 weeks.

How can I do a select query to find out? The table will be big: there might be 100 weeks, so I want to find the right way to do it.

Thank you so much. I really need get this done.

This query uses variables to track adjacent weeks and work out if they are consecutive:

set @start_week = 2, @week := 0, @conseq := 0, @cookie:='';
select conseq_weeks, count(*)
from (
select 
  cookie,
  if (cookie != @cookie or week != @week + 1, @conseq := 0, @conseq := @conseq + 1) + 1 as conseq_weeks,
  (cookie != @cookie and week <= @start_week) or (cookie = @cookie and week = @week + 1) as conseq,
  @cookie := cookie as lastcookie,
  @week := week as lastweek
from (select week, cookie from webhist where week >= @start_week order by 2, 1) x
) y
where conseq
group by 1;

This is for week 2. For another week, change the start_week variable at the top.

Here's the test:

create table webhist(week int, cookie char);
insert into webhist values (1, 'a'), (1, 'b'), (1, 'c'), (1, 'd'), (2, 'a'), (2, 'b'), (3, 'a'), (3, 'c'), (3, 'd');

Output of above query with where week >= 1:

+--------------+----------+
| conseq_weeks | count(*) |
+--------------+----------+
|            1 |        4 |
|            2 |        2 |
|            3 |        1 |
+--------------+----------+

Output of above query with where week >= 2:

+--------------+----------+
| conseq_weeks | count(*) |
+--------------+----------+
|            1 |        2 |
|            2 |        1 |
+--------------+----------+

p.s. Good question, but a bit of a ball-breaker

Fastest way to subset - data.table vs. MySQL

7 votes

I'm an R user, and I frequently find that I need to write functions that require subsetting large datasets (10s of millions of rows). When I apply such functions over a large number of observations, it can get very time consuming if I'm not careful about how I implement it.

To do this, I have sometimes used the data.table package, and this provides much faster speeds than subsetting using data frames. Recently, I've started experimenting with packages like RMySQL, pushing some tables to mysql, and using the package to run sql queries and return results.

I have found mixed performance improvements. For smaller datasets (millions), it seems that loading up the data into a data.table and setting the right keys makes for faster subsetting. For larger datasets (10s to 100s of millions), it appears the sending out a query to mysql moves faster.

Was wondering if anyone has any insight into which technique should return simple subsetting or aggregation queries faster, and whether or not this should depend on the size of the data? I understand that setting keys in data.table is somewhat analogous to creating an index, but I don't have much more intuition beyond that.

'If the data fits in RAM, data.table is faster. If you provide an example it will probably become evident, quickly, that you're using data.table badly. Have you read the "do's and don'ts" on the data.table wiki?

SQL has a lower bound because it is a row store. If the data fits in RAM (and 64bit is quite a bit) then data.table is faster not just because it is in RAM but because columns are contiguous in memory (minimising page fetches from RAM to L2 for column operations). Use data.table correctly and it is faster than SQL's lower bound. This is explained in FAQ 3.1. If you're seeing slower with data.table, then chances are very high that you're using data.table incorrectly. So, please post some tests, after reading the data.table wiki.

Getting three random dates from each hour

7 votes

The title doesn't actually fully describes the problem, that is: I have a table with dates -

 1. 2011-07-01 13:01:48
 2. 2011-07-01 13:09:36 
 3. 2011-07-01 13:21:24
 4. 2011-07-01 13:35:12
 5. 2011-07-01 13:49:23
 6. 2011-07-01 13:57:47
 7. 2011-07-01 14:05:12
 8. 2011-07-01 14:12:45
 9. 2011-07-01 14:31:48
 10. 2011-07-01 14:47:31

and so on. What I need is to get three random dates of each hour, for example:

 1. 2011-07-01 13:01:48
 2. 2011-07-01 13:21:24
 3. 2011-07-01 13:49:23
 4. 2011-07-01 14:05:12
 5. 2011-07-01 14:12:45
 6. 2011-07-01 14:47:31

How can I do it in mysql?

This should work quite well if you have statistically enough rows per hour:

select *
from (
    select *
    from yourtable
    order by rand()
)
group by date(yourdate), hour(yourdate), floor(rand()*3)

Load file from mysql with PHP

7 votes

I store images in my mysql database as blob. I can load these files with a query in a variable, but how can I send back the contents of this variable to the browser as an image?

Can the html file contain something like this <img src="smtg.png"> ? Or how the request can be made?

You need a script that echos out the image data. Use header() to set the appropriate content type, and then echo the data. For example:

header("Content-type: image/png");
...
echo $db_results['imgdata'];

Then, call your script from the HTML page like this:

<img src="yourimagescript.php" />

Ideally, you should be storing file type in a column next to your image data, so you can dynamically set the correct content type.

Is it better to do a large query and filter with jquery, or do more smaller queries?

7 votes

I'm writing my first website using php/mysql, and jquery. For the next part, the user will apply filters and sorts to find specific items (like an advanced search, kind of) and there will also be a search box. I want the data to reflect the change as soon as any checkbox filter is changed (currently have a test ajax call with no database query to do this that works well).

Would it be better to re-form the query string and re query the data each time a filter changes, or to make one large query and filter the results depending on the filters?

For the time being, the number of records will be low, but it's possible to grow into the thousands.

I would say build the query solely based on the data the user has selected, simply because when the data does become large and unwieldy, it would be silly to send that all to the client. It would be detrimental to performance in two main ways, if you think about it:

  1. Large data download to the client.
  2. JavaScript will need to process (sort/filter) the results and display what the user has asked for.

This is a no brainer, in my books. You will end up having to re-work your solution to scale with the size of your database, which is something you do don't want.

The server is good at handling queries and sorting and filtering through truckloads of data. That's not really something you should be doing on the client side if you have a choice.

Many database rows vs one comma separated values row

6 votes

I'm creating a table for allowing website users to become friends. I'm trying to determine which is the best table design to store and return a user's friends. The goal is to have fast queries and not use up a lot of db space.

I have two options:

Have individual rows for each friendship.

+----+-------------+-------------------+
| ID | User_ID     | Friend_ID         |
+----+-------------+-------------------+
| 1  | 102         | 213               |
| 2  | 64          | 23                |
| 3  | 4           | 344               |
| 4  | 102         | 2                 |
| 5  | 102         | 90                |
| 6  | 64          | 88                |
+----+-------------+-------------------+

Or store all friends in one row as CSV

    +----+-------------+-------------------+
    | ID | User_ID     | Friend_ID         |
    +----+-------------+-------------------+
    | 1  | 102         | 213,44,34,67,8    |
    | 2  | 64          | 23,33,45,105      |
    +----+-------------+-------------------+

When retrieving friends I can create an array using explode() however deleting a user would be trickier.

Edit: For second method I would separate each id in array in php for functions such as counting and others.

Which method do you think is better?

First method is definitely better. It's what makes relational databases great :)

It will allow you to search for and group by much more specific criteria than the 2nd method.

Say you wanted to write a query so users could see who had them as a friend. The 2nd method would require you to use IN() and would be much slower than simply using JOINS.

MySQL Ordering AA before A

6 votes

I'm doing an e-commerce website for a client who sells lingerie, I've written up a bra size picker for them but they've come back to me today with a slight issue.

With bra sizes, AA is smaller than A, so it should appear before that in the chart, but when I use mySQL order by on the size, obviously it puts A first, then AA, then B etc

Is there a simple way I can get mySQL to order AA first, then A, B etc?

Thanks

Assuming A is the only possible repeating letter, you can do this:

SELECT  *
FROM    bra
ORDER BY
        LENGTH(size) DESC, size

But a better solution would be to create a conversion table which would store all possible sizes (European, Japanese etc) including metrical on which yoг can order.

You may use it to build conversion charts and show the sizes in person's preferred system as well.

How to structure an extremely large table

6 votes

This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.

By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available. I have two conceptual ideas to speed it up.

1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.

2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.

Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.

The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)

There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.

From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.

There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.

10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.

Changing a SQL column title via query

6 votes

I have the following query:

SELECT product_description.name, product.quantity,product.price,product_option_value_description.name,product_option_value.quantity
FROM product
INNER JOIN product_description
ON product.product_id=product_description.product_id
INNER JOIN product_option_value_description
ON product.product_id=product_option_value_description.product_id
INNER JOIN product_option_value
ON product.product_id=product_option_value.product_id
ORDER BY product_description.name 

How could I change the title for product_option_value_description.name as I would like to name this option.

Use an alias like so:

product_option_value_description.name AS `Option`

If you want to change the column's name, not only for this query but in general use ALTER TABLE

ALTER TABLE product_option_value_description CHANGE name newname DATATYPE;

Should I rate-limit or reduce my database queries?

5 votes

I'm creating a PHP script that imports some data from text files into a MySQL database. These text files are pretty large, an average file will have 10,000 lines in it each of which corresponds to a new item I want in my database. (I won't be importing files very often)

I'm worried that reading a line from the file, and then doing a INSERT query, 10,000 times in a row might cause some issues. Is there a better way for me to do this? Should I perform one INSERT query with all 10,000 values? Or would that be just as bad?

Maybe I can reach a medium, and perform something like 10 or 100 entries at once. Really my problem is that I don't know what is good practice. Maybe 10,000 queries in a row is fine and I'm just worrying for nothing.

Any suggestions?

yes it is

<?php
$lines = file('file.txt');
$count = count($lines);
$i = 0;
$query = "INSERT INTO table VALUES ";
foreach($lines as $line){
    $i++;
    if ($count == $i) {
        $query .= "('".$line."')";
    }
    else{
        $query .= "('".$line."'),";
    }
}
echo $query;

http://sandbox.phpcode.eu/g/5ade4.php

this will make one single query, which is multiple faster than one-line-one-query style!

SQL and Case Insensitive Conditions

5 votes

I have noticed in MySQL a where clause like WHERE x = 'Test' will match regardless of case ('TEST', 'test', etc.)

When using PDO, is it fair to assume that this is the case with most database servers? For example, If I were to use MSSQL or Oracle, would this be the same case?

It's not the server that it depends on, but the collation. Most databases will default to a case insensitive collation, so you can assume that, but if you ever run into one that is case sensitive it is easy to change.

Is it necessary to have an index on every combination of queryable fields in a SQL table to optimize performance?

5 votes

If my User table has several fields that are queryable (say DepartmentId, GroupId, RoleId) will it make any speed difference if I create an index for each combination of those fields?

By "queryable", I'm referring to a query screen where the end user can select records based on Department, Group or Role by selecting from a drop-down.

At the moment, I have a index on DepartmentId, GroupId and RoleId. That's a single non-unique index per field.

If an end user selects "anyone in Group B", the SQL looks like:

select * from User where GroupId = 2

Having an index on GroupId should speed that up.

But if the end user select "anyone in Group B and in Role C", the SQL would look like this:

select * from User where GroupId = 2 and RoleId = 3

Having indexes on GroupId and RoleId individually may not make any difference, right?

A better index for that search would be if I had one index spanning both GroupId and RoleId.

But if that's the case, than that would mean that I would need to have an index for every combination of queryable fields. So I would need all these indexes:

  • DepartmentId
  • GroupId
  • RoleId
  • DepartmentId and GroupId
  • DepartmentId and RoleId
  • GroupId and RoleId
  • Department Id, GroupId and RoleId

Can anyone shed some light on this? I'm using MySQL if that makes a difference.

A multi-column index can be used for any left prefix of that index. So, an index on (A, B, C) can be used for queries on (A), (A, B) and (A, B, C), but it cannot, for example, be used for queries on (B) or (B, C).

If the columns are all indexed individually, MySQL (5.0 or later) may also use Index Merge Optimization.

SELECT multiple rows with WHERE

4 votes
PID       VALUE
3        1
4        3
1        9
1        3

How to select row(s) that has both values 3 and 9? I tried

select PID from table where VALUE = 3 and VALUE = 9

So that i get something like below, instead i get an empty set.

PID      
1       

PID 4 should not be included in the result because it do not have VALUE 9

The WHERE clause can only evaluate conditions against one row from a given table at a time. You can't make a condition span multiple rows.

But you can use a self-join to match multiple rows from the same table into one row of the result set, so you can apply a condition that involves both.

SELECT t1.pid
FROM table t1 JOIN table t2 ON t1.pid=t2.pid
WHERE t1.value = 3 AND t2.value = 9;

An alternative solution is to use GROUP BY and count the distinct values:

SELECT t.pid
FROM table t
WHERE t.value IN (3,9)
GROUP BY t.pid
HAVING COUNT(DISTINCT t.value) = 2;

mysql ignore any distinct values

3 votes

i am trying to run a sql query which will not show distinct/duplicate values.

For example if using distinct option it would display only one unique result, but i would like to skip all detected distinct values i.e dont display distinct values

is it possible?

    select col1  d from tb_col  where col1 = '123';

col1
------
123
123


(2 rows)


select distinct col1  d from tb_col  where col1 = '123';

col1
------
123
(1 row)

Not showing duplicates at all:

SELECT col1 AS d
FROM tb_col
GROUP BY col1
HAVING COUNT(*) = 1            --- or perhaps HAVING COUNT(*) > 1
                               --- it's not clear what you want.  

skip characters in like clause of mysql

3 votes

hello I want to skip some character while matching any string using like using mysql. For example I have a string like this 2011-07-12 06:09. I want to match the only month part of the time stamp not whole. I know I can use % for whole string. I want to skip characters from front and end of that part. Would any body tell me how to accomplish this job

Use the MONTH() function for grabbing the month from a date. Don't re-invent the wheel.

If you're really interested in matching a string using like, you'd be better off using the underscore (_) as a the wildcard:

select *
from some_table
where some_column like '__-07-__ __:__'

Understanding / mySQL aka tricking ForeignKey relationships in Django

3 votes

So I've inherited some django.

The mySQL table is simple enough where parent is NOT a FK relationship just the "Parent" id:

CREATE TABLE `Child` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `parent` int(10) unsigned NOT NULL,
  `name` varchar(255) NOT NULL,
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24;

But then the originator did this..

class Child(models.Model):
    """Project Child information"""
    id = models.AutoField(primary_key=True)
    parent = models.ForeignKey(Parent)
    name = models.CharField(max_length=255)

    class Meta:
        managed = False

Admittedly I am NOT a SQL Jockey but I know that a "real" Foreign Key Relationship looks similar to this notice CONSTRAINT...

CREATE TABLE `Child` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `parent_id` int(11) NOT NULL,
  `name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (`id`),
  KEY `child_63f17a16` (`parent_id`),
  CONSTRAINT `parent_id_refs_id_34923e1e` FOREIGN KEY (`parent_id`) REFERENCES `Parent` (`id`)
) ENGINE=InnoDB;

What I want to know is the following:

  1. What problems could I expect to see by this "trickery".
  2. While this appears to work - is it recommended or advised.
  3. Would we be advised to modify the SQL to add in the constraint?

Thanks so much!

  1. Not having an actual constraint might lead to broken references, invalid parents and other sorts of data inconsistencies. I am not a Django expert but I would venture a guess that in most cases Django will still handle the relations fine unless you purposefully add some invalid records.

  2. Normally, if your RDBMS supports foreign key constraints, there is absolutely no reason not to use them, and it could potentially be considered a design flaw to ignore them.

  3. You should consider adding the key constraints. Not only do they give your DBMS a good idea of how to optimize the queries, they also ensure consistency in your data. I am pretty sure Django has a setting somewhere that will automatically generate the SQL to add the key constraints when you run manage.py syncdb

For more information about why you should prefer foreign keys, you should read the MySQL Foreign Key Documentation

Most interestingly:

InnoDB requires indexes on foreign keys and referenced keys so that foreign key checks can be fast and not require a table scan. In the referencing table, there must be an index where the foreign key columns are listed as the first columns in the same order. Such an index is created on the referencing table automatically if it does not exist. (This is in contrast to some older versions, in which indexes had to be created explicitly or the creation of foreign key constraints would fail.) index_name, if given, is used as described previously.