Best mysql questions in February 2012

Improve password hashing with a random salt

14 votes

I'm starting a website and I'm trying to decide how to encrypt user passwords to store them in a SQL database.

I realize that using a simple md5(password) is very unsecured. I'm considering using a sha512(password.salt), and I have been researching the best way to generate a useful salt. I read numerous articles stating that a salt should be as random as possible to add entropy to the hash and it looks like a great idea. But:

  • you need to store the random salt along with your hash
  • given that an attacker somehow got access to your hashed passwords (and is trying to reverse the hash to plain text), it means he probably dumped your database, then got access to your random salts also

Isn't it obvious that the weird looking value next to the hash in the database is a salt? If the attacker can access the salt along with the hash value, how is that more secure?

Anyone has any expertise in that area? Thanks!

An attacker is "allowed" to know the salt - your security must be designed in a way that even with the knowledge of the salt it is still secure.

What does the salt do ?

Salt aids in defending against brute-force attacks using pre-computed "rainbow-tables".
Salt makes brute-force much more expensive (in time/memory terms) for the attacker.
Calculating such a table is expensive and usually only done when it can be used for more than one attack/password.
IF you use the same salt for all password an attacker could pre-compute such a table and then brute-force your passwords into cleartext...
As long as you generate a new (best cryptogrpahically strong) random salt for every password you want to store the hash of there is no problem.

IF you want to strengthen the security further
You could calculate the hash several times over (hash the hash etc.) - this doesn't cost you much but it makes a brute-force attack / calculating "rainbow-tables" even more expensive... please don't invent yourself - there are proven standard methods to do so, see for example http://en.wikipedia.org/wiki/PBKDF2 and http://www.itnewb.com/tutorial/Encrypting-Passwords-with-PHP-for-Storage-Using-the-RSA-PBKDF2-Standard

NOTE:

Using such a mechanism is these days mandatrory since "CPU time" (usable for attacks like rainbow tables/brute force etc.) is getting more and more widely available (see for example the fact that Amazon's Cloud service is among the top 50 of fastest supercomuters worldwide and can be used by anyone for a comparatively small amount)!

apache .htaccess rule with dynamic pages (php) performance

9 votes

i have a new website (i'm building one right now) and i want to make sure i do it correctly and not redesigning after 1 month.

so i have pages like:

/candy
/candy/chocolate
/drink
/drink/beer

so i look on stackoverflow about how can i do this and i found:

RewriteRule ^([a-z]+)/([a-z]+)/?$ index.php?category=$1&page=$2 [NC,L]

now, this will work but my question is about execution. obviously the $_GET[category] will be the name and page will be chocolate for example.

now when i do my query i will do:

$sql = "SELECT myfields FROM mytable WHERE name = '" . $_GET['category'] . "'";

now, would it be better if i use the primary key which is an INT. if so, what can i do in my .htaccess to do this?

Unfortunately using a name and convert it to a unique id might take longer to execute than creating an index on the name itself.

Here's what I recommend based on the urls you have:

Add a index to the name of the page

ALTER TABLE `mytable` ADD key indexname (columnname);

example:

ALTER TABLE `page` ADD key pagename (name);

Now, because the structure is different /candy vs /candy/chocolate, I assume you have some sort of structure like a main page with the list (/candy) and specific list (/candychocolate) so in this case you can use this:

RewriteRule ^([a-z]+)/?$ index.php?category=$1&page=list [NC,L]
RewriteRule ^([a-z]+)/([a-z]+)/?$ index.php?category=$1&page=$2 [NC,L]

Then you can simply query the category and the page using the indexed field. This will fast (of course not as fast as an INT but still fast).

For the first query you can do:

$name = addslahes($name);
SELECT fields FROM category WHERE category = '$name';

and when you get a page you can use:

$categoryname = addslahes($categoryname);
$pagename = addslahes($pagename);
SELECT fields FROM page LEFT JOIN category ON (page.categoryid = category.id) WHERE page = '$pagename' AND category = '$categoryname';

this way, by using both category and page, you will avoid page not found (404).

user manual (seo + auto generate)

7 votes

i have a website that allows people to get the manual online. i have a new version of the manual where everything is generated dynamically using php and mysql. so no more html generated.

i have the manual in french and english now. i learn that for search engine purposes it will be good if i have url that are friendly with gogole and yahoo etc...

now here's my problem:

i want to show the url like this:

  • /manual/ that will go to the main manual page where the user choose a lang
  • /manual/fr/ this will display the list of all available chapters
  • /manual/fr/1.0 this will display the chapter 1

i got the idea where i need to use something like :

rewriterule /manual/(.*)/(.*)/(.*)/ index.php?lang=$1& ...

can i do this using 1 rule? or i need multiples?

thanks so much

Here's what you have to do:

RewriteEngine On
RewriteRule ^manual/?$ index.php?action=selectLang [L,NC,QSA]
RewriteRule ^manual/(fr|en)/?$ index.php?action=listChapter&lang=$1 [L,NC,QSA]
RewriteRule ^manual/(fr|en)/([0-9\.]+)(/[^/]+)?/?$ index.php?action=listChapter&lang=$1&chapter=$2 [L,NC,QSA]

The first one will load the page where the user choose a lang. In your PHP you can check the $_GET['action'] so you can load the "select the lang" page. The second same idea as the first one but for the chapter list. Now the third one, you can either use /manual/fr/1.0 or /manual/fr/1.0/chapter-title. This will work better for SEO if you add the title.

[] at the end of each RewriteRule are the flag, learn more here: http://httpd.apache.org/docs/2.3/rewrite/flags.html

Select a portion from a MySQL Blob Field

7 votes

I have a table containing lots of data and one of them is a blob. I some times needs to look into this blob for data using PHP.

I do:

select `desc` from table where `desc` like '%Nam rhoncus%';

this return the entire data but I don't need it

So if my description is like this:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. In a tempor velit. Integer sit amet ligula nibh, eu rutrum ante. Mauris cursus, neque eu ultrices pulvinar, purus purus fermentum libero, in eleifend tortor orci quis lectus. Cras luctus nunc ac tortor laoreet eu iaculis libero consectetur. Maecenas iaculis facilisis libero sodales auctor. Donec gravida interdum vehicula. Suspendisse vitae massa eget arcu condimentum mattis. Fusce ut ligula ante, nec placerat felis. Maecenas vel nunc nibh, ut luctus urna. Nunc eu lectus a orci iaculis volutpat eget a lorem. Nunc lobortis porttitor tempor. Nulla ipsum neque, volutpat in viverra sit amet, pharetra non tortor. Phasellus at leo pellentesque nunc ultrices euismod.

Nulla ullamcorper scelerisque leo, eu consequat risus fringilla id. Nulla facilisi. Cras sit amet sem a diam molestie dignissim. Duis interdum, sapien quis laoreet bibendum, dui turpis imperdiet magna, id auctor metus velit sollicitudin dolor. Integer blandit, turpis eget interdum commodo, ante nisl laoreet dui, ac congue purus dui quis nisl. Etiam blandit eleifend tortor at egestas. Vestibulum euismod orci ac nibh consectetur feugiat. Praesent ac libero quam. Morbi elit nulla, gravida ac blandit eu, bibendum vitae lacus. In facilisis pellentesque ipsum aliquam auctor. Nam rhoncus, purus eget fringilla ullamcorper, mauris tellus fermentum lectus, ut tempus tellus arcu vel dolor. Suspendisse eros augue, tincidunt sit amet luctus et, auctor id turpis. Praesent consequat velit ut arcu convallis sodales. Proin pulvinar varius erat, id consequat orci varius sed.

An I am looking for: Nam rhoncus I want to get the first occurence like:

...m auctor. Nam rhoncus, purus eg...

This query:

SELECT 
  CONCAT(
    '...', 
    SUBSTR(`description`, 
      LOCATE('Nam rhoncus', `description`) - 10, 
      (LENGTH('Nam rhoncus') + 20)), 
    '...') AS `description`
FROM table 
WHERE `description` LIKE '%Nam rhoncus%';

(I broke it down like this so it's easier to read)

this will output:

...m auctor. Nam rhoncus, purus eg...

So in your PHP you can do:

<?php
define('CHAR_LEFT', 10);
define('CHAR_RIGHT', 10);
// db stuff
$search = mysql_real_escape_string($search_var);
$query = "SELECT CONCAT('...', SUBSTR(`description`, LOCATE('" . $search . "', `description`) - " . CHAR_LEFT . ", (LENGTH('" . $search . "') + " . (CHAR_LEFT + CHAR_RIGHT) . ")), '...') AS `description` FROM table WHERE `description` LIKE '%" . $search . "%';";
// then your request

NOTE: Ill be careful using mysql reversed words, this is why I use description instead.

Join two queries into one

7 votes

Is there a way to join this two queries into one?

query = "select foo from TABLE where foo like '%foo%'";

if (query.empty())
    query = "select bar from TABLE where bar like '%foo%'"

Update:

select ifnull(foo,bar) from TABLE where foo like 'foo%' or bar like '%foo%';

Thanks to Kamal for the idea

For Oracle

Select NVL(foo,bar) from TABLE where foo like '%foo%' or bar like '%foo%';

How to predict system resource need?

6 votes

Is there any software or way to predict system resource need by taking several inputs such as MySQL query read/write ratio, queries executed in a second, properties of tables and databases, maybe some more inputs related with PHP configurations etc. ?

Yes. I know there is probably no perfect solution a software can provide me but I'm not looking for a perfect solution.

A few tools that may come in handy: AWS Cloudwatch, but will work only with EC2. Setup an alarm, say for CPU utilization, so the moment you exceed, say 75%, an alarm email or sms will be sent to you, so you can look around accordingly.

AWS Elsatic Load Balancer, will help you to further distribute the traffic, releasing the unhealthy instance and routing the traffic to a healthier server.

Some monitoring tools like... missing on the names... but if you can take care of things with a self monitoring way, when we use a control panel like webmin, you can get the details of CPU Usage, RAM usage, storage, etc...

And lastly, just a concept, but will take a lot to complete, and I will follow what algolicious just stated above, create a mathematical model which can work on cron, so every 5 mins the observation can be recorded, and action be taken when required.

Another way to improve the SQL Query to avoid union?

6 votes

User can search by Postcode (eg: L14, L15, L16) or Location from a textbox.

If user type in "Liverpool", it will find all the shops that are located in "Liverpool". If User type in the postcode (Eg: L15), it will search all the shops that do delivery in L15 postcode zone.

See the Tables below:

mysql> select * from shops;
+----+----------+-----------+----------+
| id | name     | location  | postcode |
+----+----------+-----------+----------+
|  1 | Shop One | Liverpool | L10      |
|  2 | Shop Two | Liverpool | L16      |
+----+----------+-----------+----------+

-

mysql> select * from shops_delivery_area;
+------------------+---------+----------+---------------+
| delivery_area_id | shop_id | postcode | delivery_cost |
+------------------+---------+----------+---------------+
|                1 |       1 | L10      |          1.50 |
|                2 |       1 | L11      |          0.00 |
|                3 |       1 | L12      |          1.00 |
|                4 |       1 | L13      |          1.00 |
|                5 |       2 | L10      |          2.50 |
|                6 |       2 | L16      |          0.00 |
|                7 |       2 | L28      |          0.00 |
+------------------+---------+----------+---------------+

SQL Query:

SELECT U.* FROM 
   ((SELECT DISTINCT shops.*, DA.delivery_cost, DA.postcode AS AreaPostcode FROM shops
             JOIN shops_delivery_area as DA on (DA.shop_id = shops.id)
   WHERE DA.postcode = "Liverpool")
  UNION
   (SELECT DISTINCT shops.*, DA.delivery_cost, DA.postcode AS AreaPostcode FROM shops
             JOIN shops_delivery_area as DA on  
                              (DA.shop_id = shops.id AND
                              DA.postcode = shops.postcode)
   WHERE shops.location = "Liverpool")) as U

-

Result - by Location (Liverpool):

+----+----------+-----------+----------+---------------+--------------+
| id | name     | location  | postcode | delivery_cost | AreaPostcode |
+----+----------+-----------+----------+---------------+--------------+
|  1 | Shop One | Liverpool | L10      |          1.50 | L10          |
|  2 | Shop Two | Liverpool | L16      |          0.00 | L16          |
+----+----------+-----------+----------+---------------+--------------+

Result - by Postcode (L12):

+----+----------+-----------+----------+---------------+--------------+
| id | name     | location  | postcode | delivery_cost | AreaPostcode |
+----+----------+-----------+----------+---------------+--------------+
|  1 | Shop One | Liverpool | L10      |          1.00 | L12          |
+----+----------+-----------+----------+---------------+--------------+

It appear to be working correctly... Is there other way to improve the SQL query shorter to avoid union or something?

Whatever you choose, be aware that short code is not always optimal code. In many cases, where you have sufficiently divergent logic, unioning the results really is the most optimal (and sometimes most clean, programatically) option.

That said, the following OR in the WHERE clause seems to cover both your cases...

SELECT DISTINCT
  shops.*,
  DA.delivery_cost,
  DA.postcode AS AreaPostcode
FROM
  shops
INNER JOIN
  shops_delivery_area as DA
    ON (DA.shop_id = shops.id)
WHERE
  (DA.postcode = "Liverpool")
OR
  (DA.postcode = shops.postcode AND shops.location = "Liverpool")

Too many fields in MySQL?

6 votes

I developed a stats site for a game as a learning project a few years back. It's still used today and I'd like to get it cleaned up a bit.

The database is one area that needs improvement. I have a table for the game statistics, which has GameID, PlayerID, Kills, Deaths, DamageDealt, DamageTaken, etc. In total, there are about 50 fields in that single table and many more that could be added in the future. At what point are there too many fields? It currently has 57,341 rows and is 153.6 MiB by itself.

I also have a few fields that stores arrays in a BLOB in this same table. An example of the array is Player vs Player matchups. The array stores how many times that player killed another player in the game. These are the bigger fields in filesize. Is storing an array in a BLOB advised?

The array looks like:

        [Killed] => Array
            (
                [SomeDude] => 13
                [GameGuy] => 10
                [AnotherPlayer] => 8
                [YetAnother] => 7
                [BestPlayer] => 3
                [APlayer] => 9
                [WorstPlayer] => 2
            )

These tend to not exceed more than 10 players.

I prefer to not have one table with an undetermined number of columns (with more to come) but rather to have an associated table of labels and values, so each user has an id and you use that id as a key into the table of labels and values. That way you only store the data you need per user. I believe this approach is called EAV (as per Triztian's comment) and it's how medical databases are kept, since there are SO many potential fields for an individual patient, even while any given patient only has a very small number of those fields with actual data.

so, you'd have

user:
id | username | some_other_required_field

user_data:
id | user_id | label | value

Now you can have as many or as few user_data rows as you need per user.

[Edit]

As to your array, I would treat this with a relational table as well. Something like:

player_interraction:
id | player_id | player_id | interraction_type

here you would store the two players who had an interaction and what type of interaction it was.

How can I summarize rows that occur only once?

5 votes

I have a query which returns the number of rows of a distinct device_type which occur more than once.

SELECT COUNT(*) AS C1,device_type FROM stat 
    WHERE stat_date = '2012-02-08' 
    GROUP BY 2 HAVING C1 > 1 
    ORDER BY 1 DESC

I would like to summarize the remaining (HAVING count = 1) rows as 'others'

How can I add the sum of COUNT(*) and 'others' as second column for the following query?

SELECT COUNT(*) AS C2,device_type FROM stat 
    WHERE stat_date = '2012-02-08' 
    GROUP BY 2 HAVING C2 = 1 
    ORDER BY 1 DESC

Sample data in DB

device_type
dt1
dt1
dt1
dt2
dt2
dt3
dt4
dt5

expected result

3 dt1
2 dt2
3 other

You can also try:

SELECT SUM(C1) AS C1, CASE WHEN C1 = 1 THEN 'other' ELSE device_type END as device_type
FROM (  SELECT  COUNT(*) AS C1,
                device_type 
        FROM stat 
        WHERE stat_date = '2012-02-08' 
        GROUP BY device_type) A
GROUP BY CASE WHEN C1 = 1 THEN 'other' ELSE device_type END

How can I make remember voting with cookies easier than this?

5 votes

It is the most easiest to describe my problem with a working example: even if you are not logged in, YouTube remembers what you have watched, and next time gives you suggestions based on previous watched movies.

My site is similar in a way: the users can vote on articles without logging in, and the site remembers votes with cookies. I have figured out a working method, but there has to be an easier way - also now the DB usage is anything but optimized.

For every visitor there is a check if he has the cookies. If yes I query his votes. If not I create a dummy user, and send him out the cookies. Now I store this users "last_visit" timestamp. After this everything is the same for both users. My problem is that my DB is filling up with dummy users, so I made my cookies expire in 3 months and my site regularly check which users didn't visit my site in the last 3 months, and deletes them from the DB.

I know I overcomplicated this, but my vote system is using AJAX, and I couldn't find a method to send out a cookie (and create the dummy user) only if a vote happens and not every time a simple visitor browses my site - without vote.

Also a note: I insist on using cookies - I know it would be easier to store IP-s when a vote happens, but there are schools, businesses using the same IP, and I like to allow their users to use my site.

What did I miss here? How can this be optimized?

if they do not hold a permanent account, why store anything related to them in the database at all? just record their prior votes in the cookie. you would also store averall votes in the db, but anonymously, and not relate these to "users" at all.

Querying the record of database for almost similar matching String value

5 votes

The scenario is that i have a bulky database of around 500,000 records having address and city field in which there is no such standard way of inserting the value so multiple users, for example some have inserted their city value as bangalore and another have inserted its city value as begaluru or benglore(misspelled)

Also in case of address field same user with multiple record have inserted its address values but the values are not exaclty same for example Mountville park Thomas gate and Montlee park thonas gte.

I need to fetch all those record those are having same and almost similar values(somehow missplelled) of address and city.

Is there any way to get those records with almost similar but unmatched values?

Thankyou.

It will be an expensive query, but since this will hopefully be a one-time operation, you might consider looking in a Levenshtein distance formula.

In order to avoid needing to calculate the distance for a cartesian product of your table, you could first narrow the set of cities and addresses to be compared with a quicker sanity check... such as they begin with the same letter, and have a similar length.

You could then start off by only returning records with a very small Levenshtein distance, and then gradually increasing the distance until you start to get too many false positives.

Here's an implementation directly in MySql:

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    -- max strlen=255 
    DECLARE cv0, cv1 VARBINARY(256); 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END;

This function could then be used in a a helper function as follows:

CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) ) 
  RETURNS INT 
  DETERMINISTIC 
  BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END; 

You could also optimize the levenshtein function by passing in your current max distance... if the function passes that distance, exit without calculating the exact distance.

SQL select multiplicity of items in union

5 votes

I am performing several selects and wish to compute the overlap. Example:

SELECT id FROM foo WHERE ...
SELECT id FROM bar WHERE ...
SELECT id FROM baz WHERE ...

Call these queries a, b, and c, respectively. Suppose a gives (1,2,3,4,5), b gives (1,3,5), and c gives (4,5,6). I want to take the union of these and count the multiplicities. For the example above, the result I am looking for is

id | multiplicity
-----------------
1  | 2
2  | 1
3  | 2
4  | 2
5  | 3
6  | 1

How do I do this in MySQL5 within one query? (The a, b, and c parts may be plain selects or stored procedures).

I cannot verify this at the moment, but I believe this will work

SELECT id, count(id) AS multiplicity 
FROM
(
    SELECT id FROM foo WHERE ...
    UNION ALL
    SELECT id FROM bar WHERE ...
    UNION ALL
    SELECT id FROM baz WHERE ...
) AS TablesTogether
GROUP BY id

What is the best way to delete old rows from MySQL on a rolling basis?

5 votes

I find myself wanting to delete rows older than (x)-days on a rolling basis in a lot of applications. What is the best way to do this most efficiently on a high-traffic table?

For instance, if I have a table that stores notifications and I only want to keep these for 7 days. Or high scores that I only want to keep for 31 days.

Right now I keep a row storing the epoch time posted and run a cron job that runs once per hour and deletes them in increments like this:

DELETE FROM my_table WHERE time_stored < 1234567890 LIMIT 100

I do that until mysql_affected_rows returns 0.

I used to do it all at once but that caused everything in the application to hang for 30 seconds or so while INSERTS piled up. Adding the LIMIT worked to alleviate this but I'm wondering if there is a better way to do this.

Check out MySQL Partitioning:

Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.

See e.g. this post to get some ideas on how to apply it:

Using Partitioning and Event Scheduler to Prune Archive Tables

And this one:

Partitioning by dates: the quick how-to

SQL: Select transactions where rows are not of criteria inside the same table

5 votes

I have a table with transactions:

Transactions
------------
id | account | type | date_time             | amount
----------------------------------------------------
 1 | 001     | 'R'  | '2012-01-01 10:01:00' | 1000
 2 | 003     | 'R'  | '2012-01-02 12:53:10' | 1500
 3 | 003     | 'A'  | '2012-01-03 13:10:01' | -1500
 4 | 002     | 'R'  | '2012-01-03 17:56:00' | 2000
 5 | 001     | 'R'  | '2012-01-04 12:30:01' | 1000
 6 | 002     | 'A'  | '2012-01-04 13:23:01' | -2000
 7 | 003     | 'R'  | '2012-01-04 15:13:10' | 3000
 8 | 003     | 'R'  | '2012-01-05 12:12:00' | 1250
 9 | 003     | 'A'  | '2012-01-06 17:24:01' | -1250

and I wish to select all of certain type ('R'), but not those that immediatly (in order of the date_time field) have another transaction of another type ('A') for the same account filed...

So, the query should throw the following rows, given the previous example:

id | account |type  | date                  | amount
----------------------------------------------------
 1 | 001     | 'R'  | '2012-01-01 10:01:00' | 1000
 5 | 001     | 'R'  | '2012-01-04 12:30:01' | 1000
 7 | 003     | 'R'  | '2012-01-04 15:13:10' | 3000

(As you can see, row 2 isn't displayed because row 3 'cancels' it... also row 4 is 'cancelled' by row 6'; Row 7 do appears (even though the account 003 belongs to cancelled row #2, this time in row 7 it's not cancelled by any 'A' row); And row 8 won't appear (it's too for 003 account since now this one is cancelled by 9, which doesn't cancels 7 too, just the previouse one: 8...

I have tried Joins, subqueries in Where clauses but I'm really not sure how do I must make my query...

What I have tried:

Trying joins:

   SELECT trans.type as type,
          trans.amount as amount,
          trans.date_time as dt,
          trans.account as acct,
     FROM Transactions trans
INNER JOIN ( SELECT t.type AS type, t.acct AS acct, t.date_time AS date_time
               FROM Transactions t
              WHERE t.date_time > trans.date_time
           ORDER BY t.date_time DESC
          ) AS nextTrans
       ON nextTrans.acct = trans.acct
    WHERE trans.type IN ('R')
      AND nextTrans.type NOT IN ('A')
 ORDER BY DATE(trans.date_time) ASC

This throws an error, since I can't introduce external values to the JOIN in MySQL.

Trying subquery in where:

   SELECT trans.type as type,
          trans.amount as amount,
          trans.date_time as dt,
          trans.account as acct,
     FROM Transactions trans
    WHERE trans.type IN ('R')
      AND trans.datetime <
          ( SELECT t.date_time AS date_time
               FROM Transactions t
              WHERE t.account = trans.account
           ORDER BY t.date_time DESC
          ) AS nextTrans
       ON nextTrans.acct = trans.acct

 ORDER BY DATE(trans.date_time) ASC

This is wrong, I can get to introduce external values to the WHERE in MySQL but I cannot manage to find the way to filter correctly for what I need...

IMPORTANT EDIT:

I managed to achieve a solution, but it now needs serious optimization. Here it is:

SELECT *
  FROM (SELECT t1.*, tFlagged.id AS cancId, tFlagged.type AS cancFlag
          FROM transactions t1
     LEFT JOIN (SELECT t2.*
                  FROM transactions t2
              ORDER BY t2.date_time ASC ) tFlagged
            ON (t1.account=tFlagged.account
                  AND
                t1.date_time < tFlagged.date_time)
         WHERE t1.type = 'R'
      GROUP BY t1.id) tCanc
 WHERE tCanc.cancFlag IS NULL
    OR tCanc.cancFlag <> 'A'

I joined the table with itself, just considering same account and great date_time. The Join goes ordered by date_time. Grouping by id I managed to get only the first result of the join, which happens to be the next transaction for the same account.

Then on the outer select, I filter out those that have an 'A', since that means that the next transaction was effectively a cancelation for it. In other words, if there is no next transaction for the same account or if the next transaction is an 'R', then it is not cancelled and it must be shown in the result...

I got this:

+----+---------+------+---------------------+--------+--------+----------+
| id | account | type | date_time           | amount | cancId | cancFlag |
+----+---------+------+---------------------+--------+--------+----------+
|  1 | 001     |   R  | 2012-01-01 10:01:00 |   1000 |      5 | R        |
|  5 | 001     |   R  | 2012-01-04 12:30:01 |   1000 |   NULL | NULL     |
|  7 | 003     |   R  | 2012-01-04 15:13:10 |   3000 |      8 | R        |
+----+---------+------+---------------------+--------+--------+----------+

It relates each transaction with the next one in time for the same account and then filters out those that have been cancelled... Success!!

As I said, the problem now is optimization. My real data has a lot of rows (as a table holding transactions through time is expected to have), and for a table of ~10,000 rows right now, I got a positive result with this query in 1min.44sec. I suppose that's the thing with joins... (For those who know the protocol in here, what should I do? launch a new question here and post this as a solution to this one? Or just wait for more answers here?)

Here is a solution based on nested subqueries. First, I added a few rows to catch a few more cases. Transaction 10, for example, should not be cancelled by transaction 12, because transaction 11 comes in between.

> select * from transactions order by date_time;
+----+---------+------+---------------------+--------+
| id | account | type | date_time           | amount |
+----+---------+------+---------------------+--------+
|  1 |       1 | R    | 2012-01-01 10:01:00 |   1000 |
|  2 |       3 | R    | 2012-01-02 12:53:10 |   1500 |
|  3 |       3 | A    | 2012-01-03 13:10:01 |  -1500 |
|  4 |       2 | R    | 2012-01-03 17:56:00 |   2000 |
|  5 |       1 | R    | 2012-01-04 12:30:01 |   1000 |
|  6 |       2 | A    | 2012-01-04 13:23:01 |  -2000 |
|  7 |       3 | R    | 2012-01-04 15:13:10 |   3000 |
|  8 |       3 | R    | 2012-01-05 12:12:00 |   1250 |
|  9 |       3 | A    | 2012-01-06 17:24:01 |  -1250 |
| 10 |       3 | R    | 2012-01-07 00:00:00 |   1250 |
| 11 |       3 | R    | 2012-01-07 05:00:00 |   4000 |
| 12 |       3 | A    | 2012-01-08 00:00:00 |  -1250 |
| 14 |       2 | R    | 2012-01-09 00:00:00 |   2000 |
| 13 |       3 | A    | 2012-01-10 00:00:00 |  -1500 |
| 15 |       2 | A    | 2012-01-11 04:00:00 |  -2000 |
| 16 |       2 | R    | 2012-01-12 00:00:00 |   5000 |
+----+---------+------+---------------------+--------+
16 rows in set (0.00 sec)

First, create a query to grab, for each transaction, "the date of the most recent transaction before that one in the same account":

SELECT t2.*,
       MAX(t1.date_time) AS prev_date
FROM transactions t1
JOIN transactions t2
ON (t1.account = t2.account
   AND t2.date_time > t1.date_time)
GROUP BY t2.account,t2.date_time
ORDER BY t2.date_time;

+----+---------+------+---------------------+--------+---------------------+
| id | account | type | date_time           | amount | prev_date           |
+----+---------+------+---------------------+--------+---------------------+
|  3 |       3 | A    | 2012-01-03 13:10:01 |  -1500 | 2012-01-02 12:53:10 |
|  5 |       1 | R    | 2012-01-04 12:30:01 |   1000 | 2012-01-01 10:01:00 |
|  6 |       2 | A    | 2012-01-04 13:23:01 |  -2000 | 2012-01-03 17:56:00 |
|  7 |       3 | R    | 2012-01-04 15:13:10 |   3000 | 2012-01-03 13:10:01 |
|  8 |       3 | R    | 2012-01-05 12:12:00 |   1250 | 2012-01-04 15:13:10 |
|  9 |       3 | A    | 2012-01-06 17:24:01 |  -1250 | 2012-01-05 12:12:00 |
| 10 |       3 | R    | 2012-01-07 00:00:00 |   1250 | 2012-01-06 17:24:01 |
| 11 |       3 | R    | 2012-01-07 05:00:00 |   4000 | 2012-01-07 00:00:00 |
| 12 |       3 | A    | 2012-01-08 00:00:00 |  -1250 | 2012-01-07 05:00:00 |
| 14 |       2 | R    | 2012-01-09 00:00:00 |   2000 | 2012-01-04 13:23:01 |
| 13 |       3 | A    | 2012-01-10 00:00:00 |  -1500 | 2012-01-08 00:00:00 |
| 15 |       2 | A    | 2012-01-11 04:00:00 |  -2000 | 2012-01-09 00:00:00 |
| 16 |       2 | R    | 2012-01-12 00:00:00 |   5000 | 2012-01-11 04:00:00 |
+----+---------+------+---------------------+--------+---------------------+
13 rows in set (0.00 sec)

Use that as a subquery to get each transaction and its predecessor on the same row. Use some filtering to pull out the transactions we're interested in - namely, 'A' transactions whose predecessors are 'R' transactions that they exactly cancel out -

SELECT
  t3.*,transactions.*
FROM
  transactions
  JOIN
  (SELECT t2.*,
          MAX(t1.date_time) AS prev_date
   FROM transactions t1
   JOIN transactions t2
   ON (t1.account = t2.account
      AND t2.date_time > t1.date_time)
   GROUP BY t2.account,t2.date_time) t3
  ON t3.account = transactions.account
     AND t3.prev_date = transactions.date_time
     AND t3.type='A'
     AND transactions.type='R'
     AND t3.amount + transactions.amount = 0
  ORDER BY t3.date_time;


+----+---------+------+---------------------+--------+---------------------+----+---------+------+---------------------+--------+
| id | account | type | date_time           | amount | prev_date           | id | account | type | date_time           | amount |
+----+---------+------+---------------------+--------+---------------------+----+---------+------+---------------------+--------+
|  3 |       3 | A    | 2012-01-03 13:10:01 |  -1500 | 2012-01-02 12:53:10 |  2 |       3 | R    | 2012-01-02 12:53:10 |   1500 |
|  6 |       2 | A    | 2012-01-04 13:23:01 |  -2000 | 2012-01-03 17:56:00 |  4 |       2 | R    | 2012-01-03 17:56:00 |   2000 |
|  9 |       3 | A    | 2012-01-06 17:24:01 |  -1250 | 2012-01-05 12:12:00 |  8 |       3 | R    | 2012-01-05 12:12:00 |   1250 |
| 15 |       2 | A    | 2012-01-11 04:00:00 |  -2000 | 2012-01-09 00:00:00 | 14 |       2 | R    | 2012-01-09 00:00:00 |   2000 |
+----+---------+------+---------------------+--------+---------------------+----+---------+------+---------------------+--------+
4 rows in set (0.00 sec)

From the result above it's apparent we're almost there - we've identified the unwanted transactions. Using LEFT JOIN we can filter these out of the whole transaction set:

SELECT
  transactions.*
FROM
  transactions
LEFT JOIN
  (SELECT
     transactions.id
   FROM
     transactions
     JOIN
     (SELECT t2.*,
             MAX(t1.date_time) AS prev_date
      FROM transactions t1
      JOIN transactions t2
      ON (t1.account = t2.account
         AND t2.date_time > t1.date_time)
      GROUP BY t2.account,t2.date_time) t3
     ON t3.account = transactions.account
        AND t3.prev_date = transactions.date_time
        AND t3.type='A'
        AND transactions.type='R'
        AND t3.amount + transactions.amount = 0) t4
  USING(id)
  WHERE t4.id IS NULL
    AND transactions.type = 'R'
  ORDER BY transactions.date_time;

+----+---------+------+---------------------+--------+
| id | account | type | date_time           | amount |
+----+---------+------+---------------------+--------+
|  1 |       1 | R    | 2012-01-01 10:01:00 |   1000 |
|  5 |       1 | R    | 2012-01-04 12:30:01 |   1000 |
|  7 |       3 | R    | 2012-01-04 15:13:10 |   3000 |
| 10 |       3 | R    | 2012-01-07 00:00:00 |   1250 |
| 11 |       3 | R    | 2012-01-07 05:00:00 |   4000 |
| 16 |       2 | R    | 2012-01-12 00:00:00 |   5000 |
+----+---------+------+---------------------+--------+