Best r questions in March 2012

When to use R, when to use SQL?

8 votes

I have a moderate sized database with many joins and lookup tables.

I am more familiar with R than with SQL, and I am using MySQL.

My Question:

At what point is it beneficial to stop increasing the complexity of an SQL statement in favor of the data subsetting functionality in R (e.g., merge, *apply, maply, dlply, etc.)in R.

On one hand, SQL's join is easier than selecting all contents of each table and using the R merge function to join them. Also, doing the conditional selects in SQL would reduce the amount of data that has to be imported to R; but the speed difference is not significant.

On the other hand, a big join with a complex where clause becomes less easy to understand than the R syntax.

Below I have some untested code for illustrative purposes: I am asking this question at before having working code, and the answer to my question doesn't require working code (although this is always appreciated) - the "most elegant approach", "fewest lines", or "amazing implementation of X" are always appreciated, but what I am particularly interested in is the "most sensible / practical / canonical / based on first principles" rationale.

I am interested in the general answer of which steps should use a SQL where clause and which steps would be easier to accomplish using R.

Illustration:

Database description

there are three tables: a, ab, and b. Tables a and b each have a primary key id. They have a many-many relationship that is represented by a lookup table, ab, which contains fields ab.a_id and ab.b_id that join to a.id and b.id, respectively. Both tables have a time field, and a has a group field.

Goal:

Here is a minimal example of the join and subsetting that I want to do;

(MySQL naming of elements, e.g. a.id is equivalent to a$id in R)

  1. Join tables a and b using ab, appending multiple values of b.time associated with each a.id as a new column;

    select a_time, b.time, a.id, b.id from 
           a join ab on a.id = ab.a_id 
           join b on b.id = ab.b_id and then append b.time for distinct values of b.id;
    
  2. I don't need repeated values of b.time, I only need a value of b.max: for repeated values of b.time joined to each a.id, b.max is the value of b.time closest to but not greater than a.time

    b.max <- max(b.time[b.time < a.time))
    
  3. append the value dt <- a.time - b.max to the table, for example, in R,
  4. for each distinct value in a.group, select which(min(x.dt)))

    x.dt <- a.time - b.max
    

I usually do the data manipulations in SQL until the data I want is in a single table, and then, I do the rest in R. Only when there is a performance issue do I start to move some of the computations to the database. This is already what you are doing.

Computations involving timestamps often become unreadable in SQL (the "analytic functions", similar to ddply, are supposed to simplify this, but I think they are not available in MySQL).

However, your example can probably be written entirely in SQL as follows (not tested).

-- Join the tables and compute the maximum
CREATE VIEW t1 AS
SELECT a.id    AS a_id, 
       a.group AS a_group,
       b.id    AS b_id,
       a.time  AS a_time, 
       a.time - MAX(b.time) AS dt
FROM   a, b, ab
WHERE  a.id = ab.a_id AND b.id = ab.b_id
AND    b.time < a.time
GROUP  BY a.id, a.group, b.id;

-- Extract the desired rows
CREATE VIEW t2 AS 
SELECT t1.*
FROM t1, (SELECT group, MIN(dt) AS min_dt FROM t1) X
WHERE t1.a_id = X.a_id 
AND   t1.b_id = X.b_id 
AND   t1.a_group = X.a.group;

reverse-lookup Digital Object Identifier given table of citations?

6 votes

I have a table of citations that includes the last name of the first author, the title, journal, year, and page numbers for each citation.

I have posted the first few lines of the table on google docs, or the csv version (not all records have a doi)

I would like to be able to query the digital object identifier for each of these citations. For the titles, it would be best if the query could handle "fuzzy matching".

How can I do this?

The table is currently in MySQL, but it would be sufficient to start and end with a .csv file (I would appreciate an answer that goes from start to finish) (or, since I mostly use R, an R data frame).

Here are two options

CSV upload

I have found another promising solution that does not work as well in practice as in

CrossRef allows you to upload the linked csv directly, and then performs a text query here: http://www.crossref.org/stqUpload/

However, only 18 of the 250 queries (~7%) returned a doi.

XML Query

Based on the answer by Brian Diggs, here an attempt that does 95% of the work - toward writing the xml-based query, it still has a few bugs that require some deletion using sed. But the biggest problem that my "session timed out" when the query was submitted.

the xml syntax includes an option to use fuzzy matching.

the doiquery.xml contains the template text in @Brians answer; the citations.csv is linked above.

library(XML)
doiquery.xml <- xmlTreeParse('doiquery.xml')

query <- doiquery.xml$doc$children$query_batch[["body"]]

citations <- read.csv("citations.csv")

new.query <- function(citation, query = query){
  xmlValue(query[["author"]]) <- as.character(citation$author)
  xmlValue(query[["year"]]) <- as.character(citation$year)
  xmlValue(query[["article_title"]][["text"]]) <- citation$title
  xmlValue(query[["journal_title"]]) <- citation$journal
  return(query)
}


for (i in 1:nrow(citations)){
  q <- addChildren(q, add.query(citations[i,]))
}
axml <- addChildren(doiquery.xml$doc$children$query_batch, q )

saveXML(axml, file = 'foo.xml')

CSV to XML Converter

Creativyst software provides a web based CSV to XML converter.

Steps:

  1. Enter columnames in ElementID's field,
  2. "document" in DocID field
  3. "query" in RowID field
  4. Copy / paste csv in "Input CSV file".
  5. Click Convert

Also, see this related question: Shell script to parse CSV to an XML query?