Best r questions in June 2011

Is there an implementation of Hadley's ddply for python?

12 votes

I find Hadley's plyr package for R extremely helpful, its a great DSL for transforming data. The problem that is solves is so common, that I face it other use cases, when not manipulating data in R, but in other programming languages.

Does anyone know if there exists an a module that does a similar thing for python? Something like:

def ddply(rows, *cols, op=lambda group_rows: group_rows):
    """group rows by cols, then apply the function op to each group
       and return the results aggregating all groups
       rows is a dict or list of values read by csv.reader or csv.DictReader"""
    pass

It shouldn't be too difficult to implement, but would be great if it already existed. I'd implement it, I'd use itertools.groupby to group by cols, then apply the op function, then use itertools.chain to chain it all up. Is there a better solution?

This is the implementation I drafted up:

def ddply(rows, cols, op=lambda group_rows: group_rows): 
    """group rows by cols, then apply the function op to each group 
    rows is list of values or dict with col names (like read from 
    csv.reader or   csv.DictReader)"""
    def group_key(row):                         
        return (row[col] for col in cols)
    rows = sorted(rows, key=group_key)
    return itertools.chain.from_iterable(
        op(group_rows) for k,group_rows in itertools.groupby(rows, key=group_key)) 

Another step would be to have a set of predefined functions that could be applied as op, like sum and other utility functions.

Stopping Garbage Collection for an unmanaged Delegate

12 votes

I've recently been trying out using R.NET to get R talking to .NET and C#. It's been going very well so far, but I've hit a snag that I don't seem to be able to solve.

I've had no issues with simple, basic commands. I made a simple calculator, and something to import data into a data grid. But now I keep getting the following error:

A callback was made on a garbage collected delegate of type 'R.NET!RDotNet.Internals.blah3::Invoke'. This may cause application crashes, corruption and data loss. When passing delegates to unmanaged code, they must be kept alive by the managed application until it is guaranteed that they will never be called.

It began when I tried to repeatedly import a text file, just to test something. I've looked up this error in various ways - after hours of trawling through pages, it seems that there are a number of causes of this type of error. As time has gone on, I've been stripping back my code to more and more simple tasks to try to eliminate possibilities. I've got this now:

 public Form1()
        {
            InitializeComponent();

            REngine.SetDllDirectory(@"C:\Program Files\R\R-2.13.0\bin\i386");
            REngine.CreateInstance("RDotNet");

            using (REngine currentEngine = REngine.GetInstanceFromID("RDotNet"))
            {
                for (int i = 0; i < 1000; ++i)
                {
                    currentEngine.EagerEvaluate("test <- " + i.ToString());

                    NumericVector returned = currentEngine.GetSymbol("test").AsNumeric();

                    textBox1.Text += returned[0];

                }

            }

        }

All it does is increment a count in textBox1.Text. I had been doing it as a test with a button press incrementing the value, but this was making my finger ache after a while! It could typically manage loads of presses before throwing the error above.

At first this code seemed to be fine - so I had assumed the other stuff I had been doing was somehow the cause of the problem, as well as the cause of the error quoted above. So that's why I put in the for loop. The loop can run with no problems for several hundred runs, but then the error kicks in.

Now, I did read that this kind of error can be called by the garbage collector getting rid of the instance I've been working with. I've tried various suggestions that I read, as best I understand them. These have included using GC.KeepAlive() (no luck), and also creating a private field in a separate class that can't be gotten rid of. Sadly this didn't work either.

Is there anything else that I can try? I'm very, very new to C# so I'd appreciate any pointers on how to get this working. I assume very much that my lack of success with the methods suggested are either something to do with (1) my own mistakes in implementing the standard fixes (this seems most likely) or (2) a quirk associated with R.NET that I haven't understood.

Any help would be greatly appreciated!

Looks like a bug in R.NET. The exception you're seeing happens when a .NET layer passes a callback to unmanaged code but then lets the delegate get garbage collected. I see no delegate usage in your repro code, hence the conclusion that it must be in R.NET.

How can I use back references with `grep` in R?

11 votes

I am looking for an elegant way of returning back references using regular expressions in R. Le me explain:

Let's say I want to find strings that start with a month name:

x <- c("May, 1, 2011", "30 June 2011")
grep("May|^June", x, value=TRUE)
[1] "May, 1, 2011"

This works, but I really want to isolate the month (i.e. "May", not the entire matched string.

So, one can use gsub to return the back reference using the substitute parameter. But this has two problems:

  1. You have to wrap the pattern inside ".*(pattern).*)" so that the substitution occurs on the entire string.
  2. Rather than returning NA for non-matched strings, gsub returns the original string. This is clearly not what I desire:

The code and results:

gsub(".*(^May|^June).*", "\\1", x) 
[1] "May"          "30 June 2011"

I could probably code a workaround by doing all kinds of additional checks, but this quickly becomes very messy.

To be crystal clear, the desired results should be:

[1] "May"          NA

Is there an easy way of achieving this?

The stringr package has a function exactly for this purpose:

library(stringr)
x <- c("May, 1, 2011", "30 June 2011", "June 2012")
str_extract(x, "May|^June")
# [1] "May"  NA     "June"

It's a fairly thin wrapper around regexpr, but stringr generally makes string handling easier by being more consistent than base R functions.

Can R produce on-the-fly graphs for website?

7 votes

I use a Flex/ColdFusion/MSSQl combo to take input from users to generate charts for a website . Is this possible in R? I have used RODBC and sqlQuery as a way of producing static graphs but cannot seem to find a way of doing it dynamically. Over to you JU

Of course you can, you can use fantastic Apache module that allows stateless execution of R scripts - RApache. You can define an R script and catch (unserialize) plot parameters (e.g. via JSON or URL encoded string), plot the graph, and load result(s) with AJAX. That's pretty much what I did in my app.

If you're not satisfied with R graph capabilities (and I'm sure that's so not gonna happen), you can try out googleVis or canvas packages. The first one is "only" a library for GoogleVis API for R, and I'm sure you'll like the later if you're familiar with HTML5 canvas. Some lads found it useful.

So, the final answer is, yes, you can!. You only need to decide whether you're going to generate graphs on client or server-side. Of course, even if you decide to generate graphs on the client side, you must massage your data in R and return it in serialized form (JSON or XML encoded). I know that ExtJS 4 also has good interface for creating client-side graphs, but I haven't used it much (read: "at all").