I find Hadley's plyr package for R extremely helpful, its a great DSL for transforming data. The problem that is solves is so common, that I face it other use cases, when not manipulating data in R, but in other programming languages.
Does anyone know if there exists an a module that does a similar thing for python? Something like:
def ddply(rows, *cols, op=lambda group_rows: group_rows): """group rows by cols, then apply the function op to each group and return the results aggregating all groups rows is a dict or list of values read by csv.reader or csv.DictReader""" pass
It shouldn't be too difficult to implement, but would be great if it already existed. I'd implement it, I'd use
itertools.groupby to group by
cols, then apply the
op function, then use itertools.chain to chain it all up. Is there a better solution?
This is the implementation I drafted up:
def ddply(rows, cols, op=lambda group_rows: group_rows): """group rows by cols, then apply the function op to each group rows is list of values or dict with col names (like read from csv.reader or csv.DictReader)""" def group_key(row): return (row[col] for col in cols) rows = sorted(rows, key=group_key) return itertools.chain.from_iterable( op(group_rows) for k,group_rows in itertools.groupby(rows, key=group_key))
Another step would be to have a set of predefined functions that could be applied as
sum and other utility functions.