Guinea Pig

From Cohen Courses
Revision as of 14:21, 28 May 2014 by Wcohen (talk | contribs)
Jump to navigationJump to search

Last login: Mon May 5 10:29:51 on console eddy:~ wcohen$ kinit5 -bash: kinit5: command not found eddy:~ wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:~ wcohen$ aklog eddy:~ wcohen$ cd ~/Documents/code/pyhack/guineapig/ eddy:guineapig wcohen$ ls

  1. notes.txt# TODO.txt demo gp.py.~1.25.~ notes-for-pig.txt tgp.py udocvec1.gpri

CVS TODO.txt.~1.7.~ docvec1.gpmo gp.pyc parks.txt tmp udocvec3.gpmo README.txt data gp.py j.gpmo row.py udocvec1.gpmo eddy:guineapig wcohen$ cvs update -dP Password: ? docvec1.gpmo ? gp.pyc ? j.gpmo ? notes-for-pig.txt ? parks.txt ? row.py ? tgp.py ? tmp ? udocvec1.gpmo ? udocvec1.gpri ? udocvec3.gpmo ? demo/phirl-naive.pu ? demo/template.py cvs update: Updating . cvs update: Updating data cvs update: Updating demo eddy:guineapig wcohen$ emacs

[1]+ Stopped emacs eddy:guineapig wcohen$ %% emacs eddy:guineapig wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:guineapig wcohen$ aklog eddy:guineapig wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:guineapig wcohen$ aklog eddy:guineapig wcohen$ pushd ~/Desktop/afs-home/keepers/ ~/Desktop/afs-home/keepers ~/Documents/code/pyhack/guineapig eddy:keepers wcohen$ ls CVS email-taxonomy Shortcut to imls.lnk expt Thumbs.db grants a12-handout.docx imls amended_key.docx letters bib masters-program blogs meetings budgets misc-research cikara_cohen_redlawsk_proposal_Nov2013_v5_wc.docx planning classes radar-names consulting rcwang-hire data science reviews disc-lim teaching dsedra-rec.txt thesis-proposals eddy:keepers wcohen$ mkdir pnc eddy:keepers wcohen$ ls ~/Desktop/summary* /Users/wcohen/Desktop/summary - funding fall 2013.xlsx /Users/wcohen/Desktop/summary- funding spring 2014.xlsx /Users/wcohen/Desktop/summary- funding spring 2013.xlsx eddy:keepers wcohen$ mc ~/Desktop/summary* pnc -bash: mc: command not found eddy:keepers wcohen$ mv ~/Desktop/summary* pnc mv: pnc/summary - funding fall 2013.xlsx: set owner/group (was: 502/20):eddy:keepers wcohen$eddy:keeeddy:keddyeddy:keddyeddyededdy:keddy:keepeeddy:keeperseddy:keepers wcoeddy:keepers weddy:keeperseddy:keddy:keddy:keeeddy:keeeddy:keeeddy:keeeddy:keepeeddy:keeeddy:keepers weddyeddy:keddyeddy:keddy:keeeddyeddyeddy:keddy:keeeddyeddyeddyeddyedededededdy:keepers wcohen$ popd ~/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ ls CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp eddy:guineapig wcohen$ cvs update -dP Password: ? .DS_Store ? Makefile ? cuts.txt ? data.gp ? docFreq.gp ? docvec.gp ? docvec1.gpmo ? fields.gp ? guineapig.new ? guineapig.pyc ? look.gp ? look1.gpmo ? look2.gp ? look3.gpmo ? ndoc.gp ? ndoc2.gp ? norm.gp ? notes-for-pig.txt ? r.gp ? rel1Docs.gp ? rel2Docs.gp ? s.gp ? simpairs.gp ? softjoin1.gpmo ? softjoin2.gp ? tmp ? trial.py ? try.py ? udocvec.gp ? udocvec1.gpmo ? udocvec2.gp ? udocvec3.gp ? udocvec3.gpmo ? wc-for-bluecorpus.txt ? wc-for-redcorpus.txt ? wc-for-s ? wc.gp ? data/dkos-data.txt ? data/redstate-data-small.txt ? data/redstate-data.txt ? data/redstate-small-clean.txt ? demo/params2.py ? tutorial/guineapig.pyc ? tutorial/params.pyc ? tutorial/wc.gp cvs update: Updating . RCS file: /usr1/cvsroot/pyhack/guineapig/TODO.txt,v retrieving revision 1.28 retrieving revision 1.33 Merging differences between 1.28 and 1.33 into TODO.txt rcsmerge: warning: conflicts during merge cvs update: conflicts found in TODO.txt C TODO.txt P guineapig.py cvs update: Updating data cvs update: Updating demo P demo/ugp1.py U demo/wordprob.py cvs update: Updating tutorial P tutorial/guineapig.py U tutorial/instance-wordcount.py U tutorial/multi-wordcount.py cvs update: tutorial/multi.py is no longer in the repository U tutorial/param-wordcount.py cvs update: tutorial/params.py is no longer in the repository U tutorial/wikipage.txt eddy:guineapig wcohen$ ls CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp eddy:guineapig wcohen$ rm *.gp *.gpmo eddy:guineapig wcohen$ ls CVS cuts.txt guineapig.html notes-for-pig.txt try.py wc-for-redcorpus.txt Makefile data guineapig.new tmp try.py~ wc-for-s TODO.txt demo guineapig.py trial.py tutorial TODO.txt.~1.26.~ guineapig.bak guineapig.pyc trial.py~ wc-for-bluecorpus.txt eddy:guineapig wcohen$ rm wc-for-* eddy:guineapig wcohen$ ls CVS TODO.txt.~1.26.~ demo guineapig.new notes-for-pig.txt trial.py~ tutorial Makefile cuts.txt guineapig.bak guineapig.py tmp try.py TODO.txt data guineapig.html guineapig.pyc trial.py try.py~ eddy:guineapig wcohen$ rm -rf tmp eddy:guineapig wcohen$ pwd /Users/wcohen/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ pwd /Users/wcohen/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ emacs

Quick Start

Running wordcount.py

Set up a directory that contains the file gp.py and a second script called wordcount.py which contains this code:

# always start like this
from gp import *
import sys

# supporting routines can go here
def tokens(line):
    for tok in line.split():
	yield tok.lower()

#always subclass Planner
class WordCount(Planner):

    wc = ReadLines('corpus.txt') | FlattenBy(by=tokens) | Group(by=lambda x:x, reducingWith=ReduceToCount())

# always end like this
if __name__ == "__main__":
    WordCount().main(sys.argv)

Then type the command:

% python tutorial/wordcount.py --store wc

After a couple of seconds it will return, and you can see the wordcounts with

% head wc.gp

Understanding the wordcount example

A longer example

There's also a less concise but easier-to-explain wordcount file, longer-wordcount.py, with this view definition:

class WordCount(Planner):
    lines = ReadLines('corpus.txt')
    words = Flatten(lines,by=tokens)
    wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())

Functions instead of fields

The Group view is actually quite flexible: the "by" clause is a lambda that extracts an arbitrary value that defines the groups, and in the absence of a "re\ ducingTo" clause, the result of grouping is a tuple (key,[row1,...,rowN]) where the rowi's are the rows that have the indicated "key" (as extracted by the "by" clause). \

The "reduceTo" argument is an optimization, which you can define with an instance of the ReduceTo class.  For instance, instead of using the

ReduceToCount() subclass, you could have used

ReduceTo(int,by=lambda accum,val:accum+1)

where the first argument is the type of the output (and defines the initial value of the accumulator, which here will be int(), or zero) and the second is t\ he function that is used to reduce values pairwise.

Note the use of functions as parameters in Group and Flatten. Guinea Pig has no notion of records: rows can be any python object (although the\ -uu-:---F1 wikipage.txt Top L1 CVS-1.1 (Text)---------------------------------------------------------------------------------------------------------------------- Loading vc-cvs...done