Guinea Pig
Last login: Mon May 5 10:29:51 on console eddy:~ wcohen$ kinit5 -bash: kinit5: command not found eddy:~ wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:~ wcohen$ aklog eddy:~ wcohen$ cd ~/Documents/code/pyhack/guineapig/ eddy:guineapig wcohen$ ls
- notes.txt# TODO.txt demo gp.py.~1.25.~ notes-for-pig.txt tgp.py udocvec1.gpri
CVS TODO.txt.~1.7.~ docvec1.gpmo gp.pyc parks.txt tmp udocvec3.gpmo README.txt data gp.py j.gpmo row.py udocvec1.gpmo eddy:guineapig wcohen$ cvs update -dP Password: ? docvec1.gpmo ? gp.pyc ? j.gpmo ? notes-for-pig.txt ? parks.txt ? row.py ? tgp.py ? tmp ? udocvec1.gpmo ? udocvec1.gpri ? udocvec3.gpmo ? demo/phirl-naive.pu ? demo/template.py cvs update: Updating . cvs update: Updating data cvs update: Updating demo eddy:guineapig wcohen$ emacs
[1]+ Stopped emacs eddy:guineapig wcohen$ %% emacs eddy:guineapig wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:guineapig wcohen$ aklog eddy:guineapig wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:guineapig wcohen$ aklog eddy:guineapig wcohen$ pushd ~/Desktop/afs-home/keepers/ ~/Desktop/afs-home/keepers ~/Documents/code/pyhack/guineapig eddy:keepers wcohen$ ls CVS email-taxonomy Shortcut to imls.lnk expt Thumbs.db grants a12-handout.docx imls amended_key.docx letters bib masters-program blogs meetings budgets misc-research cikara_cohen_redlawsk_proposal_Nov2013_v5_wc.docx planning classes radar-names consulting rcwang-hire data science reviews disc-lim teaching dsedra-rec.txt thesis-proposals eddy:keepers wcohen$ mkdir pnc eddy:keepers wcohen$ ls ~/Desktop/summary* /Users/wcohen/Desktop/summary - funding fall 2013.xlsx /Users/wcohen/Desktop/summary- funding spring 2014.xlsx /Users/wcohen/Desktop/summary- funding spring 2013.xlsx eddy:keepers wcohen$ mc ~/Desktop/summary* pnc -bash: mc: command not found eddy:keepers wcohen$ mv ~/Desktop/summary* pnc mv: pnc/summary - funding fall 2013.xlsx: set owner/group (was: 502/20):eddy:keepers wcohen$eddy:keeeddy:keddyeddy:keddyeddyededdy:keddy:keepeeddy:keeperseddy:keepers wcoeddy:keepers weddy:keeperseddy:keddy:keddy:keeeddy:keeeddy:keeeddy:keeeddy:keepeeddy:keeeddy:keepers weddyeddy:keddyeddy:keddy:keeeddyeddyeddy:keddy:keeeddyeddyeddyeddyedededededdy:keepers wcohen$ popd ~/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ ls CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp eddy:guineapig wcohen$ cvs update -dP Password: ? .DS_Store ? Makefile ? cuts.txt ? data.gp ? docFreq.gp ? docvec.gp ? docvec1.gpmo ? fields.gp ? guineapig.new ? guineapig.pyc ? look.gp ? look1.gpmo ? look2.gp ? look3.gpmo ? ndoc.gp ? ndoc2.gp ? norm.gp ? notes-for-pig.txt ? r.gp ? rel1Docs.gp ? rel2Docs.gp ? s.gp ? simpairs.gp ? softjoin1.gpmo ? softjoin2.gp ? tmp ? trial.py ? try.py ? udocvec.gp ? udocvec1.gpmo ? udocvec2.gp ? udocvec3.gp ? udocvec3.gpmo ? wc-for-bluecorpus.txt ? wc-for-redcorpus.txt ? wc-for-s ? wc.gp ? data/dkos-data.txt ? data/redstate-data-small.txt ? data/redstate-data.txt ? data/redstate-small-clean.txt ? demo/params2.py ? tutorial/guineapig.pyc ? tutorial/params.pyc ? tutorial/wc.gp cvs update: Updating . RCS file: /usr1/cvsroot/pyhack/guineapig/TODO.txt,v retrieving revision 1.28 retrieving revision 1.33 Merging differences between 1.28 and 1.33 into TODO.txt rcsmerge: warning: conflicts during merge cvs update: conflicts found in TODO.txt C TODO.txt P guineapig.py cvs update: Updating data cvs update: Updating demo P demo/ugp1.py U demo/wordprob.py cvs update: Updating tutorial P tutorial/guineapig.py U tutorial/instance-wordcount.py U tutorial/multi-wordcount.py cvs update: tutorial/multi.py is no longer in the repository U tutorial/param-wordcount.py cvs update: tutorial/params.py is no longer in the repository U tutorial/wikipage.txt eddy:guineapig wcohen$ ls CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp eddy:guineapig wcohen$ rm *.gp *.gpmo eddy:guineapig wcohen$ ls CVS cuts.txt guineapig.html notes-for-pig.txt try.py wc-for-redcorpus.txt Makefile data guineapig.new tmp try.py~ wc-for-s TODO.txt demo guineapig.py trial.py tutorial TODO.txt.~1.26.~ guineapig.bak guineapig.pyc trial.py~ wc-for-bluecorpus.txt eddy:guineapig wcohen$ rm wc-for-* eddy:guineapig wcohen$ ls CVS TODO.txt.~1.26.~ demo guineapig.new notes-for-pig.txt trial.py~ tutorial Makefile cuts.txt guineapig.bak guineapig.py tmp try.py TODO.txt data guineapig.html guineapig.pyc trial.py try.py~ eddy:guineapig wcohen$ rm -rf tmp eddy:guineapig wcohen$ pwd /Users/wcohen/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ pwd /Users/wcohen/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ emacs
Contents
Quick Start
Running wordcount.py
Set up a directory that contains the file gp.py
and a
second script called wordcount.py
which contains this
code:
# always start like this from gp import * import sys # supporting routines can go here def tokens(line): for tok in line.split(): yield tok.lower() #always subclass Planner class WordCount(Planner): wc = ReadLines('corpus.txt') | FlattenBy(by=tokens) | Group(by=lambda x:x, reducingWith=ReduceToCount()) # always end like this if __name__ == "__main__": WordCount().main(sys.argv)
Then type the command:
% python tutorial/wordcount.py --store wc
After a couple of seconds it will return, and you can see the wordcounts with
% head wc.gp
Understanding the wordcount example
A longer example
There's also a less concise but easier-to-explain wordcount file,
longer-wordcount.py
, with this view definition:
class WordCount(Planner): lines = ReadLines('corpus.txt') words = Flatten(lines,by=tokens) wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())
Functions instead of fields
The Group
view is actually quite flexible: the "by" clause is a lambda that extracts an arbitrary value that defines the groups, and in the absence of a "re\
ducingTo" clause, the result of grouping is a tuple (key,[row1,...,rowN]) where the rowi's are the rows that have the indicated "key" (as extracted by the "by" clause). \
The "reduceTo" argument is an optimization, which you can define with an instance of the ReduceTo class. For instance, instead of using the
ReduceToCount() subclass, you could have used
ReduceTo(int,by=lambda accum,val:accum+1)
where the first argument is the type of the output (and defines the initial value of the accumulator, which here will be int()
, or zero) and the second is t\
he function that is used to reduce values pairwise.
Note the use of functions as parameters in Group
and Flatten
. Guinea Pig has no notion of records: rows can be any python object (although the\
-uu-:---F1 wikipage.txt Top L1 CVS-1.1 (Text)----------------------------------------------------------------------------------------------------------------------
Loading vc-cvs...done