Difference between revisions of "Guinea Pig"
Line 1: | Line 1: | ||
+ | Last login: Mon May 5 10:29:51 on console | ||
+ | eddy:~ wcohen$ kinit5 | ||
+ | -bash: kinit5: command not found | ||
+ | eddy:~ wcohen$ kinit | ||
+ | wcohen@CS.CMU.EDU's Password: | ||
+ | eddy:~ wcohen$ aklog | ||
+ | eddy:~ wcohen$ cd ~/Documents/code/pyhack/guineapig/ | ||
+ | eddy:guineapig wcohen$ ls | ||
+ | #notes.txt# TODO.txt demo gp.py.~1.25.~ notes-for-pig.txt tgp.py udocvec1.gpri | ||
+ | CVS TODO.txt.~1.7.~ docvec1.gpmo gp.pyc parks.txt tmp udocvec3.gpmo | ||
+ | README.txt data gp.py j.gpmo row.py udocvec1.gpmo | ||
+ | eddy:guineapig wcohen$ cvs update -dP | ||
+ | Password: | ||
+ | ? docvec1.gpmo | ||
+ | ? gp.pyc | ||
+ | ? j.gpmo | ||
+ | ? notes-for-pig.txt | ||
+ | ? parks.txt | ||
+ | ? row.py | ||
+ | ? tgp.py | ||
+ | ? tmp | ||
+ | ? udocvec1.gpmo | ||
+ | ? udocvec1.gpri | ||
+ | ? udocvec3.gpmo | ||
+ | ? demo/phirl-naive.pu | ||
+ | ? demo/template.py | ||
+ | cvs update: Updating . | ||
+ | cvs update: Updating data | ||
+ | cvs update: Updating demo | ||
+ | eddy:guineapig wcohen$ emacs | ||
+ | |||
+ | [1]+ Stopped emacs | ||
+ | eddy:guineapig wcohen$ %% | ||
+ | emacs | ||
+ | eddy:guineapig wcohen$ kinit | ||
+ | wcohen@CS.CMU.EDU's Password: | ||
+ | eddy:guineapig wcohen$ aklog | ||
+ | eddy:guineapig wcohen$ kinit | ||
+ | wcohen@CS.CMU.EDU's Password: | ||
+ | eddy:guineapig wcohen$ aklog | ||
+ | eddy:guineapig wcohen$ pushd ~/Desktop/afs-home/keepers/ | ||
+ | ~/Desktop/afs-home/keepers ~/Documents/code/pyhack/guineapig | ||
+ | eddy:keepers wcohen$ ls | ||
+ | CVS email-taxonomy | ||
+ | Shortcut to imls.lnk expt | ||
+ | Thumbs.db grants | ||
+ | a12-handout.docx imls | ||
+ | amended_key.docx letters | ||
+ | bib masters-program | ||
+ | blogs meetings | ||
+ | budgets misc-research | ||
+ | cikara_cohen_redlawsk_proposal_Nov2013_v5_wc.docx planning | ||
+ | classes radar-names | ||
+ | consulting rcwang-hire | ||
+ | data science reviews | ||
+ | disc-lim teaching | ||
+ | dsedra-rec.txt thesis-proposals | ||
+ | eddy:keepers wcohen$ mkdir pnc | ||
+ | eddy:keepers wcohen$ ls ~/Desktop/summary* | ||
+ | /Users/wcohen/Desktop/summary - funding fall 2013.xlsx /Users/wcohen/Desktop/summary- funding spring 2014.xlsx | ||
+ | /Users/wcohen/Desktop/summary- funding spring 2013.xlsx | ||
+ | eddy:keepers wcohen$ mc ~/Desktop/summary* pnc | ||
+ | -bash: mc: command not found | ||
+ | eddy:keepers wcohen$ mv ~/Desktop/summary* pnc | ||
+ | mv: pnc/summary - funding fall 2013.xlsx: set owner/group (was: 502/20):eddy:keepers wcohen$eddy:keeeddy:keddyeddy:keddyeddyededdy:keddy:keepeeddy:keeperseddy:keepers wcoeddy:keepers weddy:keeperseddy:keddy:keddy:keeeddy:keeeddy:keeeddy:keeeddy:keepeeddy:keeeddy:keepers weddyeddy:keddyeddy:keddy:keeeddyeddyeddy:keddy:keeeddyeddyeddyeddyedededededdy:keepers wcohen$ popd | ||
+ | ~/Documents/code/pyhack/guineapig | ||
+ | eddy:guineapig wcohen$ ls | ||
+ | CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo | ||
+ | Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt | ||
+ | TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt | ||
+ | TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s | ||
+ | cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp | ||
+ | data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp | ||
+ | data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp | ||
+ | eddy:guineapig wcohen$ cvs update -dP | ||
+ | Password: | ||
+ | ? .DS_Store | ||
+ | ? Makefile | ||
+ | ? cuts.txt | ||
+ | ? data.gp | ||
+ | ? docFreq.gp | ||
+ | ? docvec.gp | ||
+ | ? docvec1.gpmo | ||
+ | ? fields.gp | ||
+ | ? guineapig.new | ||
+ | ? guineapig.pyc | ||
+ | ? look.gp | ||
+ | ? look1.gpmo | ||
+ | ? look2.gp | ||
+ | ? look3.gpmo | ||
+ | ? ndoc.gp | ||
+ | ? ndoc2.gp | ||
+ | ? norm.gp | ||
+ | ? notes-for-pig.txt | ||
+ | ? r.gp | ||
+ | ? rel1Docs.gp | ||
+ | ? rel2Docs.gp | ||
+ | ? s.gp | ||
+ | ? simpairs.gp | ||
+ | ? softjoin1.gpmo | ||
+ | ? softjoin2.gp | ||
+ | ? tmp | ||
+ | ? trial.py | ||
+ | ? try.py | ||
+ | ? udocvec.gp | ||
+ | ? udocvec1.gpmo | ||
+ | ? udocvec2.gp | ||
+ | ? udocvec3.gp | ||
+ | ? udocvec3.gpmo | ||
+ | ? wc-for-bluecorpus.txt | ||
+ | ? wc-for-redcorpus.txt | ||
+ | ? wc-for-s | ||
+ | ? wc.gp | ||
+ | ? data/dkos-data.txt | ||
+ | ? data/redstate-data-small.txt | ||
+ | ? data/redstate-data.txt | ||
+ | ? data/redstate-small-clean.txt | ||
+ | ? demo/params2.py | ||
+ | ? tutorial/guineapig.pyc | ||
+ | ? tutorial/params.pyc | ||
+ | ? tutorial/wc.gp | ||
+ | cvs update: Updating . | ||
+ | RCS file: /usr1/cvsroot/pyhack/guineapig/TODO.txt,v | ||
+ | retrieving revision 1.28 | ||
+ | retrieving revision 1.33 | ||
+ | Merging differences between 1.28 and 1.33 into TODO.txt | ||
+ | rcsmerge: warning: conflicts during merge | ||
+ | cvs update: conflicts found in TODO.txt | ||
+ | C TODO.txt | ||
+ | P guineapig.py | ||
+ | cvs update: Updating data | ||
+ | cvs update: Updating demo | ||
+ | P demo/ugp1.py | ||
+ | U demo/wordprob.py | ||
+ | cvs update: Updating tutorial | ||
+ | P tutorial/guineapig.py | ||
+ | U tutorial/instance-wordcount.py | ||
+ | U tutorial/multi-wordcount.py | ||
+ | cvs update: tutorial/multi.py is no longer in the repository | ||
+ | U tutorial/param-wordcount.py | ||
+ | cvs update: tutorial/params.py is no longer in the repository | ||
+ | U tutorial/wikipage.txt | ||
+ | eddy:guineapig wcohen$ ls | ||
+ | CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo | ||
+ | Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt | ||
+ | TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt | ||
+ | TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s | ||
+ | cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp | ||
+ | data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp | ||
+ | data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp | ||
+ | eddy:guineapig wcohen$ rm *.gp *.gpmo | ||
+ | eddy:guineapig wcohen$ ls | ||
+ | CVS cuts.txt guineapig.html notes-for-pig.txt try.py wc-for-redcorpus.txt | ||
+ | Makefile data guineapig.new tmp try.py~ wc-for-s | ||
+ | TODO.txt demo guineapig.py trial.py tutorial | ||
+ | TODO.txt.~1.26.~ guineapig.bak guineapig.pyc trial.py~ wc-for-bluecorpus.txt | ||
+ | eddy:guineapig wcohen$ rm wc-for-* | ||
+ | eddy:guineapig wcohen$ ls | ||
+ | CVS TODO.txt.~1.26.~ demo guineapig.new notes-for-pig.txt trial.py~ tutorial | ||
+ | Makefile cuts.txt guineapig.bak guineapig.py tmp try.py | ||
+ | TODO.txt data guineapig.html guineapig.pyc trial.py try.py~ | ||
+ | eddy:guineapig wcohen$ rm -rf tmp | ||
+ | eddy:guineapig wcohen$ pwd | ||
+ | /Users/wcohen/Documents/code/pyhack/guineapig | ||
+ | eddy:guineapig wcohen$ pwd | ||
+ | /Users/wcohen/Documents/code/pyhack/guineapig | ||
+ | eddy:guineapig wcohen$ emacs | ||
+ | |||
== Quick Start == | == Quick Start == | ||
Line 16: | Line 184: | ||
def tokens(line): | def tokens(line): | ||
for tok in line.split(): | for tok in line.split(): | ||
− | + | yield tok.lower() | |
#always subclass Planner | #always subclass Planner | ||
Line 56: | Line 224: | ||
==== Functions instead of fields ==== | ==== Functions instead of fields ==== | ||
− | The <code>Group</code> view is actually quite flexible: the "by" clause is a lambda that extracts an arbitrary value that defines the groups, and in the absence of a " | + | The <code>Group</code> view is actually quite flexible: the "by" clause is a lambda that extracts an arbitrary value that defines the groups, and in the absence of a "re\ |
+ | ducingTo" clause, the result of grouping is a tuple (key,[row1,...,rowN]) where the rowi's are the rows that have the indicated "key" (as extracted by the "by" clause). \ | ||
+ | The "reduceTo" argument is an optimization, which you can define with an instance of the ReduceTo class. For instance, instead of using the | ||
ReduceToCount() subclass, you could have used | ReduceToCount() subclass, you could have used | ||
<pre> | <pre> | ||
ReduceTo(int,by=lambda accum,val:accum+1) | ReduceTo(int,by=lambda accum,val:accum+1) | ||
</pre> | </pre> | ||
− | where the first argument is the type of the output (and defines the initial value of the accumulator, which here will be <code>int()</code>, or zero) and the second is | + | where the first argument is the type of the output (and defines the initial value of the accumulator, which here will be <code>int()</code>, or zero) and the second is t\ |
− | + | he function that is used to reduce values pairwise. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | Note the use of functions as parameters in <code>Group</code> and <code>Flatten</code>. Guinea Pig has no notion of records: rows can be any python object (although the\ | |
− | + | -uu-:---F1 wikipage.txt Top L1 CVS-1.1 (Text)---------------------------------------------------------------------------------------------------------------------- | |
− | + | Loading vc-cvs...done |
Revision as of 14:21, 28 May 2014
Last login: Mon May 5 10:29:51 on console eddy:~ wcohen$ kinit5 -bash: kinit5: command not found eddy:~ wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:~ wcohen$ aklog eddy:~ wcohen$ cd ~/Documents/code/pyhack/guineapig/ eddy:guineapig wcohen$ ls
- notes.txt# TODO.txt demo gp.py.~1.25.~ notes-for-pig.txt tgp.py udocvec1.gpri
CVS TODO.txt.~1.7.~ docvec1.gpmo gp.pyc parks.txt tmp udocvec3.gpmo README.txt data gp.py j.gpmo row.py udocvec1.gpmo eddy:guineapig wcohen$ cvs update -dP Password: ? docvec1.gpmo ? gp.pyc ? j.gpmo ? notes-for-pig.txt ? parks.txt ? row.py ? tgp.py ? tmp ? udocvec1.gpmo ? udocvec1.gpri ? udocvec3.gpmo ? demo/phirl-naive.pu ? demo/template.py cvs update: Updating . cvs update: Updating data cvs update: Updating demo eddy:guineapig wcohen$ emacs
[1]+ Stopped emacs eddy:guineapig wcohen$ %% emacs eddy:guineapig wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:guineapig wcohen$ aklog eddy:guineapig wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:guineapig wcohen$ aklog eddy:guineapig wcohen$ pushd ~/Desktop/afs-home/keepers/ ~/Desktop/afs-home/keepers ~/Documents/code/pyhack/guineapig eddy:keepers wcohen$ ls CVS email-taxonomy Shortcut to imls.lnk expt Thumbs.db grants a12-handout.docx imls amended_key.docx letters bib masters-program blogs meetings budgets misc-research cikara_cohen_redlawsk_proposal_Nov2013_v5_wc.docx planning classes radar-names consulting rcwang-hire data science reviews disc-lim teaching dsedra-rec.txt thesis-proposals eddy:keepers wcohen$ mkdir pnc eddy:keepers wcohen$ ls ~/Desktop/summary* /Users/wcohen/Desktop/summary - funding fall 2013.xlsx /Users/wcohen/Desktop/summary- funding spring 2014.xlsx /Users/wcohen/Desktop/summary- funding spring 2013.xlsx eddy:keepers wcohen$ mc ~/Desktop/summary* pnc -bash: mc: command not found eddy:keepers wcohen$ mv ~/Desktop/summary* pnc mv: pnc/summary - funding fall 2013.xlsx: set owner/group (was: 502/20):eddy:keepers wcohen$eddy:keeeddy:keddyeddy:keddyeddyededdy:keddy:keepeeddy:keeperseddy:keepers wcoeddy:keepers weddy:keeperseddy:keddy:keddy:keeeddy:keeeddy:keeeddy:keeeddy:keepeeddy:keeeddy:keepers weddyeddy:keddyeddy:keddy:keeeddyeddyeddy:keddy:keeeddyeddyeddyeddyedededededdy:keepers wcohen$ popd ~/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ ls CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp eddy:guineapig wcohen$ cvs update -dP Password: ? .DS_Store ? Makefile ? cuts.txt ? data.gp ? docFreq.gp ? docvec.gp ? docvec1.gpmo ? fields.gp ? guineapig.new ? guineapig.pyc ? look.gp ? look1.gpmo ? look2.gp ? look3.gpmo ? ndoc.gp ? ndoc2.gp ? norm.gp ? notes-for-pig.txt ? r.gp ? rel1Docs.gp ? rel2Docs.gp ? s.gp ? simpairs.gp ? softjoin1.gpmo ? softjoin2.gp ? tmp ? trial.py ? try.py ? udocvec.gp ? udocvec1.gpmo ? udocvec2.gp ? udocvec3.gp ? udocvec3.gpmo ? wc-for-bluecorpus.txt ? wc-for-redcorpus.txt ? wc-for-s ? wc.gp ? data/dkos-data.txt ? data/redstate-data-small.txt ? data/redstate-data.txt ? data/redstate-small-clean.txt ? demo/params2.py ? tutorial/guineapig.pyc ? tutorial/params.pyc ? tutorial/wc.gp cvs update: Updating . RCS file: /usr1/cvsroot/pyhack/guineapig/TODO.txt,v retrieving revision 1.28 retrieving revision 1.33 Merging differences between 1.28 and 1.33 into TODO.txt rcsmerge: warning: conflicts during merge cvs update: conflicts found in TODO.txt C TODO.txt P guineapig.py cvs update: Updating data cvs update: Updating demo P demo/ugp1.py U demo/wordprob.py cvs update: Updating tutorial P tutorial/guineapig.py U tutorial/instance-wordcount.py U tutorial/multi-wordcount.py cvs update: tutorial/multi.py is no longer in the repository U tutorial/param-wordcount.py cvs update: tutorial/params.py is no longer in the repository U tutorial/wikipage.txt eddy:guineapig wcohen$ ls CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp eddy:guineapig wcohen$ rm *.gp *.gpmo eddy:guineapig wcohen$ ls CVS cuts.txt guineapig.html notes-for-pig.txt try.py wc-for-redcorpus.txt Makefile data guineapig.new tmp try.py~ wc-for-s TODO.txt demo guineapig.py trial.py tutorial TODO.txt.~1.26.~ guineapig.bak guineapig.pyc trial.py~ wc-for-bluecorpus.txt eddy:guineapig wcohen$ rm wc-for-* eddy:guineapig wcohen$ ls CVS TODO.txt.~1.26.~ demo guineapig.new notes-for-pig.txt trial.py~ tutorial Makefile cuts.txt guineapig.bak guineapig.py tmp try.py TODO.txt data guineapig.html guineapig.pyc trial.py try.py~ eddy:guineapig wcohen$ rm -rf tmp eddy:guineapig wcohen$ pwd /Users/wcohen/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ pwd /Users/wcohen/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ emacs
Contents
Quick Start
Running wordcount.py
Set up a directory that contains the file gp.py
and a
second script called wordcount.py
which contains this
code:
# always start like this from gp import * import sys # supporting routines can go here def tokens(line): for tok in line.split(): yield tok.lower() #always subclass Planner class WordCount(Planner): wc = ReadLines('corpus.txt') | FlattenBy(by=tokens) | Group(by=lambda x:x, reducingWith=ReduceToCount()) # always end like this if __name__ == "__main__": WordCount().main(sys.argv)
Then type the command:
% python tutorial/wordcount.py --store wc
After a couple of seconds it will return, and you can see the wordcounts with
% head wc.gp
Understanding the wordcount example
A longer example
There's also a less concise but easier-to-explain wordcount file,
longer-wordcount.py
, with this view definition:
class WordCount(Planner): lines = ReadLines('corpus.txt') words = Flatten(lines,by=tokens) wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())
Functions instead of fields
The Group
view is actually quite flexible: the "by" clause is a lambda that extracts an arbitrary value that defines the groups, and in the absence of a "re\
ducingTo" clause, the result of grouping is a tuple (key,[row1,...,rowN]) where the rowi's are the rows that have the indicated "key" (as extracted by the "by" clause). \
The "reduceTo" argument is an optimization, which you can define with an instance of the ReduceTo class. For instance, instead of using the
ReduceToCount() subclass, you could have used
ReduceTo(int,by=lambda accum,val:accum+1)
where the first argument is the type of the output (and defines the initial value of the accumulator, which here will be int()
, or zero) and the second is t\
he function that is used to reduce values pairwise.
Note the use of functions as parameters in Group
and Flatten
. Guinea Pig has no notion of records: rows can be any python object (although the\
-uu-:---F1 wikipage.txt Top L1 CVS-1.1 (Text)----------------------------------------------------------------------------------------------------------------------
Loading vc-cvs...done