Difference between revisions of "Guinea Pig"
| Line 1: | Line 1: | ||
| + | Last login: Mon May 5 10:29:51 on console | ||
| + | eddy:~ wcohen$ kinit5 | ||
| + | -bash: kinit5: command not found | ||
| + | eddy:~ wcohen$ kinit | ||
| + | wcohen@CS.CMU.EDU's Password: | ||
| + | eddy:~ wcohen$ aklog | ||
| + | eddy:~ wcohen$ cd ~/Documents/code/pyhack/guineapig/ | ||
| + | eddy:guineapig wcohen$ ls | ||
| + | #notes.txt# TODO.txt demo gp.py.~1.25.~ notes-for-pig.txt tgp.py udocvec1.gpri | ||
| + | CVS TODO.txt.~1.7.~ docvec1.gpmo gp.pyc parks.txt tmp udocvec3.gpmo | ||
| + | README.txt data gp.py j.gpmo row.py udocvec1.gpmo | ||
| + | eddy:guineapig wcohen$ cvs update -dP | ||
| + | Password: | ||
| + | ? docvec1.gpmo | ||
| + | ? gp.pyc | ||
| + | ? j.gpmo | ||
| + | ? notes-for-pig.txt | ||
| + | ? parks.txt | ||
| + | ? row.py | ||
| + | ? tgp.py | ||
| + | ? tmp | ||
| + | ? udocvec1.gpmo | ||
| + | ? udocvec1.gpri | ||
| + | ? udocvec3.gpmo | ||
| + | ? demo/phirl-naive.pu | ||
| + | ? demo/template.py | ||
| + | cvs update: Updating . | ||
| + | cvs update: Updating data | ||
| + | cvs update: Updating demo | ||
| + | eddy:guineapig wcohen$ emacs | ||
| + | |||
| + | [1]+ Stopped emacs | ||
| + | eddy:guineapig wcohen$ %% | ||
| + | emacs | ||
| + | eddy:guineapig wcohen$ kinit | ||
| + | wcohen@CS.CMU.EDU's Password: | ||
| + | eddy:guineapig wcohen$ aklog | ||
| + | eddy:guineapig wcohen$ kinit | ||
| + | wcohen@CS.CMU.EDU's Password: | ||
| + | eddy:guineapig wcohen$ aklog | ||
| + | eddy:guineapig wcohen$ pushd ~/Desktop/afs-home/keepers/ | ||
| + | ~/Desktop/afs-home/keepers ~/Documents/code/pyhack/guineapig | ||
| + | eddy:keepers wcohen$ ls | ||
| + | CVS email-taxonomy | ||
| + | Shortcut to imls.lnk expt | ||
| + | Thumbs.db grants | ||
| + | a12-handout.docx imls | ||
| + | amended_key.docx letters | ||
| + | bib masters-program | ||
| + | blogs meetings | ||
| + | budgets misc-research | ||
| + | cikara_cohen_redlawsk_proposal_Nov2013_v5_wc.docx planning | ||
| + | classes radar-names | ||
| + | consulting rcwang-hire | ||
| + | data science reviews | ||
| + | disc-lim teaching | ||
| + | dsedra-rec.txt thesis-proposals | ||
| + | eddy:keepers wcohen$ mkdir pnc | ||
| + | eddy:keepers wcohen$ ls ~/Desktop/summary* | ||
| + | /Users/wcohen/Desktop/summary - funding fall 2013.xlsx /Users/wcohen/Desktop/summary- funding spring 2014.xlsx | ||
| + | /Users/wcohen/Desktop/summary- funding spring 2013.xlsx | ||
| + | eddy:keepers wcohen$ mc ~/Desktop/summary* pnc | ||
| + | -bash: mc: command not found | ||
| + | eddy:keepers wcohen$ mv ~/Desktop/summary* pnc | ||
| + | mv: pnc/summary - funding fall 2013.xlsx: set owner/group (was: 502/20):eddy:keepers wcohen$eddy:keeeddy:keddyeddy:keddyeddyededdy:keddy:keepeeddy:keeperseddy:keepers wcoeddy:keepers weddy:keeperseddy:keddy:keddy:keeeddy:keeeddy:keeeddy:keeeddy:keepeeddy:keeeddy:keepers weddyeddy:keddyeddy:keddy:keeeddyeddyeddy:keddy:keeeddyeddyeddyeddyedededededdy:keepers wcohen$ popd | ||
| + | ~/Documents/code/pyhack/guineapig | ||
| + | eddy:guineapig wcohen$ ls | ||
| + | CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo | ||
| + | Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt | ||
| + | TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt | ||
| + | TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s | ||
| + | cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp | ||
| + | data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp | ||
| + | data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp | ||
| + | eddy:guineapig wcohen$ cvs update -dP | ||
| + | Password: | ||
| + | ? .DS_Store | ||
| + | ? Makefile | ||
| + | ? cuts.txt | ||
| + | ? data.gp | ||
| + | ? docFreq.gp | ||
| + | ? docvec.gp | ||
| + | ? docvec1.gpmo | ||
| + | ? fields.gp | ||
| + | ? guineapig.new | ||
| + | ? guineapig.pyc | ||
| + | ? look.gp | ||
| + | ? look1.gpmo | ||
| + | ? look2.gp | ||
| + | ? look3.gpmo | ||
| + | ? ndoc.gp | ||
| + | ? ndoc2.gp | ||
| + | ? norm.gp | ||
| + | ? notes-for-pig.txt | ||
| + | ? r.gp | ||
| + | ? rel1Docs.gp | ||
| + | ? rel2Docs.gp | ||
| + | ? s.gp | ||
| + | ? simpairs.gp | ||
| + | ? softjoin1.gpmo | ||
| + | ? softjoin2.gp | ||
| + | ? tmp | ||
| + | ? trial.py | ||
| + | ? try.py | ||
| + | ? udocvec.gp | ||
| + | ? udocvec1.gpmo | ||
| + | ? udocvec2.gp | ||
| + | ? udocvec3.gp | ||
| + | ? udocvec3.gpmo | ||
| + | ? wc-for-bluecorpus.txt | ||
| + | ? wc-for-redcorpus.txt | ||
| + | ? wc-for-s | ||
| + | ? wc.gp | ||
| + | ? data/dkos-data.txt | ||
| + | ? data/redstate-data-small.txt | ||
| + | ? data/redstate-data.txt | ||
| + | ? data/redstate-small-clean.txt | ||
| + | ? demo/params2.py | ||
| + | ? tutorial/guineapig.pyc | ||
| + | ? tutorial/params.pyc | ||
| + | ? tutorial/wc.gp | ||
| + | cvs update: Updating . | ||
| + | RCS file: /usr1/cvsroot/pyhack/guineapig/TODO.txt,v | ||
| + | retrieving revision 1.28 | ||
| + | retrieving revision 1.33 | ||
| + | Merging differences between 1.28 and 1.33 into TODO.txt | ||
| + | rcsmerge: warning: conflicts during merge | ||
| + | cvs update: conflicts found in TODO.txt | ||
| + | C TODO.txt | ||
| + | P guineapig.py | ||
| + | cvs update: Updating data | ||
| + | cvs update: Updating demo | ||
| + | P demo/ugp1.py | ||
| + | U demo/wordprob.py | ||
| + | cvs update: Updating tutorial | ||
| + | P tutorial/guineapig.py | ||
| + | U tutorial/instance-wordcount.py | ||
| + | U tutorial/multi-wordcount.py | ||
| + | cvs update: tutorial/multi.py is no longer in the repository | ||
| + | U tutorial/param-wordcount.py | ||
| + | cvs update: tutorial/params.py is no longer in the repository | ||
| + | U tutorial/wikipage.txt | ||
| + | eddy:guineapig wcohen$ ls | ||
| + | CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo | ||
| + | Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt | ||
| + | TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt | ||
| + | TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s | ||
| + | cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp | ||
| + | data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp | ||
| + | data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp | ||
| + | eddy:guineapig wcohen$ rm *.gp *.gpmo | ||
| + | eddy:guineapig wcohen$ ls | ||
| + | CVS cuts.txt guineapig.html notes-for-pig.txt try.py wc-for-redcorpus.txt | ||
| + | Makefile data guineapig.new tmp try.py~ wc-for-s | ||
| + | TODO.txt demo guineapig.py trial.py tutorial | ||
| + | TODO.txt.~1.26.~ guineapig.bak guineapig.pyc trial.py~ wc-for-bluecorpus.txt | ||
| + | eddy:guineapig wcohen$ rm wc-for-* | ||
| + | eddy:guineapig wcohen$ ls | ||
| + | CVS TODO.txt.~1.26.~ demo guineapig.new notes-for-pig.txt trial.py~ tutorial | ||
| + | Makefile cuts.txt guineapig.bak guineapig.py tmp try.py | ||
| + | TODO.txt data guineapig.html guineapig.pyc trial.py try.py~ | ||
| + | eddy:guineapig wcohen$ rm -rf tmp | ||
| + | eddy:guineapig wcohen$ pwd | ||
| + | /Users/wcohen/Documents/code/pyhack/guineapig | ||
| + | eddy:guineapig wcohen$ pwd | ||
| + | /Users/wcohen/Documents/code/pyhack/guineapig | ||
| + | eddy:guineapig wcohen$ emacs | ||
| + | |||
== Quick Start == | == Quick Start == | ||
| Line 16: | Line 184: | ||
def tokens(line): | def tokens(line): | ||
for tok in line.split(): | for tok in line.split(): | ||
| − | + | yield tok.lower() | |
#always subclass Planner | #always subclass Planner | ||
| Line 56: | Line 224: | ||
==== Functions instead of fields ==== | ==== Functions instead of fields ==== | ||
| − | The <code>Group</code> view is actually quite flexible: the "by" clause is a lambda that extracts an arbitrary value that defines the groups, and in the absence of a " | + | The <code>Group</code> view is actually quite flexible: the "by" clause is a lambda that extracts an arbitrary value that defines the groups, and in the absence of a "re\ |
| + | ducingTo" clause, the result of grouping is a tuple (key,[row1,...,rowN]) where the rowi's are the rows that have the indicated "key" (as extracted by the "by" clause). \ | ||
| + | The "reduceTo" argument is an optimization, which you can define with an instance of the ReduceTo class. For instance, instead of using the | ||
ReduceToCount() subclass, you could have used | ReduceToCount() subclass, you could have used | ||
<pre> | <pre> | ||
ReduceTo(int,by=lambda accum,val:accum+1) | ReduceTo(int,by=lambda accum,val:accum+1) | ||
</pre> | </pre> | ||
| − | where the first argument is the type of the output (and defines the initial value of the accumulator, which here will be <code>int()</code>, or zero) and the second is | + | where the first argument is the type of the output (and defines the initial value of the accumulator, which here will be <code>int()</code>, or zero) and the second is t\ |
| − | + | he function that is used to reduce values pairwise. | |
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | + | Note the use of functions as parameters in <code>Group</code> and <code>Flatten</code>. Guinea Pig has no notion of records: rows can be any python object (although the\ | |
| − | + | -uu-:---F1 wikipage.txt Top L1 CVS-1.1 (Text)---------------------------------------------------------------------------------------------------------------------- | |
| − | + | Loading vc-cvs...done | |
Revision as of 13:21, 28 May 2014
Last login: Mon May 5 10:29:51 on console eddy:~ wcohen$ kinit5 -bash: kinit5: command not found eddy:~ wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:~ wcohen$ aklog eddy:~ wcohen$ cd ~/Documents/code/pyhack/guineapig/ eddy:guineapig wcohen$ ls
- notes.txt# TODO.txt demo gp.py.~1.25.~ notes-for-pig.txt tgp.py udocvec1.gpri
CVS TODO.txt.~1.7.~ docvec1.gpmo gp.pyc parks.txt tmp udocvec3.gpmo README.txt data gp.py j.gpmo row.py udocvec1.gpmo eddy:guineapig wcohen$ cvs update -dP Password: ? docvec1.gpmo ? gp.pyc ? j.gpmo ? notes-for-pig.txt ? parks.txt ? row.py ? tgp.py ? tmp ? udocvec1.gpmo ? udocvec1.gpri ? udocvec3.gpmo ? demo/phirl-naive.pu ? demo/template.py cvs update: Updating . cvs update: Updating data cvs update: Updating demo eddy:guineapig wcohen$ emacs
[1]+ Stopped emacs eddy:guineapig wcohen$ %% emacs eddy:guineapig wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:guineapig wcohen$ aklog eddy:guineapig wcohen$ kinit wcohen@CS.CMU.EDU's Password: eddy:guineapig wcohen$ aklog eddy:guineapig wcohen$ pushd ~/Desktop/afs-home/keepers/ ~/Desktop/afs-home/keepers ~/Documents/code/pyhack/guineapig eddy:keepers wcohen$ ls CVS email-taxonomy Shortcut to imls.lnk expt Thumbs.db grants a12-handout.docx imls amended_key.docx letters bib masters-program blogs meetings budgets misc-research cikara_cohen_redlawsk_proposal_Nov2013_v5_wc.docx planning classes radar-names consulting rcwang-hire data science reviews disc-lim teaching dsedra-rec.txt thesis-proposals eddy:keepers wcohen$ mkdir pnc eddy:keepers wcohen$ ls ~/Desktop/summary* /Users/wcohen/Desktop/summary - funding fall 2013.xlsx /Users/wcohen/Desktop/summary- funding spring 2014.xlsx /Users/wcohen/Desktop/summary- funding spring 2013.xlsx eddy:keepers wcohen$ mc ~/Desktop/summary* pnc -bash: mc: command not found eddy:keepers wcohen$ mv ~/Desktop/summary* pnc mv: pnc/summary - funding fall 2013.xlsx: set owner/group (was: 502/20):eddy:keepers wcohen$eddy:keeeddy:keddyeddy:keddyeddyededdy:keddy:keepeeddy:keeperseddy:keepers wcoeddy:keepers weddy:keeperseddy:keddy:keddy:keeeddy:keeeddy:keeeddy:keeeddy:keepeeddy:keeeddy:keepers weddyeddy:keddyeddy:keddy:keeeddyeddyeddy:keddy:keeeddyeddyeddyeddyedededededdy:keepers wcohen$ popd ~/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ ls CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp eddy:guineapig wcohen$ cvs update -dP Password: ? .DS_Store ? Makefile ? cuts.txt ? data.gp ? docFreq.gp ? docvec.gp ? docvec1.gpmo ? fields.gp ? guineapig.new ? guineapig.pyc ? look.gp ? look1.gpmo ? look2.gp ? look3.gpmo ? ndoc.gp ? ndoc2.gp ? norm.gp ? notes-for-pig.txt ? r.gp ? rel1Docs.gp ? rel2Docs.gp ? s.gp ? simpairs.gp ? softjoin1.gpmo ? softjoin2.gp ? tmp ? trial.py ? try.py ? udocvec.gp ? udocvec1.gpmo ? udocvec2.gp ? udocvec3.gp ? udocvec3.gpmo ? wc-for-bluecorpus.txt ? wc-for-redcorpus.txt ? wc-for-s ? wc.gp ? data/dkos-data.txt ? data/redstate-data-small.txt ? data/redstate-data.txt ? data/redstate-small-clean.txt ? demo/params2.py ? tutorial/guineapig.pyc ? tutorial/params.pyc ? tutorial/wc.gp cvs update: Updating . RCS file: /usr1/cvsroot/pyhack/guineapig/TODO.txt,v retrieving revision 1.28 retrieving revision 1.33 Merging differences between 1.28 and 1.33 into TODO.txt rcsmerge: warning: conflicts during merge cvs update: conflicts found in TODO.txt C TODO.txt P guineapig.py cvs update: Updating data cvs update: Updating demo P demo/ugp1.py U demo/wordprob.py cvs update: Updating tutorial P tutorial/guineapig.py U tutorial/instance-wordcount.py U tutorial/multi-wordcount.py cvs update: tutorial/multi.py is no longer in the repository U tutorial/param-wordcount.py cvs update: tutorial/params.py is no longer in the repository U tutorial/wikipage.txt eddy:guineapig wcohen$ ls CVS demo guineapig.new ndoc.gp s.gp try.py udocvec3.gpmo Makefile docFreq.gp guineapig.py ndoc2.gp simpairs.gp try.py~ wc-for-bluecorpus.txt TODO.txt docvec.gp guineapig.pyc norm.gp softjoin1.gpmo tutorial wc-for-redcorpus.txt TODO.txt.~1.26.~ docvec1.gpmo look.gp notes-for-pig.txt softjoin2.gp udocvec.gp wc-for-s cuts.txt fields.gp look1.gpmo r.gp tmp udocvec1.gpmo wc.gp data guineapig.bak look2.gp rel1Docs.gp trial.py udocvec2.gp data.gp guineapig.html look3.gpmo rel2Docs.gp trial.py~ udocvec3.gp eddy:guineapig wcohen$ rm *.gp *.gpmo eddy:guineapig wcohen$ ls CVS cuts.txt guineapig.html notes-for-pig.txt try.py wc-for-redcorpus.txt Makefile data guineapig.new tmp try.py~ wc-for-s TODO.txt demo guineapig.py trial.py tutorial TODO.txt.~1.26.~ guineapig.bak guineapig.pyc trial.py~ wc-for-bluecorpus.txt eddy:guineapig wcohen$ rm wc-for-* eddy:guineapig wcohen$ ls CVS TODO.txt.~1.26.~ demo guineapig.new notes-for-pig.txt trial.py~ tutorial Makefile cuts.txt guineapig.bak guineapig.py tmp try.py TODO.txt data guineapig.html guineapig.pyc trial.py try.py~ eddy:guineapig wcohen$ rm -rf tmp eddy:guineapig wcohen$ pwd /Users/wcohen/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ pwd /Users/wcohen/Documents/code/pyhack/guineapig eddy:guineapig wcohen$ emacs
Contents
Quick Start
Running wordcount.py
Set up a directory that contains the file gp.py and a
second script called wordcount.py which contains this
code:
# always start like this
from gp import *
import sys
# supporting routines can go here
def tokens(line):
for tok in line.split():
yield tok.lower()
#always subclass Planner
class WordCount(Planner):
wc = ReadLines('corpus.txt') | FlattenBy(by=tokens) | Group(by=lambda x:x, reducingWith=ReduceToCount())
# always end like this
if __name__ == "__main__":
WordCount().main(sys.argv)
Then type the command:
% python tutorial/wordcount.py --store wc
After a couple of seconds it will return, and you can see the wordcounts with
% head wc.gp
Understanding the wordcount example
A longer example
There's also a less concise but easier-to-explain wordcount file,
longer-wordcount.py, with this view definition:
class WordCount(Planner):
lines = ReadLines('corpus.txt')
words = Flatten(lines,by=tokens)
wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())
Functions instead of fields
The Group view is actually quite flexible: the "by" clause is a lambda that extracts an arbitrary value that defines the groups, and in the absence of a "re\
ducingTo" clause, the result of grouping is a tuple (key,[row1,...,rowN]) where the rowi's are the rows that have the indicated "key" (as extracted by the "by" clause). \
The "reduceTo" argument is an optimization, which you can define with an instance of the ReduceTo class. For instance, instead of using the
ReduceToCount() subclass, you could have used
ReduceTo(int,by=lambda accum,val:accum+1)
where the first argument is the type of the output (and defines the initial value of the accumulator, which here will be int(), or zero) and the second is t\
he function that is used to reduce values pairwise.
Note the use of functions as parameters in Group and Flatten. Guinea Pig has no notion of records: rows can be any python object (although the\
-uu-:---F1 wikipage.txt Top L1 CVS-1.1 (Text)----------------------------------------------------------------------------------------------------------------------
Loading vc-cvs...done