Difference between revisions of "Guinea Pig"
| Line 88: | Line 88: | ||
wordCount</code>, modulo some details about how errors are reported. | wordCount</code>, modulo some details about how errors are reported. | ||
| − | + | Notice that the "plan" contains steps that call the | |
| − | + | longer-wordcount.py python program: e.g., it has lines like | |
| − | |||
<pre> | <pre> | ||
| − | + | python longer-wordcount.py --view=words --do=doStoreRows < corpus.txt > words.gp | |
| − | |||
| − | |||
| − | |||
| − | |||
</pre> | </pre> | ||
| + | This shell command is one step of the overall plan--namely, executing | ||
| + | the code associated with the <code>words</code> view to create the | ||
| + | materialized view <code>words.gp</code>. | ||
| − | If | + | If you're working on a machine that has Hadoop installed you can |
| − | + | generate an alternative plan that uses Hadoop streaming: | |
| − | |||
| − | |||
| − | you' | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
<pre> | <pre> | ||
| − | % python longer-wordcount.py --plan wordCount | + | % python longer-wordcount.py --plan wordCount --target hadoop |
</pre> | </pre> | ||
| − | |||
| − | |||
Revision as of 15:21, 9 May 2014
Quick Start
Running wordcount.py
Set up a directory that contains the file gp.py and a
second script called wordcount.py which contains this
code:
# always start like this
from gp import *
import sys
# supporting routines can go here
def tokens(line):
for tok in line.split():
yield tok.lower()
#always subclass Planner
class WordCount(Planner):
wc = ReadLines('corpus.txt') | FlattenBy(by=tokens) | Group(by=lambda x:x, reducingWith=ReduceToCount())
# always end like this
if __name__ == "__main__":
WordCount().main(sys.argv)
Then type the command:
% python tutorial/wordcount.py --store wc
After a couple of seconds it will return, and you can see the wordcounts with
% head wc.gp
Understanding the wordcount example
There's also a less concise but easier-to-explain wordcount file,
longer-wordcount.py, with this view definition:
class WordCount(Planner):
lines = ReadLines('corpus.txt')
words = Flatten(lines,by=tokens)
wordGroups = Group(words, by=lambda x:x)
wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())
If you type
% python longer-wordcount.py
you'll get a brief usage message:
usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options]
--list
Typing
% python longer-wordcount.py --list
will list the views that are defined in the
file: lines, words, wordGroups,
and wordCount If you pprint one of these,
say wordCount, you can see what it essentially is:
a Python data structure, with several named subparts
(like words)
wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>)
| words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True)
| | lines = ReadLines("corpus.txt")
GuineaPig can convert one of these view structures into a plan for storing the view. To see a plan, you can type something like
% python longer-wordcount.py --plan wordCount
If you typed
% python longer-wordcount.py --plan wordCount | sh
this would equivalent to python longer-wordcount.py --store
wordCount, modulo some details about how errors are reported.
Notice that the "plan" contains steps that call the longer-wordcount.py python program: e.g., it has lines like
python longer-wordcount.py --view=words --do=doStoreRows < corpus.txt > words.gp
This shell command is one step of the overall plan--namely, executing
the code associated with the words view to create the
materialized view words.gp.
If you're working on a machine that has Hadoop installed you can generate an alternative plan that uses Hadoop streaming:
% python longer-wordcount.py --plan wordCount --target hadoop