Difference between revisions of "Guinea Pig"
Line 76: | Line 76: | ||
| | lines = ReadLines("corpus.txt") | | | lines = ReadLines("corpus.txt") | ||
</pre> | </pre> | ||
+ | The "pipe" notation is just a shortcut for nested views: <code>words = ReadLines('corpus.txt') | Flatten(by=tokens)</code> | ||
+ | is equivalent to <code>words = Flatten(ReadLines('corpus.txt') , by=tokens)</code> or to | ||
+ | <pre> | ||
+ | lines = ReadLines('corpus.txt') | ||
+ | words = Flatten(lines,by=tokens) | ||
+ | </pre> | ||
+ | |||
GuineaPig can convert one of these view structures into a <i>plan</i> | GuineaPig can convert one of these view structures into a <i>plan</i> | ||
for storing the view. To see a plan, you can type something like | for storing the view. To see a plan, you can type something like | ||
Line 88: | Line 95: | ||
wordCount</code>, modulo some details about how errors are reported. | wordCount</code>, modulo some details about how errors are reported. | ||
− | Notice | + | Notice how this works: the view definition (a data structure) is converted to a plan (a shell script), |
− | longer-wordcount.py | + | and the shell script is then executed, starting up some new processes while it executes. These new processes |
+ | invoke additional copies of <code>python longer-wordcount.py</code> with special arguments, like | ||
<pre> | <pre> | ||
python longer-wordcount.py --view=words --do=doStoreRows < corpus.txt > words.gp | python longer-wordcount.py --view=words --do=doStoreRows < corpus.txt > words.gp | ||
</pre> | </pre> | ||
− | + | which tell Python perform smaller-scale operations associated with individual views, as steps in the overall plan. | |
− | + | Here the <code>word</code> view is stored for later processing. | |
− | |||
+ | The motivation for doing all this is because this sort of process can also be distributed across a cluster using Hadoop streaming. | ||
If you're working on a machine that has Hadoop installed you can | If you're working on a machine that has Hadoop installed you can | ||
generate an alternative plan that uses Hadoop streaming: | generate an alternative plan that uses Hadoop streaming: | ||
Line 103: | Line 111: | ||
% python longer-wordcount.py --plan wordCount --target hadoop | % python longer-wordcount.py --plan wordCount --target hadoop | ||
</pre> | </pre> | ||
+ | |||
+ | This produces a messier-looking plan that will store <code>wordCount</code> on HDFS using a series of Hadoop streaming jobs. |
Revision as of 15:29, 9 May 2014
Quick Start
Running wordcount.py
Set up a directory that contains the file gp.py
and a
second script called wordcount.py
which contains this
code:
# always start like this from gp import * import sys # supporting routines can go here def tokens(line): for tok in line.split(): yield tok.lower() #always subclass Planner class WordCount(Planner): wc = ReadLines('corpus.txt') | FlattenBy(by=tokens) | Group(by=lambda x:x, reducingWith=ReduceToCount()) # always end like this if __name__ == "__main__": WordCount().main(sys.argv)
Then type the command:
% python tutorial/wordcount.py --store wc
After a couple of seconds it will return, and you can see the wordcounts with
% head wc.gp
Understanding the wordcount example
There's also a less concise but easier-to-explain wordcount file,
longer-wordcount.py
, with this view definition:
class WordCount(Planner): lines = ReadLines('corpus.txt') words = Flatten(lines,by=tokens) wordGroups = Group(words, by=lambda x:x) wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())
If you type
% python longer-wordcount.py
you'll get a brief usage message:
usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options] --list
Typing
% python longer-wordcount.py --list
will list the views that are defined in the
file: lines
, words
, wordGroups
,
and wordCount
If you pprint
one of these,
say wordCount
, you can see what it essentially is:
a Python data structure, with several named subparts
(like words
)
wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>) | words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True) | | lines = ReadLines("corpus.txt")
The "pipe" notation is just a shortcut for nested views: words = ReadLines('corpus.txt') | Flatten(by=tokens)
is equivalent to words = Flatten(ReadLines('corpus.txt') , by=tokens)
or to
lines = ReadLines('corpus.txt') words = Flatten(lines,by=tokens)
GuineaPig can convert one of these view structures into a plan for storing the view. To see a plan, you can type something like
% python longer-wordcount.py --plan wordCount
If you typed
% python longer-wordcount.py --plan wordCount | sh
this would equivalent to python longer-wordcount.py --store
wordCount
, modulo some details about how errors are reported.
Notice how this works: the view definition (a data structure) is converted to a plan (a shell script),
and the shell script is then executed, starting up some new processes while it executes. These new processes
invoke additional copies of python longer-wordcount.py
with special arguments, like
python longer-wordcount.py --view=words --do=doStoreRows < corpus.txt > words.gp
which tell Python perform smaller-scale operations associated with individual views, as steps in the overall plan.
Here the word
view is stored for later processing.
The motivation for doing all this is because this sort of process can also be distributed across a cluster using Hadoop streaming. If you're working on a machine that has Hadoop installed you can generate an alternative plan that uses Hadoop streaming:
% python longer-wordcount.py --plan wordCount --target hadoop
This produces a messier-looking plan that will store wordCount
on HDFS using a series of Hadoop streaming jobs.