Difference between revisions of "Guinea Pig"
Line 1: | Line 1: | ||
− | = Quick Start = | + | == Quick Start == |
− | == Running wordcount.py == | + | === Running wordcount.py === |
Set up a directory that contains the file <code>gp.py</code> and a | Set up a directory that contains the file <code>gp.py</code> and a | ||
Line 38: | Line 38: | ||
% head wc.gp | % head wc.gp | ||
</pre> | </pre> | ||
+ | |||
+ | === Understanding the wordcount example === | ||
+ | |||
+ | There's also a less concise but easier-to-explain wordcount file, | ||
+ | <code>longer-wordcount.py</code> | ||
+ | |||
+ | <pre> | ||
+ | class WordCount(Planner): | ||
+ | lines = ReadLines('corpus.txt') | ||
+ | words = Flatten(lines,by=tokens) | ||
+ | wordGroups = Group(words, by=lambda x:x) | ||
+ | wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount()) | ||
+ | </pre> | ||
+ | |||
+ | If you type | ||
+ | <pre> | ||
+ | % python longer-wordcount.py | ||
+ | </pre> | ||
+ | you'll get a brief usage message: | ||
+ | <pre> | ||
+ | usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options] | ||
+ | --list | ||
+ | </pre> | ||
+ | Typing | ||
+ | <pre> | ||
+ | % python longer-wordcount.py --list | ||
+ | </pre> | ||
+ | will list the <i>views</i> that are defined in the | ||
+ | file: <code>lines</code>, <code>words</code>, <code>wordGroups</code>, | ||
+ | and <code>wordCount</code> If you <code>pprint</code> one of these, | ||
+ | say <code>wordCount</code> you can see what it essentially is: | ||
+ | basically, a Python data structure, with several named subparts | ||
+ | (like <code>words</code>) | ||
+ | <pre> | ||
+ | wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>) | ||
+ | | words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True) | ||
+ | | | lines = ReadLines("corpus.txt") | ||
+ | </pre> | ||
+ | GuineaPig can convert one of these view structures into a <i>plan</i> | ||
+ | for storing the view. To see a plan, you can type something like | ||
+ | <pre> | ||
+ | % python longer-wordcount.py --plan wordCount | ||
+ | </pre> | ||
+ | If you typed | ||
+ | <pre> | ||
+ | % python longer-wordcount.py --plan wordCount | sh | ||
+ | </pre> | ||
+ | this would equivalent to <code>python longer-wordcount.py --store | ||
+ | wordCount</code>, modulo some details about how errors are reported. | ||
+ | |||
+ | There's also a less concise but easier-to-explain wordcount file, | ||
+ | <code>longer-wordcount.py</code> | ||
+ | |||
+ | <pre> | ||
+ | class WordCount(Planner): | ||
+ | lines = ReadLines('corpus.txt') | ||
+ | words = Flatten(lines,by=tokens) | ||
+ | wordGroups = Group(words, by=lambda x:x) | ||
+ | wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount()) | ||
+ | </pre> | ||
+ | |||
+ | If you type | ||
+ | <pre> | ||
+ | % python longer-wordcount.py | ||
+ | </pre> | ||
+ | you'll get a brief usage message: | ||
+ | <pre> | ||
+ | usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options] | ||
+ | --list | ||
+ | </pre> | ||
+ | Typing | ||
+ | <pre> | ||
+ | % python longer-wordcount.py --list | ||
+ | </pre> | ||
+ | will list the <i>views</i> that are defined in the | ||
+ | file: <code>lines</code>, <code>words</code>, <code>wordGroups</code>, | ||
+ | and <code>wordCount</code> If you <code>pprint</code> one of these, | ||
+ | say <code>wordCount</code> you can see what it essentially is: | ||
+ | basically, a Python data structure, with several named subparts | ||
+ | (like <code>words</code>) | ||
+ | <pre> | ||
+ | wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>) | ||
+ | | words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True) | ||
+ | | | lines = ReadLines("corpus.txt") | ||
+ | </pre> | ||
+ | GuineaPig can convert one of these view structures into a <i>plan</i> | ||
+ | for storing the view. To see a plan, you can type something like | ||
+ | <pre> | ||
+ | % python longer-wordcount.py --plan wordCount | ||
+ | </pre> | ||
+ | If you typed | ||
+ | <pre> | ||
+ | % python longer-wordcount.py --plan wordCount | sh | ||
+ | </pre> | ||
+ | this would equivalent to <code>python longer-wordcount.py --store | ||
+ | wordCount</code>, modulo some details about how errors are reported. |
Revision as of 15:15, 9 May 2014
Quick Start
Running wordcount.py
Set up a directory that contains the file gp.py
and a
second script called wordcount.py
which contains this
code:
# always start like this from gp import * import sys # supporting routines can go here def tokens(line): for tok in line.split(): yield tok.lower() #always subclass Planner class WordCount(Planner): wc = ReadLines('corpus.txt') | FlattenBy(by=tokens) | Group(by=lambda x:x, reducingWith=ReduceToCount()) # always end like this if __name__ == "__main__": WordCount().main(sys.argv)
Then type the command:
% python tutorial/wordcount.py --store wc
After a couple of seconds it will return, and you can see the wordcounts with
% head wc.gp
Understanding the wordcount example
There's also a less concise but easier-to-explain wordcount file,
longer-wordcount.py
class WordCount(Planner): lines = ReadLines('corpus.txt') words = Flatten(lines,by=tokens) wordGroups = Group(words, by=lambda x:x) wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())
If you type
% python longer-wordcount.py
you'll get a brief usage message:
usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options] --list
Typing
% python longer-wordcount.py --list
will list the views that are defined in the
file: lines
, words
, wordGroups
,
and wordCount
If you pprint
one of these,
say wordCount
you can see what it essentially is:
basically, a Python data structure, with several named subparts
(like words
)
wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>) | words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True) | | lines = ReadLines("corpus.txt")
GuineaPig can convert one of these view structures into a plan for storing the view. To see a plan, you can type something like
% python longer-wordcount.py --plan wordCount
If you typed
% python longer-wordcount.py --plan wordCount | sh
this would equivalent to python longer-wordcount.py --store
wordCount
, modulo some details about how errors are reported.
There's also a less concise but easier-to-explain wordcount file,
longer-wordcount.py
class WordCount(Planner): lines = ReadLines('corpus.txt') words = Flatten(lines,by=tokens) wordGroups = Group(words, by=lambda x:x) wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())
If you type
% python longer-wordcount.py
you'll get a brief usage message:
usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options] --list
Typing
% python longer-wordcount.py --list
will list the views that are defined in the
file: lines
, words
, wordGroups
,
and wordCount
If you pprint
one of these,
say wordCount
you can see what it essentially is:
basically, a Python data structure, with several named subparts
(like words
)
wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>) | words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True) | | lines = ReadLines("corpus.txt")
GuineaPig can convert one of these view structures into a plan for storing the view. To see a plan, you can type something like
% python longer-wordcount.py --plan wordCount
If you typed
% python longer-wordcount.py --plan wordCount | sh
this would equivalent to python longer-wordcount.py --store
wordCount
, modulo some details about how errors are reported.