Difference between revisions of "Guinea Pig"

From Cohen Courses
Jump to navigationJump to search
Line 1: Line 1:
= Quick Start =
+
== Quick Start ==
  
== Running wordcount.py ==
+
=== Running wordcount.py ===
  
 
Set up a directory that contains the file <code>gp.py</code> and a
 
Set up a directory that contains the file <code>gp.py</code> and a
Line 38: Line 38:
 
% head wc.gp
 
% head wc.gp
 
</pre>
 
</pre>
 +
 +
=== Understanding the wordcount example ===
 +
 +
There's also a less concise but easier-to-explain wordcount file,
 +
<code>longer-wordcount.py</code>
 +
 +
<pre>
 +
class WordCount(Planner):
 +
    lines = ReadLines('corpus.txt')
 +
    words = Flatten(lines,by=tokens)
 +
    wordGroups = Group(words, by=lambda x:x)
 +
    wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())
 +
</pre>
 +
 +
If you type
 +
<pre>
 +
% python longer-wordcount.py
 +
</pre>
 +
you'll get a brief usage message:
 +
<pre>
 +
usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options]
 +
      --list
 +
</pre>
 +
Typing
 +
<pre>
 +
% python longer-wordcount.py --list
 +
</pre>
 +
will list the <i>views</i> that are defined in the
 +
file: <code>lines</code>, <code>words</code>, <code>wordGroups</code>,
 +
and <code>wordCount</code> If you <code>pprint</code> one of these,
 +
say <code>wordCount</code> you can see what it essentially is:
 +
basically, a Python data structure, with several named subparts
 +
(like <code>words</code>)
 +
<pre>
 +
wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>)
 +
| words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True)
 +
| | lines = ReadLines("corpus.txt")
 +
</pre>
 +
GuineaPig can convert one of these view structures into a <i>plan</i>
 +
for storing the view.  To see a plan, you can type something like
 +
<pre>
 +
% python longer-wordcount.py --plan wordCount
 +
</pre>
 +
If you typed
 +
<pre>
 +
% python longer-wordcount.py --plan wordCount | sh
 +
</pre>
 +
this would equivalent to <code>python longer-wordcount.py --store
 +
wordCount</code>, modulo some details about how errors are reported.
 +
 +
There's also a less concise but easier-to-explain wordcount file,
 +
<code>longer-wordcount.py</code>
 +
 +
<pre>
 +
class WordCount(Planner):
 +
    lines = ReadLines('corpus.txt')
 +
    words = Flatten(lines,by=tokens)
 +
    wordGroups = Group(words, by=lambda x:x)
 +
    wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())
 +
</pre>
 +
 +
If you type
 +
<pre>
 +
% python longer-wordcount.py
 +
</pre>
 +
you'll get a brief usage message:
 +
<pre>
 +
usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options]
 +
      --list
 +
</pre>
 +
Typing
 +
<pre>
 +
% python longer-wordcount.py --list
 +
</pre>
 +
will list the <i>views</i> that are defined in the
 +
file: <code>lines</code>, <code>words</code>, <code>wordGroups</code>,
 +
and <code>wordCount</code> If you <code>pprint</code> one of these,
 +
say <code>wordCount</code> you can see what it essentially is:
 +
basically, a Python data structure, with several named subparts
 +
(like <code>words</code>)
 +
<pre>
 +
wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>)
 +
| words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True)
 +
| | lines = ReadLines("corpus.txt")
 +
</pre>
 +
GuineaPig can convert one of these view structures into a <i>plan</i>
 +
for storing the view.  To see a plan, you can type something like
 +
<pre>
 +
% python longer-wordcount.py --plan wordCount
 +
</pre>
 +
If you typed
 +
<pre>
 +
% python longer-wordcount.py --plan wordCount | sh
 +
</pre>
 +
this would equivalent to <code>python longer-wordcount.py --store
 +
wordCount</code>, modulo some details about how errors are reported.

Revision as of 15:15, 9 May 2014

Quick Start

Running wordcount.py

Set up a directory that contains the file gp.py and a second script called wordcount.py which contains this code:

# always start like this
from gp import *
import sys

# supporting routines can go here
def tokens(line):
    for tok in line.split():
        yield tok.lower()

#always subclass Planner
class WordCount(Planner):

    wc = ReadLines('corpus.txt') | FlattenBy(by=tokens) | Group(by=lambda x:x, reducingWith=ReduceToCount())

# always end like this
if __name__ == "__main__":
    WordCount().main(sys.argv)

Then type the command:

% python tutorial/wordcount.py --store wc

After a couple of seconds it will return, and you can see the wordcounts with

% head wc.gp

Understanding the wordcount example

There's also a less concise but easier-to-explain wordcount file, longer-wordcount.py

class WordCount(Planner):
    lines = ReadLines('corpus.txt')
    words = Flatten(lines,by=tokens)
    wordGroups = Group(words, by=lambda x:x)
    wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())

If you type

% python longer-wordcount.py

you'll get a brief usage message:

usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options]
       --list

Typing

% python longer-wordcount.py --list

will list the views that are defined in the file: lines, words, wordGroups, and wordCount If you pprint one of these, say wordCount you can see what it essentially is: basically, a Python data structure, with several named subparts (like words)

wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>)
| words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True)
| | lines = ReadLines("corpus.txt")

GuineaPig can convert one of these view structures into a plan for storing the view. To see a plan, you can type something like

% python longer-wordcount.py --plan wordCount

If you typed

% python longer-wordcount.py --plan wordCount | sh

this would equivalent to python longer-wordcount.py --store wordCount, modulo some details about how errors are reported.

There's also a less concise but easier-to-explain wordcount file, longer-wordcount.py

class WordCount(Planner):
    lines = ReadLines('corpus.txt')
    words = Flatten(lines,by=tokens)
    wordGroups = Group(words, by=lambda x:x)
    wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())

If you type

% python longer-wordcount.py

you'll get a brief usage message:

usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options]
       --list

Typing

% python longer-wordcount.py --list

will list the views that are defined in the file: lines, words, wordGroups, and wordCount If you pprint one of these, say wordCount you can see what it essentially is: basically, a Python data structure, with several named subparts (like words)

wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>)
| words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True)
| | lines = ReadLines("corpus.txt")

GuineaPig can convert one of these view structures into a plan for storing the view. To see a plan, you can type something like

% python longer-wordcount.py --plan wordCount

If you typed

% python longer-wordcount.py --plan wordCount | sh

this would equivalent to python longer-wordcount.py --store wordCount, modulo some details about how errors are reported.