Difference between revisions of "Guinea Pig"

From Cohen Courses
Jump to navigationJump to search
Line 88: Line 88:
 
wordCount</code>, modulo some details about how errors are reported.
 
wordCount</code>, modulo some details about how errors are reported.
  
There's also a less concise but easier-to-explain wordcount file,
+
Notice that the "plan" contains steps that call the
<code>longer-wordcount.py</code>
+
longer-wordcount.py python program: e.g., it has lines like
 
 
 
<pre>
 
<pre>
class WordCount(Planner):
+
python longer-wordcount.py --view=words --do=doStoreRows < corpus.txt > words.gp
    lines = ReadLines('corpus.txt')
 
    words = Flatten(lines,by=tokens)
 
    wordGroups = Group(words, by=lambda x:x)
 
    wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())
 
 
</pre>
 
</pre>
 +
This shell command is one step of the overall plan--namely, executing
 +
the code associated with the <code>words</code> view to create the
 +
materialized view <code>words.gp</code>.
  
If you type
+
If you're working on a machine that has Hadoop installed you can
<pre>
+
generate an alternative plan that uses Hadoop streaming:
% python longer-wordcount.py
 
</pre>
 
you'll get a brief usage message:
 
<pre>
 
usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options]
 
      --list
 
</pre>
 
Typing
 
<pre>
 
% python longer-wordcount.py --list
 
</pre>
 
will list the <i>views</i> that are defined in the
 
file: <code>lines</code>, <code>words</code>, <code>wordGroups</code>,
 
and <code>wordCount</code> If you <code>pprint</code> one of these,
 
say <code>wordCount</code> you can see what it essentially is:
 
basically, a Python data structure, with several named subparts
 
(like <code>words</code>)
 
<pre>
 
wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>)
 
| words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True)
 
| | lines = ReadLines("corpus.txt")
 
</pre>
 
These data structures define how data should "flow" -  read the lines of the corpus, tokenize them, then group them - and identified the python functions (like <code>tokens</code> which operate on the data.
 
  
GuineaPig can convert one of these view structures into a <i>plan</i>
 
for storing the view.  To see a plan, you can type:
 
<pre>
 
% python longer-wordcount.py --plan wordCount
 
</pre>
 
If you sent this to the shell, e.g. with
 
 
<pre>
 
<pre>
% python longer-wordcount.py --plan wordCount | sh
+
% python longer-wordcount.py --plan wordCount --target hadoop
 
</pre>
 
</pre>
this would equivalent to <code>python longer-wordcount.py --store
 
wordCount</code>, modulo some details about how errors are reported.
 

Revision as of 16:21, 9 May 2014

Quick Start

Running wordcount.py

Set up a directory that contains the file gp.py and a second script called wordcount.py which contains this code:

# always start like this
from gp import *
import sys

# supporting routines can go here
def tokens(line):
    for tok in line.split():
        yield tok.lower()

#always subclass Planner
class WordCount(Planner):

    wc = ReadLines('corpus.txt') | FlattenBy(by=tokens) | Group(by=lambda x:x, reducingWith=ReduceToCount())

# always end like this
if __name__ == "__main__":
    WordCount().main(sys.argv)

Then type the command:

% python tutorial/wordcount.py --store wc

After a couple of seconds it will return, and you can see the wordcounts with

% head wc.gp

Understanding the wordcount example

There's also a less concise but easier-to-explain wordcount file, longer-wordcount.py, with this view definition:

class WordCount(Planner):
    lines = ReadLines('corpus.txt')
    words = Flatten(lines,by=tokens)
    wordGroups = Group(words, by=lambda x:x)
    wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())

If you type

% python longer-wordcount.py

you'll get a brief usage message:

usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options]
       --list

Typing

% python longer-wordcount.py --list

will list the views that are defined in the file: lines, words, wordGroups, and wordCount If you pprint one of these, say wordCount, you can see what it essentially is: a Python data structure, with several named subparts (like words)

wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>)
| words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True)
| | lines = ReadLines("corpus.txt")

GuineaPig can convert one of these view structures into a plan for storing the view. To see a plan, you can type something like

% python longer-wordcount.py --plan wordCount

If you typed

% python longer-wordcount.py --plan wordCount | sh

this would equivalent to python longer-wordcount.py --store wordCount, modulo some details about how errors are reported.

Notice that the "plan" contains steps that call the longer-wordcount.py python program: e.g., it has lines like

python longer-wordcount.py --view=words --do=doStoreRows < corpus.txt > words.gp

This shell command is one step of the overall plan--namely, executing the code associated with the words view to create the materialized view words.gp.

If you're working on a machine that has Hadoop installed you can generate an alternative plan that uses Hadoop streaming:

% python longer-wordcount.py --plan wordCount --target hadoop