Difference between revisions of "Guinea Pig"

Revision as of 15:21, 9 May 2014

Quick Start

Running wordcount.py

Set up a directory that contains the file gp.py and a second script called wordcount.py which contains this code:

# always start like this
from gp import *
import sys

# supporting routines can go here
def tokens(line):
    for tok in line.split():
        yield tok.lower()

#always subclass Planner
class WordCount(Planner):

    wc = ReadLines('corpus.txt') | FlattenBy(by=tokens) | Group(by=lambda x:x, reducingWith=ReduceToCount())

# always end like this
if __name__ == "__main__":
    WordCount().main(sys.argv)

Then type the command:

% python tutorial/wordcount.py --store wc

After a couple of seconds it will return, and you can see the wordcounts with

% head wc.gp

Understanding the wordcount example

There's also a less concise but easier-to-explain wordcount file, longer-wordcount.py, with this view definition:

class WordCount(Planner):
    lines = ReadLines('corpus.txt')
    words = Flatten(lines,by=tokens)
    wordGroups = Group(words, by=lambda x:x)
    wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())

If you type

% python longer-wordcount.py

you'll get a brief usage message:

usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options]
       --list

Typing

% python longer-wordcount.py --list

will list the views that are defined in the file: lines, words, wordGroups, and wordCount If you pprint one of these, say wordCount, you can see what it essentially is: a Python data structure, with several named subparts (like words)

wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>)
| words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True)
| | lines = ReadLines("corpus.txt")

GuineaPig can convert one of these view structures into a plan for storing the view. To see a plan, you can type something like

% python longer-wordcount.py --plan wordCount

If you typed

% python longer-wordcount.py --plan wordCount | sh

this would equivalent to python longer-wordcount.py --store wordCount, modulo some details about how errors are reported.

Notice that the "plan" contains steps that call the longer-wordcount.py python program: e.g., it has lines like

python longer-wordcount.py --view=words --do=doStoreRows < corpus.txt > words.gp

This shell command is one step of the overall plan--namely, executing the code associated with the words view to create the materialized view words.gp.

If you're working on a machine that has Hadoop installed you can generate an alternative plan that uses Hadoop streaming:

% python longer-wordcount.py --plan wordCount --target hadoop

Difference between revisions of "Guinea Pig"

Revision as of 15:21, 9 May 2014

Quick Start

Running wordcount.py

Understanding the wordcount example

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

@@ Line 88: / Line 88: @@
 wordCount</code>, modulo some details about how errors are reported.
-There's also a less concise but easier-to-explain wordcount file,
+Notice that the "plan" contains steps that call the
-<code>longer-wordcount.py</code>
+longer-wordcount.py python program: e.g., it has lines like
 <pre>
-class WordCount(Planner):
+python longer-wordcount.py --view=words --do=doStoreRows < corpus.txt > words.gp
-    lines = ReadLines('corpus.txt')
-    words = Flatten(lines,by=tokens)
-    wordGroups = Group(words, by=lambda x:x)
-    wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())
 </pre>
+This shell command is one step of the overall plan--namely, executing
+the code associated with the <code>words</code> view to create the
+materialized view <code>words.gp</code>.
-If you type
+If you're working on a machine that has Hadoop installed you can
-<pre>
+generate an alternative plan that uses Hadoop streaming:
-% python longer-wordcount.py
-</pre>
-you'll get a brief usage message:
-<pre>
-usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options]
-       --list
-</pre>
-Typing
-<pre>
-% python longer-wordcount.py --list
-</pre>
-will list the <i>views</i> that are defined in the
-file: <code>lines</code>, <code>words</code>, <code>wordGroups</code>,
-and <code>wordCount</code> If you <code>pprint</code> one of these,
-say <code>wordCount</code> you can see what it essentially is:
-basically, a Python data structure, with several named subparts
-(like <code>words</code>)
-<pre>
-wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>)
-| words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True)
-| | lines = ReadLines("corpus.txt")
-</pre>
-These data structures define how data should "flow" -  read the lines of the corpus, tokenize them, then group them - and identified the python functions (like <code>tokens</code> which operate on the data.
-GuineaPig can convert one of these view structures into a <i>plan</i>
-for storing the view.  To see a plan, you can type:
-<pre>
-% python longer-wordcount.py --plan wordCount
-</pre>
-If you sent this to the shell, e.g. with
 <pre>
-% python longer-wordcount.py --plan wordCount | sh
+% python longer-wordcount.py --plan wordCount --target hadoop
 </pre>
-this would equivalent to <code>python longer-wordcount.py --store
-wordCount</code>, modulo some details about how errors are reported.