Difference between revisions of "Guinea Pig"

From Cohen Courses
Jump to navigationJump to search
Line 112: Line 112:
  
 
This produces a messier-looking plan that will store <code>wordCount</code> on HDFS using a series of Hadoop streaming jobs.
 
This produces a messier-looking plan that will store <code>wordCount</code> on HDFS using a series of Hadoop streaming jobs.
 +
 +
=== Debugging Tips ===
 +
 +
Another command is <code>--cat</code> which stores a vew and then prints it.  So, to step through the program view-by-view you could type.
 +
<pre>
 +
% python longer-wordcount.py --cat lines | head
 +
% python longer-wordcount.py --cat words | head
 +
% python longer-wordcount.py --cat wordCount | head
 +
</pre>
 +
If you'd like to speed up the later steps by re-using the results of previous steps, you can do that with
 +
<pre>
 +
% python longer-wordcount.py --cat lines  | head
 +
% python longer-wordcount.py --cat words --reuse lines.gp | head
 +
% python longer-wordcount.py --cat wordCount --reuse lines.gp words.gp | head
 +
</pre>
 +
 +
If you want to get even further into the details, you can generate the plan and start running lines from it (or parts of them) one-by-one.

Revision as of 15:39, 9 May 2014

Quick Start

Running wordcount.py

Set up a directory that contains the file gp.py and a second script called wordcount.py which contains this code:

# always start like this
from gp import *
import sys

# supporting routines can go here
def tokens(line):
    for tok in line.split():
        yield tok.lower()

#always subclass Planner
class WordCount(Planner):

    wc = ReadLines('corpus.txt') | FlattenBy(by=tokens) | Group(by=lambda x:x, reducingWith=ReduceToCount())

# always end like this
if __name__ == "__main__":
    WordCount().main(sys.argv)

Then type the command:

% python tutorial/wordcount.py --store wc

After a couple of seconds it will return, and you can see the wordcounts with

% head wc.gp

Understanding the wordcount example

There's also a less concise but easier-to-explain wordcount file, longer-wordcount.py, with this view definition:

class WordCount(Planner):
    lines = ReadLines('corpus.txt')
    words = Flatten(lines,by=tokens)
    wordGroups = Group(words, by=lambda x:x)
    wordCount = Group(words, by=lambda x:x, reducingTo=ReduceToCount())

If you type

% python longer-wordcount.py

you'll get a brief usage message:

usage: --[store|pprint|plan|cat] view [--echo] [--target hadoop] [--reuse foo.gp bar.gp ...] [other options]
       --list

Typing

% python longer-wordcount.py --list

will list the views that are defined in the file: lines, words, wordGroups, and wordCount If you pprint one of these, say wordCount, you can see what it essentially is: a Python data structure, with several named subparts (like words)

wordCount = Group(words,by=<function <lambda> at 0x10497aa28>,reducingTo=<guineapig.ReduceToCount object at 0x104979190>)
| words = Flatten(lines, by=<function tokens at 0x1048965f0>).opts(cached=True)
| | lines = ReadLines("corpus.txt")

The "pipe" notation is just a shortcut for nested views: words = ReadLines('corpus.txt') | Flatten(by=tokens) is equivalent to words = Flatten(ReadLines('corpus.txt') , by=tokens) or to the two separate definitions for lines and words above. The named variables (like words) can be used, via devious pythonic tricks, to access the data structures from the command line. Hence the --store command above.

To store a view, GuineaPig will first convert a view structure into a plan for storing the view. To see a plan, you can type something like

% python longer-wordcount.py --plan wordCount

If you typed

% python longer-wordcount.py --plan wordCount | sh

this would equivalent to python longer-wordcount.py --store wordCount, modulo some details about how errors are reported.

Notice how this works: the view definition (a data structure) is converted to a plan (a shell script), and the shell script is then executed, starting up some new processes while it executes. These new processes invoke additional copies of python longer-wordcount.py with special arguments, like

python longer-wordcount.py --view=words --do=doStoreRows < corpus.txt > words.gp

which tell Python perform smaller-scale operations associated with individual views, as steps in the overall plan. Here the word view is stored for later processing.

The motivation for doing all this is because this sort of process can also be distributed across a cluster using Hadoop streaming. If you're working on a machine that has Hadoop installed you can generate an alternative plan that uses Hadoop streaming:

% python longer-wordcount.py --plan wordCount --target hadoop

This produces a messier-looking plan that will store wordCount on HDFS using a series of Hadoop streaming jobs.

Debugging Tips

Another command is --cat which stores a vew and then prints it. So, to step through the program view-by-view you could type.

% python longer-wordcount.py --cat lines | head
% python longer-wordcount.py --cat words | head
% python longer-wordcount.py --cat wordCount | head

If you'd like to speed up the later steps by re-using the results of previous steps, you can do that with

% python longer-wordcount.py --cat lines  | head
% python longer-wordcount.py --cat words --reuse lines.gp | head
% python longer-wordcount.py --cat wordCount --reuse lines.gp words.gp | head

If you want to get even further into the details, you can generate the plan and start running lines from it (or parts of them) one-by-one.