Difference between revisions of "MRS Guinea Pig"

From Cohen Courses
Jump to navigationJump to search
Line 32: Line 32:
 
=== Why to use it without a server ===
 
=== Why to use it without a server ===
  
This looks a lot like Hadoop streaming but really isn't at all useful for the same purposes.  Hadoop is mostly used for I/O bound problems, and for these, <code>mrs_gp</code> will not be any faster than running sequentially.  For instance, since the files in DIR1 are all on the same disk, if that disk has only one disk head, it's not any faster to read 10 files with 100Mb each than to to read one file with 1000Mb.  So for I/O bound tasks, <code>mrs_gp</codes>'s parallelism is useless.
+
This looks a lot like Hadoop streaming but really isn't at all useful for the same purposes.  Hadoop is mostly used for I/O bound problems, and for these, <code>mrs_gp</code> will not be any faster than running sequentially.  For instance, since the files in DIR1 are all on the same disk, if that disk has only one disk head, it's not any faster to read 10 files with 100Mb each than to to read one file with 1000Mb.  So for I/O bound tasks, <code>mrs_gp</code>'s parallelism is useless.
  
 
It can be useful if your machine has multiple cores, and your job is CPU-bound, not I/O bound.   
 
It can be useful if your machine has multiple cores, and your job is CPU-bound, not I/O bound.   
  
 
'''Hint''': One way to make a process less I/O bound is to do the reading from a RAM disk, rather than a regular disk.  Once you've done that then the multiprocessing done by <code>mrs_gp</code> will be more helpful.
 
'''Hint''': One way to make a process less I/O bound is to do the reading from a RAM disk, rather than a regular disk.  Once you've done that then the multiprocessing done by <code>mrs_gp</code> will be more helpful.

Revision as of 14:14, 11 November 2015

What it is

mrs_gp stands for "Map-Reduce Streaming for Guinea Pig" and is designed to be an alternative backend for Guinea Pig. It has a similar interface to Hadoop streaming, which I assume in this document you know all about. It is implemented in single python source file, mrs_gp.py, which is distributed with Guinea Pig. (Currently, it's only in the git branch mrs_gp_serve).

I usually pronounce is as "missus gee pee".

How to use it without a server

I'm going to assume below that the command unix command "mrs" invokes "python mrs_gp.py". One way to make this true would be to type

alias mrs='python my/path/to/GuineaPig/mrs_gp.py'

for the appropriate path to your copy of mrs_gp.py.. To run a streaming map-reduce command, type something like

mrs --input DIR1 --output DIR2 --mapper [SHELL_COMMAND1] --reducer [SHELL_COMMAND]2 --numReduceTasks [K]

The arguments to --input and --output are directories on your local filesystem. The input directory DIR1 should contain some number of text files and nothing else: every file in DIR1 will be fed as standard input to a mapper process defined by the string SHELL_COMMAND1. The standard outputs of the mappers will be hashed into K buckets accord to their "keys", and each bucket will then be sorted and sent to the standard input of the reducers. The standard output of the reducers will be stored in files in the output directory, DIR2. These files will usually be given arbitrary names, like part0007.

The previous contents of DIR2, if they exist, will be deleted.

The parallelism that mrs_gp uses is mostly process-level, not thread-level. There are threads involved in the shuffle step typically there will be multiple subprocesses

  • each of the N input files in DIR1 will have its own mapper process
  • each of the N output files in DIR2 will have its own reducer process, and its own sort process.

The default number of reduce tasks is 1, and the default mapper and reducer commands are cat.

Why to use it without a server

This looks a lot like Hadoop streaming but really isn't at all useful for the same purposes. Hadoop is mostly used for I/O bound problems, and for these, mrs_gp will not be any faster than running sequentially. For instance, since the files in DIR1 are all on the same disk, if that disk has only one disk head, it's not any faster to read 10 files with 100Mb each than to to read one file with 1000Mb. So for I/O bound tasks, mrs_gp's parallelism is useless.

It can be useful if your machine has multiple cores, and your job is CPU-bound, not I/O bound.

Hint: One way to make a process less I/O bound is to do the reading from a RAM disk, rather than a regular disk. Once you've done that then the multiprocessing done by mrs_gp will be more helpful.