Difference between revisions of "GHC Hadoop cluster information"

From Cohen Courses
Jump to navigationJump to search
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''Status: all students registered for 10-605 should have accounts on these machines''' --[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 11:18, 11 February 2014 (EST)
+
'''Last update:''' --[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 15:08, 20 February 2015 (EST)
  
Domain names are ghc{26..46}.ghc.andrew.cmu.edu.  You can log into any of these with your andrew credentials (I believe it's good form to not use ghc46, though, which also is the admin server).  If you have permission problems logging in, email ugradlabs@cs.cmu.edu and cc wcohen@cs.cmu.edu.  Also email ugradlabs if hadoop seems to be down: eg if commands like 'hadoop fs -ls' don't work.
+
Domain names are ghc{26..46}.ghc.andrew.cmu.edu.  You can log into any of these with your andrew credentials (I believe it's good form to not use ghc46, though, which also is the admin server).  If you have permission problems logging in, email ugradlabs@cs.cmu.edu and cc wcohen@cs.cmu.edu.  Also email ugradlabs if hadoop seems to be down: eg if basic commands like 'hadoop fs -ls' don't work.
  
 
For example, you might use 'ssh ghc34.ghc.andrew.cmu.edu' to log in. (Don't use port 8020, use the default ssh port).
 
For example, you might use 'ssh ghc34.ghc.andrew.cmu.edu' to log in. (Don't use port 8020, use the default ssh port).
  
 
The NameNode and Map/Reduce Admin. URLs:
 
The NameNode and Map/Reduce Admin. URLs:
*http://ghc81.ghc.andrew.cmu.edu:50070/dfshealth.jsp
+
*http://ghc89.ghc.andrew.cmu.edu:50030/jobtracker.jsp
*http://ghc81.ghc.andrew.cmu.edu:50030/jobtracker.jsp
+
*http://ghc89.ghc.andrew.cmu.edu:50070/dfshealth.jsp
  
 
Specs:
 
Specs:
Line 19: Line 19:
 
  export CLASSPATH=`ls -1 /usr/local/hadoop/*.jar|perl -ne 'do{chop;print $sep,$_;$sep=":";}'`
 
  export CLASSPATH=`ls -1 /usr/local/hadoop/*.jar|perl -ne 'do{chop;print $sep,$_;$sep=":";}'`
  
'''Update''': to use PIG, you need to also add the directory /usr/local/pig-0.12.0/bin to your PATH. --[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 10:59, 15 April 2014 (EDT)
+
Some students have reported problems with this setup: If you type 'which hadoop' this should resolve to '''/usr/local/hadoop-1.2.1/bin/hadoop'''., and  it doesn't you may need to tweak the path appropriately.--[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 15:08, 20 February 2015 (EST)
 +
 
 +
You only need to add the directory /usr/local/pig-0.12.0/bin to your PATH if you want to use PIG.
  
 
We recommend running some simple job very soon to verify your setup (and also make sure that the permissions and access for your account are set properly).  For instance, to copy a sharded version of the rcv1 data from William's HDFS account to yours:
 
We recommend running some simple job very soon to verify your setup (and also make sure that the permissions and access for your account are set properly).  For instance, to copy a sharded version of the rcv1 data from William's HDFS account to yours:
  
 
  hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar \
 
  hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar \
  -mapper cat -reducer cat -numReduceTasks 10 -input /user/wcohen/rcv1/small/unsharded -output tmp-output
+
  -mapper cat -reducer cat -numReduceTasks 10 -input /user/wcohen/rcv1/small/sharded -output tmp-output

Latest revision as of 15:30, 17 September 2015

Last update: --Wcohen (talk) 15:08, 20 February 2015 (EST)

Domain names are ghc{26..46}.ghc.andrew.cmu.edu. You can log into any of these with your andrew credentials (I believe it's good form to not use ghc46, though, which also is the admin server). If you have permission problems logging in, email ugradlabs@cs.cmu.edu and cc wcohen@cs.cmu.edu. Also email ugradlabs if hadoop seems to be down: eg if basic commands like 'hadoop fs -ls' don't work.

For example, you might use 'ssh ghc34.ghc.andrew.cmu.edu' to log in. (Don't use port 8020, use the default ssh port).

The NameNode and Map/Reduce Admin. URLs:

Specs:

  • 25 nodes have 12 Intel Xeon cores @ 3.20GHZ, 12288KB cache. 12GB RAM.
  • 56 nodes have 4 Intel Xeon cores @ 2.67GHz, with 8192KB cache. Also 12GB RAM.

Anyone can log into these machines with their Andrew account. Registered 10-605 students will also have a HDFS home directory under their andrew id (eg, /user/wcohen). To use Hadoop you need a small amount of setup: below is a working .bashrc file. [Your default shell may or may not be bash - that's just what I use - W].

export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/pig-0.12.0/bin
export JAVA_HOME=/usr/lib/jvm/jre-sun
export CLASSPATH=`ls -1 /usr/local/hadoop/*.jar|perl -ne 'do{chop;print $sep,$_;$sep=":";}'`

Some students have reported problems with this setup: If you type 'which hadoop' this should resolve to /usr/local/hadoop-1.2.1/bin/hadoop., and it doesn't you may need to tweak the path appropriately.--Wcohen (talk) 15:08, 20 February 2015 (EST)

You only need to add the directory /usr/local/pig-0.12.0/bin to your PATH if you want to use PIG.

We recommend running some simple job very soon to verify your setup (and also make sure that the permissions and access for your account are set properly). For instance, to copy a sharded version of the rcv1 data from William's HDFS account to yours:

hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar \
-mapper cat -reducer cat -numReduceTasks 10 -input /user/wcohen/rcv1/small/sharded -output tmp-output