GHC Hadoop cluster information
Status: all students registered for 10-605 should have accounts on these machines --Wcohen (talk) 11:18, 11 February 2014 (EST)
FQDNs are ghc{01..81}.ghc.andrew.cmu.edu. You can log into any of these with your andrew credentials (I believe it's good form to not use ghc81, though, which also is the admin server). If you have permission problems logging in, email ugradlabs@cs.cmu.edu and cc wcohen@cs.cmu.edu.
The NameNode and Map/Reduce Admin. URLs:
- http://ghc81.ghc.andrew.cmu.edu:50070/dfshealth.jsp
- http://ghc81.ghc.andrew.cmu.edu:50030/jobtracker.jsp
Specs:
- 25 nodes have 12 Intel Xeon cores @ 3.20GHZ, 12288KB cache. 12GB RAM.
- 56 nodes have 4 Intel Xeon cores @ 2.67GHz, with 8192KB cache. Also 12GB RAM.
Anyone can log into these machines with their Andrew account. Registered 10-605 students will also have a HDFS home directory under their andrew id (eg, /user/wcohen). To use Hadoop you need a small amount of setup: below is a working .bashrc file. [Your default shell may or may not be bash - that's just what I use - W].
export PATH=$PATH:/usr/local/hadoop/bin export JAVA_HOME=/usr/lib/jvm/jre-sun export CLASSPATH=`ls -1 /usr/local/hadoop/*.jar|perl -ne 'do{chop;print $sep,$_;$sep=":";}'`
We recommend running some simple job very soon to verify your setup (and also make sure that the permissions and access for your account are set properly). For instance, to copy a sharded version of the rcv1 data from William's HDFS account to yours:
hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-streaming-1.0.1.jar \ -mapper cat -reducer cat -numReduceTasks 10 -input /user/wcohen/rcv1/small/unsharded -output tmp-output