The purpose of this project is to develop a simple Map-Reduce program on Hadoop for graph processing.
This project must be done individually. No copying is permitted. Note: We will use a system for detecting software plagiarism, called Moss, which is an automatic system for determining the similarity of programs. That is, your program will be compared with the programs of the other students in class as well as with the programs submitted in previous years. This program will find similarities even if you rename variables, move code, change code structure, etc.
Note that, if you use a Search Engine to find similar programs on the web, we will find these programs too. So don't do it because you will get caught and you will get an F in the course (this is cheating). Don't look for code to use for your project on the web or from other students (current or past). Just do your project alone using the help given in this project description and from your instructor and GTA only.
You will develop your program on SDSC Comet. Optionally, you may use IntelliJ IDEA or Eclipse to help you develop your program, but you should test your programs on Comet before you submit them.
Follow the directions on How to login to Comet at comet.html. Please email the GTA if you need further help. After you login on Comet:
Edit the file .bashrc (note: it starts with a dot) using a text editor, such as nano .bashrc, and add the following 2 lines at the end (cut-and-paste):
alias run='srun --pty -A uot143 --partition=shared --nodes=1 --ntasks-per-node=1 --mem=5G -t 00:05:00 --wait=0 --export=ALL' export project=/oasis/projects/nsf/uot143/fegaraslogout and login again to apply the changes. On Comet, download and untar project1:
wget http://lambda.uta.edu/cse6331/project1.tgz tar xfz project1.tgz chmod -R g-wrx,o-wrx project1Go to project1/examples and look at the Map-Reduce example src/main/java/Simple.java. You can compile the Java file using:
run simple.buildand you can run it in standalone mode using:
sbatch simple.local.runThe file simple.local.out will contain the trace log of the Map-Reduce evaluation while the file output-simple/part-r-00000 will contain the results. Optionally, you can run Simple.java in distributed mode using:
sbatch simple.distr.runPlease note that running in distributed mode will waste at least 10 of your SUs.
In this project, you are asked to implement a simple graph algorithm that needs two Map-Reduce jobs. A directed graph is represented as a text file where each line represents a graph edge. For example,
20,40represents the directed edge from node 20 to node 40. First, for each graph node, you compute the number of node neighbors. Then, you group the nodes by their number of neighbors and for each group you count how many nodes belong to this group. That is, the result will have lines such as:
10 30which says that there are 30 nodes that have 10 neighbors. To help you, I am giving you the pseudo code. The first Map-Reduce is:
map ( key, line ): read 2 long integers from the line into the variables key2 and value2 emit (key2,value2) reduce ( key, nodes ): count = 0 for n in nodes count++ emit(key,count)The second Map-Reduce is:
map ( node, count ): emit(count,1) reduce ( key, values ): sum = 0 for v in values sum += v emit(key,sum)You should write the two Map-Reduce job in the Java file src/main/java/Graph.java. An empty src/main/java/Graph.java has been provided, as well as scripts to build and run this code on Comet. You should modify the Graph.java only. In your Java main program, args is the graph file and args is the output directory. The input file format for reading the input graph and the output format for the final result must be text formats, while the format for the intermediate results between the Map-Reduce jobs must be binary formats.
You can compile Graph.java on Comet using:
run graph.buildand you can run Graph.java in standalone mode over a small dataset using:
sbatch graph.local.runThe results generated by your program will be in the directory output. These results should be:
2 2 3 2 4 1 5 2 7 1You should develop and run your programs in standalone mode until you get the correct result. After you make sure that your program runs correctly in standalone mode, you run it in distributed mode using:
sbatch graph.distr.runThis will process the graph on the large dataset large-graph.txt and will write the result in the directory output-distr. These results should be similar to the results in the file large-solution.txt. Note that running in distributed mode will use up at least 10 of your SUs. So do this a couple of times only, after you make sure that your program works correctly in standalone mode.
If you have a prior good experience with an IDE (IntelliJ IDEA or Eclipse), you may want to develop your program using an IDE and then test it and run it on Comet. Using an IDE is optional; you shouldn't do this if you haven't used an IDE before.
On IntelliJ IDEA, go to New→Project from Existing Sources, then choose your project1 directory, select Maven, and then Finish. To compile the project, go to Run→Edit Configurations, use + to Add New Configuration, select Maven, give it a name, use Working directory: your project1 directory, Command line: install, then Apply.
On Eclipse, you first need to install m2e (Maven on Eclipse), if it's not already installed. Then go to Open File...→Import Project from File System, then choose your project1 directory. To compile your project, right click on the project name at the Package Explorer, select Run As, and then Maven install.
If you'd prefer, you may use your laptop to develop your program and then test it and run it on Comet. If you have Mac OS or Linux, make sure you have Java and Maven installed. If you have Windows 10, you may install Ubuntu Shell and do: sudo apt install openjdk-8-jdk-headless maven.
To install Hadoop and project:
cd wget http://apache.claz.org/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz tar xfz hadoop-2.6.5.tar.gz wget http://lambda.uta.edu/cse6331/project1.tgz tar xfz project1.tgzTo compile and run project1:
cd project1 mvn install rm -rf output export JAVA_HOME=/usr ~/hadoop-2.6.5/bin/hadoop jar target/*.jar Graph small-graph.txt outputJAVA_HOME must point to your java installation.
You need to submit the following files only:
project1/src/main/java/Graph.java project1/graph.local.out project1/output-distr/part-r-00000 project1/graph.distr.outDo not submit any other files. Just submit each of these 4 files one-by-one using the following form. These files are automatically uploaded directly into your personal class account for this particular project, so you don't have to include your name or student ID or project number in the file name. You may submit your files as many times as you like, but only the most recently submitted files will be retained and evaluated (newly submitted files replace the old files under the same file name). After you submit the files, please double-check that your submitted files are correct by clicking on the Status link. If you cannot login or have a problem submitting the project using this form, ask the GTA for help.
Last modified: 02/07/2019 by Leonidas Fegaras