The purpose of this project is to develop a graph analysis program using Apache Spark.
This project must be done individually. No copying is permitted. Note: We will use a system for detecting software plagiarism, called Moss, which is an automatic system for determining the similarity of programs. That is, your program will be compared with the programs of the other students in class as well as with the programs submitted in previous years. This program will find similarities even if you rename variables, move code, change code structure, etc.
Note that, if you use a Search Engine to find similar programs on the web, we will find these programs too. So don't do it because you will get caught and you will get an F in the course (this is cheating). Don't look for code to use for your project on the web or from other students (current or past). Just do your project alone using the help given in this project description and from your instructor and GTA only.
As in Project1, you will develop your program on SDSC Comet. Optionally, you may use Eclipse to help you develop your program, but you should test your programs on Comet before you submit them.
Login into Comet and download and untar project3:
wget http://lambda.uta.edu/cse6331/project3.tgz tar xfz project3.tgz chmod -R g-wrx,o-wrx project3Go to project3/examples and look at the Spark example JoinSpark.scala. You can compile JoinSpark.scala using:
mkdir -p classes sbatch joinSparkScala.buildand you can run it in local mode using:
sbatch joinSpark.local.runFile join.local.out will contain the trace log of the Spark evaluation.
You are asked to re-implement Project #2 (finding the connected components of an undirected graph) using Spark and Scala. An empty project3/Graph.scala is provided, as well as scripts to build and run this code on Comet. You should modify Graph.scala only. Your main program should take only one argument: args(0) which is the input graph. It should print the connected components to the output. The same small graph small-graph.txt used for testing in local mode and the moderate-sized graph large-graph.txt used for testing in distributed mode in Project #2 are also used in Project #3 to test your Spark program.
You can compile Graph.scala using:
mkdir -p classes sbatch graph.buildand you can run it in local mode over the small graph using:
sbatch graph.local.runYou should modify and run your programs in local mode until you get the correct result. After you make sure that your program runs correctly in local mode, you run it in distributed mode using:
sbatch graph.distr.runThis will work on the moderate-sized graph and will print the results to the output.
If you have a prior experience with Eclipse, you may want to develop your program on Eclipse, run it in local mode, and then test it and run it on Comet. Using Eclipse is optional; you shouldn't do this if you haven't used Eclipse before. First, install Scala on Eclipse from scala-ide.org using Install New Software... and then cut-and-paste the update site URL. Note that this plugin doesn't work for Eclipse 4.6 (Neon). You need also to install the Apache Maven plugin. After you reboot Eclipse, you create a new Scala Project with Project name: JoinSpark. Right-click on the src directory inside JoinSpark and create a new Scala Class with Name: edu.uta.cse6331.JoinSpark. Cut and paste there the source code of examples/JoinSpark.scala. If you want to run it in local mode, you need to add the line conf.setMaster("local") in the main program before you create SparkContex (you should remove this line before you test your project on Comet). Right-click on the project name JoinSpark, select Configure→Convert to Maven Project (check the create simple project). Got to the Dependencies tab of the pom.xml and add the dependency org.apache.spark:spark-core_2.11:2.0.0. Save. Maven will automatically load the dependencies and compile your Scala code (it may take few minutes). You should not get any build errors now. Right-click on JoinSpark.java→Run As→Run Configurations, select Scala Application, press the New button to create a new configuration, give it a name, add the main class edu.uta.cse6331.JoinSpark, and go to Arguments. Add 3 arguments: e.txt, d.txt, and output (separate lines). Now you can run it in local mode by hitting Run. You can do the same for your project.
You can learn more about Scala at:
You need to submit the following files only:
project3/Graph.scala project3/graph.local.out project3/graph.distr.out
Last modified: 10/11/2016 by Leonidas Fegaras