Programming Assignment 3
Graph Processing using Spark

Due on Thursday October 27 before midnight


The purpose of this project is to develop a graph analysis program using Apache Spark.

This project must be done individually. No copying is permitted. Note: We will use a system for detecting software plagiarism, called Moss, which is an automatic system for determining the similarity of programs. That is, your program will be compared with the programs of the other students in class as well as with the programs submitted in previous years. This program will find similarities even if you rename variables, move code, change code structure, etc.

Note that, if you use a Search Engine to find similar programs on the web, we will find these programs too. So don't do it because you will get caught and you will get an F in the course (this is cheating). Don't look for code to use for your project on the web or from other students (current or past). Just do your project alone using the help given in this project description and from your instructor and GTA only.


As in Project1, you will develop your program on SDSC Comet. Optionally, you may use Eclipse to help you develop your program, but you should test your programs on Comet before you submit them.

Setting up your Project

Login into Comet and download and untar project3:

tar xfz project3.tgz
chmod -R g-wrx,o-wrx project3
Go to project3/examples and look at the Spark example JoinSpark.scala. You can compile JoinSpark.scala using:
mkdir -p classes
and you can run it in local mode using:
File join.local.out will contain the trace log of the Spark evaluation.

Project Description

You are asked to re-implement Project #2 (finding the connected components of an undirected graph) using Spark and Scala. An empty project3/Graph.scala is provided, as well as scripts to build and run this code on Comet. You should modify Graph.scala only. Your main program should take only one argument: args(0) which is the input graph. It should print the connected components to the output. The same small graph small-graph.txt used for testing in local mode and the moderate-sized graph large-graph.txt used for testing in distributed mode in Project #2 are also used in Project #3 to test your Spark program.

You can compile Graph.scala using:

mkdir -p classes
and you can run it in local mode over the small graph using:
You should modify and run your programs in local mode until you get the correct result. After you make sure that your program runs correctly in local mode, you run it in distributed mode using:
This will work on the moderate-sized graph and will print the results to the output.

Optional: Use Eclipse

If you have a prior experience with Eclipse, you may want to develop your program on Eclipse, run it in local mode, and then test it and run it on Comet. Using Eclipse is optional; you shouldn't do this if you haven't used Eclipse before. First, install Scala on Eclipse from using Install New Software... and then cut-and-paste the update site URL. Note that this plugin doesn't work for Eclipse 4.6 (Neon). You need also to install the Apache Maven plugin. After you reboot Eclipse, you create a new Scala Project with Project name: JoinSpark. Right-click on the src directory inside JoinSpark and create a new Scala Class with Name: edu.uta.cse6331.JoinSpark. Cut and paste there the source code of examples/JoinSpark.scala. If you want to run it in local mode, you need to add the line conf.setMaster("local[2]") in the main program before you create SparkContex (you should remove this line before you test your project on Comet). Right-click on the project name JoinSpark, select Configure→Convert to Maven Project (check the create simple project). Got to the Dependencies tab of the pom.xml and add the dependency org.apache.spark:spark-core_2.11:2.0.0. Save. Maven will automatically load the dependencies and compile your Scala code (it may take few minutes). You should not get any build errors now. Right-click on→Run As→Run Configurations, select Scala Application, press the New button to create a new configuration, give it a name, add the main class edu.uta.cse6331.JoinSpark, and go to Arguments. Add 3 arguments: e.txt, d.txt, and output (separate lines). Now you can run it in local mode by hitting Run. You can do the same for your project.


You can learn more about Scala at:

You can learn more about Spark at:

What to Submit

You need to submit the following files only:


Submit Programming Assignment #3:

Last modified: 10/11/2016 by Leonidas Fegaras