Programming Assignment 4
Single Source Shortest Distance using Spark

Due on Friday November 10 before midnight


Description

The purpose of this project is to develop a program to solve the single source shortest distance problem using Spark.

This project must be done individually. No copying is permitted. Note: We will use a system for detecting software plagiarism, called Moss, which is an automatic system for determining the similarity of programs. That is, your program will be compared with the programs of the other students in class as well as with the programs submitted in previous years. This program will find similarities even if you rename variables, move code, change code structure, etc.

Note that, if you use a Search Engine to find similar programs on the web, we will find these programs too. So don't do it because you will get caught and you will get an F in the course (this is cheating). Don't look for code to use for your project on the web or from other students (current or past). Just do your project alone using the help given in this project description and from your instructor and GTA only.

Platform

As in the previous projects, you will develop your program on SDSC Comet.

Setting up your Project

Login into Comet and download and untar project4:

wget http://lambda.uta.edu/cse6331/project4.tgz
tar xfz project4.tgz
chmod -R g-wrx,o-wrx project4

Project Description

Write a Spark program in Scala project4/Source.scala to solve the single source shortest distance problem using Spark: for each node in a directed graph, calculate the shortest distance from the node with id=0 to this node. The input directed graph is a dataset of edges, where an edge from the node i to the node j is represented in the input text file as:

i,d,j
where d is the distance from node i to node j. (Numbers i,j, and d are long integers.) Let distance[i] be the shortest distance from the node with id=0 to the node with id=i. The pseudo-code to calculate distance[i] is as follows:
distance[0] = 0
for each node i <> 0:
    distance[i] = Long.MAX_VALUE
repeat 4 times:
    for each edge (i,d,j):
        if distance[j] > distance[i]+d
           distance[j] = distance[i]+d
Your code that calculates the new distances from the old must be repeated 4 times only.
Hint: You may want to group edges by their destination as a triple (j,distance[j],{(i1,d1),(i2,d2),...}), which are derived from the edges (i1,d1,j),(i2,d2,j),...
, where distance[j] is initially Long.MAX_VALUE.

An empty project4/Source.scala is provided, as well as scripts to build and run this code on Comet. You should modify Source.scala only. Your main program should take two arguments: args(0) is the input graph and args(1) is the output. The output should look like:

i,d
where d is the shortest distance from node 0 to node j. The small graph small-graph.txt is used for testing in local mode and the moderate-sized graph large-graph.txt is used for testing in distributed mode. The output for the small graph should be:
0 0
1 2
2 3
3 5

You can compile Source.scala using:

run source.build
and you can run it in local mode over the small source using:
sbatch source.local.run
You should modify and run your programs in local mode until you get the correct result. After you make sure that your program runs correctly in local mode, you run it in distributed mode using:
sbatch source.distr.run
This will work on the moderate-sized source and will print the results to the output.

You should test your programs in local mode on small-graph.txt until you get the correct result. After you make sure that your program runs correctly in local mode, you run it in distributed mode on large-graph.txt.

What to Submit

You need to submit the following files only:

project3/Source.scala
project3/source.local.out
project3/source.distr.out
Submit Programming Assignment #4:

Last modified: 10/26/2017 by Leonidas Fegaras