The purpose of this project is to develop a data analysis program using Apache Spark.
This project must be done individually. No copying is permitted. Note: We will use a system for detecting software plagiarism, called Moss, which is an automatic system for determining the similarity of programs. That is, your program will be compared with the programs of the other students in class as well as with the programs submitted in previous years. This program will find similarities even if you rename variables, move code, change code structure, etc.
Note that, if you use a Search Engine to find similar programs on the web, we will find these programs too. So don't do it because you will get caught and you will get an F in the course (this is cheating). Don't look for code to use for your project on the web or from other students (current or past). Just do your project alone using the help given in this project description and from your instructor and GTA only.
As in projects #1 and #2, you will develop your program on SDSC Comet. Optionally, you may use Eclipse to help you develop your program, but you should test your programs on Comet before you submit them.
Login into Comet and download and untar project3:
wget http://lambda.uta.edu/cse6331/project3.tgz tar xfz project3.tgz chmod -R g-wrx,o-wrx project3Go to project3/examples and look at the Spark example JoinSpark.scala. You can compile JoinSpark.scala using:
run joinSparkScala.buildand you can run it in local mode using:
sbatch joinSpark.local.runFile join.local.out will contain the trace log of the Spark evaluation.
You are asked to re-implement Project #1 (matrix multiplication) using Spark and Scala. An empty project3/Multiply.scala is provided, as well as scripts to build and run this code on Comet. You should modify Multiply.scala only. Your main program should take three arguments: the two input matrices and the output matrix. There are two small sparce matrices 4*3 and 3*3 in the files M-matrix-small.txt and N-matrix-small.txt for testing in standalone mode. Their matrix multiplication must return the 4*3 matrix in result-matrix-small.txt. Then there are 2 moderate-sized matrices 200*100 and 100*300 in the files M-matrix-large.txt and M-matrix-large.txt for testing in distributed mode.
You can compile Multiply.scala using:
run multiply.buildand you can run it in local mode over the small multiply using:
sbatch multiply.local.runYou should modify and run your programs in local mode until you get the correct result. After you make sure that your program runs correctly in local mode, you run it in distributed mode using:
sbatch multiply.distr.runThis will work on the moderate-sized multiply and will print the results to the output.
If you have a prior experience with Eclipse, you may want to develop your program on Eclipse, run it in local mode, and then test it and run it on Comet. Using Eclipse is optional; you shouldn't do this if you haven't used Eclipse before. First, install Scala on Eclipse from scala-ide.org using Install New Software... and then cut-and-paste the update site URL. Note that this plugin doesn't work for Eclipse 4.6 (Neon). You need also to install the Apache Maven plugin. After you reboot Eclipse, you create a new Scala Project with Project name: JoinSpark. Right-click on the src directory inside JoinSpark and create a new Scala Class with Name: edu.uta.cse6331.JoinSpark. Cut and paste there the source code of examples/JoinSpark.scala. If you want to run it in local mode, you need to add the line conf.setMaster("local") in the main program before you create SparkContex (you should remove this line before you test your project on Comet). Right-click on the project name JoinSpark, select Configure→Convert to Maven Project (check the create simple project). Got to the Dependencies tab of the pom.xml and add the dependency org.apache.spark:spark-core_2.11:2.0.0. Save. Maven will automatically load the dependencies and compile your Scala code (it may take few minutes). You should not get any build errors now. Right-click on JoinSpark.java→Run As→Run Configurations, select Scala Application, press the New button to create a new configuration, give it a name, add the main class edu.uta.cse6331.JoinSpark, and go to Arguments. Add 3 arguments: e.txt, d.txt, and output (separate lines). Now you can run it in local mode by hitting Run. You can do the same for your project.
You can learn more about Scala at:
You need to submit the following files only:
project3/Multiply.scala project3/multiply.local.out project3/multiply.distr.out
Last modified: 10/18/2017 by Leonidas Fegaras