Programming Assignment 2
Map-Reduce Optimization

Due on Tuesday October 9 before midnight


Description

The purpose of this project is to improve the performance of k-means clustering developed in Project 1 by using in-mapper combining.

This project must be done individually. No copying is permitted. Note: We will use a system for detecting software plagiarism, called Moss, which is an automatic system for determining the similarity of programs. That is, your program will be compared with the programs of the other students in class as well as with the programs submitted in previous years. This program will find similarities even if you rename variables, move code, change code structure, etc.

Note that, if you use a Search Engine to find similar programs on the web, we will find these programs too. So don't do it because you will get caught and you will get an F in the course (this is cheating). Don't look for code to use for your project on the web or from other students (current or past). Just do your project alone using the help given in this project description and from your instructor and GTA only.

Platform

As in Project1, you will develop your program on SDSC Comet. Optionally, you may use IntelliJ IDEA or Eclipse to help you develop your program, but you should test your programs on Comet before you submit them.

Setting up your Project

Login to comet and copy project1 to project2:

cd
cp -a project1 project2
Your project is to improve your Java code in project2/src/main/java/KMeans.java.

Project Description

In this project, you are asked to improve the performance of your code developed in Project 1 using in-mapper combining. See page 18 in the class notes.

Go inside the directory project2. You can compile KMeans.java on Comet using:

run kmeans.build
and you can run KMeans.java in standalone mode over a small dataset using:
sbatch kmeans.local.run
The new centroids generated by your program will be in the directory output and must be similar to solution-small.txt. You should develop and run your programs in standalone mode until you get the correct result. After you make sure that your program runs correctly in standalone mode, you run it in distributed mode using:
sbatch kmeans.distr.run
This will calculate kmeans on the large dataset points-large.txt and will write the result in the directory output-distr. Note that running in distributed mode will use up at least 10 of your SUs. So do this a couple of times only, after you make sure that your program works correctly in standalone mode.

Pseudo-Code

You should modify your project2/src/main/java/KMeans.java only, which is a copy of your KMeans.java file from Project 1. For this project though you will also use a hash table table, which for each centroid c (the hash table key), it holds the object Avg(sumX,sumY,count), where sumX and sumY are partial sums, and count is a partial count, so that the new centroid for c is at (sumX/count,sumY/count):

class Point {
    public double x;
    public double y;
}

class Avg {
    public double sumX;
    public double sumY;
    public long count;
}

Vector[Point] centroids;
Hashtable[Point,Avg] table;

mapper setup:
  read centroids from the distributed cache
  initialize table

mapper cleanup:
  for each key c in table
      emit(c,table[c])

map ( key, line ):
  Point p = new Point()
  read 2 double numbers from the line (x and y) and store them in p
  find the closest centroid c to p
  if table[c] is empty
     then table[c] = new Avg(x,y,1)
     else table[c] = new Avg(table[c].sumX+x,table[c].sumY+y,table[c].count+1)

reduce ( c, avgs ):
  count = 0
  sx = sy = 0.0
  for a in avgs
      sx += a.sumX
      sy += a.sumY
      count += a.count
  c.x = sx/count
  c.y = sy/count
  emit(c,null)

What to Submit

You need to submit the following files only:

project2/src/main/java/KMeans.java
project2/kmeans.local.out
project2/output/part-r-00000
project2/kmeans.distr.out

Submit Programming Assignment #2:

Last modified: 09/27/2018 by Leonidas Fegaras