Programming Assignment 1
A Simple Map-Reduce Program

Due on Friday September 28 before midnight


The purpose of this project is to develop a simple Map-Reduce program on Hadoop that evaluates one step of k-means clustering.

This project must be done individually. No copying is permitted. Note: We will use a system for detecting software plagiarism, called Moss, which is an automatic system for determining the similarity of programs. That is, your program will be compared with the programs of the other students in class as well as with the programs submitted in previous years. This program will find similarities even if you rename variables, move code, change code structure, etc.

Note that, if you use a Search Engine to find similar programs on the web, we will find these programs too. So don't do it because you will get caught and you will get an F in the course (this is cheating). Don't look for code to use for your project on the web or from other students (current or past). Just do your project alone using the help given in this project description and from your instructor and GTA only.


You will develop your program on SDSC Comet. Optionally, you may use IntelliJ IDEA or Eclipse to help you develop your program, but you should test your programs on Comet before you submit them.

Setting up your Project

Follow the directions on How to login to Comet at comet.html. Please email the GTA if you need further help.

Edit the file .bashrc (note: it starts with a dot) using a text editor, such as nano .bashrc, and add the following 2 lines at the end (cut-and-paste):

alias run='srun --pty -A uot143 --partition=shared --nodes=1 --ntasks-per-node=1 --mem=5G -t 00:05:00 --wait=0 --export=ALL'
export project=/oasis/projects/nsf/uot143/fegaras
logout and login again to apply the changes. On Comet, download and untar project1:
tar xfz project1.tgz
chmod -R g-wrx,o-wrx project1
Go to project1/examples and look at the two Map-Reduce examples src/main/java/ and src/main/java/ You can compile both Java files using:
and you can run them in standalone mode using:
The file example.local.out will contain the trace log of the Map-Reduce evaluation while the files output-simple/part-r-00000 output-join/part-r-00000 will contain the results. Optionally, you can run in distributed mode using:
Please note that running in distributed mode will waste at least 10 of your SUs.

K-means Clustering

In this project, you will implement one step of the Lloyd's algorithm for k-means clustering. The goal is to partition a set of points into k clusters of neighboring points. It starts with an initial set of k centroids. Then, it repeatedly partitions the input according to which of these centroids is closest and then finds a new centroid for each partition. That is, if you have a set of points P and a set of k centroids C, the algorithm repeatedly applies the following steps:

  1. Assignment step: partition the set P into k clusters of points Pi, one for each centroid Ci, such that a point p belongs to Pi if it is closest to the centroid Ci among all centroids.
  2. Update step: Calculate the new centroid Ci from the cluster Pi so that the x,y coordinates of Ci is the mean x,y of all points in Pi.
The datasets used are random points on a plane in the squares (i*2+1,j*2+1)-(i*2+2,j*2+2), with 0≤i≤9 and 0≤j≤9 (so k=100 in k-means). The initial centroids in centroid.txt are the points (i*2+1.2,j*2+1.2). So the new centroids should be in the middle of the squares at (i*2+1.5,j*2+1.5).

Project Description

In this project, you are asked to implement one step of the K-means clustering algorithm. You should write one Map-Reduce job in the Java file src/main/java/ An empty src/main/java/ has been provided, as well as scripts to build and run this code on Comet. You should modify the only.

You can compile on Comet using:

and you can run in standalone mode over a small dataset using:
The new centroids generated by your program will be in the directory output and must be similar to solution-small.txt. You should develop and run your programs in standalone mode until you get the correct result. After you make sure that your program runs correctly in standalone mode, you run it in distributed mode using:
This will calculate kmeans on the large dataset points-large.txt and will write the result in the directory output-distr. Note that running in distributed mode will use up at least 10 of your SUs. So do this a couple of times only, after you make sure that your program works correctly in standalone mode.


To help you, I am giving you the pseudo code:

class Point {
    public double x;
    public double y;

Vector[Point] centroids;

mapper setup:
  read centroids from the distributed cache

map ( key, line ):
  Point p = new Point()
  read 2 double numbers from the line (x and y) and store them in p
  find the closest centroid c to p

reduce ( c, points ):
  count = 0
  sx = sy = 0.0
  for p in points
      sx += p.x
      sy += p.y
  c.x = sx/count
  c.y = sy/count
In your Java main program args[0] is the data point file (points-small.txt or points-large.txt), args[1] is the centroids.txt, args[2] is the output directory. Use job.addCacheFile(new URI(args[1])) to broadcast the file centroids.txt to all mappers, and Mapper.Context.getCacheFiles to access the broadcast file at the mapper setup (method setup):
URI[] paths = context.getCacheFiles();
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
BufferedReader reader = new BufferedReader(new InputStreamReader( Path(paths[0]))));
then use reader.readLine() to read the lines from the file and store the centroids to the vector centroids.
You need to make the Point class WritableComparable. How do you compare 2 Points? Compare the x components first; if equal, compare the y components. You need to add a toString method for Point to print points. How do you find the closest centroid to a point p? Go through all centroids and find one whose Euclidean distance from p is minimum.

Optional: Use an IDE to develop your project

If you have a prior good experience with an IDE (IntelliJ IDEA or Eclipse), you may want to develop your program using an IDE and then test it and run it on Comet. Using an IDE is optional; you shouldn't do this if you haven't used an IDE before.

On IntelliJ IDEA, go to New→Project from Existing Sources, then choose your project1 directory, select Maven, and the Finish. To compile the project, go to Run→Edit Configurations, use + to Add New Configuration, select Maven, give it a name, use Working directory: your project1 directory, Command line: install, then Apply.

On Eclipse, you first need to install m2e (Maven on Eclipse), if it's not already installed. Then go to Open File...→Import Project from File System, then choose your project1 directory. To compile your project, right click on the project name at the Package Explorer, select Run As, and then Maven install.


What to Submit

You need to submit the following files only:

Do not submit any other files. Just submit each of these 4 files one-by-one using the following form. These files are automatically uploaded directly into your personal class account for this particular project, so you don't have to include your name or student ID or project number in the file name. You may submit your files as many times as you like, but only the most recently submitted files will be retained and evaluated (newly submitted files replace the old files under the same file name). After you submit the files, please double-check that your submitted files are correct by clicking on the Status link. If you cannot login or have a problem submitting the project using this form, ask the GTA for help.

Submit Programming Assignment #1:

Last modified: 09/12/2018 by Leonidas Fegaras