Matrix Multiplication using Hive

The purpose of this project is to develop a simple program for matrix multiplication using Apache Hive.

This project must be done individually. No copying is permitted.
**Note: We will use a system for detecting software plagiarism, called
Moss,
which is an automatic system for determining
the similarity of programs.** That is, your program will be
compared with the programs of the other students in class as well as
with the programs submitted in previous years. This program will find
similarities even if you rename variables, move code, change code
structure, etc.

Note that, if you use a Search Engine to find similar programs on the web, we will find these programs too. So don't do it because you will get caught and you will get an F in the course (this is cheating). Don't look for code to use for your project on the web or from other students (current or past). Just do your project alone using the help given in this project description and from your instructor and GTA only.

As in the previous projects, you will develop your program on SDSC Comet.

Login into Comet and download and untar project6:

wget http://lambda.uta.edu/cse6331/project6.tgz tar xfz project6.tgz chmod -R g-wrx,o-wrx project6You may use Hive on Comet in local mode interactively, but you need to setup your PATH (you need to do this every time you login to comet):

source ~/project6/setupYou also need to create an empty metastore database first (this must be done only once):

cd schematool -dbType derby -initSchemaThen, to evaluate Hive commands interactively, do:

hiveGo to project6/example and look at the join.hql example. You can run it in local mode (after you setup your PATH) using:

hive -f join.hql

You are asked to re-implement Project #1 (matrix multiplication) using Apache Hive.
This time, you need to store the result of the multiplication into a Hive
table and then write a Hive query that counts the number of matrix elements and the
average matrix value of the multiplication result.
An empty `multiply.hql` is provided
as well as a script to run this code on Comet.
The input matrices are the same as in Project1.
There are two small sparce matrices 4*3 and 3*3 in the files
M-matrix-small.txt and N-matrix-small.txt for testing in local mode.
For these matrices, your program should print the following COUNT and AVG:

12 15.5Then there are 2 moderate-sized matrices 200*100 and 100*300 in the files M-matrix-large.txt and M-matrix-large.txt for testing in distributed mode. For these matrices, your program should print the following COUNT and AVG:

59889 -0.06041565604668677Note: you can access the input matirces in Hive (which are passed as parameters) as ${hiveconf:N} and ${hiveconf:M}.

To run it in local mode over the two small matrices do:

hive -f multiply.hql --hiveconf M=M-matrix-small.txt --hiveconf N=N-matrix-small.txtTo dump the output to the file multiply.local.out, do:

hive -f multiply.hql --hiveconf M=M-matrix-small.txt --hiveconf N=N-matrix-small.txt &>multiply.local.outAfter you make sure that your program runs correctly in local mode, you run it in distributed mode using:

sbatch multiply.distr.runThis will multiply the moderate-sized matrices.

You can learn more about Hive at:

You need to submit the following files only:

project6/multiply.hql project6/multiply.local.out project6/multiply.distr.out

Last modified: 11/16/2017 by Leonidas Fegaras