Project #9
XML Search Engine

Due on Friday May 2 before midnight.


This project must be done individually. No copying is permitted. The purpose of this project is to build an XML search engine but for local XML documents only.


You will do this project on your own PC/laptop. First download project9.tar or and extract the files. It contains the jar file je.jar, which is the Berkeley DB Java Edition. This system is a storage manager, not a relational database. It only supports B+-trees and transactions.


Look at the method SearchDictionary.dump() in to see how you can access data from the indexes. You need to read the Java doc about the classes Iterator and Map. The classes ElementHit and TermHit are like the inverted indexes E-index and T-index in the class slides. You don't need to read the Berkeley DB documentation, but if you are interested, it is available here.

The XML Search Engine

You are provided with a program project9/ that builds two inverted indexes using B+-trees (on Berkeley DB), one for text terms (words) and one for XML elements. You compile them using javac -cp je.jar *.java For example, if you want to index a number of XML files, you use:

java -cp .:je.jar DocumentIndexer file1.xml file2.xml ...
(for Windows, you need to use semicolon instead of colon, that is -cp .;je.jar). Note that, the inverted indexes are persistent, so you would only need to index each file once (if you index a file twice, the file and all its terms will appear twice under a different document number). To remove the database and start with empty indexes, remove all the files from the directory project9/database/. To print the content of the database, use:
java -cp .:je.jar Dump

You need to modify the file project9/ to run one simple XPath query against the indexed XML document cs.xml:

//gradstudent[name/lastname = lname]
where lname is a lastname (a word) that you provide as an argument to the main program. Your program finds and prints all the hits (graduatestudents) that contain the word lname at the location //gradstudent/name/lastname. (You print the hits in the same way they are printed in dump().) For example, if you run:
java -cp .:je.jar Query Basney
it should print in the output:
gradstudent <0,130:173,2>

You should not use any data structure, such as vectors, stacks, etc. Instead, you should write nested loops that combine the results of the indexes using containment conditions.

What to Submit

Use the form below to submit the file

Submit this file:

Last modified: 04/22/08 by Leonidas Fegaras