Programming Assignment #9
XML Search Engine using Native Storage

Due on Tuesday May 4 before midnight.


This project must be done individually. No copying is permitted. The purpose of this project is to build an XML search engine using a native storage but for local XML documents only.


You will do this project on your own PC/laptop. First download project9.tar or and extract the files. It contains the jar file je.jar, which is the Berkeley DB Java Edition. This system is a storage manager, not a relational database. It only supports B+-trees and transactions.


Look at the method SearchDictionary.dump() in to see how you can access data from the indexes. You need to read the Java doc about the classes Iterator and Map. The classes ElementHit and TermHit are like the inverted indexes E-index and T-index in the class slides. You don't need to read the Berkeley DB documentation, but if you are interested, it is available here.

The XML Search Engine

You are provided with a program project9/ that builds two inverted indexes using B+-trees (on Berkeley DB), one for text terms (words) and one for XML elements. You compile them using javac -cp je.jar *.java For example, if you want to index a number of XML files, you use:

java -cp .:je.jar DocumentIndexer file1.xml file2.xml ...
(for Windows, you need to use semicolon instead of colon, that is -cp .;je.jar). Note that, the inverted indexes are persistent, so you would only need to index each file once (if you index a file twice, the file and all its terms will appear twice under a different document number). To remove the database and start with empty indexes, remove all the files from the directory project9/database/. To print the content of the database, use:
java -cp .:je.jar Dump

First, you need to index the document auction.xml used in projects 5 and 7. That is, you create the database using java -cp .:je.jar DocumentIndexer ../auction.xml (or whatever is the location of auction.xml). Then, you need to modify the file project9/ to run one simple XPath query against the auction database:

//namerica//item[.//listitem/text contains w1 and w2]/name
where w1 and w2 are two words that you provide as arguments to the main program. Your program should find and print all the hits (the names of asian items) that contain both the words w1 and w2 at the location //namerica//item//listitem/text. (You print the hits in the same way they are printed in dump().) For example, if you run:
java -cp .:je.jar Query honorable dumbness
your program should print to the output:
name <0,3541:3546,5>
More specifically, the only hits that satisfy the query above are:
namerica <0,3236:5083,3>
item <0,3533:4175,4>
listitem <0,3553:3786,7>
text <0,3554:3785,8>
name <0,3541:3546,5>
honorable <0,3706,8>
dumbness <0,3724,8>

Assume that there is only one indexed document (so you don't need to compare document numbers). You should not use any data structure, such as vectors, stacks, etc to store hits. Instead, you should write nested loops that compares the results of the indexes using containment conditions. For example, the code for //A//B may look like this:

StorageIterator aElems = new StorageIterator(dictionary.elementDB,"A");
StorageIterator bElems = new StorageIterator(dictionary.elementDB,"B");
ElementHit as = getElement(;
ElementHit bs = getElement(;
do {
    do {
        if (as.begin < bs.begin && as.end > bs.end)
        else if (as.end < bs.begin)
        bs = getElement(;
    } while (!bElems.eos());
    as = getElement(;
} while (!aElems.eos());
For indexed terms (words), use the dictionary.termDB index, which provides TermHit entries (there is a similar method, getTerm, for indexed terms (words)).

What to Submit

Use the form below to submit the file

Submit Project #9:

Last modified: 04/21/10 by Leonidas Fegaras