Due Friday, April 28 @ 5:00 PM for 125 points
(updated on 4/10/06, changes shown
in blue)
(1 update to insert command on 4/17/06, shown in
red)
The goal of this assignment is to develop a database system for a university to store and retrieve student records. To do this, you will bring together several concepts covered throughout the semester. You will implement a database system supporting access by primary keys and multiple secondary keys. The data records will be stored in a binary disk file. Access to the records will be via primary key, with primary key index being a hash table. Other queries will be on secondary keys, indexed by Binary Search Trees.
The database will store records of students at the university. Database records will contain 6 fields:
Data records must be stored in the file in binary format. When storing the data, you may reorder the fields within the record as you see fit. Note that the variable length string for Address field may require that you store additional information to handle this. The only place where any given record will be stored in its complete form will be in the disk file. The disk file will simply store the records sequentially in the order that they are inserted into the database. When a new record is entered into the database, it is simply appended at the end of the file. When a record is deleted from the database, it simply creates a "hole" in the file, some number of bytes that are no longer used until the database is re-initialized.
The ID field for the records serves as the primary key for the database. You will use a hash table, stored in memory, to index the student ID field. ID is a 4-byte unsigned integer. The hash function you will use is the 'square middle bits' technique as follows: First, square the value. Then, extract the middle 2 bytes of the 4 byte result. Finally, perform the modulus operation by the size of the hash table. You must use the closed hashing method. The collision resolution method you will use is double hashing. The second hash function, used for the probing sequence, should simply take the ID value modulus the hash table size-1, then add 1. Thus the 2nd hash function values should range from 1 to tablesize-1. (Note: why would it be bad to have 2nd hash value = 0 or tablesize?). The hash table slots will store the primary key value (the ID), the byte offset in the file where the record begins, and the total size of the record in bytes.
Search commands can also refer to one of the secondary key fields in the data records. The secondary keys that are supported by the search command are the Name, GPA, Major, and Bucks fields. For each secondary key field, there will be an associated Binary Search Tree (BST), stored in memory. Thus, your program maintains 4 BSTs, each with its own data type and comparator function. For full credit, you must use templates so that you have only one BST implementation that is reused for each of the data types. BST nodes must not store complete records. Instead, a BST node will store only the appropriate secondary key value and the primary key value (ID) for the associated record. Searches on secondary keys must first use the appropriate BST index to retrieve the ID values, then use the hash table to access the complete records.
The input to the program will be a command file containing commands, read from standard input. Output for the program will be to standard output. As usual, you must implement the hash table and BSTs yourself, and cannot use libraries such as STL for them.
Your program will be called studentdb, and will have 2 command-line parameters:
% studentdb myfile.bin 101
The input file (read from standard input) will consist of the commands enter, delete, search/searchr, dump, and makenull. Each enter command will have 6 lines of input, with 1 line per field of the record, in the field order specified above. Format is as follows:
enter John Doe
1001 Anywhere Street, Nowhere, USA
123-45-6789
3.8
CS
25000
The ID field will always be in the form xxx-xx-xxxx, and should be converted to a 4-byte integer. If this value duplicates any existing record in the database, then this enter command is an error, and the record should not be inserted into the database and an appropriate error message should be output. Duplicates are allowed for all other fields. On successful enter, the record should be appended to the data file, and corresponding data added to the hash table, and to each BST index. The enter command should also output (to standard out) the entire record to be inserted, and the byte location and length in the database file if successful or an error message if ID is a duplicate. Also, output the final hash index of the record in the hash table. If hashing fails to find an open slot (probing sequence returns to the home position), an error message should be output and the record should not be inserted into the database.
The delete command (the most difficult command to implement) has the following form:
delete field value
delete ID 123-45-6789
delete Name John Doe
The first parameter (field) specifies which key field to use, and must be one of the strings "ID", "Name", "GPA", "Major", or "Bucks". The second parameter is the key value of the record(s) to be deleted. If the given field is ID, then the given key value uniquely identifies a single record to delete. In the case of a secondary index, all matching records should be deleted. If there are no matching records, an appropriate message should be output. For each record deleted, the record should be removed from the hash table and from all 4 BSTs. Note that removing a specific record from the BSTs will require a special method to search and delete based on both the corresponding index value and the ID, which is different from when you want to delete all records with a given value from a BST. You need not actually do anything to remove the record from the data file, those bytes will simply be unused until the next time the database is reinitialized. Use a tombstone to mark deletion from the hash table. For each record deleted, output (to stdout) the entire record and its byte location and length in the data file. If no records are deleted, output a message accordingly.
The search/searchr command has two forms:
search field value
search ID 123-45-6789
searchr field min max
searchr GPA 3.5 4.0
The first form is similar in form to the delete command. In this case, search should output (to stdout) all records that match the specified field value, or a message indicating that no matching records found. The second form, searchr performs a range search and outputs all records with field value between the given min and max values. Min and max are inclusive, meaning that records that have exactly the min or max value are also output. For string data, use ASCII order to determine range matches, such as in strcmp(). Searchr should not operate on the ID field; it should only allow the Name, GPA, Major, and Bucks fields. Search should operate on all 5 fields. Note that Address field is not used by search or delete commands. Each record that is output should output the entire record, including its byte location and length in the data file. For a Searchr range search, records should be output in sorted order according to the field searched on. In the example above, records would be output sorted by GPA from 3.5 to 4.0. For both search types, if no records match, an appropriate message should be output.
The makenull command will reinitialize the database by reinitializing the hash table, all BSTs, and the data file to empty.
The dump command should print the contents of the hash table (all records and their hash locations in index order), and the contents of all 4 BSTs (all IDs and values, in in-order traversal order).
You can expect all input commands, and program command line arguments, to be syntactically correct.
Given that you have some records stored in your database, here is an example of how a search will operate. Consider the following command:
searchr GPA 3.5 4.0
This indicates a search on the GPA field for all students with a GPA value between 3.5 and 4.0. Thus, you begin by searching in the GPA BST. Each entry in the BST with a matching key value (in 3.5-4.0 range) will have stored with it the ID value for that record. You will search the hash table for each such ID value, to recover its position in the disk file. For each such record, you retrieve the bytes from disk file, and then print all the field values.
If this were a delete command, each of the matching records would then need to be removed from the other BSTs and the hash table. For example, to then remove a matching record (found in the GPA BST) from the Major BST, you would have to get the record's Major value (e.g. 'CS') from the record on disk, and then search the Major BST using the Major value and the ID value and remove it. Searching in the Major BST with only the Major value of the targeted record, without using the ID value, might remove the wrong record because duplicate Majors are allowed. You don't want to remove all CS majors, you want to remove all students with GPA's between 3.5 and 4.0.
You must conform to good programming/documentation standards, as described in the Elements of Programming Style. Some specifics:
Neither the GTAs nor the instructors will help any student debug an implementation, unless it is properly documented and exhibits good programming style. Be sure to begin your internal documentation right from the start.
You may only use code you have written, either specifically for this project or for earlier programs, or code taken from the textbook. Note that the textbook code is not designed for the specific purpose of this assignment, and is therefore likely to require modification. It may, however, provide a useful starting point. You may not use code from STL, MFC, or a similar library in your program.
A sample data file will be posted to the website to help you test your program. This is not the data file that will be used in grading your program. While the test data provided should be useful, you should also do testing on your own test data to ensure that your program works correctly.
You will submit a tarred and gzipped file containing all the source code for the project, a makefile and an optional readme file. All the files should be in a flat directory tree, i.e. there should be no subdirectories in your tar file. The file should end in a tar.gz extension. The makefile will allow the TA to simply type make at the command line and create your executable.
Programs should compile and run using Gnu G++ under Mandrake Linux. The optional readme file will explain any pertinent information needed by the TA to aid them in the grading of your submission.Once you have assembled the archive for submission, for your own protection, please move it to a location other than your development directory, unzip the contents, build an executable, and test that executable on at least one input. Failure to do this may result in delayed evaluation of your program and a loss of points.
You will submit your project to the automated Curator server. Links to the instructions and the curator client are posted at the CS2604 website. If you make multiple submissions, only your last submission will be evaluated.
Your project submission must include a statement, pledging your conformance to the Honor Code requirements for this course. Specifically, you must include the following pledge statement in the header comment preceding the function main() in your program.
Programs that do not contain this pledge will not be graded.