CS 4604

Project Overview

General Description

The goal of the class project is to implement a database system application, based on the data in DBLP, which is a database of over 1.4 million publications in the computer science database and related communities. The project includes the following activities spread over the entire semester:

Describe the application area that DBLP serves (for which database systems may prove beneficial),
Determine the functionalities of the database application,
Model the data stored in the database (Identify the entities, roles, relationships, constraints, etc.),
Design, normalise, and perfect the relational database schema,
Write the SQL commands to create the dataase, find appropriate data, and populate the database, and
Finally and most importantly, write the software needed to embed the database system in the application.

The end result should be a functioning application that runs on the web and that uses your database to allow useful functionality.

A group of 2-3 students should do each project. You are free to choose your own project members; if you would like the instructor to assign you to a group, say so in class or send him email. Each of the steps above will be a specific project assignment. You will get detailed instructions with each assignment. Each group should turn in a single solution to each assignment. Every member of the group will get the same grade.

Project Domain

We will work in the domain of research literature. This domain involves modeling research publications. You need to identify the title of the publication, the forum it was published in, the authors, topics, keywords and related subtopic areas. This is a big business now (under the name of digital libraries). For example, the ACM digital library provides a beautiful searchable index (and retrievable repository, but that is beyond our scope) of nearly all of the publications of ACM. The ACM computing classification system provides a convenient hierarchial meta-index that you can use to organize your class hierarchy etc. More specifically, we will work with data from the DBLP Bibliography Site, which provides a searchable facility for publications related to the database and programming communities. At the end of the day, you will be able to identify papers written by a particular person at a particular venue or papers published in a narrowly defined area.

Other Project Ideas

These ideas are a sample meant to indicate the breathtaking scope of areas that databases can have an impact on. Realize that the ideas below are not complete descriptions. Do not get intimidated by the examples that are linked from this web page. These examples are meant to give you a feel for different application domains and how you could augment the basic DBLP database with your own ideas. Do not forget that you are supposed to have fun!

Here are some ideas:

Social Network database: (Something related to my research) The goal is to create a social network database, like Twitter, where you will store the network (say who-follows-whom) and the associated data of each person like tweets, tags, photos and more. Users should be able to view a chronological listing of anyone's tweets, find tweets by certain people, group tweets by topics, find friends who tweet about some topic, find friends of friends who are not friends (to close social triangles!) so on.
Bibliography database: Develop a system that will improve a research group's ability to track its publications and publications of interest to the group. Track information such as papers, authors, projects, conferences and journals. Readers should be able to view chronological listings, find papers by certain authors, group by projects, recover lists of papers based on keywords, etc. It should be easy for group members to add new papers, both written by the group and published by others in the literature. Examples of such systems include Connotea and CiteULike.
Bioinformatics databases: Biological experiments generate massive amounts of different types of data. A plethora of databases have appeared that provide interfaces to many such datasets such as the Gene Ontology, the KEGG pathway database, and Reactome. Databases with good interfaces, especially when they link multiple types of data together, can be very useful to biologists.
The Green500 list: This list of the most energy-efficient supercomputers was created is maintained by professors in the Department of Computer Science at Virginia Tech. This project involves creating and maintaining the database that drives the website. Since the complete database may have a very complex schema, this project will involve downloading the (filtered) information at (available in HTML, XLS, or CSV formats), create a schema for this data, storing the data using this schema, and enabling queries to be made, as in these examples.

Nobel Awards Database: The goal is to model and populate information about the awards made in the various fields (Physics, Chemistry, Physiology or Medicine Literature, Peace and the Economic Sciences), the recipients, their countries, their year of birth etc. Your system should be able to answer questions such as "When was the first time an Asian won an award for the economic sciences?" (the answer to this particular question is 1998). The Nobel Foundation maintains such an interface. You could also work on variants of this idea such as the recipients of the ACM awards (unfortunately, there is not too much information online about this). Interesting queries then could be "Name people who have won at least two different awards" (the answer would include Knuth, Thompson, Ritchie, Engelbart etc.) Or the people "who were ACM Fellows before becoming Turing Award Winners" and so on.
Books Database: This domain is another popular one. Just look at barnesandnoble.com or amazon.com for excellent examples. You could model entities such as books, their authors, topics (which may be a complex hierarchy). You may also model various attributes of the authors and the institutions they belong to. You can support a service for buying and selling used books or books used in specific university courses. Your system can build a personal profile of people (and the books they like) and your database application could form the basis for a "recommender system", such as those supported by the commercial sites. The goal here is to "cluster" similar preferences together and the system can then make recommendations: "Since you liked Harry Potter and the Sorcerer's Stone, I recommend that you try Harry Potter and the Chamber of Secrets".
Movies Database: There are several excellent movie resources on the web, such as the hollywood.com movies site or the Internet Movie Database. You could model entities such as movies, their actors, directors, genres, playing times, and reviews. There are several sources on the web from which you could get data to populate such a database. You can support various queries such as finding specific playing times, finding movies playing in Blacksburg directed by a given director. You can also support updates to the reviews section of the database (e.g., viewers giving their own opinions). Another functionality is to provide personal profiles of people (i.e., the movies they like) and then try to recommend movies to them based on profiles of viewers with similar tastes. You could also create a database of OSCAR or Golden Globe nominations and awards and answer queries such as "Find all the sitcoms that have been nominated three years in a row".
Personal Photos database: With the advent of cheap digital cameras, everybody has piles of digital photos. People need a way to organize, access, and show off their photos.
Apartment Homes: Our friendly neighborhood web guide is here. This domain would require modeling apartments and their attributes, areas of town and their various characteristics (e.g., BT bus lines, crime rates, distance from various landmarks). You would provide an interface for offering apartments for rent, finding apartments based on various requirements ("gas heating + pets allowed + rent less than $500 + close to campus + BEV modem facility").
Census Database: Can you make a census data dissemination system for the Census Bureau? A census gathers data about people, business, geographic regions, etc. Different types of users need to gain different types of answers from the data. Homeowners want to know statistics about their region, such as crime rates. Business owners want to find holes in the competition. Government decision makers want to learn about demographic trends, and where to focus resources.
Web Sites: How do you think web search engines such as Google model their domain? You could think of them as a glorified database system where the basic entities modeled are web sites. You could then model the various properties of a web site: Topic, URL, domain name, other sites it links to, the background colour, etc. Retrieval could be for sites that have similar characteristics and properties.
Others: Of course, there are a whole host of other ideas such as bank accounts, student records, NBA data, election results, senate demographics, car rentals, auto insurance, consumer products, courses at Virginia Tech, Hokie statistics, "match-making services" and so on. You can let your imagination run riot!

Last modified: Tue Jan 22 6:05pm EDT 2013