CS 4604 Project Assignment 2


Released on Feb 8, 2013. Hardcopy due at the start of class on Feb 22, 2013.

In this assignment you have to design a schema for the DBLP dataset. You should design your schema in PostgreSQL and populate the database. You can utilise your PostgreSQL account on cs4604.cs.vt.edu for all aspects of this assignment. If you do not have an account for yourself or for your group, or if you have any problem with the set-up please contact Qianzhou [qiand12 AT cs DOT vt DOT edu].

The DBLP dataset contains information about approximately 1.4 million publications in the computer science literature. Each publication has a unique string called the dblp_key that identifies it. It also has a title, a year of publication, and one or more authors. The order in which authors appear in a publication is important and must be recorded. For some publications, the authors have not been recorded. A publication may also have a URL or a Digital Object Identifier (DOI). Each publication belongs to one of the following categories:
article
This type corresponds to a journal article. The publication will have an associated journal name, a volume and a number specifying the issue of the journal, page numbers, and a publisher of the journal.
book
As the name indicates, this type of publication is a book. It also has a publisher and an ISBN number.
incollection
This type indicates a publication contained within a collection. An example of a collection is a book that contains different chapters written by different authors. Each chapter in a collection will have the type incollection. Each chapter will have its own page numbers and authors. The entire collection itself is considered a separate publication and has its own title, authors (often called editors), and publisher. A chapter in a collection often has a cross reference to the collection it was published in.
inproceedings
This type indicates a paper published in the proceedings of a scientific conference. It is very similar to a publication of type "incollection". The conference proceedings is itself a separate publication with its own title, authors (or editors), and publisher. A publication of type "inproceedings" often has a cross reference to the proceedings it was published in.
proceedings
Very similar to a book, this type indicates the entire conference proceedings with its own title (name of conference), editors, year, publisher and ISBN.
mastersthesis
This publication is a Master's thesis, with a specific author, department and/or university, and year.
phdthesis
This publication is a PhD thesis, with a specific author, department and/or university, and year.
www
This type of "publication" is just a pointer to a web page, possibly with a title and one or more authors. It will have an URL.
  1. (30 points) Design a set of relations to model the DBLP database, as described above. For each relations, specify its schema: the name of the relation, its attributes, and for each attribute, its type. Describe using a sentence or two what each relation will store.
    Note 1: Your design should have at least 3 relations. If you have fewer than three relations, then your schema is not complex enough for a CS 4604 project. However, do not cook up relations just to meet this goal. We will carefully vet each relation and penalise relations we think are unnecessary.
    Note 2: Resist the temptation to create a single relation that will store all publications. If you do so, you will be smushing many attributes that make sense only for a particular type of publication into the same relation, resulting in many NULL values. We will strongly penalise any design that stores all publications in a single relation.
    Note 3: Later in this assignment, we will mention the s13dblp database that contains the data you will need to copy over to your database. The s13dblp does not have a good schema. Therefore, if you choose to use the same schema, you are certain to have a poor design, which we will penalise you for. Do not look at the s13dblp schema before coming up with your own design. The s13dblp schema will encourage you to select bad design choices.

  2. (30 points) Write an SQL database schema for your design, using the CREATE TABLE commands described in class. Pick suitable datatypes for each attribute. Consult the online PostgreSQL manual to determine the principal options that are available for types (your class notes may or may not cover the ones that you require and your class notes may or may not contain the right declaration). Do not forget to specify at least one key for each relation (using the PRIMARY KEY construct). Specify other choices for keys via the UNIQUE construct. Justify your choice of key(s) for each relation. If some attribute should not be NULL, use NOT NULL to specify this aspect. Also provide DEFAULT values wherever applicable. An example of a complete schema definition for a relation is:
    CREATE TABLE
    Students (
        pid CHAR(6) PRIMARY KEY,
        ssn CHAR(9) UNIQUE,
        name CHAR(30),
        birthdate DATE,
        gender CHAR(1) NOT NULL,
        address VARCHAR(255)
    );
    
    Do not start loading these definitions into the database just yet!

  3. (10 points) For each of the relations that you defined, indicate any foreign key (referential integrity) constraints that are expected to hold for that relation. Include the declaration of foreign key constraints for any appropriate relations. Also mention any domain specific constraints that you might require of your attributes, using the CHECK construct. Make sure to name your constraints!

  4. (30 points) Execute the CREATE TABLE statements in your group's database. Now load the DBLP data into your database, based on the instructions given next. We have stored the data in three tables within the "s13dblp" database. This database should be accessible from within your group's database. Write SQL commands that query s13dblp and insert parts of it into the appropriate tables in your database. Alternatively, you can export the s13dblp tables to files, process the files using external scripts that you will have to write, and then import the data from these files into your tables. Here are some notes about the s13dblp database:
What to turn in: A paper copy that details the following:
  1. A list of your relation schemas.
  2. A list of your SQL CREATE TABLE commands (with PRIMARY KEY, NOT NULL, DEFAULT, FOREIGN KEY, and CHECK constraints).
  3. A list of the number of tuples for each relation in your application.
  4. Samples of tuples for each relation (about 6--10 each), just to give us an idea of how they look.
  5. The number of tuples of type incollection that you ignored and the number of tuples of type inproceedings that you ignored.
  6. Any interesting observations in terms of inconsistencies in the database and how you resolved them.
What NOT to turn in:
  1. Pages and pages of raw data or printouts of all the data in your database. There is no way we can comb through all this stuff manually! You will have ample opportunity to present this stuff later on in the semester.
Precautions and Recommendations:
  1. Please SAVE a copy of your SQL statements (and data) in your home directory or some other safe place. While this precaution may seem obvious, a surprising number of people miss it.


Last updated: Thur, Feb 7, 10:00pm EDT 2013