Home  |   Notes  |   Homework  |   Labs  |   Programs
Program 3: DUML Translation

Due midnight the evening of 10/12

Goal

In your third programming assignment, you will be writing a translator for a simple textual markup language. Your translator will read textual data written using the markup language, and generate an equivalent HTML representation of the data that can be displayed in a web browser.

 
Learning Objectives
  •  Exposure to processing text files
  •  Exposure to writing simple translators
  •  Familiarity with basic input and output
  •  Familiarity with creating new subclasses
  •  Familiarity with writing test cases
  •  Familiarity with while loops
  •  Mastery of the Web-CAT Grader
  • The Scenario

    You work for a local software development company, Software Originals, Inc.  SOI has been hired by Marge Innovera, the librarian-statistician for the famous Boston auto liability firm of Dewey, Cheetham, and Howe, who is responsible for making sure that the firm's valuable archives are not lost with technological advances.  The company had a policy way back in the 1980's that required that official internal memos sent by e-mail should be written using a simple human-readable ASCII-only mark-up language called DUML (Denton's Understandable Markup Language, which was invented by and named for the firm's lead litigator, Denton Fender).  DUML contains tags for italics, boldface, and headings, as follows:

    An old DUML manual that Ms. Innovera dug up contains the following rules (stated informally in quite precise and lawyerly English, of course, as Mr. Fender was not into mathematics):

    1. A word shall contain at least two characters in order for its first and/or last character to be interpreted as a DUML tag.  For example, '*' as a word by itself is not a starting tag or ending tag for an italics phrase.

    2. If a word contains at least two characters, then call the first and last characters of the word FIRST and LAST, respectively. FIRST shall be a starting tag if it is '*', '^', '=', or '(' and the word is not nested within another tagged phrase.  Similarly for LAST: it shall be an ending tag if it is '*', '^', '=', or ')' and it occurs within a phrase marked with the matching starting tag.  (The matching starting tag might have been in an earlier word, or it might be FIRST from the current word.)

    An interesting feature of DUML is that nested tags are not allowed; e.g., you can't have a bold tag inside an italicized phrase.  Of course this rule did not prevent people from writing memos with what looked like nested tags!  For example, in the passage "here *is an italicized ^phrase*", the '^' should be treated verbatim as a caret (up-arrow) character, not as a start-of-boldface tag.

    One message leaked from the DCH archives during a recent Congressional hearing included this passage:

    =Technical Details=
    Please, I mean ^pretty please,^ tell our communications director
    Mr. Stayontopothis (georgest) to remind
    the reporters that *2 * 2 =4* the last time *I* looked.
    [Why do we have to say this *over and over* again?]
    

    Here, "georgest" is the login name of Mr. Stayontopothis on the DCH e-mail system.

    SOI's job is to help Ms. Innovera make the archives readable with a web browser, by creating a program that reads a file containing a DUML message and converts it to HTML.  This is to be done by replacing the DUML tags by appropriate HTML tags; and by making each login name in parentheses into a "mailto:" link in HTML.  (The company e-mail server is at cartalk.com, so Mr. Stayontopothis gets his e-mail at georgest@cartalk.com.)

    So, for example, your program should translate the above passage into something that looks like this when viewed in a browser:

    Technical Details

    Please, I mean pretty please, tell our communications director Mr. Stayontopothis (georgest) to remind the reporters that 2 * 2 =4 the last time I looked. [Why do we have to say this over and over again?]

    SOI system engineering staff have designed the basic skeleton for the converter program and have turned it over to you for completion.  Questions you might have for system engineering and/or the customer are best asked by sending a message to the course discussion forum, where someone from SOI is in touch with Ms. Innovera and will reply as soon as possible.  (Ms. Innovera is a bit nervous about answering questions herself.)

    Requirements for your Solution

    There are several requirements your solution must follow:

    HTML Output Format

    If you are not familiar with HTML, it is a markup format that uses tags to delimit sections of text. Most HTML tags come in pairs: one to mark the start of a section of text, and one to mark the end of the section. HTML tags are always written inside angle brackets (< >, or a less-than sign and a greater-than sign). The end tag always has the same name as the start tag, but preceded by a slash (/).

    You can use the following simple rules to translate DUML into HTML:

    Consider the original DUML example presented above:

    =Technical Details=
    Please, I mean ^pretty please,^ tell our communications director
    Mr.Stayontopothis (georgest) to remind
    the reporters that *2 * 2 =4* the last time *I* looked.
    [Why do we have to say this *over and over* again?]
    

    The corresponding HTML version is:

    <h1>Technical Details</h1>
    Please, I mean <b>pretty please,</b> tell our communications director
    Mr. Stayontopothis (<a href="mailto:georgest@cartalk.com">georgest</a>) to remind
    the reporters that <i>2 * 2 =4</i> the last time <i>I</i> looked.
    [Why do we have to say this <i>over and over</i> again?]
    

    The text of this DUML sample is also available on the course web site at this URL:

    http://courses.cs.vt.edu/~cs1705/Fall03/programs/sample.duml

    You can use this sample (as well as others of your own devising) as an input source in your own test cases by creating a BufferedReader attached to this URL. Note, however, that it is not sufficient to do your testing only using this sample--you must create additional tests using your own DUML input (see the hints on testing your solution below).

    If you already know HTML, then you also know that a correctly formed HTML document is also surrounded by <html>...</html> tags, with the body that is displayed also being inside <body>...</body> tags. In this assignment, we are not including those tags (they'll be added by the application calling your translator class). Instead, we are producing simple fragments of HTML-formatted text rather than complete HTML documents.

    Implementation Hints

    Review the brief tutorial on Files and Stream-based Input and Output. Make sure you understand the basic methods described in it.

    The most direct strategy for implementing a solution is to read from the input stream one character at a time, processing each character appropriately. Characters that represent start or end tags can be replaced by their corresponding HTML tags, and all other characters can be echoed unchanged to the destination PrintWriter.

    You may also wish to look carefully at the "one character at a time" file copying example in the stream I/O tutorial.

    Also, Try to break your solution up into logical pieces. Consider that a well-designed program must operate in one of five states:

    You may be able to use this structure of the problem to help in structuring or dividing up your solution.

    Detecting White Space

    The rules for DUML refer to "words" in the input text, a term that non-computer people use casually. For our purposes in this assignment, we consider a "word" from a textual input source to be any sequence of non-whitespace characters. Usually, words are separated by spaces, although other whitespace characters might also be used (like tabs, end of line markers, and so forth).

    This poses a problem of determining exactly what constitutes whitespace between words. Fortunately, Java provides a predicate we can use for this purpose:

        if ( Character.isWhitespace( (char)myChar ) )
        {
            // ...
        }
    

    Remembering a Sequence of Characters

    Another issue that may come up is how one can "remember" a sequence of characters, so that they can be used again later. One thing that you can do is append a character onto the end of an existing character string:

        String oldCharacters = "";
        ...
        int myChar = in.read();
        ...
        // To "save" characters that have been read before, "add" them
        // onto the end of a string:
        oldCharacters = oldCharacters + (char)myChar;
        ...
        System.out.println( oldCharacters );   // see what has been saved up
        ...
        oldCharacters = "";                    // Clear it out to start over
    

    Note that both of these hints use the funny notation "(char)myChar". This is called a type cast, and instructs Java to treat the number stored in myChar as the code for a single character, rather than as a plain number. If we didn't do that, Java would add a human-readable decimal representation of the number stored in myChar to oldCharacters, rather than adding the character whose code is stored in myChar.

    Hints on Testing Your Solution

    When it comes to testing, remember to write one or more test cases for each method that your write in your solution. Preferably, you should write these tests before (or as) you write the method itself, rather than saving testing until your code works. As you work on larger and larger programs, it is important to build skills in convincing yourself that the parts you have already written work as you intend, even if the full solution has not been completed.

    Also, In addition to trying to think of various cases that your methods should add formatting to, also write test cases for scenarios where a method should not take action (or should signal an error condition, if that is the behavior intended). A good example comes can be adapted from the DUML sample shown above:

    2 * 2 * 2 = 8
    

    Here, none of the symbols in the textual sequence should be interpreted as formatting tags. Try to think of as many "negative" examples as you can to shake out your code.

    Finally, be sure to review the section on Reading from and Writing to Strings in the stream I/O tutorial. By using one string as an input stream, and then generating output in another string, it can be very easy to write short test cases. Consider the following test method (which assumes your text fixture includes a dumlTranslator object created from your DUMLTranslator class):

        public void testAsteriskAndEquals()
        {
            try
            {
                // create the streams needed
                BufferedReader inStream  =
                    IOHelper.createBufferedReaderForString( "2 * 2 * 2 = 8" );
                StringWriter   result    = new StringWriter();
                PrintWriter    outStream = new PrintWriter( result );
    
                // run the method to get results
                dumlTranslator.translateDUMLtoHTML( inStream, outStream );
    
                inStream.close();
                outStream.close();
    
                // test that the result is what was expected
                assertEquals( result.toString(), "2 * 2 * 2 = 8" );
            }
            catch ( Exception e )
            {
                // If this happens, something went wrong; treat as a failed test
                fail();
            }
        }
    

    If you are clever, you can even create a "helper" method in your test class that takes two strings, the input string and the expected output string, and carries out the above test. That way, you can write many test cases, each of which is performed simply by calling your "helper" to do all the work. Even when writing test cases, it is a good idea to try to capture repeated code sequences in reusable pieces (placing them in their own methods, for example).

    Submit Your Solution

    Program submissions work just like lab submissions. On BlueJ's main menu, click Tools->Submit.... Click on "Browse...", double-click to open the "CS 1705 Programs" folder, and select Program 3. Click "OK". Click "Submit". Click on the link provided in the submission response in order to view the results of the automated phase of program grading.

    If no "Program 3" entry is visible on BlueJ's submission menu, then the Web-CAT Grader is not yet accepting submissions for this assignment. Wait for a message posted to the course web site that submissions are being accepted, and try again.

    If any errors, warnings or suggestions are indicated, you can fix them and resubmit. You are expected to fix all such issues in your code. You may resubmit as many times as you like, up until the deadline. Be careful as the due time approaches--if you submit just over the deadline, a late penalty will be assessed.

    Home  |   Notes  |   Homework  |   Labs  |   Programs

    copyright © 2003 Virginia Tech, ALL RIGHTS RESERVED
    Last modified: October 08, 2003, 8:39:45 am EDT, by Stephen Edwards <edwards@cs.vt.edu>