| Home | Notes | Homework | Labs | Programs |
| Program 3: DUML Translation |
| Goal |
|
In your third programming assignment, you will be writing a translator for a simple textual markup language. Your translator will read textual data written using the markup language, and generate an equivalent HTML representation of the data that can be displayed in a web browser. |
|
| The Scenario |
You work for a local software development company, Software Originals, Inc. SOI has been hired by Marge Innovera, the librarian-statistician for the famous Boston auto liability firm of Dewey, Cheetham, and Howe, who is responsible for making sure that the firm's valuable archives are not lost with technological advances. The company had a policy way back in the 1980's that required that official internal memos sent by e-mail should be written using a simple human-readable ASCII-only mark-up language called DUML (Denton's Understandable Markup Language, which was invented by and named for the firm's lead litigator, Denton Fender). DUML contains tags for italics, boldface, and headings, as follows:
The phrase *in italics* is taken to be in italics because the first word starts with '*' and the last word ends with '*'.
The phrase ^in boldface^ is taken to be in boldface because the first word starts with '^' and the last word ends with '^'.
The phrase =Heading= is taken to be a heading because the first word starts with '=' and the last word ends with '='. (Notice there is only one word in this phrase.)
Any phrase (name) is taken to be an e-mail address.
An old DUML manual that Ms. Innovera dug up contains the following rules (stated informally in quite precise and lawyerly English, of course, as Mr. Fender was not into mathematics):
A word shall contain at least two characters in order for its first and/or last character to be interpreted as a DUML tag. For example, '*' as a word by itself is not a starting tag or ending tag for an italics phrase.
If a word contains at least two characters, then call the first and last characters of the word FIRST and LAST, respectively. FIRST shall be a starting tag if it is '*', '^', '=', or '(' and the word is not nested within another tagged phrase. Similarly for LAST: it shall be an ending tag if it is '*', '^', '=', or ')' and it occurs within a phrase marked with the matching starting tag. (The matching starting tag might have been in an earlier word, or it might be FIRST from the current word.)
An interesting feature of DUML is that nested tags are not allowed; e.g., you can't have a bold tag inside an italicized phrase. Of course this rule did not prevent people from writing memos with what looked like nested tags! For example, in the passage "here *is an italicized ^phrase*", the '^' should be treated verbatim as a caret (up-arrow) character, not as a start-of-boldface tag.
One message leaked from the DCH archives during a recent Congressional hearing included this passage:
=Technical Details= Please, I mean ^pretty please,^ tell our communications director Mr. Stayontopothis (georgest) to remind the reporters that *2 * 2 =4* the last time *I* looked. [Why do we have to say this *over and over* again?]
Here, "georgest" is the login name of Mr. Stayontopothis on the DCH e-mail system.
SOI's job is to help Ms. Innovera make the archives readable with a web browser, by creating a program that reads a file containing a DUML message and converts it to HTML. This is to be done by replacing the DUML tags by appropriate HTML tags; and by making each login name in parentheses into a "mailto:" link in HTML. (The company e-mail server is at cartalk.com, so Mr. Stayontopothis gets his e-mail at georgest@cartalk.com.)
So, for example, your program should translate the above passage into something that looks like this when viewed in a browser:
Technical Details
Please, I mean pretty please, tell our communications director Mr. Stayontopothis (georgest) to remind the reporters that 2 * 2 =4 the last time I looked. [Why do we have to say this over and over again?]
SOI system engineering staff have designed the basic skeleton for the converter program and have turned it over to you for completion. Questions you might have for system engineering and/or the customer are best asked by sending a message to the course discussion forum, where someone from SOI is in touch with Ms. Innovera and will reply as soon as possible. (Ms. Innovera is a bit nervous about answering questions herself.)
| Requirements for your Solution |
There are several requirements your solution must follow:
Be sure you are using v1.3 or later of
cs1705.jar (released 9/27/03). Use the
BlueJ link from the course
home page for instructions on how to check your version or upgrade.
You must provide a class called DUMLTranslator
to serve as the main entry point for your solution.
Your DUMLTranslator class must provide a method
with the following signature:
public void translateDUMLtoHTML( BufferedReader inStream,
PrintWriter outStream )
throws Exception
{
// ...
}
The DUMLTranslator.translateDUMLtoHTML()
method must correctly translate all of the input characters
from the provided DUML-formatted inStream, and produce
the corresponding HTML-formatted output on outStream.
The DUMLTranslator.translateDUMLtoHTML()
method must not throw any exceptions except for those that arise
from calls messages sent to inStream.
| HTML Output Format |
If you are not familiar with HTML, it is a markup format that uses
tags to delimit sections of text. Most HTML tags come in
pairs: one to mark the start of a section of text, and one to mark
the end of the section. HTML tags are always written inside
angle brackets (< >, or a less-than sign and
a greater-than sign). The end tag always has the same name as the start
tag, but preceded by a slash (/).
The phrase <i>in italics</i> is formated by
a web browser to be in italics, because it is surrounded with
i-tags (<i> ... </i>), which signal
italic formatting.
The phrase <b>in boldface</b> is formatted by
a web browser to be in boldface, because it is surrounded
with b-tags (<b> ... </b>), which
signal boldface formatting.
The phrase <h1>Heading</h1> is formatted by
a web browser to be a heading, because it is surrounded with
h1-tags (<h1> ... </h1>), which
signal a top-level heading. Incidentally, there are also tags for
lower-level headings (e.g., h2, h3, etc.),
although they are not used in this assignment. Use h1-tags
for all DUML heading items.
The phrase <a href="mailto:email@system.domain">email<a>
is formatted by a web browser to be a hyperlink to an e-mail address,
because it is surrounded with a-tags
(<a> ... </a>, also called anchor tags), which
signal a hyperlink anchor.
Consider the original DUML example presented above:
=Technical Details= Please, I mean ^pretty please,^ tell our communications director Mr.Stayontopothis (georgest) to remind the reporters that *2 * 2 =4* the last time *I* looked. [Why do we have to say this *over and over* again?]
The corresponding HTML version is:
<h1>Technical Details</h1> Please, I mean <b>pretty please,</b> tell our communications director Mr. Stayontopothis (<a href="mailto:georgest@cartalk.com">georgest</a>) to remind the reporters that <i>2 * 2 =4</i> the last time <i>I</i> looked. [Why do we have to say this <i>over and over</i> again?]
The text of this DUML sample is also available on the course web site at this URL:
http://courses.cs.vt.edu/~cs1705/Fall03/programs/sample.duml
You can use this sample (as well as others of your own devising)
as an input source in your own test cases by creating a
BufferedReader attached to this URL. Note, however,
that it is not sufficient to do your testing only using this
sample--you must create additional tests using your own
DUML input (see the hints on testing your solution below).
If you already know HTML, then you also know that a correctly formed
HTML document is also surrounded by <html>...</html>
tags, with the body that is displayed also being inside
<body>...</body> tags. In this assignment, we are
not including those tags (they'll be added by the application
calling your translator class). Instead, we are producing simple
fragments of HTML-formatted text rather than complete HTML documents.
| Implementation Hints |
Review the brief tutorial on Files and Stream-based Input and Output. Make sure you understand the basic methods described in it.
The most direct strategy for implementing a solution is to
read from the input stream one character at a time, processing each
character appropriately. Characters that represent start or end
tags can be replaced by their corresponding HTML tags, and all other
characters can be echoed unchanged to the destination
PrintWriter.
You may also wish to look carefully at the "one character at a time" file copying example in the stream I/O tutorial.
Also, Try to break your solution up into logical pieces. Consider that a well-designed program must operate in one of five states:
When processing text that lies outside of all formatting tags, your program can be said to be in a plain text state. Here, it is looking for any start tag that it comes across.
Once the start tag for italics (an asterisk at the start of a word) is found, your program transitions into an italics state. Here, it is looking only for the corresponding end tag (an asterisk at the end of a word), at which point it transitions back to the plain text state.
Once the start tag for boldface (a caret at the start of a word) is found, your program transitions into a boldface state. Here, it is looking only for the corresponding end tag (a caret at the end of a word), at which point it transitions back to the plain text state.
Once the start tag for a header (an equal sign at the start of a word) is found, your program transitions into a header state. Here, it is looking only for the corresponding end tag (an equal sign at the end of a word), at which point it transitions back to the plain text state.
Once the start tag for an e-mail (a left parenthesis at the start of a word) is found, your program transitions into an e-mail state. Here, it is looking only for the corresponding end tag (a right parenthesis at the end of a word), at which point it transitions back to the plain text state.
You may be able to use this structure of the problem to help in structuring or dividing up your solution.
The rules for DUML refer to "words" in the input text, a term that non-computer people use casually. For our purposes in this assignment, we consider a "word" from a textual input source to be any sequence of non-whitespace characters. Usually, words are separated by spaces, although other whitespace characters might also be used (like tabs, end of line markers, and so forth).
This poses a problem of determining exactly what constitutes whitespace between words. Fortunately, Java provides a predicate we can use for this purpose:
if ( Character.isWhitespace( (char)myChar ) )
{
// ...
}
Another issue that may come up is how one can "remember" a sequence of characters, so that they can be used again later. One thing that you can do is append a character onto the end of an existing character string:
String oldCharacters = "";
...
int myChar = in.read();
...
// To "save" characters that have been read before, "add" them
// onto the end of a string:
oldCharacters = oldCharacters + (char)myChar;
...
System.out.println( oldCharacters ); // see what has been saved up
...
oldCharacters = ""; // Clear it out to start over
Note that both of these hints use the funny notation
"(char)myChar". This is called a type cast,
and instructs Java to treat the number stored in myChar
as the code for a single character, rather than as a plain number.
If we didn't do that, Java would add a human-readable decimal representation
of the number stored in myChar to
oldCharacters, rather than adding the character
whose code is stored in myChar.
| Hints on Testing Your Solution |
When it comes to testing, remember to write one or more test cases for each method that your write in your solution. Preferably, you should write these tests before (or as) you write the method itself, rather than saving testing until your code works. As you work on larger and larger programs, it is important to build skills in convincing yourself that the parts you have already written work as you intend, even if the full solution has not been completed.
Also, In addition to trying to think of various cases that your methods should add formatting to, also write test cases for scenarios where a method should not take action (or should signal an error condition, if that is the behavior intended). A good example comes can be adapted from the DUML sample shown above:
2 * 2 * 2 = 8
Here, none of the symbols in the textual sequence should be interpreted as formatting tags. Try to think of as many "negative" examples as you can to shake out your code.
Finally, be sure to review the section on Reading
from and Writing to Strings in the stream I/O tutorial. By using
one string as an input stream, and then generating output in another
string, it can be very easy to write short test cases. Consider
the following test method (which assumes your text fixture includes
a dumlTranslator object created from your
DUMLTranslator class):
public void testAsteriskAndEquals()
{
try
{
// create the streams needed
BufferedReader inStream =
IOHelper.createBufferedReaderForString( "2 * 2 * 2 = 8" );
StringWriter result = new StringWriter();
PrintWriter outStream = new PrintWriter( result );
// run the method to get results
dumlTranslator.translateDUMLtoHTML( inStream, outStream );
inStream.close();
outStream.close();
// test that the result is what was expected
assertEquals( result.toString(), "2 * 2 * 2 = 8" );
}
catch ( Exception e )
{
// If this happens, something went wrong; treat as a failed test
fail();
}
}
If you are clever, you can even create a "helper" method in your test class that takes two strings, the input string and the expected output string, and carries out the above test. That way, you can write many test cases, each of which is performed simply by calling your "helper" to do all the work. Even when writing test cases, it is a good idea to try to capture repeated code sequences in reusable pieces (placing them in their own methods, for example).
| Submit Your Solution |
Program submissions work just like lab submissions.
On BlueJ's main menu, click Tools->Submit.... Click on
"Browse...", double-click to open the
"CS 1705 Programs" folder, and select
Program 3. Click "OK".
Click "Submit". Click on the link provided
in the submission response in order to view the results of the
automated phase of program grading.
If no "Program 3" entry is visible on BlueJ's submission menu, then the Web-CAT Grader is not yet accepting submissions for this assignment. Wait for a message posted to the course web site that submissions are being accepted, and try again.
If any errors, warnings or suggestions are indicated, you can fix them and resubmit. You are expected to fix all such issues in your code. You may resubmit as many times as you like, up until the deadline. Be careful as the due time approaches--if you submit just over the deadline, a late penalty will be assessed.
| Home | Notes | Homework | Labs | Programs |