CS 2104 Problem Solving in Computer Science OOC Assignment 9 ------------------------------------------------------------------------------- 1. [10 points] The set of all 10-digit telephone numbers, like 540.231.5605, formatted in precisely that manner; match all and only telephone numbers that are preceded and followed by one or more spaces. (You do not have to take into account any restrictions on what are actually valid area codes or prefixes.) This is pretty straightforward, especially if you make good use of the relevant repetition syntax: ' ([0-9]{3}\.){2}[0-9]{4} ' I didn't mention '\s\, but you might have come across it in the reading. Technically, the following is not valid, since it would also match tabs: '\s([0-9]{3}\.){2}[0-9]{4}\s' 2. [10 points] The set of all strings that consist of three or four lower-case letters, where the first character cannot be a vowel, and both ends of the string must be adjacent to space characters. (Be careful about the requirement that the string only contains letters.) The only real challenge here is the restriction that the first character cannot be a vowel. Here is one solution: ' \<[b-df-hj-np-tv-z][a-z]{2,3}\> ' And here is a seductive incorrect answer: ' \<[^aeiou][a-z]{2,3}\> ' The problem with this one is that the first part will match any character that's not a vowel, not just letters. 3. [20 points] The set of all lines in a file that begin and end with the word "the". (There are exactly 64 such lines in the Moby Dick file from the Gutenberg Project.) This was also straightforward. You must specify that both occurrences of "the" are matched as words, and force the first to be at the beginning of the line and the last to be at the end of the line. And, what comes in the middle of the line is of no importance, but you must specify that. ^\.*\$ 4. [20 points] The set of all lines (in a file) that contain the place name "New England" or "Spain". (There are exactly 10 such lines in the Moby Dick file from the Gutenberg Project.) Again, both names must be matched as words, and we must find all lines that contain one or the other (or both), so we need the OR operator: '\|\' 5. [20 points] The set of all lines (in a file) that include a word that contains two (or more) consecutive occurrences of the letter 'a' or two (or more) consecutive occurrences of the letter 'u'. (There are exactly 6 such lines in the Moby Dick file from the Gutenberg Project.) Aside from needing word matches, we must allow arbitrary content before and after the "aa" or "uu": '\<.*aa.*\>|\<.*uu.*\>' 6. [20 points] The set of all lines (in a file) that contain the word "the" at least five times. (There are exactly 4 such lines in the Moby Dick file from the Gutenberg Project.) The key elements are that the content before, between and after the occurrences of "the" is arbitary; that "the" must be matched as a word; and that their must be 5 or more matches of "the": (.*\){5} Note that it's OK to search for only 5 matches, since that will include all lines with more than 5 matches.