CS4624 Glossary

Following are some terms and definitions / explanations. More will be added. Definitions will be supplied.

Boolean queries
formal representation of a question or information need using AND, OR, or NOT to connect terms
looking around, reading, viewing objects or information, sometimes with the help of some organization or tool (e.g., a browser) --- often with no or only a vague objective, sometimes as in window-shopping
compact disc digital audio (i.e., standard music CD)
compact disc read only memory (storing about 600 Mbytes of digital data in similar format to a music CD)
grouping similar items together to form clusters whose centroid or representative characterizes the group
COmposite Document Expert/extended/effective Retrieval system developed at Virginia Tech
something made up of parts - for example, in the Amsterdam model there are atomic objects (treated as a whole, not decomposed) and composite objects (made up of atomic or other composite objects)
controlled vocabulary
a fixed terminological set from which indexing and query terms are selected
computer supported cooperative work --- group work supported with collaboration technologo
digital library
a collection of digital representations of information content, along with hardware, software, and personnel to support the functions of a traditional library plus knowledge worker operations like searching, browsing, and navigation
digital tree
a hierarchical organization of data where at each level there is a multiway branch, e.g., 10-way so each digit of a number can determine the next step in a path from the root
distance function
a function that computes the distance between a pair of items, e.g., d(d1,d2) with properties:

  1. d(d1,d2) = 0 if d1=d2
  2. d(d1,d2) = d(d2,d1)
  3. d(d1,d3) = d(d1,d2)+d(d2,d3)

an article, book, or other work, typically containing text or other media, that has some type of information content
E measure
a single-valued measure of the effectiveness of an information retrieval system (with 0=best, 1=worst), which is a function of both recall and precision, as well as a factor that determines the relative importance between these
Doug E. was the inventor of the mouse and other early interactive technologies who first demonstrated a powerful hypertext system working as a CSCW tool, and led work on Augment and NLS, to augment human knowledge and skills
a system developed at Virginia Tech in connection with the NSF-funded, ACM-supported project "A User-Centered Database from the Computer Science Literature" 1991-1995
measure of the degree to which the content of a collection is "covered", typically used to describe a controlled vocabulary
faceted classification
a system for categorizing information in which diferent aspects or facets are separately considered
flat file
a component of a file system or entry on a storage device, that is treated as having no special structure beyond that of bytes, characters, words and/or lines
FSA (finite state automata)
an abstract machine made up of states (including a special "start" or "initial" one as well as one or more "final" states) where one takes a state-state transition if the input token matches that for the transition --- that can recognize a regular language and so is equivalent to a regular expression --- often used for document analysis
producing output by restricting input according to some criteria --- in connection with text, images, speech, electromagnetic waves or signals
a hypertext system marketed by OWL (Office Workstations Ltd.) that includes scrolling and note capabilities
computing an address to look for an item by applying a mathematical function to a key for that item
HTML (HyperText Markup Language)
an application of SGML, defined by a simple Document Type Definition developed in 1993, that is used for tagging documents on the World-Wide Web, which can then be rendered with viewers like Mosaic or Netscape
HTML+ (HyperText Markup Language - extended)
an extended version of HTML, proposed in 1994, adding extra elements such as for interactive forms
HTTP (HyperText Transfer Protocol)
a standard used as the basis of the World-Wide Web for communication between clients and servers, proposed in 1993, that allows for retrieval of data and following of hypermedia links

a hypertext/hypermedia system developed by Apple, provided free of charge with new systems in 1987 and then sold by Claris, which implements a card-based model derived from Xerox NoteCards, and uses an object-oriented scripting language called HyperTalk
a collection of information objects or nodes in multimedia formats with links (i.e., hypertext extended to multimedia)
a term coined by Theodor Nelson for a collection of information objects or nodes, containing text (and sometimes other multimedia formats in which case it is often called hypermedia), with links, that thus serves as an information graph that can be traversed by an hypertext system, which can present each node and follow links from anchors in nodes to other nodes (at which time the target node is also presented) --- information with a nonlinear organization
Hypertext Compendium
an ACM Database and Electronic Products offering that includes most of the early (through 1990) publications on hypertext, available in ASCII, using KMS, or in HyperCard form
Hypertext on Hypertext
an ACM Database and Electronic Products offering that includes the articles appearing in the July 1988 CACM special issue on hypertext, available in KMS, HyperCard, and HyperTies forms
ISO standard 10744, describing the structure of time-based hypermedia documents
the process of building an index, such as when a collection of text documents is analyzed to automatically identify its word or word stems that are then recorded and made to point to locations in the collection where they occur
indexing language
the set of terms used during indexing, possibly all the words in a collection, or a fixed set of terms found in a controlled vocabulary or thesaurus, possibly including phrases or other more complex forms
IDF (inverse document frequency)
a weighting formula used in some information retrieval systems whereby the importance of a term is based on the reciprocal of its document frequency in the collection; for example log (N/n) when the term occurs in n documents from a collection of N documents
INQUERY Information Retrieval System (U. Mass. Amherst), which implements a probabilistic model based on use of a Bayesian inference network
a system developed in the IRIS project at Brown University, that is the precursor to both StorySpace and Hyper-G. The IRIS group produced ACM HonH - Hypertext on Hypertext. A good article on it is discussed in the course notes: HAAN92.
inverted file
a file structure in which words or other terms used to index a collection of information are connected with a list of pointers to the locations where those words occur --- the inverted form of documents containing terms, where terms point to document (occurrences)
Joint Photographic Experts Group, ISO/CCITT standard for compressing still images (grayscale or color), available in lossless form for roughly 3:1 compression or in lossy form for 10:1 or more compression using the discrete cosine transform (DCT), coefficients based on the frequency response of the Human Visual System, a zig-zag run-length sequencing, and Huffman or arithmetic coding
hypertext/hypermedia system for expert users, computer-supported collaborative work, implementing a 2-frame/window model with a powerful scripting language
J.C.R. Licklider was author of Libraries of the Future (1965), director at ARPA involved in early funding of the ARPAnet, director of MIT's Project Mac in the late 1960s-70s
one of the distinguishing types of logical objects that characterizes hypertext; something that connects two anchors (usually directed, from the source to the target anchor); a type of pointer from one part (e.g., node) of a hypertext to another, sometimes labelled (by name or type), which is resolved by the hypertext system when the source anchor is selected, and then usually causes presentation of the target node
a record format developed by the Library of Congress for library catalogs, that can describe an individual book, journal, or other work, using a collection of fields and subfields
an experimental OPAC retrieval system developed starting in 1990 at Virginia Tech Computing Center, first used as an alternative to searching with the VTLS system, but also used as the search component of the Envision digital library system
MBONE stands for the Multicast Backbone on the Internet, launched in 1992. It carries audio and video conferences, plus shared whiteboards.
an imaginary device described by Vannevar Bush in his seminal article "As We May Think" in the July 1945 Atlantic Monthly, implementing hypertext-style associative linking between documents and images, described using microform technology
Musical Instrument Digital Interface
Multipurpose Internet Mail Extensions (multimedia mail protocol defined by RFC 1521)
Motion JPEG, a video compression scheme in which each frame is separately compressing using the JPEG standard
an Internet application used to browse and navigate on the World-Wide Web, that can render documents provided in HTML, follow links among such documents, use HTTP as well as other protocols (e.g., gopher, FTP, UUCP), and manipulate multimedia information carried using the MIME standard
Moving Picture Expert Group --- digital video standard
natural language
a language used by humans to communicate, e.g., Chinese, English, Farsi, French, Hindi, Russian, Zulu
natural language text-search
a method of searching text collections in which user queries are supplied as natural language texts or at least phrases or word strings, usually involving the vector-space or probabilistic model of partial matching
Theodor N. coined the terms hypertext and hypermedia, was a great proponent of these ideas, worked at Brown on some of the early systems, and proposed and worked toward Xanadu, a universal system for shared hypertext publishing and editing
a software system developed at Bell Labs by Dongarra and Gross for searching of numerical analysis information, including algorithms and code
National Information Infrastructure, the framework for U.S. efforts in the information industry, electronic publishing, and high-performance computing and communication (HPCC)
the National Research and Education Network, an evolving U.S. network to support the research and education community, building upon the NSFNET
an expansion of the ARPANET to serve the NSF community, leading toward the larger future NREN
online public access catalog --- an automated system to allow searching in library catalogs
paperless society
a vision proposed by F.W. Lancaster and others in which electronic publishing and communication would largely eliminate the need for paper
a system developed at the Univ. of Waterloo at the Centre for the New OED, later taken over by Open Text Corp., which supports dictionary, SGML collection, and other types of searching, using a Patricia tree representation to give very rapid response to queries involving strings or phrases
Patricia tree
a data structure, somewhat like a trie, but implemented as a binary digital tree, where every semi-infinite string (sistring) from a large string (the concatenation of all text in a collection) is entered in the tree, and is associated with a pointer to the start of the sistring
a property of electronic information in that it can be easily reshaped, republished, reused because it is in a manipulable digital representation
a measure of how precise or specific an information retrieval system is, or behaves for a given query, computed as the ratio of the number of relevant items retrieved to the total number of items retrieved
a formal representation of a search need or anomolous state of knowledge (ASK, a la Belkin), that can be processed by an information retrieval system

Redundant Array of Inexpensive Disks - a method of combining several relatively cheap (e.g., SCSI-2) disks into a single unit where the disks
operate in parallel to give higher throughput. Thus, data may be striped
across the disks so playback or recording runs at the sum of the transfer
speeds of the disks. Some levels allow for hot spares so that
extra disks keep error correction data that allows one of the disks to
be replaced in case of failure while the array keeps running.

ordering the set of documents or items found by an information retrieval system in response to a query, usually in descending order of estimated relevance to the query
a measure of how comprehensive or thorough an information retrieval system is, or behaves for a given query, computed as the ratio of number of relevant items retrieved to the total number of relevant items
regular expression
a string following the rules of a regular language, used to describe a class of strings (that can be recognized by an FSA), allowing alternatives, specifying a sequence, and indicating number of occurrences (0, 1, any number, at least one)
relevance judgment
a decision made by a human regarding if a particular document is relevant to a particular query
purposefully trying to find some object or information, sometimes with the help of a search system or search engine, sometimes using an information retrieval system, sometimes by submitting a formal query, often following some search strategy or plan
search tree
any data structure that involves a tree and can be used to speed up search for an item or keyword, such as a trie or Patricia tree
Standard Generalized Markup Language, ISO standard 8879, published in 1986, a flexible system to describe and represent documents, actually a metalanguage to describe classes of documents through Document Type Definitions (DTDs) and then documents that are in those classes
signature file
a file, sometimes implemented using superimposed coding, in which a document or document block is described by a signature, usually a fairly long bit string, in which bits are set if some term in the block hashes to that bit location --- a conjunctive query can be processed by building a signature for the query, and then all signatures that match that of the query are guaranteed to match the query (though other documents may also and need to be discarded)
similarity measure
a method of estimating the similarity or "closeness" between two entities, such as two documents or a document and a query, where 0 represents none and higher values indicate more
an experimental information retrieval system developed initially at Harvard University in the early 1960s and then continued through the 1990s at Cornell University, under the supervision of Gerard Salton
how precise or exact a term or indexing language is in its ability to describe
stemming (suffix stripping)
removing (usually automatically) the ending of a word, typically with a fast algorithm, to form a canonical representation that usually approximates the root form
stop word list
a list of word or terms that is excluded from indexing and searching, i.e., ignored as irrelevant, usually made up of function words or words that occur very often in a given collection
superimposed coding
a scheme for developing a signature for a block of text, i.e., a short record with bits set because terms in that block hash to their location, that allows rapid search for conjunctive queries, and usually does not find many records that have a suitable signature but do not satisfy the query
a word, word stem, keyword, root, phrase, acronym, abbreviation, descriptor, controlled vocabulary entry, thesaurus category or other construct meant to characterize some object or concept
term broadening
a process used by searchers or information retrieval systems to replace a single term with another or with a collection of terms that occur more often, and have wider or less precise coverage and/or meaning
term narrowing
a process used by searchers or information retrieval systems to replace a single term or phrase with another that occurs less often, and has narrower or more specific coverage and/or meaning
term weighting
a process of associating a value, usually real-valued, and possibly estimating a probability, that reflects its relative importance in a collection or document
TF (term frequency)
a weighting scheme usually used in information retrieval systems to rate the value of a term in a document based on the number of times it occurs in that document
an information structure lising words or other terms, along with relationships between them, such as: broader than, narrrower than, cross reference to, synonym of
a digital tree, in which a multiway branch occurs at each level, such as for the letters of the alphabet, where information entered is represented by the path from the root to a node (possibly leaf) marked as "final"
cutting off the (right) end of a word or term, such as when a searcher askes for "inform*" to locate all words with "inform" as a prefix
The University Licensing Program --- Elsevier, 40 bitmap journals on materials
defined by Tim Berners-Lee's 1993 IETF Draft "Uniform Resource Locators" --- describing a document or service on the internet as a string which identifies the protocol, server machine, and additional information (e.g., file path)
Universal Resource Name that will identify a document or service, as does a URL, but in a location-independent, logical, robust manner
a system or service, usually involving storage of a large number of video programs, that can support a number of users each of whom can at any time request playback/delivery of any of the stored programs
a measure of how rapidly a collection of information changes
Wide Area Information Server, originally developed using Z39.50, allowing client-server searching over the Internet, first of a collection of sources and then of actually information collections, usually involving a vector-space type search, often with relevance feedback
World-Wide Web, a logical infrastructure on the Internet in which documents and multimedia objects are linked, making use of HTTP, the HyperText Transfer Protocol, and represented in various forms including HTML, the HyperText Markup Language
the Information Retrieval Protocol, an ANSI and ISO standard for client-server computing between information retrieval systems, especially library catalogs (OPACs), adapted in WAIS

Copyright 1996 Edward A. Fox