Data Mining and Invasion of Privacy


Topic area Data Mining and Invasion of Privacy, a study of the potential impact of techological advances on society with specific reference to targeted direct mailings (junk mail) and Internet spam.
Target audience Specifically students in a data base or data mining course, but appropriate to others who can understand the concepts of data mining and the problems of junk mail and spam.
Activity type Pre-class research activity followed by in-class discussion, or a homework assignment.
Time required
  • Pre-class activity - several hours on-line or in a library;
  • In-class activity - 50 minutes.
Attachments None
Additional materials None
Background needed to complete the assignment An understanding of the principles and concepts of data mining, and the potential for "misuse".
References
Last modified 99/01/29


Abstract:

To study the potential impact of indiscriminate use of data mining as a tool to enhance the accuracy of direct marketing (junk mail) and consequently to uses that more directly invade individual privacy.

Goals for the activity:

To awaken a student's understanding of the ability of computing technology to surpass the bounds of manual activity and to provide services that have a potentially negative impact if used inappropriately.

Knowledge / skills / attitudes to be developed (behavioral objectives):

To think ahead when developing software or algorithms as to the potential negative usage, and not to simply expect that users will act responsibly.

Procedure:

Prepare a hand-out in the form of a pre-class research activity:

In our studies of the impact of the computer on society, there are numerous examples of activities that would not be possible without the aid of the computer. The technique of data mining is one such advance in technology. The methodology alows users to garner data from disparate sources and to coalesce data into information that did not exist in single resources. It is possible to find interesting and (potentially) useful patterns of associations, correlations, dependencies and summarization in data. Examples include finding valuable nuggets of information in market sales data, correlations among individuals in U.S. census information and identifying trends in demographics and election voting practices Thus an individual having given information to different data bases, believing that the data in each is innocuous, now finds that the assemblage of that data provides a portrait that is invading his/her privacy. For example the coalition of data from healthcare data, credit data, student records, driving history etc. could develop a very interesting portrait of a person.

One of the ways in which data mining is conducted in a typical business context is that it is rarely "one-shot"; it's more an iterative, cyclic and repetitive process where the whole methodology of "knowledge discovery" is integrated into the business process. This means that in some cases, it is very tightly coupled to the actual process that generates the data, cleans it and preprocesses it to make it amenable to sophisticated analysis. There is a potential question here on whether this is a good thing to do or is it best to view these stages as distinct ones.

Write a 2000 word report on data mining, its potential as a positive tool in the management of data and the assimilation of information, and the dangers of its misuse. The audience for the report should be non-technical but somewhat computer literate. Examine the relevant laws with respect to computer privacy and determine whether the use of data mining violates those laws.

If this is to be used as an in-class activity then the followign questions should be prepared on a sheet with space for student answers to be completed (say) in the first 10 minutes of the class. There then will follow a discussion of these points.

As a homework assignment include these questions as part of the assignment.

  1. The success of data mining depends on the availability of lucrative data files. In what ways do vendors of data potentially infringe privacy laws?
  2. Besides privacy laws what other controlling mechanisms exist?
  3. What are the special problems of ethics and conduct that are potentially associated with using data mining as a tool?
  4. Does the use of data mining by telemarketing agents violate the code of conduct of computer professionals?
  5. Is there a difference between the codes to be followed by telemarketers and the programmers who develop the data mining tools for them?
  6. The most objectionable kinds of mining are those that are performed on transactional data (which includes info. such as credit cards, phone numbers dialled etc.) and patterns that connect multiple individuals via some common theme. At the other extreme, mining the world wide web is considered to a hot area for research, which will lead us to issues of web search engines and indexing facilities (which in fact, do some rudimentary data mining). So, the question is how do you regulate mining across such disparate resources/repositories?
  7. Should data mining be controlled by the government in a manner similar the way in which there are plans for controlling encryption?
  8. In some cases, data mining really is to find patterns that are true for a larger group than a select specific set of individuals. This type of generalization ("People from Blacksburg like this type of shampoo") is sometimes considered more appropriate than claiming "Ben likes this type of shampoo". Discuss the implications of this with respect to your answers to the above two questions.
  9. If your manager asked you to undertake the development of a data mining application that (say) collected data from the various files that existed in the company to create profiles of your fellow workers, how would you respond? How would you justify your position?

Assessing outcomes:

The reports can be graded in the light of the in-class discussion that follows. This would be an opportunity to have students do peer evaluations of their colleagues work. However if this approach is used establish a very clear but rigid grading schedule for each grader to use.

Additional remarks:

This activity could easily be used either in course specifically on computer ethics and social impact or as a single activity within a course on data bases or data mining. It is a good example of how concerns for ethics and impact can be integrated into many of the other "technical" courses in computer science.

Author contact information:

J.A.N. Lee
Department of Computer Science
Virginia Tech
Blacksburg VA 24061-0106
Ph: (540) 231-5780
FAX: (540) 231-6075

In collaboration with Naren Ramakrishnan, Assistant Professor of Computer Science, Virginia Tech.