Project: Intelligent Information Extraction
Student Researchers: Elena Eneva, Katarina Probst
Advisor: Linda Bright Lankewicz
Institution: University of the South (Sewanee)




Goals and Purpose of the CREW Project (Collaborative Research Environment for Women in Undergraduate Computer Science and Engineering)
Sponsored by the Computing Research Association Committee on the Status of Women in Computing Research (CRA-W)
Funded by the National Science Foundation's Partnership for Advanced Computational Infrastructure's Education, Outreach and Training program (EOT-PACI)

The goal of the project was an investigation of text extraction and classification and the applicability of neural networks to the problem. The purpose of the project was to provide undergraduate students the opportunity for research and for learning methodologies used by researchers. It was important that the students understand how to approach research, beginning with a study of background work, followed by the difficult task of narrowing the exploration to a carefully stated hypothesis, then developing a process for testing the hypothesis, documenting all work, and validating conclusions. The desired outcomes as stated in the original CREW proposal were as follows: The outcome desired is that these talented women students have the opportunity to be involved in research in a way that will serve them as they move into graduate work. Both plan to pursue Ph.D.'s in computer science and want to gain admission to top graduate schools in their interest areas. It is hoped that this experience will give them an edge that they do not have in competing against students from large research universities. The more immediate goal is that the research team write a paper describing the work, its relevance to current research, and an analysis of the results.

Researchers
Two student researchers, Elena Eneva and Katharina Probst, worked with Associate Professor Linda Lankewicz at the University of the South during their senior year. Both students will begin their graduate work at Carnegie Mellon University in the fall 2000. Their statements of the personal benefits from involvement in the project are expressed below.

  • Elena Eneva
    The project was very beneficial for me because it showed me what serious research is really like. First, it helped me discover the importance of doing background reading and thorougly investigating the problem before one attempts to solve it. Then, it taught me how to work as a part of a research group, to interact effectively with the other members of the group, and to coordinate the work we are doing. I also learned the importance of considering a problem from all sides and being flexible enough and prepared to change an aspect in order to improve the performance. But most importantly, by reasoning critically and understanding and overcoming the challenges of the field, I gained valuable problem solving skills, experience, and confidence. And these will be with me no matter what I choose to work on next, because they will be equally applicable to other projects and problems, and even to different fields. Next year I will begin my graduate work at Carnegie Mellon University, at the Center for Automated Learning and Discovery, a department within the School of Computer Science. I am looking forward to doing more research and applying what I have learned during this project to my work at CMU.

  • Katharina Probst
    CREW was a great learning experience for me. I gained insight into the process of researching a topic as well as producing a creative solution to a real-world problem. I also learned to work more effectively in a team. At first, the Sewanee CREW team spent a fair amount of time with thorough reading of the literature. We are now familiar with the state of the art in information extraction and information classification. We also read quite a few articles on issues that seem marginal but proved very helpful to our approach. These included text compression, language identification, and others. From reading this large number of articles, I gained insight into the style of writing used for research. I will attend graduate school in the fall, and it certainly will be to my advantage to have had extensive exposure to research articles, so that I can apply this knowledge to my own articles in the future. Through the readings I also learned how researchers at different institutions tackle problems and how many different approaches there are. I learned that research calls for a large amount of creativity as well as competence in the field. This realization helped me in the actual process of producing a unique solution to a real-world problem. Our intention was not to reproduce what other researchers had done before us. Therefore, we were forced to come up with something new. I believe that the thorough review of the literature helped enable me to do so. Later in the process our group began implementing a prototype. Up to that point, I had mainly written code for classes. I discovered that the two experiences are very different. Writing programs for classes is much more well-defined. The problems are well thought through and designed to be based only on what one has been taught before. When implementing programs for our prototype, on the other hand, I was faced with a real-world problem. I had to get acquainted with a language I had never programmed in before (Perl), because it seemed the most reasonable for the problem at hand. Not only did this boost my self-confidence as a programmer, but I believe that the experience implementing solutions to real problems will help me in graduate school. CREW also taught me to work in a team. I had done team projects for classes before, but never before had I spent so much time with a team. I learned to make compromises for the team, and to let myself be convinced that a solution was better than the one I suggested. I learned that time management is crucial in teamwork. Not only did I have deadlines, but I needed to take into consideration other people's deadlines and plan around them. Another important learning experience for me was how frustrating research can be at times. When after spending weeks and weeks of effort, the results were not satisfactory, it was easy to become frustrated and willing to give up. I had to learn that it is necessary to keep up the effort in order to reach a stage where the results are better. Again, I think that in graduate school this experience will work to my advantage. I am very thankful for the opportunity CREW has given me. Most likely it had an influence on my being accepted to graduate schools. It also prepared me well for graduate research.

Research Process Background Investigation
Perhaps the most important aspect of any research is understanding the work of others, being aware of current developments, and making contact with other researchers. Over one-third of the time on the CREW project involved reading the literature, critiquing the work, discussing how various approaches differ, and exploring ideas that might be useful for our own work. Some of the papers studied are listed below.

  • 1. Active Learning for Natural Language Parsing and Information Extraction. Cynthia A. Thompson, Mary Elaine Califf, and Raymond J. Mooney. Proceedings of the 16th International Machine Learning Conference, 1999.
  • 2. Applying ILP-based Techniques to Natural Language Information Extraction: An Experiment in Relational Learning. Mary Elaine Califf and Raymond J. Mooney, Working Notes of the IJCAI-97 Workshop on Frontiers of Inductive Logic Programming, 1997.
  • 3. Automatically Generating Extraction Patterns from Untagged Text. Ellen Riloff, Proceedings of the 13th National Conference on Artificial Intelligence, 1996.
  • 4. Combining Evidence for Effective Information Filtering. Susan T. Dumais, AAAI Spring Symposium, 1996.
  • 5. Examining Machine Learning for Adaptable End-to-End Information Extraction Systems. Oren Glickman and Rosie Jones, AAAI 1999 Workshop on Machine Learning for Information Extraction, 1999.
  • 6. Information Extraction. Wendy Lehnert. University of Massechusetts, 1996, www-nlp.cs.umass.edu/.
  • 7. Learning to Extract Symbolic Knowledge from the World Wide Web. Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, and Sean Slattery, Proceedings of 15th National Conference on Artificial Intelligence, 1998.
  • 8. Pattern Recognition with Neural Networks, Abhijit S. Panya and Robert B. Macy, IEEE Press, 1995.
  • 9. Relational Learning Techniques for Natural Language Information Extraction. Mary Elaine Califf, Artificial Intelligence Laboratory Technical Report AI98-276, University of Texas at Austin, 1997.
  • 10. Research Projects in Natural Language Processing. Ellen Riloff, University of Utah, http://www.cs.utah.edu/~riloff/nlp.html.
  • 11. Statistical Identification of Language. Ted Dunning, Technical report CRL MCCS-94-273, Computing Research Lab, New Mexico State University, March 1994.
  • 12. Text Compression via Alphabet Re-Representation. Philip M. Long, Apostol I. Natsev, and Jeffrey Scott Vitter, Data Compression Conference, 1997.
  • 13. Unsupervised Models for Named Entity Classification. Michael Collins and Yoram Singer, http://almond.srv.cs.cmu.edu/afs/cs/project/theo-4/text-learning/www/index.html.
  • 14. Using Linear Algebra for Intelligent Information Retrieval. Michael W Berry, Susan T. Dumais, and Gavin W. O'Brien, SIAM Review, December 1995.

Prototype Development
A prototype was developed in order to explore various methods and to ascertain the feasibility of some of the approaches under consideration. Part of the prototype included a neural network for processing a sequence of text. Various architectures were tested. The hypothesis was that patterns of text could be extracted from text, those patterns could be used to classify documents, and the commonly appearing patterns would be indicative of the thrust of the text and useful for information extraction. The patterns would be composed of both words and word-tags where common words were replaced by part-of-speech word-tags. During the prototype development a number of tools were examined for use in support of the work. It is gratifying to undergraduate researchers to find that the computer science community shares tools such as these. * Newsgroups data set. Collected by Ken Lang, the 20,000 articles are divided among 20 UseNet discussion groups. The data set can be downloaded at http://www.cs.cmu.edu/afs/cs/project/theo-11/www/ naive-bayes/20_newsgroups.tar.gz * Rainbow. Part of the Bow Library developed by Andrew McCallum at Carnegie Mellon University, Bow is a library of C code useful for writing statistical text analysis, language modeling, and information retrieval programs, available at http://www.cs.cmu.edu/~mccallum/bow. * Neural Network Description Language (NEUDL). Developed by Joey Rogers at the the University of Alabama, NEUDL is an interpreted language for the design, training, and operation of neural networks. * Eric Brill's Tranformation-Based Part-of-Speech Tagger. Part of the University of Pennsylvania's Treebank Tag Set, Eric Brill's tagger is the most widely used by researchers. Project Implementation Implementation involved developing mechanisms for pre-processing data sets, expanding the prototype neural network into an architecture to support words and word-tags, establishing a test suite for evaluating test runs, and documenting all processes during implementation. Pre-processing included converting data sets into a form for input to the neural network and developing an efficient way of interpreting the output from the neural network processing. Pre-processing was the most difficult part of the implemention work. Files containing articles from discussion groups required conversion to patterns of words and word-tags. It was necessary to convert thousands of files into sets of patterns of a given length, determine the common words to replace, and then replace those words with word-tags generated using Brill's tagger. Then those patterns were converted into binary form for neural network input. The most routine part of the implementation was the division of data sets into training and testing units and deciding on test cases for runs. The Rainbow program was valuable for these tasks. The most essential part of the implementation was the documenting each step taken in order to build documentation, facilitate analysis, prevent misinterpretation, and organize material.

Conclusions and Results
The benefits cited by the student researchers indicate that the purpose of the project, to provide undergraduate students the opportunity for research and for learning methodologies used by researchers, has been accomplished. At the end of the CREW project period all the implementation has been completed, and results are being analyzed for a paper reporting findings.