CRA-W

Project: Information Retrieval Based on Large Text Collections: Design and Development of a Word Conflation Module
Student Researchers: Emily Gibson, Christina Grape
Advisor: Dr. Miroslav Martinovic
Institution: The College of New Jersey

An algorithm for word conflation introduced by M.F. Porter in 1980 (http://open.muscat.com/stemming ), has long been recognized as a rather simple, computationally inexpensive and successful technique to bring together the words conveying the same or similar meaning and treat them as the same content contributors. The algorithm comprises of five simple modules each dedicated to handling certain kinds of word transformations. These modules are applied to a given word sequentially, producing their own simplified versions of the word (i.e. for the word RADICALLIZATIONS, the following words are produced by the five modules, respectively: RADICALLIZATION, RADICALLIZE, RADICALLIZE, RADICALL, and RADICAL - the final product). It has been observed though, that in some of the cases, the algorithm did not conflate related words into a same common stem word ( i.e. DEEPENINGS conflated to DEEPEN, while DEEP stayed DEEP. Also, RELATEDNESS conflated to RELATED, while RELATED transformed into RELAT).

The goal of our research is to redesign Porter's method into an algorithm of our own in order to overcome its present deficiencies. We will introduce and evaluate the idea that sets of all five words (outputs of each of Porter algorithm's individual modules) rather than the final word (output of the last (fifth) module) be used as a representative STEM. We therefore propose to first develop a theoretical description and justification of our concept and follow it by building an independent software module that implements and tests it. We then suggest to incorporate this module into the large text corpora IR system.