SNU Biointelligence Lab

FACT

Text Filtering and Classification from Large-Scale Document Collections

1998~2001 Korea Research Foundation (KRF)

The FACT project investigated machine learning methods for automatically filtering and classifying text from large-scale document collections — foundational work in what we now call information retrieval and natural language processing. The lab built and evaluated its own text retrieval and filtering engine, SCAIR (SNU Cognitive AI Research), which was submitted to multiple tracks of the internationally competitive Text REtrieval Conference (TREC).

Overview

As the web exploded with text content in the late 1990s, the ability to automatically identify and surface relevant documents from massive collections became a critical challenge. The FACT project addressed this by developing principled probabilistic and ensemble learning methods for two related problems: adaptive text filtering (continuously deciding whether incoming documents are relevant to a user profile) and text classification (assigning documents to predefined categories).

A central contribution was the application of boosting — particularly AdaBoost with Naïve Bayes classifiers as weak learners — to the text filtering problem. This approach leveraged term frequency statistics with probabilistically accurate confidence ratios, yielding strong empirical performance on the TREC filtering track benchmarks. The project also explored boosting linear perceptrons, reinforcement learning for personalized filtering, and semi-supervised methods that exploited large quantities of unlabeled documents.

Results were validated through repeated participation in TREC (TREC-7 through TREC-9), covering the ad hoc retrieval, adaptive filtering, and question answering tracks. TREC is a major annual evaluation campaign organized by NIST that enables large-scale, standardized comparison of information retrieval systems across participating research groups worldwide.

Research Team

Technical Approach

Publications

Historical Context

FACT was conducted at a pivotal moment when the web was growing faster than human curation could manage. The lab’s sustained participation across TREC-7, TREC-8, and TREC-9 (1998–2000) placed it alongside leading international information retrieval groups, and the boosting-based filtering methods produced results that influenced the broader community’s adoption of ensemble methods for text tasks. This work laid the groundwork for the lab’s later research in text mining, bioinformatics text analysis, and multimodal learning.

Search related publications on the Research page