SNU Biointelligence Lab

LaText

Text Mining Based on Latent Variable Models

2001~2004

The LaText project investigated the application of latent variable models to text mining, developing probabilistic frameworks that could automatically uncover hidden semantic structure in large document collections — work conducted at a pivotal moment when the web was generating document corpora at unprecedented scale.

Overview

The central problem LaText addressed was how to make sense of massive, unlabeled text corpora. Traditional keyword-based approaches could retrieve documents but failed to capture the underlying semantics: the same concept can appear under many different surface forms, and the same word can mean different things in different contexts. Latent variable models offered a principled probabilistic solution by positing that each document is generated from a mixture of hidden topics, and that inferring those topics reveals the true semantic structure of the corpus.

LaText built on the then-emerging Probabilistic Latent Semantic Analysis (PLSA) framework introduced by Hofmann (1999), extending it in a Bayesian direction and applying it to practical text mining tasks including document clustering, topic discovery, and dimensionality reduction for large Korean and English document collections.

Technical Approach

Research Tasks

Research Team

Historical Context

LaText was conducted at a time when probabilistic topic models were an active research frontier. It predates the publication of Latent Dirichlet Allocation (Blei, Ng, and Jordan, 2003) and represents the lab’s early engagement with fully generative Bayesian models for text. The project ran concurrently with the lab’s bioinformatics work (BrainGene, DNAChipBench) and drew on the same Bayesian probabilistic inference infrastructure developed for those projects.

The text-mining methods developed in LaText contributed to later biomedical NLP work, including the PubMiner system (2004) for machine learning-based mining of biomedical literature. The lab’s interest in discovering latent structure in high-dimensional data traces a continuous thread from LaText through hypernetwork models and into the modern multimodal learning era.

Search related publications on the Research page