They found a way to thematically sort all of Wikipedia on a Laptop at Columbia University

They found a way to thematically sort all of Wikipedia on a Laptop at Columbia University

In a landmark paper, Online Learning for Latent Dirichlet Allocation, Blei and his co-authors Matthew Hoffman and Francis Bach introduced a way to extend topic modeling to millions and billions of documents. Blei, Hoffman, and Bach were recently awarded a Test of Time Award for their work at the Neural Information Processing Systems (NeurIPS) conference. “This paper took topic modeling, an extremely popular technique, and made it scalable to mammoth datasets,” said John P. Cunningham, an associate professor of statistics at Columbia who was not involved in the research. “They did it using stochastic optimization, a statistical technique that’s now so widely used it’s nearly impossible to imagine modern machine learning and AI without it.”

The internet had only just put mountains of information at our fingertips when a new difficulty arose: locating what we needed. With his statistical methods for organising papers thematically, David Blei, a professor of computer science and statistics at Columbia, has helped us uncover such nuggets of gold, making it easier to search and examine large quantities of material. His topic modelling approach is now used in everything from spam filters to recommendation engines, but it had a restricted reach until a decade ago, when datasets far larger than a few hundred thousand pages overwhelmed it.


  • Topic modeling was as effective at parsing news stories as it was products; replacing documents with customers, for example, and words with products, the algorithm might discover “art supplies” as a topic, divided into themes like paint, paintbrushes, and colored pencils. “The magical thing about topic modeling is that it finds this structure without being given any hints or deep knowledge about English or any other language,” said Hoffman (SEAS ‘03) now a researcher at Google. From its debut in 2003, topic modeling had an enormous impact. But as datasets grew larger, the algorithm struggled. Blei and Bach were discussing the problem one night at a bar, when they realized that stochastic optimization, an idea introduced decades earlier by statistics professor Herbert Robbins could provide a work around.

  • Blei developed topic modeling, a statistical technique he called Latent Dirichlet Allocation, or LDA, as a grad student at the University of California, Berkeley. LDA allowed an algorithm to ingest an archive of, say, news articles, and sort them into topics like sports, health, politics, and health without any prior knowledge of those subjects. The algorithm could extract broad themes by finding statistical patterns among the words themselves.

In a groundbreaking paper published in 1951, just before coming to Columbia, Robbins explained how an optimization problem could be solved by estimating the gradient through randomized approximation, a methodology now called stochastic optimization. The technique was later used to efficiently approximate gradients in a sea of data points. Rather than calculate the gradient precisely, using all of the data, the optimizer repeatedly samples a subset of the data to get a rough estimate much faster.