Cornell University
more options

David Mimno

Assistant Professor

Areas of Interest

Machine learning, text mining, digital humanities


David Mimno joined the Information Science department at Cornell University in the Fall of 2013. Prior to that, he was a postdoctoral researcher with David Blei at Princeton. He received his PhD from the University of Massachusetts Amherst, with Andrew McCallum. He tweets @dmimno.

Before UMass, David Mimno worked for an internet auction startup, the NLP group at the University of Sheffield, and the Perseus Project, a cultural heritage digital library. He has a particular interest in historical texts and languages.

David Mimno is currently chief maintainer for the MALLET Machine Learning toolkit. He organized the NorthEast Student Colloquium on Artificial Intelligence (NESCAI) at UMass in 2010 along with Sameer Singh. He ran the UMass Machine Learning and Friends lunch series for two years, and organized the Princeton Machine Learning lunch series.


  • A Practical Algorithm for Topic Modeling with Provable Guarantees Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu. ICML, 2013, Atlanta, GA. (Selected for long-form presentation). PDF
  • Spectral algorithms for LDA have been useful for proving bounds on learning, but it hasn't been clear that they are useful. This paper presents an algorithm that both maintains theoretical guarantees and also provides extremely fast inference. We compared this new algorithm directly to standard MCMC methods on a number of metrics for synthetic and real data.
  • Scalable Inference of Overlapping Communities Prem Gopalan, David Mimno, Sean Gerrish, Michael J. Freedman, David Blei. NIPS, 2012, Lake Tahoe, NV. (Selected for spotlight presentation)
  • Sparse stochastic inference for latent Dirichlet allocation David Mimno, Matthew Hoffman and David Blei. ICML, 2012, Edinburgh, Scotland. (Selected for long-form presentation). PDF
  • Gibbs sampling can be fast if data is sparse, but doesn't scale because it requires us to keep a state variable for every data point. Online stochastic inference can be fast and uses constant memory, but doesn't scale because it can't leverage sparsity. We present a method that uses Gibbs sampling in the local step of a stochastic variational algorithm. The resulting method can process a 33 billion word corpus of 1.2 million books with thousands of topics on a single CPU.
  • Computational Historiography: Data Mining in a Century of Classics Journals David Mimno. ACM J. of Computing in Cultural Heritage. 5, 1, Article 3 (April 2012), 19 pages. PDF
  • Topic Models for Taxonomies Anton Bakalov, Andrew McCallum, Hanna Wallach, and David Mimno. Joint Conference on Digital Libraries (JCDL) 2012, Washington, DC. PDF
  • Database of NIH grants using machine-learned categories and graphical clustering Edmund M Talley, David Newman, David Mimno, Bruce W Herr II, Hanna M Wallach, Gully A P C Burns, A G Miriam Leenders and Andrew McCallum, Nature Methods, Volume 8(7), June 2011, pp. 443--444. HTML
  • Reconstructing Pompeian Households David Mimno. UAI, 2011, Barcelona, Spain. (selected for oral presentation) PDF
  • Bayesian Checking for Topic Models David Mimno, David Blei. EMNLP, 2011, Edinburgh, Scotland. (selected for oral presentation) PDF
  • Optimizing Semantic Coherence in Topic Models David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, Andrew McCallum. EMNLP, 2011, Edinburgh, Scotland. (selected for oral presentation)PDF
  • We introduce a metric for detecting semantic errors in topic models and develop a completely unsupervised model that specifically tries to improve this metric. Topic models provide a useful method for organizing large document collections into a small number of meaningful word clusters. In practice, however, many topics contain obvious semantic errors that may not reduce predictive power, but significantly weaken user confidence. We find that measuring the probability that lower-ranked words in a topic co-occur in documents with higher-ranked words beats all current methods for detecting a large class of low-quality topics.
  • Measuring Confidence in Temporal Topic Models with Posterior Predictive Checks David Mimno, David Blei. NIPS Workshop on Computational Social Science and the Wisdom of Crowds, 2010, Whistler, BC.
  • Rethinking LDA: Why Priors Matter Hanna Wallach, David Mimno and Andrew McCallum. NIPS, 2009, Vancouver, BC. PDF Supplementary Material