By Louis DiPietro

Trained on vast amounts of mined online text, artificial intelligence (AI) now knows human language well enough to auto-fill our sentences when drafting emails, power chatbots to help online customers, and pen prose that could pass for a human’s. 

Cornell researchers want to know what sort of insights AI will unearth when leveraged for the humanities and are equipping scholars in those fields with machine learning skills to do just that. 

Backed by a recently announced $350,000 grant from the National Endowment for the Humanities, Cornell researchers in the field of natural language processing aim to inform, empower, and inspire humanists to apply state-of-the-art machine learning tools to written text penned by humanity across millenia.

BERT for Humanists – a reference to the large language model (LLM) BERT, introduced in 2018 – is co-directed by David Mimno and Matthew Wilkens, both associate professors of information science in the Cornell Ann S. Bowers College of Computing and Information Science and scholars in the field of natural language processing, a subfield of computer science, artificial intelligence, and linguistics. Melanie Walsh, a former postdoctoral associate in information science at Cornell and currently an assistant teaching professor in the Information School at the University of Washington, is also co-director.

“When we submitted this grant application in 2020, we had to explain to people what a large language model is,” Mimno said. “Large language models have gone mainstream in a big way,” noting newsworthy LLMs like LaMDA and GPT-3. “So our question is, can we use these models, which seem to be able to say so much, to help us study culture both in detail and at scale?”

Mimno, Walsh, and Maria Antoniak Ph.D. ‘22 first launched BERT for Humanists in 2021 with a separate grant from the National Endowment for the Humanities, and an ensuing series of well-attended workshops drew scholars from around the world. The Cornell team’s tutorials and other resources were even nominated for an award by the digital humanities field. This latest grant builds on the Cornell team’s educational outreach efforts thus far.

Mimno and Wilkens – both of whom use computational methods to transform vast online libraries and countless books into data – underscore the major research potential that LLMs can have on fields traditionally outside of tech.

Wilkens notes researchers in fields like literature, history, and culture have been experimenting with machine learning for decades, but they face barriers. 

“For one thing, the kinds of writing that interest humanists tend to be tricky. They're ambiguous, or archaic, or just long and complex,” Wilkens said. “A lot of existing tools are also hard for most humanists to use, since they are built by and for computer scientists.”

But AI-powered LLMs perform better and can be used and customized out of the box, thus lowering barriers for non-experts, he said. 

“We can use AI to pick out the characters in thousands of novels, so that we can study how those characters interact. Or we can use it to find place names in historical documents in order to map migration and trade interactions,” Wilkens said. “AI can help us measure how literary genres change over time and detect moments of especially rapid cultural evolution. We want to help our fellow humanists explore these opportunities, both by showing what's possible and by providing the tools and training to do it.”

Along with Mimno, Wilkens, and Walsh, the BERT for Humanists project team includes Rosamond Thalken, lead developer and a Cornell doctoral student in the field of information science.

Louis DiPietro is a writer for the Cornell Ann S. Bowers College of Computing and Information Science.