Cornell Research Investigates Bias-Measurement Tools in Text Mining

October 19, 2021

By Louis DiPietro

Fresh Cornell research in the emerging field of digital humanities is helping improve the computational tools and methods used to measure bias in text.

In a paper called “Bad Seeds: Evaluating Lexical Methods for Bias Measurement,” Maria Antoniak, a Ph.D. Candidate in the Ann S. Bowers College of Computing and Information Science and a scholar in the digital humanities, finds that the word lists packaged and shared amongst researchers to measure for bias in online texts oftentimes carry words, or “seeds,” with baked-in biases and stereotypes, which could skew findings.

“We need to know what biases are coded in models and datasets. What our paper does is step back and turn a critical lens on the measurement tools themselves,” Antoniak said. “What we find is there can be biases there as well. Even the tools we use are designed in particular ways.”

Maria Antoniak.jpg

Maria Antoniak

“Bad Seeds,” coauthored with her advisor, David Mimno, associate professor in the Department of Information Science, was presented at the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, held in August.

“The seeds can contain biases and stereotypes on their own, especially if packaged for future researchers to use,” she said. “Some seeds aren’t documented or are found deep into the code. If you just use the tool, you’d never know.”

In the digital humanities, and in the broader field of natural language processing (NLP), scholars bring computing power to bear on written language, mining thousands of digitized volumes and millions of words to find patterns that inform a wide range of inquiry. It’s through this kind of computational analysis that digital humanities and NLP scholars at Cornell are learning more about gender bias in sports journalism, the impeccable skills of Ancient Greek authors at imitating their predecessors, the best approaches to soothing a person reaching out to a crisis text line, and the culture that informed British fiction in the late 19 ^th and early 20 ^th centuries.

In past research, Antoniak mined online birth stories and learned about new parents’ feelings of powerlessness in the delivery room. Most recently, in a paper published at this month’s ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW), she and Cornell co-authors analyzed an online book review community to understand how users refine and redefine literary genres.

This type of text analysis can also be used to measure bias throughout an entire digital library, or corpus, whether that’s all of Wikipedia, say, or the collected works of Shakespeare. To do that, researchers use online lexicons, or a bank of words and seed terms. These lexicons are not always vetted: some are crowd-sourced, hand-curated by researchers, or pulled from prior research.

With “Bad Seeds,” Antoniak picks up on her previous work in evaluating methods in computational social science and the digital humanities. Her motivation to investigate lexicons for bias came after seeing wonky results in her own research when using an online lexicon of seed terms.

“I trusted the words, which came from trusted authors, but when I looked at the lexicon, it wasn’t what I expected,” Antoniak said. “The original researchers may have done a fabulous job in curating their seeds for their datasets. That doesn’t mean you can just pick it up and apply it to the next case.”

As explained in “Bad Seeds,” the seeds used for bias measurement can themselves have cultural and cognitive biases. For instance, the presence of the seed term “mom” in a text analysis exploring gender in domestic work would skew results female. The fix, Antoniak said, is simple: cull the word “mom” from the lexicon.

“The goal isn’t to undermine findings but to help researchers think through potential risks of seed sets used for bias detection,” she said. “Investigate them and test them for yourself to ensure results are trustworthy.”

As part of her findings, Antoniak recommends digital humanities and NLP researchers trace the origins of seed sets and features, manually examine and test them, and document all seeds and rationales.

This research is supported by the National Science Foundation.