Million-book digitized libraries – representing centuries of knowledge – lie just beyond our grasp, unreachable by today’s inadequate machine learning methods or ensnarled in copyright limbo. How do we access such vast, rich libraries, and what can we learn from them? David Mimno believes the answer is in better text-mining tools. The National Science Foundation appears to think so, too, recently awarding the Cornell Information Science professor its prestigious CAREER grant to explore machine-learning methods that will broaden our understanding of data analytics and possibly unlock massive troves of data.

In the last decade, Mimno notes, Google and the Internet Archive have digitized some 14 million volumes from academic libraries.

“Students and the general public should now have access to hundreds of billions of words and millions of images representing centuries of technology, history and culture,” he said. “But technical challenges have prevented wider access.”

For starters, the collection is massive, with two-thirds of it limited by copyright. Further, today’s text-mining tools are unable to make sense of the text and, proving more troublesome, image data like photographs, illustrations and maps.

“One of the most important intellectual resources now in existence will continue to sit idle until we solve difficult computational problems in simple, robust, privacy-preserving data mining,” he said. “Treating millions of books as a coherent data set rather than an assortment of individual works will lead to a massive expansion in our ability to measure cultural phenomena.”

This approach of applying computation to the humanities and social sciences is one of Mimno’s primary research interests. For instance, in the past, he’s analyzed words within  literature collections – like Danish folk tales – to better shape our definition of what “classic” literature truly is.

“Computation approaches to digital humanities means calling into question your presumptions,” he said. “Why are certain stories put in certain categories, and where else could they have gone? The larger question is, why do stories stick?”

With the NSF research, the impact of studying large, digitized libraries is twofold: users will gain access to untapped knowledge of history, science and technology, and it will introduce students and the public to machine learning. This latter goal stems from a more general one for Mimno: to reconnect the humanities with other forms of scholarship. In the 19th century, it wasn’t uncommon for, say, a mathematician to write a book on Sanskrit, he said.

“At some point, that relationship between scientists and humanists broke down, and scholarship became more specialized, to the point where mathematical minds and literary minds diverged. That’s an assumption I would like to break down,” he said. “The red herring is treating digital humanities as an entirely new thing – with new ways of thinking, publishing, new ways of being a scholar. It’s a lot more likely to be successful and useful if it exists in the world we have now.”