The Arxiv Algorithm That 'learned' To I.d. Good, Bad Papers

November 1, 2016

This is a fresh and fascinating angle on Cornell's arXiv that dovetails with some of the work happening here with natural language processing. In this piece, Nautilus details how the online depository's algorithm was initially designed to help categorize all incoming submissions but morphed into having the capability to distinguish the "good" and "bad" papers based on the author's language use.

From the Nautilus piece: "Outlier papers, the ones that got rejected, didn’t line up with the usual language norms of any scientific discipline. The deviation might have been obvious ... Or it might have been subtle: the wrong distribution of seemingly content-less words like 'and,' 'or,' 'it,' or 'that.'"

Now in its 25th year, arXiv – created by Info Sci's Paul Ginsparg– houses roughly 1.2 million academic papers.