National Geographic goes (or went – this article is from 2013) into the science of forensic linguists, using computers to analyze things like word choice and sentence length to determine – with a high degree of accuracy – who wrote what, even if the author didn’t sign it. One of the science’s first high-profile victories was revealing a first-time mystery author, Robert Galbraith of The Cuckoo’s Calling, was in fact Harry Potter creator JK Rowling:
With computers and sophisticated statistical analyses, researchers are mining all sorts of famous texts for clues about their authors. Perhaps more surprising: They’re are also mining not-so-famous texts, like blogs, tweets, Facebook updates and even Amazon reviews for clues about people’s lifestyles and buying habits. The whole idea is so amusingly ironic, isn’t it? Writers choose words deliberately, to convey specific messages. But those same words, it turns out, carry personal information that we don’t realize we’re giving out.
“There’s a kind of fascination with the thought that a computer sleuth can discover things that are hidden there in the text. Things about the style of the writing that the reader can’t detect and the author can’t do anything about, a kind of signature or DNA or fingerprint of the way they write,” says Peter Millican of Oxford University, one of the experts consulted by the Sunday Times.
Cal Flyn, a reporter with the Sunday Times, sent email requests to Millican and to Patrick Juola, a computer scientist at Duquesne University in Pittsburgh. Flyn told them the hypothesis — that Galbraith was Rowling — and gave them the text of five books to test that hypothesis. Those books included Cuckoo, obviously, as well as a novel by Rowling called The Casual Vacancy. The other three were all, like Cuckoo, British crime novels: The St. Zita Society by Ruth Rendell, The Private Patient by P.D. James, and The Wire in the Blood by Val McDermid.
One of those tests, for example, compared all of the word pairings, or sets of adjacent words, in each book. “That’s better than individual words in a lot of ways because it captures not just what you’re talking about but also how you’re talking about it,” Juola says. This test could show, for example, the types of things an author describes as expensive: an expensive car, expensive clothes, expensive food, and so on. “It might be that this is a word that everyone uses, like expensive, but depending on what you’re focusing on, it [conveys] a different idea.”
Juola also ran a test that searched for “character n-grams”, or sequences of adjacent characters. He focused on 4-grams, or four-letter sequences. For example, a search for the sequence “jump” would bring up not only jump, but jumps, jumped, and jumping. “That lets us look at concepts and related words without worrying about tense and conjugation,” he says.
Those two tests turn up relatively rare words. But even a book’s most common words — words like a, and, of, the — leave a hidden signature. So Juola’s program also tallied the 100 most common words in each book and compared the small differences in frequency. One book might have used the word “the” six percent of the time, while another uses it only 4 percent.
Juola’s final test completely separates a word from its meaning, by sorting words simply by their length. What fraction of a book is made of three-letter words, or eight-letter words? These distributions are fairly similar from book to book, but statistical analyses can dig into the subtle differences. And this particular test “was very characteristically Rowling,” Juola says. “Word lengths was one of the strongest pieces of evidence that [Cuckoo] was Rowling.”
Meanwhile, across the pond, Peter Millican was running a parallel Rowling investigation. After getting Flyn’s email, Millican told her he needed more comparison data, so he ended up with an additional book from each of the four known authors (using Harry Potter and the Deathly Hallows as the second known Rowling book). He ran those eight books, plus Cuckoo, into his own linguistics software program, called Signature.
Signature includes a fancy statistical method called principal component analysis to compare all of the books on six features: word length, sentence length, paragraph length, letter frequency, punctuation frequency, and word usage.
Another approach that can be quite definitive, Millican says, is a comparison of rare words. The classical example concerns the Federalist Papers, a series of essays written by Alexander Hamilton, James Madison, and John Jay during the creation of the U.S. Constitution. In 1963, researchers used word counts to determine the authorship of 12 of these essays that were written by either Madison or Hamilton. They found that Madison’s essays tended to use “whilst” and never “while”, and “on” rather than “upon”. Hamilton, in contrast, tended to use “while”, not “whilst”, and used “on” and “upon” at the same frequency. The 12 anonymous papers never used “while” and rarely used “upon”, pointing strongly to Madison as the author.
Millican found a few potentially distinctive words in his Rowling investigation. The other authors tended to use the words “course” (as in, of course), “someone” and “realized” a bit more than Rowling did. But the difference wasn’t statistically significant enough for Millican to run with it. So, like Juola, he turned to the most common words. Millican pulled out the 500 most common words in each book, and then went through and manually removed the words that were subject-specific, such as “Harry”, “wand”, and “police”.
Of all of the tests he can run with his program, Millican finds these word usage comparisons most compelling. “You end up with a graph, and on the graph it’s absolutely clear that Cuckoo’s Calling is lining up with Harry Potter. And it’s also clear that the Ruth Rendell books are close together, the Val McDermid books are close together, and so on,” he says. “It is identifying something objective that’s there. You can’t easily describe in English what it’s detecting, but it’s clearly detecting a similarity.”