ByBen Zimmer
The literary world is still abuzz over the revelation by London’s Sunday Times that J.K. Rowling of “Harry Potter” fame secretly wrote the well-received crime novel “The Cuckoo’s Calling” under the pen name Robert Galbraith. In chasing the scoop, the reporters called upon two experts in the field of authorship attribution to determine if “Galbraith” was really Rowling. The experts ran the texts through software programs designed to spot stylistic similarities, and the results were compelling enough for the Times to confront Rowling, who confessed to the pseudonymous work.
In the past, scholars have carried out such literary sleuthing on everything from The Federalist Papers to the 1996 political novel “Primary Colors,” revealed to be the work of Time columnist Joe Klein. This time around, the analysis of “The Cuckoo’s Calling” was done remarkably quickly, thanks to the automated methods used on the texts. But neither of the experts commissioned by the Times turned up a “smoking gun” that clearly gave away Rowling’s identity. Instead, they could only determine that it was more likely that Rowling was “Galbraith” than some other novelists. Their high-profile investigation provides some insight into the strengths and limitations of computerized authorship analysis, which has applications not just in the literary world but in legal inquiries as well.
In pursuing the Rowling bombshell, freelance writer Cal Flyn, who worked with Times arts editor Richard Brooks on the story, contacted two academics who have developed software specifically to examine questions of authorship: Peter Millican, who teaches philosophy and computing at Oxford University, and Patrick Juola, a computer science professor at Duquesne University in Pittsburgh. Flyn provided them with machine-readable texts of “The Cuckoo’s Calling” along with Rowling’s previous novel, “The Casual Vacancy,” and novels by three British women who specialize in crime fiction: Ruth Rendell, P.D. James, and Val McDermid.
Millican’s program, known as Signature, and Juola’s Java Graphical Authorship Attribution Program (JGAAP for short), didn’t take much time to yield an answer: “Cuckoo” was stylistically more similar to “The Casual Vacancy” than it was to the work of any of the three other novelists. Millican requested an additional book by each of the writers, and he found that Rowling’s “Harry Potter and the Deathly Hallows,” despite being in a genre far removed from detective fiction, came in second place, ahead of the six non-Rowling novels he analyzed.
I asked Juola to sketch out his research findings for a guest post on the linguistics blog Language Log, where I am a contributor. His post offers a fascinating glimpse into the nuts and bolts of “forensic stylometry,” the machine-based method of extracting features from different texts and calculating their similarities. Juola fed the texts into JGAAP and ran them through four different tests. He looked at the distribution of word lengths in each book, an easy way to generate potentially useful data. He also looked at the distribution of the one hundred most commonly occurring words in the language, which mostly consist of lowly “function words” like prepositions, conjunctions, and articles. Even if you are trying to mask your usual writing style by choosing different vocabulary, it’s hard to fake your typical palette of function words.
Two other tests that Juola conducted focused on the authors’ word selection. One used a feature known as “character 4-grams,” which simply refers to every string of four adjacent characters in the text. Though that would seem like a rather “dumb” feature to examine, since it doesn’t even take into account word boundaries, some recent studies have shown that it can be used to identify authorship with surprising accuracy. Finally, the program looked at pairs of adjacent words, and on this metric “The Cuckoo’s Calling” proved to be far more similar to Rowling’s known work than the other tested novels.
So what did it all prove? Both experts were careful not to make grandiose claims. “Nothing in the analysis constituted ‘proof’ of Rowling’s authorship,” Juola wrote in the Language Log post. “It was at best ‘suggestive’ or perhaps ‘indicative.’” Millican, for his part, told the BBC that “it’s not like fingerprinting” because “texts are too individual.” Millican also said he wasn’t able to pinpoint any particular phrases that were distinct to Rowling. “I’d have need a lot more text, probably, and a lot more time to do that,” he said.
Still, with the time and text allotted to them, Millican and Juola had found enough suggestive similarities for the Times to reveal Rowling as the hidden author – or to “rumble” her, in the British slang used in the newspaper’s exposé. The evidence might not have stood up in a court of law, but the experts weren’t facing the same burden of proof that they might in a criminal case.
In the legal sphere, differing schools of thought on authorship attribution can get rather testy. Some purveyors of “forensic stylistics” contend that no software is needed to determine if a text was written by a particular author, instead relying on a connoisseur’s eye to spot telltale features. But that approach has been giving way to the computer-driven methods favored by Millican and Juola. The uneasy shift was evident at a workshop on authorship last October hosted by Brooklyn Law School (where I first met Juola). The workshop’s organizer, law professor Lawrence M. Solan, framed the tension as one of “intuition versus algorithm,” but he argued that the computational and stylistic approaches can fruitfully converge.
In “rumbling” Rowling, the Times reporters fashioned their own convergence. After an anonymous tip via Twitter, they relied on their intuition to look for clues in “The Cuckoo’s Calling,” picking up on the use of Latin phrases and descriptions of women’s clothing that seemed Rowlingesque. But then they turned to algorithmically inclined experts for corroboration. That’s the kind of detective work that Cormoran Strike, the private investigator that Rowling-as-Galbraith created, could surely appreciate.
Ben Zimmer, a linguist and lexicographer, is executive producer of Vocabulary.com. His column “The Word on the Street” appears weekly in the Review section.
Update: And now the identity of the anonymous Twitter tipster who set off the Times’s investigation has been revealed. According to the Associated Press, it was Judith Callegari, who heard the information from her best friend’s husband, Chris Gossage, a partner at the entertainment law firm Russells. In a statement, Russells has “apologized unreservedly” to Rowling.
No comments:
Post a Comment