Thursday, December 26, 2013

Big Data Becomes a Mirror

The New York times
 
BOOKS OF THE TIMES


‘Uncharted,’ by Erez Aiden and Jean-Baptiste Michel

Why do English speakers say “drove” rather than “drived”?
Patricia Wall/The New York Times

UNCHARTED

Big Data as a Lens on Human Culture
By Erez Aiden and Jean-Baptiste Michel
Illustrated. 280 pages. Riverhead Books. $27.95.
Erez Aiden
Jean-Baptiste Michel
As graduate students at the Harvard Program for Evolutionary Dynamics about eight years ago, Erez Aiden and Jean-Baptiste Michel pondered the matter and decided that something like natural selection might be at work. In English, the “-ed” past-tense ending of Proto-Germanic, like a superior life form, drove out the Proto-Indo-European system of indicating tenses by vowel changes. Only the small class of verbs we know as irregular managed to resist.
To test this evolutionary premise, Mr. Aiden and Mr. Michel wound up inventing something they call culturomics, the use of huge amounts of digital information to track changes in language, culture and history. Their quest is the subject of “Uncharted: Big Data as a Lens on Human Culture,” an entertaining tour of the authors’ big-data adventure, whose implications they wildly oversell.
To tackle the drived/drove question, Mr. Aiden and Mr. Michel assigned two undergraduates to read every textbook on historical English grammar, compile a list of irregular verbs and follow their fortunes through the centuries. The students turned up 177 irregular verbs in Old English, a number that declined to 145 in Middle English (the language of Chaucer) and to 98 in modern English. Of the original Old English irregulars, the 12 most frequently used verbs stayed irregular, while 11 out of the 12 least frequently used verbs made the changeover. Only “slink” held the line.
“The data had spoken,” the authors write. “Something akin to natural selection was influencing human culture, leaving its fingerprints among the verbs. Usage frequency was having an extraordinarily strong effect on verb survival, making the difference between the verbs that were mourn/mourned and the verbs that were fit/fit to survive.”
Invigorated by the great verb chase, Mr. Aiden and Mr. Michel went hunting for bigger game. Given a large enough storehouse of words and a fine filter, would it be possible to see cultural change at the micro level, to follow minute fluctuations in human thought processes and activities? Tiny factoids, multiplied endlessly, might assume imposing dimensions.
By chance, Google Books, the megaproject to digitize every page of every book ever printed — all 130 million of them — was starting to roll just as the authors were looking for their next target of inquiry.
Meetings were held, deals were struck and the authors got to it. In 2010, working with Google, they perfected the Ngram Viewer, which takes its name from the computer-science term for a word or phrase. This “robot historian,” as they call it, can search the 30 million volumes already digitized by Google Books and instantly generate a usage-frequency timeline for any word, phrase, date or name, a sort of stock-market graph illustrating the ups and downs of cultural shares over time.
Mr. Aiden, now director of the Center for Genome Architecture at Rice University, and Mr. Michel, who went on to start the data-science company Quantified Labs, play theNgram Viewer (books.google.com/ngrams) like a Wurlitzer.
They graph, to take one example, the astounding career path of “chortle,” coined by Lewis Carroll in “Jabberwocky,” which has left its siblings “galumphing” and “frumious” in the dust. They tease out the most-mentioned names of people born in each year from 1800 to 1949, with some surprising results. From those born in 1871, the name that appeared most often was Cordell Hull, secretary of state under Franklin D. Roosevelt, not Orville Wright. They also come up with a sort of fame speedometer. The ngram data show that people are becoming famous at a younger age, and faster, than they did two generations ago.
Fame is much bigger, too. At one point, the authors write, Bill Clinton’s ngram “was almost exactly as frequent as the word lettuce, twice as frequent as the word cucumber, and about half as frequent as the word tomato. He completely outclassed second-tier vegetables like turnip and cauliflower
The momentous term culturomics suggests the authors’ ambitious view of what can seem like an intellectual parlor game. The magazine Mother Jones, they cheerfully admit, called the Ngram Viewer “possibly the greatest time-waster in the history of the Internet.” But the authors argue that just as Galileo’s telescope opened new, previously unimagined worlds, the powerful lens of culturomics “is going to change the humanities, transform the social sciences and renegotiate the relationship between the world of commerce and the ivory tower.”
Judging by the evidence on offer in “Uncharted,” the claim seems a tad boastful. Yes, it is fascinating to know that “donut” gained traction as a variant spelling soon after Dunkin’ Donuts was founded in 1950. The authors serve up many a tasty morsel like this: lots of fun but less than earthshaking without elaboration.
The Ngram Viewer delivers the what and the when but not the why. Take the case of specific years. All years get attention as they approach, peak when they arrive, then taper off as succeeding years occupy the attention of the public. Mentions of the year 1872 had declined by half in 1896, a slow fade that took 23 years. The year 1973 completed the same trajectory in less than half the time.
“What caused that change?” the authors ask. “We don’t know. For now, all we have are the naked correlations: what we uncover when we look at collective memory through the digital lens of our new scope.” Someone else is going to have to do the heavy lifting.
“Uncharted” began life as an article in Science magazine in December 2010, and the authors have huffed and puffed to inflate it to book length. They digress at every turn and, to add weight at the back end, they have appended nearly 50 ngram searches.
They also overexplain. Most readers do not need a background lesson on Nazi policies toward the arts to understand why, in German books published between 1933 and 1945, the graph for Marc Chagall dips like a downward-speeding roller coaster.
This may be potato chips for intellectuals, but it is irresistible. You cannot eat just one ngram.

No comments:

Post a Comment