One of the computational linguists who applied forensic text analysis to JK Rowling’s books to uncover her as the author of The Cuckoo’s Calling describes the science behind his investigation in a post for Language Log.
On of those academics, computer scientist Patrick Juola, wrote a piece for Language Log to describe how this sort of text analysis works.
Of the 11 sections of Cuckoo, six were closest (in distribution of word lengths) to Rowling, five to James. No one else got a mention.
Another feature I used were the 100 most common words. What percentage of the document were “the,” what were “of,” and so on. Again, a rich data set that is easy to extract by computer. Using an otherwise similar analysis (including cosine distance again), four of the sections were Rowling-like, four were McDermid-like, and the other three split between James and Rendell.
I ran two tests based on authorial vocabulary. The first was on the distribution of character 4-grams, groups of four adjacent characters. These could be words, parts of words (like four letters “nsid” that would be inside the word “inside”) or even parts of two words (like the four letters “n th” as part of the phrase “in the”)… I also ran on word bigrams, pairs of adjacent words, again a feature with a good track record.
The character 4-grams showed a preference for McDermid, with 8 sections close to her. Three were Rowling-like, and no one else was mentioned. The word pairs, on the other hand, were clearly Rowling-like (9 sections, against 2 by McDermid, no one else mentioned).
If you want to play around with some of the technology behind both Juola’s authorship attribution work, or that of Peter Millican – the other academic contacted by the press to do an analysis – you can actually download them both from the net.
Rumours that Mind Hacks is actually written by Natalie Portman will be strictly denied.
Link to Juola’s post on Language Log.