Computationally Identifying Similar Books in Project Gutenberg

As one of the first digital libraries, Project Gutenberg has lived through a few generations of computers, digitization techniques, and textual infrastructures. It's not surprising, then, that the corpus is fairly messy. Early transcriptions of some electronic texts, hand-keyed using...

A Project Gutenberg Database for Text Mining

Project Gutenberg is a large store of public domain electronic texts, one which has been around since the 70s. Nearly everyone that has experimented with computational literary analysis has at some point used their electronic texts. Many digital humanists undoubtedly...

The Henry James Sentence: New Quantitative Approaches

The house had a name and a history; the old gentleman taking his tea would have been delighted to tell you these things: how it had been built under Edward the Sixth, had offered a night’s hospitality to the great...

A Macro-Etymological Analysis of The Canterbury Tales

Chaucer's Canterbury Tales exhibits one of the richest vocabularies of Middle English literature, a vocabulary that reveals influences from a number of native and foreign languages: Old English, French, Latin, Greek, and Hebrew, among others. While some of this foreign...

Probabilistic Detection of Character Voices in Fiction

In James Joyce's novel Ulysses, the school headmaster Mr. Deasy quotes Shakespeare in a lecture in financial responsibility to his employee Stephen Dedalus. “[W]hat does Shakespeare say?” he asks, “Put but money in thy purse” (Joyce 1986, 25). As Stephen...