Project Gutenberg is a large store of public domain electronic texts, one which has been around since the 70s. Nearly everyone that has experimented with computational literary analysis has at some point used their electronic texts. Many digital humanists undoubtedly share my frustrations with it: its interface is clunky, its metadata is incomplete, and it's not very friendly to computational text extraction. Inspired by David McClure's textual databases, used at the Stanford Literary Lab, I decided to fix this by creating a structured database for Project Gutenberg's corpus, and augmenting it as much as possible with publicly-available book data. This database contains the complete cleaned text of each work, all its associated metadata, and additional metadata derived from GITenberg and Wikipedia. Here's an example entry, Dickens's novel A Tale of Two Cities:
The first few fields contain Project Gutenberg's own metadata about the book: the Dickens's dates of birth and death, for instance. Project Gutenberg also provides Library of Congress classification and subject data: PR (British Literature) as well as these subjects:
Already you can see that this might be useful for corpus generation. If you're interested in, say, the properties of novels that deal with lookalikes, you can easily build a corpus using these subject headings, and use those texts for your analyses.
To Project Gutenberg's metadata, I've added metadata from GITenberg. GITenberg, like my project Git-Lit, is an initiative to make editable GitHub repositories for each Project Gutenberg book. They've also enhanced the metadata for these works. Their edition of A Tale of Two Cities, for instance, contains a nice ASCIIDOC version of the book which anyone can edit in the browser, and a Creative Commons-licensed cover designed as part of the Recovering the Classics project. All of this data, as well as a link to the GITenberg repository itself, is included in the database, along with GITenberg's version of the text.
The next few fields come from Wikipedia. I queried DBPedia for the books with titles and authors similar to those in Project Gutenberg, and it returned about 1,800 matches. The wp_info field here contains the complete structured set of data about the book, which I then parsed into the more useful individual fields prefixed with
wp_. This includes, for instance,
wp_subjects, the categories to which the book's Wikipedia page belongs. For A Tale of Two Cities, this is:
The potential for corpus creation here is immense. For instance, to get a corpus of all novels that Wikipedia lists as set in London, I can just run the Pandas query:
df[df.wp_subjects.str.contains('Novels_set_in_London')], which returns a table of 46 novels. To get the average year of birth of the authors in this corpus, I can append
.authordateofbirth.mean() (it's 1844). By using this metadata to query other APIs, I can then do fun things like compare the average Goodreads ratings for novels set in London and in Paris. (Paris wins, with an average rating of 3.8, compared with an average rating of 3.5 for London novels.)
Here are the top ten Project Gutenberg-provided Library of Congress subject headings, along with the numbers of associated texts:
For those books that could be found on Wikipedia, here are the top ten Wikipedia categories for those books, along with their counts:
Other interesting categories include Debut Novels (51), 1915 Novels (39), Gothic Novels (37) and Novels Adapted into Comics (36). These categories could easily be used to create sub-corpora. If you're interested in, say, differences between American and British science fiction novels, whether there's anything unusually characteristic about 1915 novels, or the properties of novels that have been adapted into comics, it's easy to construct those experiments using this corpus. Finally, here are the most common Project Gutenberg "bookshelves", with associated text counts:
It was a dark and stormy night...
I wasn't very surprised to find 306 works by Williams Shakespeare in the Project Gutenberg database, but I found it very surprising that the next most-represented author is Edward Bulwer Lytton, with 219 texts. Here are a few more:
And here are the numbers of texts in Project Gutenberg, by language:
In the coming months, I'm going to try to generate sub-corpora from this database, starting with large corpora like American Literature and British Literature, and moving into more specialized corpora, like single-author corpora. I'll make these all available through the corpus downloader
corpus that I'm developing with DHBox.
To see the (very messy) code that I used to generate this database, check out this project on GitHub. I'll release the database itself, as well, as soon as I can figure out a way to get it inexpensively online. If you have any ideas for how best to accomplish that, please leave a note in the comments below!