Posted 2017-06-23
Project Gutenberg is a large store of public domain electronic texts, one which has been around since the 70s. Nearly everyone that has experimented with computational literary analysis has at some point used their electronic texts. Many digital humanists undoubtedly share my frustrations with it: its interface is clunky, its metadata is incomplete, and it’s not very friendly to computational text extraction. Inspired by David McClure’s textual databases, used at the Stanford Literary Lab, I decided to fix this by creating a structured database for Project Gutenberg’s corpus, and augmenting it as much as possible with publicly-available book data. This database contains the complete cleaned text of each work, all its associated metadata, and additional metadata derived from GITenberg and Wikipedia. Here’s an example entry, Dickens’s novel A Tale of Two Cities:
LCC {PR}
author Dickens, Charles
authoryearofbirth 1812
authoryearofdeath 1870
downloads 11809
formats {'text/plain; charset=utf-8': 'http://www.gute...
id 98
languages [en]
lcsh {London (England) -- History -- 18th century -...
title A Tale of Two Cities
type Text
_repo A-Tale-of-Two-Cities_98
_version 0.2.2
alternative_title NaN
contributor NaN
covers [{'attribution': 'Alexis Lampley, 2015', 'cove...
creator {'author': {'agent_name': 'Dickens, Charles', ...
description NaN
edition_identifiers {'edition_id': 'http://www.gutenberg.org/ebook...
edition_note NaN
gutenberg_bookshelf Historical Fiction
gutenberg_issued 1994-01-01
gutenberg_type Text
identifiers {'gutenberg': '98'}
jmdate 2011-01-23
subjects ['Historical fiction', 'French -- England -- L...
language_note NaN
production_note NaN
publication_date 2015-08-01
publication_note NaN
rights CC BY-NC
rights_url http://creativecommons.org/licenses/by-nc/4.0/
series_note NaN
summary NaN
tableOfContents NaN
titlepage_image NaN
url http://www.gutenberg.org/ebooks/98
wikipedia NaN
filename /run/media/jon/SAMSUNG/gitenberg/A-Tale-of-Two...
releaseDate NaN
wp_publication_date 1859
wp_subjects ['Novels_by_Charles_Dickens', 'Victorian_novel...
wp_info {'http://www.w3.org/1999/02/22-rdf-syntax-ns#t...
wp_literary_genres ['Historical_fiction']
The first few fields contain Project Gutenberg’s own metadata about the book: the Dickens’s dates of birth and death, for instance. Project Gutenberg also provides Library of Congress classification and subject data: PR (British Literature) as well as these subjects:
'British -- France -- Paris -- Fiction',
'Executions and executioners -- Fiction',
'France -- History -- Revolution, 1789-1799 -- Fiction',
'French -- England -- London -- Fiction',
'Historical fiction',
'London (England) -- History -- 18th century -- Fiction',
'Lookalikes -- Fiction',
'Paris (France) -- History -- 1789-1799 -- Fiction',
'War stories'
Already you can see that this might be useful for corpus generation. If you’re interested in, say, the properties of novels that deal with lookalikes, you can easily build a corpus using these subject headings, and use those texts for your analyses.
To Project Gutenberg’s metadata, I’ve added metadata from GITenberg. GITenberg, like my project Git-Lit, is an initiative to make editable GitHub repositories for each Project Gutenberg book. They’ve also enhanced the metadata for these works. Their edition of A Tale of Two Cities, for instance, contains a nice ASCIIDOC version of the book which anyone can edit in the browser, and a Creative Commons-licensed cover designed as part of the Recovering the Classics project. All of this data, as well as a link to the GITenberg repository itself, is included in the database, along with GITenberg’s version of the text.
The next few fields come from Wikipedia. I queried DBPedia for the
books with titles and authors similar to those in Project Gutenberg, and
it returned about 1,800 matches. The wpinfo field here
contains the complete structured set of data about the book, which I
then parsed into the more useful individual fields prefixed with wp_
. This includes, for instance, wp_subjects
, the categories to which the book’s
Wikipedia page belongs. For A Tale of Two Cities, this is:
'Novels_by_Charles_Dickens',
'Victorian_novels',
'19th-century_novels',
'British_novels',
'1775_in_fiction',
'1859_novels',
'Chapman_&_Hall_books',
'Novels_adapted_into_radio_programs',
'Novels_adapted_into_television_programs',
'Novels_adapted_into_operas',
'A_Tale_of_Two_Cities',
'Novels_adapted_into_plays',
'Novels_adapted_into_comics',
'Novels_adapted_into_films',
'Novels_first_published_in_serial_form', 'Works_originally_published_in_All_the_Year_Round',
'Novels_set_in_Paris',
'Novels_set_in_London',
'Novels_set_in_the_French_Revolution'
The potential for corpus creation here is immense. For instance, to
get a corpus of all novels that Wikipedia lists as set in London, I can
just run the Pandas query: df[df.wp_subjects.str.contains('Novels_set_in_London')]
,
which returns a table of 46 novels. To get the average year of birth of
the authors in this corpus, I can append .authordateofbirth.mean()
(it’s 1844). By using
this metadata to query other APIs, I can then do fun things like compare
the average Goodreads ratings for novels set in London and in Paris.
(Paris wins, with an average rating of 3.8, compared with an average
rating of 3.5 for London novels.)
Corpus Statistics
Subjects
Here are the top ten Project Gutenberg-provided Library of Congress subject headings, along with the numbers of associated texts:
('Fiction', 1921),
('Short stories', 1604),
('Science fiction', 1286),
('Adventure stories', 789),
('Historical fiction', 654),
('Conduct of life -- Juvenile fiction', 639),
('Poetry', 634),
('Love stories', 620),
('English wit and humor -- Periodicals', 555),
('Detective and mystery stories', 546)
For those books that could be found on Wikipedia, here are the top ten Wikipedia categories for those books, along with their counts:
('Novels_first_published_in_serial_form', 247),
('British_novels_adapted_into_films', 109),
('19th-century_American_novels', 99),
('Victorian_novels', 96),
('British_novels', 93),
('Novels_adapted_into_plays', 92),
('English_novels', 88),
('20th-century_American_novels', 85),
('American_science_fiction_novels', 63),
('American_novels_adapted_into_films', 59),
('19th-century_novels', 58),
Other interesting categories include Debut Novels (51), 1915 Novels (39), Gothic Novels (37) and Novels Adapted into Comics (36). These categories could easily be used to create sub-corpora. If you’re interested in, say, differences between American and British science fiction novels, whether there’s anything unusually characteristic about 1915 novels, or the properties of novels that have been adapted into comics, it’s easy to construct those experiments using this corpus. Finally, here are the most common Project Gutenberg “bookshelves”, with associated text counts:
[('Bestsellers, American, 1895-1923', 225),
("Children's Literature", 178),
('The Mirror of Literature, Amusement, and Instruction', 174),
("Children's Book Series", 168),
('Historical Fiction', 164),
('US Civil War', 115),
('Best Books Ever Listings', 110),
("Children's Fiction", 99),
('Movie Books', 86),
('FR Littérature', 86)]
It was a dark and stormy night…
I wasn’t very surprised to find 306 works by Williams Shakespeare in the Project Gutenberg database, but I found it very surprising that the next most-represented author is Edward Bulwer Lytton, with 219 texts. Here are a few more:
Various 3253
Anonymous 719
Shakespeare, William 306
Lytton, Edward Bulwer Lytton, Baron 219
Ebers, Georg 172
Twain, Mark 164
Balzac, Honoré de 139
Verne, Jules 137
Kingston, William Henry Giles 133
Languages
And here are the numbers of texts in Project Gutenberg, by language:
English: 43410
French: 2766
Finnish: 1622
German: 1516
Dutch: 749
Italian: 690
Portuguese: 544
Spanish: 538
Chinese: 425
Modern Greek (1453-): 219
Future Developments
In the coming months, I’m going to try to generate sub-corpora from
this database, starting with large corpora like American Literature and
British Literature, and moving into more specialized corpora, like
single-author corpora. I’ll make these all available through the corpus downloader
corpus
that I’m developing with DHBox.
Code
To see the (very messy) code that I used to generate this database, check out this project on GitHub. I’ll release the database itself, as well, as soon as I can figure out a way to get it inexpensively online. If you have any ideas for how best to accomplish that, please leave a note in the comments below!