Posted 2016-08-23
If you do computational analyses of books, and need to break up the book’s text file into its constituent chapters, I’ve just released a tool that you might find useful. It’s called chapterize, and it breaks a book into chapters. This is how it works:
# First, get a copy of Chapterize:
git clone https://github.com/JonathanReeve/chapterize.git
# Change into that directory:
cd chapterize
# Now grab a copy of Pride and Prejudice from Project Gutenberg:
wget http://www.gutenberg.org/cache/epub/1342/pg1342.txt
# Give it a nicer name:
mv pg1342.txt pride-and-prejudice.txt
# Run Chapterize on it:
python chapterize.py pride-and-prejudice.txt
This will create a directory called pride-and-prejudice-chapters
, containing
chapters 1-61 of Pride and Prejudice, named with leading zeros. Now you
can run analyses on each of these chapters. For instance, to compute the
macro-etymology of each chapter using my macroetym
tool:
# Change into the chapters directory
cd pride-and-prejudice-chapters
# Grab a copy of the macro-etym tool
git clone https://github.com/JonathanReeve/macro-etym
# Run macroetym on each chapter
python macro-etym/macroetym/main.py *.txt
Because chapterize
removes metatext
(introductions, tables of contents, Project Gutenberg fine print), it
can also be used to clean up book data in preparation for text analysis.
If you don’t actually need to break up a book into its chapters, but
just want to extract its text, use the --nochapters
flag:
python chapterize.py --nochapters pride-and-prejudice.txt
This command will create pride-and-prejudice-extracted.txt
, containing
just the inner text of the novel. So whereas pride-and-prejudice.txt
begins with “the Project
Gutenberg EBook of Pride and Prejudice, by Jane Austen,” pride-and-prejudice-extracted.txt
begins with
“it is a truth universally acknowledged.”
Since chapterize
is a command-line
tool, it’s easily scriptable. Let’s say you have a directory of 100
novels, and you want to remove all their metatext. That’s easily done
with a simple shell loop:
for f in *.txt; do python chapterize --nochapters $f; done
If you use chapterize
, let me know how
it works for you. If you get any errors with a particular book, send me
a copy of the book, and I’ll try to make it work.