Chapterize: a Tool for Automatically Splitting Electronic Texts into Chapters

Posted 2016-08-23

If you do computational analyses of books, and need to break up the book’s text file into its constituent chapters, I’ve just released a tool that you might find useful. It’s called chapterize, and it breaks a book into chapters. This is how it works:

# First, get a copy of Chapterize: 
git clone https://github.com/JonathanReeve/chapterize.git

# Change into that directory: 
cd chapterize

# Now grab a copy of Pride and Prejudice from Project Gutenberg: 
wget http://www.gutenberg.org/cache/epub/1342/pg1342.txt

# Give it a nicer name: 
mv pg1342.txt pride-and-prejudice.txt 

# Run Chapterize on it:  
python chapterize.py pride-and-prejudice.txt

This will create a directory called pride-and-prejudice-chapters, containing chapters 1-61 of Pride and Prejudice, named with leading zeros. Now you can run analyses on each of these chapters. For instance, to compute the macro-etymology of each chapter using my macroetym tool:

# Change into the chapters directory
cd pride-and-prejudice-chapters

# Grab a copy of the macro-etym tool
git clone https://github.com/JonathanReeve/macro-etym

# Run macroetym on each chapter
python macro-etym/macroetym/main.py *.txt

Because chapterize removes metatext (introductions, tables of contents, Project Gutenberg fine print), it can also be used to clean up book data in preparation for text analysis. If you don’t actually need to break up a book into its chapters, but just want to extract its text, use the --nochapters flag:

python chapterize.py --nochapters pride-and-prejudice.txt

This command will create pride-and-prejudice-extracted.txt, containing just the inner text of the novel. So whereas pride-and-prejudice.txt begins with “the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen,” pride-and-prejudice-extracted.txt begins with “it is a truth universally acknowledged.”

Since chapterize is a command-line tool, it’s easily scriptable. Let’s say you have a directory of 100 novels, and you want to remove all their metatext. That’s easily done with a simple shell loop:

for f in *.txt; do python chapterize --nochapters $f; done

If you use chapterize, let me know how it works for you. If you get any errors with a particular book, send me a copy of the book, and I’ll try to make it work.

I welcome your comments and annotations in the Hypothes.is sidebar to the right. →