If you do computational analyses of books, and need to break up the book’s text file into its constituent chapters, I’ve just released a tool that you might find useful. It’s called chapterize, and it breaks a book into chapters. This is how it works:
# First, get a copy of Chapterize: git clone https://github.com/JonathanReeve/chapterize.git # Change into that directory: cd chapterize # Now grab a copy of Pride and Prejudice from Project Gutenberg: wget http://www.gutenberg.org/cache/epub/1342/pg1342.txt # Give it a nicer name: mv pg1342.txt pride-and-prejudice.txt # Run Chapterize on it: python chapterize.py pride-and-prejudice.txt
This will create a directory called
pride-and-prejudice-chapters, containing chapters 1-61 of Pride and Prejudice, named with leading zeros. Now you can run analyses on each of these chapters. For instance, to compute the macro-etymology of each chapter using my
# Change into the chapters directory cd pride-and-prejudice-chapters # Grab a copy of the macro-etym tool git clone https://github.com/JonathanReeve/macro-etym # Run macroetym on each chapter python macro-etym/macroetym/main.py *.txt
chapterize removes metatext (introductions, tables of contents, Project Gutenberg fine print), it can also be used to clean up book data in preparation for text analysis. If you don't actually need to break up a book into its chapters, but just want to extract its text, use the
python chapterize.py --nochapters pride-and-prejudice.txt
This command will create
pride-and-prejudice-extracted.txt, containing just the inner text of the novel. So whereas
pride-and-prejudice.txt begins with "the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen,"
pride-and-prejudice-extracted.txt begins with "it is a truth universally acknowledged."
chapterize is a command-line tool, it's easily scriptable. Let's say you have a directory of 100 novels, and you want to remove all their metatext. That's easily done with a simple shell loop:
for f in *.txt; do python chapterize --nochapters $f; done
If you use
chapterize, let me know how it works for you. If you get any errors with a particular book, send me a copy of the book, and I'll try to make it work.