If you do computational analyses of books, and need to break up the book’s text file into its constituent chapters, I’ve just released a tool that you might find useful. It’s called chapterize, and it breaks a book into chapters. This is how it works:
This will create a directory called
pride-and-prejudice-chapters, containing chapters 1-61 of Pride and Prejudice, named with leading zeros. Now you can run analyses on each of these chapters. For instance, to compute the macro-etymology of each chapter using my
chapterize removes metatext (introductions, tables of contents, Project Gutenberg fine print), it can also be used to clean up book data in preparation for text analysis. If you don't actually need to break up a book into its chapters, but just want to extract its text, use the
This command will create
pride-and-prejudice-extracted.txt, containing just the inner text of the novel. So whereas
pride-and-prejudice.txt begins with "the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen,"
pride-and-prejudice-extracted.txt begins with "it is a truth universally acknowledged."
chapterize is a command-line tool, it's easily scriptable. Let's say you have a directory of 100 novels, and you want to remove all their metatext. That's easily done with a simple shell loop:
If you use
chapterize, let me know how it works for you. If you get any errors with a particular book, send me a copy of the book, and I'll try to make it work.