Macroetym: a Command-Line Tool for Macro-Etymological Textual Analysis

Posted 2016-06-01

I’m proud to introduce macroetym, a command-line tool for macro-etymological textual analysis, which is now available for download with the Python package manager, pip. It’s a complete rewrite of The Macro-Etymological Analyzer, the web tool for macro-etymological analysis I wrote a few years ago, first described in this post, and presented at DH2014. It can now analyze any number of texts, and texts in 250 languages. Here are a few examples of the program in action:

A simple comparative macro-etymological analysis of two texts

$ macroetym wells-time-machine.txt woolf-mrs-dalloway.txt

              wells-time-machine.txt  woolf-mrs-dalloway.txt
Austronesian                0.045788                0.021177
Celtic                      0.068681                0.047649
Germanic                   40.885226               40.334075
Hellenic                    0.840965                1.080051
Indo-Iranian                0.015263                0.153537
Latinate                   57.724359               57.722893
Other                       0.137363                0.333545
Semitic                     0.282357                0.243541
Turkic                      0.000000                0.031766
Uralic                      0.000000                0.031766

A more verbose analysis of a single text

$ macroetym woolf-mrs-dalloway.txt --allstats

                                woolf-mrs-dalloway.txt
Anglo-Norman                                      7.23
Angloromani                                       0.03
Arabic                                            0.06
Aragonese                                         0.06
Dutch                                             0.29
Dutch, Middle (ca. 1050-1350)                     0.27
English, Old (ca. 450-1100)                      36.56
French                                            7.81
French, Middle (ca. 1400-1600)                    3.96
French, Old (842-ca. 1400)                       21.58
German                                            0.06
German, Middle Low                                0.12
... etc.

An analysis of all the books of Paradise Lost

$ macroetym paradise-lost-books/* --showfamilies Latinate        

          bk/book01.txt  bk/book02.txt  bk/book03.txt  bk/book04.txt
Latinate      52.622816      56.005313      52.644493      50.522588   

          bk/book05.txt  bk/book06.txt  bk/book07.txt  bk/book08.txt
Latinate      55.929858      56.608863       51.46886      54.492665   

          bk/book09.txt  bk/book10.txt  bk/book11.txt  bk/book12.txt  
Latinate      53.625632      54.745275      50.982633      52.195609  

Machine-readable output

macroetym paradise-lost-books/* --csv > pl-books.csv

Analysis of texts in languages other than English

macroetym madame-bovary.txt --lang=fra

Installation

Install macroetym using pip3:

pip3 install macroetym

Alternatively, grab the source code on GitHub and install locally:

git clone https://github.com/JonathanReeve/macro-etym
cd macro-etym
pip3 install .

Contributing

Macroetym is free and open-source software! There are a number of open issues with the program that need addressing. If you know a bit of python, feel free to hack around on the code as you see fit. Pull requests are very welcome. Non-code contributions are also welcome in the form of bug reports, documentation, or experiments in text analysis that use the program.


I welcome your comments and annotations in the Hypothes.is sidebar to the right. →