Digital Features for Digital Texts: Automatic Text Analysis

I finished up last week with an article on digital texts you can download from digital libraries, along with some observations on how I modify those texts, customizing them for my students’ needs. I’ll start off this week by looking at some of the ways that the digital libraries themselves promote the reformatting of digital texts based on user needs, along with other features that become possible when texts are put online. It is a hodgepodge of very exciting features which, taken together, suggest what amazing things will be possible with the next generation of digital libraries. There is a lot to say here, so I’ll take this in several segments: my focus today will be on computer-aided analysis of texts - this is something of interest both to high-level researchers but also to the general reader.
Intratext Project. Computers are very good at counting things! They don’t understand how to interpret the results - but they can do all the work for you and let you figure out what it means. For literary, linguistic, historical and other kinds of research, the automatic processing of digital texts is enormously valuable. For a good example of this kind of analysis available online, I’ll start with the Intratext Project, which I mentioned last week in conjunction with the Babelot index at Eulogos (Italy). They currently feature a rather curious collection of texts, with a special emphasis on the complete works of canonical Italian authors (Goldoni, Svevo, Verga), along with classical Latin authors (Apuleius, Caesar, Horace, Ovid, Vergil, among others), and a range of religious texts (primarily but not exclusively Catholic documents). Of course I am terribly pleased to say that they are compiling some Polish literary classics too, along with other growing text collections in other European languages. The elegant programming at this site allows you to instantly generate the following text analysis tools to use in conjunction with these texts:

  • alphabetical word lists, with frequency statistics, reverse-word listings, and word length
  • concordances with contextual display (words prior to and following word instance)
  • statistics: word frequency and other statistics in graphical display format

These kinds of tool are like searching on steroids! When I was working on my dissertation, one of my most precious research tools was a huge concordance of the Vulgate prepared by Jesuits around the turn of the century, by hand - an enormous monster of a book that took up the entire northwest corner of my desk! Now anybody can have access to that kind of research tool, for free, online - with many word count and reverse word features that the venerable old Jesuit production could not even dream of.
Perseus Digital Library. In addition to counting things, computers are good at connecting and cross-referencing things. And one thing that they can connect is word appearances in a text with word definitions in a dictionary. This is enormously useful to any beginning language student! For the best use of this, consider the Perseus Digital Library, which contains hundreds of Greek and Latin texts linked word by word to top-of-the line online dictionaries (the Greek and Latin equivalents of the OED for English). Even more impressive is the way that Perseus has integrated morphology tools that analyze the form of a word so that you can move beyond the word instance to the possible dictionary definition. In English, this would be the equivalent of a dictionary that knows how to look up the word “went” under the entry for “come”, or the word “knives” under the entry for “knife”. Perseus also offers some amazing other features - word frequency counts and word lists (more complex than at Intratext, and a little intimidating at first - but with incredibly useful customization features!). There is even a tool that automatically draws a map displaying any place names mentioned in a specific text. Now, admittedly, computers are not very bright: the map that I just drew for Aeschylus’s Agamemnon, includes Paris, France on the map because there is a character name Paris in the text. But the interactive map of Greece that it has drawn for me is amazing nevertheless: how many times have you read a text mentioning place names about which you were more or less clueless! Let the computer clue you in - it’s an amazing feature, and reflects the deep commitment of the folks at Perseus to integrating textual content with images, mapping, and other “real world” data.
Crosswalk.com. Now both Intratext and Perseus can be rather intimidating for the beginning user. By meeting the high-end needs of researchers, they end up being somewhat scary for the true beginner. There are a number of Christian Bible sites, however, which focus exactly on the needs of readers who are very interested in the texts, but perhaps not very savvy with the technology. I especially rely on the Interlinear Bible at Crosswalk which has very user-friendly dictionary and concordance tools. For the Hebrew Bible or Greek New Testament, you can quickly access entries in the Hebrew and Greek lexicons that they have online (including audio files for each word); you can also view other instances of this word in the Biblical text. There are links to a variety of standard commentaries but unfortunately these are quite old and out-of-date, since the inclusion of these materials is subject to copyright restrictions. My dream vision: Anchor Bible commentaries online! And this brings us back to what Rob was saying about buying “chunks” of texts when needed: I would never buy an entire Anchor Bible Commentary if I need the commentary for just one verse - but I would gladly buy a subscription (even a pricey subscription!) to have access to Anchor Bible Commentaries online, available verse-by-verse as I am working on a specific Biblical text.
Perhaps some of you are surprised that it is Christian texts and classical Greek and Latin authors that are getting the best treatment here in terms of sophisticated analysis tools at these online digital libraries. But really, it is not surprising at all! These are texts that have provoked religious or secular devotion for centuries, and institutional resources of both churches and universities have been continually dedicated to the promotion and analyses of these texts. Just as the Jesuits were laboring over the manually-constructed concordance of the Vulgate that I used to rely on, now the good folks at Intratext (closely connected with the Vatican) are working hard at digital labors for the new millennium. And likewise: there were devoted classical scholars who constructed, by hand, enormous concordances to the works of Ovid or Vergil; their modern counterparts are now able to make digital leaps and bounds, reworking these same texts with computer-based tools. So, don’t forget: behind the computers, there are people who work very very hard to build the computerized tools that “effortlessly” search and analyze these digital texts. We definitely owe these people our thanks for making the results of their work available to us, for free, over the Internet.

Share, bookmark or tag: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • blogmarks
  • del.icio.us
  • digg
  • NewsVine
  • Reddit
  • StumbleUpon
  • Technorati
  • JeQQ

1 Response to “Digital Features for Digital Texts: Automatic Text Analysis”


  1. 1 Laura Gibbs

    Great info! Thanks Laura. I love Perseus, it’s a true paradigm demonstrating the unique virtues of a digital library. And by the way, I wonder, are all Jesuit books this enormous? I felt I was the luckiest teenager in the world when I received as a gift a huge Jesuit book with awesome geometry problems. The best problems ever. It was taking up so much space on my desk I had to keep it by my bed -I could have used it as a night stand if I wanted to.. :-)

Leave a Reply