BASHing–get it?? (Too violent a pun? See also: hack.)
MARBL houses and owns the rights to the poet Turner Cassity’s papers, including born digital materials from one computer. Dorothy Waugh, my colleague on the Digital Archives team, processed the born digital materials and is now working to get them online and publicly available on an Omeka site. I am experimenting with text analysis of his born digital materials.
So far, my experimentation has entailed using the Stanford Natural Language Processing group’s Named Entity Recognition software, following this tutorial by William Turkel.
My introduction to the command line
Over the past month or so, I have become somewhat comfortable on the command line from working with text analysis tools and generating network graph data via the command line.
I started by working through bits of Prof Hacker’s Guide to the Command Line, at the suggestion of Sara Palmer, electronic text specialist at ECDS.
Since then, I have mastered basic navigating among directories. “cd [name of directory]” and “ls” (which lists the files and directories within that directory) came to be easy. I also had a better time when I realized that you either have to navigate to the folder that has the stuff you’re working with, or else you have to include the file path within your command to make things work. I learned how to create and edit files within the command line, with vim. “Vim” plus the name of the new file will create and open a file within Terminal, then you hit “i” to “insert” text, “esc” to exit the edit mode, and “:wq” to save and quit. As with many of these sorts of things, internet searches are your friend when you get stuck.
Preparing files for text analysis
First, I wanted to get the text from all of the pdfs of Cassity’s files into a single plain text file. To do so, I got pdftotext, and used Ken Benoit’s instructions to convert the folder of pdfs to a plain text file. Following his instructions, I saved the following script as “convertmyfiles.sh”:
#!/bin/bashFILES=~/Documents/ECdsmarblprojects/642_TurnerCassityPapers/642_TurnerCassityPapers_Omeka/*.pdf*.pdf for f in $FILES do echo “Processing $f file…” pdftotext -enc UTF-8 $f done
for i in *.pdf; do mv “$i” “`echo $i | sed -e ‘s, ,_,g’`”; done
cat *.txt > everything.txt
open everything.txt
Named Entity Recognition Software
Once I had my everything.txt file, I turned to Turkel’s tutorial on using Stanford Natural Language Processing group’s Named Entity Recognizer (NER) software.
stanford-ner/ner.sh everything.txt > cassity_ner.txt
Turkel’s tutorial then walked me through:
- removing all of the “/O” labels to create a clean copy with only the person, organization, and location labels:
- sed ‘s/\/O / /g’ <cassity_ner.txt > cassity_ner_clean.txt
- using egrep to create files of the words that precede each of the /PERSON, /ORGANIZATION, and /LOCATION labels:
- egrep -o -f pattr cassity_ner_clean.txt > cassity_ner_pers.txt
- Pattr is a file that gives the rules for what to retrieve (the string of letters that precede the /PERSON label, and join adjacent /PERSON strings, i.e. Maya/PERSON Angelou/PERSON would be retrieved as a single named entity:
- (([[:alpha:]]|\.)*/PERSON([[:space:]]|$))+
- sorting the lists of people, organizations, and locations by number of frequency:
- cassity_ner_loc.txt | sed ‘s/\/LOCATION//g’ | sort | uniq -c | sort -nr > cassity_ner_loc_freq.txt
I ended up with three files with the lists of people, organizations, and locations, in order of frequency. The winners are (if most appearances in the text equals winning):
- Person: Galt
- Organization: Artemis (actually a personal name…organization seems to be the trickiest designation to determine, because the list is the most confused of the three). The first actual organization to appear on the list is Teatro Amazonas.
- Location: Lombok
Now, my next step will be to explore the best way to clean up or at least make more sense of the data. As Turkel says in the tutorial, even though the data isn’t perfect, it still is useful: “these errors are interesting, in the sense that they give us a bit better idea of what this text might be about.”
Turner Cassity and Place
Turner Cassity was a poet and librarian who was born and buried in Mississippi; he lived for much of his life in Georgia, and worked as a librarian at Emory; he also traveled and lived outside of the continental U.S. for significant periods of his life. I plan to use text analysis to explore his the places he writes about in poetry.
Cassity was known as a southern poet, and a poet who wrote about a geographically diverse array of places. The Atlanta Journal Constitution’s obituary of Turner Cassity forefronts his identity as a southern poet who didn’t write about the South:
“He was so very Southern that he didn’t need to write about the region to prove it,” Dana Gioia, former chairman of the National Endowment for the Arts, wrote in an e-mail. “He didn’t write about conventional ‘Southern’ literary themes because he represented the more cosmopolitan side of Southern identity. He was also a Southern eccentric in the style of Flannery O’Connor or John Kennedy Toole.”
Critics often comment on Cassity’s relationship with place.
Cassity is very much a world-wandering poet, a globetrotter, and although the places that he responds to have something in common (mostly, for instance, they are post-colonial places), still the sheer irreducible variety of visitable places and climates is something brought powerfully home to us as we read Cassity.
Yi-Hsuan Tso notes:
[Cassity’s] poem “Cartography is an Inexact Science” accentuates his idea of the syndication of culture, suggesting that geography is defined more by people’s interrelationships and less by space.
Anne Donlon
A resource for learning the command line: Zed A. Shaw’s Command Line Crash Course.