HIST 300A – Digitizing History teaches how to conduct history projects in the 21st century through documentary film-making, web-design, and best practices in the digital humanities. Students will explore a variety of contemporary models for the production and consumption of historical information on the web — including commercial, non-profit, and government databases, as well as public history, journalistic, and other websites. They will then create a documentary film using professional equipment, to be housed on an interactive website of their own creation.
NATURAL LANGUAGE PROCESSING AND OPTICAL CHARACTER RECOGNITION:
THE VIEW FROM THE HUMANITIES
Christopher Church, Assistant Professor, History
NLP and OCR: the View from the Humanities Natural Language Processing (NLP) presents numerous opportunities for humanities research, particularly in the field of history. For one, NLP allows historians to overcome their current inability to look at cultural memes in context to see how much or frequently something was said relative to everything else that was said. Additionally, it enables historians to create maps of linguistic and cultural change over time, or to paint a synchronic picture of a particular decade, movement, or ideology. In short, it allows historians-and humanists more generally-to obtain a bird’s-eye view of their material. However, there are real challenges to performing NLP on historical documents, namely issues related to Optical Character Recognition (OCR). Rather than relying upon “born-digital;’ “found” or “curated” data sets, historians must create their data themselves from oftentimes spotty archives, degraded materials, or handwritten documents. This presentation will explore the importance of OCR to NLP in the humanities, while attending to the pitfalls of relying too heavily on curated data and proposing some ways to overcome the inherent messiness of the data with which humanists wrestle.
Sometimes you may want to add unique identifiers (UIDs) to your data in OpenRefine (eg. migrating the data into a Database Management System (DBMS) like Access or Filemaker).
It’s nice to have a set number of leading zeroes, especially if you’ll sort your data alphabetically.
To do this, you’ll need to add a new column based on any column, which will bring up a dialogue window. Edit column > Add column based on this column…
For your GREL (Google Refine Expression Language) expression, enter the following:
“0000”[0,4-row.index.length()] + row.index
* Make sure to enter a column name (above circled in blue).
* * *
Here’s what the GREL means:
- “row.index” is a controlled term for the number of the row counting from the top (beginning with 0)
- “0000” is a string of four zeroes that will be spliced into the index.
- row.index.length() is how many characters make up row.index (treating it as a string) — so “1981” would have a length of 4, whereas “30” would have a length of 2.
- [0,4-row.index.length()] slices the string of zeroes to match however many are needed to bring the total number of numeric places to 4. If the index is “13” (length of 2 characters) and you want a total four numbers (0013), then it will take only 2 zeros from the string.
- finally, “+ row.index” concatenates the original index to the preceding zeros. — so in the case of the above example, it’ll add together “00” and “13” to get “0013”
You can increase the number of leading zeroes to however many you need, but you’ll need to make a few changes.
- First, you’ll need to update “0000” to match however many number places you want.
- Then you’ll need to change 4-row.index…. to X-row.index….. — where X equals the number of number places.
For example, if you want to increase the total number places to 6, change the expression to
- “000000”[0,6-row.index.length()] + row.index
DATA WORKFLOWS AND NETWORK ANALYSIS
This workshop will discuss methods of data retrieval, data cleaning, and visualization. Participants will discuss how websites are structured and learn how to collect a data set with webscraping. Participants will learn how to use tools like OpenRefine for cleaning and transforming data and then visualize data using Gephi, an open source tool for network analysis.
Christopher Church is an assistant professor of history at the University of Nevada, Reno. Before joining the history department at UNR, he was the Program Coordinator at UC Berkeley’s D-Lab. He studies colonialism, citizenship, and environmental history. He is well versed in databases, GIS, scripting, network analysis, and web design. He is tasked with developing a digital humanities curriculum at UNR.
The Human Face of Big Data: the Promise and Perils of a Planetary Nervous System
Come watch the award-wining documentary, The Human Face of Big Data.
Thursday, April 30, 7pm – Wells Fargo Auditorium (MIKC 124)
Stay for hors d’oeuvres and a panel discussion featuring UNR faculty:
- Dr. Chris Church, Dept. of History
- Dr. Katherine Hepworth, Dept. of Journalism
- Kari Barber, MFA, Dept. of Journalism
- Dr. David Alvarez, Dept. of Biology
- Dr. Nicholas Seltzer, Dept. of Political Philosophy
Pryor’s Peoria has been nominated for “Best Use of DH For Public Engagement.” Voting is determined by popular vote, so if you like the project, please vote!