Project Review: Mining the Dispatch for Textual Data on the Civil War
Using machine learning and probabilistic topic modelling specifically a tool called MALLET a - tool created by McCallum, Andrew Kachites from UMASS- Amherst, Robert K. Nelson, Director of the Digital Scholarship Lab at the University of Richmond and other scholars. This project was created to analyze an archive of 112000 pieces of text, and more than 24 million words to paint a picture of life in Richmond in the 1900s. This was at a time when Richmond was “the center of the Civil War South” and in Robert’s own words, the most cited but least understood city of the period. Using these probabilistic topic modelling, machine learning, text analysis tools, Robert K. Nelson analyzes newspaper texts from The Dispatch- a prominent newspaper of the South in the era to examine what the texts were about, analyzing runaway slave advertisements, plain newspaper advertisements, and other articles in those newspapers.
The tools used and their strengths and weaknesses Robert K Nelson explains how MALLET- a statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning application tool, carries out a kind of “distant reading” to make sense of these large amounts of text. Nelson explains the inherent flaw in this process as being that the tool relies heavily on the frequencies of words in the articles and may not be contextually accurate in the same way a person practicing traditional close reading would be. Distant reading allows the digital humanist to carry out an analysis of an entire archive of texts while close reading allows the analysis of samples of archives of texts. Nelson shows how remarkably efficient these methods are for querying digital humanities data sets and identifying trends and patterns in larger bodies of texts and multiple sets of information.
Purpose and questions posed by Nelson.
This project then tries to analyze textual data en masse and then use statistical and mathematical methods to identify patterns in it and derive qualitative information from it. Nelson tries to answer qualitative questions like, “What effect did the civil war have on the rate of slave escapes?” “How important was the civil war in giving slaves a chance to run away and escape the harsh conditions in the South?” “How did slavery evolve during the civil war? Did it increase, decrease, or did it take on new shapes and forms?” The graphs that show the correlation between slave hiring advertisements over time as the civil war progressed allow the digital humanist to identify patterns and the relationship between the civil war’s intensity and progress, and the demand and supply of slaves evident from the number of advertisements in The Dispatch.
Nelson queries why the market for slave labor remained largely contest over time, until a sudden inexplicable drop in the hiring market for slave labor around 1862. This method of analyzing large datasets in this way has the inherent weakness of neglecting contextual data. Nelson argues that the methods used to analyze the dispatch data sources are most useful when they enable digital humanities scholars to identify patterns they otherwise would not have identified through close reading.
Slavery was a prominent part of this humanities project and Nelson analyzed these textual bodies of data to identify trends and patterns in slavery and how the civil war influenced it. In examining this project, it can be useful to inquire of Nelson about the factors that influenced the survival of slavery that did not include the war and what impact slavery had on the slaves themselves and the social fabric of the period.
Nationalism and Patriotism were analyzed in poems, articles and essays that were published in the newspaper that rallied for the support of anti-northern sentiment and support for the southern war effort.
In all, 40 different topics are explored by this research project which identifies the relationships between different variables and metrics that influenced the politics and developments of the period. Some of these other topics include military conflict, soldiers, economy, politics, local news, and advertisements.
History and process.
The Mining The Dispatch digital humanities project uses a data set of newspapers of The Dispatch from the period from 1860 through 1865, spanning 112000 pieces of writing and almost 24 million words. The program uses these texts to draw conclusions based on an algorithm to give a view of sentiment on various issues such as slavery, runaway slaves based on their incidence in the newspapers and the keywords used in each article. The MALLET program used uses keywords it picks up from each article and then terms topics to deduct what an article is about based on machine learning, text analysis and probabilistic modelling. While the program is highly efficient for large data sets, for narrower data sets, the incidence of contextually inaccurate results increases.
The project also uses the Google Charts API and icons to generate charts and present data.