Exploring the Wright American Fiction Corpus

Joshua Catalano

The Wright American Fiction collection consists of nearly 3,000 works of 19th century American fiction. The collection is based on a bibliography created by Lyle H. Wright covering the period of 1851-1875. This essay explores a small subset (101 texts) of the total corpus using topic modeling. It argues that this approach is a viable and productive way to conduct research on the corpus. Because this essay is a partially a proof of concept, several brief examples will be provided to highlight potential avenues of research.

In order to get a sense of what themes and content these 19th century authors wrote about, I topic modeled a subset of 101 texts using LDA (Latent Dirichlet Allocation). The first attempt revealed that the topics created by the model were badly skewed due to the presence of proper names. After filtering out a list of common 19th century proper names created by Matthew Jockers, the model was retrained on 20 topics. The results of this second model (shown below) contain several unidentifiable categories, but topics 1, 3, 8, 9, 13, 14, 15, 16, 17, 18 had identifiable themes that I labeled nautical, law, business, slavery, religion, gardening, music/slave dialect, frontier, and military strategy respectively.

Being interested in Native American history, I focused on topic 17 (frontier) that is clearly about the American frontier and interactions with Native Americans.

## Source: local data frame [15 x 3]
## Groups: topic [1]
## 
##    topic    term   beta
##    <int>   <chr>  <dbl>
## 1     17 indians 0.0048
## 2     17 isaline 0.0043
## 3     17  simons 0.0037
## 4     17    wild 0.0028
## 5     17  indian 0.0028
## 6     17  ground 0.0026
## 7     17 savages 0.0026
## 8     17 journey 0.0020
## 9     17   river 0.0019
## 10    17    camp 0.0019
## 11    17   rifle 0.0019
## 12    17    thar 0.0019
## 13    17 forward 0.0018
## 14    17  savage 0.0018
## 15    17 hampton 0.0018

Filtering the documents that are significantly comprised of that topic revealed several texts written by Emerson Bennett including The Phantom of the Forest, The Bride of the Wilderness, Wild scenes on the frontiers, or, Heroes of the West, The Pioneer’s Daughter and The Unknown Countess, and The Border Rover. I had not come across Bennett’s work before so in this particular case, the topic model has identified an author which requires further study. In addition to Bennett, the model also identified several texts whose titles may not have piqued my interest had I merely perused a bibliographic list of the works. These texts included John Ballou’s The Lady of the West, or, The Gold Seekers and D.W. Belisle’s The American Family Robinson, or, The Adventures of a Family Lost in the Great Desert of the West.

In addition to identifying relevant texts using the topics, computing and weighting the term frequencies (tf-idf) also yielded interesting results by highlighting texts that frequently employed the terms “indian” and “savage.”

## Selecting by tf_idf

## # A tibble: 10 × 7
##       filename   word     n total_words          tf       idf       tf_idf
##          <chr>  <chr> <int>       <int>       <dbl>     <dbl>        <dbl>
## 1  VAC5800.txt indian    59        4608 0.003703471 0.5897688 0.0021841919
## 2  VAC5760.txt indian    57        6575 0.002102933 0.5897688 0.0012402443
## 3  VAC5799.txt indian    85        6941 0.002044645 0.5897688 0.0012058681
## 4  VAC5789.txt indian    65        6535 0.001948441 0.5897688 0.0011491299
## 5  VAC5790.txt indian    73        7920 0.001907848 0.5897688 0.0011251895
## 6  VAC5742.txt indian    31        5614 0.001810853 0.5897688 0.0010679849
## 7  VAC5799.txt savage    82        6941 0.001972481 0.5375831 0.0010603727
## 8  VAC5789.txt savage    62        6535 0.001858513 0.5375831 0.0009991052
## 9  VAC5769.txt indian    50        6396 0.001678697 0.5897688 0.0009900434
## 10 VAC5728.txt indian    36        5940 0.001477772 0.5897688 0.0008715438

Exploring the topic breakdown of these texts revealed that they were comprised of other categories that were distinct from the frontier. Given the significant degree to which these texts contained the word “indian” I was surprised that they were only marginally comprised of topic 17. I suspected that there might be other discourses concerning Native Americans that were not necessarily texts about the frontier. In an attempt to capture this nuance, I retrained another model with 50 topics.

The second model (above) has several topics (4, 17, 26, and 35) that include the word “indian.” This suggests the existence of multiple distinct discourses surrounding Native Americans in 19th century fiction. Enlarging the study to include the entire corpus may reveal more nuances within the topic models and help identify the different discourses surrounding Native Americans. While by no means conclusive, this essay demonstrates that topic modeling is a useful way to conduct research on the Wright American Fiction collection.