The Wright American Fiction collection consists of nearly 3,000 works of 19th century American fiction. The collection is based on a bibliography created by Lyle H. Wright covering the period of 1851-1875. This essay explores a small subset (101 texts) of the total corpus using topic modeling. It argues that this approach is a viable and productive way to conduct research on the corpus. Because this essay is a partially a proof of concept, several brief examples will be provided to highlight potential avenues of research.
In order to get a sense of what themes and content these 19th century authors wrote about, I topic modeled a subset of 101 texts using LDA (Latent Dirichlet Allocation). The first attempt revealed that the topics created by the model were badly skewed due to the presence of proper names. After filtering out a list of common 19th century proper names created by Matthew Jockers, the model was retrained on 20 topics. The results of this second model (shown below) contain several unidentifiable categories, but topics 1, 3, 8, 9, 13, 14, 15, 16, 17, 18 had identifiable themes that I labeled nautical, law, business, slavery, religion, gardening, music/slave dialect, frontier, and military strategy respectively.
Being interested in Native American history, I focused on topic 17 (frontier) that is clearly about the American frontier and interactions with Native Americans.
## Source: local data frame [15 x 3]
## Groups: topic [1]
##
## topic term beta
## <int> <chr> <dbl>
## 1 17 indians 0.0048
## 2 17 isaline 0.0043
## 3 17 simons 0.0037
## 4 17 wild 0.0028
## 5 17 indian 0.0028
## 6 17 ground 0.0026
## 7 17 savages 0.0026
## 8 17 journey 0.0020
## 9 17 river 0.0019
## 10 17 camp 0.0019
## 11 17 rifle 0.0019
## 12 17 thar 0.0019
## 13 17 forward 0.0018
## 14 17 savage 0.0018
## 15 17 hampton 0.0018
Filtering the documents that are significantly comprised of that topic revealed several texts written by Emerson Bennett including The Phantom of the Forest, The Bride of the Wilderness, Wild scenes on the frontiers, or, Heroes of the West, The Pioneer’s Daughter and The Unknown Countess, and The Border Rover. I had not come across Bennett’s work before so in this particular case, the topic model has identified an author which requires further study. In addition to Bennett, the model also identified several texts whose titles may not have piqued my interest had I merely perused a bibliographic list of the works. These texts included John Ballou’s The Lady of the West, or, The Gold Seekers and D.W. Belisle’s The American Family Robinson, or, The Adventures of a Family Lost in the Great Desert of the West.
In addition to identifying relevant texts using the topics, computing and weighting the term frequencies (tf-idf) also yielded interesting results by highlighting texts that frequently employed the terms “indian” and “savage.”
## Selecting by tf_idf
## # A tibble: 10 × 7
## filename word n total_words tf idf tf_idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 VAC5800.txt indian 59 4608 0.003703471 0.5897688 0.0021841919
## 2 VAC5760.txt indian 57 6575 0.002102933 0.5897688 0.0012402443
## 3 VAC5799.txt indian 85 6941 0.002044645 0.5897688 0.0012058681
## 4 VAC5789.txt indian 65 6535 0.001948441 0.5897688 0.0011491299
## 5 VAC5790.txt indian 73 7920 0.001907848 0.5897688 0.0011251895
## 6 VAC5742.txt indian 31 5614 0.001810853 0.5897688 0.0010679849
## 7 VAC5799.txt savage 82 6941 0.001972481 0.5375831 0.0010603727
## 8 VAC5789.txt savage 62 6535 0.001858513 0.5375831 0.0009991052
## 9 VAC5769.txt indian 50 6396 0.001678697 0.5897688 0.0009900434
## 10 VAC5728.txt indian 36 5940 0.001477772 0.5897688 0.0008715438
Exploring the topic breakdown of these texts revealed that they were comprised of other categories that were distinct from the frontier. Given the significant degree to which these texts contained the word “indian” I was surprised that they were only marginally comprised of topic 17. I suspected that there might be other discourses concerning Native Americans that were not necessarily texts about the frontier. In an attempt to capture this nuance, I retrained another model with 50 topics.
The second model (above) has several topics (4, 17, 26, and 35) that include the word “indian.” This suggests the existence of multiple distinct discourses surrounding Native Americans in 19th century fiction. Enlarging the study to include the entire corpus may reveal more nuances within the topic models and help identify the different discourses surrounding Native Americans. While by no means conclusive, this essay demonstrates that topic modeling is a useful way to conduct research on the Wright American Fiction collection.