Methods, Data, and Tools – Following "The North Star"

The data for this project was developed by Alex Bice as part of his Certificate in Digital Humanities project at Northeastern University. The data can be accessed through the Following “The North Star” Github.

The tools used for this project included the programming language Python as well as the text visualization tool Lexos Online for cleanup of the OCR obtained from the Library of Congress Website. MALLET was used to run the topic model. AntConc, Lexos Online, and Voyant were used to explore the full corpus and visualize how individual words were used over the course of the paper’s publication. This work was useful to better unpack the topic models and create some of the labels that have been used in the topic model visualizations. Datawrapper and Tabeau Public were used for graph and chart creation.

This project sought to combine topic modeling and the close reading of first articles to create an effective overview of The North Star that would be accessible to a public audience. Below is some additional information about each method that is not covered on the front page of the website.

First Articles

First articles were used to create a database for comparison with the topic modeling information. The first articles were chosen for this due to several reasons. First, they represented a self-contained group that would carry over from issue-to-issue depending on the size of the article. Some major published works or speeches would cover the front page of two or three issues, but most only took up space on one issue. Second, these articles were often of a considerable length that could benefit from close reading without being unmanageable. Third, the first articles drew from multiple types of materials. Instead of always being a letter or opinion piece, the section consistently brought in new materials that would allow for a better overview of what the paper drew from. None of this is to say that these articles were the “most” important or that these articles would have been the first that people interacted with. Rather, these articles represent a diverse and contained group that allow for the synthesis of many perspectives and arguments while simultaneously engaging in the work of corpus creation.

The end result of the close reading of the first articles was a list of approximately 40 perspectives or arguments that were selected and developed through continuous engagement with the paper over the spring and summer of 2020 as the topic modeling corpus was being created. Given this process, there is room for interpretation regarding the topics and arguments. This is partially because of the choice to select the first article in each edition of the paper. These articles tended to be larger, and focus on politics or organizing. After choosing approximately 3 topics for each article I was looking at (some had more or less depending on the length of the article), I used a formula in Microsoft Excel to turn the selected topics into a faceted list so that all topics could be compared against one another.

Like all attempts to place data into categories, there is room for error or other interpretations of how things relate. The data for first articles in each text was achieved through close reading each article for specific themes and arguments that were common over the publication period of the paper.

Topic Analysis

The corpus for the text analysis was generated specifically for this project, as there was no way to bulk download the OCR from the Library of Congress website. While this posed problems early on, it also allowed for the corpus to be customized to exclude advertisements. The next problem encountered in corpus creation dealt with a line break issue in the OCR. The Library of Congress OCR had no space at the end of each line of text, even if the line ended at the end or the middle of a word. As a result of this, I created a multi-step python process that could remove the line breaks and note where they had previously been located in the text. Afterwards, another python program with regular expressions was used to determine whether the characters around the line break represented two unique words or one larger word. I utilized the Office Libre English language dictionary as the comparison with my corpus. Any characters surrounding a line break that exactly matched a word in the English dictionary would be kept as a single word, everything else would be kept as two words along the original line break. The process was not perfect, but did materially improve the quality of the results. Given the lack of access to batches of the newspaper from the Library of Congress, this was an attempt to take a very messy OCR corpus and go through the entire process of cleaning it to create a product usable for text analysis.

Topic modeling for this project was undertaken using the MALLET program. I selected this program because its ease of use, and because it had been used on other projects encountered while researching topic modeling and text analysis methods. The topic model that was included in the project website used 10,000 iterations and included 40 topics. Out of these topics, there were 3 that appeared to be made up predominantly of errors from the OCR.

The final process related to creating a usable list of topics was creating labels for each of the 40 lists of keywords. Labels were added to each of the topics in order to make them easier to interact with and relate to. The process of determining a label for these is partially recreated in the thematic pages, which highlight peaks where a topic occurred. Graphing each topic over time allowed me to locate when each topic peaked. From there, I would keyword search that particular edition to find the most common words associated with that topic. Reading through these articles, and determining what similarities they shared was the central part of creating a label. As with the the labels for first articles, these are subject to interpretation and have room for continued refinement.

Not all outputs of the topic model are housed on the web site. This is primarily a choice driven by the project’s audience, and what could be clearly communicated and engaged with. The clearest group of topics that was not discussed on the website were those related to Whig and Free Soil politics. While these have some scholarly implications, the clear engagement with politics is already represented in the “Washington D.C. in The North Star” and “The American South” themes. Significant engagement with the history of Whigs and Free Soilers would add little for my audience while necessitating complicated political contextualization. The website also does not include topics that remained unknown (there were 2), or were mixed in ways that did not allow for clear communication of their subject matter.