Project Data

Data Assessment:

1. When first opened, the TesseractOCR file is divided into reels that contain volumes of newspapers whose numbers correspond to certain years, more or less (Volume 86 containes the newspaper text from 1986). The volume numbers are sometimes off from the actual dates, however, since they follow the academic years as opposed to beginning over again each January. If opened, the individual volumes contain one text document containing all OCR from each issue printed during the corresponding year. We have all text from 1886 until 2012 at our disposal, despite various inconcistencies. To start, the Ring Tum Phi mis-labelled some of their own issues during the nineties. During this time period as well, the OCR was not properly cleaned. I was unable to locate clean and accurate OCR from 1998 and 1999, so they have ben omitted from this project.
2. The files could definitely be improved by making a marked increase in OCR quality for the years that had been very poorly translated. The text was sometimes completely unintelligible, as seen below in the image.

It would have also been helpful to have the files divided up perfectly by years instead of by volumes, at least for my particular project. Finally, it would have been easier to process the text files in smaller chunks rather than in one big piece of text for each volume. It was difficult to distinguish which issue was which within the larger file.
3. The text files were essentially just large chunks of writing with very limited differentiation. There’s a bit of a pattern structurally, because each newspaper as a section for advertisements that sometimes throw off the OCR due to the images. Additionally, the newspapers never fail to follow very similar set ups and lengths, containing on average 7-10 pages and having similar front page structures with headers, a cover story, and then supplemental information on the sides.
4. Some files contain duplicate data, having one version of the OCR and then a cleaned version of the OCR. This is the case for years 2000-2012, and I will handle this by removing the worse text from the corpus before running any sort of text analysis software.
5. Again, the quality of the OCR varies throughout time. It seems to be fairly consisent and useable up until the year 1994 where it becomes unreadable, as demonstrated by the picture above. I was able to determine just how bad the OCR was for those years by doing a close reading and experimenting with Voyant. The most popular words in those documents, according to Voyant, were compilations of dashes and letters, so I was able to realize just how rough it was.
6. My research question examines how language surrounding the LGBTQ changed ten years before the Day of Silence in 1996 during the AIDS crisis up until the present day (or 2012). For OCR cleanup, I will prioritize organizing the data into more manageable chunks by creating a separate folder just containing the volumes from 1986-2012. I will work on obtaining cleaner OCR for the nineties (provided by Professor Brooks through Box), and I will relabel the volumes so that they are easier to read and upload into text analysis software. The biggest problem is going to be having enough words to track the evolution of the LGBTQ acronym, because the community was referenced during 1990, for example, as GLBT, and I’m going to have to make accomodations in my text analysis to compensate for the subtleties. My next biggest problem is going to be leaving out the years 1998 and 1999 in my data set due to a lack of OCR available.

Data Cleanup:

1. Data was organized into a seperate folder from the original TesseractOCR file containing the OCR from 1986-2012. the files were renamed with the year first followed by OCR. Example: “1996 OCR.”

2. Data was cleaned by going through and removing duplicate pieces of data (years 2000-2010) and keeping the clean data. Years 1990-1997 were added later once the proper OCR was found by Professor Brooks.

3. Years 1998 and 1999 were omitted due to a lack of proper OCR and documentation (the text documents skip from 1997 to 2000).

4. Each year’s OCR textfile was uploaded directly into the text analysis tool Antconc by Lawrence Anthony.

5. Each year was searched using the following Regex expressions to look for key vocabulary terms:

a. Regex Code for “gay” search:
\b[Gg]ays?
*It is necessary to manually sort out last names beginning with “Gay.”

b. Regex Code for “lgbtq” related search:
glbt|GLBT|lgb|LGB|LGBT|lgbt|LGBTQ|lgbtq
*It is necessary to manually sort out strings of random letters due to OCR errors that result in one of the acronyms.

c. Regex Code for “homosexual” related search:
\b[Hh]omose.

d. Regex Code for “lesbian” search:
\b[Ll]esbian.

e. Regex Code for “queer” search:
\b[Qq]ueer.
*It is necessary to manualy sort out uses of the word “queer” other than those related to the gay community.

f. Regex Code for “ally” search:
\b[Aa]lly|\b[Aa]lli.
**Specifically looking for Gay/Straight Alliance refernces

g. Regex Code for “homophobia” related search:
\b[Hh]omophob.

**The most important ones are probably the Regex codes for “gay” and “lgbtq,” because they serve as inverses. They demonstrate when the community gained more of a presence on the campus and the decline of the phrase “gays.”

6. The data obtained from the Regex searches, after being sorted manually using the Condordance function to gain contextual knowledge, was then put in a large Excel spreadsheet as seen below:

7. No stopwords were used, because I only used Antconc for this particular project.