Last spring, I finished up the BU Gastronomy program by creating a methodology to compare word frequency in cookbooks. This involved a long process that included finding digital txt versions of the cookbooks and converting them into word frequency visualizations in my Tableau Public account. As I did not realize that the process would be my thesis, I did not adequately document my process and small, but significant decisions, but had to recreate them later. Since graduation, I have learned the business and academic terms for several of the steps in my methodology and wished I had done several things differently early in the project. My current thought process has been guided by conferences I attended virtually in Summer 2021, The Data Sitters Club, and my current enrollment in a Data Analytics class at General Assembly (remote).
As with any methodology, the question I am now considering is, how can I improve my thesis methodology in a new project. My current project idea came from a prompt from the Oxford Food Symposium call for paper proposals on Portable Food.
While in the Gastronomy program I had worked on a project exploring food in Paddington Bear books, television shows, and movies. I vaguely recalled that portable marmalade sandwiches were present through the books and decided to start exploring the foodstuffs frequency and significance. I decided to start by scanning a boxed set of Paddington stories.
I am currently creating txt versions of the books in the boxed set for distant reading, aided by technology. Here is my workflow (subject to revision) for this current work stage from my methodology spreadsheet.
I started with the Box Set as the Initial Paddington Canon
Set Goa to Visualize frequency of portable food mentions using AntConc, Google Sheets, and Tableau Public
Realized and accepted that I would need to destroy the physical books for this project
Removed pages from binding
Tried to scan pages using automatic feeder at 200 dpi
Became frustrated that the small pages kept getting stuck in the scanner, and that I was scanning odd than evens pages to be rebuilt in Adobe Acrobat
Spent some time scanning each page, which meant that pages were in the correct order, but took hours
Placed physical scanned documents into correct binding for future questions around page/book order and rescanning as needed
Placed scans into dedicated Dropbox folders for each day scans
Made an initial spreadsheet to list Paddington Stories by book, year, and illustrator
Currently rearranging scanned pages in Adobe Acrobat, making each story its own binder
Currently running Optical Character Recognition (OCR) on scanned pages
Designed Spreadsheet with initial columns
Getting ready for more Scanning at 600 dpi
Planning to make a Dictionary Defining Columns for the main spreadsheet
Need to create txt files from the scanned pages
Making plans to clean txt files
Plan to rescan Specific pages as needed
Plan to import virtual books into Tropy with accurate metadata
Plan to clean Data with documentation of cleaning practices
Need to consider new copyright rulings that would have made this whole process easier