This is an exploration of data- and software-driven approaches to pulling useful data out of semi-structured text in electronic format that may or may not contain OCR errors and other noise. The methods we outline are more accurate and cost-effective in terms of human time than previously existing methods and tools. The resulting data is highly structured and useful for many kinds of down-stream applications.