Monday, January 27, 2020

Request for Tools to Turn Tables in PDFs into Spreadsheets

Question:

Hi DLI Community,

Can anyone recommend a good tool for using OCR to turn pdfs of scanned tables into spreadsheets? A professor here at the University of Toronto is working with 200+ tables from nineteenth-century Ontario government publications, and I’m trying to suggest tools and a workflow for him and his RA.

Example of scanned document: https://archive.org/details/reportofcommissi187986ontauoft/page/n31/mode/2up

So far, my proposed workflow is for them to clean pdfs (if necessary) using Acrobat Pro, and then scan them using https://www.onlineocr.net/ (which is free and fairly good) or else something more powerful like OmniPage Ultimate (slow but useful proprietary OCR software, which we have on some library workstations) for particularly challenging tables. Finally, the tables can be manually corrected.

Do any of you have suggestions for OCR tools that worked for you, especially if they work in-browser and can create spreadsheets?

Answer:

You may want to have a look at Tabula (https://tabula.technology/). I have not used it extensively myself, but I remember that it was recommend by Vince Gray (of DLI fame), so it must be good!

--

I often use Camelot or it’s web version Excalibur but the document must be in a text-based PDF format. I don’t know a decent tool to convert image-based PDF to text-based PDF.