Showing posts with label Software. Show all posts
Showing posts with label Software. Show all posts

Monday, January 27, 2020

Request for Tools to Turn Tables in PDFs into Spreadsheets

Question:

Hi DLI Community,

Can anyone recommend a good tool for using OCR to turn pdfs of scanned tables into spreadsheets? A professor here at the University of Toronto is working with 200+ tables from nineteenth-century Ontario government publications, and I’m trying to suggest tools and a workflow for him and his RA.

Example of scanned document: https://archive.org/details/reportofcommissi187986ontauoft/page/n31/mode/2up

So far, my proposed workflow is for them to clean pdfs (if necessary) using Acrobat Pro, and then scan them using https://www.onlineocr.net/ (which is free and fairly good) or else something more powerful like OmniPage Ultimate (slow but useful proprietary OCR software, which we have on some library workstations) for particularly challenging tables. Finally, the tables can be manually corrected.

Do any of you have suggestions for OCR tools that worked for you, especially if they work in-browser and can create spreadsheets?

Answer:

You may want to have a look at Tabula (https://tabula.technology/). I have not used it extensively myself, but I remember that it was recommend by Vince Gray (of DLI fame), so it must be good!

--

I often use Camelot or it’s web version Excalibur but the document must be in a text-based PDF format. I don’t know a decent tool to convert image-based PDF to text-based PDF.

Thursday, February 9, 2017

Timeline Software

Question
A researcher is looking for a Timeline software; previously BeeDocs was used but it no longer workable.

Can someone suggest a product that is relatively easy to use and allows some flexibility to allow for illustrations, brief write-ups and links to further information?

Answer
Timeline.JS (https://timeline.knightlab.com/) has been used without complaint. It’s pretty user friendly.