In putting together the circuits course, I wanted to do a careful, section-by-section analysis of a potential textbook Medical Instrumentation: Application and Design, edited by John G. Webster. Although I have a copy of the book (from the library), I did not want to have to retype the entire table of contents to do my analysis, especially since it is already available on-line from Amazon’s Look Inside feature. Unfortunately, that feature is deliberately designed to make it difficult to extract the text, in order to protect copyright.
I finally figured out a way to do it using optical character recognition (OCR), cobbling together a bunch of programs I already had on my Mac OS X laptop. There is almost certainly an easier way to do this and it might have been faster just to retype the 7 pages, but I thought I’d share the kluge with you, in case you ever need to extract text from something you can see on your screen but not copy from.
- I used Firefox to get the Amazon page, but I assume that any browser would do. I used the “zoom +” button on the Amazon page and made my browser window fill the screen, to make the text as big as possible. OCR programs work best with high-resolution images, and there is a tradeoff between how much you text you can grab at a time and how much resolution you have in the image of the text. I went for maximum resolution, even though this meant needing more screen shots.
- I used Grab with the “Capture Selection” keyboard shortcut to grab the part of the screen that had the text I wanted. My son is disgusted with me for using such archaic software, but I’d never bothered to learn the keyboard shortcuts for doing screen capture directly. OK, the keyboard shortcuts were already set up, and shift-command-4 would have done a screenshot of a selection to a png file on the desktop, rather than the tiff file saved by Grab to whatever directory I wanted. If you are just doing one screen shot, then the keyboard shortcut is probably better, but the time it would take to rename and move the images if you are doing several probably makes Grab the better choice here.
- ABBY FineReader Sprint 8.0
- This is an OCR program that came with my Epson printer/scanner. I think that most scanners now come with some sort of OCR software. I tried giving the OCR program the screen capture images, and it failed miserably. It suggested, though, that I needed to scan at higher resolution (at least 300dpi). Since the images were not from the Epson scanner, but were very readable, I looked for a way to convince the program to try harder.
- This OCR program is designed to do one page at a time, and you have to go through several mouse clicks and selections for each additional page. Once I figured out how to trick it into doing proper scans of the screen shots, it did an adequate job of converting glyphs to the right characters, though it did get confused about font types and styles, so the rich-text markup is not worth retaining.
- Photoshop Elements
- I opened each image in Photoshop Elements, and used the cropping tool with the width set to 2000px to resize the image and remove anything that wasn’t needed. I then flattened the image to a single layer and saved in jpg format. (I suspect that png would also have worked, but the OCR program seems to get confused by some options in TIFF files and I didn’t want to figure out which versions of TIFF it did and didn’t understand.)
I tried several different ways to do the resizing using the Image Size command (Bicubic, Bicubic Smoother, Bilinear, Bicubic Sharper), but the crop tool was both easier to use and seemed to get cleaner results. One disadvantage of the crop tool is that it leaves the resolution at 72dpi, which confuses the OCR into using very large fonts. Doing a resize to change the resolution to 300 dpi (leaving the width at 2000px) might have been worthwhile, but I was throwing out the font size information anyway, so I didn’t bother.
- I made my screen shots correspond to chapter boundaries, rather than page boundaries, so I also used Photoshop Elements to remove header and footer lines that were in the middle of some chapters, and to realign the text between pages, since odd-page and even-page margins are different. It might be better to do a page at a time instead.
- The OCR program provides three output options: rich text, HTML, and spreadsheet. For most purposes, the rich text output would be best, but the program removed the section numbers from the table of contents, making this output format useless for me. The HTML output looked good, but was a complicated table arrangement that would have been difficult to add commentary to. The spreadsheet output puts all the text into one column of a spreadsheet, which looks awful, but is easy to select for cut-and-paste operations and had all the information I wanted.
- Cutting and pasting from the Numbers column directly into the WordPress.com editor for creating posts and pages did not work—the editor got very confused by whatever rich-text markup there was. Pasting into a text-only TextEdit window threw away all the junk, leaving just plain text, numbers, and spaces. I had to clean up a few places where the OCR had introduced line breaks, or where the book formatting had inserted line breaks that were not needed at the different font size I was using. Because I had about 14 screen shots for the 7-page table of contents, I also used TextEdit to put everything back into a single document. Cutting and pasting the plain text into WordPress.com’s editor works fine.
Overall, it took about 50 minutes to process the 14 screen shots. This is obviously not a suitable workflow for doing any substantial amount of text (which is a good thing for avoiding copyright violations), but it can be used if you need to quote a block of text from a book and are a slow or inaccurate typist.