Getting a scientific paper on your harddisk is quite simple. I am using a Fujitsu Scan Snap that can process a single page in a few seconds. The resulting PDF needs to be further tweaked by OCR recognition like ABBY FineReader (I couldn’t find any good open source alternative). FR will leave your PDF intact while adding recognized text as an overlay (or “underlay”). Unfortunately FR does not support batch processing but your OS will do by using a windows scripting engine like CLRscript. We also need a tool to extract a text file from the modified PDF. A good choice is pdftotext — look at the sourcecode and the DRM discussion before compiling it with a compiler like Cygwin. The following perl script doesn´t do anything than traversing your target directory and creating a batch file. As filenames offered by publishers are rather strange, I would first start to create some clean file names by replacing all spaces and brackets with something innocent like underscores.
perl.exe ocr.pl rename h:\pdf\2008\*.*
Now we create text files from the PDFs (usually done better by XPDF than directly by GDS).
perl.exe ocr.pl extract h:\pdf\2008\*.pdf
The resulting textfiles may be inspected: very small file sizes usually indicate no valid extraction and should be deleted before starting the OCR step as OCR is only done when text files are missing.
perl.exe ocr.pl ocr h:\pdf\2008\*.pdf
In the last step you may want to repeat the extract step.
ocr.zip
|wj_ocr.txt|