{"id":385,"date":"2006-11-16T13:33:33","date_gmt":"2006-11-16T11:33:33","guid":{"rendered":"http:\/\/www.wjst.de\/blog\/downloads\/a-low-cost-system-for-a-pdf-literature-archiv-i\/"},"modified":"2007-12-30T13:13:12","modified_gmt":"2007-12-30T11:13:12","slug":"a-low-cost-system-for-a-pdf-literature-archiv-i","status":"publish","type":"post","link":"https:\/\/www.wjst.de\/blog\/sciencesurf\/2006\/11\/a-low-cost-system-for-a-pdf-literature-archiv-i\/","title":{"rendered":"A low-cost system for a PDF literature archiv I"},"content":{"rendered":"<p>Getting a scientific paper on your harddisk is quite simple. I am using a Fujitsu Scan Snap that can process a single page in a few seconds. The resulting PDF needs to be further tweaked by OCR recognition like ABBY FineReader (I couldn&#8217;t find any good open source alternative). FR will leave your PDF intact while adding recognized text as an overlay (or &#8220;underlay&#8221;). Unfortunately FR does not support batch processing but your OS will do by using a windows scripting engine like <a href=\"http:\/\/www.clrsoftware.com\/clrscript\/\">CLRscript<\/a>. We also need a tool to extract a text file from the modified PDF. A good choice is <a hef=\"http:\/\/www.foolabs.com\/xpdf\/download.html\">pdftotext<\/a> &#8212; look at the sourcecode and the DRM discussion before compiling it with a compiler like Cygwin. The following perl script doesn\u00c2\u00b4t do anything than traversing your target directory and creating a batch file. As filenames offered by publishers are rather strange, I would first start to create some clean file names by replacing all spaces and brackets with something innocent like underscores.<br \/>\n<em>perl.exe ocr.pl rename h:\\pdf\\2008\\*.*<\/em><br \/>\nNow we create text files from the PDFs (usually done better by XPDF than directly by GDS).<br \/>\n<em>perl.exe ocr.pl extract h:\\pdf\\2008\\*.pdf<\/em><br \/>\nThe resulting textfiles may be inspected: very small file sizes usually indicate no valid extraction and should be deleted before starting the OCR step as OCR is only done when text files are missing.<br \/>\n<em>perl.exe ocr.pl ocr h:\\pdf\\2008\\*.pdf<\/em><br \/>\nIn the last step you may want to repeat the extract step.<\/p>\n<p><a href=\"https:\/\/www.wjst.de\/blog\/wp-content\/plugins\/scripts\/ocr.zip\">ocr.zip<\/a><br \/>\n|wj_ocr.txt|<\/p>\n\n<p>&nbsp;<\/p>\n<div class=\"bottom-note\">\n  <span class=\"mod1\">CC-BY-NC Science Surf , accessed 09.04.2026<\/span>\n <\/div>","protected":false},"excerpt":{"rendered":"<p>Getting a scientific paper on your harddisk is quite simple. I am using a Fujitsu Scan Snap that can process a single page in a few seconds. The resulting PDF needs to be further tweaked by OCR recognition like ABBY FineReader (I couldn&#8217;t find any good open source alternative). FR will leave your PDF intact &hellip; <a href=\"https:\/\/www.wjst.de\/blog\/sciencesurf\/2006\/11\/a-low-cost-system-for-a-pdf-literature-archiv-i\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">A low-cost system for a PDF literature archiv I<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-385","post","type-post","status-publish","format-standard","hentry","category-genetics-biology"],"_links":{"self":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/385","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/comments?post=385"}],"version-history":[{"count":0,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/385\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/media?parent=385"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/categories?post=385"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/tags?post=385"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}