Tag Archives: PDF

Academic text parsing

19.09.2024 admin

I used to parse PDFs using the Allenai method and the layoutparser.
This worked in many instances but is no longer maintained.
I still have Nougat on my to do list while a new paper now points to AceParse

AceParse includes various types of structured text, such as formulas, tables, algorithms, lists, and sentences embedded with mathematical expressions, among others. We provide examples of several dataset samples to give you a better understanding of our dataset.

CC-BY-NC Science Surf , accessed 27.04.2026

Software

Pro Tipp: Next level OCR of academic documents

18.02.2024 admin

Reading of math documents into LaTeX involves a lot of typing while there is some support now by FB (Github)

pip install nougat-ocr
nougat path/to/file.pdf -o output_directory

CC-BY-NC Science Surf , accessed 27.04.2026

Software

Headless Chrome is better than wkhtmltopdf to store PDFs of websites

5.07.2022 admin

I have been using wkhtmltopdf for ages but as I am getting more and more issues with missing fonts, I was looking for an alternative in particular when going for websites that load dynamically.
Here is the macOS syntax

"/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome" --headless --disable-gpu --print-to-pdf=/Users/xxx/Desktop/pdf/test.pdf https://chromestatus.com/features

CC-BY-NC Science Surf , accessed 27.04.2026

Software

Personalized PDFs

22.01.2022 admin

Scientific publishers are creating now more and more dynamic PDFs. Why do we know? There is an unexpected loading delay of a PDF from Routledge / Taylor & Francis group that I observed recently. First I thought about some DDos protection, but is indeed a personalized document.

These websites are all being contacted while creating this PDF:

Scitrus.com seems to be part of a larger reference organizer network and links to scienceconnect.io. Alexametric.com is the soon to be retired Alexa internet / Amazon service. Snap.lidcdn.com forwards to px.ads.linkedin.com, the business social network. Then we have Twitter ads, Cloudflare security and Google Analytics. All major players now know that my IP is interested in COVID-19 research. Did I ever agree to submit my IP and time stamp when looking up a rather crude scientific paper?

This is exactly what the German DFG already warned us about last October

For some time now, the major academic publishers have been fundamentally changing their business model with significant implications for research: aggregation and the reuse or resale of user traces have become relevant aspects of their business. Some publishers now explicitly regard themselves as information analysis specialists. Their business model is shifting from content provision to data analytics.

Another paper describes the situation as “Forced marriages and bastards”…

My question is : Will Francis & Taylor even do more? The structure of PDFs allows including objects including Javascript. When examining “document.pdf” using pdf-parser I could not find any javascript or my current IP in clear text. I cannot exclude however that the chopped up IP is stamped somewhere in the document. So I will have try again at a later time point and redo a bitwise analysis. of the same PDF delivered on another day.

At least the DFG document says that organisations might argue that such software allows for the prosecution of users of shadow libraries. While I have doubts that this is legal, we already see targeted advertisement as I received this PDF from Wiley that included an Eppendorf ad.

When I downloaded this document a second time using a different IP it was however identical. Blood/Elsevier only let’s you even download only after watching a small slideshow…

CC-BY-NC Science Surf , accessed 27.04.2026

Pages: 1 2 3 4

Philosophy, Software

Scientific PDFs

12.07.2021 admin

I found only one relevant blog post about PDFs produced by scientific publishers.

Although PDFs are the main formal output nowadays (no more “papers”) there is basically no standardization which meta data should be included in scientifc PDFs. It’s largely due to the software in the production office what’s in the PDF document — a largely underused resource.

This is at least my experience when working on an image duplication pipeline in scientific papers. But test it yourself ….

CC-BY-NC Science Surf , accessed 27.04.2026

Software

Spotlight indexing of Papers keywords

8.11.2010 admin

So far, spotlight indexing of Papers’ keywords is not possible (while being planned for a future release according to an email that I received today from one author). In the meantime, here is a workaround.

1. Export the whole Papers database as tmp.csv
Continue reading Spotlight indexing of Papers keywords →

CC-BY-NC Science Surf , accessed 27.04.2026