{"id":24058,"date":"2024-09-19T21:04:28","date_gmt":"2024-09-19T19:04:28","guid":{"rendered":"https:\/\/www.wjst.de\/blog\/?p=24058"},"modified":"2024-09-19T21:04:28","modified_gmt":"2024-09-19T19:04:28","slug":"academic-text-parsing","status":"publish","type":"post","link":"https:\/\/www.wjst.de\/blog\/sciencesurf\/2024\/09\/academic-text-parsing\/","title":{"rendered":"Academic text parsing"},"content":{"rendered":"<p>I used to <a href=\"https:\/\/medium.com\/swlh\/a-few-thoughts-on-parsing-text-b496a0f99dde\">parse PDFs<\/a> using the <a href=\"https:\/\/github.com\/allenai\/spv2\">Allenai<\/a> method and the <a href=\"http:\/\/example https:\/\/github.com\/Layout-Parser\/layout-parser\/blob\/master\/examples\/Deep%20Layout%20Parsing.ipynb\">layoutparser<\/a>.<br \/>\nThis worked in many instances but is no longer maintained.<br \/>\nI still have <a href=\"https:\/\/arxiv.org\/pdf\/2308.13418\">Nougat<\/a> on my to do list while a\u00a0<a href=\"https:\/\/arxiv.org\/html\/2409.10016v1\">new paper<\/a> now points to <a href=\"https:\/\/github.com\/JHW5981\/AceParse?utm_source=tldrai\">AceParse<\/a><\/p>\n<blockquote><p>AceParse includes various types of structured text, such as formulas, tables, algorithms, lists, and sentences embedded with mathematical expressions, among others. We provide examples of several dataset samples to give you a better understanding of our dataset.<\/p><\/blockquote>\n<p>&nbsp;<\/p>\n\n<p>&nbsp;<\/p>\n<div class=\"bottom-note\">\n  <span class=\"mod1\">CC-BY-NC Science Surf , accessed 24.04.2026<\/span>\n <\/div>","protected":false},"excerpt":{"rendered":"<p>I used to parse PDFs using the Allenai method and the layoutparser. This worked in many instances but is no longer maintained. I still have Nougat on my to do list while a\u00a0new paper now points to AceParse AceParse includes various types of structured text, such as formulas, tables, algorithms, lists, and sentences embedded with &hellip; <a href=\"https:\/\/www.wjst.de\/blog\/sciencesurf\/2024\/09\/academic-text-parsing\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Academic text parsing<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20,9],"tags":[2681,4911,218],"class_list":["post-24058","post","type-post","status-publish","format-standard","hentry","category-note-worthy","category-computer-software","tag-pdf","tag-parsing","tag-science"],"_links":{"self":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/24058","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/comments?post=24058"}],"version-history":[{"count":3,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/24058\/revisions"}],"predecessor-version":[{"id":24064,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/24058\/revisions\/24064"}],"wp:attachment":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/media?parent=24058"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/categories?post=24058"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/tags?post=24058"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}