{"id":25387,"date":"2025-07-04T08:52:19","date_gmt":"2025-07-04T06:52:19","guid":{"rendered":"https:\/\/www.wjst.de\/blog\/?p=25387"},"modified":"2025-08-07T08:02:54","modified_gmt":"2025-08-07T06:02:54","slug":"llm-word-checker","status":"publish","type":"post","link":"https:\/\/www.wjst.de\/blog\/sciencesurf\/2025\/07\/llm-word-checker\/","title":{"rendered":"LLM word checker"},"content":{"rendered":"<p>The recent <a href=\"https:\/\/www.science.org\/doi\/10.1126\/sciadv.adt3813\">Science Advance<\/a> paper by Kobak et al. studied<\/p>\n<blockquote><p>vocabulary changes in more than 15 million biomedical abstracts from 2010 to 2024 indexed by PubMed and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 13.5% of 2024 abstracts were processed with LLMs.<\/p><\/blockquote>\n<p>Although they say that the analysis was performed on the corpus level and cannot identify individual texts that may have been processed by a LLM, we can of course check the proportion of LLM words in a text.<\/p>\n<p>Unfortunately their online list contains stop words that I am eliminating here. But then we can run the following script!<\/p>\n<pre class=\"brush: php; title: ; notranslate\" title=\"\">\n# based on https:\/\/github.com\/berenslab\/llm-excess-vocab\/tree\/main\n\nimport csv\nimport re\nimport os\nfrom collections import Counter\nfrom striprtf.striprtf import rtf_to_text\nfrom nltk.corpus import stopwords\nimport nltk\nimport chardet\n\n# Ensure stopwords are available\nnltk.download(&#039;stopwords&#039;)\n\n# Paths\nrtfd_folder_path = &#039;\/Users\/x\/Desktop\/mss_image.rtfd&#039; # RTFD is a directory\nrtf_file_path = os.path.join(rtfd_folder_path, &#039;TXT.rtf&#039;) # or &#039;index.rtf&#039;\ncsv_file_path = &#039;\/Users\/x\/Desktop\/excess_words.csv&#039;\n\n# Read and decode the RTF file\nwith open(rtf_file_path, &#039;rb&#039;) as f:\nraw_data = f.read()\n\n# Try decoding automatically\nencoding = chardet.detect(raw_data)&#x5B;&#039;encoding&#039;]\nrtf_content = raw_data.decode(encoding)\nplain_text = rtf_to_text(rtf_content)\n\n# Normalize and tokenize text\nwords_in_text = re.findall(r&#039;\\b\\w+\\b&#039;, plain_text.lower())\n\n# Remove stopwords\nstop_words = set(stopwords.words(&#039;english&#039;))\nfiltered_words = &#x5B;word for word in words_in_text if word not in stop_words]\n\n# Load excess words from CSV\nwith open(csv_file_path, &#039;r&#039;, encoding=&#039;utf-8&#039;) as csv_file:\nreader = csv.reader(csv_file)\nexcess_words = {row&#x5B;0].strip().lower() for row in reader if row}\n\n# Count excess words in filtered text\nexcess_word_counts = Counter(word for word in filtered_words if word in excess_words)\n\n# Calculate proportion\ntotal_words = len(filtered_words)\ntotal_excess = sum(excess_word_counts.values())\nproportion = total_excess \/ total_words if total_words &amp;gt; 0 else 0\n\n# Output\nprint(&quot;\\nExcess Words Found (Sorted by Frequency):&quot;)\nfor word, count in excess_word_counts.most_common():\nprint(f&quot;{word}: {count}&quot;)\n\nprint(f&quot;\\nTotal words (without stopwords): {total_words}&quot;)\nprint(f&quot;Total excess words: {total_excess}&quot;)\nprint(f&quot;Proportion of excess words: {proportion:.4f}&quot;)\n<\/pre>\n<p><span style=\"text-decoration: underline;\">7 Aug 2025<\/span><\/p>\n<p>The long &#8217;em dash&#8217; \u2014 U+2014 instead of the standard minus &#8211; seems to be a characteristic sign of chatGPT 4 even when asked not use it.<\/p>\n\n<p>&nbsp;<\/p>\n<div class=\"bottom-note\">\n  <span class=\"mod1\">CC-BY-NC Science Surf , accessed 16.04.2026<\/span>\n <\/div>","protected":false},"excerpt":{"rendered":"<p>The recent Science Advance paper by Kobak et al. studied vocabulary changes in more than 15 million biomedical abstracts from 2010 to 2024 indexed by PubMed and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 13.5% of &hellip; <a href=\"https:\/\/www.wjst.de\/blog\/sciencesurf\/2025\/07\/llm-word-checker\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">LLM word checker<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[4253,5057],"class_list":["post-25387","post","type-post","status-publish","format-standard","hentry","category-computer-software","tag-llm","tag-check"],"_links":{"self":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/25387","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/comments?post=25387"}],"version-history":[{"count":5,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/25387\/revisions"}],"predecessor-version":[{"id":25502,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/25387\/revisions\/25502"}],"wp:attachment":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/media?parent=25387"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/categories?post=25387"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/tags?post=25387"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}