LLM word checker

The recent Science Advance paper by Kobak et al. studied

vocabulary changes in more than 15 million biomedical abstracts from 2010 to 2024 indexed by PubMed and show how the appearance of LLMs led to an abrupt increase in the frequency of certain style words. This excess word analysis suggests that at least 13.5% of 2024 abstracts were processed with LLMs.

Although they say that the analysis was performed on the corpus level and cannot identify individual texts that may have been processed by a LLM, we can of course check the proportion of LLM words in a text.

Unfortunately their online list contains stop words that I am eliminating here. But then we can run the following script!

# based on https://github.com/berenslab/llm-excess-vocab/tree/main

import csv
import re
import os
from collections import Counter
from striprtf.striprtf import rtf_to_text
from nltk.corpus import stopwords
import nltk
import chardet

# Ensure stopwords are available
nltk.download('stopwords')

# Paths
rtfd_folder_path = '/Users/x/Desktop/mss_image.rtfd' # RTFD is a directory
rtf_file_path = os.path.join(rtfd_folder_path, 'TXT.rtf') # or 'index.rtf'
csv_file_path = '/Users/x/Desktop/excess_words.csv'

# Read and decode the RTF file
with open(rtf_file_path, 'rb') as f:
raw_data = f.read()

# Try decoding automatically
encoding = chardet.detect(raw_data)['encoding']
rtf_content = raw_data.decode(encoding)
plain_text = rtf_to_text(rtf_content)

# Normalize and tokenize text
words_in_text = re.findall(r'\b\w+\b', plain_text.lower())

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words_in_text if word not in stop_words]

# Load excess words from CSV
with open(csv_file_path, 'r', encoding='utf-8') as csv_file:
reader = csv.reader(csv_file)
excess_words = {row[0].strip().lower() for row in reader if row}

# Count excess words in filtered text
excess_word_counts = Counter(word for word in filtered_words if word in excess_words)

# Calculate proportion
total_words = len(filtered_words)
total_excess = sum(excess_word_counts.values())
proportion = total_excess / total_words if total_words > 0 else 0

# Output
print("\nExcess Words Found (Sorted by Frequency):")
for word, count in excess_word_counts.most_common():
print(f"{word}: {count}")

print(f"\nTotal words (without stopwords): {total_words}")
print(f"Total excess words: {total_excess}")
print(f"Proportion of excess words: {proportion:.4f}")

7 Aug 2025

The long ’em dash’ — U+2014 instead of the standard minus – seems to be a characteristic sign of chatGPT 4 even when asked not use it.

 

CC-BY-NC Science Surf 4.07.2025, access 18.10.2025

The Südhof Nomenclature

Blurred as I have no image rightsSource: https://www.faz.net/aktuell/wissen/medizin-nobelpreistraeger-thomas-suedhof-wie-boese-ist-wissenschaft-110567521.html

The video can be found at the Lindau Mediathek.

Here is my annotated list of excuses numbered as SUEDHOF1, SUEDHOF2, …,. SUEDHOF15 in chronological order.

Is this really “an unprecedented quality initiative” as F.A.Z. Joachim Müller-Jung wrote?

IMHO this looks more like a larmoyant defense but form your own opinion now. Continue reading The Südhof Nomenclature

 

CC-BY-NC Science Surf 3.07.2025, access 18.10.2025

Peer review is science roulette

One of the best essays that I have read about current science.

Ricky Lanusse. How Peer Review Became Science’s Most Dangerous Illusion. https://medium.com/the-quantastic-journal/how-peer-review-became-sciences-most-dangerous-illusion-54cf13da517c

Peer review is far from a firewall. In most cases, it’s just a paper trail that may have even encouraged bad research. The system we’ve trusted to verify scientific truth is fundamentally unreliable — a lie detector that’s been lying to us all along.
Let’s be bold for a minute: If peer review worked, scientists would act like it mattered. They don’t. When a paper gets rejected, most researchers don’t tear it apart, revise it, rethink it. They just repackage and resubmit — often word-for-word — to  another journal. Same lottery ticket in a different draw mindset.  Peer review is science roulette.
Once the papers are in, the reviews disappear. Some journals publish them. Most shred them. No one knows what the reviewer said. No one cares. If peer review were actually a quality check, we’d treat those comments like gospel. That’s what I value about RealClimate [PubPeer, my addition]: it provides insights we don’t get to see in formal reviews. Their blog posts and discussions — none of which have been published behind paywalls in journals — often carry more weight than peer-reviewed science.

 

CC-BY-NC Science Surf 1.07.2025, access 18.10.2025

Enigma of Organismal Death

I asked ChatGPT-4 for more references around the 2024 paper “Unraveling the Enigma of Organismal Death: Insights, Implications, and Unexplored Frontieres” as Tukdam continues to be a hot topic. Here is the updated reading list

1. Organismal Superposition & the Brain‑Death Paradox
Piotr Grzegorz Nowak (2024) argues that defining death as the “termination of the organism” leads to an organismal superposition problem. He suggests that under certain physiological conditions—like brain death—the patient can be argued to be both alive and dead, much like Schrödinger’s cat, creating ethical confusion especially around organ harvesting. https://philpapers.org/rec/NOWOSP

2. Life After Organismal “Death”
Melissa Moschella (2017, revisiting Brain‑Death debates) highlights that even after “organismal death,” significant biological activity persists—cells, tissues, and networks (immune, stress responses) can remain active days postmortem. https://philpapers.org/rec/MOSCOD-2

3. Metaphysical & Ontological Critiques
The Humanum Review and similar critiques challenge the metaphysical basis of the paper’s unity‑based definition of death. They stress that considering a person’s “unity” as automatically tied to brain-function is metaphysically dubious. They also quote John Paul II, arguing death is fundamentally a metaphysical event that science can only confirm empirically. https://philpapers.org/rec/MOSCOD-2

4. Biological Categorization Limits
Additional criticism comes from theoretical biology circles, pointing out that living vs. dead is an inherently fuzzy, non-binary distinction. Any attempt to define death (like in the paper) confronts conceptual limits due to the complexity of life forms and continuous transitions. https://humanumreview.com/articles/revising-the-concept-of-death-again

5. Continuation of Scientific Research
Frontiers in Microbiology (2023) supports the broader approach but emphasizes that transcriptomic and microbiome dynamics postmortem should be more deeply explored, suggesting the paper’s overview was incomplete without enough data-driven follow-up https://pmc.ncbi.nlm.nih.gov/articles/PMC6880069/

 

CC-BY-NC Science Surf 27.06.2025, access 18.10.2025

Gödelsätze

Beispiel eines Gödelsatzes „Diese Aussage ist nicht beweisbar“

Wenn der Satz beweisbar wäre, dann wäre die Aussage falsch , also Widerspruch!  Wenn der Satz nicht beweisbar ist, dann ist die Aussage wahr, aber nicht beweisbar.

 

 

 

CC-BY-NC Science Surf 26.06.2025, access 18.10.2025

How to sync only Desktop to iCloud

After giving up Nextcloud – which is now overkill for me with 30.000 files of basic setup – I am syncing now using iCloud. As I am working only on the desktop, it would make sense to sync the desktop in regular intervals but unfortunately this can be done only together with the Documents folder (something I don’t want). SE has also no good solution, so here is mine

# make another Desktop in iCloud folder
mkdir -p ~/Library/Mobile\ Documents/com~apple~CloudDocs/iCloudDesktop

# sync local Desktop
rsync -av --delete ~/Desktop/ ~/Library/Mobile\ Documents/com~apple~CloudDocs/iCloudDesktop/

# and run it every hour or so
# launchctl load ~/Library/LaunchAgents/launched.com.desktop.rsync.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
  <dict>
    <key>KeepAlive</key>
    <dict>
      <key>Crashed</key>
      <true/>
    </dict>

    <key>Label</key>
    <string>launched.com.desktop.rsync</string>

    <key>ProgramArguments</key>
    <array>
      <string>/usr/bin/rsync</string>
      <string>-av</string>
      <string>--delete</string>
      <string>/Users/xxx/Desktop/</string>
      <string>/Users/xxx/Library/Mobile Documents/com~apple~CloudDocs/iCloudDesktop/</string>
    </array>

    <key>RunAtLoad</key>
    <true/>

    <key>StartCalendarInterval</key>
    <array>
      <dict>
        <key>Minute</key>
        <integer>0</integer>
      </dict>
    </array>

    <key>StandardOutPath</key>
    <string>/tmp/rsync.out</string>

    <key>StandardErrorPath</key>
    <string>/tmp/rsync.err</string>
  </dict>
</plist>

 

CC-BY-NC Science Surf 22.06.2025, access 18.10.2025

Otto Hahn, Lise Meitner und Fritz Straßmann

Hier zur Erinnerung ein Video mit Otto Hahn (1879-1968), Lise Meitner (1878-1968) und Fritz Straßmann (1902-1980) im Gespräch, eine historische Aufnahme aus dem Januar 1960.

Nur Otto Hahn bekam einen Nobelpreis, obwohl unstrittig alle an der Entdeckung der Kernspaltung beteiligt waren.

https://www.ndr.de/geschichte/ndr_retro/Gespraech-mit-Otto-Hahn-Lise-Meitner-und-Fritz-Strassmann-,ausersterhand130.html

Es fällt auf, wie zuvorkommend in der Diskussion immer auch die Leistung von anderen anerkannt wird, die eigentlich demütige Grundeinstellung aller, in der nicht die Selbstdarstellung sondern der Erkenntnisgewinn die treibende Kraft war.

 

CC-BY-NC Science Surf 16.06.2025, access 18.10.2025

Are we really thinking at 10 bits/s?

There is a funny paper at arXiv, that is now published in Neurology. It claims to have found a

neural conundrum behind the slowness of human behavior. The information throughput of a human being is about 10 bits/s. In comparison, our sensory systems gather data at ~10^9 bits/s. The stark contrast between these numbers remains unexplained and touches on fundamental aspects of brain function: What neural substrate sets this speed limit on the pace of our existence? Why does the brain need billions of neurons to process 10 bits/s? Why can we only think about one thing at a time?

If there are really two brains, an “outer” brain with fast high-dimensional sensory and motor signals and an “inner” brain that does are the processing?  My inner brain says this is a huge speculation.

 

CC-BY-NC Science Surf 16.06.2025, access 18.10.2025

How to run LLaMA on your local PDFs

I needed this urgently for indexing PDFs as Spotlight on the Mac is highly erratic after all this years.

Anything LLM seemed the most promising approach with an easy to use GUI and being well documented. But indexing failed after several hours, so I went on with LM Studio. Also this installation turned out to be more complicated than expected due to library “dependency hell” and version mismatch spiralling…

  1. Download and install LM Studio
  2. From inside LM Studio download your preferred model
  3. Index your PDFs in batches of 1,000 using the Python script below
  4. Combine indices and run queries against the full index

30.000 PDFs result in a 4G index while the system is unfortunately not very responsive (yet)

Continue reading How to run LLaMA on your local PDFs

 

CC-BY-NC Science Surf 14.06.2025, access 18.10.2025

Haben Physiker häufiger Heuschnupfen?

Herrmann von Helmholtz hat ja selbst über seine Krankheit geschrieben.

Und von Heisenberg wissen wir auch, daß er Heuschnupfen hatte.

Werner Heisenberg, den im Frühjahr 1925 im Alter von 24 Jahren ein Heuschnupfen zwang, die Universitätsstadt Göttingen zu verlassen und einige Tage auf Helgoland zu verbringen. Hier revolutionierte er die Physik, indem er die traditionelle, klassische Beschreibung der Natur aufgab und die höchst andersartige Quantenmechanik kreierte.

Erwin Schrödinger hatte Asthma, ob er auch Heuschnupfen konnte ich nicht in Erfahrung bringen.

 

CC-BY-NC Science Surf 13.06.2025, access 18.10.2025

Fighting AI with AI

Here is our newest paper – a nice collaboration with Andrea Taloni et al.  along with a nice commentary – to recognize surgisphere-like fraud

Recently, it was proved that the large language model Generative Pre-trained Transformer 4 (GPT-4; OpenAI) can fabricate synthetic medical datasets designed to support false scientific evidence. To uncover statistical patterns that may suggest fabrication in datasets produced by large language models and to improve these synthetic datasets by attempting to remove detectable marks of nonauthenticity, investigating the limits of generative artificial intelligence.

[…] synthetic datasets were produced for 3 fictional clinical studies designed to compare the outcomes of 2 alternative treatments for specific ocular diseases. Synthetic datasets were produced using the default GPT-4o model and a custom GPT. Data fabrication was conducted in November 2024. Prompts were submitted to GPT-4o to produce 12 “unrefined” datasets, which underwent forensic examination. Based on the outcomes of this analysis, the custom GPT Synthetic Data Creator was built with detailed instructions to generate 12 “refined” datasets designed to evade authenticity checks. Then, forensic analysis was repeated on these enhanced datasets.  […]

Sufficiently sophisticated custom GPTs can perform complex statistical tasks and may be abused to fabricate synthetic datasets that can pass forensic analysis as authentic.

 

 

CC-BY-NC Science Surf 27.04.2025, access 18.10.2025

How to consensus

Science’s Holden Thorp nailed it again

Scientists take it for granted that the consensus they refer to is not the result of opinion polls of their colleagues or a negotiated agreement reached at a research conclave. Rather, it is a phrase that describes a process in which evidence from independent lines of inquiry leads collectively toward the same conclusion. This process transcends the individual scientists who carry out the research.

Unfortunately parallel lines only intersect at infinity.

 

CC-BY-NC Science Surf 26.04.2025, access 18.10.2025

How to recognize an AI image

Lensrental has some great advice

Quantity Based: One of the continual problems the AI art generation faces is in quantity, though it is continually improving. For instance, in the past, AI art would struggle with getting the correct number of fingers correct, or perhaps the correct placement of knuckles and joints in the fingers.

General Softness & Low Resolution: AI art takes immense computing power to generate, and it still hasn’t streamlined this problem. So often, AI art is limited in resolution and detail.

Repetition: To further expand on the tip above, AI art often uses repetition to help speed up the generation process. So you may see something copied several times over the same image.

Asymmetry: Asymmetry exists in all facets of life,  [… if you] photograph the building so that it looks symmetrical across the plane. AI doesn’t understand these rules and often creates subtle symmetry shifts in its images.

TBC

 

CC-BY-NC Science Surf 22.04.2025, access 18.10.2025