Category Archives: Software

Overfitting and model degradation

31.08.2022 admin

My beginner experience here isn’t exhilarating – maybe others are suffering as well from poor models but never report it?

During the training phase the model tries to learn the patterns in data based on algorithms that deduce the probability of an event from the presence and absence of certain data. What if the model is learning from noisy, useless or wrong information? Test data may be too small, not representative and models too complex. As shown in the article linked above, increasing the depth of the classifier tree increases after a certain cut point only the training accuracy but not the test accuracy – overfitting! So this needs a lot of experience to avoid under- and overfitting.

What is model degradation or concept drift? It means that that the statistical property of the predicted variable changes over time in an unforeseen way. While the true world changes – maybe political or by climate or whatsoever – this influences also the data used for prediction making it less accurate. The computer model is static representing the time point when the algorithm has been developed. Empirical data are however dynamic. Model fit need to be reviewed in regular intervals and again this needs a lot of experience.

CC-BY-NC Science Surf , accessed 29.04.2026

Software

Death by AI

16.08.2022 admin

spiegel.de reports a fatal accident of a self driving car.

In Kurve auf Gegenfahrbahn geraten
Ein Toter und neun Schwerverletzte bei Unfall mit Testfahrzeug
Vier Rettungshubschrauber und 80 Feuerwehrleute waren im Einsatz: Bei einem Unfall auf der B28 im Kreis Reutlingen starb ein junger Mann, mehrere Menschen kamen schwer verletzt ins Krankenhaus.

Is there any registry of these kind of accidents?

https://twitter.com/ISusmelj/status/1558912252119482368

and the discussion on responsibility

The first serious accident involving a self-driving car in Australia occurred in March this year. A pedestrian suffered life-threatening injuries when hit by a Tesla Model 3, which the driver claims was in “autopilot” mode.
In the US, the highway safety regulator is investigating a series of accidents where Teslas on autopilot crashed into first-responder vehicles with flashing lights during traffic stops.

CC-BY-NC Science Surf , accessed 29.04.2026

Philosophy, Software

Big Data Paradox: quality beats quantity

9.08.2022 admin

/www.nature.com/articles/s41586-021-04198-4 (via @emollick)

Surveys are a crucial tool for understanding public opinion and behaviour, and their accuracy depends on maintaining statistical representativeness of their target populations by minimizing biases from all sources. Increasing data size shrinks confidence intervals but magnifies the effect of survey bias: an instance of the Big Data Paradox … We show how a survey of 250,000 respondents can produce an estimate of the population mean that is no more accurate than an estimate from a simple random sample of size 10

It basically confirms my earlier observation in asthma genetics

this result was possible with just 415 individuals instead of 500,000 individuals nowadays

CC-BY-NC Science Surf , accessed 29.04.2026

Allergy, Software, Vitamins

Not too bad: Citation Gecko

9.08.2022 admin

Just tried citationgecko.com on a topic that I have been working on for 2 decades. It will find rather quickly the source paper, much faster than reading through all of it. Unfortunately reviews are rated to be more influential than original data as Citation Gecko picks articles with many references.

CC-BY-NC Science Surf , accessed 29.04.2026

Philosophy, Software

It is only Monday but already depressing

8.08.2022 admin

Comment on the Palm paper by u/Flaky_Suit_8665 via @hardmaru

67 authors, 83 pages, 5408 parameters in a model, the internals of which no one can say they comprehend with a straight face, 6144 TPUs in a commercial lab that no one has access to, on a rig that no one can afford, trained on a volume of data that a human couldn’t process in a lifetime, 1 page on ethics with the same ideas that have been rehashed over and over elsewhere with no attempt at a solution – bias, racism, malicious use, etc. – for purposes that who asked for?

CC-BY-NC Science Surf , accessed 29.04.2026

Philosophy, Software

(replication crisis)^2

2.08.2022 admin

We always laughed at the papers in the “Journal of Irreproducible Results”

https://www.thriftbooks.com/w/the-best-of-the-journal-of-irreproducible-results/473440/item/276126/?gclid=EAIaIQobChMI3NnCm72l-QIVpHNvBB1nIwSWEAQYAiABEgK6__D_BwE#idiq=276126&edition=1874246

then we had the replication crisis and nobody laughed anymore.

And today? It seems that irreproducible research is set to reach a new height. Elizabeth Gibney discusses an arXiv paper by Sayash Kapoor and Arvind Narayanan basically saying that

reviewers do not have the time to scrutinize these models, so academia currently lacks mechanisms to root out irreproducible papers, he says. Kapoor and his co-author Arvind Narayanan created guidelines for scientists to avoid such pitfalls, including an explicit checklist to submit with each paper … The failures are not the fault of any individual researcher, he adds. Instead, a combination of hype around AI and inadequate checks and balances is to blame.

Algorithms being stuck on shortcuts that don’t always hold has been discussed here earlier . Also data leakage (good old confounding) due to proxy variables seems to be also a common issue.

CC-BY-NC Science Surf , accessed 29.04.2026

Philosophy, Software

Spiegel Mining

18.07.2022 admin

immer noch legendär, auch noch 5

oder 3 Jahre später

API

CC-BY-NC Science Surf , accessed 29.04.2026

Software

More about the AI winter

14.07.2022 admin

towardsdatascience.com

In the deep learning community, it is common to retrospectively blame Minsky and Papert for the onset of the first ‘AI Winter,’ which made neural networks fall out of fashion for over a decade. A typical narrative mentions the ‘XOR Affair,’ a proof that perceptrons were unable to learn even very simple logical functions as evidence of their poor expressive power. Some sources even add a pinch of drama recalling that Rosenblatt and Minsky went to the same school and even alleging that Rosenblatt’s premature death in a boating accident in 1971 was a suicide in the aftermath of the criticism of his work by colleagues.

CC-BY-NC Science Surf , accessed 29.04.2026

Software

Headless Chrome is better than wkhtmltopdf to store PDFs of websites

5.07.2022 admin

I have been using wkhtmltopdf for ages but as I am getting more and more issues with missing fonts, I was looking for an alternative in particular when going for websites that load dynamically.
Here is the macOS syntax

"/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome" --headless --disable-gpu --print-to-pdf=/Users/xxx/Desktop/pdf/test.pdf https://chromestatus.com/features

CC-BY-NC Science Surf , accessed 29.04.2026

Philosophy, Software

Deep fake image fraud

14.06.2022 admin

Doing now another image integrity study, I fear that we may already have the deep fake images in current scientific papers. Never spotted any in the wild which doesn’t mean that it does not exist…

Here are some T cells that I produced this morning.

https://huggingface.co/spaces/dalle-mini/dalle-mini

Continue reading Deep fake image fraud →

CC-BY-NC Science Surf , accessed 29.04.2026

Philosophy, Software

Responsibility for algorithms

30.05.2022 admin

Excellent paper at towardsdatascience.com about the responsibility for algorithms including a

broad framework for involving citizens to enable the responsible design, development, and deployment of algorithmic decision-making systems. This framework aims to challenge the current status quo where civil society is in the dark about risky ADS.

I think that the responsiblity is not primarily with the developer but with the user and the social and political framework ( SPON has a warning about the numerous crazy errors when letting AI decide about human behaviour while I can also recommend here the “Weapons of Math Destruction” ).

Being now in the 3rd wave of machine learning, the question is now already discussed (Economist & Washington Post) if AI has an own personality.

This discussion between a Google engineer and their conversational AI model helped cause the engineer to believe the AI is becoming sentient, kick up an internal shitstorm and get suspended from his job. And it is absolutely insane. https://t.co/hGdwXMzQpX pic.twitter.com/6WXo0Tpvwp

— Tom Gara (@tomgara) June 11, 2022

The dialogue sounds slightly better than ELIZA but again way off.

We clearly need to regulate that gold rush to avoid further car crashes like this one in China and this one in France.

CC-BY-NC Science Surf , accessed 29.04.2026

Software, Tech, Video

Ein Fahrrad Abstandswarner

20.04.2022 admin

Nach Stromzähler und Gasuhr kommt hier nun mein drittes Raspberry Pi Zero Projekt: ein Abstandswarner für vorbeifahrende Fahrzeuge. Das erste Mal habe ich davon in einem wissenschaftlichen Artikel gelesen, dann gab es den Radmesser in Berlin (das Projekt war toll aber die Kiste dann doch etwas sperrig).

Auch auf Kickstarter stand mal was und dann gibt es auch noch den 200€ Varia Radar von Garmin – allerdings hatte keines der bisherigen Projekte eine Kamera eingebaut.

Laser und ToF hatte es mir immer schon angetan, dann probieren wir das auch mal hier.

Continue reading Ein Fahrrad Abstandswarner →

CC-BY-NC Science Surf , accessed 29.04.2026

Software

Best data visualizations

18.03.2022 admin

https://twitter.com/MathiasLeroy_/status/1487446565438562310
Continue reading Best data visualizations →

CC-BY-NC Science Surf , accessed 29.04.2026

Software

Another problem in AI: Out-of-distribution generalization

11.03.2022 admin

Not sure if it is really the biggest but certainly one of the most pressing problems: Out-of-distribution generalization. It is explained as

Imagine, for example, an AI that’s trained to identify cows in images. Ideally, we’d want it to learn to detect cows based on their shape and colour. But what if the cow pictures we put in the training dataset always show cows standing on grass? In that case, we have a spurious correlation between grass and cows, and if we’re not careful, our AI might learn to become a grass detector rather than a cow detector.

As an epidemiologist I would have simply said, it is colliding or confounding, so every new field is rediscovering the same problems over and over again.

Not unexpected AI just running randomly over pixels is leading to spurious association. Once shape and colour of cows has been detected, surrounding environment, like grass or stable is irrelevant. That means that after getting initial results we have to step back, simulate different lighting conditions from sunlight to lightbulb and environmental conditions from grass to slatted floor (invariance principle). Also shape and size matters – cow spots will keep to some extent size and form irrespective if it is a real animal or children toy (scaling principle). I am a bit more sceptical about including also multimodal data (eg smacking sound) as the absence of these features is no proof of non-existence while this sound can also be imitated by other animals.

And yes, less is more.

CC-BY-NC Science Surf , accessed 29.04.2026