{"id":25900,"date":"2025-11-03T17:07:40","date_gmt":"2025-11-03T15:07:40","guid":{"rendered":"https:\/\/www.wjst.de\/blog\/?p=25900"},"modified":"2026-04-01T07:19:09","modified_gmt":"2026-04-01T05:19:09","slug":"is-there-a-data-agnostic-method-to-find-repetitive-data-in-clinical-trials","status":"publish","type":"post","link":"https:\/\/www.wjst.de\/blog\/sciencesurf\/2025\/11\/is-there-a-data-agnostic-method-to-find-repetitive-data-in-clinical-trials\/","title":{"rendered":"Is there a data agnostic method to find repetitive data in clinical trials?"},"content":{"rendered":"<p>There is an interesting observation by Nick Brown over at <a href=\"https:\/\/pubpeer.com\/publications\/C08779C45DB6E407DFAC85583BE9C4\">Pubpeer<\/a> who analysed<a href=\"https:\/\/www.bmj.com\/content\/391\/bmj-2024-083382\"> a clinical dataset<\/a> (see also <a href=\"https:\/\/www.bmj.com\/content\/391\/bmj-2024-083382\/rr-3\">my comment<\/a> atthe BMJ)<\/p>\n<blockquote><p>&#8230;there is a curious repeating pattern of records in the dataset. Specifically, every 101 records, in almost every case the following variables are identical: WBC, Hb, Plt, BUN, Cr, Na, BS, TOTALCHO, LDL, HDL, TG, PT, INR, PTT<\/p><\/blockquote>\n<p>which is remarkable detective work. By plotting the full dataset as a heatmap of z scores, I can confirm his observation of clusters after sorting for modulo 101 bin.<\/p>\n<p><a href=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/plot.jpg\" rel=\"key\" data-rel=\"key-image-0\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-25901 size-medium\" src=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/plot-620x246.jpg\" alt=\"\" width=\"620\" height=\"246\" srcset=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/plot-620x246.jpg 620w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/plot-1262x500.jpg 1262w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/plot-768x304.jpg 768w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/plot-1536x609.jpg 1536w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/plot-2048x812.jpg 2048w\" sizes=\"auto, (max-width: 620px) 100vw, 620px\" \/><\/a><\/p>\n<p>How could we have found the repetitive values without knowing the period length? Is there any formal, data-agnostic detection method?<\/p>\n<p>If we even don&#8217;t know the initial sorting variable, it may makes sense to look primarily for monotonic and nearly unique variables, i.e. that are plausible ordering variables. Clearly, that&#8217;s obs_id in the BMJ dataset.<\/p>\n<p>Let us first collapse all continuous variables of a row into a string forming a fingerprint. Then we compute pairwise correlations (or Euclidean distances in this case) of all fingerprints.\u00a0 If a dataset contains many identical or near-identical rows, we will see a multimodal distribution of correlations plus an additional big spike at 1.0 for duplicated rows. This is exactly what happens here.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-25903 size-medium\" src=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/histo-620x306.png\" alt=\"\" width=\"620\" height=\"306\" srcset=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/histo-620x306.png 620w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/histo-1013x500.png 1013w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/histo-768x379.png 768w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/histo-1536x758.png 1536w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/histo.png 1580w\" sizes=\"auto, (max-width: 620px) 100vw, 620px\" \/><\/p>\n<p>Unfortunately this works only when mainly repetitive variables are included and not too many non repetitive variables.<\/p>\n<p>Next, I thought of Principal Component Analysis (PCA) as the identical blocks may create linear dependencies and the covariance matrix is becoming rank-deficient. But unfortunately results here were not very impressive &#8211; so we better stick with the cosine similarity\u00a0 above.<\/p>\n<p data-start=\"1816\" data-end=\"1945\">So rest assured we find an excess of identical values, but how to proceed?\u00a0 Duplicates spaced by a fixed lag will cause an high lag k autocorrelation in each variable. Scanning <span class=\"katex\"><span class=\"katex-mathml\">k=1\u2026N\/2 <\/span><\/span>reveals spikes at the duplication lag as shown by a periodogram of row-wise similarity in the BMJ dataset.<\/p>\n<p data-start=\"1985\" data-end=\"2175\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-25904\" src=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/lag-620x306.png\" alt=\"\" width=\"620\" height=\"306\" srcset=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/lag-620x306.png 620w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/lag-1013x500.png 1013w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/lag-768x379.png 768w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/lag-1536x758.png 1536w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/lag.png 1580w\" sizes=\"auto, (max-width: 620px) 100vw, 620px\" \/><\/p>\n<p>So there are peaks at around 87, 101 and 122. Unfortunately I am not an expert in time series or signal processing analysis. Can somebody else jump in here and provide some help with FFT?<\/p>\n<p>There may be even an easier method, using the fingerprint-gap . For every fingerprint that occurs more than once, we sort those rows by obs_id and compute the differences of obs_id between consecutive matches.\u00a0 Well, this shows just one dominant gap at 101 only!<\/p>\n<p>We could test\u00a0 also all relevant mod values, lets say between 50 and 150. For each candidate we compute the across-group variance of the standardized lab-means. The result is interesting<\/p>\n<pre>Modulus 52: variance = 0.084019\r\nModulus 87: variance = 0.138662\r\nModulus 101: variance = 0.789720<\/pre>\n<p>As a cross check let us\u00a0 look into white blood cell counts (WBC) and hemoglobin (Hb).<\/p>\n<p><a href=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2025-11-06-um-18.59.00.jpg\" rel=\"key\" data-rel=\"key-image-1\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-25939 size-medium\" src=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2025-11-06-um-18.59.00-620x490.jpg\" alt=\"\" width=\"620\" height=\"490\" srcset=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2025-11-06-um-18.59.00-620x490.jpg 620w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2025-11-06-um-18.59.00-633x500.jpg 633w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2025-11-06-um-18.59.00-768x606.jpg 768w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2025-11-06-um-18.59.00.jpg 993w\" sizes=\"auto, (max-width: 620px) 100vw, 620px\" \/><\/a><\/p>\n<p>I am not sure, how to interpret this. Mod 52 may reflect shorter template fragments but did not show up in the autocorrelation test. Mod 87 has rather smooth, coherent curve and is supported by autocorrelation. Mod 101 is more noisy, but gives probably the best explanation for block copying values. Maybe the authors block copied at two occasions?<\/p>\n<p>On the next day, I thought of a strategy to find the exact repetition numbers. Why not looping over mod 50 through 150 and just count the number of identical blocks? This is very informative &#8211; blocks of size 2, size 3 and 4 or greater show an exact maximum at modulus 101.<\/p>\n<p><a href=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/PREVENT_modblock_histograms-scaled.png\" rel=\"key\" data-rel=\"key-image-2\" data-rl_title=\"\" data-rl_caption=\"\" title=\"\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-25955 size-medium\" src=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/PREVENT_modblock_histograms-620x775.png\" alt=\"\" width=\"620\" height=\"775\" srcset=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/PREVENT_modblock_histograms-620x775.png 620w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/PREVENT_modblock_histograms-400x500.png 400w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/PREVENT_modblock_histograms-768x960.png 768w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/PREVENT_modblock_histograms-1229x1536.png 1229w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/PREVENT_modblock_histograms-1638x2048.png 1638w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/PREVENT_modblock_histograms-scaled.png 2048w\" sizes=\"auto, (max-width: 620px) 100vw, 620px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p><u>23.3.2026 Appendix<\/u><\/p>\n<p>There seems many more studies out there with\u00a0<a href=\"https:\/\/www.sciencedetective.org\/scientific-datasets-are-riddled-with-copy-paste-errors\/\">copy-paste<\/a>in signs including a Parkinson Cell paper, a PLoS Genetics toxicology paper and a Nat Comm fish ecology study. Here is the <a href=\"https:\/\/github.com\/markusenglund\/copy-paste-detective\">Github link <\/a> to the implementation by Markus Eglund<\/p>\n<p>Hopefully I get the pipeline right by summarizing the entropy calculation there. This\u00a0 is not Shannon entropy &#8211; it is a custom measure of how informationally surprising a raw number is. The logic is:<\/p>\n<ul class=\"[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3\">\n<li class=\"whitespace-normal break-words pl-2\">Strip the decimal point and trailing zeros from the number&#8217;s string representation, then take the absolute integer value. So <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">0.314<\/code> \u2192 <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">314<\/code>, <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">0.500<\/code> \u2192 <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">5<\/code> (trailing zeros stripped), <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">2016<\/code> \u2192 <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">16<\/code> (year exception: years 1900-2030 get a capped entropy of 100).<\/li>\n<li class=\"whitespace-normal break-words pl-2\">Apply a log-scaled transformation: values below 100 get <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">log10(value)<\/code>; values up to 100,000 get <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">5\u00d7log10 - 8<\/code>; larger values get <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">log10 + 12<\/code>.<\/li>\n<li class=\"whitespace-normal break-words pl-2\">For column sequences, sum the individual entropy scores of each value in the run.<\/li>\n<li class=\"whitespace-normal break-words pl-2\">Adjust downward for &#8220;regularity&#8221; &#8211; if the values in a sequence follow a regular arithmetic interval (e.g. 1.0, 2.0, 3.0), the score is reduced proportionally, because regular sequences can appear legitimately.<\/li>\n<li class=\"whitespace-normal break-words pl-2\">Normalise by <code class=\"bg-text-200\/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]\">logNumberCountModifier<\/code> (log of the total number of numeric cells on the sheet) so large sheets don&#8217;t get disproportionately penalised.<\/li>\n<\/ul>\n<p class=\"font-claude-response-body break-words whitespace-normal leading-[1.7]\">The suspicion grades are fixed thresholds on the resulting normalized score. I will add the strategy to my Python script (it is implemented here in type script) as another module and upload to Github once it has been sufficiently tested.<\/p>\n<p><u>31.3.2026 Appendix<\/u><\/p>\n<p>PREVENT-TAHA8, the starting point of this analysis, <a href=\"https:\/\/www.bmj.com\/content\/391\/bmj-2024-083382\/rapid-responses\">has been retracted today<\/a>.\u00a0 I will give a presentation on the avalanche, that has been triggered by this paper, on 29\u201331 July 2026 in Hannover.<\/p>\n<figure id=\"attachment_26263\" aria-describedby=\"caption-attachment-26263\" style=\"width: 400px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2026-03-31-um-18.54.19.jpg\" rel=\"key\" data-rel=\"key-image-3\" data-rl_title=\"Screenshot\" data-rl_caption=\"Screenshot\" title=\"Screenshot\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-26263 size-large\" src=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2026-03-31-um-18.54.19-400x500.jpg\" alt=\"\" width=\"400\" height=\"500\" srcset=\"https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2026-03-31-um-18.54.19-400x500.jpg 400w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2026-03-31-um-18.54.19-620x774.jpg 620w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2026-03-31-um-18.54.19-768x959.jpg 768w, https:\/\/www.wjst.de\/blog\/wp-content\/uploads\/2025\/11\/Bildschirmfoto-2026-03-31-um-18.54.19.jpg 809w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><\/a><figcaption id=\"caption-attachment-26263\" class=\"wp-caption-text\">Screenshot 31\/3\/26<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n\n<p>&nbsp;<\/p>\n<div class=\"bottom-note\">\n  <span class=\"mod1\">CC-BY-NC Science Surf , accessed 06.06.2026<\/span>\n <\/div>","protected":false},"excerpt":{"rendered":"<p>There is an interesting observation by Nick Brown over at Pubpeer who analysed a clinical dataset (see also my comment atthe BMJ) &#8230;there is a curious repeating pattern of records in the dataset. Specifically, every 101 records, in almost every case the following variables are identical: WBC, Hb, Plt, BUN, Cr, Na, BS, TOTALCHO, LDL, &hellip; <a href=\"https:\/\/www.wjst.de\/blog\/sciencesurf\/2025\/11\/is-there-a-data-agnostic-method-to-find-repetitive-data-in-clinical-trials\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Is there a data agnostic method to find repetitive data in clinical trials?<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20,9],"tags":[2927],"class_list":["post-25900","post","type-post","status-publish","format-standard","hentry","category-note-worthy","category-computer-software","tag-statistics"],"_links":{"self":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/25900","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/comments?post=25900"}],"version-history":[{"count":28,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/25900\/revisions"}],"predecessor-version":[{"id":26264,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/posts\/25900\/revisions\/26264"}],"wp:attachment":[{"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/media?parent=25900"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/categories?post=25900"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.wjst.de\/blog\/wp-json\/wp\/v2\/tags?post=25900"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}