Tag Archives: statistics

Is there a data agnostic method to find repetitive data in clinical trials?

There is an interesting observation by Nick Brown over at Pubpeer who analysed a clinical dataset

…there is a curious repeating pattern of records in the dataset. Specifically, every 101 records, in almost every case the following variables are identical: WBC, Hb, Plt, BUN, Cr, Na, BS, TOTALCHO, LDL, HDL, TG, PT, INR, PTT

which is remarkable detective work. By plotting the full dataset as a heatmap of z scores, I can confirm his observation of clusters after sorting for modulo 101 bin.

How could we have found the repetitive values without knowing the period length? Is there any formal, data-agnostic detection method without knowing the modulo key?

Even if we don’t know which was the initial sorting variable? May it may makes sense to look primarily for monotonic and nearly unique variables, i.e. that are plausible ordering variables. Clearly, that’s obs_id in the BMJ dataset.

Let us first collapse all continuous variables of a row into a string forming a fingerprint. Then we compute pairwise correlations (or Euclidean distances in this case) of all fingerprints.  If a dataset contains many identical or near-identical rows, we will see a multimodal distribution of correlations plus an additional big spike at 1.0 for duplicated rows. This is exactly what happens here.

Unfortunately this works only when mainly repetitive variables are included and not too many non repetitive variables.

Next, I thought of Principal Component Analysis (PCA) as the identical blocks may create linear dependencies and the covariance matrix is becoming rank-deficient. But unfortunately results here were not very impressive – so we better stick with the cosine similarity histogram above.

So rest assured we will find something, but how to proceed?  Duplicates spaced by a fixed lag will cause an high lag k autocorrelation in each variable

r(k)=corr(xt,xt+k)

Scanning k=1…N/2 reveals spikes at the duplication lag ). This can be shown by a periodogram of row-wise similarity.

So there are peaks at around 87, 101 and 122, let’s keep that in memory. Unfortunately I am not an expert in time series or signal processing. Can somebody else jump in here with FFT et al. ?

There may be even an easier method, a fingerprint-gap method. For every fingerprint that occurs more than once, we sort those rows by obs_id and compute the differences of obs_id between consecutive matches.  Well, this shows just one dominant gap at 101!

Then we test the mod values, lets say between 50 and 150. For each candidate we compute the across-group variance of the standardized lab-means. The result is interesting

Modulus 52: variance = 0.084019
Modulus 87: variance = 0.138662
Modulus 101: variance = 0.789720

As a cross check we  look into details, may white blood cell counts (WBC) and hemoglobin (Hb).

I am not sure, how to interprete this data archaeology now. Mod 52 may reflect shorter template fragments but did not show up in the autocorrelation test. Mod 87 has rather smooth, coherent curve, is supported by autocorrelation. Mod 101 is more noisy, but gives probably the best explanation for block copying values. Maybe the authors block copied at two occasions?

 

CC-BY-NC Science Surf accessed 07.11.2025

Cause and effect in observational data: Magic, alchemy or just a new statistical tool?

Slashdot has a feature on that

Statisticians have long thought it impossible to tell cause and effect apart using observational data. The problem is to take two sets of measurements that are correlated, say X and Y, and to find out if X caused Y or Y caused X. That’s straightforward with a controlled experiment… But in the last couple of years, statisticians have developed a technique that can tease apart cause and effect from the observational data alone. It is based on the idea that any set of measurements always contain noise. However, the noise in the cause variable can influence the effect but not the other way round. So the noise in the effect dataset is always more complex than the noise in the cause dataset. .. The results suggest that the additive noise model can tease apart cause and effect correctly in up to 80 per cent of the cases (provided there are no confounding factors or selection effects).

and jmlr a more theoretical account

Based on these deliberations we propose an efficient new algorithm that is able to dis- tinguish between cause and effect for a finite sample of discrete variables.

tbc

 

CC-BY-NC Science Surf accessed 07.11.2025