Some of us are old enough to remember what was originally promised by genome-wide association studies (GWASs): we would finally discover the genes aetiologically involved in the conditions which till then we had been researching using a combination of linkage and candidate gene association studies. Clearly, this has not happened. With the benefit of hindsight and a myriad actual results we now clearly appreciate what perhaps we should always have realised, which is that common variants do not have substantial effects on phenotypes. GWASs yield complex, difficult to interpret findings which implicate variants but not genes and have not delivered the insights which we were promised they would.
During a correspondence with a colleague about the validity of genotyping rare variants I became aware of a paper in Science that I missed initially. It is about genetic signatures in longevity:
Using these data, we built a genetic model that includes 150 single-nucleotide polymorphisms (SNPs) and found that it could predict EL with 77% accuracy in an independent set of centenarians and controls.
That’s an extremely difficult enterprise given our recent results of the heritability of life span. But that’s not the point, reading now the editorial expression of concern by Bruce Alberts.
In their study, Sebastiani et al. used a number of different genotyping platforms and neglected to perform data quality-control steps, which resulted in their reporting several false-positive single-nucleotide polymorphism (SNP) associations. In particular, one of the platforms used in their work, the Illumina 610-Quad array, has been shown in unpublished studies by other investigators to produce artifactual genotype data at a subset of SNPs.
No idea, what’s bad with the Illumina Continue reading What’s wrong with these genotypes?
At least some people believe that once it’s published in Nature, it must be superior science – even when it’s rather trivial (or even wrong). There is a category “Brief Communications Arising” but when you are trying to get your comments there you will get this message by email:
In the present case, while we appreciate the interest of your comments to the community, we do not feel that they challenge key data or conclusions of the papers by Pleasance et al., and therefore we cannot offer to consider your paper for publication in our Brief Communications Arising section.
I wondered if Chorea Huntington may be unintentionally diagnosed by current SNP chips used in research projects of other diseases. On a first glance, this seems to be unlikely – we are dealing with repeated CAG repeats in huntingtin located on chromosome 4p16.3 that are not easily accessible by SNP panels. Continue reading Will current SNP chips unintentionally diagnose Huntington?
… permanently leading to errors – the plus / minus strand orientation and the consecutive sequence / allele designation of SNPs. Only recently I came across a fancy way how to define strand direction – from a tech note Continue reading Orientation of genomic sequence and SNP allele designation
is an exciting question that will have a large disease relevance. I think this is not so much about cytogenetic abnormalities but the general understanding which genetic modules are recombining, e.g. are compatible in terms of function, and which need to be tight together. Continue reading How we are reshuffling our genome
After running a dual core CPU for two weeks I have a list here of all transcripts that are associated with the “ORMDL3” SNP gene cluster. Making sense from this list is a difficult task even with dozen of dedicated websites.
To get an overview of what is available I would start Continue reading GPS for biological pathways
My latest idea is to create a wiki like annotation server that lets everybody create rules how to analyze an individual genome – we could use the CV dataset as a testbed.
Maybe we should start with SNPs only and develop some ground rules first at which threshold any predictive rule may be applied?
Otherwise these personal genomes will be quite useless, yea, yea.
While optimizing the analysis strategy for a 500,000 SNP Affymetrix array set, I found 6 autosomal SNPs that show highly significant sex-dependent allele differences: rs2809868, rs4862188, rs2880301, rs3883011, rs3883013 and rs3883014.
Sure, there could be autosomal marker that influences male/female outcome but there is a more likely explanation: All SNPs have paralogue sequence stretches on the Y chromosome that are co-amplified during PCR. From the initial genotyping results it is most likely that only the Y chromosomal stretch is being mutated in SNP 4, 13 and 15.2.
These SNPs are perfect sex marker, as they include an autosomal control allele (in comparison to pure Y markers like SNPs in SRY). They are always unambiguous (in contrast to pure X marker where only heterocygotes are informative).
They even offer advantage to commercial STR kits of the Amelogenin/Amely gene situated (in the Y parautosomal region) as they would not be affected by excess homologous X chromosomal material as often found in forensic situations. In addition, they might overcome some other weakness of the Amelogenin test where a second assay is usually recommended.
If you will ever see a case-control study that is highlighting any of these SNPs, you can be sure that this study had a distorted male-female ratio between case and controls.
Yesterday evening I attended an excellent presentation by Nikolaus Rajewksy about microRNAs, small noncoding RNAs that are thought to have a role in posttranscriptional regulation. Nikolaus just moved 3 months ago from New York to follow Jens Reich at MDC in Berlin). Basically, he talked about his recent “l(ou)sy” paper and the “SNP” paper after giving a rather detailed history about the development of the field. It started in 1950 with Jacob and Monod, 1960 Britten and Davidson, 1970 Haywood (who even quit science after being dissappointed), finally to 1990 when the Ambros and Ruvkun labs discovered nematode microRNAs. Current research is mainly done in the Tuschl, Batel, Cohen, Lander and Rajewsky labs who produce the bulk of the 800 papers or so published in 2006.
Approximately 30% of genes are influenced by microRNAs, the total number of microRNA sites is under heavy debate (~22,000) as well as the number of human microRNAs (328); each microRNA regulates ~200 genes. Unfortunately there is still no highthroughput technique to detect targets. There is also no good prediction by free energy and even mismatches in the 5 prime of mRNA are possible (individual predictions can be obtained at Pictar that uses a hidden Markov model).
If I understood that correctly, miRNA are the feedback mechanism on RNA level (with transcription factors at the DNA level). He mentioned 3 classes known so far in humans: oncomiRNA, miRNA 375 myotrophin, and miRNA 122 acting on cholesterol (quite interesting as being described recently in the NEJM. The experimental knockdown of liver specific mouse microRNA shows ~300 up- and ~300 down regulated genes. Upregulated genes have in approximately 50% of cases one miRNA nucleus, downregulated ones have even less than average binding sites. There is no overrepresented GO category in upregulated genes but cholesterol is highly significant in downregulated genes whatever that means. Action of miRNA seem to heavily context dependent giving us many more questions than answers. Yea, yea.
In a recent blog I described high resolution SNP datasets that are available on the net. To work with these datasets you will probably need to upgrade your hardware and software. For data handling many people stick nowadays to commercial SQL databases that have plugins for PD software.
My recommendation is to save that money and store the data in a special format that may be more useful for these large dataset; details are in a technical report that I will upload later this day. In the meantime you can already check some software tools to work with these large datasets. This is what I know so far
- David Duffy has recompiled his sibpair program |link
- Geron(R) has something under development |link
- Jochen Hampe and colleagues offer Genomizer |link
- Franz Rüschendorf developed Alohomora |link
- I renember about SNPGWA, a development at Wake Forest University |no link yet
- there will be a R-Bioconductor package by Rob Scharpf |no link yet
- R library GenABEL by Yurii Aulchenko |link
- R library SNPassoc by Juan GonzÃ¡lez |link
A technical report how to work with large SNP dataset is now also available at my paper section. Alternatives to what I am suggesting in this paper, have been set out by an anonmyous reviewer
For R users, if SQLite limits are reached, hdf5 (http://hdf.ncsa.uiuc.edu/HDF5/) may be one way forward for really huge table structures since there is an R interface already available. PostgreSQL column limit depends on data type with a maximum of 1600 for simple types. MySQL with the BerkeleyDB backend may be like SQLite with no obvious column count limit. Metakit is not mentioned â€“ it is column oriented and probably also has â€œunlimitedâ€ columns as long as each database is < 1GB or so.
Looking again at human variation it seems that my recent estimate of 99,9% sequence identity is wrong as shown in an nature editorial yesterday and of course the new paper with the first copy number map of the human genome
3,080 million ‘letters’ of DNA in the human genome
22,205 genes, by one recent estimate
10 million single-letter changes (SNPs) â€”
that’s only 0.3% of the genome
1,447 copy-number variants (CNV),
covering a surprisingly large 12% of the genome
About 99.5% similarity between two random people’s DNA
I am organizing my literature in folders where the CNV section is still very thin but labelled as high priority – this seems to be adequate as the new study shows that the CNV emcompass hundreds of genes and functional elements.
Notably, the CNVRs encompassed more nucleotide content per genome than SNPs, underscoring the importance of CNV in genetic diversity and evolution
The Wellcome Trust has a nice website about copy number variants. If you want to read more, you will find information about the methods (array-based comparative genome hybridisation, cytogenetics, population genetics, comparative genomics and bioinformatics) as well as the questions that drive CNV research.
Again it seems that disease genetics is not only about stupid nucleotide polymorphisms (SNP), it is a whole bunch of chromosome aberration, segmental duplication, insertions and deletions – there is a good chance that these new data will improve our complex disease mapping efforts. I am quite confident that CNVs are not randomly distributed in the genome
CNV genes encode disproportionately large numbers of secreted, olfactory, and immunity proteins, although they contain fewer than expected genes associated with Mendelian disease […] Natural selection appears to have acted discriminately among human CNV genes. The significant overabundance, within human CNVs, of genes associated with olfaction, immunity, protein secretion, and elevated coding sequence divergence, indicates that a subset may have been retained in the human population due to the adaptive benefit of increased gene dosage.
There is a good chance of retrieving even posthoc CNV information from SNP arrays by taking into account relative signal intensity. Yea, yea.
The mouse data are now also online.
I am detailing in a forthcoming paper in “Allergy”, that the contradicting results found with ADAM33 (the first positionally cloned asthma gene) probably results from a rather poor design of all follow-up studies.
It does not make so much sense to repeat over and over the same few SNP marker; instead a full resquencing of the linkage region would be necessary. From the analysis of public LD maps it is even possible that neighboring genes may be responsible for the observed associations.
I have also doubts if the SNP-centric view is always leading to success. BTW there is a new database of over 400,000 non-reduandt indels of which 280,000 are validated by comparison with other human or chimpanzee genomes (see Mills et al., the indels are available in dbSNP under the “Devine_lab” handle).
It reduces the number of genotyping reactions and increases the power of genome-wide association studies. Our method combines sparse marker data from a linkage scan and high-resolution SNP genotypes for several individuals to infer genotypes for related individuals.
- could already test association only in linked families
- knew that linkage genome scans> will improve the power of association
- could evaluate by stepc if a polymorphisms explains a linkage result
but this seems to be the best recycling for our old fashioned linkage data. Yea, yea.