Tag Archives: CNV

454 or 0815 or 4911

In a recent book chapter we discussed new genotyping and sequencing technologies. Our concluding remarks haven´t changed so much – it seems that realtime detection of single molecules is still not possible; micro electropheresis based methods have already reached their limit while sequencing by hybridization has severe restrictions when it comes to de novo (or re-) sequencing of whole genomes. At least for research purpose I expect that whole genome re-sequencing will replace current SNP based disease mapping. So far, sequencing by synthesis seems to be one of the few HT methods that already works at that scale. The 454 platform consists of 3 consecutive steps:

  1. DNA library preparation starts with genomic DNA (after fragmentation and adaptor ligation the single-stranded template DNA libraries are isolated and assessed (takes ~4 hours)
  2. sstDNA is emulsificated, then amplified and recovered on beads before sequencing primer are being annealed (takes 1 day)
  3. After washing, so called PicoTiterPlates are prepared and a process started that looks like a combination of pyrosequencing reaction, correct me if I am wrong, a pyrophospate dependent enzyme cascade emitting light being recorded by a CCD camera that watches each of the ~200,000 holes (takes ~6 hours according to a recent paper in GenomXPress 2/06, figures at 454.com)

With an average read length of 100 bp and 200,000 fragments (resulting in 20 Mb) in 6 hours, the throughput is about 60fold compared to Sanger sequencing. The recent Neanderthal paper raises five arguments why the 454 sequencing platform is extremely well suited for analyses of bulk DNA extracted from ancient remains.

  1. … it circumvents bacterial cloning, in which the vast majority of initial template molecules are lost during transformation and establishment of clones.
  2. … because each molecule is amplified in isolation from other molecules it also precludes template competition, which frequently occurs when large numbers of different DNA fragments are amplified together.
  3. … its current read length of 100–200 nucleotides covers the average length of the DNA preserved in most fossils.
  4. … it generates hundreds of thousands of reads per run, which is crucial because the majority of the DNA recovered from fossils is generally not derived from the fossil species, but rather from organisms that have colonized the organism after its death.
  5. … because each sequenced product stems from just one original single-stranded template molecule of known orientation, the DNA strand from which the sequence is derived is known. This provides an advantage over traditional PCR from double-stranded templates, in which the template strand is not known, because the frequency of different nucleotide misincorporations can be deduced … damage that affects different bases differently.

Except of the low read length most of these observations would benefit large scale resequencing projects in human individuals. My main point for starting ASAP resequencing projects: So far we have not achieved a dense resolution of the genome while deep resequencing project (for example at the CRP locus) got astonishing results. We do not even know what is going on in the “noncoding” regions. Finally deletions and CNVs have been largely neglected – another look at this question in the EJHG.
The question remains, what does 454 mean – my inquiry is still pending. As far as I know 0815 was a machine gun in World War I and is a synomym for something repeatedly boring while 4911 is a street number and stands for a perfume. Yea, yea.

How to detect your own CNVs

How to detect copy number variation (CNV) in your own genotype chip data, can be found in a companion paper of the recent Nature publication.
In the previous Nature paper the authors explained their algorithm to be based on k-means and PAM (partitioning around medoid) clustering, but it seems quite different. They call genotypes with DM (which seems to be already obsolete by the BRLMM, see a comparison at Broad and the AFFX whitepaper), then adjust heterocygote ratios by Gaussian mixture clustering, normalize and reduce noise before! merging NspI and StyI arrays. The software is at Genome Science, Tokyo. Yea, yea.

Nothing makes sense except in the light of a hypothesis

Looking again at human variation it seems that my recent estimate of 99,9% sequence identity is wrong as shown in an nature editorial yesterday and of course the new paper with the first copy number map of the human genome

3,080 million ‘letters’ of DNA in the human genome
22,205 genes, by one recent estimate
10 million single-letter changes (SNPs) —
that’s only 0.3% of the genome
1,447 copy-number variants (CNV),
covering a surprisingly large 12% of the genome
About 99.5% similarity between two random people’s DNA

I am organizing my literature in folders where the CNV section is still very thin but labelled as high priority – this seems to be adequate as the new study shows that the CNV emcompass hundreds of genes and functional elements.

Notably, the CNVRs encompassed more nucleotide content per genome than SNPs, underscoring the importance of CNV in genetic diversity and evolution

The Wellcome Trust has a nice website about copy number variants. If you want to read more, you will find information about the methods (array-based comparative genome hybridisation, cytogenetics, population genetics, comparative genomics and bioinformatics) as well as the questions that drive CNV research.
Again it seems that disease genetics is not only about stupid nucleotide polymorphisms (SNP), it is a whole bunch of chromosome aberration, segmental duplication, insertions and deletions – there is a good chance that these new data will improve our complex disease mapping efforts. I am quite confident that CNVs are not randomly distributed in the genome

CNV genes encode disproportionately large numbers of secreted, olfactory, and immunity proteins, although they contain fewer than expected genes associated with Mendelian disease […] Natural selection appears to have acted discriminately among human CNV genes. The significant overabundance, within human CNVs, of genes associated with olfaction, immunity, protein secretion, and elevated coding sequence divergence, indicates that a subset may have been retained in the human population due to the adaptive benefit of increased gene dosage.

There is a good chance of retrieving even posthoc CNV information from SNP arrays by taking into account relative signal intensity. Yea, yea.

Addendum

The mouse data are now also online.