Category Archives: Genetics

The first methylome available

-moblog- Having spent this weekend in Heidelberg city at a meeting of the German NGFN project I had the opportunity to listen to an excellent talk of Stephan Beck who works at the Wellcome Trust Sanger Center.

Epigenetics is the connecting link between the rather fixed genome and the variable transcriptome. To start with the end of the talk: Beck predicts for the near future highly parallel SNP, expression and methylation arrays. Although the first methylome has just been published 4 weeks ago by the Arabidopsis community (as with RNAi the plant people again at the forefront) there is still a long way ahead for a first human methylation map.

The latest information may be retrieved from www.epigenome.org, www.epigenome-noe.net, www.epitron.eu,
www.heroic-ip.eu and the German National Methylome Project on chromosome 21 (please google for the link). The methylome is largely an European initiative – the two US epigenome projects do not have any website so far. The network site has some introductory texts; Beck was also refering to a 2006 PLOS paper by Akhtar.

Currently there are 4 human chromosomes under work covering 873 genes (hopefully I captured this correctly as this was a very dense talk). 70% of genes examined so far are either clearly methylated or they are not methylated by testing 12 different tissues. Sperm stands out from all other tissues – which is not unexpected. Tissues originating from the same developmental background have similar methylation patterns – also not unexpected. A preliminary analysis of expression patterns shows that if the 5 prime end is methylated expression is suppressed- also not unexpected.

Fascinating: the colon cells that certainly have a close interaction with the environment do NEITHER show age NOR sex specific differences. Fascinating too: The most frequently methylated regions are ECRs (evolutionary conserved sequences) for whatever reason. Promotor methylation dips around the transcription start sites – from the plots I would say plus and minus 2000bp. Methylation seem to be also conserved between mouse and human tissues while methylation status seems stable over time.

Current bisulfite sequencing is still laborious, expensive and takes quite a long time while immunoprecipitation using MeDIP is getting an alternative. The Sanger people also did a study usinge Nimble(R) gene 50 mers where Ensembl and UCSC will soon have these data for display. Finally, methylation appears in blocks. TagMVPs (your guess is correct, these are tags for methylation variant profiles) construction is straightforward where the estimated 40 million CpG sites will probably be covered by less than 10 percent tagMVP – Haplo epi types are now called hepitypes, yea, yea.

pb250021.JPG

Addendum

Methyl Primer Express® Software – is a free software package to simplify and automate the primer design process in methylation experiments. The bisulfite kit is not free ;-)

Addendum

A new textbook and a nice preview

Science is about recognizing errors

-moblog- I was already willing to accept that age related macular degeneration presents the first good case for a common variant responsible for a common disease (Y402H in CFH). Although the gene may be correct according to a new report from the Chakravarty! group Y402H seems to be largely irrelevant. A haplotype indicating a CFHR3 deletion was seen LESS in AMD (and replicated in a second sample). As the authors say

Much work is required to unravel the complexity of the transcripts and proteins arising from this highly duplicated gene cluster.

Another paper finds

… that there are multiple disease susceptibility alleles in the region.

See you soon again, yea, yea.

Nice to know

-moblog- There are many things “nice to know” but only a few “need to know”. Molecular epidemiologists having access to large datasets sometimes forget what most people on earth “want to know” – how to prevent and cure human disease.

Nothing makes sense except in the light of a hypothesis

Looking again at human variation it seems that my recent estimate of 99,9% sequence identity is wrong as shown in an nature editorial yesterday and of course the new paper with the first copy number map of the human genome

3,080 million ‘letters’ of DNA in the human genome
22,205 genes, by one recent estimate
10 million single-letter changes (SNPs) —
that’s only 0.3% of the genome
1,447 copy-number variants (CNV),
covering a surprisingly large 12% of the genome
About 99.5% similarity between two random people’s DNA

I am organizing my literature in folders where the CNV section is still very thin but labelled as high priority – this seems to be adequate as the new study shows that the CNV emcompass hundreds of genes and functional elements.

Notably, the CNVRs encompassed more nucleotide content per genome than SNPs, underscoring the importance of CNV in genetic diversity and evolution

The Wellcome Trust has a nice website about copy number variants. If you want to read more, you will find information about the methods (array-based comparative genome hybridisation, cytogenetics, population genetics, comparative genomics and bioinformatics) as well as the questions that drive CNV research.
Again it seems that disease genetics is not only about stupid nucleotide polymorphisms (SNP), it is a whole bunch of chromosome aberration, segmental duplication, insertions and deletions – there is a good chance that these new data will improve our complex disease mapping efforts. I am quite confident that CNVs are not randomly distributed in the genome

CNV genes encode disproportionately large numbers of secreted, olfactory, and immunity proteins, although they contain fewer than expected genes associated with Mendelian disease […] Natural selection appears to have acted discriminately among human CNV genes. The significant overabundance, within human CNVs, of genes associated with olfaction, immunity, protein secretion, and elevated coding sequence divergence, indicates that a subset may have been retained in the human population due to the adaptive benefit of increased gene dosage.

There is a good chance of retrieving even posthoc CNV information from SNP arrays by taking into account relative signal intensity. Yea, yea.

Addendum

The mouse data are now also online.

Hap world map?

A new study of 12 Mb DNA sequence in 927 individuals representing 52 populations now finds good portability of of tag SNPs between the 4 hapmap groups and any of the 52 populations (except some African populations like the Mandenka, Bantu, Yoruba, Biaka Pygmy, Mbuti Pygmy and San). The paper has some exceptional well done graphics – and I am quite happy that the resolution of European nations leaves some gaps for our forthcoming ECRHS papers (a poster had already been on display at the 3rd Annual International HapMap Project in Cambridge, Massachusetts).

“Die Botschaft hör’ ich wohl, allein mir fehlt der Glaube” (Goethe, “I hear the message well…”). The usefulness of tagSNPs in disease association studies still remains to be shown (I still renember comments like cr.. map). At present I neither believe in rare variants nor in common common variants but a permanent reshuffling of rare, frequent and highly abundant variants. Yea, yea.

Men r’sponding to women

We know much about the differences between men and women – the X is the default pathway and the Y under the microscope looks as worn down and “misshapen as a stubbed-out cheroot“. There turns out to be something really new. So far all effects of Y genes on sex determination have been attributed to SRY, the testis determining gene (NR0B1, FOXL2 and WNT04 are probably ovary-determining).
The careful analysis of an Italian pedigree now described a new gene that can reversal XX to male when being disrupted: It is R-spondin 1 (or RSPO1), a growth factor that may act through ß-catenin stabilization and synergize with Wnt.
Do you know renember the nice cartoon of the Y chromosome with the HUH? selective hearing loss ;-) it is finally RSPO1. Yea, yea.

Not an academic project, not an industry project

Not an academic project, not an industry project, but still earn money with knowledge? Innocentive posted the request #4470259 for an ALS biomarker. Deadline is Nov 06, 2008 and you will get $1,000,000 USD for problem solving. Yea, yea.

Peer production

firstmonday has an interesting article about the limits of self-organization and “laws of quality”. Given 52 million tracks in the Gracenote database, 1 million entries in Wikipedia and 17,000 books in project Gutenberg, Paul Duguid throughly examines the two laws of quality

  • Linus law: “given enough eyeballs, all bugs are shallow” which means that almost every error will be discovered and ultimately fixed
  • Graham law: “people just produce whatever they want; the good stuff spreads, and the bad gets ignored”

Although more professionalized, similar principles operate in science. With these large genetic studies, I have the feeling that most errors occur at the interfaces, during hand-shaking of disciplines. There are certainly only a few people that can design a study, examine a patient, go to the laboratory, analyze and annotate the data and publish them. This means that even many eyeballs can not look around the corner and that it will take many years for the “good stuff to spread”. Yea, yea.

Pharmacogenetic tests on the market

Certainly one of the best web resources for pharmacogenetics is the PharmGkB database that collects all kind of data about the relationships among drugs, diseases and genes. Of course you could sequence your genome or run expression profiling on a liver sample. However, you are probably here to find out what (serious!?) pharmacogenetic tests are already on the market.

Much can be said about the usefulness of such tests; I have doubts if there will ever be such personalized treatment as I can foresee some logistic problems to validate it ;-) More likely are group based therapies, maybe restricted to geographic ancestry. Here is a (first and very) preliminary collection of commercially available pharmacogenetic tests:

  • CYP2D6, CYP2C9 and CYP2C19 collectively account for about 40 percent of drug metabolism mediated by cytochrome P-450 (Roche). The AmpliChip CYP450 Test is the world’s first pharmacogenetic microarray-based test approved for clinical use. CYP2D6 metabolizes codeine into morphine. A variation in CYP2D6 varies with race and leads to a lower elimination rate of the antidepressants Prozac (a selective serotonin reuptake inhibitors); the alternatively used drug Celexa is metabolized by CYP2C19 (as well as omeprazole). Other examples include clopidogrel (metabolized by CYP3A4) and cyclophosphamide (by CYP2B6) and vitamin K (by CYP2C9)
  • NAT2*5A, NAT2*6A, NAT2*7A/B and NAT2*14A carriers are rapid and slow acetylators for example of isoniazid or procainamide (Roche)
  • HER2+ women may get herceptin (Roche, Bayer, PathVysion)
  • TP, DPD are the rate limiting catabolic enzymes of 5-fluorouracil metabolism (Roche)
  • Mitochondrial A155G variants are tested for aminogylcoside side effect (Humatrix)
  • A Warfarin sensitivity test will be in clinical use next year (Kimball Genetics). It will test for variations in CYP2C9 and VKORC1
  • An UGT1A1 gene variant is associated with leukopenia if prescribed camptosar, a drug for colon cancer (Oncoscreen)
  • A TPMT variant is associated with slow metabolism of 6-mercaptopurine, used in the treatment of childhood leukemia and inflammatory bowel diseases (Pharmaco-Gendia)
  • Epigenomics is currently developing tests based on DNA methylation
  • Tyrosine kinase inhibitor gleevec inhibits the ABL, ARG, SCF/KIT, and PDGFRA and PDGFRB kinases in CML. Mutations in ABL can arise as secondary mutations in previously sensitive leukemias (Pharmaco-Gendia)

Needless to say that I have excluded here specific HIV mutations that may induce resistance to particular drugs (as I learned last week on a bioinformatics meeting here by Thomas Lengauer). I have also excluded all kind of sex-specific marker (e.g. SRY testing) and the whole nutrigenomics stuff.

Who knows more, for example about lansoprazole effectiveness, UGT1A9 and mycophenolic acid, UGT1A1 and irinotecan, COMT genotype and amphetamine response, pharmacogenetics of COX-2 inhibitors, and GRP78 responsiveness to chemotherapy? Is there any commercial test available for these genes? It seems that somebody should start a wiki on that, yea, yea.

Addendum 31-12-09

Here is another gene list; only 6 tests have been approved by the FDA; Nature reports about Oncotype DX and Prostate Px as well as MammaPrint. See also an UK based paper

HLA-B*5701 was most commonly tested to identify those at risk of abacavir hypersensitivity among patients with HIV. A number of barriers to testing were identified, including lack of clinician knowledge and a lack of scientific evidence.

INDELligent

I am detailing in a forthcoming paper in “Allergy”, that the contradicting results found with ADAM33 (the first positionally cloned asthma gene) probably results from a rather poor design of all follow-up studies.
It does not make so much sense to repeat over and over the same few SNP marker; instead a full resquencing of the linkage region would be necessary. From the analysis of public LD maps it is even possible that neighboring genes may be responsible for the observed associations.
I have also doubts if the SNP-centric view is always leading to success. BTW there is a new database of over 400,000 non-reduandt indels of which 280,000 are validated by comparison with other human or chimpanzee genomes (see Mills et al., the indels are available in dbSNP under the “Devine_lab” handle).

Best of two worlds

Finally, linkage and association data can be used together after downloading new software using genotype inference.

It reduces the number of genotyping reactions and increases the power of genome-wide association studies. Our method combines sparse marker data from a linkage scan and high-resolution SNP genotypes for several individuals to infer genotypes for related individuals.

Sure, we

Geo IP identification

The Geo IP database is available at Maxmind and allows to trace your home city from IP addresses. Here is a quick and dirty script to upload the Geo IP data into MySQL:

 1:
 2:
 3:
 4:
 5:
 6:
 7:
 8:
 9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
23:
@echo off
rem set path to mysql binaries
set i=C:\Programme\XAMPP\xampp\mysql\bin\

rem SQL statement
echo CREATE TABLE GeoLiteCity (locId DOUBLE,country CHAR(15),region CHAR(15),city CHAR(15),postalCode CHAR(15),latitude DOUBLE,longitude DOUBLE,dmaCode DOUBLE,areaCode DOUBLE); >1.SQL
echo CREATE TABLE GeoLiteCityBlocks (startIpNum DOUBLE,endIpNum DOUBLE,locId DOUBLE); >1.SQL

rem create database
%i%mysql <1.SQL
%i%mysql <2.SQL

rem drop SQL statement
del 1.SQL>nul
del 2.SQL>nul

rem import data
%i%mysqlimport --fields-terminated-by="," --fields-optionally-enclosed-by="\"" --lines-terminated-by="\n" --host=localhost --user=root test d:\GeoLiteCity.csv
%i%mysqlimport --fields-terminated-by="," --fields-optionally-enclosed-by="\"" --lines-terminated-by="\n" --host=localhost --user=root test d:\GeoLiteCityBlocks.csv

pause
exit

I would also put an index on loc_id. Finally the database should be available as

SELECT city
FROM GeoLiteCity INNER JOIN GeoLiteCityBlocks ON GeoLiteCity.locID = GeoLiteCityBlocks.locID
WHERE $myIP >= startIpNum AND $myIP <= endIpNum;

where $myIP is calculated as

substr($_SERVER['REMOTE_ADDR'],0,3) * 16777216 +
substr($_SERVER['REMOTE_ADDR'],4,3) * 65536 +
substr($_SERVER['REMOTE_ADDR'],8,3) * 256 +
substr($_SERVER['REMOTE_ADDR'],12,3)

A low-cost system for a PDF literature archiv I

Getting a scientific paper on your harddisk is quite simple. I am using a Fujitsu Scan Snap that can process a single page in a few seconds. The resulting PDF needs to be further tweaked by OCR recognition like ABBY FineReader (I couldn’t find any good open source alternative). FR will leave your PDF intact while adding recognized text as an overlay (or “underlay”). Unfortunately FR does not support batch processing but your OS will do by using a windows scripting engine like CLRscript. We also need a tool to extract a text file from the modified PDF. A good choice is pdftotext — look at the sourcecode and the DRM discussion before compiling it with a compiler like Cygwin. The following perl script doesn´t do anything than traversing your target directory and creating a batch file. As filenames offered by publishers are rather strange, I would first start to create some clean file names by replacing all spaces and brackets with something innocent like underscores.
perl.exe ocr.pl rename h:\pdf\2008\*.*
Now we create text files from the PDFs (usually done better by XPDF than directly by GDS).
perl.exe ocr.pl extract h:\pdf\2008\*.pdf
The resulting textfiles may be inspected: very small file sizes usually indicate no valid extraction and should be deleted before starting the OCR step as OCR is only done when text files are missing.
perl.exe ocr.pl ocr h:\pdf\2008\*.pdf
In the last step you may want to repeat the extract step.

ocr.zip

 1:
 2:
 3:
 4:
 5:
 6:
 7:
 8:
 9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43:
44:
45:
46:
47:
48:
49:
50:
51:
52:
53:
54:
55:
56:
57:
58:
59:
60:
61:
62:
63:
64:
65:
66:
67:
68:
69:
70:
71:
72:
73:
74:
75:
76:
77:
78:
79:
80:
81:
82:
83:
84:
85:
86:
87:
89:
# ocr.pl setting up batch ocr commands
# m@wjst.de 26Oct03
# ------------------------------------------------------------
#!/usr/local/bin/perl -w
$action = <$ARGV[0]>;
@files = <$ARGV[1]>;

open(OUT,">_ocr.cmd");
print OUT ("\@echo off\n");

foreach $file (@files) {
    if ($action eq "rename") {
        $old = $file;
        $file =~ s/(.*)(\.pdf|\.txt)//i;
        $new=;
        $new =~ s/[\(\)\s\%\.\#]/\_/gi;
        $new = $new . $file;
        if ($new ne $old) {
            print OUT "rename \"$old\" \"$new\" \n";
        }
    } 
    else {
        $file =~ m/(.*)(\.pdf)/i;
        $fn=;
        $fne=$fn."\.txt";
        if (not -e $fne) {
            if ($action eq "extract") {
                print OUT ("start /min /wait pdftotext.exe -layout -eol dos \"$fn\.pdf\"\n");
            }
            elsif ($action eq "ocr") {
                print OUT ("call:sub $fn\.pdf \& start /max /wait CLRScrpt.exe /r _ocr.tmp\n");
            }
        }
    }
}
if ($action eq "ocr") {
    print OUT ("del _ocr.tmp\>nul\n");
    CLRScrpt();
}
print OUT ("del _ocr.cmd\>nul\n");
close(OUT);
exit(0);

sub CLRScrpt {
print OUT << "EOF"
:: --------------------------------------------------------------
:sub
set o=_ocr.tmp
echo void main() { \>%o%
echo if (Run("c:\Programme\ABBYY FineReader 7.0 Professional Edition\FineReader.exe",SW_SHOWMAXIMIZED)); \>\>%o%
echo Pause (1000); \>\>%o%
echo ClickButton("NO"); \>\>%o%
echo Pause (1000); \>\>%o%
echo SendKeys("{ctrl}o"); \>\>%o%
echo Pause (1000); \>\>%o%
echo SetDlgItemText(1152,"%1"); \>\>%o%
echo ClickButton("&ffnen"); \>\>%o%
echo Pause (1000); \>\>%o%
echo while (VerifyActiveWindowTitleSub("Adding",1000)) \>\>%o%
echo { Pause (1000); } \>\>%o%
echo SendKeys("{ctrl}{shift}r"); \>\>%o%
echo Pause (1000); \>\>%o%
echo while (VerifyActiveWindowTitleSub("Reading",1000)) \>\>%o%
echo { Pause (1000); } \>\>%o%
echo SendKeys("{ctrl}{F2}"); \>\>%o%
echo Pause (1000); \>\>%o%
echo SetDlgItemText(1152,"%1"); \>\>%o%
echo Pause (1000); \>\>%o%
echo ClickButton("For&mats Settings..."); \>\>%o%
echo Pause (1000); \>\>%o%
echo ClickButton("Text &under the page image"); \>\>%o%
echo Pause (1000); \>\>%o%
echo ClickButton("OK"); \>\>%o%
echo Pause (1000); \>\>%o%
echo ClickButton("&Speichern"); \>\>%o%
echo Pause (1000); \>\>%o%
echo ClickButton("OK"); \>\>%o%
echo while (VerifyActiveWindowTitleSub("Saving",1000)) \>\>%o%
echo { Pause (1000); } \>\>%o%
echo SendKeys("{alt}fx"); \>\>%o%
echo Pause (1000); \>\>%o%
echo ClickButton("&Nein"); \>\>%o%
echo exit(); \>\>%o%
echo } \>\>%o%
goto:eof
EOF
;
}

Some are hybrids, some are not

Microchimerism is an interesting phenomenon that describes the hosting of foreign cells in an individuum – the prefix micro relates to the rather low counts of foreign cells (see the self discussion).
It is believed (but unproven) that most cases of microchimerism relate to the persistence of fetal cells in the maternal organism. The background of microchimerism is extremely complicated as highlighted in a recent review about the immunology of placentation in mammals. This paper has some nice cartoons about the types of placentation (epitheliochorial, endothelichorial and haemochorial) where the invasive potential of fetal trophoblast cells is the culprit of reciprocal (?) cell traffic between mother and fetus. The highest risk is found in women with induced abortion; cell count is ranging from 0 to 21 male cells per 100,000 female cells in peripheral blood; transfer may occur from mother <-> child, twin <-> twin, or sib <-> mother <-> sib.
Microchimerism has been examined in transplantation medicine (where the recipient replaces the outer donor organ epithelium), in blood transfusion and HCT, as well as in some autoimmune diseases (systemic sclerosis, SLE, thyroiditis, PBC). A clinical review reports that fetal cells have been found to persist for many years, probably for a lifetime.
I have doubts if that is true as I am not aware of any quantitative long-term study. Nearly all studies identified only male cells in women although now genomic studies of single cells are possible allowing a much better identification of foreign cells. If you are looking for a PhD thesis, microchimerism could be your field!
I already wondered if microchimerism could lead to genotyping errors, a question that can now easily be tested on the garbage of genotyping labs: We usually have genotyping errors in the 1-10 o/oo range; sometimes we see also triallelic SNPs. As far as I can renember, microchimerism has never been analyzed in the allergy field, although allergy can transplanted as well as asthma. Yea, yea.