Category Archives: Software

How to detect your own CNVs

How to detect copy number variation (CNV) in your own genotype chip data, can be found in a companion paper of the recent Nature publication.
In the previous Nature paper the authors explained their algorithm to be based on k-means and PAM (partitioning around medoid) clustering, but it seems quite different. They call genotypes with DM (which seems to be already obsolete by the BRLMM, see a comparison at Broad and the AFFX whitepaper), then adjust heterocygote ratios by Gaussian mixture clustering, normalize and reduce noise before! merging NspI and StyI arrays. The software is at Genome Science, Tokyo. Yea, yea.

New LD measure

There is a new way to calculate LD that may overcome the limitations of D’ and R^2 that are not easily generalizable to multiallelic markers (or haplotypes) and depend on the distribution of SNPs (or haplotypes).
The paper is at BMC, the sources at the authors’ website. I have slightly modified the program to allow input and output file names on the command line before compiling it. Use at your own risk, yea, yea.

New R packages for SNP studies

The December R newsletter reports several brandnew bioconductor packages useful for SNP studies:


Number cruncher

In a recent blog I described high resolution SNP datasets that are available on the net. To work with these datasets you will probably need to upgrade your hardware and software. For data handling many people stick nowadays to commercial SQL databases that have plugins for PD software.
My recommendation is to save that money and store the data in a special format that may be more useful for these large dataset; details are in a technical report that I will upload later this day. In the meantime you can already check some software tools to work with these large datasets. This is what I know so far

  • David Duffy has recompiled his sibpair program |link
  • Geron(R) has something under development |link
  • Jochen Hampe and colleagues offer Genomizer |link
  • Franz Rüschendorf developed Alohomora |link
  • I renember about SNPGWA, a development at Wake Forest University |no link yet
  • there will be a R-Bioconductor package by Rob Scharpf |no link yet
  • R library GenABEL by Yurii Aulchenko |link
  • R library SNPassoc by Juan González |link


A technical report how to work with large SNP dataset is now also available at my paper section. Alternatives to what I am suggesting in this paper, have been set out by an anonmyous reviewer

For R users, if SQLite limits are reached, hdf5 ( may be one way forward for really huge table structures since there is an R interface already available. PostgreSQL column limit depends on data type with a maximum of 1600 for simple types. MySQL with the BerkeleyDB backend may be like SQLite with no obvious column count limit. Metakit is not mentioned – it is column oriented and probably also has “unlimited” columns as long as each database is < 1GB or so.

-moblog- Last week I asked an author for some additional information that was not available in his online supplement. He responded immediately, I saw the email arriving in Thunderbird, but when I wanted to read it a couple of hours later I couldn’t find it – neither in in the inbox, spam nor trash folder. Bugtraq has identical user reports – which makes me believe that also Activesync sometimes drops items form the todo list, mainly the important ones. Computer are only a higher ordering system for managing the chaos but there is no reason to believe in impeccability. Yea, yea.


This is a quick link to Eye of Science, a website with impressing micro photographs. Calibrate your monitor first at sriker, then goto Eye of Science. Oliver Meckes is quoting Albert Einstein

People should be ashamed to use the wonders of science and technology if they don’t know any more about it than a cow knows about the botany of the grass it relishes in eating.

Some privacy…

Every click leaves many traces in the internet. To enjoy at least some privacy, I recommend to install the CookieCuller, that will destroy all cookies (except some protected cokkies) when closing your browser. A slightly higher level of privacy may be obtained by using TORPARK, that is now even available in a standalone USB stick version form Even by using TORPARK you are still identified by your network card – SMAC is the ultimate solution, yea, yea.



Science writes:

As you browse the Internet, many Web sites such as Google’s record a string of tex–the cookie–representing the identity of your computer. And when you use Google, its servers keep track not only of what you search for but also where you go next. People add new entries to this record at the rate of 200 million Web searches per day. This electronic record is key to Google’s business model: Most of its $1 billion annual revenue comes from Internet advertising targeted to individuals.

Another tip – disable also Flash super cookies in the online applet.

Hidden feature at PUBMED

No, it is not really a hidden feature – but keep your mouse on “links” at the right part of the citation, wait for the drop-down, select “Link-out” and there is a good chance to jump directly to the publisher site, yea, yea.


The nodalpoint blog” writes about MEDIE, a new PUBMED parser:

…is an “intelligent” semantic search engine that retrieves biomedical correlations from over 14 million articles in MEDLINE. You can find abstracts and sentences in MEDLINE by specifying the semantics of correlations; for example, What activates tumour suppressor protein p53? So just how useful is MEDIE and is it at the cutting edge?

SQL injection and retrovirus infection

What is similiar to SQL injection (in webforms) and retrovirus infection (of mammals)? I think there is always a vulnerable situation (tainted variable by unexpected data entry – or double strand break and ligation) that allows foreign code to be inserted. What is different? Retrovirus insertion is probably position specific while SQL injection can even determine its own target. Yea, yea.

Reading behind the lines

-moblog– Eran Segal et al. describe in Nature a genomic code for correct nucleosome attachment of genomic DNA. DNA must be positioned for access to functional sites of gene activity where 147 bases are wrapped around each nucleosome core. AT is favored where phosphodiesterase backbones face inward and GC where it faces outward. Distance between nucleosomes may be variable – as the accompanying editoral by Timothy Richmond explains (the enigmatic histone H1 question). Do genomes use nucleosome DNA preference to target transcription factor towards appropriate sites? This might expain why current transcription factor models are rather poor as they are using only sequence binding matrices. It reminds me to steganograpy, algorithmic procedures that can be used to hide secret messages in in pictures without affecting the visual impression. Yea, yea.


I am using a HP6515 GPS phone with a Spectec mini SD card and a SD 512K memory card running PPC2003 (which replaced my the older PALMs). I am manging the device with Total Commander (free) and the build-in IPAQ Backup (free); the Today screen uses PocketBreeze (€) and Ilauncher (€). I am navigating with OziExplorer (€), TomTom (€) and Metro (free). For internet access I am using the buildin messaging client, Opera (€) and pRSSreader (free) which works all together without major problems with >21 MB internal memory left. The mp3 mediaplayer (free) is being a good choice. So – display is not so sharp, ear phones have low quality, torch not very light, GPS not as good as standalone devices, WLAN has weak signal, camera a poor quality with a device rather large for a mobile – but there is only 1 device and 1 charger which outweights everything! Yea, yea.

9.jpg 8.jpg 7.jpg 

6.jpg 51.jpg

 4.jpg 1.jpg

Vicious cycle?

PNAS has an interesting paper: Until now I had believed that the major search engines provide some kind of self-enforcing loops as they show web sites first that are already highly visited. Visitors of these prime websites will then know (and link) only these websites giving them even a higher rank (rich-get-richer). But that does not see to be true. Nay, Nay. The use of search engines actually had an egalitarian effect as shown by empirical and theoretical arguments. On the other hand I can not renember an own paper, that would have received more than 50 citations without being in an “upper-class” journal. Did you ever want ot know how Google works? Check Ghemawat S GH, Leung S-T: The Google file system. ACM, SOSP’03, October 19-22,2003, Bolton Landing, New York.


An anonymous reader at slashdot writes that AOL released search logs of 657,427 users “AOL has released very private data about its users without their permission. While the AOL username has been changed to a random ID number, the ability to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.” The German Green Party already filed a legislation proposal that companies and institutions will need to inform their clients about such accidents. This seems to be very important also for genetic data. Yea, yea.