Background Information for Promoters in BKL

Origin of the Promoter Sequences

The Genomic Sequence Assemblies created by the international sequencing consortia were extracted from the Ensembl database. Promoter sequences are extracted only for those protein- and miRNA-encoding genes for which an Entrez ID is defined. Genes on mitochondria and chloroplasts are excluded, due to their special modes of transcription.


Calculation of Virtual Transcription Start Sites (TSSs)

The calculation of 'virtual TSSs' as reference points for the promoter extraction is based on a collection of TSSs for a given gene. TSSs are taken from EnsEMBL. TSSs are assumed to be the first nucleotide of the most 5' exon of an EnsEMBL mRNA model. Thus, collected TSSs for a given gene are located on a sequence fragment which sometimes spans several thousand nucleotides, in some cases far more than 100 kb. They are frequently not located in tight clusters of only a few dozen nucleotides length, but are often widespread throughout the sequence.

In order to define a reasonable number of 'virtual TSSs' for a given gene from this data collection, an algorithm was designed, which applies a set of rules to the data collection in order to find 'clusters' of TSSs. A window of 3000 nt length is slid along the entire sequence fragment. A 'clustering score' is calculated by summing up weighted contributions from each TSS in the window. Each TSS derived from an EnsEMBL mRNA model is scored with 5 evidence points. The weights of evidence points are additionally multiplied by a distance score: the central position is multiplied by 1, the outer positions are multiplied by 0, and all positions in between by a value taken from a cosine function, according to the distance from the center of the window. The peaks of the resulting clustering score are regarded as potential 'virtual TSSs'.

For some of the genes only a handful of evidence points are available, thus resulting in multiple 'virtual TSSs', each consisting of only a few evidence points. Therefore, for all those genes where less than 19 evidence points are available only the most 5' 'virtual TSS' is accepted. For all other genes those peaks are accepted as 'virtual TSSs' for which the respective sequence window contains at least 8% of all evidence points. However, there are genes, for which - although the coverage with data is pretty good - the annotated TSSs are so equally distributed along the sequence, that no prominent peaks occur, and therefore - according to the above mentioned rules - no peak would be accepted. In this case the most prominent peaks are accepted. If there are more than two peaks for which these conditions are true, the most 5' 'virtual TSS' is accepted. The collection of 'virtual TSSs' prepared in this way is the basis for the extraction of the promoter sequences. The calculation of 'virtual TSSs' and the subsequent data extraction are fully automated processes; whenever conflicts or inconsistencies occur the respective gene is excluded from the database.


Copyright © geneXplain. All rights reserved.
Contact us at support@genexplain.com