ANALYSIS AND PREDICTION OF TRANSCRIPTION FACTOR BINDING SITES

 

PhD PROGRAMME "INFORMATICS TOOLS FOR MOLECULAR GENETICS"


UNIVERSITAT DE BARCELONA

June 2005


M.Mar Albà, UPF (malba@imim.es; http://genome.imim.es/evolgenome/)

References:

Wray et al (2003). The evolution of transcriptional regulation in eukaryotes. Molecular Biology and Evolution 20(9): 1377-1419.

Messeguer, X., Escudero, R., Farré, D., Núñez, O., Martínez, J., Albà, M.M. (2002). PROMO: detection of known transcription regulatory elements using species-tailored searches. Bioinformatics 18: 333-334.

Ovcharenko, I., Loots, G.G., Hardison, R.C., Miller, W., Stubbs, L. (2004). zPicture: dynamic alignment and visualization tool for analyzing conservation profiles. Genome Res. 14: 427-477.

Matys et al. (2003). TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 31: 374-378.




Introduction

1. Characteristics of gene expression regulatory sequences

Transcription regulation is mediated by transcription factors, many of which recognize DNA motifs in a sequence-specific manner. This binding may activate the expression of a gene or may repress it.  The regulation involves interactions between different transcription factors.

Regions upstream of the transcription start site contain many regulatory elements and are known as promoters. But in multicellular organisms this scenario can be quite complex and a gene's regulatory sequence can consist in several cys-regulatory modules, discrete regions that drive the expression of the gene under particular conditions or tissues.

Regulatory sequences have the potential to evolve quickly, as mutations in one or a few nucleotides may lead to the loss or acquisition of a new transcription factor binding site, and different cys-regulatory modules can evolve independently.



transcription


2. Transcription factor binding sites

Transcription factors bind to DNA motifs in regulatory regions in a sequence-specific manner. The motifs are short and imprecise and so they are difficult to identify in regulatory sequences, as many of the motifs will be found just by chance.

example:
TATA site in the gene promoter

CLUSTAL W (1.82) multiple sequence alignment

seq3 TATAAA 6
seq7 TATAGA 6
seq8 TATAAA 6
seq2 TATAAA 6
seq5 GATAAA 6
seq6 TATAAA 6
seq1 TATAAA 6
seq4 TATAAT 6
***

We can capture this variability using consensus sequences, such as [T/G]ATA[A/G][A/T], or position weight matrices (PWM), such as:


TATA box PWM:

                   1    2    3     4    5    6
          - - - - - - - - - - - - - - - - - - - -
           A    0    8    0    8    7    7
           C    0    0    0    0     0   0
           G    1    0    0    0    1    0
           T    7    0    8    0     0    1
           
 relative frequencies: 

                  1           2       3         4        5           6
          - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
           A    0           1        0        1        0.875    0.875
           C    0           0        0        0        0           0
           G    0.125    0        0        0        0.125    0
           T    0.875    0        1        0        0           0.125

(each cell is Mij)

The use of PWM allows us to perform more specific searches. We will not only be interested in whether a nucleotide can appear in a particular position but also in the frequency at which that is observed.

To search for this motif we run a sliding window, of the same size as the motif, along the sequence and each time we score the similarity to the motif.

We can use as a score the sum of Mij for that particular window. Then use a cut-off, a minimum score, to define our predictions.

          ..CGTGAGTC TATAAAGCGTG GATAGTCCGGGCGCTA TGC..
                                                                        
                                    window x           window y          window z


    score window x : 0.875+1+1+1+0.875+0.875 = 5,625

    score window y :  0.125 + 1 + 1 + 1 + 0.125 + 0.125 = 3.375

    score window z :  0.125 + 0 + 0 + 0 + 0 + 0.875 = 1

Note that if we used a consensus sequence of the TATA box motif (as the one above) windows x and y will both match the consensus. However using PSSMs we can see that window x is closer (has a better score) to the TATA box motif than window y, so it is a more reliable prediction of the motif.  If for example we set up a cut-off of 5 only window x will be recovered.


3. Transcription factor binding site databases (position weight matrices)

A number of databases exist that collect information from experimetally-validated sites and use it to construct binding site weight matrices for different transcription factor binding sites. This information can then be used to scan a sequence of interest and predict putative transcription factor binding sites (see above).

Animals:

TRANSFAC:  http://www.gene-regulation.com/pub/databases.html
PROMO: http://alggen.lsi.upc.es/  (follow links to Research, PROMO 3.0)
Jaspar: http://jaspar.cgb.ki.se/

Plants:

PlantCARE:  http://intra.psb.ugent.be:8080/PlantCARE/

E.coli:

RegulonDB: http://www.cifn.unam.mx/Computational_Genomics/regulondb/

Yeast:

SCPD: http://cgsigma.cshl.org/jian/


4. Comparative approaches

The prediction of transcription factor binding sites in a sequence produces many false positives, that is, many of the predictions will not correspond to functional sites. This happens because the motifs are typically short and imprecise and the probability to find them by chance is high. To solve this problem one of the most interesting approaches is to look for motifs that have been conserved during evolution. This is known as phylogenetic footprinting. We identify regions that have been conserved in orthologous genes from relatively distant species and, due to this evolutionary conservation, we infer that they probably contain functional regulatory sequences.


Example phylogenetic footprinting:
phylogenetic
Boffelli et al., 2004




Practical: Create logos from conserved motifs


Below we have information about the binding site of MEF2 and the TATA-binding protein (TBP). We will use the multiple alignments to obtain a Logo representation of the binding sites.  The height of the letters represented the level of conservation of each nucleotide in each position.

- Go to WebLogo

- Paste the alignments below (without the title)

- Click on "Create Logo"


1. MEF2 motif (Myocyte enhancer factor 2 transcription factor binding site)

CTAAAAATAA
TTAAAAATAA
TTTAAAATAA
CTATAAATAA
TTATAAATAA
CTTAAAATAG
TTTAAAATAG


2. TATA box (TATA binding protein site)


seq3               TATAAA
seq7               TATAGA
seq8               TATAAA
seq2               TATAAA
seq5               GATAAA
seq6               TATAAA
seq1               TATAAA
seq4               TATAAT
               



Practical: Analysis of the gene expression regulatory region of the alpha-actin cardiac gene

We have the following sequences of the upstream region of cardiac alpha-actin gene:

>human_-360_to_-15
CTGCGGAGGACCGAATCCACAGACCATCCAGGGAGCACCCACACCCCAGAAAGGGGGAGGGGTGGGCTGGCGTCAC
TTAGTCTTCCCCTGCCCCCTACCCTTCAGCGCCTGCCCCTCCCCAGCTCCCTATTTGGCCATCCCCCTGACTGCCCCC
TCCCCTTCCTTACATGGTCTGGGGGCTCCCTGGCTGATCCTCTCCCCTGCCCTTGGCTCCATGAATGGCCTCGGCAGT
CCTAGCGGGTGCGAAGGGGACCAAATAAGGCAAGGTGGCAGACCGGGCCCCCCACCCCTGCCCCCGGCTGCTCCAA
CTGACCCTGTCCATCAGCGTTCTATAAAGCGGCCCTCCT
>mouse_-360_to_-12
TTGGAAGGGCTGAAGAGCAATAAGCCCACTCCACAACTAGGGAGCTCCCCCACCCAAGGGGCGCATTGGCATCACATAGCCTTTCC
CCGTCCCCCACCCCTTGCTGGCCTGCCCCTCCCTAGCTCCCTATATGGCCATTGCTCTGACTGCCCCCTCCCCTTCCTTACATGG
TCTGGGAGCCCCCTGGCTGATCCTCTACCCTGCCCTTGGCTCCAAGAATGGCCTCAGCGGTCCTAGATGGTGCTAAGGCGACCAA
ATAAGGCAAGGTGGCAGATCAGGGGCCCCCCACCCCTGCCCCCGGCTGCTCCAACTGACCCCGTCCATCAGAGAGCTATAAAGCTG
CGCTCCA
>chicken_-335_to_-24
ACGCCCCGCGTGAAGGCCACCCGGGCCCGACATCTCGGGCAGCGCACCTGGCTTACACTTCCTCGAGGGACCATGA
GGGCCACAGAAGAACTCCGAGCCTCCCCTCCCACCACGTCGGCGGAGGCTCCCTATTTGGCCATGTGGCGGCGGXX
XXXXXXXXXXTCCGCACCTGCCTTAGATGGCCGGACAGCCGCGCCGCCTTGCGCCATTCATGGCCGCGCTGCGCCG
CCATGGCGCCGAGCCGGCCAAATAAGAGAAGGTGGCTGCCCCGGCCCGCGGACCGCGGCCGCCGGGGGCTATAAAG
CGGCAGCTTC
>frog_-355_to_-14
AGTCCCCCTGCACAATTGTGCTGCACCTGTCTACTCCATTTGCAGACCCCTGTGTCTGTGCAAACTATTTCTTTCATT
GTGCTGTTTTTTTTGTCACCCAGCATTACAGACATGCTTTTTTGGGAATCCCTATTTGGCCATCCCTAGTAGTGCTCC
CXXXXXXXXXXXXXXTTTCCATACATGGGCTAAGGGGTCCAAAGACCCTGCCCTCCCCCCTCACCTACTCCATTAA
TGGCTTCTTTGCTTTTCAATGGCCAGAAGCTACCAAATAAGGGCAGGCTGCCTGCCTTTCGGAGCTCCCACTGACTC
CTCAACTCCAGGCAGCGTATAAATTGACAGCTCA

The aim of the practical will be, given these upstream sequences from orthologous genes, explore different ways to predict transcription regulatory elements



    1. Obtain the PSSM of the TATA box from

                1.1 Transfac
  
*    Go to Transfac database  (login: ub_2005; password: ub2005)

*    In TRANSFAC 6.0: choose Search action

*    Select the table of Matrix

*    Enter the factor name TATA

*    Set (Factor) Name (NA) as searching field and submit the query

*    There are two entries: M00252 and M00216

*    Select M00252 matrix for inspection
            
                    *    Keep this matrix for further use (from AC line to //)


                1.2 
Jaspar


*    Go to the Jaspar database

*    Input "TBP" in the "search by name" box 

*    Inspect the matrix and logo representation (click on View)

*   How many sequences have been used to construct this matrix?



    2. Predict the position of the TATA box


               
    *    Open
RSA tools webserver

                *   
On the left frame, click on Pattern matching - patser (matrices)

                *    Paste the human alpha-actin cardiac gene upstream sequence

                *    Select Transfac as Matrix Format

                *    Paste the Transfac TATA matrix (including matrix header) obtained previously

                *    In Origin select "start" (of the sequence) and press GO

                *   Check the results.  They should look something like this:

map type id strand start end sequence score ln(P)
human_-360_to_-15 site patser D 46 60 caccCCAGAAAGGGGGAGGggtg 3.29 -5.74
human_-360_to_-15 site patser R 120 134 tcccCAGCTCCCTATTTGGccat 1.41 -4.61
human_-360_to_-15 site patser D 127 141 ctccCTATTTGGCCATCCCcctg 0.46 -4.13
human_-360_to_-15 site patser R 157 171 cctcCCCTTCCTTACATGGtctg 3.78 -6.07
human_-360_to_-15 site patser D 213 227 ggctCCATGAATGGCCTCGgcag 4.38 -6.51
human_-360_to_-15 site patser D 253 267 gggaCCAAATAAGGCAAGGtggc 3.00 -5.54
human_-360_to_-15 site patser R 323 337 ccatCAGCGTTCTATAAAGcggc 4.50 -6.60
human_-360_to_-15 site patser D 328 342 agcgTTCTATAAAGCGGCCctcc 0.91 -4.35
human_-360_to_-15 site patser D 330 344 cgttCTATAAAGCGGCCCTcc 7.67 -9.65


                *    Each row in the results table is a putative match to the TATA box matrix. Which may be the real TATA box?

                *   To obtain a graphical representation of predictions press "feature map"
               
                *   In the RSA-tools - feature map page press "GO"

                *   Identify the best TATA box prediction in the drawing



    3. Predict all putative TFBS in one sequence using TRANSFAC

                *  
Go to TRANSFAC applications
               
                *    Choose the program Match-Public to scan promoter sequences searching for sites using the complete library of TRANSFAC matrices
            
                *    Paste the
human alpha-actin cardiac gene upstream sequence

                *    Uncheck the option "Minimize false positives". Select the last option and set cut-offs: 0.75 (matrix similarity) and 0.85 (core similarity)

                *    Click on "Submit the form"
 

                *    Inspect the putative TFBSs. How many are there? Which is the frequency of predicted sites per nucleotide? Is this realistic?

                *   Click on "View a graphic of the following search results"

                *    Go back to the initial page and select Group of matrices "vertebrates" (similarity cut-offs as before). Submit the form.

                *    Inspect the putative TFBSs. How many are there?
Which is the frequency of predicted sites per nucleotide? Is this realistic?

                *   Go back to the initial page and change cut-offs: 0.85 (matrix similarity) and 1 (core similarity)

                *   How may putative hits do we get now? Why?
     


    4. Predict all putative TFBS in one sequence using PROMO

                *  
Go to PROMO (Selct RESEARCH and then PROMO 3.0b)

                *   Go to SearchSites and input the
human alpha-actin cardiac gene upstream sequence

                *   Click Submit.

   
            *   Inspect the putative TFBSs. How many different TFBS do we find?

                *   Go back and to SelectSpecies in the menu at the left. Choose "only human factors" and "only human sites". Click submit.

                *   Go to SearchSites and input the human alpha-actin cardiac gene upstream sequence

                *   Click Submit.

   
            *   Inspect the putative TFBSs. How many different TFBS do we find?

                *   Click on some of the factors below "Factors predicted within a dissimilaity margin less or equal than 15%"

                *    What is dissimilarity and RE (random expectation)? Compare the values for the factor "GR-alpha" with those for "TBP" (TATA-binding protein).

                *   Click on Zoom. How many predictions do we have for "GR-alpha" and for "TBP"? Can we trust them?

 

        
    5. Predict all putative TFBS in several sequences in PROMO


                *   
Go back and to SelectSpecies in the menu at the left.

                *   Select "Selected species factors" and "Selected species sites".


                *   We will select the phylogenetic group "chordata". To do so we click below on the arrow for "all species", then we select "eukaryota", then "animals", then "chordata". We perform                     this for "Factors of" and "Sites of"

   
            *   Go to MultiSearchSites in the menu at the left.

                *   Paste the 4 sequences of the cardiac alpha-actin gene. Select "Sites found in 1 or more sequences" to see all the predictions. Click Submit.

                *   Inspect the putative TFBSs. How many different TFBS do we find?              

                *   As these are orthologous sequences it is likely that functional sites are shared. Go back and select "Sites found in all sequences".

                *   Inspect the putative TFBSs. How many different TFBS do we find?

                *   Go back and change the dissimilarity cut-off to 5. (maximum percent dissimilarity allowed)

                *   Inspect the putative TFBSs. How many different TFBS do we find?

                *   Click on some of the factors below "Factors predicted in 4 or more input sequences within a dissimilarity margin less or equal than 5 %".

                *   Observe the differences in RE. Compare for example the values of CArG box-binding protein or SRF with those of GR-beta.

                *   Observe the position on the sequences of the TATA box predictions (TBP) and of the
CArG box-binding protein binding site.

                *   Keep the results page to compare with following sections.


  
    6. Predict all putative TFBS using predictions in aligned human-mouse sequences with rVISTA



               *   Go to zpicture to align the human and mouse
alpha-actin cardiac gene upstream sequence.

               *   Input the human sequence in SEQUENCE1 and the mouse sequence in SEQUENCE2. Click Submit.

               *   Click on submit alignment to rVISTA

               *   Use vertebrate matrices and click "Submit".

               *    Go down the page and select "Select All" and "Submit".

               *    Click on "Check it".  

               *   Compare the number of TFBS prediction conserved between the two sequences with those in each sequence separately ("Summary" and "Binding sites in the input sequences").

               *   Click on "Dynamic visualization" (Dynamically Overlay")

              
*   Select "Select All" and "Submit".

               *   Change in "Picture - bases per layer" to 0.1 Kb.  Click Submit.

               *   Go down the page and inspect hits. Locate  SRF and TATA binding site predictions.

            
  *   Keep the results page to compare with following sections.


  

    7. Compare the results of the predictions (PROMO MultiSearchSites and rVISTA) with annotated sites

             *  
Search for known TFBS in these sequences in the Catalog of Muscle-specific Regulatory Elements

             *   Go to Table of Contents

             *   Go to Actin, Alpha-Cardiac

             *   Go to "Transcription factor binding sites".

             *   Go down the page to see binding sites in the four
alpha-actin cardiac gene upstream sequences.
      
             *   Note that CArg box (= SRF) is conserved in the 4 sequences. Compare with the results obtained in the previous sections with TFBS prediction programs.




Additional files:

TATA matrix from TRANSFAC
     
PROMO chordata tree