A Deterministic Motif Finding Algorithm with Application to the Human Genome

Lawrence S. Hon and Ajay N. Jain

We present a novel algorithm, MaMF, for identifying transcription factor (TF) binding site motifs. The method is deterministic and depends on an indexing technique to optimize the search process. On common yeast data sets, MaMF performs competitively with other methods. We also present results on a challenging group of eight sets of human genes known to be responsive to a diverse group of TFs. In every case, MaMF finds the annotated motif among the top scoring putative motifs, performing better than other motif finders. We analyzed the remaining high scoring motifs and show that many correspond to other TFs that are known to co-occur with the annotated TF motifs. The significant and frequent presence of co-occurring transcription factor binding sites explains in part the difficulty of human motif finding. MaMF is a very fast algorithm, suitable for application to large numbers of interesting gene sets.

Supplemental Data

Downloadable files include:

  • The MaMF code and human data archive. The main experiments can be reproduced using this file. (10MB)
  • stuart-dbtss-human-subset10.zip contains the microarray experiments obtained from Stuart et al, used to compute the expression ratio (ER) (see paper). (9MB)
  • lower-organisms.zip contains promoter sequence, binding sites, and background distributions for the four yeast and e. coli examples used in the paper. (2MB)
  • allhom-masked.zip contains 10k upstream of a large number of human Refseq genes used to generate the background distribution in the human examples. (9MB)
The archives should be extracted to the same directory. Please see readme.txt in the mamf.zip archive for details in using MaMF. MaMF runs using the Cygwin environment for Windows, which can be obtained at www.cygwin.com.