GitHub - gifford-lab/GEM: High resolution peak calling and motif discovery for ChIP-seq and ChIP-exo data

NOTE: this repository contains the source code for GEM. A compiled version of the latest release is available here.

GEM: High resolution peak calling and motif discovery for ChIP-seq and ChIP-exo data Genome wide Event finding and Motif discovery

Citation:

High Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints. Yuchun Guo, Shaun Mahony & David K Gifford, (2012) PLoS Computational Biology 8(8): e1002638.

Abstract:

An essential component of genome function is the syntax of genomic regulatory elements that determine how diverse transcription factors interact to orchestrate a program of regulatory control. A precise characterization of in vivo spacing constraints between key transcription factors would reveal key aspects of this genomic regulatory language. To discover novel transcription factor spatial binding constraints in vivo, we developed a new integrative computational method, genome wide event finding and motif discovery (GEM). GEM resolves ChIP data into explanatory motifs and binding events at high spatial resolution by linking binding event discovery and motif discovery with positional priors in the context of a generative probabilistic model of ChIP data and genome sequence. GEM analysis of 63 transcription factors in 214 ENCODE human ChIP-Seq experiments recovers more known factor motifs than other contemporary methods, and discovers six new motifs for factors with unknown binding specificity. GEM's adaptive learning of binding-event read distributions allows it to further improve upon previous methods for processing ChIP-Seq and ChIP-exo data to yield unsurpassed spatial resolution and discovery of closely spaced binding events of the same factor. In a systematic analysis of in vivo sequence-specific transcription factor binding using GEM, we have found hundreds of spatial binding constraints between factors. GEM found 37 examples of factor binding constraints in mouse ES cells, including strong distance-specific constraints between Klf4 and other key regulatory factors. In human ENCODE data, GEM found 390 examples of spatially constrained pair-wise binding, including such novel pairs as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4A/FOXA1. The discovery of new factor-factor spatial constraints in ChIP data is significant because it proposes testable models for regulatory factor interactions that will help elucidate genome function and the implementation of combinatorial control.

News:

Papers citing GEM

GEM has been selected to be part of the ENCODE TF ChIP-seq analysis pipeline.

MIT NEWS Deciphering the language of transcription factors (MIT News article on the GEM paper).

GEM is a scientific software for studying protein-DNA interaction at high resolution using ChIP-seq/ChIP-exo data. It can also be applied to CLIP-seq and Branch-seq data.
GEM links binding event discovery and motif discovery with positional priors in the context of a generative probabilistic model of ChIP data and genome sequence, resolves ChIP data into explanatory motifs and binding events at unsurpassed spatial resolution. GEM reciprocally improves motif discovery using binding event locations, and binding event predictions using discovered motifs.

GEM has following features:

Exceptionally high spatial resolution on binding event prediction (aka peak calling)
Highly accurate de novo motif discovery
Resolve closely spaced (less than 500bp) homotypic events that appear as a single cluster of reads
Enable analysis of spatial binding constraints of multiple transcription factors, for predicting TF dimer binding sites, enhanceosomes, etc.
Analyze ChIP-seq, ChIP-exo, CLIP-seq and Branch-seq data, single-end or paired-end
Run in single-condition mode or multi-condition mode

GEM is implemented in Java, which comes with all the major operating systems.

See an example of GEM output for ES cell Sox2 binding, including binding events, K-mer Set motifs (KSMs), PWM motifs, and motif spatial distribution plots.

Download

Download, unzip, and run ... see command line examples.

Download GEM software (version 2.6) and test data

GEM vs. GPS

GEM is a superset of GPS. GEM uses both ChIP-seq read data and genome sequence to perform de novo motif discovery and binding event calling, while GPS uses only ChIP-seq read data for binding event calling.

GEM can be activated by giving a genome sequence (--genome) and using any one of the following command line options:

--k: the length of the k-mers
--k_min and --k_max: the range for the length of k-mers
--seed: the seed k-mer to jump start k-mer set motif discovery. The length of the seed k-mer will be used to set k.

If these three options are not used, GEM will just run GPS and stop.

System requirements

GEM is a java software. Not installation is required. It works across most of computer systems.

Java 7 is required to execute the JAR. For analysis with mamalian genome, GEM requires about 5-15G memory. It can be specified at the command line with the option -Xmx (i.e. java -jar -Xmx10G gem.jar allocates 10GB of memory).

Read distributions

As GPS, read distribution file is required for GEM. The user can use the default read distribution file provided with the software as starting point (ChIP-seq default read distribution file ChIP-exo default read distribution file). After one round of prediction, GEM will re-estimate the read distribution using the predicted events.

The read distribution file specifies the empirical spatial distribution of reads for a given binding event. The file contains tab-delimited position/density pairs. The first field is the position relative to the binding position (i.e. binding event is at position 0). The second field contains the corresponding read density at that position. For example,

-344    1.42285E-4
-343    1.42275E-4
-342    1.42265E-4
... ...

Alternatively, it can be estimated directly from the ChIP-seq data. Given a set of events, we count all the reads at each position (the 5' end of the reads) relative to the corresponding event positions. The initial set of events for estimating the empirical spatial distribution can be defined by using known motifs or by finding the center of the forward and reverse read profiles. GEM has a tool to calculate the read distribution from a user provided file (coords.txt) containing the coordinates (format chr:coord, e.g. 1:3498832, or chr:coord:strand, e.g. 1:3498832:+, each coordinate on a separate line).

java -Xmx1G -cp gem.jar edu.mit.csail.cgs.deepseq.analysis.GPS_ReadDistribution --g "mm8.chrom.sizes" --coords "coords.txt" --chipseq "SRX000540_mES_CTCF.bed" --f "BED" --name "CTCF" --range 250 --smooth 5 --mrc 4

For ChIP-exo data, we suggest --smooth 3 --mrc 20.

After GPS/GEM makes the prediction, it will re-estimate the read distribution using the predicted events.

If the data are too noisy or too few events are used for re-estimation, the new read distribution may not be accurate. The users are encouraged to examine the read distribution using the plot of read distributions (X_All_Read_Distributions.png) output by GEM.

Input and output

GEM takes an alignment file of ChIP-seq reads and a genome sequence as input and reports a list of predicted binding events and the explanatory binding motifs.

ChIP-seq alignment file

ChIP-seq alignment file formats that are supported:

BED
SAM

Support for some file formats may not be the most updated. If you have an error related to FileReadLoader, please do the following:

Send me a few example lines (and the version of the aligner software), I will update it in the next release.
A quick fix is to convert the read alignment file into BED format, and then try again.

Genome sequence

To run de novo motif discovery, a genome sequence (UCSC download) is needed. The path to directory containing the genome sequence files (by chromosome, *.fa or *.fasta files, with the prefix "chr") can be specified using option --genome (for example, --genome your_path/mm8/). Note that the chromosome name should match those in the "--g" genome.chrom.sizes file, as well as those in your read alignment file.

For example, the fasta file for chromosome 2 is chr2.fa:



chr2

TAATTGTAATAGTATATACTTGTATGTACTTAAAATAttttatcatagtt

ATCTGGATTTTTGATGGCTATCATGACCTCTGAATGACTAGGGAATCTTG

... ...

Output files
GEM outputs both the binding event files and the motif files. Because of the read distribution re-estimation, GEM outputs event prediction and read distribution files for multiple rounds. (See more details)

GEM event text file (GEM_events.txt, see more details)
K-mer set motifs (KSM.txt, see more details)
PFM file of PWM motifs (PFM.txt)
HTML file summarizing the GEM event and motif results (see an example and explanations)
GEM output folder containing more detailed result files (Round 1 and 2 are GPS and GEM results, respectively (See more details))
- GEM event text files (significant, insignificant and filtered)
- K-mer set motifs
- PFM file of PWM motifs
- Read distribution file
- Spatial distribution between primary motif and all the secondary motifs in the 61bp around the GEM events (this is based on PWM motif match, not based on the GEM binding sites). You will have the files such as Name_Spatial_dist_0_1.txt/png showing the spatial distribution of secondary motifs (#1) with respect to the primary motif (#0) in text format or as a PNG image. If you click on the PNG image on the HTML output page, you also get the txt file with the values.

GEM also outputs the list of insignificant events (those do not pass the statistical test), and the filtered events (those would pass the statistical test using the read count, but have a low IP/Ctrl ratio, or the distribution of reads are quite different from the empirical distribution).

Optionally, GEM can be set to output BED files (using option --outBED) for loading the GEM results to Genome Browser as custom tracks for visualization. Note that in BED file, the coordinates are offset to left/right 100bp to give a region for visualization.

GEM event file is a tab-delimited file (xxx_n_GEM_events.txt ) with following fields:

Field	Description
Location	the genome coordinate of this binding event
IP binding strength	the number of IP reads associated with the event
Control binding strength	the number of control reads in the corresponding region
Fold	fold enrichment (IP/Control)
Expected binding strength	the number of IP read counts expected in the binding region given its local context (defined by parameter W2 or W3), this is used as the Lambda parameter for the Poisson test
Q_-lg10	-log10(q-value), the q-value after multiple-testing correction, using the larger p-value of Binomial test and Poisson test
P_-lg10	-log10(p-value), the p-value is computed from the Binomial test given the IP and Control read counts (when there are control data)
P_poiss	-log10(p-value), the p-value is computed from the Poission test given the IP and Expected read counts (without considering control data)
IPvsEMP	Shape deviation, the KL divergence of the IP reads from the empirical read distribution (log10(KL)), this is used to filter predicted events given the `--sd` cutoff (default=-0.40).
Noise	the fraction of the event read count estimated to be noise
KmerGroup	the group of the k-mers associated with this binding event, only the most significant k-mer is shown, the n/n values are the total number of sequence hits of the k-mer group in the positive and negative training sequences (by default total 5000 of each), respectively
KG_hgp	log10(hypergeometric p-value), the significance of enrichment of this k-mer group in the positive vs negative training sequences (by default total 5000 of each), it is the hypergeometric p-value computed using the pos/neg hit counts and total counts
Strand	the sequence strand that contains the k-mer group match, the orientation of the motif is determined during the GEM motif discovery, '*' represents that no k-mer is found to associated with this event

The KSM file is a tab-delimited file (xxx_n_KSM.txt ) with following fields:

First header line, e.g. "#5000/5000", shows the number of positive/negative sequences that were used for learing the motif.
Second header line, e.g. "#3.01", the KSM motif score cutoff, optimized to give best motif enrichment in the training sequences.

Field	Description
k-mer/r.c.	The k-mer sequence and its reverse complement
Cluster	Cluster ID of the k-mer set (Cluster 0 is the primary motif, i.e. the most significant motif)
Offset	The offset of this k-mer from the seed k-mer
PosCt	Number of positive sequences that contain this k-mer
wPosCt	Weighted PosCt, calculated using the relative sequence weighting (for GEM, this is natural logarithm of the binding event strength, i.e. read count)
NegCt	Number of negative sequences that contain this k-mer
HGP_10	HyperGeometric P-value (log10)
(no name)	The IDs of positive sequences that contain this k-mer
(no name)	The IDs of negative sequences that contain this k-mer

Examples:

This data can be used to test GEM. It comes from a Ng lab publication (PMID: 18555785) and consists of Bowtie alignments of mouse ES cell CTCF ChIP-seq and GFP control reads.

Once everything is unpacked, use the following command:
java -Xmx10G -jar gem.jar --d Read_Distribution_default.txt --g mm8.chrom.sizes --genome your_path/mm8 --s 2000000000 --expt SRX000540_mES_CTCF.bed --ctrl SRX000543_mES_GFP.bed --f BED --out mouseCTCF --k_min 6 --k_max 13

Note the double dashes (--) for GEM parameters.

For ChIP-exo data, use ChIP-exo read distribution --d Read_Distribution_ChIP-exo.txt, add one more option --smooth 3 to estimate the read distribution without too much smoothing. Depending on the quality of the data, you may want to turn off the read filtering by option --nrf.

Command-line options:

The command line parameters are in the format of --flag/name pairs. Note the double dashes (--); GEM does not accept single dash parameters.

Some parameters are required:

Required parameters	Detailed information
`--d [path]`	The path to the read distribution model file
`--exptX [path]`	The path to the aligned reads file for experiment (IP). X is condition name. In multi-condition alignment mode, X is used to specify different conditions.

Some parameters (those relevant to motif discovery are in bold) are optional:

Optional parameters	Detailed information
`--ctrlX [path]`	The path to the aligned reads file for control. X should match the condition name in `--exptX`.
`--g [path]`	The path to a genome information file (genome.chrom.sizes file). The file contains tab-delimited chromosome name/length pairs. Highly recommended, although not required. If it is not supplied, GEM will use the maximum value of read coordinate as the chomosome length.
`--f [BED\|SAM\|BOWTIE\|ELAND\|NOVO]`	Read file format: BED/SAM/BOWTIE/ELAND/NOVO. The SAM option allows SAM or BAM file input. (default = BED)
`--s [n]`	The size of uniquely mappable genome. It depends on the genome and the read length. A good estimate is genome size * 0.8. If it is not supplied, it will be estimated using the genome information.
`--genome [path]`	the path to the genome sequence directory, which contains fasta files by chromosomes
`--k [n]`	the width of the k-mers
`--k_min [n] --k_max [n]`	minimum and maximum value of k
`--seed [k-mer]`	the seed k-mer to jump start k-mer set discovery. Exact k-mer sequence only. The width of the seed k-mer will be used to set k
`--k_seqs [n]`	the number of top ranking events to get sequences for motif discovery (default=5000)
`--pp_nmotifs [n]`	the max number of top ranking motifs to set the motif-based positional prior (default=1)
`--k_win [n]`	the sequence window size around the binding event for motif discovery (default=61bp)
`--strand_type [n]`	Double-strand or single-strand binding event calling and motif discovery, 0 for double-strand, 1 for single-strand (default=0)
`--nd [n]`	noise distribution model, 0 for no noise model, 1 for uniform noise model (default=1)
`--fold [value]`	Fold (IP/Control) cutoff to filter predicted events (default=3)
`--icr [value]`	IP/Control Ratio. By default, this ratio is estimated from the data using non-specific binding regions. It is important to set this value explicitly for synthetic dataset.
`--out [name]`	Output folder and file name prefix
`--q [value]`	significance level for q-value, specified as -log10(q-value). For example, to enforce a q-value threshold of 0.001, set this value to 3. (default=2, i.e. q-value=0.01)
`--a [value]`	minimum alpha value for sparse prior (default is estimated by mean whole genome read coverage)
`--af [value]`	a constant to scale alpha value with read count (default=3). A smaller af value will give a larger alpha value.
`--sd [value]`	Shape deviation cutoff to filter predicted events (default=-0.40).
`--smooth [n]`	The width (bp) to smooth the read distribution. If it is set to -1, there will be no smoothing (default=30).
`--subs [region strings]`	Subset of genome regions to be analyzed. The string can be in the format of "chr:start-end" or "chr", or both. For example, "1:003-1004 2 X".
`--subf [region file]`	File that contains subset of genome regions to be analyzed. Each line contains a region in "chr:start-end" format.
`--ex [region file]`	File that contains subset of genome regions to be excluded. Each line contains a region in "chr:start-end" format.
`--t [n]`	Number of threads to run GEM in paralell. It is suggested to be equal to or less than the physical CPU number on the computer. (default: physical CPU number)
`--top [n]`	Number of top ranked GEM events to be used for re-estimating the read distribution (default=2000). Note that GEM only re-estimate when there are more than 500 significant events called.
`--w2 [n]`	Size of sliding window to estimate lambda parameter for Possion distribution when there is no control data (default=5,000, must be larger than 1000).
`--w3 [n]`	Size of sliding window to esitmate lambda parameter for Possion distribution when there is no control data (default=10,000, must be larger than w2).

Optional flags:

Optional flags	Detailed information
`--k_neg_dinu_shuffle`	Use di-nucleotide shuffled sequences as negative sequences for motif finding
`--bp`	use Branch-seq data specific settings
`--pp_pwm`	Use PWM motif to set the motif-based positional prior (default is to use the KSM motif model)
`--bf`	Depreciated Base filtering is done by Poisson filtering by default.
`--fa`	GEM will use a fixed user-specified alpha value for all the regions
`--help`	Print help information and exit
`--multi`	Depreciated To run GEM in multi-condition mode, you only need to specify data for conditions X and Y using --exptX and --exptY, etc.
`--nf`	Do not filter predicted events (default is to perform event filtering by shape and fold)
`--nrf`	Do not filter duplicate reads (i.e. potential PCR duplicates) (default is to apply filtering considering the read counts at its neighboring bases)
`--outNP`	Output binding events in ENCODE NarrowPeak format (default is NO narrowPeak file output)
`--outBED`	Output binding events in BED format for UCSC Genome Browser (default is NO BED file output)
`--outJASPAR`	Output motif PFM in JASPAR format
`--outMEME`	Output motif PFM in MEME format
`--outHOMER`	Output motif PFM in HOMER format
`--sl`	Sort GEM output by location (default is sorted by P-value)

Q&A

Which round of result should I use?
Because of the read distribution re-estimation, GEM may output event prediction and read distribution files for multiple rounds. The round numbers are coded in the file name. For example,

xxx_0_Read_distribution.txt: The input read distribution (specified by --d).
xxx_0_GEM_events.txt: GPS Events used to re-estimate the read distribution of current dataset.
xxx_1_Read_distribution.txt: The read distribution estimated from xxx_0_GEM_events.
xxx_1_GEM_events.txt: GPS Events predicted using xxx_1_Read_distribution.
xxx_1_KSM/PFMs.txt: Motifs discovered using GPS events.
xxx_2_Read_distribution.txt: The read distribution estimated from GPS events.
xxx_2_GEM_events.txt: GEM Events predicted using xxx_2_Read_distribution and xxx_1_KSM/PFMs motifs.
xxx_2_KSM/PFMs.txt: Motifs discovered using GEM events.

Due to the large variability of datasets, the refined empirical spatial distribution may become too noisy because too few events are used to estimate the distribution. Running GEM further may make the empirical distribution even worse. Therefore, GEM save the output files from each round, and allow the user to check the spatial distributions to decide which one is the best. To facilitate this process, GEM outputs an image to plot all the read distribution curves (xxx_All_Read_Distributions.png). Ideally, the read distribution of later rounds should be smooth and similar to that of round 0. If so, user can use the event predictions from these rounds. If that is not the case, we would suggest to use the round 0 prediction results.

Some best practices?

For each GEM result set, create a folder with the same name as --out. Then run GEM within this folder. This will be useful if later you need to write scripts to process multiple GEM results.
After each GEM run, check the PNG image file that plot all the read distribution curves (xxx_All_Read_Distributions.png). If the estimated curves are not smooth, you may have too few binding events to estimate from.
If you get something unexpected (or an error), check the command line options, make sure that each option starts with "--", and there is a space between each option/value pair.

Missed some events?
Sometimes an event may be found by GEM, but it is not reported to the GEM_events.txt file. It may be reported in the _GEM_insignificant.txt file if it does not pass the statistical test. Or it could be in the _GEM_filtered.txt file if the shape of the binding event is too far away from the empirical read distribution. You can add --nf flag to turn off filtering.

Can GEM/GPS process paired-end data?
Yes. As version 2.4, GEM/GPS can process paired-end SAM/BAM format data (using the same --f SAM option) by treating each mate-pair as two single-end reads. This works well, but it is not optimal. We are developing a new version that explicitly models the paired-end data, which should give more accurate results.

What if GEM finds wrong motif?
Clearly, GEM's binding call accuracy is depending on finding the correct primary motif. Some times a co-factor motif may be more statistically significant in the data, and it is subsequently used to direct the binding calls. There are several alternative options to try:

If you know the consensus motif of the TF, use --seed option to set a starting k-mer for the motif discovery process.
You may want to try some different k values, or different sequence window size --k_win.
Try to use a different negative sequence set, for example --k_neg_dinu_shuffle option will use di-nucleotide shuffled sequences as negative set, instead of the default set taking from 300bp away from binding sites. This may be useful for TF that binds in CG-rich promoter regions (e.g. SP1).

Multi-condition v.s. Multi-replicates
GEM can analyze binding data from multiple conditions (time points) simultaneously. The user need to give them different names, for example, -–exptCond1 CTCF_cond1.bed -–exptCond2 CTCF_cond2.bed.

For multiple replicates of same condition, you can specify multiple replicates as separate files, for example, -–exptCond1 CTCF_cond1_rep1.bed -–exptCond1 CTCF_cond1_rep2.bed (note that they need to have the same name). GEM will combine the replicates as one large dataset for analysis.

Read filtering and event filtering
PCR amplification artifacts typically manifest as the observation of many reads mapping to the exact same base positions. These artifacts are quite variable and dataset-specific. Therefore, a generic approach to exclude those regions might result in the loss of true events.

GEM implements an event filtering method by comparing the read distribution of the predicted event to the expected event read distribution. A shape deviation score (IPvsEMP field) is computed using Kullback–Leibler divergence (see method section 2.6 of GPS paper). A higher score means the event is more divergent from the expected read distribution, hence more likely to be artifact or noise. A cutoff score can be specified by user to filter out spurious events using option (--sd). GEM also excludes events with less than 3 fold enrichment (IP/Control). GEM reports the filtered events, hence allows the user to verify and adjust cutoff threshold for a particular dataset. The shape deviation filter is on by default, but can be turned off using option (--nf).

In addition, GEM also applies a Poisson filter for abnormal high read count at a base position. For each base, we obtain an average read count by estimating a Gaussian Kernel density (with std=20bp) on the read counts of nearby base positions (excluding the base of interest). The estimated value is used to set Lambda parameter of Poisson distribution. The actual read count value is then set to the value corresponding to p-value=0.001 if it is larger.

Contact

Contact Yuchun Guo (yguo at mit dot edu) with any problems, comments, or suggestions.

Sign up for GPS mailing list to receive emails related to GEM/GPS updates, release, etc.

This software is for research use only.

Name		Name	Last commit message	Last commit date
Latest commit History 2,454 Commits
db-schemas		db-schemas
db-scripts		db-scripts
lib		lib
manifests		manifests
passwords		passwords
src/edu/mit/csail/cgs		src/edu/mit/csail/cgs
.gitignore		.gitignore
README.md		README.md
build.xml		build.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db-schemas

db-schemas

db-scripts

db-scripts

lib

lib

manifests

manifests

passwords

passwords

src/edu/mit/csail/cgs

src/edu/mit/csail/cgs

.gitignore

.gitignore

README.md

README.md

build.xml

build.xml

Repository files navigation

Download

GEM vs. GPS

System requirements

Read distributions

Input and output

Examples:

Command-line options:

Q&A

Contact

About

Releases 1

Packages

Contributors 6

Languages

gifford-lab/GEM

Folders and files

Latest commit

History

Repository files navigation

Download

GEM vs. GPS

System requirements

Read distributions

Input and output

Examples:

Command-line options:

Q&A

Contact

About

Resources

Stars

Watchers

Forks

Languages