Virtual Cell Program

Home / Code and data for the linear epigenome papers

The epigenome and regulatory proteins are highly redundant and linearly related | the dynamic linear epigenome

The epigenome and regulatory proteins are highly redundant and linearly related

Code for processing BedGraph and (Big)Wig files (all code files are available under the GPL (GNU General Public License), for downloading all code files use this link):

README.pdf shows in detail how to process BedGraph and (Big)Wig files with the following Python scripts and how to use the R script to normalize the Python script outputs, delete scores for genes that overlap with a blacklist and paste the data together.

GTF_processing.py processes a GTF file (e.g., from GENCODE), extracts all relevant knowledge in terms, of Gene ID, Transcript ID, Gene type, Start/End position of transcript and strand information and stores the result in a csv file.

counting_the_number_of_cpgs.py (optional, when whole genome bisulfite-sequencing DNA methylation data are considered) processes a human genome Fasta file (e.g., by downloading the hg19.2bit file from here and converting it to a Fasta file with twoBitToFa (see here for downloading twoBitToFa and converting it with something like twoBitToFa hg19.2bit hg19.fa)) and counts the number of CpGs per base pair around TSSs, in Transcripts and around TTSs for various bin resolutions for protein coding or lincRNA genes etc. and stores the results as csv files.

BEDGRAPH-processing.py and WIG-processing.py take BedGraph or WIG files as input and count the number of tags per base pair around TSSs, in Transcripts and around TTSs for various bin resolutions for protein coding or lincRNA genes etc. and stores the results as csv files. Since many ENCODE files are BigWig files one may convert them to BedGraph files with bigWigToBedGraph (see here for downloading this tool).

remove_genes_from_blacklist_and_chromosomes.py (optional, if genes should/must not be considered) outputs which genes for protein coding resp. lincRNA genes etc. overlap with regions which should not be considered, for which we may, for instance, take the ENCODE consensus blacklist (see here for further information, and to download the blacklist file see here) and chromosomes like chrX, chrY and chrM.

funcions.R contains R functions to create data frames with normalized scores for each loci of interest and each epigenetic mark/input BedGraph or Wig file.

Data files generated for the paper (in RData format):

protein coding genes 1-bin resolution 1%-quantile normalization GM12878 H1 H9 IMR90 K562
lincRNA genes 1-bin resolution 1%-quantile normalization GM12878 H1 H9 IMR90 K562
protein coding genes 40-bin resolution (just used for predicting CAGE levels) 1%-quantile normalization GM12878 H1 IMR90 K562
lincRNA genes 40-bin resolution (just used for predicting CAGE levels) 1%-quantile normalization GM12878 H1 IMR90 K562
protein coding genes 1-bin resolution 5%-quantile normalization GM12878 H1 H9 IMR90 K562
lincRNA genes 1-bin resolution 5%-quantile normalization GM12878 H1 H9 IMR90 K562
Each RData file consists of a list with three data frames for the epigenetic mark scores assoiated with the TSSs, Transcripts/Gene bodies and the TTSs.

The CAGE gene expression can be accessed here (for protein coding genes) and here (for lincRNA genes). Each file consists of a list with four elements for the four cell lines (GM12878, H1, IMR90 and K562), for which there are CAGE nucleus data available, and each entry is a vector with CAGE scores for all genes of interest in that particular cell line (as outlined in the methods of the paper).

the dynamic linear epigenome

Code for genome-wide processing BedGraph(-style) files and downstream analysis (all code files are available under the GPL (GNU General Public License), for downloading all code files use this link):

README.pdf shows how to process BedGraph(-style) files, normalize them, paste them together and analyze the epigenomic changes with the following R scripts.

Enrichments_per_bin_processing.R processes BedGraph(-style) files by counting the number of tags falling into each bin of a given genome for specified chromosomes and bin width.

functions.R contains R functions to create data frames with normalized scores for each bin and each epigenetic mark for a given time point and to analyze the changes with systems of linear ordinary differential equations.

Koike_0hours_sample.RData and Koike_4hours_sample.RData contain data frames for a small subset of loci and all epigenetic marks from the paper Koike et al., Transcriptional architecture and chromatin landscape of the core circadian clock in mammals, Science, 2012, from the time points 0 hours and 4 hours. These files are used in the workflow of the README in order to illustrate the functionality of the functions from above.

Data files were generated for the paper (in RData format) from

Garber et al. 1000-bp bin width 0 minutes 30 minutes 60 minutes 120 minutes
Koike et al. 1000-bp bin width 0 hours 4 hours 8 hours 12 hours 16 hours 20 hours
Yu et al. 1000-bp bin width 0 days 4 days 6 days
Garber et al. 4000-bp bin width 0 minutes 30 minutes 60 minutes 120 minutes
Koike et al. 4000-bp bin width 0 hours 4 hours 8 hours 12 hours 16 hours 20 hours
Yu et al. 4000-bp bin width 0 days 4 days 6 days
Each RData file consists of a data frame for the epigenetic mark scores for all loci at that time point.

 

Home | Research | Papers | Teaching
Software | People | Positions
Events | Theory Lunch | Contact