CTK: Standard/BrdU-CLIP data analysis

This vignette shows how to execute the CTK “Standard/BrdU-CLIP_data_analysis_using_CTK” vignette using CLIPflexR, from within R.

Download Data

Download data from SRA. SRR110753 was selected as an example in this vignette.

require(SRAdb)
mySRA <- "SRAmetadb.sqlite"
if(!file.exists(mySRA)){
  mySRA <- getSRAdbFile()
}
sra_con <- dbConnect(dbDriver("SQLite"), mySRA )

Fox3_Std <- "SRR1107535.fastq.gz"
if(!file.exists(Fox3_Std)){
  getFASTQfile("SRR1107535",
               sra_con = sra_con,
               destDir = getwd())
}

Process data

Quality filter reads (wrapping CTK’s fastq_filter.pl)

Filtering low quality read using R-wrapped fastq_filter.pl from CTK.
Use fastq/gzipped fast files as input. outFile: fastq file after filtering low quality reads; qsFilter: filtering criteria, “mean:0-29:20” means the average quality of the first 30 bases must be equal or greater than 20.
CLIPflexR documentation.

Fox3_Std <- "SRR1107535.fastq.gz"
Fox3_Std_filtered <- ctk_fastqFilter(Fox3_Std,
                                     outFile = "SRR1107535_FF.fastq",
                                     qsFilter="mean:0-29:20")

Remove 3’ linkers (wrapping FASTX’s fastx_clipper)

Trimming 3’-end adapter sequences using R-wrapped fastx_clipper function from FASTX_toolkit.
adaptor: CLIP this sequence from the 3’-end of reads, “GTGTCAGTCACTTCCAGCGG” (default); length: minimum read length, only keep reads with length equal or greater then this threshold, 18 (default).
CLIPflexR documentation.

Fox3_Std_filtered <- fastx_clipper(Fox3_Std_filtered,outFile = "SRR1107535_clipped.fastq",adaptor = "GTGTCAGTCACTTCCAGCGG",length =20)

Trim low quality nucleotides and short reads (wrapping FASTX’s fastq_quality_trimmer)

Discard reads with low quality bases and and short reads using R-wrapped fastq_quality_trimmer function from FASTX-toolkit. outfile: file name and path of the fastq files after filtering; qualityThreshold: the minimum quality threshold. CLIPflexR documentation

Fox3_Std_filtered <- fastq_quality_trimmer(Fox3_Std_filtered,
                                           outFile = "SRR1107535_clipFiltered.fastq.gz",qualityThreshold = 5,minimumLength = 20 )

Collapse exact ctk_fastq2collapse (wrapping CTKs’s ctk_fastq2collapse.pl)

Collapse duplicate sequences into unique reads using R-wrapped fastq2collapse.pl from CTK.
CLIPflexR documentation.

Fox3_Std_filtered <- ctk_fastq2collapse(Fox3_Std_filtered,
                                           outFile = "SRR1107535_collapsed.fastq.gz")

Remove random barcodes (wrapping CTK’s stripBarcode.pl)

Strip random barcodes with specific length from the reads using R-wrapped stripBarcode.pl from CTK.
linkerlength: barcode length; inputFormat: format of input sequences, fasta (default) or fastq.
CLIPflexR documentation.

Fox3_Std_filtered <- ctk_stripBarcode(Fox3_Std_filtered,linkerlength = 5,
                                      inputFormat = "fasta",
                                           outFile = "SRR1107535_stripped.fastq.gz")

Map processed reads to genome

Make reference data for mm10

Fetch reference sequences from BSgenome.

require(BSgenome.Mmusculus.UCSC.mm10)
require(magrittr)
listOfSeq <- lapply(seqnames(BSgenome.Mmusculus.UCSC.mm10),function(x)BSgenome.Mmusculus.UCSC.mm10[[x]])
names(listOfSeq) <- seqnames(BSgenome.Mmusculus.UCSC.mm10)
toWrite <- DNAStringSet(listOfSeq[!grepl("Un|random|hap",names(listOfSeq))])
writeXStringSet(toWrite,file="mm10.fa")

Build a Bowtie2 genome index using Rbowtie2

Build Bowtie2 index of reference genome, mm10 in this vignette.
threads: number of threads used. Index documentation.

bowtie2_index("mm10.fa",threads = 6)

Align to genome using Rbowtie2

Align reads to indexed reference genome using Rbowtie2.
format: format of input sequences, fasta (default) or fastq; threads: number of threads used, 1 (default); maxMismatches: number of mismatches allowed, 1 (default) or 0; seedSubString: length of seed substrings, 18 (default), must be >3 & <32; report_k: NULL, look for multiple alignments, report best (default), otherwise report up to n (integer) alignments per read.
Alignment documentation.

Fox3_Std_filtered <- decompress(Fox3_Std_filtered)
alignedFile <- bowtie_align(Fox3_Std_filtered,
              "mm10",
              gsub("\\.fastq",".bam",Fox3_Std_filtered),
              format="fastq",
              threads = 6)

Parsing SAM files

Parsing SAM files to get unique mapped reads and mismatches per read using R-wrapped parseAlignment.pl from CTK.
minLen: miminal mapping length to report, 18 (default); mapQual: Minimum map quality, 1 (default MAPQ), keep only uniquely mapped reads when mapQual>=1.
CLIPflexR documentation.

parsedAlignement <- ctk_parseAlignment(alignedFile,
                                       minLen = 18,
                                       mapQual = 1,
                                       mutationFile = "mutations.txt")

Collapse duplicated mapped reads

Collapse CLIP tags generated by parseAlignment using R-wrapped tag2collapse.pl from CTK.
bigFile: set TRUE if input file is big; keepMaxScore: keep the tag with the most weight (instead of the longest one) as representative; keepTagName: do not change tag name (no extra information); weight: consider the weight of each tag; weightInName: find weight in name; randomBarcode: random barcode exists, no collapse for different barcodes; em: EM threshold to infer reliability of each collapsed read (when have random linker, -1 = no EM); outputSeqError: output sequencing errors estimated by the EM algorithm.
CLIPflexR documentation.

# -v -big --random-barcode -EM 30 --seq-error-model alignment -weight --weight-in-name --keep-max-score --keep-tag-name $f.tag.bed
tagcollapsed <- ctk_tag2collapse(parsedAlignement,
                                     keepMaxScore = TRUE,
                                     keepTagName = TRUE,
                                     weight = TRUE,
                                     bigFile = TRUE,
                                     weightInName = TRUE,
                                     randomBarcode = TRUE,
                                     em = 30,
                                     outputSeqError = TRUE
                                     )

Call peaks & visualize

Join tags and mutations

Get the mutations in unique tags using R-wrapped selectRow.pl from CTK (originally wrapped from Galaxy).
file1: mutation file; file2: tagged bed file; field1: query column; field2: filter column; mode: running mode, set N while only print paired rows.
CLIPflexR documentation.

tagJoined <- ctk_joinWrapper("mutations.txt",tagcollapsed,field1 = 4,field2 = 4,mode = "n",
                                outFile="Fox3_3_tag.unique.mutation.txt")

Label bed file with rgb information

Add colors to bed file using R-wrapped bed2rgb.pl from CTK.
filesToRun: taged bed files; col: assign color in r,g,b format.
CLIPflexR documentation.

tagcollapsedRGB <- ctk_bed2rgb(filesToRun = tagcollapsed,col = "128,0,0")

Generate bedgraph

Generate bedgraphs for visualization using R-wrapped tag2profile.pl from CTK.
bigFile: set TRUE if input file is big; ss: separate strand, TRUE (default); exact: exact count of each nucleotide, TRUE (default).
CLIPflexR documentation.

# -v -big --random-barcode -EM 30 --seq-error-model alignment -weight --weight-in-name --keep-max-score --keep-tag-name $f.tag.bed
tagProfile <- ctk_tag2profile(filesToRun = tagcollapsedRGB,bigFile = TRUE,ss = TRUE,exact = TRUE,outputFormat = "bedgraph")

Peak calling without statistical assessment

Call peaks using R-wrapped tag2peak.pl from CTK without statistical assessment. filesToRun: RGB labeled tagged bed files; valleySeeking: find candidate peaks by valley seeking, TRUE (default); valleyDepth: depth of valley, if valley seeking, 0.9 (default), between 0.5 and 1; bigFile: set TRUE if the input file is big; ss: separate strand, TRUE (default).
CLIPflexR documentation.

tagPeaksNoStats <- ctk_tag2peak(filesToRun = tagcollapsedRGB,valleySeeking = TRUE,valleyDepth = 0.9,bigFile = TRUE,ss = TRUE)

Peak calling with statistical assessment

Call peaks using R-wrapped tag2peak.pl from CTK with statistical assessment. filesToRun: RGB labeled tagged bed files; valleySeeking: find candidate peaks by valley seeking, TRUE (default);valleyDepth: depth of valley, if valley seeking, 0.9 (default), between 0.5 and 1; bigFile: set TRUE if the input file is big; ss: separate strand, TRUE (default); genes: gene locus in bed format.
CLIPflexR documentation.

require(TxDb.Mmusculus.UCSC.mm10.knownGene)
export.bed(genes(TxDb.Mmusculus.UCSC.mm10.knownGene),"genes.bed")
tagProfile <- ctk_tag2peak(filesToRun = tagcollapsedRGB,
                           outFile = "TC_SRR1107535_stripped.RGB.peakWithStats.bed",
                           outBoundary ="TC_SRR1107535_stripped.RGB.peakBoundary.bed",
                           outHalfPH = "TC_SRR1107535_stripped.RGB.peakHalf.bed",
                           valleySeeking = TRUE,valleyDepth = 0.9,bigFile = TRUE,ss = TRUE,genes = "genes.bed")

Use CLIPflexR to get peak sequences or search for motifs in peaks

Retrieve sequence under peaks using CLIPflexR fetchSequencesForClIP

peaks: tagged bed file or GRanges of peaks; reSize: resize peak to this width around peak center, 64 (default); fasta: reference sequence.
CLIPflexR documentation.

myseq <- fetchSequencesForClIP(peaks = tagProfile,reSize = 20,fasta = "mm10.fa",bedHeader = FALSE)

Retrieve peaks associated with particular motifs using CLIPflexR annotatePeaksWithPatterns

peaks: tagged bed file or GRanges of peaks; reSize: resize peak to this width around peak center, 64 (default); fasta: reference sequence; pattern: FASTA of motif sequence(s).
CLIPflexR documentation.

myseq <- annotatePeaksWithPatterns(peaks = tagProfile,patterns = "TGCATG",resize = 100,fasta = "mm10.fa",bedHeader=FALSE)

Kathryn Rozen-Gagnon, Ji-Dung Luo and Thomas Carroll