vignettes/StandardandBrdU_Processing_CTK.Rmd
StandardandBrdU_Processing_CTK.Rmd
Download data from SRA. SRR110753 was selected as an example in this vignette.
require(SRAdb)
mySRA <- "SRAmetadb.sqlite"
if(!file.exists(mySRA)){
mySRA <- getSRAdbFile()
}
sra_con <- dbConnect(dbDriver("SQLite"), mySRA )
Fox3_Std <- "SRR1107535.fastq.gz"
if(!file.exists(Fox3_Std)){
getFASTQfile("SRR1107535",
sra_con = sra_con,
destDir = getwd())
}
Filtering low quality read using R-wrapped
fastq_filter.pl from CTK.
Use fastq/gzipped fast files as input. outFile: fastq
file after filtering low quality reads; qsFilter:
filtering criteria, “mean:0-29:20” means the average quality of the
first 30 bases must be equal or greater than 20.
CLIPflexR
documentation.
Fox3_Std <- "SRR1107535.fastq.gz"
Fox3_Std_filtered <- ctk_fastqFilter(Fox3_Std,
outFile = "SRR1107535_FF.fastq",
qsFilter="mean:0-29:20")
Trimming 3’-end adapter sequences using R-wrapped
fastx_clipper function from FASTX_toolkit.
adaptor: CLIP this sequence from the 3’-end of reads,
“GTGTCAGTCACTTCCAGCGG” (default); length: minimum read
length, only keep reads with length equal or greater then this
threshold, 18 (default).
CLIPflexR
documentation.
Fox3_Std_filtered <- fastx_clipper(Fox3_Std_filtered,outFile = "SRR1107535_clipped.fastq",adaptor = "GTGTCAGTCACTTCCAGCGG",length =20)
Discard reads with low quality bases and and short reads using R-wrapped fastq_quality_trimmer function from FASTX-toolkit. outfile: file name and path of the fastq files after filtering; qualityThreshold: the minimum quality threshold. CLIPflexR documentation
Fox3_Std_filtered <- fastq_quality_trimmer(Fox3_Std_filtered,
outFile = "SRR1107535_clipFiltered.fastq.gz",qualityThreshold = 5,minimumLength = 20 )
Collapse duplicate sequences into unique reads using R-wrapped
fastq2collapse.pl from CTK.
CLIPflexR
documentation.
Fox3_Std_filtered <- ctk_fastq2collapse(Fox3_Std_filtered,
outFile = "SRR1107535_collapsed.fastq.gz")
Strip random barcodes with specific length from the reads using
R-wrapped stripBarcode.pl from CTK.
linkerlength: barcode length;
inputFormat: format of input sequences, fasta (default)
or fastq.
CLIPflexR
documentation.
Fox3_Std_filtered <- ctk_stripBarcode(Fox3_Std_filtered,linkerlength = 5,
inputFormat = "fasta",
outFile = "SRR1107535_stripped.fastq.gz")
Fetch reference sequences from BSgenome.
require(BSgenome.Mmusculus.UCSC.mm10)
require(magrittr)
listOfSeq <- lapply(seqnames(BSgenome.Mmusculus.UCSC.mm10),function(x)BSgenome.Mmusculus.UCSC.mm10[[x]])
names(listOfSeq) <- seqnames(BSgenome.Mmusculus.UCSC.mm10)
toWrite <- DNAStringSet(listOfSeq[!grepl("Un|random|hap",names(listOfSeq))])
writeXStringSet(toWrite,file="mm10.fa")
Build Bowtie2 index of reference genome, mm10 in this vignette.
threads: number of threads used. Index documentation.
bowtie2_index("mm10.fa",threads = 6)
Align reads to indexed reference genome using Rbowtie2.
format: format of input sequences, fasta (default) or
fastq; threads: number of threads used, 1 (default);
maxMismatches: number of mismatches allowed, 1
(default) or 0; seedSubString: length of seed
substrings, 18 (default), must be >3 & <32;
report_k: NULL, look for multiple alignments, report
best (default), otherwise report up to n (integer) alignments per
read.
Alignment
documentation.
Fox3_Std_filtered <- decompress(Fox3_Std_filtered)
alignedFile <- bowtie_align(Fox3_Std_filtered,
"mm10",
gsub("\\.fastq",".bam",Fox3_Std_filtered),
format="fastq",
threads = 6)
Parsing SAM files to get unique mapped reads and mismatches per read
using R-wrapped parseAlignment.pl from CTK.
minLen: miminal mapping length to report, 18 (default);
mapQual: Minimum map quality, 1 (default MAPQ), keep
only uniquely mapped reads when mapQual>=1.
CLIPflexR
documentation.
parsedAlignement <- ctk_parseAlignment(alignedFile,
minLen = 18,
mapQual = 1,
mutationFile = "mutations.txt")
Collapse CLIP tags generated by parseAlignment using R-wrapped
tag2collapse.pl from CTK.
bigFile: set TRUE if input file is big;
keepMaxScore: keep the tag with the most weight
(instead of the longest one) as representative;
keepTagName: do not change tag name (no extra
information); weight: consider the weight of each tag;
weightInName: find weight in name;
randomBarcode: random barcode exists, no collapse for
different barcodes; em: EM threshold to infer
reliability of each collapsed read (when have random linker, -1 = no
EM); outputSeqError: output sequencing errors estimated
by the EM algorithm.
CLIPflexR
documentation.
# -v -big --random-barcode -EM 30 --seq-error-model alignment -weight --weight-in-name --keep-max-score --keep-tag-name $f.tag.bed
tagcollapsed <- ctk_tag2collapse(parsedAlignement,
keepMaxScore = TRUE,
keepTagName = TRUE,
weight = TRUE,
bigFile = TRUE,
weightInName = TRUE,
randomBarcode = TRUE,
em = 30,
outputSeqError = TRUE
)
Get the mutations in unique tags using R-wrapped
selectRow.pl from CTK
(originally wrapped from Galaxy).
file1: mutation file; file2: tagged
bed file; field1: query column;
field2: filter column; mode: running
mode, set N while only print paired rows.
CLIPflexR
documentation.
tagJoined <- ctk_joinWrapper("mutations.txt",tagcollapsed,field1 = 4,field2 = 4,mode = "n",
outFile="Fox3_3_tag.unique.mutation.txt")
Add colors to bed file using R-wrapped bed2rgb.pl
from CTK.
filesToRun: taged bed files; col:
assign color in r,g,b format.
CLIPflexR documentation.
tagcollapsedRGB <- ctk_bed2rgb(filesToRun = tagcollapsed,col = "128,0,0")
Generate bedgraphs for visualization using R-wrapped
tag2profile.pl from CTK.
bigFile: set TRUE if input file is big;
ss: separate strand, TRUE (default);
exact: exact count of each nucleotide, TRUE
(default).
CLIPflexR
documentation.
# -v -big --random-barcode -EM 30 --seq-error-model alignment -weight --weight-in-name --keep-max-score --keep-tag-name $f.tag.bed
tagProfile <- ctk_tag2profile(filesToRun = tagcollapsedRGB,bigFile = TRUE,ss = TRUE,exact = TRUE,outputFormat = "bedgraph")
Call peaks using R-wrapped tag2peak.pl from CTK
without statistical assessment. filesToRun:
RGB labeled tagged bed files; valleySeeking: find
candidate peaks by valley seeking, TRUE (default);
valleyDepth: depth of valley, if valley seeking, 0.9
(default), between 0.5 and 1; bigFile: set TRUE if the
input file is big; ss: separate strand, TRUE
(default).
CLIPflexR
documentation.
tagPeaksNoStats <- ctk_tag2peak(filesToRun = tagcollapsedRGB,valleySeeking = TRUE,valleyDepth = 0.9,bigFile = TRUE,ss = TRUE)
Call peaks using R-wrapped tag2peak.pl from CTK
with statistical assessment. filesToRun: RGB
labeled tagged bed files; valleySeeking: find candidate
peaks by valley seeking, TRUE (default);valleyDepth:
depth of valley, if valley seeking, 0.9 (default), between 0.5 and 1;
bigFile: set TRUE if the input file is big;
ss: separate strand, TRUE (default);
genes: gene locus in bed format.
CLIPflexR
documentation.
require(TxDb.Mmusculus.UCSC.mm10.knownGene)
export.bed(genes(TxDb.Mmusculus.UCSC.mm10.knownGene),"genes.bed")
tagProfile <- ctk_tag2peak(filesToRun = tagcollapsedRGB,
outFile = "TC_SRR1107535_stripped.RGB.peakWithStats.bed",
outBoundary ="TC_SRR1107535_stripped.RGB.peakBoundary.bed",
outHalfPH = "TC_SRR1107535_stripped.RGB.peakHalf.bed",
valleySeeking = TRUE,valleyDepth = 0.9,bigFile = TRUE,ss = TRUE,genes = "genes.bed")
peaks: tagged bed file or GRanges of peaks;
reSize: resize peak to this width around peak center,
64 (default); fasta: reference sequence.
CLIPflexR
documentation.
myseq <- fetchSequencesForClIP(peaks = tagProfile,reSize = 20,fasta = "mm10.fa",bedHeader = FALSE)
peaks: tagged bed file or GRanges of peaks;
reSize: resize peak to this width around peak center,
64 (default); fasta: reference sequence;
pattern: FASTA of motif sequence(s).
CLIPflexR
documentation.
myseq <- annotatePeaksWithPatterns(peaks = tagProfile,patterns = "TGCATG",resize = 100,fasta = "mm10.fa",bedHeader=FALSE)