jpayne@69: #!/bin/bash jpayne@69: jpayne@69: usage(){ jpayne@69: echo " jpayne@69: Written by Brian Bushnell jpayne@69: Last modified February 3, 2021 jpayne@69: jpayne@69: Description: Uses kmer counts to assemble contigs, extend sequences, jpayne@69: or error-correct reads. Tadpole has no upper bound for kmer length, jpayne@69: but some values are not supported. Specifically, it allows 1-31, jpayne@69: multiples of 2 from 32-62, multiples of 3 from 63-93, etc. jpayne@69: Please read bbmap/docs/guides/TadpoleGuide.txt for more information. jpayne@69: jpayne@69: Usage: jpayne@69: Assembly: tadpole.sh in= out= jpayne@69: Extension: tadpole.sh in= out= mode=extend jpayne@69: Correction: tadpole.sh in= out= mode=correct jpayne@69: jpayne@69: Recommended parameters for optimal assembly: jpayne@69: tadpole.sh in= out= shave rinse pop k=<50-70% of read length> jpayne@69: jpayne@69: Extension and correction may be done simultaneously. Error correction on jpayne@69: multiple files may be done like this: jpayne@69: jpayne@69: tadpole.sh in=libA_r1.fq,libA_merged.fq in2=libA_r2.fq,null extra=libB_r1.fq out=ecc_libA_r1.fq,ecc_libA_merged.fq out2=ecc_libA_r2.fq,null mode=correct jpayne@69: jpayne@69: Extending contigs with reads could be done like this: jpayne@69: jpayne@69: tadpole.sh in=contigs.fa out=extended.fa el=100 er=100 mode=extend extra=reads.fq k=62 jpayne@69: jpayne@69: jpayne@69: Input parameters: jpayne@69: in= Primary input file for reads to use as kmer data. jpayne@69: in2= Second input file for paired data. jpayne@69: extra= Extra files for use as kmer data, but not for error- jpayne@69: correction or extension. jpayne@69: reads=-1 Only process this number of reads, then quit (-1 means all). jpayne@69: NOTE: in, in2, and extra may also be comma-delimited lists of files. jpayne@69: jpayne@69: Output parameters: jpayne@69: out= Write contigs (in contig mode) or corrected/extended jpayne@69: reads (in other modes). jpayne@69: out2= Second output file for paired output. jpayne@69: outd= Write discarded reads, if using junk-removal flags. jpayne@69: dot= Write a contigs connectivity graph (partially implemented) jpayne@69: dump= Write kmers and their counts. jpayne@69: fastadump=t Write kmers and counts as fasta versus 2-column tsv. jpayne@69: mincounttodump=1 Only dump kmers with at least this depth. jpayne@69: showstats=t Print assembly statistics after writing contigs. jpayne@69: jpayne@69: Prefiltering parameters: jpayne@69: prefilter=0 If set to a positive integer, use a countmin sketch jpayne@69: to ignore kmers with depth of that value or lower. jpayne@69: prehashes=2 Number of hashes for prefilter. jpayne@69: prefiltersize=0.2 (pff) Fraction of memory to use for prefilter. jpayne@69: minprobprefilter=t (mpp) Use minprob for the prefilter. jpayne@69: prepasses=1 Use this many prefiltering passes; higher be more thorough jpayne@69: if the filter is very full. Set to 'auto' to iteratively jpayne@69: prefilter until the remaining kmers will fit in memory. jpayne@69: onepass=f If true, prefilter will be generated in same pass as kmer jpayne@69: counts. Much faster but counts will be lower, by up to jpayne@69: prefilter's depth limit. jpayne@69: filtermem=0 Allows manually specifying prefilter memory in bytes, for jpayne@69: deterministic runs. 0 will set it automatically. jpayne@69: jpayne@69: Hashing parameters: jpayne@69: k=31 Kmer length (1 to infinity). Memory use increases with K. jpayne@69: prealloc=t Pre-allocate memory rather than dynamically growing; jpayne@69: faster and more memory-efficient. A float fraction (0-1) jpayne@69: may be specified; default is 1. jpayne@69: minprob=0.5 Ignore kmers with overall probability of correctness below this. jpayne@69: minprobmain=t (mpm) Use minprob for the primary kmer counts. jpayne@69: threads=X Spawn X worker threads; default is number of logical processors. jpayne@69: buildthreads=X Spawn X contig-building threads. If not set, defaults to the same jpayne@69: as threads. Setting this to 1 will make contigs deterministic. jpayne@69: rcomp=t Store and count each kmer together and its reverse-complement. jpayne@69: coremask=t All kmer extensions share the same hashcode. jpayne@69: fillfast=t Speed up kmer extension lookups. jpayne@69: jpayne@69: Assembly parameters: jpayne@69: mincountseed=3 (mcs) Minimum kmer count to seed a new contig or begin extension. jpayne@69: mincountextend=2 (mce) Minimum kmer count continue extension of a read or contig. jpayne@69: It is recommended that mce=1 for low-depth metagenomes. jpayne@69: mincountretain=0 (mincr) Discard kmers with count below this. jpayne@69: maxcountretain=INF (maxcr) Discard kmers with count above this. jpayne@69: branchmult1=20 (bm1) Min ratio of 1st to 2nd-greatest path depth at high depth. jpayne@69: branchmult2=3 (bm2) Min ratio of 1st to 2nd-greatest path depth at low depth. jpayne@69: branchlower=3 (blc) Max value of 2nd-greatest path depth to be considered low. jpayne@69: minextension=2 (mine) Do not keep contigs that did not extend at least this much. jpayne@69: mincontig=auto (minc) Do not write contigs shorter than this. jpayne@69: mincoverage=1 (mincov) Do not write contigs with average coverage below this. jpayne@69: maxcoverage=inf (maxcov) Do not write contigs with average coverage above this. jpayne@69: trimends=0 (trim) Trim contig ends by this much. Trimming by K/2 jpayne@69: may yield more accurate genome size estimation. jpayne@69: trimcircular=t Trim one end of contigs ending in LOOP/LOOP by K-1, jpayne@69: to eliminate the overlapping portion. jpayne@69: contigpasses=16 Build contigs with decreasing seed depth for this many iterations. jpayne@69: contigpassmult=1.7 Ratio between seed depth of two iterations. jpayne@69: ownership=auto For concurrency; do not touch. jpayne@69: processcontigs=f Explore the contig connectivity graph. jpayne@69: popbubbles=t (pop) Pop bubbles; increases contiguity. Requires jpayne@69: additional time and memory and forces processcontigs=t. jpayne@69: jpayne@69: Processing modes: jpayne@69: mode=contig contig: Make contigs from kmers. jpayne@69: extend: Extend sequences to be longer, and optionally jpayne@69: perform error correction. jpayne@69: correct: Error correct only. jpayne@69: insert: Measure insert sizes. jpayne@69: discard: Discard low-depth reads, without error correction. jpayne@69: jpayne@69: Extension parameters: jpayne@69: extendleft=100 (el) Extend to the left by at most this many bases. jpayne@69: extendright=100 (er) Extend to the right by at most this many bases. jpayne@69: ibb=t (ignorebackbranches) Do not stop at backward branches. jpayne@69: extendrollback=3 Trim a random number of bases, up to this many, on reads jpayne@69: that extend only partially. This prevents the creation jpayne@69: of sharp coverage discontinuities at branches. jpayne@69: jpayne@69: Error-correction parameters: jpayne@69: ecc=f Error correct via kmer counts. jpayne@69: reassemble=t If ecc is enabled, use the reassemble algorithm. jpayne@69: pincer=f If ecc is enabled, use the pincer algorithm. jpayne@69: tail=f If ecc is enabled, use the tail algorithm. jpayne@69: eccfull=f If ecc is enabled, use tail over the entire read. jpayne@69: aggressive=f (aecc) Use aggressive error correction settings. jpayne@69: Overrides some other flags like errormult1 and deadzone. jpayne@69: conservative=f (cecc) Use conservative error correction settings. jpayne@69: Overrides some other flags like errormult1 and deadzone. jpayne@69: rollback=t Undo changes to reads that have lower coverage for jpayne@69: any kmer after correction. jpayne@69: markbadbases=0 (mbb) Any base fully covered by kmers with count below jpayne@69: this will have its quality reduced. jpayne@69: markdeltaonly=t (mdo) Only mark bad bases adjacent to good bases. jpayne@69: meo=t (markerrorreadsonly) Only mark bad bases in reads jpayne@69: containing errors. jpayne@69: markquality=0 (mq) Set quality scores for marked bases to this. jpayne@69: A level of 0 will also convert the base to an N. jpayne@69: errormult1=16 (em1) Min ratio between kmer depths to call an error. jpayne@69: errormult2=2.6 (em2) Alternate ratio between low-depth kmers. jpayne@69: errorlowerconst=3 (elc) Use mult2 when the lower kmer is at most this deep. jpayne@69: mincountcorrect=3 (mcc) Don't correct to kmers with count under this. jpayne@69: pathsimilarityfraction=0.45(psf) Max difference ratio considered similar. jpayne@69: Controls whether a path appears to be continuous. jpayne@69: pathsimilarityconstant=3 (psc) Absolute differences below this are ignored. jpayne@69: errorextensionreassemble=5 (eer) Verify this many kmers before the error as jpayne@69: having similar depth, for reassemble. jpayne@69: errorextensionpincer=5 (eep) Verify this many additional bases after the jpayne@69: error as matching current bases, for pincer. jpayne@69: errorextensiontail=9 (eet) Verify additional bases before and after jpayne@69: the error as matching current bases, for tail. jpayne@69: deadzone=0 (dz) Do not try to correct bases within this distance of jpayne@69: read ends. jpayne@69: window=12 (w) Length of window to use in reassemble mode. jpayne@69: windowcount=6 (wc) If more than this many errors are found within a jpayne@69: a window, halt correction in that direction. jpayne@69: qualsum=80 (qs) If the sum of the qualities of corrected bases within jpayne@69: a window exceeds this, halt correction in that direction. jpayne@69: rbi=t (requirebidirectional) Require agreement from both jpayne@69: directions when correcting errors in the middle part of jpayne@69: the read using the reassemble algorithm. jpayne@69: errorpath=1 (ep) For debugging purposes. jpayne@69: jpayne@69: Junk-removal parameters (to only remove junk, set mode=discard): jpayne@69: tossjunk=f Remove reads that cannot be used for assembly. jpayne@69: This means they have no kmers above depth 1 (2 for paired jpayne@69: reads) and the outermost kmers cannot be extended. jpayne@69: Pairs are removed only if both reads fail. jpayne@69: tossdepth=-1 Remove reads containing kmers at or below this depth. jpayne@69: Pairs are removed if either read fails. jpayne@69: lowdepthfraction=0 (ldf) Require at least this fraction of kmers to be jpayne@69: low-depth to discard a read; range 0-1. 0 still jpayne@69: requires at least 1 low-depth kmer. jpayne@69: requirebothbad=f (rbb) Only discard pairs if both reads are low-depth. jpayne@69: tossuncorrectable (tu) Discard reads containing uncorrectable errors. jpayne@69: Requires error-correction to be enabled. jpayne@69: jpayne@69: Shaving parameters: jpayne@69: shave=t Remove dead ends (aka hair). jpayne@69: rinse=t Remove bubbles. jpayne@69: wash= Set shave and rinse at the same time. jpayne@69: maxshavedepth=1 (msd) Shave or rinse kmers at most this deep. jpayne@69: exploredist=300 (sed) Quit after exploring this far. jpayne@69: discardlength=150 (sdl) Discard shavings up to this long. jpayne@69: Note: Shave and rinse can produce substantially better assemblies jpayne@69: for low-depth data, but they are very slow for large metagenomes. jpayne@69: jpayne@69: Overlap parameters (for overlapping paired-end reads only): jpayne@69: merge=f Attempt to merge overlapping reads prior to jpayne@69: kmer-counting, and again prior to correction. Output jpayne@69: will still be unmerged pairs. jpayne@69: ecco=f Error correct via overlap, but do not merge reads. jpayne@69: testmerge=t Test kmer counts around the read merge junctions. If jpayne@69: it appears that the merge created new errors, undo it. jpayne@69: jpayne@69: Java Parameters: jpayne@69: -Xmx This will set Java's memory usage, overriding autodetection. jpayne@69: -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. jpayne@69: The max is typically 85% of physical memory. jpayne@69: -eoom This flag will cause the process to exit if an jpayne@69: out-of-memory exception occurs. Requires Java 8u92+. jpayne@69: -da Disable assertions. jpayne@69: " jpayne@69: } jpayne@69: jpayne@69: #This block allows symlinked shellscripts to correctly set classpath. jpayne@69: pushd . > /dev/null jpayne@69: DIR="${BASH_SOURCE[0]}" jpayne@69: while [ -h "$DIR" ]; do jpayne@69: cd "$(dirname "$DIR")" jpayne@69: DIR="$(readlink "$(basename "$DIR")")" jpayne@69: done jpayne@69: cd "$(dirname "$DIR")" jpayne@69: DIR="$(pwd)/" jpayne@69: popd > /dev/null jpayne@69: jpayne@69: #DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/" jpayne@69: CP="$DIR""current/" jpayne@69: jpayne@69: z="-Xmx14g" jpayne@69: z2="-Xms14g" jpayne@69: set=0 jpayne@69: jpayne@69: if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then jpayne@69: usage jpayne@69: exit jpayne@69: fi jpayne@69: jpayne@69: calcXmx () { jpayne@69: source "$DIR""/calcmem.sh" jpayne@69: setEnvironment jpayne@69: parseXmx "$@" jpayne@69: if [[ $set == 1 ]]; then jpayne@69: return jpayne@69: fi jpayne@69: freeRam 15000m 84 jpayne@69: z="-Xmx${RAM}m" jpayne@69: z2="-Xms${RAM}m" jpayne@69: } jpayne@69: calcXmx "$@" jpayne@69: jpayne@69: tadpole() { jpayne@69: local CMD="java $EA $EOOM $z $z2 -cp $CP assemble.Tadpole $@" jpayne@69: echo $CMD >&2 jpayne@69: eval $CMD jpayne@69: } jpayne@69: jpayne@69: tadpole "$@"