jpayne@69: #!/bin/bash jpayne@69: jpayne@69: usage(){ jpayne@69: echo " jpayne@69: Written by Brian Bushnell jpayne@69: Last modified October 14, 2020 jpayne@69: jpayne@69: Description: Counts the number of unique kmers in a file. jpayne@69: Generates a kmer frequency histogram and genome size estimate (in peaks output), jpayne@69: and prints a file containing all kmers and their counts. jpayne@69: Supports K=1 to infinity, though not all values are allowed. jpayne@69: SEE ALSO: bbnorm.sh/khist.sh, loglog.sh, and kmercountmulti.sh. jpayne@69: jpayne@69: Usage: kmercountexact.sh in= khist= peaks= jpayne@69: jpayne@69: Input may be fasta or fastq, compressed or uncompressed. jpayne@69: Output may be stdout or a file. out, khist, and peaks are optional. jpayne@69: jpayne@69: jpayne@69: Input parameters: jpayne@69: in= Primary input file. jpayne@69: in2= Second input file for paired reads. jpayne@69: amino=f Run in amino acid mode. jpayne@69: jpayne@69: Output parameters: jpayne@69: out= Print kmers and their counts. This is produces a jpayne@69: huge file, so skip it if you only need the histogram. jpayne@69: fastadump=t Print kmers and counts as fasta versus 2-column tsv. jpayne@69: mincount=1 Only print kmers with at least this depth. jpayne@69: reads=-1 Only process this number of reads, then quit (-1 means all). jpayne@69: dumpthreads=-1 Use this number of threads for dumping kmers (-1 means auto). jpayne@69: jpayne@69: Hashing parameters: jpayne@69: k=31 Kmer length (1-31 is fastest). jpayne@69: prealloc=t Pre-allocate memory rather than dynamically growing; faster and more memory-efficient. A float fraction (0-1) may be specified, default 1. jpayne@69: prefilter=0 If set to a positive integer, use a countmin sketch to ignore kmers with depth of that value or lower. jpayne@69: prehashes=2 Number of hashes for prefilter. jpayne@69: prefiltersize=0.2 Fraction of memory to use for prefilter. jpayne@69: minq=6 Ignore kmers containing bases with quality below this. (TODO) jpayne@69: minprob=0.0 Ignore kmers with overall probability of correctness below this. jpayne@69: threads=X Spawn X hashing threads (default is number of logical processors). jpayne@69: onepass=f If true, prefilter will be generated in same pass as kmer counts. Much faster but counts will be lower, by up to prefilter's depth limit. jpayne@69: rcomp=t Store and count each kmer together and its reverse-complement. jpayne@69: jpayne@69: Histogram parameters: jpayne@69: khist= Print kmer frequency histogram. jpayne@69: histcolumns=2 2 columns: (depth, count). 3 columns: (depth, rawCount, count). jpayne@69: histmax=100000 Maximum depth to print in histogram output. jpayne@69: histheader=t Set true to print a header line. jpayne@69: nzo=t (nonzeroonly) Only print lines for depths with a nonzero kmer count. jpayne@69: gchist=f Add an extra histogram column with the average GC%. jpayne@69: jpayne@69: Intersection parameters: jpayne@69: ref= An input assembly of the input reads. jpayne@69: intersection= Output file for a 2-D histogram of read and ref kmer counts. jpayne@69: refmax=6 Maximum reference kmer depth to track; read depth is controlled by 'histmax'. jpayne@69: jpayne@69: Smoothing parameters: jpayne@69: smoothkhist=f Smooth the output kmer histogram. jpayne@69: smoothpeaks=t Smooth the kmer histogram for peak-calling, but does not affect the output histogram. jpayne@69: smoothradius=1 Initial radius of progressive smoothing function. jpayne@69: maxradius=10 Maximum radius of progressive smoothing function. jpayne@69: progressivemult=2 Increment radius each time depth increases by this factor. jpayne@69: logscale=t Transform to log-scale prior to peak-calling. jpayne@69: logwidth=0.1 The larger the number, the smoother. jpayne@69: jpayne@69: Peak calling parameters: jpayne@69: peaks= Write the peaks to this file. Default is stdout. jpayne@69: Also contains the genome size estimate in bp. jpayne@69: minHeight=2 (h) Ignore peaks shorter than this. jpayne@69: minVolume=5 (v) Ignore peaks with less area than this. jpayne@69: minWidth=3 (w) Ignore peaks narrower than this. jpayne@69: minPeak=2 (minp) Ignore peaks with an X-value below this. jpayne@69: maxPeak=BIG (maxp) Ignore peaks with an X-value above this. jpayne@69: maxPeakCount=12 (maxpc) Print up to this many peaks (prioritizing height). jpayne@69: ploidy=-1 Specify ploidy; otherwise it will be autodetected. jpayne@69: jpayne@69: Sketch parameters (for making a MinHashSketch): jpayne@69: sketch= Write a minhash sketch to this file. jpayne@69: sketchlen=10000 Output the top 10000 kmers. Only kmers with at least mincount are included. jpayne@69: sketchname= Name of output sketch. jpayne@69: sketchid= taxID of output sketch. jpayne@69: jpayne@69: Quality parameters: jpayne@69: qtrim=f Trim read ends to remove bases with quality below minq. jpayne@69: Values: t (trim both ends), f (neither end), r (right end only), l (left end only). jpayne@69: trimq=4 Trim quality threshold. jpayne@69: minavgquality=0 (maq) Reads with average quality (before trimming) below this will be discarded. jpayne@69: jpayne@69: Overlap parameters (for overlapping paired-end reads only): jpayne@69: merge=f Attempt to merge reads before counting kmers. jpayne@69: ecco=f Error correct via overlap, but do not merge reads. jpayne@69: jpayne@69: Java Parameters: jpayne@69: -Xmx This will set Java's memory usage, overriding autodetection. jpayne@69: -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. jpayne@69: The max is typically 85% of physical memory. jpayne@69: -eoom This flag will cause the process to exit if an jpayne@69: out-of-memory exception occurs. Requires Java 8u92+. jpayne@69: -da Disable assertions. jpayne@69: " jpayne@69: } jpayne@69: jpayne@69: #This block allows symlinked shellscripts to correctly set classpath. jpayne@69: pushd . > /dev/null jpayne@69: DIR="${BASH_SOURCE[0]}" jpayne@69: while [ -h "$DIR" ]; do jpayne@69: cd "$(dirname "$DIR")" jpayne@69: DIR="$(readlink "$(basename "$DIR")")" jpayne@69: done jpayne@69: cd "$(dirname "$DIR")" jpayne@69: DIR="$(pwd)/" jpayne@69: popd > /dev/null jpayne@69: jpayne@69: #DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/" jpayne@69: CP="$DIR""current/" jpayne@69: jpayne@69: z="-Xmx1g" jpayne@69: z2="-Xms1g" jpayne@69: set=0 jpayne@69: jpayne@69: if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then jpayne@69: usage jpayne@69: exit jpayne@69: fi jpayne@69: jpayne@69: calcXmx () { jpayne@69: source "$DIR""/calcmem.sh" jpayne@69: setEnvironment jpayne@69: parseXmx "$@" jpayne@69: if [[ $set == 1 ]]; then jpayne@69: return jpayne@69: fi jpayne@69: freeRam 3200m 84 jpayne@69: z="-Xmx${RAM}m" jpayne@69: z2="-Xms${RAM}m" jpayne@69: } jpayne@69: calcXmx "$@" jpayne@69: jpayne@69: kmercountexact() { jpayne@69: local CMD="java $EA $EOOM $z $z2 -cp $CP jgi.KmerCountExact $@" jpayne@69: echo $CMD >&2 jpayne@69: eval $CMD jpayne@69: } jpayne@69: jpayne@69: kmercountexact "$@"