jpayne@69: #!/bin/bash jpayne@69: jpayne@69: usage(){ jpayne@69: echo " jpayne@69: Written by Brian Bushnell jpayne@69: Last modified Jan 7, 2020 jpayne@69: jpayne@69: Description: Compares query sketches to others, and prints their kmer identity. jpayne@69: The input can be sketches made by sketch.sh, or fasta/fastq files. jpayne@69: It's recommended to first sketch references with sketch.sh for large files, jpayne@69: or when taxonomic information is desired. jpayne@69: jpayne@69: Please read bbmap/docs/guides/BBSketchGuide.txt for more information. jpayne@69: jpayne@69: Usage: comparesketch.sh in= ref= jpayne@69: Alternative: comparesketch.sh in= file file file jpayne@69: Alternative: comparesketch.sh in= *.sketch jpayne@69: Alternative: comparesketch.sh alltoall *.sketch.gz jpayne@69: jpayne@69: File parameters: jpayne@69: in= Sketches or fasta files to compare. jpayne@69: out=stdout Comparison output. Can be set to a file instead. jpayne@69: outsketch= Optionally write sketch files generated from the input. jpayne@69: ref= List of sketches to compare against. Files given without jpayne@69: a prefix (ref=) will be treated as references, jpayne@69: so you can use *.sketch without ref=. jpayne@69: You can also do ref=nt#.sketch to load all numbered files jpayne@69: fitting that pattern. jpayne@69: On NERSC, you can use these abbreviations (e.g. ref=nt): jpayne@69: nt: nt sketches jpayne@69: refseq: Refseq sketches jpayne@69: silva: Silva sketches jpayne@69: img: IMG sketches jpayne@69: mito: RefSeq mitochondrial sketches jpayne@69: fungi: RefSeq fungi sketches jpayne@69: protein: RefSeq prokaryotic amino acid sketches jpayne@69: Using an abbreviation automatically sets the blacklist, jpayne@69: and k. If the reference is in amino space, the query jpayne@69: also be in amino acid space with the flag amino added. jpayne@69: If the query is in nucleotide space, use the flag jpayne@69: 'translate', but this will only work for prokaryotes. jpayne@69: jpayne@69: Blacklist and Whitelist parameters: jpayne@69: blacklist= Ignore keys in this sketch file. Additionally, there are jpayne@69: built-in blacklists that can be specified: jpayne@69: nt: Blacklist for nt jpayne@69: refseq: Blacklist for Refseq jpayne@69: silva: Blacklist for Silva jpayne@69: img: Blacklist for IMG jpayne@69: whitelist=f Ignore keys that are not in the index. Requires index=t. jpayne@69: jpayne@69: Sketch-making parameters: jpayne@69: mode=perfile Possible modes, for sequence input: jpayne@69: single: Generate one sketch. jpayne@69: sequence: Generate one sketch per sequence. jpayne@69: perfile: Generate one sketch per file. jpayne@69: sketchonly=f Don't run comparisons, just write the output sketch file. jpayne@69: k=31 Kmer length, 1-32. To maximize sensitivity and jpayne@69: specificity, dual kmer lengths may be used: k=31,24 jpayne@69: Dual kmers are fastest if the shorter is a multiple jpayne@69: of 4. Query and reference k must match. jpayne@69: samplerate=1 Set to a lower value to sample a fraction of input reads. jpayne@69: For raw reads (rather than an assembly), 1-3x coverage jpayne@69: gives best results, by reducing error kmers. Somewhat jpayne@69: higher is better for high-error-rate data like PacBio. jpayne@69: minkeycount=1 Ignore kmers that occur fewer times than this. Values jpayne@69: over 1 can be used with raw reads to avoid error kmers. jpayne@69: minprob=0.0001 Ignore kmers below this probability of correctness. jpayne@69: minqual=0 Ignore kmers spanning bases below this quality. jpayne@69: entropy=0.66 Ignore sequence with entropy below this value. jpayne@69: merge=f Merge paired reads prior to sketching. jpayne@69: amino=f Use amino acid mode. Input should be amino acids. jpayne@69: translate=f Call genes and translate to proteins. Input should be jpayne@69: nucleotides. Designed for prokaryotes. jpayne@69: sixframes=f Translate all 6 frames instead of predicting genes. jpayne@69: ssu=t Scan for and retain full-length SSU sequence. jpayne@69: printssusequence=f Print the query SSU sequence (JSON mode only). jpayne@69: jpayne@69: Size parameters: jpayne@69: size=10000 Desired size of sketches (if not using autosize). jpayne@69: mgf=0.01 (maxfraction) Max fraction of genomic kmers to use. jpayne@69: minsize=100 Do not generate sketches for genomes smaller than this. jpayne@69: autosize=t Use flexible sizing instead of fixed-length. This is jpayne@69: nonlinear; a human sketch is only ~6x a bacterial sketch. jpayne@69: sizemult=1 Multiply the autosized size of sketches by this factor. jpayne@69: Normally a bacterial-size genome will get a sketch size jpayne@69: of around 10000; if autosizefactor=2, it would be ~20000. jpayne@69: density= If this flag is set (to a number between 0 and 1), jpayne@69: autosize and sizemult are ignored, and this fraction of jpayne@69: genomic kmers are used. For example, at density=0.001, jpayne@69: a 4.5Mbp bacteria will get a 4500-kmer sketch. jpayne@69: sketchheapfactor=4 If minkeycount>1, temporarily track this many kmers until jpayne@69: counts are known and low-count kmers are discarded. jpayne@69: jpayne@69: Sketch comparing parameters: jpayne@69: threads=auto Use this many threads for comparison. jpayne@69: index=auto Index the sketches for much faster searching. jpayne@69: Requires more memory and adds startup time. jpayne@69: Recommended true for many query sketches, false for few. jpayne@69: prealloc=f Preallocate the index for greater efficiency. jpayne@69: Can be set to a number between 0 and 1 to determine how jpayne@69: much of total memory should be used. jpayne@69: alltoall (ata) Compare all refs to all. Must be sketches. jpayne@69: compareself=f In all-to-all mode, compare a sketch to itself. jpayne@69: jpayne@69: Taxonomy-related parameters: jpayne@69: tree= Specify a TaxTree file. On Genepool, use tree=auto. jpayne@69: Only necessary for use with printtaxa and level. jpayne@69: Assumes comparisons are done against reference sketches jpayne@69: with known taxonomy information. jpayne@69: level=2 Only report the best record per taxa at this level. jpayne@69: Either level names or numbers may be used. jpayne@69: 0: disabled jpayne@69: 1: subspecies jpayne@69: 2: species jpayne@69: 3: genus jpayne@69: ...etc jpayne@69: include= Restrict output to organisms in these clades. jpayne@69: May be a comma-delimited list of names or NCBI TaxIDs. jpayne@69: includelevel=0 Promote the include list to this taxonomic level. jpayne@69: For example, include=h.sapiens includelevel=phylum jpayne@69: would only include organisms in the same phylum as human. jpayne@69: includestring= Only report records whose name contains this string. jpayne@69: exclude= Ignore organisms in these clades. jpayne@69: May be a comma-delimited list of names or NCBI TaxIDs. jpayne@69: excludelevel=0 Promote the exclude list to this taxonomic level. jpayne@69: For example, exclude=h.sapiens excludelevel=phylum jpayne@69: would exclude all organisms in the same phylum as human. jpayne@69: excludestring= Do not records whose name contains this string. jpayne@69: minlevel= Use this to restrict comparisons to distantly-related jpayne@69: organisms. Intended for finding misclassified organisms jpayne@69: using all-to-all comparisons. minlevel=order would only jpayne@69: report hits between organisms related at the order level jpayne@69: or higher, not between same species or genus. jpayne@69: banunclassified=f Ignore organisms descending from nodes like jpayne@69: 'unclassified Bacteria' jpayne@69: banvirus=f Ignore viruses. jpayne@69: requiressu=f Ignore records without SSUs. jpayne@69: minrefsize=0 Ignore ref sketches smaller than this (unique kmers). jpayne@69: minrefsizebases=0 Ignore ref sketches smaller than this (total base pairs). jpayne@69: jpayne@69: Output format: jpayne@69: format=2 2: Default format with, per query, one query header line; jpayne@69: one column header line; and one reference line per hit. jpayne@69: 3: One line per hit, with columns query, reference, ANI, jpayne@69: and sizeRatio. Useful for all-to-all comparisons. jpayne@69: 4: JSON (format=json also works). jpayne@69: 5: Constellation (format=constellation also works). jpayne@69: usetaxidname=f For format 3, print the taxID in the name column. jpayne@69: usetaxname for format 3, print the taxonomic name in the name column. jpayne@69: useimgname For format 3, print the img ID in the name column. jpayne@69: jpayne@69: Output columns (for format=2): jpayne@69: printall=f Enable all output columns. jpayne@69: printani=t (ani) Print average nucleotide identity estimate. jpayne@69: completeness=t Genome completeness estimate. jpayne@69: score=f Score (used for sorting the output). jpayne@69: printmatches=t Number of kmer matches to reference. jpayne@69: printlength=f Number of kmers compared. jpayne@69: printtaxid=t NCBI taxID. jpayne@69: printimg=f IMG identifier (only for IMG data). jpayne@69: printgbases=f Number of genomic bases. jpayne@69: printgkmers=f Number of genomic kmers. jpayne@69: printgsize=t Estimated number of unique genomic kmers. jpayne@69: printgseqs=t Number of sequences (scaffolds/reads). jpayne@69: printtaxname=t Name associated with this taxID. jpayne@69: printname0=f (pn0) Original seqeuence name. jpayne@69: printfname=t Query filename. jpayne@69: printtaxa=f Full taxonomy of each record. jpayne@69: printcontam=t Print contamination estimate, and factor contaminant kmers jpayne@69: into calculations. Kmers are considered contaminant if jpayne@69: present in some ref sketch but not the current one. jpayne@69: printunique=t Number of matches unique to this reference. jpayne@69: printunique2=f Number of matches unique to this reference's taxa. jpayne@69: printunique3=f Number of query kmers unique to this reference's taxa, jpayne@69: regardless of whether they are in this reference sketch. jpayne@69: printnohit=f Number of kmers that don't hit anything. jpayne@69: printrefhits=f Average number of ref sketches hit by shared kmers. jpayne@69: printgc=f GC content. jpayne@69: printucontam=f Contam hits that hit exactly one reference sketch. jpayne@69: printcontam2=f Print contamination estimate using only kmer hits jpayne@69: to unrelated taxa. jpayne@69: contamlevel=species Taxonomic level to use for contam2/unique2/unique3. jpayne@69: NOTE: unique2/unique3/contam2/refhits require an index. jpayne@69: jpayne@69: printdepth=f (depth) Print average depth of sketch kmers; intended jpayne@69: for shotgun read input. jpayne@69: printdepth2=f (depth2) Print depth compensating for genomic repeats. jpayne@69: Requires reference sketches to be generated with depth. jpayne@69: actualdepth=t If this is false, the raw average count is printed. jpayne@69: If true, the raw average (observed depth) is converted jpayne@69: to estimated actual depth (including uncovered areas). jpayne@69: printvolume=f (volume) Product of average depth and matches. jpayne@69: printca=f Print common ancestor, if query taxID is known. jpayne@69: printcal=f Print common ancestor tax level, if query taxID is known. jpayne@69: recordsperlevel=0 If query TaxID is known, and this is positive, print this jpayne@69: many records per common ancestor level. jpayne@69: jpayne@69: Sorting: jpayne@69: sortbyscore=t Default sort order is by score, a composite metric. jpayne@69: sortbydepth=f Include depth as a factor in sort order. jpayne@69: sortbydepth2=f Include depth2 as a factor in sort order. jpayne@69: sortbyvolume=f Include volume as a factor in sort order. jpayne@69: sortbykid=f Sort strictly by KID. jpayne@69: sortbyani=f Sort strictly by ANI/AAI/WKID. jpayne@69: sortbyhits=f Sort strictly by the number of kmer hits. jpayne@69: jpayne@69: Other output parameters: jpayne@69: minhits=3 (hits) Only report records with at least this many hits. jpayne@69: minani=0 (ani) Only report records with at least this ANI (0-1). jpayne@69: minwkid=0.0001 (wkid) Only report records with at least this WKID (0-1). jpayne@69: anifromwkid=t Calculate ani from wkid. If false, use kid. jpayne@69: minbases=0 Ignore ref sketches of sequences shortert than this. jpayne@69: minsizeratio=0 Don't compare sketches if the smaller genome is less than jpayne@69: this fraction of the size of the larger. jpayne@69: records=20 Report at most this many best-matching records. jpayne@69: color=family Color records at the family level. color=f will disable. jpayne@69: Colors work in most terminals but may cause odd characters jpayne@69: to appear in text editors. So, color defaults to f if jpayne@69: writing to a file. Requires the taxtree to be loaded. jpayne@69: intersect=f Print sketch intersections. delta=f is suggested. jpayne@69: jpayne@69: Metadata flags (optional, for the query sketch header): jpayne@69: taxid=-1 Set the NCBI taxid. jpayne@69: imgid=-1 Set the IMG id. jpayne@69: spid=-1 Set the JGI sequencing project id. jpayne@69: name= Set the name (taxname). jpayne@69: name0= Set name0 (normally the first sequence header). jpayne@69: fname= Set fname (normally the file name). jpayne@69: meta_= Set an arbitrary metadata field. jpayne@69: For example, meta_Month=March. jpayne@69: jpayne@69: Other parameters: jpayne@69: requiredmeta= (rmeta) Required optional metadata values. For example: jpayne@69: rmeta=subunit:ssu,source:silva jpayne@69: bannedmeta= (bmeta) Forbidden optional metadata values. jpayne@69: jpayne@69: jpayne@69: Java Parameters: jpayne@69: -Xmx This will set Java's memory usage, overriding autodetection. jpayne@69: -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. jpayne@69: The max is typically 85% of physical memory. jpayne@69: -eoom This flag will cause the process to exit if an jpayne@69: out-of-memory exception occurs. Requires Java 8u92+. jpayne@69: -da Disable assertions. jpayne@69: jpayne@69: For more detailed information, please read /bbmap/docs/guides/BBSketchGuide.txt. jpayne@69: Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems. jpayne@69: " jpayne@69: } jpayne@69: jpayne@69: #This block allows symlinked shellscripts to correctly set classpath. jpayne@69: pushd . > /dev/null jpayne@69: DIR="${BASH_SOURCE[0]}" jpayne@69: while [ -h "$DIR" ]; do jpayne@69: cd "$(dirname "$DIR")" jpayne@69: DIR="$(readlink "$(basename "$DIR")")" jpayne@69: done jpayne@69: cd "$(dirname "$DIR")" jpayne@69: DIR="$(pwd)/" jpayne@69: popd > /dev/null jpayne@69: jpayne@69: #DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/" jpayne@69: CP="$DIR""current/" jpayne@69: jpayne@69: z="-Xmx4g" jpayne@69: z2="-Xms4g" jpayne@69: set=0 jpayne@69: jpayne@69: if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then jpayne@69: usage jpayne@69: exit jpayne@69: fi jpayne@69: jpayne@69: calcXmx () { jpayne@69: source "$DIR""/calcmem.sh" jpayne@69: setEnvironment jpayne@69: parseXmx "$@" jpayne@69: if [[ $set == 1 ]]; then jpayne@69: return jpayne@69: fi jpayne@69: freeRam 3200m 84 jpayne@69: z="-Xmx${RAM}m" jpayne@69: z2="-Xms${RAM}m" jpayne@69: } jpayne@69: calcXmx "$@" jpayne@69: jpayne@69: comparesketch() { jpayne@69: local CMD="java $EA $EOOM $z $z2 -cp $CP sketch.CompareSketch $@" jpayne@69: # echo $CMD >&2 jpayne@69: eval $CMD jpayne@69: } jpayne@69: jpayne@69: comparesketch "$@"