jpayne@69: #!/bin/bash jpayne@69: jpayne@69: usage(){ jpayne@69: echo " jpayne@69: Written by Brian Bushnell jpayne@69: Last modified December 19, 2019 jpayne@69: jpayne@69: Description: Compares query sketches to reference sketches hosted on a jpayne@69: remote server via the Internet. The input can be sketches made by sketch.sh, jpayne@69: or fasta/fastq files from which SendSketch will generate sketches. jpayne@69: Only sketches will sent, not sequences. jpayne@69: jpayne@69: Please read bbmap/docs/guides/BBSketchGuide.txt for more information. jpayne@69: jpayne@69: Usage: jpayne@69: sendsketch.sh in=file jpayne@69: jpayne@69: To change nucleotide servers, add the server name, e.g.: jpayne@69: sendsketch.sh in=file nt jpayne@69: jpayne@69: For the protein server with nucleotide input: jpayne@69: sendsketch.sh in=file protein jpayne@69: jpayne@69: for the protein server with amino input: jpayne@69: sendsketch.sh in=file amino protein jpayne@69: jpayne@69: jpayne@69: Standard parameters: jpayne@69: in= Sketch or fasta file to compare. jpayne@69: out=stdout Comparison output. Can be set to a file instead. jpayne@69: outsketch= Optional, to write the sketch to a file. jpayne@69: local=f For local files, have the server load the sketches. jpayne@69: Allows use of whitelists; recommended for Silva. jpayne@69: Local can only be used when the client and server access jpayne@69: the same filesystem - e.g., Genepool and Cori. jpayne@69: address= Address of remote server. Default address: jpayne@69: https://refseq-sketch.jgi-psf.org/sketch jpayne@69: You can also specify these abbreviations: jpayne@69: nt: nt server jpayne@69: refseq: Refseq server jpayne@69: silva: Silva server jpayne@69: protein: RefSeq prokaryotic amino acid sketches jpayne@69: img: IMG server (Not Yet Available) jpayne@69: mito: RefSeq mitochondrial server (NYA) jpayne@69: fungi: RefSeq fungi sketches (NYA) jpayne@69: Using an abbreviation automatically sets the address, jpayne@69: the blacklist, and k. jpayne@69: aws=f Set aws=t to use the aws servers instead of NERSC. jpayne@69: When, for example, NERSC (or the whole SF Bay area) is down. jpayne@69: jpayne@69: Sketch-making parameters: jpayne@69: mode=single Possible modes, for fasta input: jpayne@69: single: Generate one sketch per file. jpayne@69: sequence: Generate one sketch per sequence. jpayne@69: k=31 Kmer length, 1-32. This is automatic and does not need to jpayne@69: be set for JGI servers, only for locally-hosted servers. jpayne@69: samplerate=1 Set to a lower value to sample a fraction of input reads. jpayne@69: For raw reads (rather than an assembly), 1-3x coverage jpayne@69: gives best results, by reducing error kmers. Somewhat jpayne@69: higher is better for high-error-rate data like PacBio. jpayne@69: minkeycount=1 Ignore kmers that occur fewer times than this. Values jpayne@69: over 1 can be used with raw reads to avoid error kmers. jpayne@69: minprob=0.0001 Ignore kmers below this probability of correctness. jpayne@69: minqual=0 Ignore kmers spanning bases below this quality. jpayne@69: entropy=0.66 Ignore sequence with entropy below this value. jpayne@69: merge=f Merge paired reads prior to sketching. jpayne@69: amino=f Use amino acid mode. Input should be amino acids. jpayne@69: translate=f Call genes and translate to proteins. Input should be jpayne@69: nucleotides. Designed for prokaryotes. jpayne@69: sixframes=f Translate all 6 frames instead of predicting genes. jpayne@69: ssu=t Scan for and retain full-length SSU sequence. jpayne@69: printssusequence=f Print the query SSU sequence (JSON mode only). jpayne@69: refid= Instead of a query file, specify a reference sketch by name jpayne@69: or taxid; e.g. refid=h.sapiens or refid=9606. jpayne@69: jpayne@69: Size parameters: jpayne@69: size=10000 Desired size of sketches (if not using autosize). jpayne@69: mgf=0.01 (maxfraction) Max fraction of genomic kmers to use. jpayne@69: minsize=100 Do not generate sketches for genomes smaller than this. jpayne@69: autosize=t Use flexible sizing instead of fixed-length. This is jpayne@69: nonlinear; a human sketch is only ~6x a bacterial sketch. jpayne@69: sizemult=1 Multiply the autosized size of sketches by this factor. jpayne@69: Normally a bacterial-size genome will get a sketch size jpayne@69: of around 10000; if autosizefactor=2, it would be ~20000. jpayne@69: density= If this flag is set (to a number between 0 and 1), jpayne@69: autosize and sizemult are ignored, and this fraction of jpayne@69: genomic kmers are used. For example, at density=0.001, jpayne@69: a 4.5Mbp bacteria will get a 4500-kmer sketch. jpayne@69: sketchheapfactor=4 If minkeycount>1, temporarily track this many kmers until jpayne@69: counts are known and low-count kmers are discarded. jpayne@69: jpayne@69: Taxonomy and filtering parameters: jpayne@69: level=2 Only report the best record per taxa at this level. jpayne@69: Either level names or numbers may be used. jpayne@69: 0: disabled jpayne@69: 1: subspecies jpayne@69: 2: species jpayne@69: 3: genus jpayne@69: ...etc jpayne@69: include= Restrict output to organisms in these clades. jpayne@69: May be a comma-delimited list of names or NCBI TaxIDs. jpayne@69: includelevel=0 Promote the include list to this taxonomic level. jpayne@69: For example, include=h.sapiens includelevel=phylum jpayne@69: would only include organisms in the same phylum as human. jpayne@69: includestring= Only report records whose name contains this string. jpayne@69: exclude= Ignore organisms in these clades. jpayne@69: May be a comma-delimited list of names or NCBI TaxIDs. jpayne@69: excludelevel=0 Promote the exclude list to this taxonomic level. jpayne@69: For example, exclude=h.sapiens excludelevel=phylum jpayne@69: would exclude all organisms in the same phylum as human. jpayne@69: excludestring= Do not records whose name contains this string. jpayne@69: banunclassified=f Ignore organisms descending from nodes like jpayne@69: 'unclassified Bacteria' jpayne@69: banvirus=f Ignore viruses. jpayne@69: requiressu=f Ignore records without SSUs. jpayne@69: minrefsize=0 Ignore ref sketches smaller than this (unique kmers). jpayne@69: minrefsizebases=0 Ignore ref sketches smaller than this (total base pairs). jpayne@69: jpayne@69: Output format: jpayne@69: format=2 2: Default format with, per query, one query header line; jpayne@69: one column header line; and one reference line per hit. jpayne@69: 3: One line per hit, with columns query, reference, ANI, jpayne@69: and sizeRatio. jpayne@69: 4: JSON (format=json also works). jpayne@69: 5: Constellation (format=constellation also works). jpayne@69: usetaxidname=f For format 3, print the taxID in the name column. jpayne@69: usetaxname for format 3, print the taxonomic name in the name column. jpayne@69: useimgname For format 3, print the img ID in the name column. jpayne@69: d3=f Output in JSON format, with a tree for visualization. jpayne@69: jpayne@69: Output columns (for format=2): jpayne@69: printall=f Enable all output columns. jpayne@69: printani=t (ani) Print average nucleotide identity estimate. jpayne@69: completeness=t Genome completeness estimate. jpayne@69: score=f Score (used for sorting the output). jpayne@69: printmatches=t Number of kmer matches to reference. jpayne@69: printlength=f Number of kmers compared. jpayne@69: printtaxid=t NCBI taxID. jpayne@69: printimg=f IMG identifier (only for IMG data). jpayne@69: printgbases=f Number of genomic bases. jpayne@69: printgkmers=f Number of genomic kmers. jpayne@69: printgsize=t Estimated number of unique genomic kmers. jpayne@69: printgseqs=t Number of sequences (scaffolds/reads). jpayne@69: printtaxname=t Name associated with this taxID. jpayne@69: printname0=f (pn0) Original seqeuence name. jpayne@69: printqfname=t Query filename. jpayne@69: printrfname=f Reference filename. jpayne@69: printtaxa=f Full taxonomy of each record. jpayne@69: printcontam=t Print contamination estimate, and factor contaminant kmers jpayne@69: into calculations. Kmers are considered contaminant if jpayne@69: present in some ref sketch but not the current one. jpayne@69: printunique=t Number of matches unique to this reference. jpayne@69: printunique2=f Number of matches unique to this reference's taxa. jpayne@69: printunique3=f Number of query kmers unique to this reference's taxa, jpayne@69: regardless of whether they are in this reference sketch. jpayne@69: printnohit=f Number of kmers that don't hit anything. jpayne@69: printrefhits=f Average number of ref sketches hit by shared kmers. jpayne@69: printgc=f GC content. jpayne@69: printucontam=f Contam hits that hit exactly one reference sketch. jpayne@69: printcontam2=f Print contamination estimate using only kmer hits jpayne@69: to unrelated taxa. jpayne@69: contamlevel=species Taxonomic level to use for contam2/unique2/unique3. jpayne@69: NOTE: unique2/unique3/contam2/refhits require an index. jpayne@69: jpayne@69: printdepth=f (depth) Print average depth of sketch kmers; intended jpayne@69: for shotgun read input. jpayne@69: printdepth2=f (depth2) Print depth compensating for genomic repeats. jpayne@69: Requires reference sketches to be generated with depth. jpayne@69: actualdepth=t If this is false, the raw average count is printed. jpayne@69: If true, the raw average (observed depth) is converted jpayne@69: to estimated actual depth (including uncovered areas). jpayne@69: printvolume=f (volume) Product of average depth and matches. jpayne@69: printca=f Print common ancestor, if query taxID is known. jpayne@69: printcal=f Print common ancestor tax level, if query taxID is known. jpayne@69: recordsperlevel=0 If query TaxID is known, and this is positive, print at jpayne@69: most this many records per common ancestor level. jpayne@69: jpayne@69: Sorting: jpayne@69: sortbyscore=t Default sort order is by score. jpayne@69: sortbydepth=f Include depth as a factor in sort order. jpayne@69: sortbydepth2=f Include depth2 as a factor in sort order. jpayne@69: sortbyvolume=f Include volume as a factor in sort order. jpayne@69: sortbykid=f Sort strictly by KID. jpayne@69: sortbyani=f Sort strictly by ANI/AAI/WKID. jpayne@69: sortbyhits=f Sort strictly by the number of kmer hits. jpayne@69: jpayne@69: Other output parameters: jpayne@69: minhits=3 (hits) Only report records with at least this many hits. jpayne@69: minani=0 (ani) Only report records with at least this ANI (0-1). jpayne@69: minwkid=0.0001 (wkid) Only report records with at least this WKID (0-1). jpayne@69: anifromwkid=t Calculate ani from wkid. If false, use kid. jpayne@69: minbases=0 Ignore ref sketches of sequences shortert than this. jpayne@69: minsizeratio=0 Don't compare sketches if the smaller genome is less than jpayne@69: this fraction of the size of the larger. jpayne@69: records=20 Report at most this many best-matching records. jpayne@69: color=family Color records at the family level. color=f will disable. jpayne@69: Colors work in most terminals but may cause odd characters jpayne@69: to appear in text editors. So, color defaults to f if jpayne@69: writing to a file. jpayne@69: intersect=f Print sketch intersections. delta=f is suggested. jpayne@69: jpayne@69: Metadata parameters (optional, for the query sketch header): jpayne@69: taxid=-1 Set the NCBI taxid. jpayne@69: imgid=-1 Set the IMG id. jpayne@69: spid=-1 Set the sequencing project id (JGI-specific). jpayne@69: name= Set the name (taxname). jpayne@69: name0= Set name0 (normally the first sequence header). jpayne@69: fname= Set fname (normally the file name). jpayne@69: meta_= Set an arbitrary metadata field. jpayne@69: For example, meta_Month=March. jpayne@69: jpayne@69: Other parameters: jpayne@69: requiredmeta= (rmeta) Required optional metadata values. For example: jpayne@69: rmeta=subunit:ssu,source:silva jpayne@69: bannedmeta= (bmeta) Forbidden optional metadata values. jpayne@69: jpayne@69: Java Parameters: jpayne@69: -Xmx This will set Java's memory usage, overriding autodetection. jpayne@69: -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. jpayne@69: The max is typically 85% of physical memory. jpayne@69: -eoom This flag will cause the process to exit if an out-of-memory jpayne@69: exception occurs. Requires Java 8u92+. jpayne@69: -da Disable assertions. jpayne@69: jpayne@69: For more detailed information, please read /bbmap/docs/guides/BBSketchGuide.txt. jpayne@69: Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems. jpayne@69: " jpayne@69: } jpayne@69: jpayne@69: #This block allows symlinked shellscripts to correctly set classpath. jpayne@69: pushd . > /dev/null jpayne@69: DIR="${BASH_SOURCE[0]}" jpayne@69: while [ -h "$DIR" ]; do jpayne@69: cd "$(dirname "$DIR")" jpayne@69: DIR="$(readlink "$(basename "$DIR")")" jpayne@69: done jpayne@69: cd "$(dirname "$DIR")" jpayne@69: DIR="$(pwd)/" jpayne@69: popd > /dev/null jpayne@69: jpayne@69: #DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/" jpayne@69: CP="$DIR""current/" jpayne@69: jpayne@69: z="-Xmx4g" jpayne@69: z2="-Xms4g" jpayne@69: set=0 jpayne@69: jpayne@69: if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then jpayne@69: usage jpayne@69: exit jpayne@69: fi jpayne@69: jpayne@69: calcXmx () { jpayne@69: source "$DIR""/calcmem.sh" jpayne@69: setEnvironment jpayne@69: parseXmx "$@" jpayne@69: if [[ $set == 1 ]]; then jpayne@69: return jpayne@69: fi jpayne@69: freeRam 3200m 84 jpayne@69: z="-Xmx${RAM}m" jpayne@69: z2="-Xms${RAM}m" jpayne@69: } jpayne@69: calcXmx "$@" jpayne@69: jpayne@69: sendsketch() { jpayne@69: local CMD="java $EA $EOOM $z -cp $CP sketch.SendSketch $@" jpayne@69: # echo $CMD >&2 jpayne@69: eval $CMD jpayne@69: } jpayne@69: jpayne@69: sendsketch "$@"