Mercurial > repos > rliterman > csp2
diff CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/share/man/man1/samtools.1 @ 68:5028fdace37b
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 16:23:26 -0400 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/share/man/man1/samtools.1 Tue Mar 18 16:23:26 2025 -0400 @@ -0,0 +1,2192 @@ +'\" t +.TH samtools 1 "21 June 2017" "samtools-1.5" "Bioinformatics tools" +.SH NAME +samtools \- Utilities for the Sequence Alignment/Map (SAM) format +.\" +.\" Copyright (C) 2008-2011, 2013-2017 Genome Research Ltd. +.\" Portions copyright (C) 2010, 2011 Broad Institute. +.\" +.\" Author: Heng Li <lh3@sanger.ac.uk> +.\" Author: Joshua C. Randall <jcrandall@alum.mit.edu> +.\" +.\" Permission is hereby granted, free of charge, to any person obtaining a +.\" copy of this software and associated documentation files (the "Software"), +.\" to deal in the Software without restriction, including without limitation +.\" the rights to use, copy, modify, merge, publish, distribute, sublicense, +.\" and/or sell copies of the Software, and to permit persons to whom the +.\" Software is furnished to do so, subject to the following conditions: +.\" +.\" The above copyright notice and this permission notice shall be included in +.\" all copies or substantial portions of the Software. +.\" +.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL +.\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING +.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER +.\" DEALINGS IN THE SOFTWARE. +. +.\" For code blocks and examples (cf groff's Ultrix-specific man macros) +.de EX + +. in +\\$1 +. nf +. ft CR +.. +.de EE +. ft +. fi +. in + +.. +. +.SH SYNOPSIS +.PP +samtools view -bt ref_list.txt -o aln.bam aln.sam.gz +.PP +samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam +.PP +samtools index aln.sorted.bam +.PP +samtools idxstats aln.sorted.bam +.PP +samtools flagstat aln.sorted.bam +.PP +samtools stats aln.sorted.bam +.PP +samtools bedcov aln.sorted.bam +.PP +samtools depth aln.sorted.bam +.PP +samtools view aln.sorted.bam chr2:20,100,000-20,200,000 +.PP +samtools merge out.bam in1.bam in2.bam in3.bam +.PP +samtools faidx ref.fasta +.PP +samtools tview aln.sorted.bam ref.fasta +.PP +samtools split merged.bam +.PP +samtools quickcheck in1.bam in2.cram +.PP +samtools dict -a GRCh38 -s "Homo sapiens" ref.fasta +.PP +samtools fixmate in.namesorted.sam out.bam +.PP +samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam +.PP +samtools flags PAIRED,UNMAP,MUNMAP +.PP +samtools fastq input.bam > output.fastq +.PP +samtools fasta input.bam > output.fasta +.PP +samtools addreplacerg -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o output.bam input.bam +.PP +samtools collate aln.sorted.bam aln.name_collated.bam +.PP +samtools depad input.bam + +.SH DESCRIPTION +.PP +Samtools is a set of utilities that manipulate alignments in the BAM +format. It imports from and exports to the SAM (Sequence Alignment/Map) +format, does sorting, merging and indexing, and allows to retrieve reads +in any regions swiftly. + +Samtools is designed to work on a stream. It regards an input file `-' +as the standard input (stdin) and an output file `-' as the standard +output (stdout). Several commands can thus be combined with Unix +pipes. Samtools always output warning and error messages to the standard +error output (stderr). + +Samtools is also able to open a BAM (not SAM) file on a remote FTP or +HTTP server if the BAM file name starts with `ftp://' or `http://'. +Samtools checks the current working directory for the index file and +will download the index upon absence. Samtools does not retrieve the +entire alignment file unless it is asked to do so. + +.SH COMMANDS AND OPTIONS + +.TP 10 \"-------- view +.B view +samtools view +.RI [ options ] +.IR in.sam | in.bam | in.cram +.RI [ region ...] + +With no options or regions specified, prints all alignments in the specified +input alignment file (in SAM, BAM, or CRAM format) to standard output +in SAM format (with no header). + +You may specify one or more space-separated region specifications after the +input filename to restrict output to only those alignments which overlap the +specified region(s). Use of region specifications requires a coordinate-sorted +and indexed input file (in BAM or CRAM format). + +The +.BR -b , +.BR -C , +.BR -1 , +.BR -u , +.BR -h , +.BR -H , +and +.B -c +options change the output format from the default of headerless SAM, and the +.B -o +and +.B -U +options set the output file name(s). + +The +.B -t +and +.B -T +options provide additional reference data. One of these two options is required +when SAM input does not contain @SQ headers, and the +.B -T +option is required whenever writing CRAM output. + +The +.BR -L , +.BR -r , +.BR -R , +.BR -s , +.BR -q , +.BR -l , +.BR -m , +.BR -f , +.BR -F , +and +.B -G +options filter the alignments that will be included in the output to only those +alignments that match certain criteria. + +The +.B -x +and +.B -B +options modify the data which is contained in each alignment. + +Finally, the +.B -@ +option can be used to allocate additional threads to be used for compression, and the +.B -? +option requests a long help message. + +.TP +.B REGIONS: +.RS +Regions can be specified as: RNAME[:STARTPOS[-ENDPOS]] and all position +coordinates are 1-based. + +Important note: when multiple regions are given, some alignments may be output +multiple times if they overlap more than one of the specified regions. + +Examples of region specifications: +.TP 10 +.B chr1 +Output all alignments mapped to the reference sequence named `chr1' (i.e. @SQ SN:chr1). +.TP +.B chr2:1000000 +The region on chr2 beginning at base position 1,000,000 and ending at the +end of the chromosome. +.TP +.B chr3:1000-2000 +The 1001bp region on chr3 beginning at base position 1,000 and ending at base +position 2,000 (including both end positions). +.TP +.B '*' +Output the unmapped reads at the end of the file. +(This does not include any unmapped reads placed on a reference sequence +alongside their mapped mates.) +.TP +.B . +Output all alignments. +(Mostly unnecessary as not specifying a region at all has the same effect.) +.RE + +.B OPTIONS: +.RS +.TP 10 +.B -b +Output in the BAM format. +.TP +.B -C +Output in the CRAM format (requires -T). +.TP +.B -1 +Enable fast BAM compression (implies -b). +.TP +.B -u +Output uncompressed BAM. This option saves time spent on +compression/decompression and is thus preferred when the output is piped +to another samtools command. +.TP +.B -h +Include the header in the output. +.TP +.B -H +Output the header only. +.TP +.B -c +Instead of printing the alignments, only count them and print the +total number. All filter options, such as +.BR -f , +.BR -F , +and +.BR -q , +are taken into account. +.TP +.B -? +Output long help and exit immediately. +.TP +.BI "-o " FILE +Output to +.I FILE [stdout]. +.TP +.BI "-U " FILE +Write alignments that are +.I not +selected by the various filter options to +.IR FILE . +When this option is used, all alignments (or all alignments intersecting the +.I regions +specified) are written to either the output file or this file, but never both. +.TP +.BI "-t " FILE +A tab-delimited +.IR FILE . +Each line must contain the reference name in the first column and the length of +the reference in the second column, with one line for each distinct reference. +Any additional fields beyond the second column are ignored. This file also +defines the order of the reference sequences in sorting. If you run: +`samtools faidx <ref.fa>', the resulting index file +.I <ref.fa>.fai +can be used as this +.IR FILE . +.TP +.BI "-T " FILE +A FASTA format reference +.IR FILE , +optionally compressed by +.B bgzip +and ideally indexed by +.B samtools +.BR faidx . +If an index is not present, one will be generated for you. +.TP +.BI "-L " FILE +Only output alignments overlapping the input BED +.I FILE +[null]. +.TP +.BI "-r " STR +Only output alignments in read group +.I STR +[null]. +.TP +.BI "-R " FILE +Output alignments in read groups listed in +.I FILE +[null]. +.TP +.BI "-q " INT +Skip alignments with MAPQ smaller than +.I INT +[0]. +.TP +.BI "-l " STR +Only output alignments in library +.I STR +[null]. +.TP +.BI "-m " INT +Only output alignments with number of CIGAR bases consuming query +sequence \(>= +.I INT +[0] +.TP +.BI "-f " INT +Only output alignments with all bits set in +.I INT +present in the FLAG field. +.I INT +can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) +or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0]. +.TP +.BI "-F " INT +Do not output alignments with any bits set in +.I INT +present in the FLAG field. +.I INT +can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) +or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0]. +.TP +.BI "-G " INT +Do not output alignments with all bits set in +.I INT +present in the FLAG field. This is the opposite of \fI-f\fR such +that \fI-f12 -G12\fR is the same as no filtering at all. +.I INT +can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) +or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0]. +.TP +.BI "-x " STR +Read tag to exclude from output (repeatable) [null] +.TP +.B -B +Collapse the backward CIGAR operation. +.TP +.BI "-s " FLOAT +Output only a proportion of the input alignments. +This subsampling acts in the same way on all of the alignment records in +the same template or read pair, so it never keeps a read but not its mate. +.IP +The integer and fractional parts of the +.BI "-s " INT . FRAC +option are used separately: the part after the +decimal point sets the fraction of templates/pairs to be kept, +while the integer part is used as a seed that influences +.I which +subset of reads is kept. +.IP +.\" Reads are retained based on a score computed by hashing their QNAME +.\" field and the seed value. +When subsampling data that has previously been subsampled, be sure to use +a different seed value from those used previously; otherwise more reads +will be retained than expected. +.TP +.BI "-@ " INT +Number of BAM compression threads to use in addition to main thread [0]. +.TP +.B -S +Ignored for compatibility with previous samtools versions. +Previously this option was required if input was in SAM format, but now the +correct format is automatically detected by examining the first few characters +of input. +.RE + +.TP \"-------- sort +.B sort +.na +samtools sort +.RB [ -l +.IR level ] +.RB [ -m +.IR maxMem ] +.RB [ -o +.IR out.bam ] +.RB [ -O +.IR format ] +.RB [ -n ] +.RB [ -t +.IR tag ] +.RB [ -T +.IR tmpprefix ] +.RB [ -@ +.IR threads "] [" in.sam | in.bam | in.cram ] +.ad + +Sort alignments by leftmost coordinates, or by read name when +.B -n +is used. +An appropriate +.B @HD-SO +sort order header tag will be added or an existing one updated if necessary. + +The sorted output is written to standard output by default, or to the +specified file +.RI ( out.bam ) +when +.B -o +is used. +This command will also create temporary files +.IB tmpprefix . %d .bam +as needed when the entire alignment data cannot fit into memory +(as controlled via the +.B -m +option). + +.B Options: +.RS +.TP 11 +.BI "-l " INT +Set the desired compression level for the final output file, ranging from 0 +(uncompressed) or 1 (fastest but minimal compression) to 9 (best compression +but slowest to write), similarly to +.BR gzip (1)'s +compression level setting. +.IP +If +.B -l +is not used, the default compression level will apply. +.TP +.BI "-m " INT +Approximately the maximum required memory per thread, specified either in bytes +or with a +.BR K ", " M ", or " G +suffix. +[768 MiB] +.IP +To prevent sort from creating a huge number of temporary files, it enforces a +minimum value of 1M for this setting. +.TP +.B -n +Sort by read names (i.e., the +.B QNAME +field) rather than by chromosomal coordinates. +.TP +.BI "-t " TAG +Sort first by the value in the alignment tag TAG, then by position or name (if +also using \fB-n\fP). +.BI "-o " FILE +Write the final sorted output to +.IR FILE , +rather than to standard output. +.TP +.BI "-O " FORMAT +Write the final output as +.BR sam ", " bam ", or " cram . + +By default, samtools tries to select a format based on the +.B -o +filename extension; if output is to standard output or no format can be +deduced, +.B bam +is selected. +.TP +.BI "-T " PREFIX +Write temporary files to +.IB PREFIX . nnnn .bam, +or if the specified +.I PREFIX +is an existing directory, to +.IB PREFIX /samtools. mmm . mmm .tmp. nnnn .bam, +where +.I mmm +is unique to this invocation of the +.B sort +command. +.IP +By default, any temporary files are written alongside the output file, as +.IB out.bam .tmp. nnnn .bam, +or if output is to standard output, in the current directory as +.BI samtools. mmm . mmm .tmp. nnnn .bam. +.TP +.BI "-@ " INT +Set number of sorting and compression threads. +By default, operation is single-threaded. +.PP +.B Ordering Rules + +The following rules are used for ordering records. + +If option \fB-t\fP is in use, records are first sorted by the value of +the given alignment tag, and then by position or name (if using \fB-n\fP). +For example, \*(lq-t RG\*(rq will make read group the primary sort key. The +rules for ordering by tag are: + +.IP \(bu 4 +Records that do not have the tag are sorted before ones that do. +.IP \(bu 4 +If the types of the tags are different, they will be sorted so +that single character tags (type A) come before array tags (type B), then +string tags (types H and Z), then numeric tags (types f and i). +.IP \(bu 4 +Numeric tags (types f and i) are compared by value. Note that comparisons +of floating-point values are subject to issues of rounding and precision. +.IP \(bu 4 +String tags (types H and Z) are compared based on the binary +contents of the tag using the C +.BR strcmp (3) +function. +.IP \(bu 4 +Character tags (type A) are compared by binary character value. +.IP \(bu 4 +No attempt is made to compare tags of other types \(em notably type B +array values will not be compared. +.PP +When the \fB-n\fP option is present, records are sorted by name. Names are +compared so as to give a \*(lqnatural\*(rq ordering \(em i.e. sections +consisting of digits are compared numerically while all other sections are +compared based on their binary representation. This means \*(lqa1\*(rq will +come before \*(lqb1\*(rq and \*(lqa9\*(rq will come before \*(lqa10\*(rq. +Records with the same name will be ordered according to the values of +the READ1 and READ2 flags (see +.BR flags ). + +When the \fB-n\fP option is +.B not +present, reads are sorted by reference (according to the order of the @SQ +header records), then by position in the reference, and then by the REVERSE +flag. + +.B Note + +.PP +Historically +.B samtools sort +also accepted a less flexible way of specifying the final and +temporary output filenames: +.IP +samtools sort +.RB [ -f "] [" -o ] +.I in.bam out.prefix +.PP +This has now been removed. +The previous \fIout.prefix\fP argument (and \fB-f\fP option, if any) +should be changed to an appropriate combination of \fB-T\fP \fIPREFIX\fP +and \fB-o\fP \fIFILE\fP. The previous \fB-o\fP option should be removed, +as output defaults to standard output. +.RE + +.TP \"-------- index +.B index +samtools index +.RB [ -bc ] +.RB [ -m +.IR INT ] +.IR aln.bam | aln.cram +.RI [ out.index ] + +Index a coordinate-sorted BAM or CRAM file for fast random access. +(Note that this does not work with SAM files even if they are bgzip +compressed \(em to index such files, use tabix(1) instead.) + +This index is needed when +.I region +arguments are used to limit +.B samtools view +and similar commands to particular regions of interest. + +If an output filename is given, the index file will be written to +.IR out.index . +Otherwise, for a CRAM file +.IR aln.cram , +index file +.IB aln.cram .crai +will be created; for a BAM file +.IR aln.bam , +either +.IB aln.bam .bai +or +.IB aln.bam .csi +will be created, depending on the index format selected. + +.B Options: +.RS +.TP 8 +.B -b +Create a BAI index. +This is currently the default when no format options are used. +.TP +.B -c +Create a CSI index. +By default, the minimum interval size for the index is 2^14, which is the same +as the fixed value used by the BAI format. +.TP +.BI "-m " INT +Create a CSI index, with a minimum interval size of 2^INT. +.RE + +.TP \"-------- idxstats +.B idxstats +samtools idxstats +.IR in.sam | in.bam | in.cram + +Retrieve and print stats in the index file corresponding to the input file. +Before calling idxstats, the input BAM file must be indexed by samtools index. + +The output is TAB-delimited with each line consisting of reference sequence +name, sequence length, # mapped reads and # unmapped reads. It is written to +stdout. + +.TP \"-------- flagstat +.B flagstat +samtools flagstat +.IR in.sam | in.bam | in.cram + +Does a full pass through the input file to calculate and print statistics +to stdout. + +Provides counts for each of 13 categories based primarily on bit flags in +the FLAG field. Each category in the output is broken down into QC pass and +QC fail, which is presented as "#PASS + #FAIL" followed by a description of +the category. + +The first row of output gives the total number of reads that are QC pass and +fail (according to flag bit 0x200). For example: + + 122 + 28 in total (QC-passed reads + QC-failed reads) + +Which would indicate that there are a total of 150 reads in the input file, +122 of which are marked as QC pass and 28 of which are marked as "not passing +quality controls" + +Following this, additional categories are given for reads which are: + +.RS 18 +.TP +secondary +0x100 bit set +.TP +supplementary +0x800 bit set +.TP +duplicates +0x400 bit set +.TP +mapped +0x4 bit not set +.TP +paired in sequencing +0x1 bit set +.TP +read1 +both 0x1 and 0x40 bits set +.TP +read2 +both 0x1 and 0x80 bits set +.TP +properly paired +both 0x1 and 0x2 bits set and 0x4 bit not set +.TP +with itself and mate mapped +0x1 bit set and neither 0x4 nor 0x8 bits set +.TP +singletons +both 0x1 and 0x8 bits set and bit 0x4 not set +.RE + +.RS 10 +And finally, two rows are given that additionally filter on the reference +name (RNAME), mate reference name (MRNM), and mapping quality (MAPQ) fields: +.RE + +.RS 18 +.TP +with mate mapped to a different chr +0x1 bit set and neither 0x4 nor 0x8 bits set and MRNM not equal to RNAME +.TP +with mate mapped to a different chr (mapQ>=5) +0x1 bit set and neither 0x4 nor 0x8 bits set +and MRNM not equal to RNAME and MAPQ >= 5 +.RE + +.TP \"-------- stats +.B stats +samtools stats +.RI [ options ] +.IR in.sam | in.bam | in.cram +.RI [ region ...] + +samtools stats collects statistics from BAM files and outputs in a text format. +The output can be visualized graphically using plot-bamstats. + +.B Options: +.RS +.TP 8 +.BI "-c, --coverage " MIN , MAX , STEP +Set coverage distribution to the specified range (MIN, MAX, STEP all given as integers) +[1,1000,1] +.TP +.B -d, --remove-dups +Exclude from statistics reads marked as duplicates +.TP +.BI "-f, --required-flag " STR "|" INT +Required flag, 0 for unset. See also `samtools flags` +[0] +.TP +.BI "-F, --filtering-flag " STR "|" INT +Filtering flag, 0 for unset. See also `samtools flags` +[0] +.TP +.BI "--GC-depth " FLOAT +the size of GC-depth bins (decreasing bin size increases memory requirement) +[2e4] +.TP +.B -h, --help +This help message +.TP +.BI "-i, --insert-size " INT +Maximum insert size +[8000] +.TP +.BI "-I, --id " STR +Include only listed read group or sample name +[] +.TP +.BI "-l, --read-length " INT +Include in the statistics only reads with the given read length +[] +.TP +.BI "-m, --most-inserts " FLOAT +Report only the main part of inserts +[0.99] +.TP +.BI "-P, --split-prefix " STR +A path or string prefix to prepend to filenames output when creating +categorised statistics files with +.BR -S / --split . +[input filename] +.TP +.BI "-q, --trim-quality " INT +The BWA trimming parameter +[0] +.TP +.BI "-r, --ref-seq " FILE +Reference sequence (required for GC-depth and mismatches-per-cycle calculation). +[] +.TP +.BI "-S, --split " TAG +In addition to the complete statistics, also output categorised statistics +based on the tagged field +.I TAG +(e.g., use +.B --split RG +to split into read groups). + +Categorised statistics are written to files named +.RI < prefix >_< value >.bamstat, +where +.I prefix +is as given by +.B --split-prefix +(or the input filename by default) and +.I value +has been encountered as the specified tagged field's value in one or more +alignment records. +.TP +.BI "-t, --target-regions " FILE +Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive. +[] +.TP +.B "-x, --sparse" +Suppress outputting IS rows where there are no insertions. +.RE + +.TP \"-------- bedcov +.B bedcov +samtools bedcov +.RI [ options ] +.IR region.bed " " in1.sam | in1.bam | in1.cram "[...]" + +Reports the total read base count (i.e. the sum of per base read depths) +for each genomic region specified in the supplied BED file. +Counts for each alignment file supplied are reported in separate columns. + +.B Options: +.RS +.TP +.BI "-Q " INT +.RI "Only count reads with mapping quality greater than " INT +.RE + +.TP \"-------- depth +.B depth +samtools depth +.RI [ options ] +.RI "[" in1.sam | in1.bam | in1.cram " [" in2.sam | in2.bam | in2.cram "] [...]]" + +Computes the depth at each position or region. + +.B Options: +.RS +.TP 8 +.B -a +Output all positions (including those with zero depth) +.TP +.B -a -a, -aa +Output absolutely all positions, including unused reference sequences. +Note that when used in conjunction with a BED file the -a option may +sometimes operate as if -aa was specified if the reference sequence +has coverage outside of the region specified in the BED file. +.TP +.BI "-b " FILE +.RI "Compute depth at list of positions or regions in specified BED " FILE. +[] +.TP +.BI "-f " FILE +.RI "Use the BAM files specified in the " FILE +(a file of filenames, one file per line) +[] +.TP +.BI "-l " INT +.RI "Ignore reads shorter than " INT +.TP +.BI "-m, -d " INT +.RI "Truncate reported depth at a maximum of " INT " reads." +[8000] +.TP +.BI "-q " INT +.RI "Only count reads with base quality greater than " INT +.TP +.BI "-Q " INT +.RI "Only count reads with mapping quality greater than " INT +.TP +.BI "-r " CHR ":" FROM "-" TO +Only report depth in specified region. +.RE + +.TP \"-------- merge +.B merge +samtools merge [-nur1f] [-h inh.sam] [-R reg] [-b <list>] <out.bam> <in1.bam> [<in2.bam> <in3.bam> ... <inN.bam>] + +Merge multiple sorted alignment files, producing a single sorted output file +that contains all the input records and maintains the existing sort order. + +If +.BR -h +is specified the @SQ headers of input files will be merged into the specified header, otherwise they will be merged +into a composite header created from the input headers. If in the process of merging @SQ lines for coordinate sorted +input files, a conflict arises as to the order (for example input1.bam has @SQ for a,b,c and input2.bam has b,a,c) +then the resulting output file will need to be re-sorted back into coordinate order. + +Unless the +.BR -c +or +.BR -p +flags are specified then when merging @RG and @PG records into the output header then any IDs found to be duplicates +of existing IDs in the output header will have a suffix appended to them to differentiate them from similar header +records from other files and the read records will be updated to reflect this. + +The ordering of the records in the input files must match the usage of the +\fB-n\fP and \fB-t\fP command-line options. If they do not, the output +order will be undefined. See +.B sort +for information about record ordering. + +.B OPTIONS: +.RS +.TP 8 +.B -1 +Use zlib compression level 1 to compress the output. +.TP +.BI -b \ FILE +List of input BAM files, one file per line. +.TP +.B -f +Force to overwrite the output file if present. +.TP 8 +.BI -h \ FILE +Use the lines of +.I FILE +as `@' headers to be copied to +.IR out.bam , +replacing any header lines that would otherwise be copied from +.IR in1.bam . +.RI ( FILE +is actually in SAM format, though any alignment records it may contain +are ignored.) +.TP +.B -n +The input alignments are sorted by read names rather than by chromosomal +coordinates +.TP +.B -t TAG +The input alignments have been sorted by the value of TAG, then by either +position or name (if \fB-n\fP is given). +.TP +.BI -R \ STR +Merge files in the specified region indicated by +.I STR +[null] +.TP +.B -r +Attach an RG tag to each alignment. The tag value is inferred from file names. +.TP +.B -u +Uncompressed BAM output +.TP +.B -c +When several input files contain @RG headers with the same ID, emit only one +of them (namely, the header line from the first file we find that ID in) to +the merged output file. +Combining these similar headers is usually the right thing to do when the +files being merged originated from the same file. + +Without \fB-c\fP, all @RG headers appear in the output file, with random +suffixes added to their IDs where necessary to differentiate them. +.TP +.B -p +Similarly, for each @PG ID in the set of files to merge, use the @PG line +of the first file we find that ID in rather than adding a suffix to +differentiate similar IDs. +.RE + +.TP \"-------- faidx +.B faidx +samtools faidx <ref.fasta> [region1 [...]] + +Index reference sequence in the FASTA format or extract subsequence from +indexed reference sequence. If no region is specified, +.B faidx +will index the file and create +.I <ref.fasta>.fai +on the disk. If regions are specified, the subsequences will be +retrieved and printed to stdout in the FASTA format. + +The input file can be compressed in the +.B BGZF +format. + +The sequences in the input file should all have different names. +If they do not, indexing will emit a warning about duplicate sequences and +retrieval will only produce subsequences from the first sequence with the +duplicated name. + +.TP \"-------- tview +.B tview +samtools tview +.RB [ -p +.IR chr:pos ] +.RB [ -s +.IR STR ] +.RB [ -d +.IR display ] +.RI <in.sorted.bam> +.RI [ref.fasta] + +Text alignment viewer (based on the ncurses library). In the viewer, +press `?' for help and press `g' to check the alignment start from a +region in the format like `chr10:10,000,000' or `=10,000,000' when +viewing the same reference sequence. + +.B Options: +.RS +.TP 14 +.BI -d \ display +Output as (H)tml or (C)urses or (T)ext +.TP +.BI -p \ chr:pos +Go directly to this position +.TP +.BI -s \ STR +Display only alignments from this sample or read group +.RE + +.TP \"-------- split +.B split +samtools split +.RI [ options ] +.IR merged.sam | merged.bam | merged.cram + +Splits a file by read group. + +.B Options: +.RS +.TP 14 +.BI "-u " FILE1 +.RI "Put reads with no RG tag or an unrecognised RG tag into " FILE1 +.TP +.BI "-u " FILE1 ":" FILE2 +.RI "As above, but assigns an RG tag as given in the header of " FILE2 +.TP +.BI "-f " STRING +Output filename format string (see below) +["%*_%#.%."] +.TP +.B -v +Verbose output +.PP +Format string expansions: +.TS +center; +lb l . +%% % +%* basename +%# @RG index +%! @RG ID +%. output format filename extension +.TE +.RE + +.TP \"-------- quickcheck +.B quickcheck +samtools quickcheck +.RI [ options ] +.IR in.sam | in.bam | in.cram +[ ... ] + +Quickly check that input files appear to be intact. Checks that beginning of the +file contains a valid header (all formats) containing at least one target +sequence and then seeks to the end of the file and checks that an end-of-file +(EOF) is present and intact (BAM only). + +Data in the middle of the file is not read since that would be much more time +consuming, so please note that this command will not detect internal corruption, +but is useful for testing that files are not truncated before performing more +intensive tasks on them. + +This command will exit with a non-zero exit code if any input files don't have a +valid header or are missing an EOF block. Otherwise it will exit successfully +(with a zero exit code). + +.B Options: +.RS +.TP 8 +.B -v +Verbose output: will additionally print the names of all input files that don't +pass the check to stdout. Multiple -v options will cause additional messages +regarding check results to be printed to stderr. +.RE + +.TP \"-------- dict +.B dict +samtools dict <ref.fasta|ref.fasta.gz> + +Create a sequence dictionary file from a fasta file. + +.B OPTIONS: +.RS +.TP 11 +.BI -a,\ --assembly \ STR +Specify the assembly for the AS tag. +.TP +.B -H,\ --no-header +Do not print the @HD header line. +.TP +.BI -o,\ --output \ FILE +Output to +.I FILE +[stdout]. +.TP +.BI -s,\ --species \ STR +Specify the species for the SP tag. +.TP +.BI -u,\ --uri \ STR +Specify the URI for the UR tag. Defaults to +the absolute path of +.I ref.fasta +unless reading from stdin. +.RE + +.TP \"-------- fixmate +.B fixmate +.na +samtools fixmate +.RB [ -rpc ] +.RB [ -O +.IR format ] +.I in.nameSrt.bam out.bam +.ad + +Fill in mate coordinates, ISIZE and mate related flags from a +name-sorted alignment. + +.B OPTIONS: +.RS +.TP 11 +.B -r +Remove secondary and unmapped reads. +.TP +.B -p +Disable FR proper pair check. +.TP +.B -c +Add template cigar ct tag. +.TP +.BI "-O " FORMAT +Write the final output as +.BR sam ", " bam ", or " cram . + +By default, samtools tries to select a format based on the output +filename extension; if output is to standard output or no format can be +deduced, +.B bam +is selected. +.RE + +.TP \"-------- mpileup +.B mpileup +samtools mpileup +.RB [ -EBugp ] +.RB [ -C +.IR capQcoef ] +.RB [ -r +.IR reg ] +.RB [ -f +.IR in.fa ] +.RB [ -l +.IR list ] +.RB [ -Q +.IR minBaseQ ] +.RB [ -q +.IR minMapQ ] +.I in.bam +.RI [ in2.bam +.RI [ ... ]] + +Generate VCF, BCF or pileup for one or multiple BAM files. Alignment records +are grouped by sample (SM) identifiers in @RG header lines. If sample +identifiers are absent, each input file is regarded as one sample. + +In the pileup format (without +.BR -u \ or \ -g ), +each +line represents a genomic position, consisting of chromosome name, +1-based coordinate, reference base, the number of reads covering the site, +read bases, base qualities and alignment +mapping qualities. Information on match, mismatch, indel, strand, +mapping quality and start and end of a read are all encoded at the read +base column. At this column, a dot stands for a match to the reference +base on the forward strand, a comma for a match on the reverse strand, +a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward +strand and `acgtn' for a mismatch on the reverse strand. A pattern +`\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this +reference position and the next reference position. The length of the +insertion is given by the integer in the pattern, followed by the +inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' +represents a deletion from the reference. The deleted bases will be +presented as `*' in the following lines. Also at the read base column, a +symbol `^' marks the start of a read. The ASCII of the character +following `^' minus 33 gives the mapping quality. A symbol `$' marks the +end of a read segment. + +Note that there are two orthogonal ways to specify locations in the +input file; via \fB-r\fR \fIregion\fR and \fB-l\fR \fIfile\fR. The +former uses (and requires) an index to do random access while the +latter streams through the file contents filtering out the specified +regions, requiring no index. The two may be used in conjunction. For +example a BED file containing locations of genes in chromosome 20 +could be specified using \fB-r 20 -l chr20.bed\fR, meaning that the +index is used to find chromosome 20 and then it is filtered for the +regions listed in the bed file. + +.B Input Options: +.RS +.TP 10 +.B -6, --illumina1.3+ +Assume the quality is in the Illumina 1.3+ encoding. +.TP +.B -A, --count-orphans +Do not skip anomalous read pairs in variant calling. +.TP +.BI -b,\ --bam-list \ FILE +List of input BAM files, one file per line [null] +.TP +.B -B, --no-BAQ +Disable probabilistic realignment for the computation of base alignment +quality (BAQ). BAQ is the Phred-scaled probability of a read base being +misaligned. Applying this option greatly helps to reduce false SNPs +caused by misalignments. +.TP +.BI -C,\ --adjust-MQ \ INT +Coefficient for downgrading mapping quality for reads containing +excessive mismatches. Given a read with a phred-scaled probability q of +being generated from the mapped position, the new mapping quality is +about sqrt((INT-q)/INT)*INT. A zero value disables this +functionality; if enabled, the recommended value for BWA is 50. [0] +.TP +.BI -d,\ --max-depth \ INT +At a position, read maximally +.I INT +reads per input file. Note that samtools has a minimum value of +.I 8000/n +where +.I n +is the number of input files given to mpileup. This means the default +is highly likely to be increased. Once above the cross-sample minimum of +8000 the -d parameter will have an effect. [250] +.TP +.B -E, --redo-BAQ +Recalculate BAQ on the fly, ignore existing BQ tags +.TP +.BI -f,\ --fasta-ref \ FILE +The +.BR faidx -indexed +reference file in the FASTA format. The file can be optionally compressed by +.BR bgzip . +[null] +.TP +.BI -G,\ --exclude-RG \ FILE +Exclude reads from readgroups listed in FILE (one @RG-ID per line) +.TP +.BI -l,\ --positions \ FILE +BED or position list file containing a list of regions or sites where +pileup or BCF should be generated. Position list files contain two +columns (chromosome and position) and start counting from 1. BED +files contain at least 3 columns (chromosome, start and end position) +and are 0-based half-open. +.br +While it is possible to mix both position-list and BED coordinates in +the same file, this is strongly ill advised due to the differing +coordinate systems. [null] +.TP +.BI -q,\ -min-MQ \ INT +Minimum mapping quality for an alignment to be used [0] +.TP +.BI -Q,\ --min-BQ \ INT +Minimum base quality for a base to be considered [13] +.TP +.BI -r,\ --region \ STR +Only generate pileup in region. Requires the BAM files to be indexed. +If used in conjunction with -l then considers the intersection of the +two requests. +.I STR +[all sites] +.TP +.B -R,\ --ignore-RG +Ignore RG tags. Treat all reads in one BAM as one sample. +.TP +.BI --rf,\ --incl-flags \ STR|INT +Required flags: skip reads with mask bits unset [null] +.TP +.BI --ff,\ --excl-flags \ STR|INT +Filter flags: skip reads with mask bits set +[UNMAP,SECONDARY,QCFAIL,DUP] +.TP +.B -x,\ --ignore-overlaps +Disable read-pair overlap detection. +.PP +.B Output Options: +.TP 10 +.BI "-o, --output " FILE +Write pileup or VCF/BCF output to +.IR FILE , +rather than the default of standard output. + +(The same short option is used for both +.B --open-prob +and +.BR --output . +If +.BR -o 's +argument contains any non-digit characters other than a leading + or - sign, +it is interpreted as +.BR --output . +Usually the filename extension will take care of this, but to write to an +entirely numeric filename use +.B -o ./123 +or +.BR "--output 123" .) +.TP +.B -g,\ --BCF +Compute genotype likelihoods and output them in the binary call format (BCF). +As of v1.0, this is BCF2 which is incompatible with the BCF1 format produced +by previous (0.1.x) versions of samtools. +.TP +.B -v,\ --VCF +Compute genotype likelihoods and output them in the variant call format (VCF). +Output is bgzip-compressed VCF unless +.B -u +option is set. +.PP +.B Output Options for mpileup format (without -g or -v): +.TP 10 +.B -O, --output-BP +Output base positions on reads. +.TP +.B -s, --output-MQ +Output mapping quality. +.TP +.B -a +Output all positions, including those with zero depth. +.TP +.B -a -a, -aa +Output absolutely all positions, including unused reference sequences. +Note that when used in conjunction with a BED file the -a option may +sometimes operate as if -aa was specified if the reference sequence +has coverage outside of the region specified in the BED file. +.PP +.B Output Options for VCF/BCF format (with -g or -v): +.TP 10 +.B -D +Output per-sample read depth [DEPRECATED - use +.B -t DP +instead] +.TP +.B -S +Output per-sample Phred-scaled strand bias P-value [DEPRECATED - use +.B -t SP +instead] +.TP +.BI -t,\ --output-tags \ LIST +Comma-separated list of FORMAT and INFO tags to output (case-insensitive): +.B AD +(Allelic depth, FORMAT), +.B INFO/AD +(Total allelic depth, INFO), +.B ADF +(Allelic depths on the forward strand, FORMAT), +.B INFO/ADF +(Total allelic depths on the forward strand, INFO), +.B ADR +(Allelic depths on the reverse strand, FORMAT), +.B INFO/ADR +(Total allelic depths on the reverse strand, INFO), +.B DP +(Number of high-quality bases, FORMAT), +.B DV +(Deprecated in favor of AD; Number of high-quality non-reference bases, FORMAT), +.B DPR +(Deprecated in favor of AD; Number of high-quality bases for each observed allele, FORMAT), +.B INFO/DPR +(Number of high-quality bases for each observed allele, INFO), +.B DP4 +(Deprecated in favor of ADF and ADR; Number of high-quality ref-forward, ref-reverse, alt-forward and alt-reverse bases, FORMAT), +.B SP +(Phred-scaled strand bias P-value, FORMAT) +[null] +.TP +.B -u,\ --uncompressed +Generate uncompressed VCF/BCF output, which is preferred for piping. +.TP +.B -V +Output per-sample number of non-reference reads [DEPRECATED - use +.B -t DV +instead] +.PP +.B Options for SNP/INDEL Genotype Likelihood Computation (for -g or -v): +.TP 10 +.BI -e,\ --ext-prob \ INT +Phred-scaled gap extension sequencing error probability. Reducing +.I INT +leads to longer indels. [20] +.TP +.BI -F,\ --gap-frac \ FLOAT +Minimum fraction of gapped reads [0.002] +.TP +.BI -h,\ --tandem-qual \ INT +Coefficient for modeling homopolymer errors. Given an +.IR l -long +homopolymer +run, the sequencing error of an indel of size +.I s +is modeled as +.IR INT * s / l . +[100] +.TP +.B -I, --skip-indels +Do not perform INDEL calling +.TP +.BI -L,\ --max-idepth \ INT +Skip INDEL calling if the average per-input-file depth is above +.IR INT . +[250] +.TP +.BI -m,\ --min-ireads \ INT +Minimum number gapped reads for indel candidates +.IR INT . +[1] +.TP +.BI -o,\ --open-prob \ INT +Phred-scaled gap open sequencing error probability. Reducing +.I INT +leads to more indel calls. [40] + +(The same short option is used for both +.B --open-prob +and +.BR --output . +When +.BR -o 's +argument contains only an optional + or - sign followed by the digits 0 to 9, +it is interpreted as +.BR --open-prob .) +.TP +.B -p, --per-sample-mF +Apply +.B -m +and +.B -F +thresholds per sample to increase sensitivity of calling. +By default both options are applied to reads pooled from all samples. +.TP +.BI -P,\ --platforms \ STR +Comma-delimited list of platforms (determined by +.BR @RG-PL ) +from which indel candidates are obtained. It is recommended to collect +indel candidates from sequencing technologies that have low indel error +rate such as ILLUMINA. [all] +.RE + +.TP \"-------- flags +.B flags +samtools flags INT|STR[,...] + +Convert between textual and numeric flag representation. + +.B FLAGS: +.TS +rb l l . +0x1 PAIRED paired-end (or multiple-segment) sequencing technology +0x2 PROPER_PAIR each segment properly aligned according to the aligner +0x4 UNMAP segment unmapped +0x8 MUNMAP next segment in the template unmapped +0x10 REVERSE SEQ is reverse complemented +0x20 MREVERSE SEQ of the next segment in the template is reverse complemented +0x40 READ1 the first segment in the template +0x80 READ2 the last segment in the template +0x100 SECONDARY secondary alignment +0x200 QCFAIL not passing quality controls +0x400 DUP PCR or optical duplicate +0x800 SUPPLEMENTARY supplementary alignment +.TE + +.TP \"-------- fastq fasta +.B fastq/a +samtools fastq +.RI [ options ] +.I in.bam +.br +samtools fasta +.RI [ options ] +.I in.bam + +Converts a BAM or CRAM into either FASTQ or FASTA format depending on the +command invoked. The FASTQ files will be automatically compressed if the +filenames have a .gz or .bgzf extention. + +.B OPTIONS: +.RS +.TP 8 +.B -n +By default, either '/1' or '/2' is added to the end of read names +where the corresponding BAM_READ1 or BAM_READ2 flag is set. +Using +.B -n +causes read names to be left as they are. +.TP 8 +.B -N +Always add either '/1' or '/2' to the end of read names +even when put into different files. +.TP 8 +.B -O +Use quality values from OQ tags in preference to standard quality string +if available. +.TP 8 +.B -s FILE +Write singleton reads in FASTQ format to FILE instead of outputting them. +.TP 8 +.B -t +Copy RG, BC and QT tags to the FASTQ header line, if they exist. +.TP 8 +.B -T TAGLIST +Specify a comma-separated list of tags to copy to the FASTQ header line, if they exist. +.TP 8 +.B -1 FILE +Write reads with the BAM_READ1 flag set to FILE instead of outputting them. +.TP 8 +.B -2 FILE +Write reads with the BAM_READ2 flag set to FILE instead of outputting them. +.TP 8 +.B -0 FILE +Write reads with both or neither of the BAM_READ1 and BAM_READ2 flags set +to FILE instead of outputting them. +.TP 8 +.BI "-f " INT +Only output alignments with all bits set in +.I INT +present in the FLAG field. +.I INT +can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) +or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0]. +.TP 8 +.BI "-F " INT +Do not output alignments with any bits set in +.I INT +present in the FLAG field. +.I INT +can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) +or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0]. +.TP 8 +.BI "-G " INT +Only EXCLUDE reads with all of the bits set in +.I INT +present in the FLAG field. +.I INT +can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/) +or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0]. +.TP 8 +.B -i +add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG) +.TP 8 +.B -c [0..9] +set compression level when writing gz or bgzf fastq files. +.TP 8 +.B --i1 FILE +write first index reads to FILE +.TP 8 +.B --i2 FILE +write second index reads to FILE +.TP 8 +.B --barcode-tag TAG +aux tag to find index reads in [default: BC] +.TP 8 +.B --quality-tag TAG +aux tag to find index quality in [default: QT] +.TP 8 +.B --index-format STR +string to describe how to parse the barcode and quality tags. For example: + +.RS +.TP 8 +.B i14i8 +the first 14 characters are index 1, the next 8 characters are index 2 +.TP 8 +.B n8i14 +ignore the first 8 characters, and use the next 14 characters for index 1 + +If the tag contains a separator, then the numeric part can be replaced with '*' to +mean 'read until the separator or end of tag', for example: +.TP 8 +.B n*i* +ignore the left part of the tag until the separator, then use the second part +.RE +.RE + +.TP \"-------- collate +.B collate +samtools collate +.RI [ options ] +.IR in.sam | in.bam | in.cram " [" out.prefix "]" + +Shuffles and groups reads together by their names. +A faster alternative to a full query name sort, +.B collate +ensures that reads of the same name are grouped together in contiguous groups, +but doesn't make any guarantees about the order of read names between groups. + +The output from this command should be suitable for any operation that +requires all reads from the same template to be grouped together. + +.B Options: +.RS +.TP 8 +.B -O +Output to stdout rather than to files starting with out.prefix +.TP +.B -u +Write uncompressed BAM output +.TP +.BI "-l " INT +Compression level. +[1] +.TP +.BI "-n " INT +Number of temporary files to use. +[64] +.RE + +.TP \"-------- reheader +.B reheader +samtools reheader +.RB [ -iP ] +.I in.header.sam in.bam + +Replace the header in +.I in.bam +with the header in +.IR in.header.sam . +This command is much faster than replacing the header with a +BAM\(->SAM\(->BAM conversion. + +By default this command outputs the BAM or CRAM file to standard +output (stdout), but for CRAM format files it has the option to +perform an in-place edit, both reading and writing to the same file. +No validity checking is performed on the header, nor that it is suitable +to use with the sequence data itself. + +.B OPTIONS: +.RS +.TP 8 +.B -P, --no-PG +Do not generate an @PG header line. +.TP 8 +.B -i, --in-place +Perform the header edit in-place, if possible. This only works on CRAM +files and only if there is sufficient room to store the new header. +The amount of space available will differ for each CRAM file. +.RE + +.TP \"-------- cat +.B cat +samtools cat [-b list] [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ] + +Concatenate BAMs or CRAMs. Although this works on either BAM or CRAM, +all input files must be the same format as each other. The sequence +dictionary of each input file must be identical, although this command +does not check this. This command uses a similar trick to +.B reheader +which enables fast BAM concatenation. + +.B OPTIONS: +.RS +.TP 8 +.BI "-b " FOFN +Read the list of input BAM or CRAM files from \fIFOFN\fR. These are +concatenated prior to any files specified on the command line. +Multiple \fB-b\fR \fIFOFN\fR options may be specified to concatenate +multiple lists of BAM/CRAM files. +.TP 8 +.BI "-h " FILE +Uses the SAM header from \fIFILE\fR. By default the header is taken +from the first file to be concatenated. +.TP 8 +.BI "-o " FILE +Write the concatenated output to \fIFILE\fR. By default this is sent +to stdout. +.RE + +.TP \"-------- rmdup +.B rmdup +samtools rmdup [-sS] <input.srt.bam> <out.bam> + +Remove potential PCR duplicates: if multiple read pairs have identical +external coordinates, only retain the pair with highest mapping quality. +In the paired-end mode, this command +.B ONLY +works with FR orientation and requires ISIZE is correctly set. It does +not work for unpaired reads (e.g. two ends mapped to different +chromosomes or orphan reads). + +.B OPTIONS: +.RS +.TP 8 +.B -s +Remove duplicates for single-end reads. By default, the command works for +paired-end reads only. +.TP 8 +.B -S +Treat paired-end reads and single-end reads. +.RE + +.TP \"-------- addreplacerg +.B addreplacerg +samtools addreplacerg [-r rg line | -R rg ID] [-m mode] [-l level] [-o out.bam] +<input.bam> + +Adds or replaces read group tags in a file. + +.B OPTIONS: +.RS +.TP 8 +.BI "-r " STRING +Allows you to specify a read group line to append to the header and applies it +to the reads specified by the -m option. If repeated it automatically adds in +tabs between invocations. +.TP 8 +.BI "-R " STRING +Allows you to specify the read group ID of an existing @RG line and applies it +to the reads specified. +.TP 8 +.BI "-m " MODE +If you choose orphan_only then existing RG tags are not overwritten, if you choose +overwrite_all, existing RG tags are overwritten. The default is overwrite_all. +.TP 8 +.BI "-o " STRING +Write the final output to STRING. The default is to write to stdout. + +By default, samtools tries to select a format based on the output +filename extension; if output is to standard output or no format can be +deduced, +.B bam +is selected. +.RE + +.TP \"-------- calmd +.B calmd +samtools calmd [-Eeubr] [-C capQcoef] <aln.bam> <ref.fasta> + +Generate the MD tag. If the MD tag is already present, this command will +give a warning if the MD tag generated is different from the existing +tag. Output SAM by default. + +Calmd can also read and write CRAM files although in most cases it is +pointless as CRAM recalculates MD and NM tags on the fly. The one +exception to this case is where both input and output CRAM files +have been / are being created with the \fIno_ref\fR option. + +.B OPTIONS: +.RS +.TP 8 +.B -A +When used jointly with +.B -r +this option overwrites the original base quality. +.TP 8 +.B -e +Convert a the read base to = if it is identical to the aligned reference +base. Indel caller does not support the = bases at the moment. +.TP +.B -u +Output uncompressed BAM +.TP +.B -b +Output compressed BAM +.TP +.BI -C \ INT +Coefficient to cap mapping quality of poorly mapped reads. See the +.B pileup +command for details. [0] +.TP +.B -r +Compute the BQ tag (without -A) or cap base quality by BAQ (with -A). +.TP +.B -E +Extended BAQ calculation. This option trades specificity for sensitivity, though the +effect is minor. +.RE + +.TP \"-------- targetcut +.B targetcut +samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam> + +This command identifies target regions by examining the continuity of read depth, computes +haploid consensus sequences of targets and outputs a SAM with each sequence corresponding +to a target. When option +.B -f +is in use, BAQ will be applied. This command is +.B only +designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)]. + +.TP \"-------- phase +.B phase +samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam> + +Call and phase heterozygous SNPs. + +.B OPTIONS: +.RS +.TP 8 +.B -A +Drop reads with ambiguous phase. +.TP 8 +.BI -b \ STR +Prefix of BAM output. When this option is in use, phase-0 reads will be saved in file +.BR STR .0.bam +and phase-1 reads in +.BR STR .1.bam. +Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads +with switch errors will be saved in +.BR STR .chimeric.bam. +[null] +.TP +.B -F +Do not attempt to fix chimeric reads. +.TP +.BI -k \ INT +Maximum length for local phasing. [13] +.TP +.BI -q \ INT +Minimum Phred-scaled LOD to call a heterozygote. [40] +.TP +.BI -Q \ INT +Minimum base quality to be used in het calling. [13] +.RE + +.TP \"-------- depad +.B depad +samtools depad [-SsCu1] [-T ref.fa] [-o output] <in.bam> + +Converts a BAM aligned against a padded reference to a BAM aligned +against the depadded reference. The padded reference may contain +verbatim "*" bases in it, but "*" bases are also counted in the +reference numbering. This means that a sequence base-call aligned +against a reference "*" is considered to be a cigar match ("M" or "X") +operator (if the base-call is "A", "C", "G" or "T"). After depadding +the reference "*" bases are deleted and such aligned sequence +base-calls become insertions. Similarly transformations apply for +deletions and padding cigar operations. + +.B OPTIONS: +.RS +.TP +.B -S +Ignored for compatibility with previous samtools versions. +Previously this option was required if input was in SAM format, but now the +correct format is automatically detected by examining the first few characters +of input. +.TP +.B -s +Output in SAM format. The default is BAM. +.TP +.B -C +Output in CRAM format. The default is BAM. +.TP +.B -u +Do not compress the output. Applies to either BAM or CRAM output +format. +.TP +.B -1 +Enable fastest compression level. Only works for BAM or CRAM output. +.TP +.BI "-T " FILE +Provides the padded reference file. Note that without this the @SQ +line lengths will be incorrect, so for most use cases this option will +be considered as mandatory. +.TP +.BI "-o " FILE +Specifies the output filename. By default output is sent to stdout. +.RE + +.TP \"-------- help etc +.BR help ,\ --help +Display a brief usage message listing the samtools commands available. +If the name of a command is also given, e.g., +.BR samtools\ help\ view , +the detailed usage message for that particular command is displayed. + +.TP +.B --version +Display the version numbers and copyright information for samtools and +the important libraries used by samtools. + +.TP +.B --version-only +Display the full samtools version number in a machine-readable format. +.PP +.SH GLOBAL OPTIONS +.PP +Several long-options are shared between multiple samtools subcommands: +\fB--input-fmt\fR, \fB--input-fmt-options\fR, \fB--output-fmt\fR, +\fB--output-fmt-options\fR, and \fB--reference\fR. +The input format is typically auto-detected so specifying the format +is usually unnecessary and the option is included for completeness. +Note that not all subcommands have all options. Consult the subcommand +help for more details. +.PP +Format strings recognised are "sam", "bam" and "cram". They may be +followed by a comma separated list of options as \fIkey\fR or +\fIkey\fR=\fIvalue\fR. See below for examples. +.PP +The \fBfmt-options\fR arguments accept either a single \fIoption\fR or +\fIoption\fR=\fIvalue\fR. Note that some options only work on some +file formats and only on read or write streams. If value is +unspecified for a boolean option, the value is assumed to be 1. The +valid options are as follows. +.RS 0 +.\" General purpose +.TP 4 +.BI nthreads= INT +Specifies the number of threads to use during encoding and/or +decoding. For BAM this will be encoding only. In CRAM the threads +are dynamically shared between encoder and decoder. +.\" CRAM specific +.TP +.BI reference= fasta_file +Specifies a FASTA reference file for use in CRAM encoding or decoding. +It usually is not required for decoding except in the situation of the +MD5 not being obtainable via the REF_PATH or REF_CACHE environment variables. +.TP +.BI decode_md= 0|1 +CRAM input only; defaults to 1 (on). CRAM does not typically store +MD and NM tags, preferring to generate them on the fly. This option +controls this behaviour. +.TP +.BI ignore_md5= 0|1 +CRAM input only; defaults to 0 (off). When enabled, md5 checksum +errors on the reference sequence and block checksum errors within CRAM +are ignored. Use of this option is strongly discouraged. +.TP +.BI required_fields= bit-field +CRAM input only; specifies which SAM columns need to be populated. +By default all fields are used. Limiting the decode to specific +columns can have significant performance gains. The bit-field is a +numerical value constructed from the following table. +.TS +center; +rb l . +0x1 SAM_QNAME +0x2 SAM_FLAG +0x4 SAM_RNAME +0x8 SAM_POS +0x10 SAM_MAPQ +0x20 SAM_CIGAR +0x40 SAM_RNEXT +0x80 SAM_PNEXT +0x100 SAM_TLEN +0x200 SAM_SEQ +0x400 SAM_QUAL +0x800 SAM_AUX +0x1000 SAM_RGAUX +.TE +.TP +.BI name_prefix= string +CRAM input only; defaults to output filename. Any sequences with +auto-generated read names will use \fIstring\fR as the name prefix. +.TP +.BI multi_seq_per_slice= 0|1 +CRAM output only; defaults to 0 (off). By default CRAM generates one +container per reference sequence, except in the case of many small +references (such as a fragmented assembly). +.TP +.BI version= major.minor +CRAM output only. Specifies the CRAM version number. Acceptable +values are "2.1" and "3.0". +.TP +.BI seqs_per_slice= INT +CRAM output only; defaults to 10000. +.TP +.BI slices_per_container= INT +CRAM output only; defaults to 1. The effect of having multiple slices +per container is to share the compression header block between +multiple slices. This is unlikely to have any significant impact +unless the number of sequences per slice is reduced. (Together these +two options control the granularity of random access.) +.TP +.BI embed_ref= 0|1 +CRAM output only; defaults to 0 (off). If 1, this will store portions +of the reference sequence in each slice, permitting decode without +having requiring an external copy of the reference sequence. +.TP +.BI no_ref= 0|1 +CRAM output only; defaults to 0 (off). If 1, sequences will be stored +verbatim with no reference encoding. This can be useful if no +reference is available for the file. +.TP +.BI use_bzip2= 0|1 +CRAM output only; defaults to 0 (off). Permits use of bzip2 in CRAM +block compression. +.TP +.BI use_lzma= 0|1 +CRAM output only; defaults to 0 (off). Permits use of lzma in CRAM +block compression. +.TP +.BI lossy_names= 0|1 +CRAM output only; defaults to 0 (off). If 1, templates with all +members within the same CRAM slice will have their read names +removed. New names will be automatically generated during decoding. +Also see the \fBname_prefix\fR option. +.RE +.PP +For example: +.EX 4 +samtools view --input-fmt-option decode_md=0 + --output-fmt cram,version=3.0 --output-fmt-option embed_ref + --output-fmt-option seqs_per_slice=2000 -o foo.cram foo.bam +.EE +.PP +.SH REFERENCE SEQUENCES +.PP +The CRAM format requires use of a reference sequence for both reading +and writing. +.PP +When reading a CRAM the \fB@SQ\fR headers are interrogated to identify +the reference sequence MD5sum (\fBM5:\fR tag) and the local reference +sequence filename (\fBUR:\fR tag). Note that \fIhttp://\fR and +\fIftp://\fR based URLs in the UR: field are not used, but local fasta +filenames (with or without \fIfile://\fR) can be used. +.PP +To create a CRAM the \fB@SQ\fR headers will also be read to identify +the reference sequences, but M5: and UR: tags may not be present. In +this case the \fB-T\fR and \fB-t\fR options of samtools view may be +used to specify the fasta or fasta.fai filenames respectively +(provided the .fasta.fai file is also backed up by a .fasta file). +.PP +The search order to obtain a reference is: +.IP 1. 3 +Use any local file specified by the command line options (eg -T). +.IP 2. 3 +Look for MD5 via REF_CACHE environment variable. +.IP 3. 3 +Look for MD5 in each element of the REF_PATH environment variable. +.IP 4. 3 +Look for a local file listed in the UR: header tag. +.PP +.SH ENVIRONMENT VARIABLES +.PP +.TP +.B HTS_PATH +A colon-separated list of directories in which to search for HTSlib plugins. +If $HTS_PATH starts or ends with a colon or contains a double colon (\fB::\fP), +the built-in list of directories is searched at that point in the search. + +If no HTS_PATH variable is defined, the built-in list of directories +specified when HTSlib was built is used, which typically includes +\fB/usr/local/libexec/htslib\fP and similar directories. + +.TP +.B REF_PATH +A colon separated (semi-colon on Windows) list of locations in which +to look for sequences identified by their MD5sums. This can be either +a list of directories or URLs. Note that if a URL is included then the +colon in http:// and ftp:// and the optional port number will be +treated as part of the URL and not a PATH field separator. +For URLs, the text \fB%s\fR will be replaced by the MD5sum being +read. + +If no REF_PATH has been specified it will default to +\fBhttp://www.ebi.ac.uk/ena/cram/md5/%s\fR and if REF_CACHE is also unset, +it will be set to \fB$XDG_CACHE_HOME/hts-ref/%2s/%2s/%s\fR. +If \fB$XDG_CACHE_HOME\fR is unset, \fB$HOME/.cache\fR (or a local system +temporary directory if no home directory is found) will be used similarly. + +.TP +.B REF_CACHE +This can be defined to a single directory housing a local cache of +references. Upon downloading a reference it will be stored in the +location pointed to by REF_CACHE. When reading a reference it will be +looked for in this directory before searching REF_PATH. To avoid many +files being stored in the same directory, a pathname may be +constructed using %\fInum\fRs and %s notation, consuming \fInum\fR +characters of the MD5sum. For example +\fB/local/ref_cache/%2s/%2s/%s\fR will create 2 nested subdirectories +with the filenames in the deepest directory being the last 28 +characters of the md5sum. + +The REF_CACHE directory will be searched for before attempting to load +via the REF_PATH search list. If no REF_PATH is defined, both +REF_PATH and REF_CACHE will be automatically set (see above), but if +REF_PATH is defined and REF_CACHE not then no local cache is used. + +To aid population of the REF_CACHE directory a script +\fBmisc/seq_cache_populate.pl\fR is provided in the Samtools +distribution. This takes a fasta file or a directory of fasta files +and generates the MD5sum named files. +.PP +.SH EXAMPLES +.IP o 2 +Import SAM to BAM when +.B @SQ +lines are present in the header: +.EX 2 +samtools view -bS aln.sam > aln.bam +.EE +If +.B @SQ +lines are absent: +.EX 2 +samtools faidx ref.fa +samtools view -bt ref.fa.fai aln.sam > aln.bam +.EE +where +.I ref.fa.fai +is generated automatically by the +.B faidx +command. + +.IP o 2 +Convert a BAM file to a CRAM file using a local reference sequence. +.EX 2 +samtools view -C -T ref.fa aln.bam > aln.cram +.EE +.IP o 2 +Attach the +.B RG +tag while merging sorted alignments: +.EX 2 +perl -e 'print "@RG\\tID:ga\\tSM:hs\\tLB:ga\\tPL:Illumina\\n@RG\\tID:454\\tSM:hs\\tLB:454\\tPL:454\\n"' > rg.txt +samtools merge -rh rg.txt merged.bam ga.bam 454.bam +.EE +The value in a +.B RG +tag is determined by the file name the read is coming from. In this +example, in the +.IR merged.bam , +reads from +.I ga.bam +will be attached +.IR RG:Z:ga , +while reads from +.I 454.bam +will be attached +.IR RG:Z:454 . + +.IP o 2 +Call SNPs and short INDELs: +.EX 2 +samtools mpileup -uf ref.fa aln.bam | bcftools call -mv > var.raw.vcf +bcftools filter -s LowQual -e '%QUAL<20 || DP>100' var.raw.vcf > var.flt.vcf +.EE +The +.B bcftools filter +command marks low quality sites and sites with the read depth exceeding +a limit, which should be adjusted to about twice the average read depth +(bigger read depths usually indicate problematic regions which are +often enriched for artefacts). One may consider to add +.B -C50 +to +.B mpileup +if mapping quality is overestimated for reads containing excessive +mismatches. Applying this option usually helps +.B BWA-short +but may not other mappers. + +Individuals are identified from the +.B SM +tags in the +.B @RG +header lines. Individuals can be pooled in one alignment file; one +individual can also be separated into multiple files. The +.B -P +option specifies that indel candidates should be collected only from +read groups with the +.B @RG-PL +tag set to +.IR ILLUMINA . +Collecting indel candidates from reads sequenced by an indel-prone +technology may affect the performance of indel calling. + +.IP o 2 +Generate the consensus sequence for one diploid individual: +.EX 2 +samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq +.EE +.IP o 2 +Phase one individual: +.EX 2 +samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out +.EE +The +.B calmd +command is used to reduce false heterozygotes around INDELs. + + +.IP o 2 +Dump BAQ applied alignment for other SNP callers: +.EX 2 +samtools calmd -bAr aln.bam > aln.baq.bam +.EE +It adds and corrects the +.B NM +and +.B MD +tags at the same time. The +.B calmd +command also comes with the +.B -C +option, the same as the one in +.B pileup +and +.BR mpileup . +Apply if it helps. + +.SH LIMITATIONS +.PP +.IP o 2 +Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c. +.IP o 2 +Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan +reads or ends mapped to different chromosomes). If this is a concern, +please use Picard's MarkDuplicates which correctly handles these cases, +although a little slower. + +.SH AUTHOR +.PP +Heng Li from the Sanger Institute wrote the original C version of samtools. +Bob Handsaker from the Broad Institute implemented the BGZF library. +James Bonfield from the Sanger Institute developed the CRAM implementation. +John Marshall and Petr Danecek contribute to the source code and various +people from the 1000 Genomes Project have contributed to the SAM format +specification. + +.SH SEE ALSO +.IR bcftools (1), +.IR sam (5), +.IR tabix (1) +.PP +Samtools website: <http://www.htslib.org/> +.br +File format specification of SAM/BAM,CRAM,VCF/BCF: <http://samtools.github.io/hts-specs> +.br +Samtools latest source: <https://github.com/samtools/samtools> +.br +HTSlib latest source: <https://github.com/samtools/htslib> +.br +Bcftools website: <http://samtools.github.io/bcftools>