-bam file format

Each line of the header section starts with and a two letter record type code.

program: aligner name and version, parameters used for the alignment.

read group: sequencing lane, sample, sequencing center, library etc.

alignment file: format version, sorting.

It contains information about five main topics:.

The header section is not mandatory, but most NGS softwares require it.

The SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. You can find the SAM format specification here and the article about the SAM format and SAMtools here. These files can contain information about mapped and unmapped reads, the contigs of the reference sequence that was used and many more things. And so on.The generally used file formats for sequence based alignments are the SAM and BAM formats. NM = number of mismatches = 0 for this read. XT:A:U is specific to BWA I suppose = Unique read. From then all, all columns are optional and different softwares use different available and necessary options (some for their internal use in their pipeline downstream). The first 11 columns are mandatory in SAM format. I'd guess your other read in pair also has the same cigar string 100M, then we know it starts at 1059 and therefore it should end at 1059+100-1 = 1158. TLEN: 157 - the distance from the left of the left-most mapped read and the right of the right most mapped read. Since your first read extends from 1002-1101, and the second read in pair starts at 1059, your reads overlap here.ĩ. PNEXT: 1059 - starting position of the next read in pair. RNEXT: = read is paired ( * denotes single end reads)Ĩ. You don't add the 1I because its in the read and not in the genome.Ĭoming back to the rest of the columns: 7. Continuing this way, the place where the reads are mapped to the chromosome are as follows: 1002 - 1042 => first 40 bases of read mapped to chromosome with 1 base in between which is not in your read but in the genomeġ103 - 1152 => next 50 bases match to genome. The next 60 bases are not in your read (presumably because its transcriptome data and this is an intron location). So, the read matches nicely up until this point. Going once again, starting from 1002, we have a 20M, then 1D and then a 20M again. So, we'll have to add the 1D as well to obtain the genomic coordinates. Now there is a 1D, if you remove the base at 1022 of your reference the next 20 bases (20M) have a nice matching sequence. The first 20 bases are continuously mapped to your genome. Suppose your read started at 1002, then, let's see how the read is for this cigar string. Suppose your cigar was 20M 1D 20M 60N 30M 1I 20M 100N 9M => read length = sum of all M/I/S operations. Note that still there might be mismatches.

Your cigar of 100M indicates that the read starts at 1002 and is mapped continuously for the next 100 bases. I is insertion into the reference and D is deletion from the reference. The CIGAR string might require some explanation. POS: 1002 - position on the genome where the first base of this read matches (refer to the documentation reg. Type in your flag number and it explains what they mean. To start with therefore, use this link to interpret them. Probably hard to grasp for biologists at the beginning. FLAG: 163 - read paired, read mapped in proper pair, its mate maps to reverse strand, this read is second in pair.

In case of paired end reads, you'll have them twice, assuming both reads of the pair are mapped. Since you also asked for explanation regarding the different columns in your read: 1. However, there might be conflicting options in this regard, I believe. If you're doing a gene expression analysis or anything that deals with count data, then probably not a good idea. I'd suggest to do so, if you're calling SNPs. That being said, it depends on what is your downstream analysis as to whether its essential to remove duplicates or not. You can remove duplicate reads with picard tools' MarkDuplicates (link from Jorge) as follows: java -jar MarkDuplicates.jar INPUT=my_file_input.bam OUTPUT=my_file_output.bam METRICS_FILE=my_metrics.txt REMOVE_DUPLICATES=TRUE TMP_DIR=my_tmp_dir

.bam file format