2014-11-26

Column headers in BLAST+ tabular and CSV output

In the last couple of years, my preferred BLAST output format has switched from BLAST XML to plain tabular output. The main reason for this it is easier to parse, and now gives easy access to more fields - BLAST+ 2.2.28 added descriptions and taxonomy output to the tabular and CSV output, but the cumulative effect is BLAST XML has been lagging behind.

However, there is a simple change the NCBI could make to greatly improve the usability of the tabular or CSV output - label the columns with a header line! This is vital meta-data: No-one should be forced to guess-the-columns when presented with a data file. 

Look at all the columns


Here is an excerpt from the BLAST+ 2.2.30 command line help on the output format:

 *** Formatting options
 -outfmt <String>
   alignment view options:
     0 = pairwise,
     1 = query-anchored showing identities,
     2 = query-anchored no identities,
     3 = flat query-anchored, show identities,
     4 = flat query-anchored, no identities,
     5 = XML Blast output,
     6 = tabular,
     7 = tabular with comment lines,
     8 = Text ASN.1,
     9 = Binary ASN.1,
    10 = Comma-separated values,
    11 = BLAST archive format (ASN.1)
    12 = JSON Seqalign output
  
   Options 6, 7, and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
           qseqid means Query Seq-id
              qgi means Query GI
             qacc means Query accesion
          qaccver means Query accesion.version
             qlen means Query sequence length
           sseqid means Subject Seq-id
        sallseqid means All subject Seq-id(s), separated by a ';'
              sgi means Subject GI
           sallgi means All subject GIs
             sacc means Subject accession
          saccver means Subject accession.version
          sallacc means All subject accessions
             slen means Subject sequence length
           qstart means Start of alignment in query
             qend means End of alignment in query
           sstart means Start of alignment in subject
             send means End of alignment in subject
             qseq means Aligned part of query sequence
             sseq means Aligned part of subject sequence
           evalue means Expect value
         bitscore means Bit score
            score means Raw score
           length means Alignment length
           pident means Percentage of identical matches
           nident means Number of identical matches
         mismatch means Number of mismatches
         positive means Number of positive-scoring matches
          gapopen means Number of gap openings
             gaps means Total number of gaps
             ppos means Percentage of positive-scoring matches
           frames means Query and subject frames separated by a '/'
           qframe means Query frame
           sframe means Subject frame
             btop means Blast traceback operations (BTOP)
          staxids means unique Subject Taxonomy ID(s), separated by a ';'
                (in numerical order)
        sscinames means unique Subject Scientific Name(s), separated by a ';'
        scomnames means unique Subject Common Name(s), separated by a ';'
       sblastnames means unique Subject Blast Name(s), separated by a ';'
                (in alphabetical order)
       sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
                (in alphabetical order)
           stitle means Subject Title
       salltitles means All Subject Title(s), separated by a '<>'
          sstrand means Subject Strand
            qcovs means Query Coverage Per Subject
          qcovhsp means Query Coverage Per HSP
   When not provided, the default value is:
   'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   Default = `0'

Just look at all those interesting potential output fields :)

Sample Output


Now for a quick example, the protein sequence for E. coli K-12's gene moaC, and a fake sequence:

$ more queries.fasta
>moaC molybdopterin biosynthesis, protein C
MSQLTHINAAGEAHMVDVSAKAETVREARAEAFVTMRSETLAMIIDGRHHKGDVFATARIAGIQAAKRTW
DLIPLCHPLMLSKVEVNLQAEPEHNRVRIETLCRLTGKTGVEMEALTAASVAALTIYDMCKAVQKDMVIG
PVRLLAKSGGKSGDFKVEADD
>fake sequence of letters which should not match any real proteins
DFAIBFNWACNMVBNMDEYGBCBFCNKSFDEZNVDXHALAHLFGDNASHBCVNMDFGNMNDFSILAPPQG
FCGHAKGRDAIBVKPDJKAHCIIBYANMNVB

Let's search this against the NCBI NR database using BLASTP with the default column tabular output, limiting this to the top two hits for brevity:

$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt 6
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>100.00<tab>161<tab>0<tab>0<tab>1<tab>161 
<tab>1<tab>161<tab>3e-114<tab>330<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>99.38<tab>161<tab>1<tab>0<tab>1<tab>161
<tab>1<tab>161<tab>9e-114<tab>329<end>

Due to line splitting for the blog, I have represented the tab characters as <tab> in yellow, and emphasised the new lines with <end>.

Or, for comparison, the comma separated output - highlighted in yellow (which appears to include some unexpected spaces before the final values - a minor bug?):

$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt 10
moaC,gi|15800534|ref|NP_286546.1|,100.00,161,0,0,1,161,1,161,3e-114,  330
moaC,gi|170768970|ref|ZP_02903423.1|,,99.38,161,1,,0,1,161,1,161,9e-114,  329

If you know the standard 12 columns by heart, this is workable. But one of the best things about BLAST+ is it will happily output other columns! For instance,

$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "6 qseqid sseqid score qcovs"
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>846<tab>100<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>843<tab>100<end>

If you don't know how the file was generated, how are you to guess what the columns mean? Supposedly this is where -outfmt 7 is useful:

$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "7 qseqid sseqid score qcovs"
# BLASTP 2.2.30+
# Query: moaC molybdopterin biosynthesis, protein C
# Database: nr
# Fields: query id, subject id, score, % query coverage per subject
# 2 hits found
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>846<tab>100<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>843<tab>100<end>
# BLASTP 2.2.30+
# Query: fake sequence of letters which should not match any real proteins
# Database: nr
# 0 hits found
# BLAST processed 2 queries

This is extremely comment heavy (OK, not as verbose as BLAST XML, but still...) and not immediately useful for parsing the data (although you can now see queries with no hits being reported). However, loading this with Excel or R will not recognise the columns for you.

Enhancement Proposal


I would like the NCBI to add a new tabular output format to BLAST+, say -outfmt 13 if not used for anything else first, which acts like -outfmt 6 (tabular) but with the addition of a single header line. This should start with the # character (indicating this is a comment rather than data), followed by tab separated column names:

$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "13 qseqid sseqid score qcovs"
#qseqid<tab>sseqid<tab>score<tab>qcovs<end>
moaC<tab>gi|15800534|ref|NP_286546.1|<tab>846<tab>100<end>
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>843<tab>100<end>

I debated just changing the existing -outfmt 6 output, but fear this would break too many existing scripts. Similarly, this new mode could replace the existing -outfmt 7, but that includes other functionality as well - like reporting query sequences with no hits. So a new output format number seems best to me.

For fans of comma separated variable (CSV) files, the same style header should also be useful? I prefer to avoid CSV with BLAST as commas do occur in record titles and so can complicate parsing. Maybe that can be -outfmt 14 giving:

$ blastp -query queries.fasta -db nr -evalue 0.0001 -max_target_seqs 2 -outfmt "14 qseqid sseqid score qcovs"
#qseqid,sseqid,score,qcovs
moaC,gi|15800534|ref|NP_286546.1|,846,100
moaC,gi|170768970|ref|ZP_02903423.1|,843,100


Note that I am explicitly suggest using the same one-word concise format names that BLAST+ itself uses in the command line to generate the file (and not the more verbose column names currently used in -outfmt 7). These short column names are clear, plus this should make it easier to trace the values back to their meaning. Also they are much easier to work with in R than column names with spaces.

The point of this header line convention is it is widely used for self-describing tab-separated tables (I specifically want this for the BLAST+ Galaxy wrappers). This is trivial to load into Excel, and much easier to load into R or Python etc and get column names matched up with the data.

In essence, this column header line is vital meta-data which avoids the guess-the-columns problem which otherwise can come up when sharing BLAST data in tabular or CSV format.

2 comments:

  1. Maybe adopting the ASCII char 31 for column delimitation and ASCII char 30 for lines would make it easy to avoid problems when using comma. See here: http://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/

    ReplyDelete
    Replies
    1. True, but tab separated variables are much more commonly used than those ASCII codes. My guess is they were too hard to enter from a standard keyboard layout? So sure, we could have ASCII delimited text as another BLAST+ output format - but I'd still want the column names there as the first line ;)

      Delete