Exciting things are afoot at NCBI! Annotation updates, new features, and more -- read on! Please feel free to forward this on to other interested parties.
I. Updated genome annotations.
We've been hard at work in the last 13 months since we introduced full RNA-seq support in the NCBI annotation pipeline, annotating genomes for 96 species (8 insects) including 72 with RNA-seq (all 8 insects). We've re-annotated three insects in the last few weeks: pea aphid (Acyrthosiphon pisum), red flour beetle (Tribolium castaneum), and jewel wasp (Nasonia vitripennis). The updated annotations are now available in NCBI's RefSeq and Gene resources, by BLAST (both NR and organism-specific BLAST pages), and FTP.
Here are some useful links for each:
A. pea aphid -- NCBI Acyrthosiphon pisum Annotation Release 101 (no assembly change: Acyr_2.0) Annotation Report: http://www.ncbi.nlm.nih.gov/genome/annotation_euk/Acyrthosiphon_pisum/101/
B. red flour beetle -- NCBI Tribolium castaneum Annotation Release 102 (no assembly change: Tcas_3.0) Annotation Report: http://www.ncbi.nlm.nih.gov/genome/annotation_euk/Tribolium_castaneum/102/
(a technical note for our Tribolium users: we had a small bug that resulted in the RefSeq version and GI changing for LG7. All you really need to know is that NC_007422.2 (GI:189313717) from our previous "build 2.1" annotation and NC_007422.4 (GI:645685058) are the exact same sequence, and equal to Genbank CM000282.2)
C. jewel wasp -- NCBI Nasonia vitripennis Annotation Release 101 (assembly update: Nvit_2.1) Annotation Report: > http://www.ncbi.nlm.nih.gov/genome/annotation_euk/Nasonia_vitripennis/101/
You can find more information on recent annotations here: http://www.ncbi.nlm.nih.gov/genome/annotation_euk/status/#recent
We're finishing up a new status page that should be public in another week or so where you'll be able to easily find information on all of our insect and other annotations -- stay tuned!
II. gap-filled models.
Last month we updated our annotation software to include a new feature to improve annotation quality we call "gap-filling". Basically, the software can now utilize transcripts in GenBank (mRNAs or TSA transcriptome assemblies) to compensate for gaps in the genome assembly, stitching in pieces of transcript that represent missing exon sequence. XM_008189147.1 is an example from pea aphid: http://www.ncbi.nlm.nih.gov/nuccore/XM_001943408.3
There are several ways to recognize these models:
a) a COMMENT saying the record is derived from genomic and transcript sequences
b) an "assembly gap" attribute in the RefSeq-Attributes section
c) a table indicating which accession ranges were used to assemble the model
d) notes on the gene and CDS features indicating how many bases were added to the model. These notes also appear on the genome annotation
We've found that this feature can improve annotations for hundreds to thousands of genes, depending on the quality of the assembly and availability of transcripts in GenBank. It had a relatively minor effect for pea aphid, Tribolium, and Nasonia because there aren't large numbers of long, same-species transcript sequences available in GenBank, but it is likely to make a big difference for some of our upcoming annotation runs. You can find the gap-filled RefSeq transcripts with this query: http://www.ncbi.nlm.nih.gov/nuccore/?term=insects[orgn]+refseq[filter]
III. RNA-seq expression tracks and additional data available in NCBI's Gene resource
We recently updated the Graphics display in NCBI's Gene resource to include access to RNA-seq expression and other tracks through the "Configure" option. Specifically, for each of the RNA-seq BioSamples processed as part of our annotation runs, we produce histogram tracks representing the exon coverage and intron-spanning reads, which are displayed after a log(2) scale. We also produce aggregate tracks of all the aligned RNA-seq data across all samples (two histograms, and a third track representing the intron features) which are part of the default display. This provides a way to view the RNA-seq evidence supporting a particular gene model. These tracks are currently available for honey bee, pea aphid, jewel wasp, red flour beetle, giant honey bee, silkworm, medfly, and house fly. Here is an example: http://www.ncbi.nlm.nih.gov/gene/100169318
We are also in the process of generating tracks for RepeatMasker, WindowMasker (a masking application developed here at NCBI that we use for most of the insects), CpG Islands, and G+C content for all of the organisms annotated with the NCBI annotation pipeline.
More information is available on the NCBI News site: http://www.ncbi.nlm.nih.gov/news/06-03-2014-easier-annotation-info-access-in-Gene/
IV. Future annotation efforts
On top of all the changes to improve annotation quality and add new features, we've also increased automation and performance to a point where we can annotate many more genomes. Our annotation efforts are aimed at:
1) making genome sequence data more useful in NCBI resources like BLAST and Gene
2) supplying high-quality annotations calculated with consistent methodology
3) providing updates on a regular basis to take advantage of additional evidence (RNA-seq, or protein datasets from related species) and improvements in software
4) rapidly take advantage of improved assemblies with sophisticated tracking of features compared to the prior annotation
At this point we have a nearly complete set of vertebrate genome annotations (104 species), and are interested in expanding our efforts in other taxonomic divisions. Our general annotation policy is here: http://www.ncbi.nlm.nih.gov/genome/annotation_euk/status/#policy
If you're interested in NCBI annotating a genome (or many genomes) for inclusion in the public RefSeq database, drop me a line. For insects we prefer a contig N50 of >20 kb, although we have annotated lower quality assemblies. The assembly needs to be public in an INSDC database, and same species RNA-seq data in SRA is greatly preferred (200M reads is a good amount, but we've worked with 40M to 20+B reads). TSAs available in GenBank are useful for the new gap-filling capability.
Note that annotation in RefSeq using our pipeline does not preclude the genome submitter from providing their own annotation on the GenBank records, and I strongly encourage groups to do so. We are working on new features to show both annotation tracks in the Gene resource so that users can easily see similarities and differences between different annotations to help guide their studies.
We have a lot more in store for the future, so stay tuned! I won't be at the AGS meeting this year, but feel free to drop me a line with any questions.
Terence Murphy, Ph.D.
45 Center Drive, Room 5AS.43
Bethesda, MD 20892-6510