Hi All,
Exciting things are
afoot at NCBI! Annotation updates, new features, and more -- read on! Please
feel free to forward this on to other interested parties.
I.
Updated genome annotations.
We've been hard at work in the
last 13 months since we introduced full RNA-seq support in the NCBI annotation
pipeline, annotating genomes for 96 species (8 insects) including 72 with
RNA-seq (all 8 insects). We've re-annotated three insects in the last few
weeks: pea aphid (Acyrthosiphon pisum), red flour beetle (Tribolium castaneum),
and jewel wasp (Nasonia vitripennis). The updated annotations are now available
in NCBI's RefSeq and Gene resources, by BLAST (both NR and organism-specific
BLAST pages), and FTP.
Here
are some useful links for each:
A. pea
aphid -- NCBI Acyrthosiphon pisum
Annotation Release 101
(no assembly change: Acyr_2.0) Annotation Report: http://www.ncbi.nlm.nih.gov/genome/annotation_euk/Acyrthosiphon_pisum/101/
G_DEF=blastn&BLAST_PROG_DEF=megaBlast&BLAST_SPEC=OGP__7029__13646
B. red flour beetle -- NCBI Tribolium castaneum
Annotation Release 102 (no assembly change: Tcas_3.0) Annotation
Report: http://www.ncbi.nlm.nih.gov/genome/annotation_euk/Tribolium_castaneum/102/
G_DEF=blastn&BLAST_PROG_DEF=megaBlast&BLAST_SPEC=OGP__7070__12539
(a technical note for our
Tribolium users: we had a small bug that resulted in the RefSeq version and GI
changing for LG7. All you really need to know is that NC_007422.2
(GI:189313717) from our previous "build 2.1" annotation and
NC_007422.4 (GI:645685058) are the exact same sequence, and equal to Genbank
CM000282.2)
C. jewel wasp -- NCBI Nasonia vitripennis Annotation
Release 101 (assembly
update: Nvit_2.1) Annotation Report: > http://www.ncbi.nlm.nih.gov/genome/annotation_euk/Nasonia_vitripennis/101/
G_DEF=blastn&BLAST_PROG_DEF=megaBlast&BLAST_SPEC=OGP__7425__13647
You can find more
information on recent annotations here: http://www.ncbi.nlm.nih.gov/genome/annotation_euk/status/#recent
We're finishing up a new status
page that should be public in another week or so where you'll be able to easily
find information on all of our insect and other annotations -- stay tuned!
II.
gap-filled models.
-----------------------------------------
Last month we updated our annotation
software to include a new feature to improve annotation quality we call
"gap-filling". Basically, the software can now utilize transcripts in
GenBank (mRNAs or TSA transcriptome assemblies) to compensate for gaps in the
genome assembly, stitching in pieces of transcript that represent missing exon
sequence. XM_008189147.1 is an example from pea aphid: http://www.ncbi.nlm.nih.gov/nuccore/XM_001943408.3
There
are several ways to recognize these models:
a)
a
COMMENT saying the record is derived from genomic and transcript sequences
b)
an
"assembly gap" attribute in the RefSeq-Attributes section
c)
a
table indicating which accession ranges were used to assemble the model
d)
notes
on the gene and CDS features indicating how many bases were added to the model.
These notes also appear on the genome annotation
We've found that this feature can
improve annotations for hundreds to thousands of genes, depending on the quality
of the assembly and availability of transcripts in GenBank. It had a relatively
minor effect for pea aphid, Tribolium, and Nasonia because there aren't large
numbers of long, same-species transcript sequences available in GenBank, but it
is likely to make a big difference for some of our upcoming annotation runs.
You can find the gap-filled RefSeq transcripts with this query: http://www.ncbi.nlm.nih.gov/nuccore/?term=insects[orgn]+refseq[filter]
+biomol_rna[prop]+assembly+gap[prop]
III.
RNA-seq expression tracks and additional data available in NCBI's Gene resource
-----------------------------------------
We recently updated the Graphics
display in NCBI's Gene resource to include access to RNA-seq expression and
other tracks through the "Configure" option. Specifically, for each of the RNA-seq
BioSamples processed as part of our annotation runs, we produce histogram
tracks representing the exon coverage and intron-spanning reads, which are
displayed after a log(2) scale. We also produce aggregate tracks of all the
aligned RNA-seq data across all samples (two histograms, and a third track
representing the intron features) which are part of the default display. This
provides a way to view the RNA-seq evidence supporting a particular gene model.
These tracks are currently available for honey bee, pea aphid, jewel wasp, red
flour beetle, giant honey bee, silkworm, medfly, and house fly. Here is an
example: http://www.ncbi.nlm.nih.gov/gene/100169318
We are also in the process of
generating tracks for RepeatMasker, WindowMasker (a masking application
developed here at NCBI that we use for most of the insects), CpG Islands, and
G+C content for all of the organisms annotated with the NCBI annotation
pipeline.
More information is
available on the NCBI News site: http://www.ncbi.nlm.nih.gov/news/06-03-2014-easier-annotation-info-access-in-Gene/
IV.
Future annotation efforts
-----------------------------------------
On top of all the changes to
improve annotation quality and add new features, we've also increased
automation and performance to a point where we can annotate many more genomes.
Our annotation efforts are aimed at:
1)
making
genome sequence data more useful in NCBI resources like BLAST and Gene
2)
supplying
high-quality annotations calculated with consistent methodology
3)
providing
updates on a regular basis to take advantage of additional evidence (RNA-seq,
or protein datasets from related species) and improvements in software
4)
rapidly
take advantage of improved assemblies with sophisticated tracking of features
compared to the prior annotation
At this point we have
a nearly complete set of vertebrate genome annotations (104 species), and are
interested in expanding our efforts in other taxonomic divisions. Our general
annotation policy is here: http://www.ncbi.nlm.nih.gov/genome/annotation_euk/status/#policy
If you're interested in NCBI
annotating a genome (or many genomes) for inclusion in the public RefSeq
database, drop me a line. For insects we prefer a contig N50 of >20 kb,
although we have annotated lower quality assemblies. The assembly needs to be
public in an INSDC database, and same species RNA-seq data in SRA is greatly
preferred (200M reads is a good amount, but we've worked with 40M to 20+B
reads). TSAs available in GenBank are useful for the new gap-filling
capability.
Note that annotation in RefSeq
using our pipeline does not preclude the genome submitter from providing their
own annotation on the GenBank records, and I strongly encourage groups to do
so. We are working on new features to show both annotation tracks in the Gene
resource so that users can easily see similarities and differences between
different annotations to help guide their studies.
We have a lot more in store for
the future, so stay tuned! I won't be at the AGS meeting this year, but feel
free to drop me a line with any questions.
Best regards,
-Terence
-----
Terence Murphy, Ph.D.
Staff Scientist
NCBI/NLM/NIH/DHHS
45 Center Drive, Room
5AS.43
Bethesda, MD 20892-6510
Phone:
00-1-301-402-0990
e-mail: murphyte@ncbi.nlm.nih.gov
沒有留言:
張貼留言