Ensembl 102

Known bugs in Ensembl

Inconsistency in transcripts numbering in GFF3 and GTF exported files

Affects: Live site Versions: Ensembl 102, Ensembl 103, Ensembl 104, Ensembl 105
Description: We noticed, from a bug report that some inconstencies may appear in particular cases between our GFF3 and GTF FTP files available.

Sometime, depending on data underlying our dumps, the number of transcripts retrieved may differ from one file to the other, for the same species.

The main difference between GTF and GFF3 dumping is that for GTF, we get the transcripts from the gene ($gene->get_all_Transcripts) while for the GFF3, we get the transcripts from the underlying slice ($transcript_adaptor->fetch_all_by_Slice)

https://github.com/Ensembl/ensembl-production/blob/release/104/modules/Bio/EnsEMBL/Production/Pipeline/GFF3/DumpFile.pm#L199
https://github.com/Ensembl/ensembl-io/blob/release/104/modules/Bio/EnsEMBL/Utils/IO/GTFSerializer.pm#L112

This means if the transcript goes over the boundaries of the slice, we might not dump it although we dump the genes.
This currently only happens with genes on patches, where some transcripts can be entirely outside of the patch region due to the fact that we create a fake chromosome including the patch.
In the future, we are planning to store the patches as standalone scaffolds, and those transcripts will be removed entirely, hence not being included in either the GTF or GFF3 dumps

We plan to fix this from 106 onwards.

Workaround: No work around. Except using most up to date datasets

GRCh37 REST VEP – Conservation Parameter

Affects: Live site Versions: Ensembl 101, Ensembl 102
Description: The parameter ‘Conservation’ on the VEP endpoint of [http://grch37.rest.ensembl.org/] does not provide data as expected.
Workaround: This will be fixed for release 103.

Missing RefSeq data in homo_sapiens otherfeatures 102

Affects: Live site Versions: Ensembl 100, Ensembl 101, Ensembl 102
Description:There are a number of RefSeq genes missing in the homo_sapiens_otherfeatures_102_38 database.

This will also affect VEP queries using the RefSeq transcript set.

Workaround: The Ensembl transcript set is unaffected, but there is no work-around for VEP queries using the RefSeq or merged transcript set.

This will be fixed in Ensembl 103.

Missing variant pathogenicity predictions for REVEL, MetaLR and MutationAssessor

Affects: Live site, Mirrors Versions: Ensembl 102
Description: We are missing variant pathogenicity predictions from REVEL, MetaLR and MutationAssessor on:
* Variant page > Genes and regulation view
* Transcript page > Variant table view

This only affects human GRCh38 views. Predictions for CADD, SIFT and PolyPhen-2 are still available.

This problem does not impact Ensembl VEP.

Workaround: The scores can still be retrieved:

  • in release 102 through VEP, using the web and command line VEP tool
  • using release 101 views

Missing human chrY gene in release 102

Affects: Live site, Mirrors Versions: Ensembl 102
Description: The lncRNA gene XGY2, on human chrY, is missing from Ensembl release 102. It will be reinstated for release 103..
Workaround: The gene can be accessed in the Ensembl release 101 archive.

Missing data in mouse for 3D Protein Viewer

Affects: Live site, Archives Versions: Ensembl 100, Ensembl 101, Ensembl 102
Description: Mappings between Ensembl translations and PDBe protein structures are not available.

They are missing from the ‘Protein Summary’ view on our transcript pages.

These data are also used to drive our interactive views showing variants on 3D PDB models. The  ‘3D Protein Model’ views on the variant page and transcript pages currently return no data

Views of novel variants on 3D structures are also missing in the VEP web interface.

Workaround: The PDBe mappings and variant locations on 3D structures on transcript and variant pages can be viewed in Ensembl version 99.

Genomes have been over-masked

Affects: Live site, Mirrors Versions: Ensembl 102
Description: Repeatmasked genomes have been masked using Repeatmodeler libraries for some species – we are not confident that this is not masking gene families and so will remove this masking, i.e. only mask the genomes using Repbase libraries.
Workaround: For the time being, masked genomes have been masked using the Repeatmodeler libraries.

Broken/ missing links for transcripts with biotypes “tRNA” and “IG” for RefSeq tracks

Affects: Live site Versions: Ensembl 101, Ensembl 102
Description: When viewing the RefSeq track, the links to NCBI for transcripts with biotypes “tRNA” and “IG” are broken or incorrect.
Workaround: This will be fixed in an upcoming Ensembl release, in the meantime the links will be disabled.

Compara ncRNA trees stats not described accurately

Affects: Live site Versions: Ensembl 100, Ensembl 101, Ensembl 102
Description: The stats computed in ncRNA trees under the names {{nb_genes_in_tree}} and {{nb_orphaned_genes}} are not actually referring to the final trees but the unfiltered clusters (earlier stage).
Workaround: In Ensembl 103 we have corrected this problem and they will match their name, but their values will decrease significantly in at least 50% of the species reported.

Some protein coding genes turned into non_translating_CDS

Affects: Live site Versions: Ensembl 101, Ensembl 102
Description: A user spotted that peptide fasta files are considerably shorter for pachysolen_tannophilus_nrrl_y_2460_gca_001661245 (fungus). This is because in release 42 a lot of its protein coding genes were marked as nontranslating_CDS (although the underling data and annotation has not changed).
Workaround: No workaround

Incorrect display ids/labels captured for UCSC external references in mouse

Affects: Live site Versions: Ensembl 100, Ensembl 101, Ensembl 102
Description: Ensembl identifiers (ENS ids) are displayed as UCSC external references for Homo sapiens.
Workaround: The linking out to UCSC website works correctly.

rfam_genes have wrong strand when loaded with ensembl-genomeloader

Affects: Live site Versions: Ensembl 96, Ensembl 97, Ensembl 98, Ensembl 99, Ensembl 100, Ensembl 101, Ensembl 102
Description: The [https://github.com/Ensembl/ensembl-genomeloader] (GL) is used by NV divisions to load genomes and their annotations from the ENA. It also used to annotate non-coding genes matching RFAM HMMs, but apparently in some cases the assigned strand is the template strand. This affects some microbial and plant genomes loaded with the GL.
Workaround: We will remove the rfam_genes from the affected genomes and run the RNA features pipeline instead.

GRCh37 – COSMIC insertion coordinates off by +1

Affects: Live site Versions: Ensembl 100, Ensembl 101, Ensembl 102
Description: The coordinates for insertions imported for COSMIC source are off by +1.

For GRCh37 e100, e101, e102: 2.66 % (253,428 / 9,511,409) COSMIC variation is affected.

Workaround: Release 99 can be used for GRCh37:

http://grch37-archive.ensembl.org/index.html

http://ftp.ensembl.org/pub/grch37/release-99/variation/

This contained 4,478,854 COSMIC mutations, compared to 9,511,409 in the current database.

Drosophila melanogaster RNA gene cross-reference links do not work

Affects: Live site, Mirrors Versions: Ensembl 99, Ensembl 100, Ensembl 101, Ensembl 102
Description: Rfam and miRBase cross-reference links do not work, because they use the FlyBase ID instead of the RNA gene.
Workaround: Search for the Rfam or miRBase ID on the respective websites.