SeqXML Documentation

SeqXML Version 0.4


Table of Contents

1. Introduction
2. Example
3. Schema description
3.1. seqXML
3.2. entry
3.3. species
3.4. description
3.5. DBRef
3.6. property
3.7. RNAseq
3.8. DNAseq
3.9. AAseq
4. Changes
4.1. version 0.2
4.2. version 0.3
4.3. version 0.4

1. Introduction

SeqXML is an XML Schema to describe biological sequences. SeqXML is designed to be a simple and versatile format to store sequence data. The FASTA sequence format is so widely used because it is simple and versatile, but FASTA's weakness is the lack of structure in its header. The lack of standardization in FASTA headers makes it time-consuming and labor-intensive to parse header information in FASTA sequences gathered from different sources.

SeqXML solves this problem by adding just enough structure to allow easy parsing while still being a lightweight format. There are only two required fields: the id and the sequence.

If you do choose to specify additional information about the sequence, SeqXML makes it easy to do in a standardized way. The most common types of sequence metadata, such as description, species, and alternative IDs, are already encoded in SeqXML.

It's flexible, too. You can include your own metadata as properties, which can be just a tag, like "has_splice_variants", or a tag-value pair, where a named category like "prediction method" has a value like "genscan".

And XML is a markup language which makes it easy to process and validate. Using XML as the "container" for sequence data means it's possible to do some simple validation, so that common problems like duplicate entries and invalid sequence letters are detected automatically.

The document describes the content of an SeqXML file version 0.4 and gives details about each element and attribute. For more information about SeqXML, including tools to convert SeqXML to and from other sequence formats, visit seqxml.org.

2. Example

In this section we walk through an example of a SeqXML file, shown here:

        <?xml version="1.0"?>
        <seqXML source="Ensembl" sourceVersion="56" speciesName="Homo sapiens" ncbiTaxID="9606" seqXMLversion="0.4">
            <entry id="A12345">
                <DNAseq>ACTGCGAAATGTGCCCGGGNNNN</DNAseq>
            </entry>
            <entry id="ENSG00000173402" source="Ensembl">
                <species name="Homo sapiens" ncbiTaxID="9606"/>
                <description>dystroglycan 1</description>
                <RNAseq>AAGGC----UGAUGUC.....ACAU</RNAseq>
                <DBRef source="RefSeq" id="NM_004393"/>
                <property name="prediction_method" value="manual curation"/>
                <property name="has_splice_variants"/>
            </entry>
            <entry id="ENSG000001734023" source="Ensembl">
                <description>example description.</description>
                <AAseq>AAGGCGAAA----AA*AAAAAGT.....CACJOXA</AAseq>
            </entry>
            <entry id="ENSG0000017340233">
                <description>another example description</description>
                <DNAseq>AAGGCGTTAAA----AAAAAAAGT.....CACTA</DNAseq>
            </entry>
        </seqXML>

The root element in the XML file is the seqXML tag. It indicates the version of SeqXML used, and optionally the origin of the data (often, a database like Ensembl or Swiss-Prot) and the version of that database. The optional attributes speciesName and ncbiTaxID define the species of origin of all entries. Here you can see this SeqXML file contains records from Human from Ensembl version 56:

        <seqXML source="Ensembl" sourceVersion="56" speciesName="Homo sapiens" ncbiTaxID="9606" seqXMLversion="0.4">

Under the seqXML tag there are four entries, one entry for each sequence. The first entry shows the minimum necessary. It's just an identifier and a sequence:

        <entry id="A12345">
            <DNAseq>ACTGCGAAATGTGCCCGGGNNNN</DNAseq>
        </entry>

Note that the type of sequence is specified by the sequence tag used. Here, it's <DNAseq> to indicate a DNA alphabet. The other options are RNAseq and AAseq for RNA and amino acid alphabets, respectively. If you validate a SeqXML file, the sequence will be checked to make sure it contains only IUPAC-allowed characters for that sequence type.

The second entry is much more detailed:

        <entry id="ENSG00000173402" source="Ensembl">
            <RNAseq>AAGGC----UGAUGUC.....ACAU</RNAseq>
            <description>dystroglycan 1</description>
            <species name="Homo sapiens" ncbiTaxID="9606"/>
            <DBRef source="RefSeq" id="NM_004393"/>
            <property name="has_splice_variants"/>
            <property name="prediction_method" value="manual curation"/>
        </entry>

In addition to the ID and the sequence (an RNA sequence in this case), there are several other pieces of optional information. There's an optional source attribute identifying the ID as being an Ensembl ID. There's also a description, a local species definition, a database cross-reference and its source (RefSeq in this case), and then a couple of properties.

Rather than overloading the description field, as has become common with FASTA records, properties are a way to include additional information related to the sequence. Properties can be written in two ways. First, as simply a tag, and secondly as a tag and associated value. The example here shows one of each:

         <property name="has_splice_variants"/>
         <property name="prediction_method" value="manual curation"/>

Of course these simple tags cannot properly represent more complicated types of metadata, such as journal references, but they are not intended to. SeqXML is designed as a lightweight format, not as a comprehensive one. For that purpose, there are other more appropriate XML sequence data formats, such as UniProt XML or EMBL XML.

So that's a quick overview of SeqXML. Tools for using it and converting to and from other sequence formats are available at SeqXML.org. A more detailed reference documentation continues below.

3. Schema description

This section is intended to be a human readable version of the SeqXML schema. If you encounter a discrepancy, (1) the schema is right and (2) please let us know!

3.1. seqXML

The root element. The seqXML element must have at least one entry child element.

Attributes:

  • version (required,decimal) – The version of the SeqXML specification being used in the file.

  • source (optional,string) – The source program/database of the file, for example UniProt.

  • sourceVersion (optional,any) – The version or release number of the source program/database that the data in the file comes from.

  • speciesName (optional,string) – The scientific name of the species of origin of all entries in the file. It is also possible to define the species locally with a species element in the entry.

  • ncbiTaxID (optional,integer) – The NCBI taxonomy identifier of the species of origin. If the speciesName is given this attribute must be given as well. The reason for this requirement is that species names are notoriously difficult to parse. While it seems like standardizing on genus and species would be sufficient, subspecies, strains, and subtypes (common in prokaryotic and viral nomenclature) tend to confound that approach. Using a taxonomy identifier allows a species to be indicated unambiguously.

3.2. entry

The entry element represents a single sequence record. The sequence's primary identifier is specified as an attribute, and there's an optional source attribute. Entry has only one required child element, and it must be one of: DNAseq, RNAseq, AAseq. All other child elements are optional.

Attributes:

  • id (required,string) – The primary identifier (ID) of the sequence.

  • source (optional,string) – The source of the identifier. That is, which database the identifier comes from. For example, ENSG00000173402 is an Ensembl ID, so you can specify source="Ensembl".

3.3. species

The species which the sequence comes from locally defined for a single entry. (optional)

The local species definition overwrites any global definition in the root. Species is an optional part of a sequence entry, but if included, it must contain the NCBI taxonomy identifier. The reason for this requirement is that species names are notoriously difficult to parse. While it seems like standardizing on genus and species would be sufficient, subspecies, strains, and subtypes (common in prokaryotic and viral nomenclature) tend to confound that approach. Using a taxonomy identifier allows a species to be indicated unambiguously. Only one species is allowed per sequence

Attributes:

  • name (required,string) – The species name.

  • ncbiTaxID (required,string) – The NCBI taxonomy identifier (ID) for the species.

3.4. description

A description of the sequence. (optional)

Often the name of the gene or protein which the sequence comes from, such as "homeobox protein A2". Only one description is allowed per sequence. Please do not stuff multiple types of information in the description field. Instead, specify extra data using properties.

3.5. DBRef

Database cross-references. (optional)

Multiple DBRefs may be specified.

Attributes:

  • type (optional,string) – The type of data that the DBRef refers to. Common types would be, for example: DNA, RNA, AA, ncRNA, structure, SNP, journal article. We recommend that standard English capitalization rules be followed. Namely, abbreviations like DNA, RNA, and SNP be all-caps, and everything else be uncapitalized.

    Note that this is not necessarily the same type as the sequence in the SeqXML entry. One might have a SeqXML <AAseq> entry but be crossreferencing that with an Ensembl transcript record, in which case the DBRef to the Ensembl transcript would have type="RNA".

  • source (required,string) – The name of the source (usually a database) that the DBRef refers to. We recommend to use the same spelling and capitalization as used by the major resources like UniProt and INSDC.

  • id (required,string) – The referenced identifier (ID) itself.

Examples:

  • an Ensembl gene

    <DBRef type="gene" source="Ensembl" id="ENSG123450000"/>
  • an Ensembl transcript

    <DBRef type="RNA" source="Ensembl" id="ENST123450001"/>
  • an Ensembl protein

    <DBRef type="AA" source="Ensembl" id="ENSP123450002"/>

3.6. property

Arbitrary additional information related to the sequence. (optional)

SeqXML includes three predefined places to store the most common types of metadata related to a sequence: description, species, and DBRef. Other bits of sequence-related information not covered by these types can be stored as properties.

In its simplest form, a property is simply a tag. That is, a text string such as "has splice variants", "sequenced by Raoul", or "really important".

A property can also, optionally, have a value associated with the tag. Tag-value pairs are a common way of storing both a piece of information and the type or category of that information. Here are three examples of tag-value pairs:

Tag Value
BLAST score 145.6
topology prediction method SignalP
evidence code IEA

Attributes:

  • name (required,string) – The property's name.

  • value (optional,string) – The property's value.

3.7. RNAseq

The sequence itself.

Each entry must have one and only one of: RNAseq, DNAseq, AAseq. The sequence must exist and be at least one character in length. If the SeqXML file is validated, each sequence type will be validated against the appropriate IUPAC alphabet.

For RNA, the allowed characters are: ACGUMRWSYKVHDBXN.-

3.8. DNAseq

The sequence itself.

Each entry must have one and only one of: RNAseq, DNAseq, AAseq. The sequence must exist and be at least one character in length. If the SeqXML file is validated, each sequence type will be validated against the appropriate IUPAC alphabet.

For DNA, the allowed characters are: ACGTMRWSYKVHDBXN.-

3.9. AAseq

The sequence itself.

Each entry must have one and only one of: RNAseq, DNAseq, AAseq. The sequence must exist and be at least one character in length. If the SeqXML file is validated, each sequence type will be validated against the appropriate IUPAC alphabet.

For an amino acid sequence, the allowed characters are: ABCDEFGHIJKLMNOPQRSTUVWXYZ.-*

4. Changes

4.1. version 0.2

July 2, 2010

  • sequence type – The possible values for the sequence type were changed as follows:

    to use the more common biological capitalization instead of the camelCase programming convention.

  • alternativeID – The alternativeID element name was changed to DBRef in order to make it more generic.

  • DBRef type – A DBRef (formerly the alternativeID) must now specify a type. We made this change because a) an ID becomes much more useful if you specify the type of information that the ID refers to, and b) because a single source can provide multiple types of data. See DBRef and the example.

4.2. version 0.3

December 9, 2010

  • source attribute for entries – It's now possible to (optionally) specify the source for an entry ID. There is already an optional global source attribute, but this was added to allow individual entries to use (and specify) identifiers from different source databases.

4.3. version 0.4

June 28, 2011

  • speciesName and ncbiTaxID attribute for document root seqXML – The species can now optionally be defined globally. This reduces redundancy by not having to define the species with a local species element in every entry. It is however possible to overwrite the global definition with a local species element.

  • DBRef type – The type attribute of DBRef is no longer required. This reverts a previous change because it is not always possible to know the type of the revered object. We do however strongly encourage you to use the type whenever you can.