SeqXML Documentation

SeqXML Version 0.3

Table of Contents

1. Introduction

2. Example

3. Schema description

3.1. seqXML
3.2. entry
3.3. species
3.4. description
3.5. DBRef
3.6. property
3.7. RNAseq
3.8. DNAseq
3.9. AAseq

4. Changes

4.1. version 0.2
4.2. version 0.3

1. Introduction

SeqXML is an XML Schema to describe biological sequences. SeqXML is designed to be a simple and versatile format to store sequence data. The FASTA sequence format is so widely used because it is simple and versatile, but FASTA's weakness is the lack of structure in its header. The lack of standardization in FASTA headers makes it time-consuming and labor-intensive to parse header information in FASTA sequences gathered from different sources.

SeqXML solves this problem by adding just enough structure to allow easy parsing while still being a lightweight format. There are only two required fields: the id and the sequence.

If you do choose to specify additional information about the sequence, SeqXML makes it easy to do in a standardized way. The most common types of sequence metadata, such as description, species, and alternative IDs, are already encoded in SeqXML.

It's flexible, too. You can include your own metadata as properties, which can be just a tag, like "has_splice_variants", or a tag-value pair, where a named category like "prediction method" has a value like "genscan".

And XML is a markup language which makes it easy to process and validate. Using XML as the "container" for sequence data means it's possible to do some simple validation, so that common problems like duplicate entries and invalid sequence letters are detected automatically.

The document describes the content of an SeqXML file version 0.3 and gives details about each element and attribute. For more information about SeqXML, including tools to convert SeqXML to and from other sequence formats, visit seqxml.org.

2. Example

In this section we walk through an example of a SeqXML file, shown here:

        <?xml version="1.0"?>
        <seqXML source="Ensembl" sourceVersion="56" seqXMLversion="0.3">
            <entry id="A12345">
                <DNAseq>ACTGCGAAATGTGCCCGGGNNNN</DNAseq>
            </entry>
            <entry id="ENSG00000173402" source="Ensembl">
                <species name="Homo sapiens" ncbiTaxID="9606"/>
                <description>dystroglycan 1</description>
                <RNAseq>AAGGC----UGAUGUC.....ACAU</RNAseq>
                <DBRef source="GenBank" id="NM_004393"/>
                <property name="prediction_method" value="manual curation"/>
                <property name="has_splice_variants"/>
            </entry>
            <entry id="ENSG000001734023" source="Ensembl">
                <species name="Homo sapiens" ncbiTaxID="9606"/>
                <description>example description.</description>
                <AAseq>AAGGCGAAA----AA*AAAAAGT.....CACJOXA</AAseq>
            </entry>
            <entry id="ENSG0000017340233" source="Ensembl">
                <species name="Homo sapiens" ncbiTaxID="9606"/>
                <description>another example description</description>
                <DNAseq>AAGGCGTTAAA----AAAAAAAGT.....CACTA</DNAseq>
            </entry>
        </seqXML>

The root element in the XML file is the seqXML tag. It indicates the version of SeqXML used, and optionally the origin of the data (often, a database like Ensembl or Swiss-Prot) and the version of that database. Here you can see this SeqXML file contains records from Ensembl version 56:

        <seqXML source="Ensembl" sourceVersion="56" seqXMLversion="0.3">

Under the seqXML tag there are four entries, one entry for each sequence. The first entry shows the minimum necessary. It's just an identifier and a sequence:

        <entry id="A12345">
            <DNAseq>ACTGCGAAATGTGCCCGGGNNNN</DNAseq>
        </entry>

Note that the type of sequence is specified by the sequence tag used. Here, it's <DNAseq> to indicate a DNA alphabet. The other options are RNAseq and AAseq for RNA and amino acid alphabets, respectively. If you validate a SeqXML file, the sequence will be checked to make sure it contains only IUPAC-allowed characters for that sequence type.

The second entry is much more detailed:

        <entry id="ENSG00000173402" source="Ensembl">
            <RNAseq>AAGGC----UGAUGUC.....ACAU</RNAseq>
            <description>dystroglycan 1</description>
            <species name="Homo sapiens" ncbiTaxID="9606"/>
            <DBRef source="GenBank" id="NM_004393"/>
            <property name="has_splice_variants"/>
            <property name="prediction_method" value="manual curation"/>
        </entry>

In addition to the ID and the sequence (an RNA sequence in this case), there are several other pieces of optional information. There's an optional source attribute identifying the ID as being an Ensembl ID. There's also a description, the species and its NCBI taxonomy ID, an alternative identifier and its source (GenBank in this case), and then a couple of properties.

Rather than overloading the description field, as has become common with FASTA records, properties are a way to include additional information related to the sequence. Properties can be written in two ways. First, as simply a tag, and secondly as a tag and associated value. The example here shows one of each:

         <property name="has_splice_variants"/>
         <property name="prediction_method" value="manual curation"/>

Of course these simple tags cannot properly represent more complicated types of metadata, such as journal references, but they are not intended to. SeqXML is designed as a lightweight format, not as a comprehensive one. For that purpose, there are other more appropriate XML sequence data formats, such as UniProt XML or EMBL XML.

So that's a quick overview of SeqXML. Tools for using it and converting to and from other sequence formats are available at SeqXML.org. A more detailed reference documentation continues below.

3. Schema description

This section is intended to be a human readable version of the SeqXML schema. If you encounter a discrepancy, (1) the schema is right and (2) please let us know!

3.1. seqXML

The root element. The seqXML element must have at least one entry child element.

Attributes:

version (required,decimal) – The version of the SeqXML specification being used in the file.
source (optional,string) – The source program/database of the file, for example UniProt.
sourceVersion (optional,any) – The version or release number of the source program/database that the data in the file comes from.

3.2. entry

The entry element represents a single sequence record. The sequence's primary identifier is specified as an attribute, and there's an optional source attribute. Entry has only one required child element, and it must be one of: DNAseq, RNAseq, AAseq. All other child elements are optional.

Attributes:

id (required,string) – The primary identifier (ID) of the sequence.
source (optional,string) – The source of the identifier. That is, which database the identifier comes from. For example, ENSG00000173402 is an Ensembl ID, so you can specify source="Ensembl".

3.3. species

The species which the sequence comes from. (optional)

Species is an optional part of a sequence entry, but if included, it must contain the NCBI taxonomy identifier. The reason for this requirement is that species names are notoriously difficult to parse. While it seems like standardizing on genus and species would be sufficient, subspecies, strains, and subtypes (common in prokaryotic and viral nomenclature) tend to confound that approach. Using a taxonomy identifier allows a species to be indicated unambiguously. Only one species is allowed per sequence

Attributes:

name (required,string) – The species name.
id (required,string) – The NCBI taxonomy identifier (ID) for the species.

3.4. description

A description of the sequence. (optional)

Often the name of the gene or protein which the sequence comes from, such as "homeobox protein A2". Only one description is allowed per sequence. Please do not stuff multiple types of information in the description field. Instead, specify extra data using properties.

3.5. DBRef

An additional identifier for the sequence. (optional)

Multiple DBRefs may be specified.

Attributes:

type (required,string) – The type of data that the DBRef refers to. Common types would be, for example: DNA, RNA, AA, ncRNA, structure, SNP, journal article.

Note that this is not necessarily the same type as the sequence in the SeqXML entry. One might have a SeqXML <AAseq> entry but be crossreferencing that with an Ensembl transcript record, in which case the DBRef to the Ensembl transcript would have type="RNA".
source (required,string) – The name of the source (usually a database) that the DBRef comes from.
id (required,string) – The alternative identifier (ID) itself.

Examples:

an Ensembl gene
<DBRef type="gene" source="Ensembl" id="ENSG123450000"/>
an Ensembl transcript
<DBRef type="RNA" source="Ensembl" id="ENST123450001"/>
an Ensembl protein
<DBRef type="AA" source="Ensembl" id="ENSP123450002"/>

3.6. property

Arbitrary additional information related to the sequence. (optional)

SeqXML includes three predefined places to store the most common types of metadata related to a sequence: description, species, and DBRef. Other bits of sequence-related information not covered by these types can be stored as properties.

In its simplest form, a property is simply a tag. That is, a text string such as "has splice variants", "sequenced by Raoul", or "really important".

A property can also, optionally, have a value associated with the tag. Tag-value pairs are a common way of storing both a piece of information and the type or category of that information. Here are three examples of tag-value pairs:

Tag	Value
BLAST score	145.6
topology prediction method	SignalP
evidence code	IEA

Attributes:

name (required,string) – The property's name.
value (optional,string) – The property's value.

3.7. RNAseq

The sequence itself.

Each entry must have one and only one of: RNAseq, DNAseq, AAseq. The sequence must exist and be at least one character in length. If the SeqXML file is validated, each sequence type will be validated against the appropriate IUPAC alphabet.

For RNA, the allowed characters are: ACGUMRWSYKVHDBXN.-

3.8. DNAseq

The sequence itself.

For DNA, the allowed characters are: ACGTMRWSYKVHDBXN.-

3.9. AAseq

The sequence itself.

For an amino acid sequence, the allowed characters are: ABCDEFGHIJKLMNOPQRSTUVWXYZ.-*

4. Changes

4.2. version 0.2

July 2, 2010

sequence type – The possible values for the sequence type were changed as follows:
- dnaSeq → DNAseq
- rnaSeq → RNAseq
- aaSeq → AAseq
to use the more common biological capitalization instead of the camelCase programming convention.
alternativeID – The alternativeID element name was changed to DBRef in order to make it more generic.
DBRef type – A DBRef (formerly the alternativeID) must now specify a type. We made this change because a) an ID becomes much more useful if you specify the type of information that the ID refers to, and b) because a single source can provide multiple types of data. See DBRef and the example.

4.3. version 0.3

December 9, 2010

source attribute for entries – It's now possible to (optionally) specify the source for an entry ID. There is already an optional global source attribute, but this was added to allow individual entries to use (and specify) identifiers from different source databases.