New TOPSAN Paper and Download Details

November 10, 2010

A paper on TOPSAN, entitled: “TOPSAN: a dynamic web database for structural genomics” will be featured in the Nucleic Acids Research upcoming database issue.
One of the main points of this paper is the many efforts to make TOPSAN data more accessible. These efforts include providing TOPSAN articles for bulk download in machine readable formats. These download options now include RDFa, RDF and N3 files. All of these bulk download options can be found at http://topsan.org/Downloads.

As a brief introduction, the three formats we have provided contain semantic web related data, which better enables data organization and easier machine parsing. The three formats include:

  • RDFa: The text extract of a page, with semantic web microtags embedded in the XHTML. Use this download if you want the full text of the articles.
  • RDF: A pure XML description of relationship triples that can be used to create a searchable database. Use this download if you want to build an semantic database with graph context.
  • N3: A simpler format to describe semantic web triples. Use this download if you want the easiest to read version of TOPSAN triples.

Every article on TOPSAN is identified with a unique alias. For example, the record on the PDB protein 2ASH, found at http://topsan.org/Proteins/JCSG/2ash, is identified as TPS1300. The alias ID for topsan records begins with the prefix TPS and is followed by a unique number. You can find this ID in the ‘Alias Ids’ field of the page header. This identifier can be used to download the ‘light’ versions of TOPSAN pages. So for the TPS1300 example, the RDFa record can be found at http://topsan.org/rdfa/TPS1300, while the RDF record can be found at http://topsan.org/rdf/TPS1300.

The bulk download of all RDFa files in a single XML file can be found at http://files.topsan.org/topsan.xml.gz. The full RDF extract of TOPSAN be found at http://files.topsan.org/topsan_rdf.tar.gz.  This tarball contains every TOPSAN page entry as a seperate RDF file, as well as a file called ‘graphMap’ that can be used to map the context of the graphs on a quad store server.

We hope that by providing TOPSAN articles and knowledge in these formats we will enable better collaboration with other protein annotation efforts.


TOPSAN and the Semantic Web (Part II)

July 12, 2010

This is the second in a series of blogs in which we will try to introduce you to the concepts behind the TOPSAN Protein Syntax and the TOPSAN semantic notation system. Our first entry introduced how to imbed semantic web notations into TOPSAN (Part I). In this article we will describe how to use obtain and use the semantic information that has been embedded in TOPSAN, compose queries and analyze the available information.

SPARQL

The main reason to embed semantic information into TOPSAN pages is to allow for easy extraction of information from the site.  Every day, all of the embedded semantic information on the site is scrapped and compiled into a single file, which you can find at http://files.topsan.org/topsan.n3.gz This ‘N3′ format file is a collection of all the semantic triples found on the TOPSAN site.  Once you’ve downloaded and unzipped it, it is very easy to import it into a large number of different styles of semantic web ‘triple store’ systems. These systems will have the ability to parse database queries written in the ‘SPARQL’ syntax. One such system is RDFLib, found at http://www.rdflib.net/.  Once you installed this library, you can use it to open and query the topsan.n3 file.

If you need a introduction to SPARQL, check out these tutorials:
 

Examples

To begin these examples first download the file found at http://files.topsan.org/topsan.n3.gz, and uncompress it.

$ wget http://files.topsan.org/topsan.n3.gz
$ gunzip topsan.n3.gz

(On mac)
$ curl -O http://files.topsan.org/topsan.n3.gz
$ gunzip topsan.n3.gz

Example 1: Retrieve Molecular weights

testSparql.py Script:

#!/usr/bin/env python import sys
from rdflib.Graph import Graph
g = Graph()
g.parse( sys.argv[1], format=”n3″ ) queryStr = “””
SELECT ?id ?weight WHERE {
?id <http://purl.uniprot.org/core/molecularWeight&gt;
?weight
}
“””for row in g.query(queryStr):
print ” : “.join( str(a) for a in row )
 

 



By calling this script, and passing the path to the topsan.n3 file, it will scan the file looking for all of the molecular weights assigned to proteins stored in TOPSAN. You should get something that looks like:

$ ./testSparql.py topsan.n3 http://topsan.org/purl/TPS11889 : 21055.88
http://topsan.org/purl/TPS11896 : 66251.10
http://topsan.org/purl/TPS11891 : 21321.14
http://topsan.org/purl/TPS11892 : 20223.71
http://topsan.org/purl/TPS11894 : 56539.67

….


 

Example 2: Retrieve PFAM to GO mapping

A more complicated call would be to extract all of the PFAM to GO mappings stored in TOPSAN.  To do this, replace the queryStr contents from ‘testSparql.py’ with the following query:

 


“””

PREFIX core:<http://purl.uniprot.org/core/&gt;
SELECT ?pfam ?go WHERE {
?pfam a core:Pfam_Family ;
core:memberOf ?go
}

“””

 


 

Example 3: Finding Domains of Unknown Function with PDB structures.

This query finds all of PFAM families that start with ‘DUF’ with PDB structures on TOPSAN that are associated with them. The associations are found with the ‘seeAlso’ type link to a common tag, whose title is filtered with the regular expression “^PF”.

 


 

“””

PREFIX core:<http://purl.uniprot.org/core/&gt;
SELECT * WHERE {
?fam a core:Pfam_Family .
?fam a core:Pfam_Family .
?fam core:name ?name .
FILTER( regex(?name, “^DUF”) ) .
?prot a core:Protein_Structure .
?prot core:memberOf ?fam
}

“””

(Note, this query involves external data sources that are not part of the ‘core’ TOPSAN extract files. We will outline how to obtain the extended database needed to do these types of queries in later posts)


At this point, the query will take several minutes and over a GB of memory. In the next article we’ll demonstrate how to set up a Joseki based server to query Semantic Topsan data.


TOPSAN and the Semantic Web (Part I)

June 1, 2010

This is the first in a series of blogs in which we will try to introduce you to the concepts behind the TOPSAN Protein Syntax and the TOPSAN semantic notation system. The first article will be the basics of engaging the notation environment and some simple examples of how to use the notation system. Next we will describe how to use obtain and use the semantic information that has been embedded in TOPSAN, compose queries and analyze the available information. Finally we will describe the more advanced concepts involved with the controlled ontology of predicates that the TOPSAN Protein Syntax describes.

To get read more about the technologies involved you can find additional information at:

What is the semantic web:

Biohackathon semantic web series:

How the data is being stored:

How ontologies can control the language used to describe the relationships between different pieces of data:
Technologies that can be used to query and examine the data:

Introduction to TOPSAN Protein Syntax

One of the goals of the TOPSAN protein annotation system is to make sure that human annotations of protein structures are available to the public. This includes ensuring that annotations are available in a machine readable format. When an annotator adds a link or a value to a page, it is important the intent for this link is expressed. It is important to know if a link has been added because it’s an example of a homologue or because it is a link to another protein in the same pathway. The concepts and standards behind the semantic web provide a framework for expressing this information.
The TOPSAN Protein Syntax (TPS), is designed to cover the set of predicates used to describe the relationships between proteins and the databases and values they can be linked to. These predicates follow a formalized ontology, that begins from three different roots. These different branches represent the different basic concepts that are used to describe proteins. These include ‘links’, and ‘values’. ‘Link’ statements describe connections from a protein to another database element while ‘value’ statements assign direct values and data to a protein.

All calls embedded in the text of TOPSAN documents begin and end with the double brackets ‘{{‘ and ‘}}’. You then make a call to ‘note.link’. There are two ways to call the function, via sequential argument or by named arguments. For sequential arguments, wrap the arguments with ‘(‘ and ‘)’, and type in the values. This is usually only used for the two argument call, when passing the predicate and the object values. Alternatively, if you want to manipulate additional arguments the named argument format is preferred. In this method, the arguments are wrapped with ‘{‘ and ‘}’, and the name of each of argument is given followed by a ‘:’ and then the value. When using the named argument format you don’t have to remember a specific order of arguments.

note.link Arguments:

  • rel : Type of relationship
  • value : The database to link to, if it is not an identifiable link it is assigned as a literal value
  • visible : If false the call does not produce text that is visible on the page
  • about : Defaults to the current page
  • rev: If the relationship is reversed, so that the destination is the subject and the current page is the object.

Relationships can go in both directions. By default the subject is the current page and the passed value is the object. To reverse this relationship, so that the relationship statement is about the external database pointing to the current page, set ‘rev:true’.

Examples:

Embed a link to PFAM:

{{ note.link( ‘memberOf’, ‘PFAM:PF07980′ ) }}

Cite a PubMed Reference:
{{ note.link( ‘citation’, ‘PMID:19191477′ ) }}

Reverse a relationship:

{{ note.link{ rel:’similar’, value:’UNIPROT:Q8A1G2′, rev:true } }}
Define a relationship about something other then the current page:
{{ note.link{ about:’TOPSAN:2aam’, rel:’similar’, value:’UNIPROT:Q8A1G2′ } }}
On the editor this would like:
When displayed in the page it would be:

Predicates

We will describe the set of TOPSAN Protein Syntax predicates with greater detail later. For now, there are only a handful of predicates that you need to know in order to get started.
Predicate
Definition
similar
Represents a connection between two proteins that are homologous or structurally similar
classifiedWith
Connects a protein to an assigned function type
memberOf
Connects a single element to group to which it is a part of
citation
A connection to a literature citation


Available Databases

When describing a link to another database, you can use prefix codes that will recognized by TOPSAN and translated accordingly. We have a set of 10 linked databases currently, but this will grow as needed. To use the prefix code, simply name the database by code, followed by a colon and the database identified from the database, ie “PFAM:PF0798″, “UNIPROT:Q8A1G2″, or “PDB:1AAC”.

Prefix Database
GO The Gene Ontology database
PFAM The Pfam protein family database
UNIPROT The Uniprot protein database
EC The Enzyme Catalogue
TOPSAN The TOPSAN protein annotation system
TAXON The NCBI taxonomic codes
PDB The Protein Data Base
PMID Pubmed
SCOP Scop domain IDs, ie d1wy7a1
SUNID Scop ID: ie, 51349 -> Alpha and beta proteins (a/b)

Data Mining on TOPSAN

Part II

Developer Update – January 2010

January 25, 2010

We are currently improving TOPSAN performance issues on a development site.  Two of the primary features that were slowing down TOPSAN were the external real-time feeds on protein pages and advanced search options. These features have been disabled but will be back in the next TOPSAN upgrade.  We are expecting to upgrade TOPSAN within the first week of February. We should only experience a brief downtime (~20min). I will be posting exact date and time of the upgrade. We are excited to bring back some of the useful features but more importantly have a more stable and responsive site! If you have any questions please contact support at topsan dot org.


Adding References on TOPSAN

July 22, 2009

We recently added a new feature on TOPSAN that allows you to easily add references as you are editing a protein page. To use this feature,

1. Select the pubmed icon in the WYSIWYG tool bar while in edit-mode:

Pubmed Icon

Pubmed Icon

2. A dialog box will appear that allows you to enter a Pubmed Id, ISBN number, or keyword (author, title, etc.). Click on the selected reference to insert it into the text.

Reference Dialog

Reference Dialog

3. Once you are finished editing the section, save your changes.  The references within the text  will appear in numerical order (i.e [1], [2]), with the full reference in the ‘Reference’ tab at the bottom of the protein page with links to the full article,  Pubmed, and Hubmed:

Reference Tab

Reference Tab

4. By clicking on the reference in the text (i.e [1]) or from the ‘Discuss this page’ link in the References tab you will link to an Article page which is automatically generated on TOPSAN. From this article page, users can discuss the article and can view all pages on TOPSAN that also refer to an article.

Article Page

Article Page

You may also use this feature to easily add your own publications to your User profile page (i.e User:<username>).


Error propagation

January 22, 2009

While annotating an orotidine 5′-phosphate decarboxylase from Thermotoga maritima (PDB id: 1vqt), I was surprised to come across an apparent paralog (PDB id: 2yyu) as Pfam showed a single gene for this enzyme in T. maritima. On closer inspection, it turned out the protein was actually an ortholog from Geobacillus kaustophilus but wrongly annotated in the PDB and propagated onto the relevant TOPSAN page. I checked to see how far the error had reached. A quick search showed it to be present in the PDB, PDBsum, PSI-KB, Proteopedia, SSM, PISA and NCBI. All programs relying on a sequence analysis (BLAST, STRING, COGnitor) correctly identified G. kaustophilus as the species.

Error propagation in the automatic annotation of proteins has been previously described (Valencia 2005). I have notified the PDB of the problem. It will be interesting to see the time it takes to correct the error and whether or not the correction back-propagates.


Betting on Science

December 16, 2008

Here is one of those horror stories about how good science is severely cheapened by its unavailability to the science consuming public due to the inherent limitations of academic publishing. The author laments regarding some exciting new procedure, “It turns out that no one will have access to this method for a year or so.”

Ok, so we all have some misgivings about how scientific discourse has been traditionally structured. TOPSAN represents what we hope is a tractable approach to a more modern scientific procedure. Apparently, however, some people are thinking in creative ways that may prove a bit beyond practical. Economist Robin Hanson wrote a paper almost two decades ago outlining how the scientific process could be enhanced by using gambling. The idea is strange, but he argues convincingly that just as in insurance underwriting and market speculating, science could be enhanced by using the same techniques that are essentially used with bookmaking.

The nice thing about this approach is that it is basically a money-where-mouth-is approach which would pretty quickly rule out insincere scientific propositions. He has obviously thought seriously about possible corruption and rightly points out that, “Fortunately nature has no insiders…”

He makes an interesting point that:

“Influence in academia, as measured for example by number of papers published, is far more concentrated than in most walks of life. It seems unlikely that markets would make things worse…”

The author contemplates the difficulty involved in changing the status quo to this kind of approach. In the “Strategy” section the author wonders if some kind of reputation based game could be enough. I think so because it sounds a lot like MarketGuru.com which does exactly this and is quite successful at letting people show off how smart they are (or are not). It would be nice to see some non-traditional scientific reputation system demonstrate that peer reviewed publications aren’t necessarily the last word in scientific efficiency.

Hanson also says:

“We could do much worse than having intellectual institutions as open, flexible, diverse, and egalitarian as the stock market, with incentives as well-grounded and with estimates on important issues as unbiased and predictive.”

To which I say, let’s try to do better than the stock market!


Follow

Get every new post delivered to your Inbox.