TOPSAN and the Semantic Web (Part II)

July 12, 2010

This is the second in a series of blogs in which we will try to introduce you to the concepts behind the TOPSAN Protein Syntax and the TOPSAN semantic notation system. Our first entry introduced how to imbed semantic web notations into TOPSAN (Part I). In this article we will describe how to use obtain and use the semantic information that has been embedded in TOPSAN, compose queries and analyze the available information.


The main reason to embed semantic information into TOPSAN pages is to allow for easy extraction of information from the site.  Every day, all of the embedded semantic information on the site is scrapped and compiled into a single file, which you can find at This ‘N3’ format file is a collection of all the semantic triples found on the TOPSAN site.  Once you’ve downloaded and unzipped it, it is very easy to import it into a large number of different styles of semantic web ‘triple store’ systems. These systems will have the ability to parse database queries written in the ‘SPARQL’ syntax. One such system is RDFLib, found at  Once you installed this library, you can use it to open and query the topsan.n3 file.

If you need a introduction to SPARQL, check out these tutorials:


To begin these examples first download the file found at, and uncompress it.

$ wget
$ gunzip topsan.n3.gz

(On mac)
$ curl -O
$ gunzip topsan.n3.gz

Example 1: Retrieve Molecular weights Script:

#!/usr/bin/env python import sys
from rdflib.Graph import Graph
g = Graph()
g.parse( sys.argv[1], format=”n3″ ) queryStr = “””
SELECT ?id ?weight WHERE {
?id <;
“””for row in g.query(queryStr):
print ” : “.join( str(a) for a in row )


By calling this script, and passing the path to the topsan.n3 file, it will scan the file looking for all of the molecular weights assigned to proteins stored in TOPSAN. You should get something that looks like:

$ ./ topsan.n3 : 21055.88 : 66251.10 : 21321.14 : 20223.71 : 56539.67



Example 2: Retrieve PFAM to GO mapping

A more complicated call would be to extract all of the PFAM to GO mappings stored in TOPSAN.  To do this, replace the queryStr contents from ‘’ with the following query:



PREFIX core:<;
SELECT ?pfam ?go WHERE {
?pfam a core:Pfam_Family ;
core:memberOf ?go




Example 3: Finding Domains of Unknown Function with PDB structures.

This query finds all of PFAM families that start with ‘DUF’ with PDB structures on TOPSAN that are associated with them. The associations are found with the ‘seeAlso’ type link to a common tag, whose title is filtered with the regular expression “^PF”.




PREFIX core:<;
?fam a core:Pfam_Family .
?fam a core:Pfam_Family .
?fam core:name ?name .
FILTER( regex(?name, “^DUF”) ) .
?prot a core:Protein_Structure .
?prot core:memberOf ?fam


(Note, this query involves external data sources that are not part of the ‘core’ TOPSAN extract files. We will outline how to obtain the extended database needed to do these types of queries in later posts)

At this point, the query will take several minutes and over a GB of memory. In the next article we’ll demonstrate how to set up a Joseki based server to query Semantic Topsan data.