ClassRank

Online demo of ClassRank algorithm

Raw input

This webapp is a prototype to calculate online the ClassRank score of a given RDF graph. A detailed overview of ClassRank algorithm is offered below. Currently, this online prototype cannot handle big graphs. In order to experiment with big sources we encourage you to download the source code of ClassRank and to test it locally.

Instructions:

  • Introduce your RDF graph in turtle or n3 format in the text area labelled "Graph input".
  • Introduce a list of classpointers in the text area labelled "Classpointers". Each classpointer should should be in a separate line.
  • Specify a positive integer as class threshold (θC).
  • Specify a positive integer as instance threshold (θI).
  • Specify a float number in the range [0,1] as damping factor for PageRank (α).
  • Click on the "Execute Classrank" button.

Your results will be prompted in the text area labelled "Results". In case any error occurs during the process, it will also be prompted in the text area.

The results are given in a list of JSON objects. The objects represent detected classes and are sorted by its ClassRank score (the element with max score goes first). Each object contains the following keys:

  • "class": URI of the class.
  • "CR_score": ClassRank score.
  • "cps": dictionary. Each key of this dictionary is a classpointer which is used to point the class in the given graph. Each key contins a list of elements, wich are the URIs that point the class using the correspondent classpointer.
  • "INSTANCES": number of instances which were used to compute the classrank score. Note: this number represent only those cases in which both the instances and the classpointer fit in the params specified by θC and θI. In the dictionary "cps" some other instances may appear.

Graph input

Classpointers

Params

  • Classes threshold (θC)
  • Instances threshold (θI)
  • Damping factor (α)

Results

ClassRank Overview

ClassRank is an unsupervised technique useful for discovering URIs representing abstract concepts (classes) in a knowledge graph, and to measure their relevance within the source. ClassRank is primarily based on notions of graph centrality according to the PageRank algorithm. ClassRank associates each found class with a score with the following equivalent meanings:

  • Accumulated centrality (PageRank) score of its instances.
  • Chance of reaching one of its instances while surfing the graph randomly.

We provide a graphical example of how ClassRank works:

ClassRank explained graphically

In the image we prompt some elements of a certaing RDF graph. The rectangles represent URIs of instances, the clouds URIs of classes, and the arrows properties that join both elements; orange color for dbo:governamentType and green for rdf:type . Let's say that we have applied PageRank (PR) to a graph which contains those elements and relations. Also, lets say that we classify both dbo:governamentType and rdf:type as classpointers. Each entity on the graph has been associated to a PR score. The ClassRank (CR) score of a class is obtained by aggregating the PR scores of its instances. Then, in case Parlimentary system and Country don't have any other instance than the ones that appear in the picture, we would have:

  • CR_Parlimentary_System(0.6) = PR_Hungary(0.1) + PR_Greece(0.2) + PR_USA(0.3)
  • CR_Country(0.3) = PR_Hungary(0.1) + PR_Greece(0.2)

The information brought by ClassRank may be helpful in several scenarios, including, but not limited to:

  • Better exploitation of existing knowledge. Obtaining a list of relevant classes may facilitate querying relevant entities using common interfaces.
  • Class consolidation by comparison. Using ClassRank on different KGs allows to compare the results of all of them in order to find repeated classes across different sources. That would be useful for:
    • Class confirmation. If the algorithm repeatedly finds a certain class across different sources, it probably means that this is a real and correct class.
    • Class conciliation. If a certain class is found in different sources, we may be able to interconnect them using properties of OWL.
  • Information Retrieval Tasks. ClassRank allows to obtain the most relevant instances of a certain type, which may be helpful in different Information Retrieval scenarios. For instance, when looking for scientific literature, it may be desirable to know the most relevant instances (publications) of a given area.

In order to discover class URIs, ClassRank explores the target graph looking for triples containing some special properties that we have called class-pointers. Essentially, a class-pointer is an RDF property which is expected to be used only to point to RDF classes.

ClassRank computes exclusively information contained in the target graph, i.e., no third-party knowledge is used during the process, thus ClassRank can be applied on KGs of any kind of domain.

Inputs and outputs

The damping factor α in PageRank defines the probability p = (1 - α) for a random surfer to get bored on moving through the graph using links, moment in which he decides to jump to a random element. The thresholds θI and θC are used to ignore some facts that occur rarely, which may mean that they are noise or they have a non-significant presence in G. The algorithm returns three results:

  • The standard PageRank vector for every entity node in the graph
  • The ClassRank vector for every found class in the graph
  • A matrix containing which entities point to which classes using which class-pointers.

The combined information of CR scores and list and amount of isntances provide the relevance of each class and it allows us to analyze the source of that relevance.

Formalization

We provide a formalization of ClassRank algorith:

Mathematical formalization of Pagerank

Access to source code

We have implemented ClassRank in Python and the source code is publicly available. Our version of Classrank is still a prototype, more tests are needed and some bugs (which do not affect to the final score) have been already detected. It completely operates in main memory to maintain some structures during the computations, so you may need a powerfull machine in order to compute huge grpahs.

Some examples and instructions are provided in the repository.

Computed datasets

ClassRank has already been applied over big RDF sources. We provide access to the results of applying ClassRank on a dump of Wikidata, date 2016/10/26. The link provided also give acces to an older implementation of ClassRank whose code is tightfully linked to Wikidata. If you are planning to test ClassRank, we recomend you to use the implementation that we have recently provided, which is thought to work with any kind of RDF graph.

We will provide soon the results of applying ClassRank over the English edition of DBpedia.