Please follow this link to enter the development version of SOURCERER code search engine
* Please use the mirror link if the above link is not working *
* DMKD: a recent version with alternate ranking schemes, please use link *

The Sourcerer Project


Sourcerer Infrastructure

Sourcerer is an infrastructure for large scale analysis of open source code repositories. The Sourcerer architecture (Figure 1) consists of modules which provide support for two major areas of functionality - code analysis/storage and code retrieval.

The heart of the repository infrastructure is composed of a customized parser that is responsible for static analysis and code entity extraction. After selecting projects for analysis from a list created by a web crawler, the parser is invoked to capture both symbolic and structural properties of the project files. Package, class, method and attribute information, in addition to the various relations among these entities, is exported to an underlying relational database and indexed for efficient information retrieval. Included in this phase is the extraction of fingerprints - compact representations of source code entities which signify the presence or absence of interesting structural characteristics within the code. Finally, entities are ranked using the CodeRank method, which calculates the relative importance of entities in a manner analogous to Google's PageRank.

In addition to providing a probabilistic framework for ranking, CodeRank allows the ranking process to be tuned to boost or dampen the significance of specific types of relationships (uses, calls, inherits, etc) in computing rank, as well as the level at which ranking is computed (inter or intra project).

All this information Sourcerer provides serves as the basis for implementing various Software Engineering tools. Three such tools have been developed; (i) SOURCERER code search engine, (ii) CodeGenie - an Eclipse plugin for test driven source code search, and (ii) DeMatrix - a visualization tool based on Design Structure Matrix (DSM) for software systems.

Figure 1. Sourcerer System Architecture

SOURCERER Code Search Engine

SOURCERER is a search engine for open source code. It provides various search modes using the structural information provided by the Sourcerer infrastructure. This information such as the code rank and relational information about the code structure enable search forms that go beyond conventional keyword-based searches. Specifically, Sourcerer supports five types of searches:
  • components
  • component uses
  • function
  • function uses
  • fingerprints

Sourcerer code search application is a lucene-based web application that facilitates the efficient retrieval of code based on user-specified queries. In addition to standard text information retrieval, structure-based search is supported.  In the latter case, the user is given the ability to specify a structural fingerprint of code they are in.  A ranked list of entities matching the user's query is then returned.  

Code Fingerprints

Fingerprints are essentially vectors whose elements denote both the presence and multiplicity of specific programming constructs within individual code entities.  The vector representation, while simple, naturally lends itself to popular methods within both information retrieval and machine learning.  In constructing fingerprints, care must be taken to balance the competing needs of expressiveness and efficiency.  Fingerprint attributes must be numerous enough to provide a meaningful basis for comparison among code entities.  At the same time, superfluous attributes add unnecessary overhead to the search process.

Currently SOURCERER provides 3 forms of fingerprint search:

  • Control Structure Fingerprints
  • Java Type Fingerprints
  • Micro Pattern Fingerprints

Control structure fingerprints provide information about concurrency, iteration, and branching constructs within the code.  Java Type fingerprints captures information about OO constructs such as classes, methods, attributes, constructors, and the like.  Finally, micro pattern fingerprints provide information as to whether or not simple design patterns are present within a code entity.

Sourcerer Applications

Publication

  • OOPSLA 2006 Poster Abstract [PDF]
  • ISR Forum 2006 Poster [PDF]


(c) the mondego group | ISR | IGB | Bren School ICS | UC Irvine