Skip to content

nick-b/sourceclassifier

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SourceClassifier

Source classifier identifies programming language using a Bayesian classifier trained on a corpus generated from the Computer Language Benchmarks Game . It is written in Ruby and availabe as a gem. To train the classifier to identify new languages download the sources from github.

Usage

First install the gem using github as a source

$ gem sources -a http://gems.github.com $ sudo gem install chrislo-sourceclassifier

Then, to use

  require 'rubygems'
  require 'sourceclassifier'
  
  s = SourceClassifier.new
  
  ruby_text = <<EOT
  def my_sorting_function(a)
    a.sort
  end
  EOT
  
  c_text = <<EOT
  #include <unistd.h>
  
  int main() {
    write(1, "hello world\n", 12);
    return(0);
  }
  EOT
  
  s.identify(ruby_text) #=> Ruby
  s.identify(c_text) #=> Gcc

Training

Download the sources from github and in the directory run the training rake test

$ rake train

In the ./sources directory are subdirectories for each language you wish to be able to identify. Each subdirectory contains examples of programs written in that language. The name of the directory is significant – it is the value returned by the SourceClassifier.identify() method.

The rake task populate can be used to build these subdirectories from a checkout of the computer language shootout sources but you are free to train the classifier using any available examples.

Acknowledgments

This library depends heavily on the great Classifier gem by Lucas Carlson and David Fayram II.

About

Use a Bayesian classifier to determine source code language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published