GitHub - vbajaria/sifarish: Content based and collaborative filtering based recommendation engine implementation on Hadoop and Storm

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
resource		resource
src/main/java/org/sifarish		src/main/java/org/sifarish
target		target
.gitignore		.gitignore
README		README
manifest.mf		manifest.mf
pom.xml		pom.xml

Repository files navigation

Introduction
============
Sifarish is a suite of recommendation engines implementaed in Hadoop. Various 
algorithms, including  feature similarity based recommendation and collaborative 
filtering based recommendation using social rating data will be available

Blogs
=====
The following blogs of mine are good source of details of sifarish

http://pkghosh.wordpress.com/2011/10/26/similarity-based-recommendation-basics/
http://pkghosh.wordpress.com/2011/11/28/similarity-based-recommendation-hadoop-way/
http://pkghosh.wordpress.com/2011/12/15/similarity-based-recommendation-text-analytic/
http://pkghosh.wordpress.com/2012/04/21/socially-accepted-recommendation/
http://pkghosh.wordpress.com/2010/10/19/recommendation-engine-powered-by-hadoop-part-1/
http://pkghosh.wordpress.com/2010/10/31/recommendation-engine-powered-by-hadoop-part-2/
http://pkghosh.wordpress.com/2012/12/31/get-social-with-pearson-correlation/
http://pkghosh.wordpress.com/2012/09/03/from-item-correlation-to-rating-prediction/


Content Similarity Based Recommendation
=======================================
In the absence of social rating data, the only options is a feature similarity 
based recommendation. Similarity is calculated based on distance between entities 
in a multi dimensional feature space. Some examples are - recommending jobs based 
on user's resume - recommending products based on user profile. These
solutions are known as content based recommendation, because it's based innate 
features of some entity.

There are two different solutions as follows
1. Similarity between entities of different types (e.g. user profile and product)
2. Similarity between entities of same type (e.g. product)

Attribute meta data is defined in a json file. Both entities need not have the 
same set of attributes. Mapping between attributes values from one entity to 
the other can be defined in the config file.

The data type supported are numerical (integer) and categorical. The distnace 
between categorical attributes  is is 0 if they are same 1 otherwise. If the 
atrributes are different, the distance can be configured to be  something other 
than. For numerical, the distance is simply the distance between the two values. 
Numerical  attributes are normalized using the difference between the min and max values.
 
The distance algorithms can be chosed to be euclidian, manhattan or minkowski. The 
default algorithm is euclidian. The distancs between different atrributes of different 
types are combined to find distance between two entity instances. Different weights 
can be assigned to the attributes to control the relative importance of different 
attributes.


Social Interaction Data Based Recommendation
============================================
These category of solutions are based user behavior data with respect to some product 
or service.  User behavior data is defined in terms of some explicit rating by user 
or it's derived from user  behavior in the site. There are 3 algorithms implemented for 
social recommendation. The essential  input to all these algorithms is a matrix of user 
and items. The value for a cell could be the ratingas an integer. It could also be boolean, 
if the user's interest in an item is expressed as a boolean

1. Item similarity based on distance between two item vectors. The size of the vector 
is the number of users
2. The so called slope one recommender, where rating for an item is predicted for an 
user based on average rating of items
3. Collaborative filtering, which is based on correlation between ratings of  items. Slope one 
recommender is simplified version of collaborative filtering.


Cold Starting Recommenders
==========================
These solutions are used when enough social data is not avaialable

1. If data contains text attributes, use TextAnalyzer MR to convert text to token stream 
   using lucene
2. Find similar items based on user profile. Use DiffTypeSimilarity MR
3. Use TopMatches MR to find top n matches for a profile


Warm Starting Recommenders
==========================
When limited amount of user behavior data is available, these solutuions are appropriate

1. If data contains text attributes, use TextAnalyzer MR to convert text to token stream 
   using lucene
2. Find similar items by pairing items with one another using SameTypeSimilarity MR
3. Use TopMatches MR to find top n matches for a product


Hot Starting Recommenders
=========================
When significant of user behavior data is available, these soltions can be used. In 
the order of  complexity, the choices are as follows. They are all based on social data

There two phases for collaborative filetering based recommendation using social data
1. Find correlation between items 2. Predict rating based on items alreadyv rated and 
result of 1

Items Correlation
1. Find  item similarity based on social data using DynamicAttrSimilarityStrategy MR
OR
2. Use RatingDifference to find mean rating difference (aka slope one recommender) correlation
OR
3. Use collaborative filtering recommender (in progress)

Rating Prediction
1. Run UtilityPredictor on the result of 1 or 2 or 3 followed by UtilityAggregator

Complex Attributes
==================
I am in the process of adding more suppoort for structured fields e.g.,
Location, Time Window, Ctaegorized Item, Product etc. These provide contextual
dimensions to recommendation. They are particularly relevant for recommendation
in the mobile space


Diversilty, Unexpectedness
==========================
Based on recent work in this as published by ACM, I am working on implementing some 
algorithms to introduce  diversity and novelty in recommendation

Facted Match
============
For content based recommendation, faceted match is supported as faceted search in Solr.
Faceted fields are specified through a configuration parameter

Configuration
=============
Put all config in a properties and pass it to Hadoop as below in your shell script
-Dconf.path=/my/config/path/config.properties

The properties file should have all the config values. There are some samples in the 
resource  directory

Usage
=====
Here is sample script. You provide all the configurations needed through a properties
file. The property file name is provided through the param conf.path.

JAR_NAME=/..../sifarish/target/sifarish-1.0.jar
CLASS_NAME=org.sifarish.common.ItemDynamicAttributeSimilarity

echo "running mr"
IN_PATH=/..../purch/item
OUT_PATH=/..../purch/simi
echo "input $IN_PATH output $OUT_PATH"
hadoop fs -rmr $OUT_PATH
echo "removed output dir"

hadoop jar $JAR_NAME  $CLASS_NAME -Dconf.path=/.../purch.properties  $IN_PATH  $OUT_PATH

Dependency
==========
This project depends on my other project chombo. You can distribute to cache by including the
following in the hadoop jar command as follows
-libjars chombo-1.0.jar
 
If running single node Haddop cluster on you locl machine, easiest thing to do is to 
copy the jar to the Hadoop installation lib directory