forked from pranab/sifarish
Content based and collaborative filtering based recommendation engine implementation on Hadoop and Storm
vbajaria/sifarish
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Introduction ============ Sifarish is a suite of recommendation engines implementaed in Hadoop. Various algorithms, including feature similarity based recommendation and collaborative filtering based recommendation using social rating data will be available Blogs ===== The following blogs of mine are good source of details of sifarish http://pkghosh.wordpress.com/2011/10/26/similarity-based-recommendation-basics/ http://pkghosh.wordpress.com/2011/11/28/similarity-based-recommendation-hadoop-way/ http://pkghosh.wordpress.com/2011/12/15/similarity-based-recommendation-text-analytic/ http://pkghosh.wordpress.com/2012/04/21/socially-accepted-recommendation/ http://pkghosh.wordpress.com/2010/10/19/recommendation-engine-powered-by-hadoop-part-1/ http://pkghosh.wordpress.com/2010/10/31/recommendation-engine-powered-by-hadoop-part-2/ http://pkghosh.wordpress.com/2012/12/31/get-social-with-pearson-correlation/ http://pkghosh.wordpress.com/2012/09/03/from-item-correlation-to-rating-prediction/ Content Similarity Based Recommendation ======================================= In the absence of social rating data, the only options is a feature similarity based recommendation. Similarity is calculated based on distance between entities in a multi dimensional feature space. Some examples are - recommending jobs based on user's resume - recommending products based on user profile. These solutions are known as content based recommendation, because it's based innate features of some entity. There are two different solutions as follows 1. Similarity between entities of different types (e.g. user profile and product) 2. Similarity between entities of same type (e.g. product) Attribute meta data is defined in a json file. Both entities need not have the same set of attributes. Mapping between attributes values from one entity to the other can be defined in the config file. The data type supported are numerical (integer) and categorical. The distnace between categorical attributes is is 0 if they are same 1 otherwise. If the atrributes are different, the distance can be configured to be something other than. For numerical, the distance is simply the distance between the two values. Numerical attributes are normalized using the difference between the min and max values. The distance algorithms can be chosed to be euclidian, manhattan or minkowski. The default algorithm is euclidian. The distancs between different atrributes of different types are combined to find distance between two entity instances. Different weights can be assigned to the attributes to control the relative importance of different attributes. Social Interaction Data Based Recommendation ============================================ These category of solutions are based user behavior data with respect to some product or service. User behavior data is defined in terms of some explicit rating by user or it's derived from user behavior in the site. There are 3 algorithms implemented for social recommendation. The essential input to all these algorithms is a matrix of user and items. The value for a cell could be the ratingas an integer. It could also be boolean, if the user's interest in an item is expressed as a boolean 1. Item similarity based on distance between two item vectors. The size of the vector is the number of users 2. The so called slope one recommender, where rating for an item is predicted for an user based on average rating of items 3. Collaborative filtering, which is based on correlation between ratings of items. Slope one recommender is simplified version of collaborative filtering. Cold Starting Recommenders ========================== These solutions are used when enough social data is not avaialable 1. If data contains text attributes, use TextAnalyzer MR to convert text to token stream using lucene 2. Find similar items based on user profile. Use DiffTypeSimilarity MR 3. Use TopMatches MR to find top n matches for a profile Warm Starting Recommenders ========================== When limited amount of user behavior data is available, these solutuions are appropriate 1. If data contains text attributes, use TextAnalyzer MR to convert text to token stream using lucene 2. Find similar items by pairing items with one another using SameTypeSimilarity MR 3. Use TopMatches MR to find top n matches for a product Hot Starting Recommenders ========================= When significant of user behavior data is available, these soltions can be used. In the order of complexity, the choices are as follows. They are all based on social data There two phases for collaborative filetering based recommendation using social data 1. Find correlation between items 2. Predict rating based on items alreadyv rated and result of 1 Items Correlation 1. Find item similarity based on social data using DynamicAttrSimilarityStrategy MR OR 2. Use RatingDifference to find mean rating difference (aka slope one recommender) correlation OR 3. Use collaborative filtering recommender (in progress) Rating Prediction 1. Run UtilityPredictor on the result of 1 or 2 or 3 followed by UtilityAggregator Complex Attributes ================== I am in the process of adding more suppoort for structured fields e.g., Location, Time Window, Ctaegorized Item, Product etc. These provide contextual dimensions to recommendation. They are particularly relevant for recommendation in the mobile space Diversilty, Unexpectedness ========================== Based on recent work in this as published by ACM, I am working on implementing some algorithms to introduce diversity and novelty in recommendation Facted Match ============ For content based recommendation, faceted match is supported as faceted search in Solr. Faceted fields are specified through a configuration parameter Configuration ============= Put all config in a properties and pass it to Hadoop as below in your shell script -Dconf.path=/my/config/path/config.properties The properties file should have all the config values. There are some samples in the resource directory Usage ===== Here is sample script. You provide all the configurations needed through a properties file. The property file name is provided through the param conf.path. JAR_NAME=/..../sifarish/target/sifarish-1.0.jar CLASS_NAME=org.sifarish.common.ItemDynamicAttributeSimilarity echo "running mr" IN_PATH=/..../purch/item OUT_PATH=/..../purch/simi echo "input $IN_PATH output $OUT_PATH" hadoop fs -rmr $OUT_PATH echo "removed output dir" hadoop jar $JAR_NAME $CLASS_NAME -Dconf.path=/.../purch.properties $IN_PATH $OUT_PATH Dependency ========== This project depends on my other project chombo. You can distribute to cache by including the following in the hadoop jar command as follows -libjars chombo-1.0.jar If running single node Haddop cluster on you locl machine, easiest thing to do is to copy the jar to the Hadoop installation lib directory
About
Content based and collaborative filtering based recommendation engine implementation on Hadoop and Storm
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published