This is my util library for experimenting with Machine learning I' ve learned from ML-class.org. There were two main experiment I conducted for several weeks.

Collaborative filtering algorithm(als-wr) which is based on matrix factorization(from mahout).

This is multithreaded implementation for Collaborative Filtering(ALS-WR) using Matrix factorization for implicit feedback.

Heavily stolen from Mahout 0.7. if your matrix can be loaded into memory, then just use mahout.

Why re-implement algorithm in mahout? problems were follwing.

U/M matrix is too large(about 8G in 2D float array) for mapred.map.child.java.opts. mahout implementation runs out of heap space when load U/M matrix into memory. I used hdfs file lock and fair scheduler to force not too many map task runs simultaneously on same datanode.
loading U/M matrix into memory for each setup is single threaded. since keys in each part hdfs files are mutual exclusive(key is row id), it is safe to run multiple thread to read multiple part hdfs file into memory.
even though U/M is loaded into memory, single thread to calculate current ui, mj is too slow. note that each ui, mj can be run simultaneously using multiple thread. I added multithreadedMapMapper to span multithreaded map method.

using multithread in map task, it runs 3 times(used 10 thread per task) fater than single threaded. Still need to figure out how to emit asyncronously in mapper.

Run

RunALS.sh hdfs-input-path numIteration numFeature hdfs-output-path hdfs-tmp-path

Parameters

hdfs-input-path: (user-id:string, item-id:string, rating:float) comma delimeted hdfs file.
numIteration: how many iterations you want to run ALS-WR. usually more iteration means better uptimization on cost function.
numFeatures: how many latent features you want to use. lager value of this means better uptimization on cost function.
alpha(coded in script): weight in trainform function for rating to confidence in paper.
lambda(coded in script): weight for regularization term on cost function. refer paper for detail.

Job flows

indexify your rating input: (user-index:integer, item-index:integer, rating:float).
split data per user-index. note mahouts implementation doesnt group by user-index for split.
factorize rating matrix into U/M. this will take a lot of times depends on user matrix dimension.
build recommendations using U x M.
remove already rated items in recommendations.
run evaluation(optional).
de-indexify recommendations result.

Results

recommendations per each user-id.

Random walk on bipartitie graph using mahout math library.

This is implementation for RandomWalk Algorithm on bipartite indirect graph using mahout math library.

If your graph is bipartie use this

Concepts

Concept is composite of vertices in input graph that represent abstract themes. In simplist form, Concept can be one vertice in graph of set of vertices.

ex) assumes bipartite graph from click-log. left vertices are queries and right vertices are URL that have been clicked by queries. in this case concept can be following.

each URL and we will get probabilities of every queries that will click this concept URL.
set of URLs that represent bigger concept like category.
composite of 1 and 2.

currently I only implemented simplist concept 1. override retrieveClassMatrix if you want others.

Run

RunARW.sh hdfs-input-path hdfs-output-path hdfs-tmp-path

Parameters

hdfs-input-path: (source-id:string, target-id:string, weight:float) comma seperated hdfs file.
hdfs-output-path: output path.
tmp: temp path.
sinkProb: sink threshold to consider as 0. large value of this will cause too much network data transfer and will lead can`t fork copy on each datanode. too small value of this will not discover any indirect link.
iteration: how many times propagate probability. theoretically after many iteration this algorithm should converge.

Results

probability matrix(|left vertices| x |concepts|). output can be considered as likelyhood that every left vertices will sink into each concept.
probability matrix(|right vertices| x |concepts|). note that if we initiate concept on right vertices, probability matrix(|right vertices| x |concepts|) can be considered as soft-clustering.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
lib		lib
src/com/skp/experiment		src/com/skp/experiment
test/com/skp/experiment		test/com/skp/experiment
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib

lib

src/com/skp/experiment

src/com/skp/experiment

test/com/skp/experiment

test/com/skp/experiment

.gitignore

.gitignore

README.md

README.md

Repository files navigation

This is my util library for experimenting with Machine learning I' ve learned from ML-class.org. There were two main experiment I conducted for several weeks.

Collaborative filtering algorithm(als-wr) which is based on matrix factorization(from mahout).

Why re-implement algorithm in mahout? problems were follwing.

Run

Parameters

Job flows

Results

Random walk on bipartitie graph using mahout math library.

Concepts

Run

Parameters

Results

About

Releases

Packages

Languages

SteamShon/collaborative-filtering-experiment

Folders and files

Latest commit

History

Repository files navigation

This is my util library for experimenting with Machine learning I' ve learned from ML-class.org. There were two main experiment I conducted for several weeks.

Collaborative filtering algorithm(als-wr) which is based on matrix factorization(from mahout).

Why re-implement algorithm in mahout? problems were follwing.

Run

Parameters

Job flows

Results

Random walk on bipartitie graph using mahout math library.

Concepts

Run

Parameters

Results

About

Resources

Stars

Watchers

Forks

Languages