This is my util library for experimenting with Machine learning I' ve learned from ML-class.org. There were two main experiment I conducted for several weeks.
This is multithreaded implementation for Collaborative Filtering(ALS-WR) using Matrix factorization for implicit feedback.
Heavily stolen from Mahout 0.7. if your matrix can be loaded into memory, then just use mahout.
- U/M matrix is too large(about 8G in 2D float array) for mapred.map.child.java.opts. mahout implementation runs out of heap space when load U/M matrix into memory. I used hdfs file lock and fair scheduler to force not too many map task runs simultaneously on same datanode.
- loading U/M matrix into memory for each setup is single threaded. since keys in each part hdfs files are mutual exclusive(key is row id), it is safe to run multiple thread to read multiple part hdfs file into memory.
- even though U/M is loaded into memory, single thread to calculate current ui, mj is too slow. note that each ui, mj can be run simultaneously using multiple thread. I added multithreadedMapMapper to span multithreaded map method.
using multithread in map task, it runs 3 times(used 10 thread per task) fater than single threaded. Still need to figure out how to emit asyncronously in mapper.
RunALS.sh hdfs-input-path numIteration numFeature hdfs-output-path hdfs-tmp-path
-
hdfs-input-path: (user-id:string, item-id:string, rating:float) comma delimeted hdfs file.
-
numIteration: how many iterations you want to run ALS-WR. usually more iteration means better uptimization on cost function.
-
numFeatures: how many latent features you want to use. lager value of this means better uptimization on cost function.
-
alpha(coded in script): weight in trainform function for rating to confidence in paper.
-
lambda(coded in script): weight for regularization term on cost function. refer paper for detail.
-
indexify your rating input: (user-index:integer, item-index:integer, rating:float).
-
split data per user-index. note mahout
s implementation doesn
t group by user-index for split. -
factorize rating matrix into U/M. this will take a lot of times depends on user matrix dimension.
-
build recommendations using U x M.
-
remove already rated items in recommendations.
-
run evaluation(optional).
-
de-indexify recommendations result.
- recommendations per each user-id.
This is implementation for RandomWalk Algorithm on bipartite indirect graph using mahout math library.
If your graph is bipartie use this
Concept is composite of vertices in input graph that represent abstract themes. In simplist form, Concept can be one vertice in graph of set of vertices.
ex) assumes bipartite graph from click-log. left vertices are queries and right vertices are URL that have been clicked by queries. in this case concept can be following.
-
each URL and we will get probabilities of every queries that will click this concept URL.
-
set of URLs that represent bigger concept like category.
-
composite of 1 and 2.
currently I only implemented simplist concept 1. override retrieveClassMatrix if you want others.
RunARW.sh hdfs-input-path hdfs-output-path hdfs-tmp-path
-
hdfs-input-path: (source-id:string, target-id:string, weight:float) comma seperated hdfs file.
-
hdfs-output-path: output path.
-
tmp: temp path.
-
sinkProb: sink threshold to consider as 0. large value of this will cause too much network data transfer and will lead can`t fork copy on each datanode. too small value of this will not discover any indirect link.
-
iteration: how many times propagate probability. theoretically after many iteration this algorithm should converge.
-
probability matrix(|left vertices| x |concepts|). output can be considered as likelyhood that every left vertices will sink into each concept.
-
probability matrix(|right vertices| x |concepts|). note that if we initiate concept on right vertices, probability matrix(|right vertices| x |concepts|) can be considered as soft-clustering.