Skip to content

hmetaxa/MixLDA

Repository files navigation

Non-parametric Multi-View Topic Model (MV-HDP) that extends well-established Hierarchical Dirichlet Process (HDP) incorporating a novel Interacting Pólya Urn scheme (IUM) to model per-document topic distribution. This way, MV-HDP combines interaction and reinforcement addressing the following major challenges: 1) multi-view modeling leveraging statistical strength among different views, 2) estimation and adaptation to the extent of correlation between the different views that is automatically inferred during inference and 3) scalable inference on massive, real world datasets. The latter is achieved through a parallel Gibbs sampling scheme that utilizes efficient F+Tree data structure. 
Our implementation is based on MALLET toolkit. We consider that estimating the right number of topics is not our primary goal especially in collections of that size. So, we have implemented a truncated version of the proposed model where the goal is to better estimate priors and model parameters given a maximum number of topics. Although initially we extended MALLET’s efficient parallel topic model with Sparse LDA sampling, we end up implementing a very different parallel implementation based on F+Trees that: 1) it is more readable and extensible, 2) it is usually faster in real world big datasets especially in multi-view settings 3) shares model related data structures across threads (contrary to MALLET where each thread retains each own model copy). We update the model using background threads based on a lock free, queue based scheme minimizing staleness. Our implementation can scale to massive datasets (over one million documents with meta-data and links) on a single computer.

Related classes in package cc.mallet.topics:
FastQMVWVParallelTopicModel: Main class
FastQMVWVUpdaterRunnable: Model updating
FastQMVWVWorkerRunnable: Gibbs Sampling
FastQMVWVTopicModelDiagnostics: Related diagnostics 
PTMExperiment: Example of how we can load data from SQLite, run MV-Topic Models, calc similarities etc

Example results (multi view topics on Full ACM & OpenAccess PubMed corpora):
https://1drv.ms/f/s!Aul4avjcWIHpg-Ara0PZzqHeOkyGIw
(short link http://1drv.ms/1PI9UB0 is not working any more) 

About

Extensions on classic LDA implementation (based on MALLET toolkit)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages