Extensions on classic LDA implementation (based on MALLET toolkit)
License
hmetaxa/MixLDA
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Non-parametric Multi-View Topic Model (MV-HDP) that extends well-established Hierarchical Dirichlet Process (HDP) incorporating a novel Interacting Pólya Urn scheme (IUM) to model per-document topic distribution. This way, MV-HDP combines interaction and reinforcement addressing the following major challenges: 1) multi-view modeling leveraging statistical strength among different views, 2) estimation and adaptation to the extent of correlation between the different views that is automatically inferred during inference and 3) scalable inference on massive, real world datasets. The latter is achieved through a parallel Gibbs sampling scheme that utilizes efficient F+Tree data structure. Our implementation is based on MALLET toolkit. We consider that estimating the right number of topics is not our primary goal especially in collections of that size. So, we have implemented a truncated version of the proposed model where the goal is to better estimate priors and model parameters given a maximum number of topics. Although initially we extended MALLET’s efficient parallel topic model with Sparse LDA sampling, we end up implementing a very different parallel implementation based on F+Trees that: 1) it is more readable and extensible, 2) it is usually faster in real world big datasets especially in multi-view settings 3) shares model related data structures across threads (contrary to MALLET where each thread retains each own model copy). We update the model using background threads based on a lock free, queue based scheme minimizing staleness. Our implementation can scale to massive datasets (over one million documents with meta-data and links) on a single computer. Related classes in package cc.mallet.topics: FastQMVWVParallelTopicModel: Main class FastQMVWVUpdaterRunnable: Model updating FastQMVWVWorkerRunnable: Gibbs Sampling FastQMVWVTopicModelDiagnostics: Related diagnostics PTMExperiment: Example of how we can load data from SQLite, run MV-Topic Models, calc similarities etc Example results (multi view topics on Full ACM & OpenAccess PubMed corpora): https://1drv.ms/f/s!Aul4avjcWIHpg-Ara0PZzqHeOkyGIw (short link http://1drv.ms/1PI9UB0 is not working any more)
About
Extensions on classic LDA implementation (based on MALLET toolkit)
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published