lda2vec: The Best of Both Worlds

, Precognox

In our previous analysis, we used LDA to discover the topics of the discourse on CEU and NGOs in the Hungarian online media. We love LDA and we were shocked when it put articles on the same issue (the legislation process and the reaction of the EU, and the effected institutions) into two topics (check out the pyldaviz output here). This was due to the very different word usage of the sites; the independent (or left, or liberal, choose your favorite term) media like official names, for example Central European University, NGOs, EU and etc, while the quasi state financed (or right, or pro-government) side is using terms like “Soros University”, “foreign organizations”, “Brussels” and etc. Our word2vec model built on the corpus shows similar terms are close to each other in the semantic space, i.e. “Soros University” and “Central European University” are occupying a very similar position in the space (you can explore the 3D t-sne projection of the word2vec model here). That’s why we gave a try to Christopher E. Moody’s lda2vec algorithm; we hoped it can overcome this word usage problem.

Although we should work on the algorithm before using it in real-life scenarios, the first results are very promising; in our case, we got more descriptive topics which opens the possibility to find articles on the same issue from the opposing narratives.


If you liked the article please share it with others!