Project Description
Document clustering is one of the major tasks in natural language processing fields. While the widely researches for document clustering, it has several limitations because of the clustering is one of the unsupervised learning techniques. The most critical limitations are anyone cannot be guaranteed that the clustering result is right. This means that someone who is the expert having domain-specific knowledge has to check and verify the clustering result.
Due to this reason, Grace and I catch up the needs to compare clustering results from different clustering techniques and we visualized different clustering model on a web page to support the evaluation of models and quality of clusters to retrieve target clusters, which often becomes an exploratory problem.
For this project, using the following skills:
Clustering Algorithms
Clustering models are divided into several types according to using algorithms. We choose 3 different type models, Latent Dirichlet Allocation(LDA), K-means, Deep Embedded Clustering(DEC).
- Latent Dirichlet Allocation(Porbabilistic methods): LDA has considered each item is a mixture of various clusters with probabilistic distribution.
- K-Means(Centroid-based methods): K-Means decides the cluster of each item to minimize the within-cluster sum of squares(WCSS, variance)
- Deep Embedded Clustering(Low dimensional embedding method): DEC uses a deep neural network to choose features to represent each cluster
Dimension Reduction
Umap: Uniform manifold approximation and projection for dimension reduction(McInnes, L., & Healy, J., 2018)
To visualize documents, we used Uniform Manifold Approximation(U-MAP). t-SNE is the most popular technique to reduce dimensions for preprocessing and projection. However, using the t-SNE need for powerful computing power and high time-consuming. In recently, a new technique called U-MAP developed. This technique preserves more of the global structure and has a faster time complexity than t-SNE
Extraction Keywords & Entity Recognition
For verifying clusters, we extract main keywords from each cluster. Keywords are determined by probability based on the frequency of a word in each cluster over frequency in overall documents. We use the pyLDAvis library to extract keywords. However, pyLDAvis is not supported extracting keywords from only LDA model. Because of that, we use a custom library called kmeans to pyLDAvis made by lovit to extract keywords from the K-means and DEC model.
Entity Recognition is another information to verify the cluster quality. We provide not only entities of each document, but also the frequency of entities of documents in the same cluster. To recognize entities, we use the spaCy library with English dataset(en_core_web_sm, including Vocabulary, Syntax, Entities).
Implementation
We use basically python to implement this system including running model, importing model, feature reduction(extracting coordinates for scatter chart) and entity recognition. Using Flask(python library), we make the local web server, and every result from python have represented this server. On the front side, we visualize data from Flask server using javascript and d3.js.
Moreover, we make the application using Electron that is one of the famous platforms to build cross-platform desktop apps. Electron makes the window including chromium browser, therefore our application can call web pages from Flask server. For this, we package the python server by pyinstaller and the front web pages by Electron.