There has been a lot of methods for natural language processing and text mining. However, in tweets, surveys, Facebook, or many online data, texts are short, lacking data to build enough information. Traditional bag-of-words (BOW) model gives sparse vector representation.
Semantic relations between words are important, because we usually do not have enough data to capture the similarity between words. We do not want “drive” and “drives,” or “driver” and “chauffeur” to be completely different.
The relation between or order of words become important as well. Or we want to capture the concepts that may be correlated in our training dataset.
We have to represent these texts in a special way and perform supervised learning with traditional machine learning algorithms or deep learning algorithms.
This package `shorttext‘ was designed to tackle all these problems. It is not a completely new invention, but putting everything known together. It contains the following features:
- example data provided (including subject keywords and NIH RePORT);
- text preprocessing;
- pre-trained word-embedding support;
- gensim topic models (LDA, LSI, Random Projections) and autoencoder;
- topic model representation supported for supervised learning using scikit-learn;
- cosine distance classification; and
- neural network classification (including ConvNet, and C-LSTM).
Readers can refer this to the documentation.