The Python package for text mining
shorttext has a new release: 0.5.4. It can be installed by typing in the command line:
pip install -U shorttext
For some people, you may need to install it from “root”, i.e., adding
sudo in front of the command. Since the version 0.5 (including releases 0.5.1 and 0.5.4), there have been substantial addition of functionality, mostly about comparisons between short phrases without running a supervised or unsupervised machine learning algorithm, but calculating the “similarity” with various metrics, including:
- soft Jaccard score (the same kind of fuzzy scores based on edit distance in SOCcer),
- Word Mover’s distance (WMD, detailedly described in a previous post), and
- Jaccard index due to word-embedding model.
For the soft Jaccard score due to edit distance, we can call it by:
>>> from shorttext.metrics.dynprog import soft_jaccard_score >>> soft_jaccard_score(['book', 'seller'], ['blok', 'sellers']) # gives 0.6716417910447762 >>> soft_jaccard_score(['police', 'station'], ['policeman']) # gives 0.2857142857142858
The core of this code was written in C, and interfaced to Python using SWIG.
For the Word Mover’s Distance (WMD), while the source codes are the same as my previous post, it can now be called directly. First, load the modules and the word-embedding model:
>>> from shorttext.metrics.wasserstein import word_mover_distance >>> from shorttext.utils import load_word2vec_model >>> wvmodel = load_word2vec_model('/path/to/model_file.bin')
And compute the WMD with a single function:
>>> word_mover_distance(['police', 'station'], ['policeman'], wvmodel) # gives 3.060708999633789 >>> word_mover_distance(['physician', 'assistant'], ['doctor', 'assistants'], wvmodel) # gives 2.276337146759033
And the Jaccard index due to cosine distance in Word-embedding model can be called like this:
>>> from shorttext.metrics.embedfuzzy import jaccardscore_sents >>> jaccardscore_sents('doctor', 'physician', wvmodel) # gives 0.6401538990056869 >>> jaccardscore_sents('chief executive', 'computer cluster', wvmodel) # gives 0.0022515450768836143 >>> jaccardscore_sents('topological data', 'data of topology', wvmodel) # gives 0.67588977344632573
Most new functions can be found in this tutorial.
And there are some minor bugs fixed.
- PyPI: shorttext. [PyPI]
- Homepage of shorttext. [RTFD]
- Tutorial of Metrics in Shorttext. [RTFD]
- “Short Text Categorization using Deep Neural Networks and Word-Embedding Models,” Everything About Data Analytics, WordPress (2016). [WordPress]
- “Python Package for Short Text Mining,” Everything About Data Analytics, WordPress (2016). [WordPress]