/ˈkärmīn/
QRMine is a suite of qualitative research (QR) data mining tools in Python using Natural Language Processing (NLP) and Machine Learning (ML). This workbook demonstrates how to use QRMine.
QRMine is a theory building tool. Publication with the theoretical foundations of this tool is being worked on. QRMine is inspired by this work and the associated paper. The GitHub repo is here. Read my blog post here.
QRMine is available in pypi and can be installed with pip. The spacy model is a dependency and has to be installed separately.
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm
pip install qrmine
from qrmine import Content
from qrmine import Network
from qrmine import Qrmine
from qrmine import ReadData
from qrmine import Sentiment
from qrmine import MLQRMine
from qrmine import __version__
print(__version__)
Individual documents or interview transcripts in single or multiple text files can be used. Interviews or sections are separated as below:
Transcript of the first interview with John.
Any number of lines
<break>First_Interview_John</break>
Text of the second interview with Jane.
More text.
<break>Second_Interview_Jane</break>
## Read the text file(s)
data = ReadData()
inp_file = ["transcript.txt"]
data.read_file(inp_file)
import textacy
q = Qrmine()
all_interviews = Content(data.content)
doc = textacy.make_spacy_doc(all_interviews.doc)
q.print_categories(doc, 10) # 10 categories
s = Sentiment()
s.sentiment_analyzer_scores(doc.text)
print(s.sentiment()) # neutral
You can also filter according to a topic. Example below
## Categories of document P3
ct=0
for title in data.titles:
if title == "P3":
content = data.documents[ct]
ct+=1
interview = Content(content)
doc = textacy.make_spacy_doc(interview.doc)
q.print_categories(doc, 3)
q.content = data
q.process_content()
q.print_topics()
q.print_dict(all_interviews, 5)
A single csv file with the following generic structure is needed.
index, obesity, bmi, exercise, income, bp, fbs, has_diabetes
1, 0, 29, 1, 12, 120, 89, 1
2, 1, 32, 0, 9, 140, 92, 0
......
ml = MLQRMine()
ml.csvfile = "diabetes.csv"
ml.epochs = 3
ml.prepare_data(True) # Oversample
ml.get_nnet_predictions()
print("\n%s: %.2f%%" % (ml.model.metrics_names[1], ml.get_nnet_scores()[1] * 100))
print(ml.svm_confusion_matrix())
ml.prepare_data()
n=4
r=3
knn = ml.knn_search(n, r)
for n in knn:
print("Records: ", n + 1)
print(ml.get_kmeans(3))
print(ml.get_pca(3))
This is just a quick demo of functions. The command line interface supports more advanced filtering and other analysis methods. The GitHub repo is here. Read my blog post here. Publication with the theoretical foundations of this tool is being worked on. QRMine is inspired by this work and the associated paper.
QRMine is work in progress
input files are transcripts as txt files and a single csv file with numeric data. The output txt file can be specified.
The coding dictionary, topics and topic assignments can be created from the entire corpus (all documents) using the respective command line options.
Categories (concepts), summary and sentiment can be viewed for entire corpus or specific titles (documents) specified using the --titles switch. Sentence level sentiment output is possible with the --sentence flag.
You can filter documents based on sentiment, titles or categories and do further analysis, using --filters or -f
Many of the ML functions like neural network takes a second argument (-n) . In nnet -n signifies the number of epochs, number of clusters in kmeans, number of factors in pca, and number of neighbours in KNN. KNN also takes the --rec or -r argument to specify the record.
Variables from csv can be selected using --titles (defaults to all). The first variable will be ignored (index) and the last will be the DV (dependant variable).
pythom -m qrmine --help
Command | Alternate | Description |
---|---|---|
--inp | -i | Input file in the text format with |
--out | -o | Output file name |
--csv | csv file name | |
--num | -n | N (clusters/epochs etc depending on context) |
--rec | -r | Record (based on context) |
--titles | -t | Document(s) title(s) to analyze/compare |
--codedict | Generate coding dictionary | |
--topics | Generate topic model | |
--assign | Assign documents to topics | |
--cat | List categories of entire corpus or individual docs | |
--summary | Generate summary for entire corpus or individual docs | |
--sentiment | Generate sentiment score for entire corpus or individual docs | |
--nlp | Generate all NLP reports | |
--sentence | Generate sentence level scores when applicable | |
--nnet | Display accuracy of a neural network model -n epochs(3) | |
--svm | Display confusion matrix from an svm classifier | |
--knn | Display nearest neighbours -n neighbours (3) | |
--kmeans | Display KMeans clusters -n clusters (3) | |
--cart | Display Association Rules | |
--pca | Display PCA -n factors (3) |
Please cite QRMine in your publications if it helped your research. Here is an example BibTeX entry:
@misc{eapenbr2019qrmine,
title={QRMine -Qualitative Research Tools in Python.},
author={Eapen, Bell Raj and contributors},
year={2019},
publisher={GitHub},
journal = {GitHub repository},
howpublished={\url{https://github.com/dermatologist/qrmine}}
}