QRMine

/ˈkärmīn/

Introduction

QRMine is a suite of qualitative research (QR) data mining tools in Python using Natural Language Processing (NLP) and Machine Learning (ML). This workbook demonstrates how to use QRMine.

QRMine is a theory building tool. Publication with the theoretical foundations of this tool is being worked on. QRMine is inspired by this work and the associated paper. The GitHub repo is here. Read my blog post here.

Installation

QRMine is available in pypi and can be installed with pip. The spacy model is a dependency and has to be installed separately.

pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm

pip install qrmine
In [1]:
from qrmine import Content
from qrmine import Network
from qrmine import Qrmine
from qrmine import ReadData
from qrmine import Sentiment
from qrmine import MLQRMine
from qrmine import __version__
print(__version__)
Using TensorFlow backend.
2.0.0

Natural Language Processing with QRMine

Individual documents or interview transcripts in single or multiple text files can be used. Interviews or sections are separated as below:

Transcript of the first interview with John.
Any number of lines
<break>First_Interview_John</break>

Text of the second interview with Jane.
More text.
<break>Second_Interview_Jane</break>
In [2]:
## Read the text file(s)
data = ReadData()
inp_file = ["transcript.txt"]
data.read_file(inp_file)

Find recuring concepts in the entire corpus (entire text)

In [5]:
import textacy

q = Qrmine()
all_interviews = Content(data.content)
doc = textacy.make_spacy_doc(all_interviews.doc)
q.print_categories(doc, 10) # 10 categories
---Categories with count---
| CATEGORY  | WEIGHT                |
| know      | 0.01962494548626254   |
| think     | 0.0056694286960314    |
| go        | 0.0056694286960314    |
| thing     | 0.004797208896641954  |
| get       | 0.00436109899694723   |
| time      | 0.00436109899694723   |
| hospital  | 0.0039249890972525075 |
| basically | 0.0030527692978630614 |
| like      | 0.0030527692978630614 |
| people    | 0.0030527692978630614 |
---------------------------

Out[5]:
['know',
 'think',
 'go',
 'thing',
 'get',
 'time',
 'hospital',
 'basically',
 'like',
 'people']

Sentiment

In [8]:
s = Sentiment()
s.sentiment_analyzer_scores(doc.text)
print(s.sentiment()) # neutral
neu

Filters

You can also filter according to a topic. Example below

In [9]:
## Categories of document P3
ct=0
for title in data.titles:
    if title == "P3":
            content = data.documents[ct]
ct+=1
interview = Content(content)
doc = textacy.make_spacy_doc(interview.doc)
q.print_categories(doc, 3)
---Categories with count---
| CATEGORY  | WEIGHT               |
| know      | 0.012422360248447204 |
| basically | 0.009316770186335404 |
| period    | 0.009316770186335404 |
---------------------------

Out[9]:
['know', 'basically', 'period']

Generate topics and asign documents to the topics

In [10]:
q.content = data
q.process_content()
q.print_topics()
---Topics---
| TOPIC   | DESCRIPTION                                                                             |
| TOPIC:1 | number   t   fair   read   first   will   know -PRON- s   like   ve   go                |
| TOPIC:2 | m   go   time   think   try   say   t   take   like   first                             |
| TOPIC:3 | ve   ve get   basically   come   work   time   hospital   think   get   try             |
| TOPIC:4 | m   read   hospital   one   ve   t know   don t know   get   go   thing                 |
| TOPIC:5 | think   thing   people   life   ve   look   support   don t   don   t                   |
| TOPIC:6 | bit   hospital   t   mean   come   don t know   t know   support   know -PRON- s   look |
| TOPIC:7 | t   number   say   basically   get   ve get   like   hospital   take   don              |
---------------------------

Generate coding dictionary with categories, properties and dimensions

In [13]:
q.print_dict(all_interviews, 5)
---Coding Dictionary---
| CATEGORY | PROPERTY | DIMENSION  |
| know     | somebody | know       |
| ...      | ...      | correctly  |
| ...      | ...      | aware      |
| ...      | support  | know       |
| ...      | ...      | correctly  |
| ...      | ...      | aware      |
| ...      | thing    | know       |
| ...      | ...      | let        |
| ...      | ...      | mean       |
| think    | somebody | know       |
| ...      | ...      | correctly  |
| ...      | ...      | aware      |
| ...      | life     | know       |
| ...      | ...      | different  |
| ...      | ...      | lifestyle  |
| ...      | support  | know       |
| ...      | ...      | correctly  |
| ...      | ...      | aware      |
| go       | time     | know       |
| ...      | ...      | go         |
| ...      | ...      | get        |
| ...      | thing    | know       |
| ...      | ...      | let        |
| ...      | ...      | mean       |
| ...      | morning  | choose     |
| ...      | ...      | know       |
| ...      | ...      | feel       |
| get      | work     | think      |
| ...      | ...      | get        |
| ...      | ...      | know       |
| ...      | son      | get        |
| ...      | ...      | suppose    |
| ...      | ...      | hit        |
| ...      | time     | know       |
| ...      | ...      | go         |
| ...      | ...      | get        |
| say      | coma     | say        |
| ...      | ...      | puzzle     |
| ...      | ...      | apparently |
| ...      | lad      | say        |
| ...      | ...      | puzzle     |
| ...      | ...      | apparently |
| ...      | matter   | say        |
---------------------------

Machine learning with QRMine

A single csv file with the following generic structure is needed.

  • Column 1 with identifier. If it is related to a text document as above, include the title.
  • Last column has the dependent variable (DV). (NLP algorithms like the topic asignments may provide the DV)
  • All independent variables (numerical) in between.
index, obesity, bmi, exercise, income, bp, fbs, has_diabetes
1, 0, 29, 1, 12, 120, 89, 1
2, 1, 32, 0, 9, 140, 92, 0
......

NN Classifier accuracy

In [16]:
ml = MLQRMine()
ml.csvfile = "diabetes.csv"
ml.epochs = 3
ml.prepare_data(True)  # Oversample
ml.get_nnet_predictions()
print("\n%s: %.2f%%" % (ml.model.metrics_names[1], ml.get_nnet_scores()[1] * 100))
Epoch 1/3
 - 1s - loss: 0.6910 - acc: 0.5690
Epoch 2/3
 - 0s - loss: 0.6752 - acc: 0.5460
Epoch 3/3
 - 0s - loss: 0.6605 - acc: 0.6090
1000/1000 [==============================] - 0s 203us/step

acc: 63.80%

Confusion matrix from svm

In [17]:
print(ml.svm_confusion_matrix())
[[100  26]
 [ 24 100]]

4 nearest neighbours for record # 3

In [20]:
    ml.prepare_data()
    n=4
    r=3
    knn = ml.knn_search(n, r)
    for n in knn:
        print("Records: ", n + 1)
Records:  [  3 318 328 676]

kmeans clusters

In [ ]:
print(ml.get_kmeans(3))

PCA

In [22]:
print(ml.get_pca(3))
Covariance matrix: 
[[ 1.00130378  0.03769013  0.08664835  0.02683737  0.03520991  0.02357081
  -0.03593079  0.01043349 -0.03024341  0.01216529 -0.01034635]
 [ 0.03769013  1.00130378  0.01836833  0.03228842 -0.03936374  0.04370907
  -0.01661483 -0.03170046 -0.04604789 -0.03549143  0.02067612]
 [ 0.08664835  0.01836833  1.00130378 -0.05111871  0.08751788 -0.01115744
   0.00836761  0.05208627  0.02005672 -0.01940849 -0.03553975]
 [ 0.02683737  0.03228842 -0.05111871  1.00130378  0.12962746  0.14146618
  -0.08177826 -0.07363049  0.01770615 -0.03356638  0.54505093]
 [ 0.03520991 -0.03936374  0.08751788  0.12962746  1.00130378  0.15278853
   0.05740263  0.33178913  0.2213593   0.13751636  0.26385788]
 [ 0.02357081  0.04370907 -0.01115744  0.14146618  0.15278853  1.00130378
   0.2076409   0.08904933  0.2821727   0.04131875  0.23984024]
 [-0.03593079 -0.01661483  0.00836761 -0.08177826  0.05740263  0.2076409
   1.00130378  0.43735204  0.39308503  0.18416737 -0.11411885]
 [ 0.01043349 -0.03170046  0.05208627 -0.07363049  0.33178913  0.08904933
   0.43735204  1.00130378  0.19811702  0.18531222 -0.04221793]
 [-0.03024341 -0.04604789  0.02005672  0.01770615  0.2213593   0.2821727
   0.39308503  0.19811702  1.00130378  0.14083033  0.03628912]
 [ 0.01216529 -0.03549143 -0.01940849 -0.03356638  0.13751636  0.04131875
   0.18416737  0.18531222  0.14083033  1.00130378  0.03360507]
 [-0.01034635  0.02067612 -0.03553975  0.54505093  0.26385788  0.23984024
  -0.11411885 -0.04221793  0.03628912  0.03360507  1.00130378]]
Eigenvectors 
[[ 3.74564863e-03  3.71549975e-02  7.19530442e-02 -7.56216595e-03
  -5.90981128e-01  2.00818234e-01 -2.32372517e-01  7.06229120e-01
   2.04761003e-01 -9.00985932e-02 -3.13530034e-02]
 [-3.90518249e-02  7.46800709e-02  6.58294676e-03 -7.92636278e-03
  -1.36964305e-01  6.77275424e-01 -4.97708584e-01 -4.89277012e-01
   6.01251320e-04 -1.39620708e-01  9.34362148e-02]
 [ 4.32842816e-02 -6.73836648e-02 -8.10155587e-03 -1.69695960e-02
  -6.52279300e-01  7.17681724e-02  4.71031729e-01 -2.75283777e-01
  -4.57255843e-01  7.37628630e-02 -2.24582817e-01]
 [ 1.20084532e-01  5.94752998e-01 -5.82081936e-01  1.30896229e-01
   5.02260592e-02 -2.40794039e-02 -3.73900345e-02  1.61631164e-02
   1.99381066e-02 -1.69014064e-01 -4.92108074e-01]
 [ 3.94247606e-01  1.72014360e-01 -5.02693162e-02  4.54188207e-01
  -3.09741547e-01 -3.23282751e-01 -2.95791383e-04 -2.36135858e-01
   2.53088656e-01 -1.34856867e-01  5.17410781e-01]
 [ 3.54833013e-01  1.95590504e-01 -1.97884667e-01 -5.31853920e-03
   1.56047383e-01  4.26640302e-01  2.14087169e-01  2.14290087e-01
  -1.16031457e-01  6.15864309e-01  3.25227342e-01]
 [ 4.41163135e-01 -3.18824025e-01  2.98781191e-01  5.57643482e-01
   1.70741383e-01  2.00412879e-01  7.06915670e-03  3.82196403e-02
   7.15530649e-02  1.52933459e-02 -4.76767756e-01]
 [ 4.38238444e-01 -2.43166530e-01 -1.47425892e-01 -5.43824510e-01
  -1.58093396e-01 -1.65201892e-01 -1.47344048e-01 -2.17504378e-01
   4.34787016e-01  2.58749206e-01 -2.32471726e-01]
 [ 4.52347864e-01 -9.35219195e-02 -3.67096173e-02 -3.41401805e-01
   1.70656008e-01  1.93057146e-01  2.54951011e-01  1.41303411e-01
  -1.59140481e-01 -6.80434510e-01  1.70453071e-01]
 [ 2.71010038e-01 -1.14999497e-01 -8.80322182e-02 -7.51438836e-03
  -4.68175864e-03 -3.07432390e-01 -5.84370841e-01  1.11400053e-01
  -6.69692161e-01  8.57425669e-02  3.53173307e-02]
 [ 1.89964688e-01  6.19819463e-01  7.02927594e-01 -2.29825964e-01
   2.02237198e-02 -9.72786801e-02 -3.06158730e-02 -5.45905213e-02
  -5.56964604e-02  5.81241535e-02 -1.11981982e-01]]

Eigenvalues 
[2.10083614 1.74251056 0.41735684 0.404731   1.13797948 1.06243172
 0.95915773 0.92892017 0.83682495 0.67353103 0.75006197]
Eigenvalues in descending order:
2.1008361418121164
1.7425105649672612
1.1379794763759048
1.0624317240337113
0.9591577303679275
0.9289201663183302
0.8368249462229675
0.7500619722730967
0.673531030312319
0.41735684251240424
0.40473099541673546
Variance explained:  [19.073642528052716, 15.820378827295094, 10.331797566054915, 9.645894085391298, 8.708262064301616, 8.433733044107, 7.59759391279612, 6.809866628000299, 6.115036698029696, 3.789212810215603, 3.6745818357556366]
Cumulative:  [ 19.07364253  34.89402136  45.22581892  54.87171301  63.57997507
  72.01370812  79.61130203  86.42116866  92.53620535  96.32541816
 100.        ]
Matrix W:
 [[ 0.00374565  0.037155   -0.59098113]
 [-0.03905182  0.07468007 -0.1369643 ]
 [ 0.04328428 -0.06738366 -0.6522793 ]
 [ 0.12008453  0.594753    0.05022606]
 [ 0.39424761  0.17201436 -0.30974155]
 [ 0.35483301  0.1955905   0.15604738]
 [ 0.44116313 -0.31882403  0.17074138]
 [ 0.43823844 -0.24316653 -0.1580934 ]
 [ 0.45234786 -0.09352192  0.17065601]
 [ 0.27101004 -0.1149995  -0.00468176]
 [ 0.18996469  0.61981946  0.02022372]]
None

This is just a quick demo of functions. The command line interface supports more advanced filtering and other analysis methods. The GitHub repo is here. Read my blog post here. Publication with the theoretical foundations of this tool is being worked on. QRMine is inspired by this work and the associated paper.

What to expect in the future

QRMine is work in progress

NLP

  • [x] Lists common categories for open coding.
  • [x] Create a coding dictionary with categories, properties and dimensions.
  • [x] Topic modelling.
  • [x] Arrange docs according to topics.
  • [x] Compare two documents/interviews.
  • [x] Select documents/interviews by sentiment, category or title for further analysis.
  • [x] Sentiment analysis
  • [ ] Network analysis
  • [ ] Co-citation finder

ML

  • [x] Accuracy of a neural network model trained using the data
  • [x] Confusion matrix from an support vector machine classifier
  • [x] K nearest neighbours of a given record
  • [x] K-Means clustering
  • [x] Principal Component Analysis (PCA)
  • [ ] Association rules

Command-line Use

  • input files are transcripts as txt files and a single csv file with numeric data. The output txt file can be specified.

  • The coding dictionary, topics and topic assignments can be created from the entire corpus (all documents) using the respective command line options.

  • Categories (concepts), summary and sentiment can be viewed for entire corpus or specific titles (documents) specified using the --titles switch. Sentence level sentiment output is possible with the --sentence flag.

  • You can filter documents based on sentiment, titles or categories and do further analysis, using --filters or -f

  • Many of the ML functions like neural network takes a second argument (-n) . In nnet -n signifies the number of epochs, number of clusters in kmeans, number of factors in pca, and number of neighbours in KNN. KNN also takes the --rec or -r argument to specify the record.

  • Variables from csv can be selected using --titles (defaults to all). The first variable will be ignored (index) and the last will be the DV (dependant variable).

Command-line options

pythom -m qrmine --help
Command Alternate Description
--inp -i Input file in the text format with Topic
--out -o Output file name
--csv csv file name
--num -n N (clusters/epochs etc depending on context)
--rec -r Record (based on context)
--titles -t Document(s) title(s) to analyze/compare
--codedict Generate coding dictionary
--topics Generate topic model
--assign Assign documents to topics
--cat List categories of entire corpus or individual docs
--summary Generate summary for entire corpus or individual docs
--sentiment Generate sentiment score for entire corpus or individual docs
--nlp Generate all NLP reports
--sentence Generate sentence level scores when applicable
--nnet Display accuracy of a neural network model -n epochs(3)
--svm Display confusion matrix from an svm classifier
--knn Display nearest neighbours -n neighbours (3)
--kmeans Display KMeans clusters -n clusters (3)
--cart Display Association Rules
--pca Display PCA -n factors (3)

Author

Citation

Please cite QRMine in your publications if it helped your research. Here is an example BibTeX entry:


@misc{eapenbr2019qrmine,
  title={QRMine -Qualitative Research Tools in Python.},
  author={Eapen, Bell Raj and contributors},
  year={2019},
  publisher={GitHub},
  journal = {GitHub repository},
  howpublished={\url{https://github.com/dermatologist/qrmine}}
}
In [ ]: