NLP How to build Westworlds Rehoboam to fight Covid-19 — Part 1
Learning how to build a question/answer system for clinical research subjects to help speed up knowledge sharing
After finishing West World season premier (spoilers), and with the world in a global pandemic, one cannot escape the sensation of doom and gloom. It can be difficult to decide which fate is worse, mass extinction from robot overlords, or mass extinction from a virus. I am going to give a hard pass to both. Maybe Rehoboam from WestWorld is the solution we need in this pandemic. The concept of having all the data and predicting human interaction per individual seems unrealistic, however what could we do with current technologies? Being stuck inside during a pandemic can drive you a little crazy, so I thought that this is a great time to learn new technologies, and build something useful beyond Covid-19.
Recently my grandmother was in the hospital. Due to the pandemic my family was not able to be with her. Thankfully she is out of the hospital and is now recovering. She had many issues trying to understand what was happening because of the slow progression of scientific information. Let us create a question/answer brain for Covid-19, that can eventually assist in the treatment search. For those reasons let us divide the milestone from start to finish into two segments. Research Knowledge(Question/Answer) and Treatment Discovery.
I could not be there for my grandmother, and thought it would be helpful to create an expert on the current virus so she could ask questions. Once we have created the Covid-19 brain we can look further into treatment discovery utilizing that indexed knowledge.
As a constant innovator, I wanted the project to utilize the best practices of the current time, and make sure it was a good model for future use. I am finishing a contract for architecting the next generation of a product that utilized NLP (Natural Language Processing), and thought it only natural to see what could be done in this space. The brain should be based on the clinical research papers for validity and utilize some important feature attributes to give weight of what answers may be more applicable. What if we mapped out a process for doing this for all clinical research topics? One with a pipeline for adding new data for continual learning.
A special thanks to David Mezzetti at https://neuml.com, for his hard work on this subject, and for being an inspiration to us all.
The question is if the end goal is to build a question answer system then how is it built to accomplish this? I’ll save you the research and just give it to you.
The first thing is to get a dataset, and do some exploratory analysis to determine what needs to be done to it. The Covid-19 dataset we used can be found on the Semantic Scholar.
In this article we are going to breakdown the ETL process further to gather some important information about the studies in the dataset. This will help give weight to the information in the articles moving forward. eg, We want specific types of studies, and prefer studies with large sample sizes and newer articles should have preference.
For ease of use and data exploration I have made a copy of the indexed corpus in sqlite format available for download directly.
What information we might want to gather from these research documents that is not in the meta data?
1. Study type
2. Sample size
3. Sampling method
We will not be looking for Authors and publish date in this version as it is commonly available in the metadata. In the future it would be great to improve this even further to be able to handle all relevant attributes for clinical studies. This would make it useful in other aspects of clinical research processing.
Study type may need more manual classification, and processing as well. This appears to be a document embedding similarity-based machine learning problem that will require vectoring each of the documents into a labeled dataset. This will be looked at further in our next review.
When we focus on extracting the sample size and methodology, we can breakdown the problem. For our purpose we need to locate only the sentences that contain the data we first need, then will do further NLP parsing in order to extract the specific information that is required.
For this problem specifically it makes sense to use A Logistic Regression Classifier, and sentence embedding trained on a labelled dataset of similar sentences. We may need to add more features for contextual information to increase accuracy. This is the link to the Google Co-lab, and Github repo.
Plain vanilla vector prediction results per category:
… Processing other
Test accuracy is 0.9393939393939394
… Processing method
Test accuracy is 0.8515151515151516
… Processing sample
Test accuracy is 0.8333333333333334
These results are not bad for introductory training, however after a close review we see that although the training resulted in high accuracy when we directly apply the model to the original data, it does not match with high accuracy in the real world.
Without getting into the details from the Colab, after some experimenting, we were able to get decent classification accuracy. Now that we are narrowing down the problem to the sentence level, we will then need to parse those sentences further to extract the exact information needed.
To do further parsing we use some spacy tokenization to use keyword vocabulary parsing, and the spacy tokenization child dependency to find the numeric string surrounding it for the exact sample size return value. The Word2number library is then used to translate all string number to integers that can be passed back. This library used word vectors to match the numbers and it works great per token, however it also works well for multi token so future enhancements may be to convert and combine numbers first so as not to lose multi token numbers.
As a whole this approach is rather simplistic, however for an MVP implementation it is starting to look down the right path. In the Co-lab we look at the results and determine a realistic performance, we then determine what some of the corner case are, and possible paths for improving them.
Now that we have the basic structure for clinical study attribute extraction, we can compartmentalize this into a larger ETL flow. We can now overtime improve the accuracy of the attribute data extraction through the suggestions noted in the Co-lab. For our purposes this article is part of a series on building a Covid-19 Rehoboam. We will next review creating the sentence embedding index, and making sure Rehoboam weighs the articles properly when asked questions.