Machine Learning in Practice

NLP Against Covid-19 — Building Word Vectors for the Question/Answer Bot like Westworld Part 2

Turning words into vectors to be used in machine learning to create sentence vectors for sentence similarity

Matt Davis
5 min readMay 27, 2020

Welcome back to the series on building a question/answer robot as a central brain of the Covid-19 Clinical Research. In this series we are going to first build and train a question/answer bot that will prioritize the most relevant answers, giving us the world expert in Covid-19 research to ask questions too. In Part -1 — first we designed at a high level some of the data we need to extrapolate through the ETL process in order for the questions to answer more accurately. We will expand this into a full ETL clinical trial framework later as we look at other conditions to apply the process too.

The Colab for this demo is available here, and the repo is here.

Now that we have the data in an accessible format, and the previous ETL process done, then we should have a database with all the pertinent information extracted we need to make the answers more accurate. The next things to be done are to build the word vectors file to be used for training sentence similarity later. Firstly let’s look at what Word2Vec is and when to use it. A more extensive review of this can be found here: Word 2 Vector Review

Converting words into vectors, which deep learning algorithms can ingest and process, helps to formulate a much better understanding of natural language. Novelist EL Doctorow has expressed this idea quite poetically in his book ‘Billy Bathgate’.

It’s like numbers are language, like all the letters in the language are turned into numbers, and so it’s something that everyone understands the same way. You lose the sounds of the letters and whether they click or pop or touch the palate, or go ooh or aah, and anything that can be misread or con you with its music or the pictures it puts in your mind, all of that is gone, along with the accent, and you have a new understanding entirely, a language of numbers, and everything becomes as clear to everyone as the writing on the wall. So as I say there comes a certain time for the reading of the numbers.

We will skip the explanation of data load as that is has been done in previous parts.

First thing we need to do is define some functions that can be used through the process of Word2Vec and access the data in our database. First we define a row iterator object, this used to iterate through all the returned objects from the database more easily in getting the tokens and writing to the temporary file. The next most important object is the token object which defines some of the stop words so as not to skew the model and then does the Tokenizize of the sentences. This will make it much easier to create the word vectors moving forward. Then we execute all of that writing all the tokens out to a temporary file that we will use later for creating the vectors file.

Now that we have an organized and tokenized word dictionary we can run the vectorization. Please note that this process will take a couple of hours and even longer if you are doing on google colab. Important code parts to accomplish this:

#Stop word definitions:
# English Stop Word List (Standard stop words used by Apache Lucene)
STOP_WORDS = {"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it","no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these","they", "this", "to", "was", "will", "with"}

The tokenization temporary file writing:

cur = db.cursor()cur.execute("SELECT Text FROM sections")count = 0dbfile = "data/articles.sqlite"#loop to write tokens to a temporary filewith tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False) as output:    # Save file path    tokens = output.name    for row in RowIterator(dbfile):       output.write(" ".join(row) + "\n")

Now this will take some time to write out all the tokens from documents to the temporary file. Once this finishes then we will use the FastText to build the vectors for the words. We could have used pretrained models like these, but when the field is so specialized it is better to train for the specific text to be used in further vectorization later.

As past projects that I have worked on have mostly involved BERT you might wonder what are the differences and which is best as word to vector library here. Here are a few details I discovered as I was researching this.

Fasttext is fragment based so it can handle words it has not seen before. Similar capabilities are built into BERT, but BERT provides some benefits that we won’t really be using in this implementation as well as adding complications to go with those benefits. Benefits such as fine tuning, and the ability to drop into classification easier. BERT is generally made for robust implementations and classification then straight sentence similarity, thus for fast implementation and usage we are using the FastText package. If you know any other libraries or details that would help us here, feel free to comment below and we can improve this together.

Having the tokenization of all the words is important and has to cycle through all the text so may take some time. Taking the tokenization we can now compute the vectors and write the file to the file system to save it. This is only a Word2Vec file for the individual words, that then we use this to create the sentence embeddings and indexing as thenext stage. Here is the code for creating the vector file:

#Training the model# Train on tokens file using largest dimension sizeimport fasttextimport fasttext.utilsize = 300mincount = 3model = fasttext.train_unsupervised(tokens, dim=size, minCount=mincount)

Now after the model is trained we cycle through the words writing the vectors output and then convert the file via magnitude.

#Next is to take the vectors in the model and writing it out to a file and converting# Output file pathprint("Building %d dimension model" % size)path = "data/covid-fasttext-vectors"# Output vectors in vec/txt formatwith open(path + ".txt", "w") as output: words = model.get_words() output.write("%d %d\n" % (len(words), model.get_dimension())) for word in words: # Skip end of line token if word != "</s>":  vector = model.get_word_vector(word)  data = ""  for v in vector:    data += " " + str(v)    output.write(word + data + "\n")# Build magnitude vectors databaseprint("Converting vectors to magnitude format")converter.convert(path + ".txt", path + ".magnitude", subword=True)

Now you have a base vector file for your project. This should take a few hours to generate and can really only be used for this project. Let it run through and next time we will look at utilizing the result in creating the sentence embedding and integrating it with the previous ETL data for adjusting the answer data to give more and more precise answers. Feel free to leave a comment if you have any questions or suggestions for improvement.

--

--

No responses yet