NLP Word to Vector — What it is and when to use it?

Matt Davis
4 min readMay 25, 2020

It is important to know what are the difficulties in natural language processing in order to really understand why certain solutions exist. The ability to use deep learning or another form of machine learning in order to process natural language has signifigant benefits in automation. Then what are the issues that make it so hard?

Computers interact with humans in programming languages which are unambiguous, precise and often structured. However, natural (human) language has a lot of ambiguity or in other words implied meaning. There are multiple words with same meaning (synonyms), words with multiple meanings (polysemy) some of which are entirely opposite in nature (auto-antonyms), and words which behave differently when used as noun and verb. These words make sense contextually in natural language which humans can comprehend and distinguish easily, but machines can’t. That’s what makes NLP one of the most difficult and interesting tasks in AI.

There are several tasks which can be accomplished by enabling computers to “understand” human language. One live example examining clinical trials for patient data. Here are some tasks which are being done presently using NLP:

  1. Google Chromecasts — AI personal assistantt
  2. Booking Appointments for users
  3. Parsing information from documents, websites
  4. Understanding meanings of sentences, documents and queries
  5. Machine Translation (e.g Translate text from English to German)
  6. Question Answering and doing tasks (e.g scheduling appointments etc.)
  7. Knowledgebase question/answer

The biggest issues that come up along the path from the natural language to the deep learning/machine learning techniques is representing the natural language letters and words as representations that the computer can use. The computer needs the data to be in some sort of numerical format in order to be able to do it’s work against it. There are an estimated 13 million words in English language. That means there are some pretrained models already trained on this corpus. There are many different package flavours, each with it’s own pros and cons. FastText is a simple open source solution, BERT is full sentence framework to give context to words as well.

Understanding the words need numerical representation for machine learning is just the beginning. Once we have decided to give them numerical representaiton we need to make the words have some idea of similarity, and meaning in multiple different directions. Essential to mirror the synonyms, polysemy, auto-antonyms and many more. Thus giving tthe direction of creating a numerical representation with multiple meaning representations. This is where vectors give a good basis for representation as they allow the flexibility of multi-matrix directional ability as well as giving one representation that can directly relate to that exact word while seeing simularities. It’s import to note that this is how the human mind looks at human language as well, and all we are trying to do is do what comes naturally for humans in code lowering the error rate over time.

Some of the way words are handled together are: Skip-grams and CBOW.

Skip-grams:

Continuous Bag of Words model (CBOW)

In abstract terms, this is opposite of skip-gram. In CBOW, we try to predict centre word by summing vectors of surrounding words.

This was about converting words into vectors. But where does the “learning” happen? Essentially, we begin with small random initialization of word vectors. Our predictive model learns the vectors by minimizing the loss function. In Word2vec, this happens with a feed-forward neural network and optimization techniques such as Stochastic gradient descent. There are also count-based models which make a co-occurrence count matrix of words in our corpus; we have a large matrix with each row for the “words” and columns for the “context”. The number of “contexts” is of course large, since it is essentially combinatorial in size. To overcome this size issue, we apply SVD to the matrix. This reduces the dimensions of the matrix retaining maximum information.

Finally these are just a few representations of the how Word to Vector is used and what some of the popular methods are. Many more modern techqniques are coming out everyday such as BERT and these are combinatitons of the basic word to vector discussed here so it will be important to understand the basics.

--

--