Attention

Attention mechanisms let a Machine Learning model relate tokens, such as words, to each other regardless of their distance between one another in a group, such as a document or group of documents.

Attention measurements can range from high to low, as in this example:

An attention neural network layer can access all previous states and weighs them according to some learned measure of relevancy to the token currently being processes.

Attention mechanisms are useful for applications and tools such as:

Advantages Over Recurrent Based Algorithms

Recurrent algorithms, such as Recurrent Neural Networks, utilize explicit feedback loops with graph memory nodes.

Attention mechanism advantages over recurrent mechanisms include:

  • Scope of Token Relations - using a recurrent mechanism, one token, such as a word, can be related to only a small number of other elements; attention mechanisms can relate an individual token to all other tokens in even a very large collection of tokens

  • Accuracy - the ability to relate all tokens in a grouping to all other tokens leads to higher accuracy

  • Training Generalization - models can be trained on generalized applications and then training enhanced for specific applications

  • Training Efficiency - parallel processing can be used at a large scale to enhance training efficiency and speed

Attention Processes Example

The process flow for a language processing multi-headed attention is shown below. It uses matrix and vector mathematics to produces outputs based on encoded word vector inputs.

To see an example of how attention fits into a larger neural network model, see Transformer Neural Networks.

The example below is based on this research paper.

  • Word Embedding Processes - such as Word2vec, convert words to numbers using using models trained on a corpus of text

  • Embeddings Position Encoding - add encoding for the position of words in a text to the encoding for the word itself

  • Model Training and Prediction Processes - train weights W for the vectors: query q, key k, and value v

  • Abstraction Vectors Q, K, V - are used in the attention process

  • Multi-Head Attention Process - are divided into parallel layers for computing efficiency

  • Matrix Inputs - calculate matrices Q, K, V based on input word embeddings and trained weights Wq, Wk, Wv

  • Matrix Multiplication - performs a full matrix multiplication using dot product calculations for each output matrix element

  • Scaling - transforms matrices to increase or decrease dimensions

  • Masking - indicates which elements of a matrix or vector should and should not be used

  • Softmax - is a function that adjusts values to a range between zero and one

  • Concatenation - appends matrices together to form a new matrix

References