In this blog a word embedding by using Keras Embedding layer is considered Word embeding is a class of approaches for representing words and documents using a vector representation. This is an improvement over traditional coding schemes, where large sparse vectors or the evaluation of each word in a vector was used to represent each word in order to represent the whole vocabulary. These representations were scarce, because the dictionaries were extensive, and the word or document would seem to be a large vector, consisting mainly of zero values. Instead, in the world embeding, words are represented by dense vectors, where the vector represents a projection of the word into a continuous vector space. The representation of a word in a vector space is derived from the text and is based on the words that surround the word when it is used.
Two popular examples of word embedding methods include:
In addition to these previously developed methods, the vectorization of words can be studied as part of a deep learning model.
Keras offers an Embedding layer that can be used in neural network models for processing text data. It requires that the input data is encoded with integers, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API, also provided by Keras.
The Embedding layer is initialized with random weights and vectorizes for all words in the training data set.
It is a flexible layer that can be used in various ways, such as:
- It can be used separately to study the vectorization of words, which can be saved and used in another model later.
- It can be used as part of a deep learning model in which vectorization is studied along with the model itself.
- It can be used to load a previously prepared pattern of vectorization of words, such as the transfer of learning.
The Embedding vectorization layer is defined as the first hidden layer of the network. It has three arguments:
- input_dim: This is the size of the text data dictionary. For example, if integer data is encoded with values from 0 to 10, then the size of the dictionary will be 11 words.
- output_dim: This is the dimension of the vector space in which words will be vectorized. It determines the size of the output vectors of this layer for each word. For example, it may be 32 or 100 or even more.
- input_length: This is the length of the input sequences, as you would define for any input layer of the Keras model. For example, if all input documents consist of 1000 words, it will be 1000.
For example, below we define an Embedding layer with a dictionary of 200 words (for example, integer coded words from 0 to 199 ), a vector space of 32 dimensions into which words will be vectorized, and input documents, each of which contains 50 words.
e = Embedding (200, 32, input_length = 50)
The Embedding layer contains weights that can be analyzed later. If you save the model in a file, this will include the scale for the Embedding layer. The output of the Embedding layer is a 2D vector with one vector for each word in the input word sequence (input document). If you need to connect a fully connected layer directly to the Embedding layer, then you must first smooth the 2D output matrix onto the 1D vector using the Flatten layer. Now let’s see how we can use the Embedding layer in practice.
We will create a small task in which we have 10 text documents, each of which has a comment about the part of the work done by the student. Each text document is classified as a positive “1” or negative “0”. First we define the documents and their class labels.
from numpy import array from keras.preprocessing.text import one_hot from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers.embeddings import Embedding # define documents docs = ['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!', 'Weak', 'Poor effort!', 'not good', 'poor work', 'Could have done better.'] # define class labels lbls = array([1,1,1,1,1,0,0,0,0,0])
Then we can encode each document with integers. This means that as input, the Embedding layer will have sequences of integers. Keras provides the one_hot () function, which creates a hash of each word as effective integer coding. We estimate the size of the dictionary to be 50, which is much more than is necessary to reduce the likelihood of collisions of matches from the hash function.
# integer encode the documents vs = 50 enc_docs = [one_hot(d, vs) for d in docs] print(enc_docs)
The sequences have different lengths, and therefore we will fill all input sequences up to length 4. We can do this with the built-in function Keras pad_sequences().
# pad documents to a max length of 4 words max_length = 4 p_docs = pad_sequences(enc_docs, maxlen=max_length, padding='post') print(p_docs)
Now we are ready to define our Embedding layer as part of our neural network model. The model is a simple binary classification model. It is important to note that the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word.
# define the model modelEmb = Sequential() modelEmb.add(Embedding(vs, 8, input_length=max_length)) modelEmb.add(Flatten()) modelEmb.add(Dense(1, activation='sigmoid')) # compile the model modelEmb.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc']) # summarize the model print(modelEmb.summary()) # fit the model modelEmb.fit(p_docs, lbls, epochs=150, verbose=0) # evaluate the model loss, accuracy = modelEmb.evaluate(p_docs, lbls, verbose=2) print('Accuracy: %f' % (accuracy*100))
Running the example first prints integer coding of documents.
[[42, 38], [24, 27], [26, 11], [3, 27], , , [28, 11], [8, 24], [28, 27], [23, 4, 38, 31]]
Then the filled vectors of each document are printed, filled with zeros so that they are of the same length.
[[42 38 0 0] [24 27 0 0] [26 11 0 0] [ 3 27 0 0] [36 0 0 0] [42 0 0 0] [28 11 0 0] [ 8 24 0 0] [28 27 0 0] [23 4 38 31]]
After determining the network structure will be printed. As expected, the output of the Embedding layer is a 4 × 8 matrix, and this data is compressed to a 32-element vector by a Flatten layer.
_________________________________________________________________ Layer (type) Output Shape Param # =========================================================== embedding_1 (Embedding) (None, 4, 8) 400 _________________________________________________________________ flatten_1 (Flatten) (None, 32) 0 _________________________________________________________________ dense_1 (Dense) (None, 1) 33 ================================================================= Total params: 433 Trainable params: 433 Non-trainable params: 0 _________________________________________________________________
Finally, the accuracy of the prepared model is printed, showing that it has perfectly studied the information.
You can save the trained Embedding layer weights to a file for later use in other models. You can also use this model to classify other documents that have the same vocabulary.