It’s hard to decide on a baby name, even more so when multiple people are invested in the decision-making process. My friends recently had a baby and couldn’t decide between Alex and Niklas. I naturally took it upon myself to solve their problem by creating a Machine Learning algorithm that finds the names laying perfectly in the middle. What do you get when you split the difference between Alex and Niklas?
Approach
Variational AutoEncoders (VAEs) are a popular type of Machine Learning model that expands on autoencoders. I’ve presented autoencoders in some previous posts where I use them to reduce dimensionality or perform clustering on unlabelled data. AEs are models that recreate their inputs, usually producing something interesting along the way. Since these intermediate products of the model need to contain all the information to recreate the original, they usually represent a deeper understanding of the data itself.
While AEs are good for compressing existing data, they are not suited for generating new data, since the space between existing embeddings is not necessarily meaningful within the latent space. VAEs solve this problem by using the encoder output to sample from a normal space, thereby regularising the latent output. In other words, a VAE is better at giving meaning to the embedding, therefore it makes more semantic sense to sample outside the existing data points. Since the desired outcome of this project is to be able to split the difference between the encodings of “Alex” and “Niklas”, this architecture should provide the required functionality.
Data
The data is available from https://familieretshuset.dk/navne/navne/godkendte-fornavne, which provides a list of over 20,000 boy names in Denmark. I converted the special characters å, æ and ø to a, ae, and o for simplicity. I also removed names exceeding 10 characters, since they represented an insignificant amount of the dataset.
Architecture
The model I have settled on is a sequence to sequence (Seq2Seq) model using a combination of Conv1D and LSTM cells. The encoding is used as a repeated input to the decoder until it finishes producing the output. The latent vector has dimension 256 as the output of the encoder passes through two 256-neuron dense layers to produce the mean and variance vectors.
Model: "encoder" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv1d (Conv1D) (None, None, 32) 2624 conv1d_1 (Conv1D) (None, None, 32) 3104 conv1d_2 (Conv1D) (None, None, 32) 3104 bidirectional (Bidirectiona (None, None, 128) 49664 l) bidirectional_1 (Bidirectio (None, None, 128) 98816 nal) lstm_2 (LSTM) (None, 64) 49408 ================================================================= Total params: 206,720 Trainable params: 206,720 Non-trainable params: 0 _________________________________________________________________ Model: "decoder" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= repeat_vector (RepeatVector (None, 10, 256) 0 ) bidirectional_2 (Bidirectio (None, 10, 128) 164352 nal) bidirectional_3 (Bidirectio (None, 10, 128) 98816 nal) conv1d_3 (Conv1D) (None, 10, 32) 12320 conv1d_4 (Conv1D) (None, 10, 32) 3104 conv1d_5 (Conv1D) (None, 10, 27) 2619 ================================================================= Total params: 281,211 Trainable params: 281,211 Non-trainable params: 0 _________________________________________________________________
I created a custom callback to compute the average edit distance between input and output names. The callback also stops the training when the model perfectly reconstructs over 50% of names. Since this is a toy project, I haven’t spent more time perfecting the training loop. I also reduced the variance effect in order to more easily obtain results.
Results
We can now use the machine learning model to interpolate between two names. First, we find where Alex and Niklas exist in the latent space.
# Get encodings for Alex and Niklas alex_encoded, _ = vae.encode(prepare("alex")) niklas_encoded, _ = vae.encode(prepare("niklas"))
Just like getting points between a start and end point on a map in 2D space, we interpolate the points between Alex and Niklas in 256-dim space. We use these intermediate vectors as generation seeds for the decoder, producing all names between Alex and Niklas.
# Get intermediate names from 100% Alex to 100% Niklas for perc in np.arange(0,1.1,0.1): wsum = perc*niklas_encoded + (1-perc)*alex_encoded decoded = vae.decode(wsum)[0] decoded = ohe_to_name(decoded) print(f"{round((1-perc)*100):03d}% alex + {round((perc)*100):03d}% niklas -> {decoded}")
Finally, the output
100% alex + 000% niklas -> alex 090% alex + 010% niklas -> alex 080% alex + 020% niklas -> alex 070% alex + 030% niklas -> alex 060% alex + 040% niklas -> allx 050% alex + 050% niklas -> erli 040% alex + 060% niklas -> eillas 030% alex + 070% niklas -> nillas 020% alex + 080% niklas -> nillas 010% alex + 090% niklas -> niklas 000% alex + 100% niklas -> niklas
So there it is, halfway between Alex and Niklas is Erli, closely followed by Elias (close enough to eillas).
Code
The code is on my GitHub page: https://github.com/felixgravila/alex2niklas
Pingback: Making a name graph from their popularity over time - Felix Gravila