Using Machine Learning to Compromise between names

It’s hard to decide on a baby name, even more so when multiple people are invested in the decision-making process. My friends recently had a baby and couldn’t decide between Alex and Niklas. I naturally took it upon myself to solve their problem by creating a Machine Learning algorithm that finds the names laying perfectly in the middle. What do you get when you split the difference between Alex and Niklas?

Approach

Variational AutoEncoders (VAEs) are a popular type of Machine Learning model that expands on autoencoders. I’ve presented autoencoders in some previous posts where I use them to reduce dimensionality or perform clustering on unlabelled data. AEs are models that recreate their inputs, usually producing something interesting along the way. Since these intermediate products of the model need to contain all the information to recreate the original, they usually represent a deeper understanding of the data itself.

While AEs are good for compressing existing data, they are not suited for generating new data, since the space between existing embeddings is not necessarily meaningful within the latent space. VAEs solve this problem by using the encoder output to sample from a normal space, thereby regularising the latent output. In other words, a VAE is better at giving meaning to the embedding, therefore it makes more semantic sense to sample outside the existing data points. Since the desired outcome of this project is to be able to split the difference between the encodings of “Alex” and “Niklas”, this architecture should provide the required functionality.

Data

The data is available from https://familieretshuset.dk/navne/navne/godkendte-fornavne, which provides a list of over 20,000 boy names in Denmark. I converted the special characters å, æ and ø to a, ae, and o for simplicity. I also removed names exceeding 10 characters, since they represented an insignificant amount of the dataset.

Architecture

The model I have settled on is a sequence to sequence (Seq2Seq) model using a combination of Conv1D and LSTM cells. The encoding is used as a repeated input to the decoder until it finishes producing the output. The latent vector has dimension 256 as the output of the encoder passes through two 256-neuron dense layers to produce the mean and variance vectors.

Model: "encoder"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv1d (Conv1D)             (None, None, 32)          2624      
                                                                 
 conv1d_1 (Conv1D)           (None, None, 32)          3104      
                                                                 
 conv1d_2 (Conv1D)           (None, None, 32)          3104      
                                                                 
 bidirectional (Bidirectiona  (None, None, 128)        49664     
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, None, 128)        98816     
 nal)                                                            
                                                                 
 lstm_2 (LSTM)               (None, 64)                49408     
                                                                 
=================================================================
Total params: 206,720
Trainable params: 206,720
Non-trainable params: 0
_________________________________________________________________

Model: "decoder"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 repeat_vector (RepeatVector  (None, 10, 256)          0         
 )                                                               
                                                                 
 bidirectional_2 (Bidirectio  (None, 10, 128)          164352    
 nal)                                                            
                                                                 
 bidirectional_3 (Bidirectio  (None, 10, 128)          98816     
 nal)                                                            
                                                                 
 conv1d_3 (Conv1D)           (None, 10, 32)            12320     
                                                                 
 conv1d_4 (Conv1D)           (None, 10, 32)            3104      
                                                                 
 conv1d_5 (Conv1D)           (None, 10, 27)            2619      
                                                                 
=================================================================
Total params: 281,211
Trainable params: 281,211
Non-trainable params: 0
_________________________________________________________________

I created a custom callback to compute the average edit distance between input and output names. The callback also stops the training when the model perfectly reconstructs over 50% of names. Since this is a toy project, I haven’t spent more time perfecting the training loop. I also reduced the variance effect in order to more easily obtain results.

Results

We can now use the machine learning model to interpolate between two names. First, we find where Alex and Niklas exist in the latent space.

# Get encodings for Alex and Niklas
alex_encoded, _ = vae.encode(prepare("alex"))
niklas_encoded, _ = vae.encode(prepare("niklas"))

Just like getting points between a start and end point on a map in 2D space, we interpolate the points between Alex and Niklas in 256-dim space. We use these intermediate vectors as generation seeds for the decoder, producing all names between Alex and Niklas.

# Get intermediate names from 100% Alex to 100% Niklas
for perc in np.arange(0,1.1,0.1):
    wsum = perc*niklas_encoded + (1-perc)*alex_encoded
    decoded = vae.decode(wsum)[0]
    decoded = ohe_to_name(decoded)
    print(f"{round((1-perc)*100):03d}% alex + {round((perc)*100):03d}% niklas -> {decoded}")

Finally, the output

100% alex + 000% niklas -> alex
090% alex + 010% niklas -> alex
080% alex + 020% niklas -> alex
070% alex + 030% niklas -> alex
060% alex + 040% niklas -> allx
050% alex + 050% niklas -> erli
040% alex + 060% niklas -> eillas
030% alex + 070% niklas -> nillas
020% alex + 080% niklas -> nillas
010% alex + 090% niklas -> niklas
000% alex + 100% niklas -> niklas

So there it is, halfway between Alex and Niklas is Erli, closely followed by Elias (close enough to eillas).

Code

The code is on my GitHub page: https://github.com/felixgravila/alex2niklas