I had an idea for a way to generate fake words from dictionaries a few months ago and I made it. I totally forgot that it existed and just remembered, so I’m posting it here.
To generate fake words, a method that I thought of (almost definitely not the best) is to make a graph with a depth n. Each node is a single character with a probability value (0-1). The probability of a root node x is the number of times that it’s at the beginning of. The probability of a child node (which in turn has more child nodes) y is the chance that y comes after x. The probabilities of a node’s children add up to 1. Then, to generate words, we would just recursively go through the tree while building a word, picking nodes based on their probabilities.
Sometimes, the words it generates are kinda or really weird or sometimes just verbatim taken from the dictionary used.
The source isn’t currently public, I might make it public sometime.
Some output (cherry-picked, n = 5):
- దళగిం (dhaLagiM)
- అనిరు (aniru)
- విద్య (nidhya)
- వ్రతన (vrathana)
- పైటగి (paitagi)
- 건지다조카설 (geonjidajokaseol)
- 평양찌꺼기여 (pyeong-yangjjikkeogiyeo)
- 보행자 (bohaengja)
- 싸우다주민개인적권리 (ssaudajumingaeinjeoggwonli)
Note: The generated Korean words seem to always show actual translations in Google Translate. I have no idea why, but all four were completely randomly generated (even the second one). The Korean generations were the only ones not cherry-picked, since I can’t read Hangeul or assess how much it sounds like Korean (contrasted with how I can do that with English and Telugu, and somewhat with German and Finnish). The second one was also completely randomly generated, but I’ll keep it.