home

Proof of concept for an ML written language

A little machine learning project I started (and left on indefinite hiatus) in 2022.

Now, I know what you're thinking: oh god, another art-abusing ai slop generator — now for conlangs too. But this doesn't learn language by looking at human languages: my whole goal with this was to make something from the ground up, so the network would create its own (super domain-specific) "language" purely by being forced to convey some information.

In the little demo I made I generate 10 random numbers, use a neural network to turn those into the endpoints of 3 line segments. I turn those into an image, with some added noise and moving the endpoints around a little. Finally another neural network tries to decipher the original 10 numbers. The networks gets rewarded if the final 10 numbers are close to the original 10.

These are 6 examples after some training. On the left are 10 random numbers. Those are used to generate the image in the middle. The images are decoded into the numbers on the right.

The pie in the sky goal would be to generate a mini-language that is completely alien but still reasonable, interpretable by humans. What would it mean if it came up with a Verb-Subject-Object structure?

Artistically, does it make sense to give this to an AI? You could argue that I should instead try to come up with my own exotic ways for how language could work, to train my conlanging muscles. My hope with this was that creating a novel language automatically and then studying it might be easier than coming up with one myself, but who's to say I will know how to analyze it, or that I haven't accidentally programmed in a bunch of assumptions that force the language to behave like human languages. Whatever it comes up with won't really prove anything about language. It's safe to say this is more of an exercise in machine learning than in conlanging or linguistics.

All that being said, some things to experiment with:

Interesting input data. Maybe images with overlapping colors (think Factory Balls or Minecraft banners), or an irl timelapse (clouds, traffic), or soma math thing (cellular automata, fractal pics). It should be something with recognizable and diverse features (so that the writing is easy to analyze for us but also not so orderly that it just encodes random numbers but with extra steps).

Then how do we reward the network? Just recreating the image pixel-by-pixel doesn't seem good, or language-y for that matter. Only some key features need to be encoded. So maybe, instead of the decoder network trying to recreate the original image, we could give it an image and ask whether this is what the encoder was given.
Partial data. Human language tends to have removable parts, like you can remove an adjective from a sentence and it'll still make sense. It would make it easier to analyze.

If I give the encoder 2 strokes to encode the first 5 numbers, and then another 2 to encode the next 5, how will it do that? Will it place them in consistent positions, say, the first 5 in the bottom left, next 5 in the top right? Would it repeat the same shapes if the first 5 and last 5 numbers are the same?