At this year’s GPU Technology Conference, Facebook AI Research showed off a neural net that lets you generate unique images based on a text description. First they typed “beach” and what looked like a painting of a beach appeared. The image had clouds, so next they typed “beach -clouds” and a brand new beach image appeared with blue sky and no clouds. Lastly, they typed “sunset beach -clouds” and yet another beach image appeared with an orange-red sunset.
Amid an impressive conference of photorealistic graphics, self-driving cars, and supercomputers, this was the presentation that drew the most “wow”s from the crowd, and for good reason. Neural nets have become the popular kid on the block when it comes to AI advancement. For those not familiar with the concept, neural nets are algorithms that seek to mimic the way our brains work.
What Facebook has done is train their neural net to associate certain words with certain image types, training it on many thousands (maybe even millions) of different images. Once the net is trained (requiring a supercomputer for datasets that large), getting a result back takes relatively little time.
The key breakthrough here is the system’s ability to combine different images through text. Not only can it recognize individual elements of an image, but it can also remove and replace those elements dynamically. And this is all possible through a natural language interface, i.e describe an image and bam there it is.
This first implementation is trained on artsy 2D images such as paintings. However, a 3D implementation could be hugely useful. Building the virtual reality metaverse is going to require a large amount of 3D art. Having a neural net generate 3D objects based on a text description would enable faster VR content creation. And it might not be too far away.
DeepMind (owned by Google) went from a neural net that could play 2D games to 3D ones in less than a year.
Granted, a 3D object generating neural net has several more hurdles to overcome. First, the algorithm has to be tuned to use and recognize 3D rather than 2D assets, an active research problem. Luckily, computer vision and an increasing interest in 3D computing should push research forward. Second, there probably aren’t enough different 3D assets available on the Internet to properly train a 3D neural net. To get around this, developers may need to find a way to combine 2D and 3D image recognition. Also, as virtual reality and augmented reality adoption increases, we should see a huge increase in the number of 3D assets in the world.
Looking at the pace of the AI and 3D industries, I think it’s a safe bet to expect a 3D generating neural net within 10 years. With it, creating your perfect virtual world will be as simple as writing a scene description.