Meta ImageBind AI will be able to mimic human perception

The ImageBind AI model can lead to advances in accessibility and the creation of mixed reality environments. Image

VR, mixed reality and the metaverse

Meta is developing an AI tool called ImageBind that predicts connections between data similar to how people perceive or imagine the environment. While image generators such as Midjourney, Stable Diffusion i FROM-E 2 match words with pictures, enabling the generation of a visual scene based only on a textual description, ImageBind creates more options and greater capabilities.

ImageBind can bind text, images, videos, audio, 3D materials, temperature data, and motion data. It is specific that it does everything without having to train for each possibility first. This is an early stage. Eventually the tool could to generate complex environments from entering simple elements such as a text query, an image or an audio recording or some combination of all three elements.

ImageBind brings machine learning closer to human learning. For example, if you’re standing in a stimulating environment like a busy city street, your brain (mostly unconsciously) absorbs sights, sounds, and other sensory experiences to gather information about passing cars and pedestrians, tall buildings, the weather, and more.

Humans and other animals have evolved to process this data and have a genetic advantage: survival and passing on DNA. The more aware you are of your surroundings, the more you can avoid danger. Then, you can adapt to your environment for better survival and progress. As computers come closer to mimicking the multisensory connections of animals, they can use those connections to generate fully realized scenes based only on limited pieces of data.

By using Midjourney you can encourage “the dog to wear a Gandalf outfit while balancing on a beach ball.” Then, you can get a relatively realistic photo of this bizarre scene. However, a multimodal AI tool like ImageBind can eventually to make a video of the dog with appropriate soundsincluding a detailed rendering of a suburban living room, the temperature in the room, and the precise locations of the dog and anyone else in the scene.

Completely realistic 3D scenes

“This creates distinctive possibilities for creating animations from static images by combining them with audio instructions,” the Meta researchers state. “The creator could pair the image with an alarm clock and a crowing rooster. He could then use the crowing sound to segment the rooster or the alarm sound to segment the clock and animate both into a video sequence.”

A lot could still be done with this new toy. Everything points to one of Meta’s key ambitions: VR, mixed reality and the metaverse. Future headsets will be able to construct fully realistic 3D scenes. Virtual game developers may be able to use it for much of the work from their design process.

Content creators could make impressive videos. These recordings could have realistic soundscapes and movements based on text, images or audio alone. Also, it’s easy to imagine a tool like ImageBind opening new doors in the accessibility space. The tool could generate multimedia descriptions in real time in order to help people with visual or hearing impairments to better perceive their immediate environment.

Technology will expand beyond its borders

“In typical AI systems, there is a specific embedding (vectors of numbers that can represent data and their relationships in machine learning) for each respective modality.” “ImageBind shows that it is possible to create a common space for embedding in multiple modalities without the need to train the data with each different modality combination. This is important because it is not feasible for researchers to create datasets with samples containing audio data and thermal data from a busy city street, or depth data and a textual description of a coastal cliff.

Meta thinks the technology will eventually expand beyond its current six “senses.” “Although we investigated six modalities in our current research, we believe that introducing new modalities that connect multiple senses such as touch, speech, smell and fMRI brain signals will enable richer models of human-centered artificial intelligence.” Developers interested in exploring the new AI model can begin research by accessing open-source home Mete.

Source: Engadget

Source: PC Press by

*The article has been translated based on the content of PC Press by If there is any problem regarding the content, copyright, please leave a report below the article. We will try to process as quickly as possible to protect the rights of the author. Thank you very much!

*We just want readers to access information more quickly and easily with other multilingual content, instead of information only available in a certain language.

*We always respect the copyright of the content of the author and always include the original link of the source article.If the author disagrees, just leave the report below the article, the article will be edited or deleted at the request of the author. Thanks very much! Best regards!