Generative AI has been a big topic in recent months. Company Meta is one of those that strives for the development of new systems and an open-source project ImageBind is one that certainly deserves attention. While most systems combine one or two types of data (text creates text – ChatGPT, text creates an image – DALL-E,…), ImageBind can connect up to 6 different domains together. In this way, it should be closer to how a person works. He can, for example, guess from a picture of a car what kind of sound it will make, imagine how cold or warm it is in the given environment based on the picture, imagine a visual scene based on the description, and so on.

In the case of ImageBind, we have a combination of data provided not only in the form of text, image/video and audio, but also data from depth sensors (various forms of 3D cameras), temperature sensors (infrared radiation) and even acceleration and motion (IMU) data. This allows it to predict how objects will sound, look in 2D and 3D, how warm or cold they are, and how they move. This multimodal system is open-source and invites other developers to develop new systems capable of creating “immersive virtual worlds”.

Thanks to the system, it should be possible to recognize the properties of objects in other domains, however, this may not always be easy. While e.g. depth and temperature data are often correlated in different ways, non-visual types are worse off (e.g. audio and motion have a somewhat weaker correlation).

