Meta's open-source ImageBind AI aims to mimic human perception
· 2023-05-10

ImageBind could eventually lead to leaps forward in accessibility and creating mixed reality environments.

c1d70267de6d592e6f99e57a4e9d9e7f.png


Meta is open-sourcing an AI tool called ImageBind that predicts connections between data similar to how humans perceive or imagine an environment. While image generators like Midjourney, Stable Diffusion and DALL-E 2 pair words with images, allowing you to generate visual scenes based only on a text description, ImageBind casts a broader net. It can link text, images / videos, audio, 3D measurements (depth), temperature data (thermal), and motion data (from inertial measurement units) — and it does this without having to first train on every possibility. It’s an early stage of a framework that could eventually generate complex environments from an input as simple as a text prompt, image or audio recording (or some combination of the three).


You could view ImageBind as moving machine learning closer to human learning. For example, if you’re standing in a stimulating environment like a busy city street, your brain (largely unconsciously) absorbs the sights, sounds and other sensory experiences to infer information about passing cars and pedestrians, tall buildings, weather and much more. Humans and other animals evolved to process this data for our genetic advantage: survival and passing on our DNA. (The more aware you are of your surroundings, the more you can avoid danger and adapt to your environment for better survival and prosperity.) As computers get closer to mimicking animals’ multi-sensory connections, they can use those links to generate fully realized scenes based only on limited chunks of data.


So, while you can use Midjourney to prompt “a basset hound wearing a Gandalf outfit while balancing on a beach ball” and get a relatively realistic photo of this bizarre scene, a multimodal AI tool like ImageBind may eventually create a video of the dog with corresponding sounds, including a detailed suburban living room, the room’s temperature and the precise locations of the dog and anyone else in the scene. “This creates distinctive opportunities to create animations out of static images by combining them with audio prompts,” Meta researchers said today in a developer-focused blog post. “For example, a creator could couple an image with an alarm clock and a rooster crowing, and use a crowing audio prompt to segment the rooster or the sound of an alarm to segment the clock and animate both into a video sequence.”


128af8b4fa04c5183313f9f3e75f89f8.png


As for what else one could do with this new toy, it points clearly to one of Meta’s core ambitions: VR, mixed reality and the metaverse. For example, imagine a future headset that can construct fully realized 3D scenes (with sound, movement, etc.) on the fly. Or, virtual game developers could perhaps eventually use it to take much of the legwork out of their design process. Similarly, content creators could make immersive videos with realistic soundscapes and movement based on only text, image or audio input. It’s also easy to imagine a tool like ImageBind opening new doors in the accessibility space, generating real-time multimedia descriptions to help people with vision or hearing disabilities better perceive their immediate environments.


“In typical AI systems, there is a specific embedding (that is, vectors of numbers that can represent data and their relationships in machine learning) for each respective modality,” said Meta. “ImageBind shows that it’s possible to create a joint embedding space across multiple modalities without needing to train on data with every different combination of modalities. This is important because it’s not feasible for researchers to create datasets with samples that contain, for example, audio data and thermal data from a busy city street, or depth data and a text description of a seaside cliff.”


Meta views the tech as eventually expanding beyond its current six “senses,” so to speak. “While we explored six modalities in our current research, we believe that introducing new modalities that link as many senses as possible — like touch, speech, smell, and brain fMRI signals — will enable richer human-centric AI models.” Developers interested in exploring this new sandbox can start by diving into Meta’s open-source code.













热门文章
亚洲游戏市场观察:15大市场热门游戏与用户趋势
线上游戏
JILI 宣布与全球板球传奇 AB de Villiers(ABD)达成重磅战略合作
体育游戏
灰度在iGB L!VE 2026展位T70和你相约7月,一起点燃伦敦的热情!
灰度头条
超级PAC筹资4800万美元:体育博彩势力加码
游戏风向
巴西拟将博彩税率提高至24% 税收将用于社保和医疗领域
游戏风向
2027 Global Game Connect(GGC)斯里兰卡招商全面开启!业务人脉尽在掌握!
灰度头条
横跨全球6个城市,灰度8场派对邀你共看世界杯,重塑高质量社交新场景
灰度头条
越南博彩管控逐步放宽,惟本土需求仍显乏力
东南亚资讯
PropellerAds 分享了新的 iGaming 案例研究:在 3 个月实现 97,674 次安装和 12,701 笔存款
广告营销
菲律宾博彩技术赛道迎来新变局,B2B 供应模式加速渗透
东南亚资讯
准备好了将你的收益最大化吗?尝试ProPush.me Constructor!
广告营销
新泽西州7月博彩收入创6.06亿美元新高,颁布禁令
游戏风向
斯里兰卡博弈产业大转型,官方:剑指南亚拉斯维加斯
游戏风向
GGC Awards 2026 璀璨科伦坡:致敬 iGaming 行业的领航者与创新力量
灰度头条
巴西颁布新法赋权央行封锁非法博彩账户及 Pix 交易
支付动态
首页
游戏
合作
发现
我的