Meta's open-source ImageBind AI aims to mimic human perception
· 2023-05-10

ImageBind could eventually lead to leaps forward in accessibility and creating mixed reality environments.

c1d70267de6d592e6f99e57a4e9d9e7f.png


Meta is open-sourcing an AI tool called ImageBind that predicts connections between data similar to how humans perceive or imagine an environment. While image generators like Midjourney, Stable Diffusion and DALL-E 2 pair words with images, allowing you to generate visual scenes based only on a text description, ImageBind casts a broader net. It can link text, images / videos, audio, 3D measurements (depth), temperature data (thermal), and motion data (from inertial measurement units) — and it does this without having to first train on every possibility. It’s an early stage of a framework that could eventually generate complex environments from an input as simple as a text prompt, image or audio recording (or some combination of the three).


You could view ImageBind as moving machine learning closer to human learning. For example, if you’re standing in a stimulating environment like a busy city street, your brain (largely unconsciously) absorbs the sights, sounds and other sensory experiences to infer information about passing cars and pedestrians, tall buildings, weather and much more. Humans and other animals evolved to process this data for our genetic advantage: survival and passing on our DNA. (The more aware you are of your surroundings, the more you can avoid danger and adapt to your environment for better survival and prosperity.) As computers get closer to mimicking animals’ multi-sensory connections, they can use those links to generate fully realized scenes based only on limited chunks of data.


So, while you can use Midjourney to prompt “a basset hound wearing a Gandalf outfit while balancing on a beach ball” and get a relatively realistic photo of this bizarre scene, a multimodal AI tool like ImageBind may eventually create a video of the dog with corresponding sounds, including a detailed suburban living room, the room’s temperature and the precise locations of the dog and anyone else in the scene. “This creates distinctive opportunities to create animations out of static images by combining them with audio prompts,” Meta researchers said today in a developer-focused blog post. “For example, a creator could couple an image with an alarm clock and a rooster crowing, and use a crowing audio prompt to segment the rooster or the sound of an alarm to segment the clock and animate both into a video sequence.”


128af8b4fa04c5183313f9f3e75f89f8.png


As for what else one could do with this new toy, it points clearly to one of Meta’s core ambitions: VR, mixed reality and the metaverse. For example, imagine a future headset that can construct fully realized 3D scenes (with sound, movement, etc.) on the fly. Or, virtual game developers could perhaps eventually use it to take much of the legwork out of their design process. Similarly, content creators could make immersive videos with realistic soundscapes and movement based on only text, image or audio input. It’s also easy to imagine a tool like ImageBind opening new doors in the accessibility space, generating real-time multimedia descriptions to help people with vision or hearing disabilities better perceive their immediate environments.


“In typical AI systems, there is a specific embedding (that is, vectors of numbers that can represent data and their relationships in machine learning) for each respective modality,” said Meta. “ImageBind shows that it’s possible to create a joint embedding space across multiple modalities without needing to train on data with every different combination of modalities. This is important because it’s not feasible for researchers to create datasets with samples that contain, for example, audio data and thermal data from a busy city street, or depth data and a text description of a seaside cliff.”


Meta views the tech as eventually expanding beyond its current six “senses,” so to speak. “While we explored six modalities in our current research, we believe that introducing new modalities that link as many senses as possible — like touch, speech, smell, and brain fMRI signals — will enable richer human-centric AI models.” Developers interested in exploring this new sandbox can start by diving into Meta’s open-source code.













熱門文章
越南在線博彩業政策收緊 催生市場新機遇
東南亞資訊
菲律賓網絡賭博和加密貨幣仍構成持續的洗錢風險
東南亞資訊
超級PAC籌資4800萬美元:體育博彩勢力加碼
合規與政策
哈薩克計劃對線上賭場促銷活動進行處罰
合規與政策
斯里蘭卡博弈產業大轉型,官方:劍指南亞拉斯維加斯
合規與政策
印度最高法院受理公益訴訟,要求全國禁封「偽裝」成社交遊戲的賭博平台
合規與政策
JILI 宣佈與全球板球傳奇 AB de Villiers(ABD)達成重磅戰略合作
體育遊戲
菲律賓博彩技術賽道迎來新變局,B2B 供應模式加速滲透
東南亞資訊
巴西擬將博弈稅率提高至24% 稅收將用於社保與醫療領域
合規與政策
GGC Awards 2026 璀璨科倫坡:致敬 iGaming 行業的領航者與創新力量
灰度頭條
橫跨全球6個城市,灰度8場派對邀你共看世界盃,重塑高質量社交新場景
灰度頭條
英國確認各垂直行業的賭博稅稅率
合規與政策
西班牙監管機構警告在線賭博平臺存在身份盜竊行為
合規與政策
越南博彩管控逐步放寬,惟本土需求仍顯乏力
東南亞資訊
印第安納州在線賭場法案在眾議院委員會停滯不前
合規與政策
首頁
遊戲
合作
發現
我的