Open Embeddings

The Problem
Embeddings make it easier to find relevant content for humans, agents and scripts without access to the original material.
What problem is Open Embedding solving? Humans / Agents / Scripts want to find maximally relevant content without needing to consume unrelated content in a publisher agnostic way. Today, keyword searches into content providers leave context on the table. Similarity searches are great, but we have to download the content to generate the embeddings. This wastes time and bandwidth, and it can lead to data capture behind walled gardens.
How do Open Embeddings solve this problem?**
- Create a standardized file format for content providers to expose their content in an AI-native way
- Create a standardized way for indexers to provide embeddings and metadata for content discovery
- Create a standardized way for agents and scripts to query embeddings by URI
See the examples page for sample use cases.
See the RFC page for a draft specification
Community Benefits
What are the expected community benefits of Open Embeddings?
- Content Providers pay less in publishing costs (bandwidth, compute, storage) to expose their content to AI systems.
- Developers can easily access high-quality embeddings for content without the need to download the content themselves.
- End Users can “collect” embeddings without worrying about which model or platform they are using, allowing for a more seamless experience.
Target Audience
Developers, Content Creators, AI platforms
Value Proposition
- Developers: Easier access to high-quality embeddings for content without re-embedding costs.
- Content Creators: Greater control over how their content is interpreted and used by AI, with the ability to expose multiple sets of embeddings.
- AI Platforms: A standardized way to ingest and understand web content, reducing the need for re-embedding and allowing for more efficient content discovery and delivery.
Hard Implementation Problems to Solve at Scale
- Model sprawl - Can we build on recent academic work to generalize on a framework to transform between embedding spaces for multi-modal models?
- Cache-Invalidation: Trusting the embeddings and metadata
Our Vision
When you understand our thesis, we hope you will help us:
- Define a great Open Format / Spec for content-providers to leverage multiple models
- Update commonly used content publishing tools to support the format securely
- Generate a corpus of distributed cross-space materials to allow transitions between closed models and open model encoded data
Our goal is to find a sustainable way to lower the barrier to entry for new AI agents/scripts and services, while taking pressure off the indexers pounding content providers to effectively perform the same operation across multiple models.
Get Involved
We encourage community participation through multiple channels:
- Read the Draft RFC - Review and comment on the technical specification
- See Examples - Explore sample implementations and use cases
- Provide Feedback - Share your thoughts and suggestions
- View our Roadmap - See what’s planned for the future
- Contribute to the project - Join our open-source development efforts
This project is run as a non-profit, funded through donations and open-source grants with the goal of creating an open and accessible internet for AI.