RFC
Open Embeddings RFC - Draft Specification
This is a living document. We encourage community feedback and contributions through pull requests and issues.
Abstract
The Open Embeddings specification defines a standardized JSON format for content providers to expose embeddings of their content, enabling AI-native content discovery without requiring full content download. This specification addresses the growing need for efficient content similarity search while reducing bandwidth costs and preventing data capture behind walled gardens.
Motivation
Problem Statement
Current content discovery methods face several challenges:
- Bandwidth Waste: Similarity searches require downloading content to generate embeddings locally
- Computational Cost: Re-embedding content across different models and platforms is expensive
- Walled Gardens: Data capture behind proprietary systems limits open access
- Model Sprawl: Different embedding models create incompatible representations
Solution Overview
Open Embeddings provides:
- Standardized JSON format for embedding distribution
- Publisher-agnostic content discovery
- Reduced bandwidth and computational requirements
- Support for multiple embedding models per content item
Technical Specification
File Format
Open Embeddings uses JSON format with the following structure:
{
"version": "1.0",
"metadata": {
"generated": "2024-01-15T10:30:00Z",
"generator": "open-embeddings-tool/1.0",
"license": "CC-BY-4.0"
},
"content": [
{
"uri": "/path/to/content",
"content_type": "text/html",
"last_modified": "2024-01-15T09:00:00Z",
"title": "Content Title",
"embeddings": [
{
"model": "text-embedding-ada-002",
"version": "1.0",
"dimensions": 1536,
"vector": [0.023, -0.019, 0.041, "..."],
"metadata": {
"chunk_size": 512,
"overlap": 0,
"language": "en"
}
}
]
}
]
}
File Locations
Open Embeddings files SHOULD be accessible at:
/.well-known/open-embeddings.json(preferred)/open-embeddings.json(fallback)
Hard Implementation Problems
Model Sprawl: Can we build on recent academic work to generalize on a framework to transform between embedding spaces for multi-modal models?
Cache-Invalidation: Trusting the embeddings and metadata - ensuring content freshness and embedding accuracy.
References
- RFC 2119 - Key words for use in RFCs
- RFC 8259 - The JavaScript Object Notation (JSON) Data Interchange Format
- Well-Known URIs
Contributing
This RFC is open for community input. Please:
- Submit issues for questions or suggestions
- Create pull requests for specific changes
- Join discussions on the project repository
Last updated: [Current Date]
Examples
This page demonstrates practical implementations and use cases for the Open Embeddings specification.
Real-World Use Case: The Self-Improvement Video Pipeline
The Challenge: Building a script to collect youtube/vimeo/tiktok self-improvement videos for “the path to my best self” without overwhelming a laptop with embedding computation.
The Open Embeddings Solution:
- Video platforms expose embeddings via
open-embeddings.json - Your script encodes the query “self-improvement techniques”
- Calculate similarity against pre-computed embeddings
- Receive ranked, relevant videos for further screening
Benefits:
- No need to download every video to check relevance
- Reduced bandwidth and compute costs
- Publisher-agnostic discovery across platforms
Sample Implementation
Basic open-embeddings.json
Here’s a simple example of an open-embeddings.json file for a blog:
{
"version": "1.0",
"metadata": {
"generated": "2024-01-15T10:30:00Z",
"generator": "blog-embeddings-tool/1.0",
"license": "CC-BY-ND-4.0"
},
"content": [
{
"uri": "/blog/ai-future",
"content_type": "text/html",
"last_modified": "2024-01-15T09:00:00Z",
"embeddings": [
{
"model": "text-embedding-ada-002",
"version": "1.0",
"dimensions": 1536,
"vector": [0.023, -0.019, 0.041, "... (1533 more values)"],
"metadata": {
"chunk_size": 512,
"overlap": 0,
"language": "en"
}
}
]
}
]
}
Use Cases
1. Content Discovery for Developers
Scenario: A developer wants to find blog posts about “machine learning deployment” across multiple tech blogs.
Solution: Instead of scraping and processing each blog’s content, the developer can:
- Check each blog’s
open-embeddings.jsonfile - Generate an embedding for “machine learning deployment”
- Calculate similarity scores against pre-computed embeddings
- Rank and filter results by relevance
2. AI Platform Content Ingestion
Scenario: An AI platform needs to index web content for semantic search.
Solution: The platform can:
- Discover websites with Open Embeddings files
- Download embeddings instead of full content
- Use embeddings for semantic search and recommendations
- Reduce bandwidth and processing costs significantly
3. Content Creator Control
Scenario: A content creator wants to provide multiple representations of their content for different AI systems.
Solution: The creator can:
- Generate embeddings with different models (e.g., general-purpose, domain-specific)
- Provide different granularities (paragraph-level, document-level)
- Include metadata about intended use cases
- Control how their content is represented in AI systems
Code Examples
Python Parser
import json
import requests
from typing import List, Dict, Optional
class OpenEmbeddingsParser:
def __init__(self, base_url: str):
self.base_url = base_url.rstrip('/')
def fetch_embeddings(self) -> Optional[Dict]:
"""Fetch open-embeddings.json from the website"""
urls = [
f"{self.base_url}/.well-known/open-embeddings.json",
f"{self.base_url}/open-embeddings.json"
]
for url in urls:
try:
response = requests.get(url)
if response.status_code == 200:
return response.json()
except requests.RequestException:
continue
return None
def find_similar_content(self, query_embedding: List[float],
threshold: float = 0.8) -> List[Dict]:
"""Find content similar to the query embedding"""
embeddings_data = self.fetch_embeddings()
if not embeddings_data:
return []
similar_content = []
for content in embeddings_data.get('content', []):
for embedding in content.get('embeddings', []):
similarity = self.cosine_similarity(
query_embedding,
embedding['vector']
)
if similarity >= threshold:
similar_content.append({
'uri': content['uri'],
'title': content.get('title', ''),
'similarity': similarity,
'model': embedding['model']
})
return sorted(similar_content, key=lambda x: x['similarity'], reverse=True)
def cosine_similarity(self, a: List[float], b: List[float]) -> float:
"""Calculate cosine similarity between two vectors"""
import math
dot_product = sum(a[i] * b[i] for i in range(len(a)))
norm_a = math.sqrt(sum(a[i] ** 2 for i in range(len(a))))
norm_b = math.sqrt(sum(b[i] ** 2 for i in range(len(b))))
return dot_product / (norm_a * norm_b)
# Usage example
parser = OpenEmbeddingsParser("https://example.com")
similar_content = parser.find_similar_content(query_embedding)
JavaScript Parser
class OpenEmbeddingsParser {
constructor(baseUrl) {
this.baseUrl = baseUrl.replace(/\/$/, '');
}
async fetchEmbeddings() {
const urls = [
`${this.baseUrl}/.well-known/open-embeddings.json`,
`${this.baseUrl}/open-embeddings.json`
];
for (const url of urls) {
try {
const response = await fetch(url);
if (response.ok) {
return await response.json();
}
} catch (error) {
continue;
}
}
return null;
}
async findSimilarContent(queryEmbedding, threshold = 0.8) {
const embeddingsData = await this.fetchEmbeddings();
if (!embeddingsData) return [];
const similarContent = [];
for (const content of embeddingsData.content || []) {
for (const embedding of content.embeddings || []) {
const similarity = this.cosineSimilarity(
queryEmbedding,
embedding.vector
);
if (similarity >= threshold) {
similarContent.push({
uri: content.uri,
title: content.title || '',
similarity: similarity,
model: embedding.model
});
}
}
}
return similarContent.sort((a, b) => b.similarity - a.similarity);
}
cosineSimilarity(a, b) {
const dotProduct = a.reduce((sum, val, i) => sum + val * b[i], 0);
const normA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
const normB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
return dotProduct / (normA * normB);
}
}
// Usage example
const parser = new OpenEmbeddingsParser('https://example.com');
const similarContent = await parser.findSimilarContent(queryEmbedding);
Testing Your Implementation
Validation Checklist
- JSON file is valid and well-formed
- File is accessible at specified locations
- All required fields are present
- Embedding dimensions match declared values
- Metadata includes necessary information
- CORS headers are properly configured
- File size is reasonable for bandwidth considerations
Common Pitfalls
- Large File Sizes: Embedding vectors can make files very large. Consider:
- Breaking large sites into multiple files
- Using compression
- Implementing pagination for large content sets
-
Outdated Embeddings: Ensure embeddings are regenerated when content changes
-
Model Compatibility: Clearly specify model versions and parameters
- Security: Be cautious about exposing sensitive information in embeddings
Next Steps
- Implement a Parser: Use the code examples above as starting points
- Generate Sample Data: Create embeddings for your content
- Test Integration: Verify your implementation works with AI platforms
- Contribute: Share your implementations with the community
Want to add your own examples? Submit feedback or contribute to the project repository.