← Back to Research

Overview

Messages referenced by cryptographic hash (content address) rather than location, enabling deduplication, verifiable delivery, and IPFS-style distributed storage.

Problem Statement

Traditional email is location-addressed: - Messages stored on specific servers - Duplicates when forwarded (wasteful) - No verifiable proof of content - Centralized storage (SPOF) - No content-based deduplication

Vision

Traditional: Message stored at imap://msgs.global/INBOX/12345

Content-Addressed: Message is hash sha256:a3f4b2c... (stored anywhere, retrieved by hash)

Architecture

1. Content Addressing

Message Hash: SHA-256 of canonical message
IPFS CID: QmYwAPJzv5CZsnA... (multihash format)

References:
  - Parent: sha256:parent-hash
  - Attachments: [sha256:att1-hash, sha256:att2-hash]
  - Thread: sha256:thread-root-hash

2. Storage Layer

                   ┌──────────────┐
                   │   IPFS/S3    │
                   │ (immutable)  │
                   └──────┬───────┘
                          │
                   ┌──────▼───────┐
                   │  Hash Index  │
                   │  (PostgreSQL)│
                   └──────┬───────┘
                          │
    ┌─────────────────────┼─────────────────────┐
    │                     │                     │
┌───▼────┐          ┌─────▼─────┐        ┌─────▼─────┐
│ User A │          │  User B   │        │  User C   │
│ Refs   │          │  Refs     │        │  Refs     │
└────────┘          └───────────┘        └───────────┘

[Only user references are stored per-user, content is shared]

3. Message Format

{
  "hash": "sha256:a3f4b2c8...",
  "ipfs_cid": "QmYwAPJzv5CZsnA...",
  "headers": {
    "from": "alice@msgs.global",
    "to": ["bob@msgs.global"],
    "subject": "Project Update",
    "date": "2026-03-07T20:00:00Z",
    "message-id": "unique-id@msgs.global"
  },
  "body": {
    "type": "text/plain",
    "hash": "sha256:body-hash...",
    "size": 1234
  },
  "attachments": [
    {
      "filename": "report.pdf",
      "hash": "sha256:att-hash...",
      "size": 524288,
      "ipfs_cid": "QmXbZ..."
    }
  ],
  "references": {
    "in-reply-to": "sha256:parent-hash...",
    "thread-root": "sha256:thread-hash..."
  }
}

Key Benefits

1. Deduplication

Alice sends message with attachment (2 MB) to Bob and Carol
Traditional: 4 MB stored (2x2)
Content-Addressed: 2 MB stored (1x, 2 refs)

Alice forwards Bob's message to Carol
Traditional: 2x message storage
Content-Addressed: 1 ref added

2. Verifiable Delivery

# Sender proves message delivered
delivery_proof = {
    'message_hash': 'sha256:a3f4b2c...',
    'recipient': 'bob@msgs.global',
    'timestamp': '2026-03-07T20:00:00Z',
    'signature': sign(recipient_key, message_hash + timestamp)
}

# Anyone can verify Bob received this exact message
verify(delivery_proof)  # Cryptographic proof, non-repudiable

3. Distributed Storage

# Message stored on multiple nodes
ipfs add message.eml  # -> QmYwAPJzv5CZsnA...

# Anyone with the hash can retrieve
ipfs cat QmYwAPJzv5CZsnA...

# msgs.global nodes pin important messages
# Users can pin their own messages elsewhere

4. Thread Integrity

# Entire thread is a Merkle tree
thread_root = "sha256:thread-hash"

# Verify entire conversation integrity
verify_thread_integrity(thread_root) # -> True/False

# Detect if any message in thread was tampered

Integration with msgs.global

Database Schema

-- Content-addressed messages
CREATE TABLE ca_messages (
    hash VARCHAR(64) PRIMARY KEY,
    ipfs_cid VARCHAR(100),
    message_data JSONB,
    content BYTEA, -- Canonical representation
    stored_at TIMESTAMP DEFAULT NOW(),
    ref_count INTEGER DEFAULT 0
);

-- User message references
CREATE TABLE user_message_refs (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    message_hash VARCHAR(64) REFERENCES ca_messages(hash),
    folder VARCHAR(50), -- INBOX, Sent, etc.
    flags TEXT[], -- SEEN, FLAGGED, etc.
    received_at TIMESTAMP DEFAULT NOW()
);

-- Attachment deduplication
CREATE TABLE ca_attachments (
    hash VARCHAR(64) PRIMARY KEY,
    ipfs_cid VARCHAR(100),
    filename VARCHAR(255),
    mime_type VARCHAR(100),
    size INTEGER,
    content BYTEA,
    ref_count INTEGER DEFAULT 0
);

CREATE INDEX idx_user_refs_folder ON user_message_refs(user_id, folder);

Storage Service

class ContentAddressedStorage:
    def store_message(self, message):
        """Store message and return hash"""
        # 1. Canonicalize message
        canonical = canonicalize_message(message)

        # 2. Hash content
        message_hash = hashlib.sha256(canonical).hexdigest()

        # 3. Check if exists
        if self.exists(message_hash):
            self.increment_ref_count(message_hash)
            return message_hash

        # 4. Store attachments separately
        for att in message.attachments:
            att_hash = self.store_attachment(att)
            message.attachment_hashes.append(att_hash)

        # 5. Store to IPFS (optional)
        ipfs_cid = ipfs_client.add(canonical)

        # 6. Store to database
        db.execute("""
            INSERT INTO ca_messages (hash, ipfs_cid, message_data, content)
            VALUES (?, ?, ?, ?)
        """, message_hash, ipfs_cid, message.to_json(), canonical)

        return message_hash

    def get_message(self, hash):
        """Retrieve message by hash"""
        # Try local database
        msg = db.query("SELECT * FROM ca_messages WHERE hash = ?", hash)
        if msg:
            return msg

        # Try IPFS
        if ipfs_cid := self.get_ipfs_cid(hash):
            content = ipfs_client.cat(ipfs_cid)
            return parse_message(content)

        return None

    def delete_message(self, user_id, hash):
        """Delete user reference (not content)"""
        db.execute("""
            DELETE FROM user_message_refs
            WHERE user_id = ? AND message_hash = ?
        """, user_id, hash)

        # Decrement ref count
        db.execute("""
            UPDATE ca_messages
            SET ref_count = ref_count - 1
            WHERE hash = ?
        """, hash)

        # Garbage collect if ref_count = 0 (optional)
        self.gc_if_needed(hash)

IMAP Bridge

# Transparent IMAP interface
# Users see normal IMAP folders, but backed by content-addressed storage

class ContentAddressedIMAP:
    def fetch(self, message_num):
        """Fetch message by sequence number"""
        # Map sequence number -> hash
        ref = db.query("""
            SELECT message_hash FROM user_message_refs
            WHERE user_id = ? AND folder = ?
            ORDER BY received_at
            LIMIT 1 OFFSET ?
        """, self.user_id, self.folder, message_num - 1)

        # Retrieve by hash
        message = storage.get_message(ref.message_hash)
        return message

API Endpoints

@app.route('/api/v1/messages/<hash>')
def get_message_by_hash(hash):
    """Retrieve message by content address"""
    message = storage.get_message(hash)
    if not message:
        return {'error': 'Message not found'}, 404
    return message.to_json()

@app.route('/api/v1/messages/<hash>/verify')
def verify_message_integrity(hash):
    """Verify message integrity"""
    message = storage.get_message(hash)
    canonical = canonicalize_message(message)
    computed_hash = hashlib.sha256(canonical).hexdigest()

    return {
        'valid': computed_hash == hash,
        'claimed_hash': hash,
        'computed_hash': computed_hash
    }

@app.route('/api/v1/threads/<thread_hash>/verify')
def verify_thread_integrity(thread_hash):
    """Verify entire thread integrity"""
    # Reconstruct thread from references
    thread = reconstruct_thread(thread_hash)

    # Verify each message
    for msg in thread:
        if not verify_message_integrity(msg.hash):
            return {'valid': False, 'invalid_message': msg.hash}

    return {'valid': True, 'message_count': len(thread)}

Migration Strategy

Phase 1: Hybrid Storage (6 months)

  • New messages stored content-addressed
  • Existing messages remain traditional
  • IMAP interface unchanged (transparent)

Phase 2: Background Migration (6 months)

  • Deduplicate existing messages
  • Compute hashes for historical data
  • Migrate to content-addressed refs

Phase 3: Full Content-Addressed (12 months)

  • All storage content-addressed
  • IPFS pinning for distributed backup
  • Enable user-controlled storage

Storage Savings Estimate

Current: 100,000 users × 5 GB avg = 500 TB

Deduplication:
  - Attachments: 40% reduction (common files)
  - Forwards: 30% reduction
  - Thread replies: 20% reduction

Estimated savings: ~35% = 175 TB
Cost savings: ~$3,500/month @ $20/TB

Challenges & Solutions

Challenge Solution
Hash collision Use SHA-256 (probability negligible)
Message mutability Store canonical form (normalized)
IMAP compatibility Transparent mapping layer
Performance Aggressive caching, local index
IPFS pinning cost Selective pinning, user-pays model

Related Technologies

  • IPFS: InterPlanetary File System (distributed storage)
  • Git: Content-addressed version control (inspiration)
  • Perkeep: Personal archival system
  • Dat Protocol: Distributed data sharing
  • Filecoin: Incentivized IPFS storage

Status

🔬 Research & Prototyping Phase

Next Steps

  1. Prototype content-addressed storage layer
  2. Benchmark deduplication savings on real data
  3. Test IPFS integration
  4. Build Merkle tree thread verification
  5. IMAP compatibility testing
  6. Measure performance vs traditional storage