coding4 min read

Building a Hybrid FTS5 + Embedding Search for Code

Explore our innovative hybrid search combining FTS5 and embeddings, enhancing code indexing for AI coding assistants. Learn why both methods are essential.

Building a Hybrid FTS5 + Embedding Search for Code

How Did We Build a Hybrid FTS5 + Embedding Search for Code? Why Do You Need Both?

Learn more about boost your web development productivity with claude workers

Boost your web development productivity with Claude workers

Creating effective AI coding assistants requires a solid grasp of the codebase. This understanding relies heavily on the search capabilities embedded within the tool. At srclight, we realized that a single search method wouldn't meet the needs of our deep code indexing MCP server. This server provides AI agents with a comprehensive understanding of the codebase. Therefore, we developed a hybrid search solution that combines FTS5 for keyword search and embeddings for semantic search.

When developing AI coding assistants, the search must cater to different user needs:

  • Keyword Search: Users who know the exact function name need a quick way to locate it.
  • Semantic Search: Users searching for concepts—like "code that handles authentication"—may not know the precise terms.

Most tools focus on either keyword or semantic search, creating gaps in functionality. By integrating both methods, we empower users to find exactly what they need, regardless of their knowledge level.

What Are the Limitations of FTS5 and Embeddings?

FTS5 excels at finding exact matches but struggles with the nuances of code naming conventions. For example:

For a deep dive on bus stop balancing: fast, cheap, and effective solutions, see our full guide

  • calculateTotalPrice
  • calculate_total_price
  • CalculateTotalPrice

A single FTS5 index cannot accommodate these variations. Additionally, users often seek concepts rather than keywords. For instance, searching for "code that validates user input" emphasizes understanding over keyword recognition.

For a deep dive on cve-2026-27606: the rollup path traversal vulnerability explained, see our full guide

Embeddings are effective for meaning-based matches but face challenges such as:

  • Exact symbol names (e.g., searching for handleAuth should yield handleAuth).
  • Substring matches (e.g., searching for parse should find parseJSON).
  • Short queries that often lack context.
  • Various naming conventions.

How Did We Create Our Innovative Hybrid Approach?

To tackle these challenges, we developed three distinct FTS5 indexes, each tailored for specific use cases:

  1. Case and Underscore Split: This index splits names based on case changes and underscores, accommodating various naming conventions.

    • Example: calculateTotalPrice becomes calculate, Total, Price.
    • Example: handle_user_auth becomes handle, user, auth.
  2. Substring Indexing: This index captures every 3-character substring, enabling substring matches even within longer words.

  3. Stemming: We implemented a stemming process to normalize words. For example, running, ran, and runner all map to run, enhancing docstring searches.

In addition to FTS5, we utilize semantic vectors for meaning-based matching with two types of embeddings: qwen3-embedding (4096 dimensions) and nomic-embed-text (768 dimensions).

How Do We Combine the Two Search Methods?

We execute each query across all four indexes, rank the results, and merge them using the Reciprocal Rank Fusion (RRF) method:

RRF_score(d) = Σ 1 / (k + rank(d))
where k = 60 (a standard constant).

For example:

  • A result at rank 1 in FTS5 and rank 2 in embeddings:

    • FTS5: 1 / (60 + 1) = 0.0164
    • Embeddings: 1 / (60 + 2) = 0.0161
    • Total: 0.0325
  • A result at rank 10 in embeddings only gets:

    • 1 / (60 + 10) = 0.0143

This scoring system allows exact keyword matches to coexist effectively with semantic matches, ensuring users benefit from both approaches.

What Additional Features Does srclight Offer?

Beyond our hybrid search, we developed features to enhance user experience:

  • GPU Vector Cache: Embeddings load to VRAM once, allowing for quick queries (~3ms) after an initial load (~300ms).
  • Incremental Indexing: This feature ensures only changed symbols are re-indexed, tracked via content hash.
  • Git Intelligence: Users can query recent changes, leveraging git blame, hotspots, and uncommitted work in progress.
  • Multi-repo Workspaces: We support SQLite ATTACH+UNION across 10+ repositories, boosting flexibility.

How Easy Is It to Install srclight?

Our goal was to create a system that installs with a single command:

pip install srclight
srclight index --embed qwen3-embedding
srclight serve

This means no JVM, no Docker, no Redis, and no cloud. Your code remains on your machine, ensuring privacy and security. We can index 13 repositories with 45,000 symbols in a workspace. For example, Claude Code's tool calls per task dropped from about 20 to 6, as it can now simply ask, "Who calls this?" instead of running multiple greps.

Conclusion: Why Is Hybrid Search Essential?

In summary, building a hybrid FTS5 and embedding search system is crucial for developing effective AI coding assistants. Keyword matches provide precision, while embeddings enhance recall. The RRF fusion technique seamlessly merges these two methods, creating a powerful search tool.

What search challenges are you facing with AI coding assistants? Share your insights in the comments below—your feedback could drive the next evolution in coding tools.

Frequently Asked Questions

Q: What is Artificial Intelligence?
A: Artificial Intelligence is a fundamental concept in modern development. It refers to...

Q: Why should I learn Artificial Intelligence?
A: Learning Artificial Intelligence helps you write better, more maintainable code and stay current with industry best practices.

Q: When should I use Artificial Intelligence?
A: Use Artificial Intelligence when you need to...

Q: How do I get started with Artificial Intelligence?
A: Getting started with Artificial Intelligence is straightforward. First, ensure you have the necessary prerequisites installed, then follow the tutorials above.

Q: What's the difference between Artificial Intelligence and Software Development?
A: While both Artificial Intelligence and Software Development serve similar purposes, they differ in implementation and use cases...



Continue learning: Next, explore new drug target discovered for 'brain on fire' disease

Continue learning: Next, explore new drug target discovered for 'brain on fire' disease

Related Articles