vchord_bm25
vchord_bm25 : A postgresql extension for bm25 ranking algorithm
Overview
| ID | Extension | Package | Version | Category | License | Language |
|---|---|---|---|---|---|---|
| 2150 | vchord_bm25
|
vchord_bm25
|
0.3.0 |
FTS
|
AGPL-3.0
|
Rust
|
| Attribute | Has Binary | Has Library | Need Load | Has DDL | Relocatable | Trusted |
|---|---|---|---|---|---|---|
--sLd--
|
No
|
Yes
|
Yes
|
Yes
|
no
|
no
|
| Relationships | |
|---|---|
| Schemas | bm25_catalog |
| See Also | vector
vchord
pg_search
pg_bestmatch
vectorscale
zhparser
pg_tokenizer
pgroonga
|
Packages
| Type | Repo | Version | PG Major Compatibility | Package Pattern | Dependencies |
|---|---|---|---|---|---|
| EXT | PIGSTY
|
0.3.0 |
18
17
16
15
14
|
vchord_bm25 |
- |
| RPM | PIGSTY
|
0.3.0 |
18
17
16
15
14
|
vchord_bm25_$v |
- |
| DEB | PIGSTY
|
0.3.0 |
18
17
16
15
14
|
postgresql-$v-vchord-bm25 |
- |
| Linux / PG | PG18 | PG17 | PG16 | PG15 | PG14 |
|---|---|---|---|---|---|
el8.x86_64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
el8.aarch64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
el9.x86_64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
el9.aarch64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
el10.x86_64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
el10.aarch64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
d12.x86_64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
d12.aarch64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
d13.x86_64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
d13.aarch64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
u22.x86_64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
u22.aarch64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
u24.x86_64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
u24.aarch64
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
PIGSTY 0.3.0
|
Source
pig build pkg vchord_bm25; # build rpm/debInstall
Make sure PGDG and PIGSTY repo available:
pig repo add pgsql -u # add both repo and update cacheInstall this extension with pig:
pig install vchord_bm25; # install via package name, for the active PG version
pig install vchord_bm25 -v 18; # install for PG 18
pig install vchord_bm25 -v 17; # install for PG 17
pig install vchord_bm25 -v 16; # install for PG 16
pig install vchord_bm25 -v 15; # install for PG 15
pig install vchord_bm25 -v 14; # install for PG 14Config this extension to shared_preload_libraries:
shared_preload_libraries = 'vchord_bm25';Create this extension with:
CREATE EXTENSION vchord_bm25;Usage
VectorChord-BM25 is a PostgreSQL extension for the BM25 ranking algorithm, implemented via Block-WeakAnd algorithms. It is designed to work together with pg_tokenizer for customized text tokenization.
Architecture
The extension comprises three main components:
- Tokenizer: Converts text into
bm25vector(sparse vectors storing vocabulary IDs and term frequencies) - bm25vector: A custom data type for storing tokenized text
- bm25vector indexes: Accelerate search and ranking operations
Quick Start
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS pg_tokenizer CASCADE;
CREATE EXTENSION IF NOT EXISTS vchord_bm25 CASCADE;
-- Create a tokenizer (e.g., LLMLingua2 for English)
SELECT create_tokenizer('tokenizer1', $$
model = "llmlingua2"
$$);
-- Create a table with text content
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
passage TEXT,
embedding bm25vector
);
-- Tokenize text passages into bm25vectors
UPDATE documents SET embedding = tokenize(passage, 'tokenizer1');
-- Create a BM25 index
CREATE INDEX documents_embedding_bm25 ON documents USING bm25 (embedding bm25_ops);
-- Query with BM25 ranking
SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('search query', 'tokenizer1')) AS score
FROM documents
ORDER BY score
LIMIT 10;Note: BM25 scores in VectorChord-BM25 are negative, with more negative scores indicating greater relevance.
The <&> Operator
The <&> operator computes the BM25 relevance score between a stored bm25vector and a query bm25vector. Queries must be wrapped in to_bm25query() which takes the index name and the tokenized query:
-- Basic search query
-- to_bm25query(index_name, tokenized_query)
SELECT id, passage, embedding <&> to_bm25query('documents_embedding_bm25', tokenize('database system', 'tokenizer1')) AS score
FROM documents
ORDER BY score
LIMIT 10;Language Support
VectorChord-BM25 supports multiple languages through different tokenizer configurations:
| Language | Approach | Model/Pre-tokenizer |
|---|---|---|
| English | Pre-trained model | model = "llmlingua2" or model = "bert_base_uncased" |
| Chinese | Custom model with Jieba pre-tokenizer | [pre_tokenizer.jieba] |
| Japanese | Custom model with Lindera pre-tokenizer | Lindera with IPADIC dictionary |
| Custom | User-trained models via text analyzers | create_custom_model_tokenizer_and_trigger() |
Chinese Text Search Example
Chinese text requires a custom model with a Jieba pre-tokenizer (not a pre-trained model):
-- Create a text analyzer with Jieba pre-tokenizer
SELECT create_text_analyzer('zh_text_analyzer', $$
[pre_tokenizer.jieba]
$$);
-- Create a custom model tokenizer that trains on your corpus
SELECT create_custom_model_tokenizer_and_trigger(
tokenizer_name => 'zh_tokenizer',
model_name => 'zh_model',
text_analyzer_name => 'zh_text_analyzer',
table_name => 'documents',
source_column => 'passage',
target_column => 'embedding'
);Custom Tokenizer Models
For domain-specific terminology, you can create text analyzers with stopwords, stemming, and other filters, then train custom models on your corpus using create_custom_model_tokenizer_and_trigger().
Comparison with Alternatives
| Feature | VectorChord-BM25 | PostgreSQL tsvector + ts_rank |
|---|---|---|
| Ranking algorithm | BM25 | tf-idf variant |
| Custom tokenizers | Yes (via pg_tokenizer) | Limited to built-in configs |
| Index type | Dedicated BM25 index | GIN index |
| Native PostgreSQL | Yes (extension) | Built-in |
| Language support | Extensible via models | Via text search configs |