pg_tokenizer
pg_tokenizer : Tokenizers for full-text search
Overview
| ID | Extension | Package | Version | Category | License | Language |
|---|---|---|---|---|---|---|
| 2160 | pg_tokenizer
|
pg_tokenizer
|
0.1.1 |
FTS
|
Apache-2.0
|
Rust
|
| Attribute | Has Binary | Has Library | Need Load | Has DDL | Relocatable | Trusted |
|---|---|---|---|---|---|---|
--sLd--
|
No
|
Yes
|
Yes
|
Yes
|
no
|
no
|
| Relationships | |
|---|---|
| Schemas | tokenizer_catalog |
| See Also | pg_search
pgroonga
pg_bigm
zhparser
pgroonga_database
pg_bestmatch
vchord_bm25
pg_trgm
|
PG18 fix by Vonng
Packages
| Type | Repo | Version | PG Major Compatibility | Package Pattern | Dependencies |
|---|---|---|---|---|---|
| EXT | PIGSTY
|
0.1.1 |
18
17
16
15
14
|
pg_tokenizer |
- |
| RPM | PIGSTY
|
0.1.1 |
18
17
16
15
14
|
pg_tokenizer_$v |
- |
| DEB | PIGSTY
|
0.1.1 |
18
17
16
15
14
|
postgresql-$v-pg-tokenizer |
- |
| Linux / PG | PG18 | PG17 | PG16 | PG15 | PG14 |
|---|---|---|---|---|---|
el8.x86_64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
el8.aarch64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
el9.x86_64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
el9.aarch64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
el10.x86_64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
el10.aarch64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
d12.x86_64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
d12.aarch64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
d13.x86_64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
d13.aarch64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
u22.x86_64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
u22.aarch64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
u24.x86_64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
u24.aarch64
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
PIGSTY 0.1.1
|
Source
pig build pkg pg_tokenizer; # build rpm/debInstall
Make sure PGDG and PIGSTY repo available:
pig repo add pgsql -u # add both repo and update cacheInstall this extension with pig:
pig install pg_tokenizer; # install via package name, for the active PG version
pig install pg_tokenizer -v 18; # install for PG 18
pig install pg_tokenizer -v 17; # install for PG 17
pig install pg_tokenizer -v 16; # install for PG 16
pig install pg_tokenizer -v 15; # install for PG 15
pig install pg_tokenizer -v 14; # install for PG 14Config this extension to shared_preload_libraries:
shared_preload_libraries = 'pg_tokenizer';Create this extension with:
CREATE EXTENSION pg_tokenizer;Usage
pg_tokenizer is a PostgreSQL extension that provides tokenizers for full-text search. It is designed to work with VectorChord-bm25 for native BM25 ranking index support.
Quick Start
CREATE EXTENSION pg_tokenizer;
-- Create a tokenizer using the LLMLingua2 model
SELECT create_tokenizer('tokenizer1', $$
model = "llmlingua2"
$$);
-- Tokenize text
SELECT tokenize('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.', 'tokenizer1');Tokenizer Models
pg_tokenizer supports multiple tokenizer models for different languages and use cases:
| Model | Language | Description |
|---|---|---|
llmlingua2 |
English | BERT-based tokenizer from LLMLingua2 |
jieba |
Chinese | Jieba Chinese text segmentation |
lindera/ipadic |
Japanese | Lindera tokenizer with IPADIC dictionary |
| Custom models | Any | User-trained models for domain-specific text |
Creating Tokenizers
-- English tokenizer
SELECT create_tokenizer('en_tokenizer', $$
model = "llmlingua2"
$$);
-- Chinese tokenizer
SELECT create_tokenizer('zh_tokenizer', $$
model = "jieba"
$$);
-- Japanese tokenizer
SELECT create_tokenizer('ja_tokenizer', $$
model = "lindera/ipadic"
$$);Tokenizing Text
-- Tokenize English text
SELECT tokenize('full text search in PostgreSQL', 'en_tokenizer');
-- Tokenize Chinese text
SELECT tokenize('PostgreSQL是一个强大的数据库系统', 'zh_tokenizer');Text Analyzer
pg_tokenizer also provides text analyzer functionality that combines tokenization with additional text processing steps. For detailed text analyzer usage, refer to the Text Analyzer documentation.
Integration with VectorChord-BM25
pg_tokenizer is typically used together with VectorChord-BM25 for full BM25 ranking support:
CREATE EXTENSION IF NOT EXISTS pg_tokenizer CASCADE;
CREATE EXTENSION IF NOT EXISTS vchord_bm25 CASCADE;
-- Create a tokenizer
SELECT create_tokenizer('my_tokenizer', $$
model = "llmlingua2"
$$);
-- Tokenize text into bm25vectors for indexing and search
SELECT tokenize('your search query', 'my_tokenizer');Documentation
For more details, see the full documentation: