pg_tokenizer

pg_tokenizer

pg_tokenizer : Tokenizers for full-text search

Overview

ID Extension Package Version Category License Language
2160
pg_tokenizer
pg_tokenizer
0.1.1
FTS
Apache-2.0
Rust
Attribute Has Binary Has Library Need Load Has DDL Relocatable Trusted
--sLd--
No
Yes
Yes
Yes
no
no
Relationships
Schemas tokenizer_catalog
See Also
pg_search
pgroonga
pg_bigm
zhparser
pgroonga_database
pg_bestmatch
vchord_bm25
pg_trgm

PG18 fix by Vonng

Packages

Type Repo Version PG Major Compatibility Package Pattern Dependencies
EXT
PIGSTY
0.1.1
18
17
16
15
14
pg_tokenizer -
RPM
PIGSTY
0.1.1
18
17
16
15
14
pg_tokenizer_$v -
DEB
PIGSTY
0.1.1
18
17
16
15
14
postgresql-$v-pg-tokenizer -
Linux / PG PG18 PG17 PG16 PG15 PG14
el8.x86_64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
el8.aarch64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
el9.x86_64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
el9.aarch64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
el10.x86_64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
el10.aarch64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
d12.x86_64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
d12.aarch64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
d13.x86_64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
d13.aarch64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
u22.x86_64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
u22.aarch64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
u24.x86_64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
u24.aarch64
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
PIGSTY 0.1.1
Package Version OS ORG SIZE File URL
pg_tokenizer_18 0.1.1 el8.x86_64 pigsty 11.7 MiB pg_tokenizer_18-0.1.1-1PIGSTY.el8.x86_64.rpm
pg_tokenizer_18 0.1.1 el8.aarch64 pigsty 11.5 MiB pg_tokenizer_18-0.1.1-1PIGSTY.el8.aarch64.rpm
pg_tokenizer_18 0.1.1 el9.x86_64 pigsty 11.0 MiB pg_tokenizer_18-0.1.1-1PIGSTY.el9.x86_64.rpm
pg_tokenizer_18 0.1.1 el9.aarch64 pigsty 10.9 MiB pg_tokenizer_18-0.1.1-1PIGSTY.el9.aarch64.rpm
pg_tokenizer_18 0.1.1 el10.x86_64 pigsty 10.9 MiB pg_tokenizer_18-0.1.1-1PIGSTY.el10.x86_64.rpm
pg_tokenizer_18 0.1.1 el10.aarch64 pigsty 11.0 MiB pg_tokenizer_18-0.1.1-1PIGSTY.el10.aarch64.rpm
postgresql-18-pg-tokenizer 0.1.1 d12.x86_64 pigsty 9.9 MiB postgresql-18-pg-tokenizer_0.1.1-1PIGSTY~bookworm_amd64.deb
postgresql-18-pg-tokenizer 0.1.1 d12.aarch64 pigsty 9.7 MiB postgresql-18-pg-tokenizer_0.1.1-1PIGSTY~bookworm_arm64.deb
postgresql-18-pg-tokenizer 0.1.1 d13.x86_64 pigsty 9.9 MiB postgresql-18-pg-tokenizer_0.1.1-1PIGSTY~trixie_amd64.deb
postgresql-18-pg-tokenizer 0.1.1 d13.aarch64 pigsty 9.7 MiB postgresql-18-pg-tokenizer_0.1.1-1PIGSTY~trixie_arm64.deb
postgresql-18-pg-tokenizer 0.1.1 u22.x86_64 pigsty 10.9 MiB postgresql-18-pg-tokenizer_0.1.1-1PIGSTY~jammy_amd64.deb
postgresql-18-pg-tokenizer 0.1.1 u22.aarch64 pigsty 10.7 MiB postgresql-18-pg-tokenizer_0.1.1-1PIGSTY~jammy_arm64.deb
postgresql-18-pg-tokenizer 0.1.1 u24.x86_64 pigsty 10.8 MiB postgresql-18-pg-tokenizer_0.1.1-1PIGSTY~noble_amd64.deb
postgresql-18-pg-tokenizer 0.1.1 u24.aarch64 pigsty 10.6 MiB postgresql-18-pg-tokenizer_0.1.1-1PIGSTY~noble_arm64.deb
Package Version OS ORG SIZE File URL
pg_tokenizer_17 0.1.1 el8.x86_64 pigsty 11.7 MiB pg_tokenizer_17-0.1.1-1PIGSTY.el8.x86_64.rpm
pg_tokenizer_17 0.1.1 el8.aarch64 pigsty 11.5 MiB pg_tokenizer_17-0.1.1-1PIGSTY.el8.aarch64.rpm
pg_tokenizer_17 0.1.1 el9.x86_64 pigsty 11.0 MiB pg_tokenizer_17-0.1.1-1PIGSTY.el9.x86_64.rpm
pg_tokenizer_17 0.1.1 el9.aarch64 pigsty 10.9 MiB pg_tokenizer_17-0.1.1-1PIGSTY.el9.aarch64.rpm
pg_tokenizer_17 0.1.1 el10.x86_64 pigsty 10.9 MiB pg_tokenizer_17-0.1.1-1PIGSTY.el10.x86_64.rpm
pg_tokenizer_17 0.1.1 el10.aarch64 pigsty 11.0 MiB pg_tokenizer_17-0.1.1-1PIGSTY.el10.aarch64.rpm
postgresql-17-pg-tokenizer 0.1.1 d12.x86_64 pigsty 9.9 MiB postgresql-17-pg-tokenizer_0.1.1-1PIGSTY~bookworm_amd64.deb
postgresql-17-pg-tokenizer 0.1.1 d12.aarch64 pigsty 9.7 MiB postgresql-17-pg-tokenizer_0.1.1-1PIGSTY~bookworm_arm64.deb
postgresql-17-pg-tokenizer 0.1.1 d13.x86_64 pigsty 9.9 MiB postgresql-17-pg-tokenizer_0.1.1-1PIGSTY~trixie_amd64.deb
postgresql-17-pg-tokenizer 0.1.1 d13.aarch64 pigsty 9.7 MiB postgresql-17-pg-tokenizer_0.1.1-1PIGSTY~trixie_arm64.deb
postgresql-17-pg-tokenizer 0.1.1 u22.x86_64 pigsty 10.9 MiB postgresql-17-pg-tokenizer_0.1.1-1PIGSTY~jammy_amd64.deb
postgresql-17-pg-tokenizer 0.1.1 u22.aarch64 pigsty 10.7 MiB postgresql-17-pg-tokenizer_0.1.1-1PIGSTY~jammy_arm64.deb
postgresql-17-pg-tokenizer 0.1.1 u24.x86_64 pigsty 10.8 MiB postgresql-17-pg-tokenizer_0.1.1-1PIGSTY~noble_amd64.deb
postgresql-17-pg-tokenizer 0.1.1 u24.aarch64 pigsty 10.7 MiB postgresql-17-pg-tokenizer_0.1.1-1PIGSTY~noble_arm64.deb
Package Version OS ORG SIZE File URL
pg_tokenizer_16 0.1.1 el8.x86_64 pigsty 11.7 MiB pg_tokenizer_16-0.1.1-1PIGSTY.el8.x86_64.rpm
pg_tokenizer_16 0.1.1 el8.aarch64 pigsty 11.5 MiB pg_tokenizer_16-0.1.1-1PIGSTY.el8.aarch64.rpm
pg_tokenizer_16 0.1.1 el9.x86_64 pigsty 11.0 MiB pg_tokenizer_16-0.1.1-1PIGSTY.el9.x86_64.rpm
pg_tokenizer_16 0.1.1 el9.aarch64 pigsty 10.9 MiB pg_tokenizer_16-0.1.1-1PIGSTY.el9.aarch64.rpm
pg_tokenizer_16 0.1.1 el10.x86_64 pigsty 10.9 MiB pg_tokenizer_16-0.1.1-1PIGSTY.el10.x86_64.rpm
pg_tokenizer_16 0.1.1 el10.aarch64 pigsty 11.0 MiB pg_tokenizer_16-0.1.1-1PIGSTY.el10.aarch64.rpm
postgresql-16-pg-tokenizer 0.1.1 d12.x86_64 pigsty 9.9 MiB postgresql-16-pg-tokenizer_0.1.1-1PIGSTY~bookworm_amd64.deb
postgresql-16-pg-tokenizer 0.1.1 d12.aarch64 pigsty 9.7 MiB postgresql-16-pg-tokenizer_0.1.1-1PIGSTY~bookworm_arm64.deb
postgresql-16-pg-tokenizer 0.1.1 d13.x86_64 pigsty 9.9 MiB postgresql-16-pg-tokenizer_0.1.1-1PIGSTY~trixie_amd64.deb
postgresql-16-pg-tokenizer 0.1.1 d13.aarch64 pigsty 9.7 MiB postgresql-16-pg-tokenizer_0.1.1-1PIGSTY~trixie_arm64.deb
postgresql-16-pg-tokenizer 0.1.1 u22.x86_64 pigsty 10.9 MiB postgresql-16-pg-tokenizer_0.1.1-1PIGSTY~jammy_amd64.deb
postgresql-16-pg-tokenizer 0.1.1 u22.aarch64 pigsty 10.7 MiB postgresql-16-pg-tokenizer_0.1.1-1PIGSTY~jammy_arm64.deb
postgresql-16-pg-tokenizer 0.1.1 u24.x86_64 pigsty 10.8 MiB postgresql-16-pg-tokenizer_0.1.1-1PIGSTY~noble_amd64.deb
postgresql-16-pg-tokenizer 0.1.1 u24.aarch64 pigsty 10.7 MiB postgresql-16-pg-tokenizer_0.1.1-1PIGSTY~noble_arm64.deb
Package Version OS ORG SIZE File URL
pg_tokenizer_15 0.1.1 el8.x86_64 pigsty 11.7 MiB pg_tokenizer_15-0.1.1-1PIGSTY.el8.x86_64.rpm
pg_tokenizer_15 0.1.1 el8.aarch64 pigsty 11.5 MiB pg_tokenizer_15-0.1.1-1PIGSTY.el8.aarch64.rpm
pg_tokenizer_15 0.1.1 el9.x86_64 pigsty 11.0 MiB pg_tokenizer_15-0.1.1-1PIGSTY.el9.x86_64.rpm
pg_tokenizer_15 0.1.1 el9.aarch64 pigsty 10.9 MiB pg_tokenizer_15-0.1.1-1PIGSTY.el9.aarch64.rpm
pg_tokenizer_15 0.1.1 el10.x86_64 pigsty 10.9 MiB pg_tokenizer_15-0.1.1-1PIGSTY.el10.x86_64.rpm
pg_tokenizer_15 0.1.1 el10.aarch64 pigsty 11.0 MiB pg_tokenizer_15-0.1.1-1PIGSTY.el10.aarch64.rpm
postgresql-15-pg-tokenizer 0.1.1 d12.x86_64 pigsty 9.9 MiB postgresql-15-pg-tokenizer_0.1.1-1PIGSTY~bookworm_amd64.deb
postgresql-15-pg-tokenizer 0.1.1 d12.aarch64 pigsty 9.7 MiB postgresql-15-pg-tokenizer_0.1.1-1PIGSTY~bookworm_arm64.deb
postgresql-15-pg-tokenizer 0.1.1 d13.x86_64 pigsty 9.9 MiB postgresql-15-pg-tokenizer_0.1.1-1PIGSTY~trixie_amd64.deb
postgresql-15-pg-tokenizer 0.1.1 d13.aarch64 pigsty 9.7 MiB postgresql-15-pg-tokenizer_0.1.1-1PIGSTY~trixie_arm64.deb
postgresql-15-pg-tokenizer 0.1.1 u22.x86_64 pigsty 10.9 MiB postgresql-15-pg-tokenizer_0.1.1-1PIGSTY~jammy_amd64.deb
postgresql-15-pg-tokenizer 0.1.1 u22.aarch64 pigsty 10.7 MiB postgresql-15-pg-tokenizer_0.1.1-1PIGSTY~jammy_arm64.deb
postgresql-15-pg-tokenizer 0.1.1 u24.x86_64 pigsty 10.8 MiB postgresql-15-pg-tokenizer_0.1.1-1PIGSTY~noble_amd64.deb
postgresql-15-pg-tokenizer 0.1.1 u24.aarch64 pigsty 10.7 MiB postgresql-15-pg-tokenizer_0.1.1-1PIGSTY~noble_arm64.deb
Package Version OS ORG SIZE File URL
pg_tokenizer_14 0.1.1 el8.x86_64 pigsty 11.7 MiB pg_tokenizer_14-0.1.1-1PIGSTY.el8.x86_64.rpm
pg_tokenizer_14 0.1.1 el8.aarch64 pigsty 11.5 MiB pg_tokenizer_14-0.1.1-1PIGSTY.el8.aarch64.rpm
pg_tokenizer_14 0.1.1 el9.x86_64 pigsty 11.0 MiB pg_tokenizer_14-0.1.1-1PIGSTY.el9.x86_64.rpm
pg_tokenizer_14 0.1.1 el9.aarch64 pigsty 10.9 MiB pg_tokenizer_14-0.1.1-1PIGSTY.el9.aarch64.rpm
pg_tokenizer_14 0.1.1 el10.x86_64 pigsty 10.9 MiB pg_tokenizer_14-0.1.1-1PIGSTY.el10.x86_64.rpm
pg_tokenizer_14 0.1.1 el10.aarch64 pigsty 11.0 MiB pg_tokenizer_14-0.1.1-1PIGSTY.el10.aarch64.rpm
postgresql-14-pg-tokenizer 0.1.1 d12.x86_64 pigsty 9.9 MiB postgresql-14-pg-tokenizer_0.1.1-1PIGSTY~bookworm_amd64.deb
postgresql-14-pg-tokenizer 0.1.1 d12.aarch64 pigsty 9.7 MiB postgresql-14-pg-tokenizer_0.1.1-1PIGSTY~bookworm_arm64.deb
postgresql-14-pg-tokenizer 0.1.1 d13.x86_64 pigsty 9.9 MiB postgresql-14-pg-tokenizer_0.1.1-1PIGSTY~trixie_amd64.deb
postgresql-14-pg-tokenizer 0.1.1 d13.aarch64 pigsty 9.7 MiB postgresql-14-pg-tokenizer_0.1.1-1PIGSTY~trixie_arm64.deb
postgresql-14-pg-tokenizer 0.1.1 u22.x86_64 pigsty 10.9 MiB postgresql-14-pg-tokenizer_0.1.1-1PIGSTY~jammy_amd64.deb
postgresql-14-pg-tokenizer 0.1.1 u22.aarch64 pigsty 10.7 MiB postgresql-14-pg-tokenizer_0.1.1-1PIGSTY~jammy_arm64.deb
postgresql-14-pg-tokenizer 0.1.1 u24.x86_64 pigsty 10.8 MiB postgresql-14-pg-tokenizer_0.1.1-1PIGSTY~noble_amd64.deb
postgresql-14-pg-tokenizer 0.1.1 u24.aarch64 pigsty 10.7 MiB postgresql-14-pg-tokenizer_0.1.1-1PIGSTY~noble_arm64.deb

Source

pig build pkg pg_tokenizer;		# build rpm/deb

Install

Make sure PGDG and PIGSTY repo available:

pig repo add pgsql -u   # add both repo and update cache

Install this extension with pig:

pig install pg_tokenizer;		# install via package name, for the active PG version

pig install pg_tokenizer -v 18;   # install for PG 18
pig install pg_tokenizer -v 17;   # install for PG 17
pig install pg_tokenizer -v 16;   # install for PG 16
pig install pg_tokenizer -v 15;   # install for PG 15
pig install pg_tokenizer -v 14;   # install for PG 14

Config this extension to shared_preload_libraries:

shared_preload_libraries = 'pg_tokenizer';

Create this extension with:

CREATE EXTENSION pg_tokenizer;

Usage

GitHub: tensorchord/pg_tokenizer.rs

pg_tokenizer is a PostgreSQL extension that provides tokenizers for full-text search. It is designed to work with VectorChord-bm25 for native BM25 ranking index support.

Quick Start

CREATE EXTENSION pg_tokenizer;

-- Create a tokenizer using the LLMLingua2 model
SELECT create_tokenizer('tokenizer1', $$
model = "llmlingua2"
$$);

-- Tokenize text
SELECT tokenize('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.', 'tokenizer1');

Tokenizer Models

pg_tokenizer supports multiple tokenizer models for different languages and use cases:

Model Language Description
llmlingua2 English BERT-based tokenizer from LLMLingua2
jieba Chinese Jieba Chinese text segmentation
lindera/ipadic Japanese Lindera tokenizer with IPADIC dictionary
Custom models Any User-trained models for domain-specific text

Creating Tokenizers

-- English tokenizer
SELECT create_tokenizer('en_tokenizer', $$
model = "llmlingua2"
$$);

-- Chinese tokenizer
SELECT create_tokenizer('zh_tokenizer', $$
model = "jieba"
$$);

-- Japanese tokenizer
SELECT create_tokenizer('ja_tokenizer', $$
model = "lindera/ipadic"
$$);

Tokenizing Text

-- Tokenize English text
SELECT tokenize('full text search in PostgreSQL', 'en_tokenizer');

-- Tokenize Chinese text
SELECT tokenize('PostgreSQL是一个强大的数据库系统', 'zh_tokenizer');

Text Analyzer

pg_tokenizer also provides text analyzer functionality that combines tokenization with additional text processing steps. For detailed text analyzer usage, refer to the Text Analyzer documentation.

Integration with VectorChord-BM25

pg_tokenizer is typically used together with VectorChord-BM25 for full BM25 ranking support:

CREATE EXTENSION IF NOT EXISTS pg_tokenizer CASCADE;
CREATE EXTENSION IF NOT EXISTS vchord_bm25 CASCADE;

-- Create a tokenizer
SELECT create_tokenizer('my_tokenizer', $$
model = "llmlingua2"
$$);

-- Tokenize text into bm25vectors for indexing and search
SELECT tokenize('your search query', 'my_tokenizer');

Documentation

For more details, see the full documentation:

Last updated on