zhparser
zhparser
zhparser : a parser for full-text search of Chinese
Overview
| ID | Extension | Package | Version | Category | License | Language |
|---|---|---|---|---|---|---|
| 2130 | zhparser
|
zhparser
|
2.3 |
FTS
|
PostgreSQL
|
C
|
| Attribute | Has Binary | Has Library | Need Load | Has DDL | Relocatable | Trusted |
|---|---|---|---|---|---|---|
--s-d-r
|
No
|
Yes
|
No
|
Yes
|
yes
|
no
|
| Relationships | |
|---|---|
| See Also | pg_trgm
rum
pg_search
pgroonga
pgroonga_database
pg_bigm
pg_tokenizer
vchord_bm25
|
Packages
| Type | Repo | Version | PG Major Compatibility | Package Pattern | Dependencies |
|---|---|---|---|---|---|
| EXT | PIGSTY
|
2.3 |
18
17
16
15
14
|
zhparser |
- |
| RPM | PIGSTY
|
2.3 |
18
17
16
15
14
|
zhparser_$v |
- |
| DEB | PIGSTY
|
2.3 |
18
17
16
15
14
|
postgresql-$v-zhparser |
- |
| Linux / PG | PG18 | PG17 | PG16 | PG15 | PG14 |
|---|---|---|---|---|---|
el8.x86_64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
el8.aarch64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
el9.x86_64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
el9.aarch64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
el10.x86_64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
el10.aarch64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
d12.x86_64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
d12.aarch64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
d13.x86_64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
d13.aarch64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
u22.x86_64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
u22.aarch64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
u24.x86_64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
u24.aarch64
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
PIGSTY 2.3
|
Source
pig build pkg zhparser; # build rpm/debInstall
Make sure PGDG and PIGSTY repo available:
pig repo add pgsql -u # add both repo and update cacheInstall this extension with pig:
pig install zhparser; # install via package name, for the active PG version
pig install zhparser -v 18; # install for PG 18
pig install zhparser -v 17; # install for PG 17
pig install zhparser -v 16; # install for PG 16
pig install zhparser -v 15; # install for PG 15
pig install zhparser -v 14; # install for PG 14Create this extension with:
CREATE EXTENSION zhparser;Usage
zhparser is a PostgreSQL extension for full-text search of Chinese, based on the Simple Chinese Word Segmentation (SCWS) library.
Features
- Chinese text segmentation for PostgreSQL full-text search
- Built on the SCWS (Simple Chinese Word Segmentation) library
- Supports custom dictionaries (TXT and XDB formats)
- Database-level custom word tables (since v2.1)
- Multiple tunable parameters for segmentation behavior
Quick Start
-- Create the extension
CREATE EXTENSION zhparser;
-- Create a text search configuration using zhparser
CREATE TEXT SEARCH CONFIGURATION chinese (PARSER = zhparser);
-- Add token type mappings
ALTER TEXT SEARCH CONFIGURATION chinese ADD MAPPING FOR n,v,a,i,e,l WITH simple;
-- Test Chinese text segmentation
SELECT to_tsvector('chinese', '小明硕士毕业于中国科学院计算所,后在日本京都大学深造');
-- Create a table and index for Chinese full text search
CREATE TABLE articles (id serial PRIMARY KEY, title text, body text);
CREATE INDEX articles_body_idx ON articles
USING gin (to_tsvector('chinese', body));
-- Query with Chinese full text search
SELECT * FROM articles
WHERE to_tsvector('chinese', body) @@ to_tsquery('chinese', '中国');Configuration Parameters
zhparser provides several GUC parameters to control segmentation behavior:
| Parameter | Default | Description |
|---|---|---|
zhparser.punctuation_ignore |
off |
Ignore all punctuation |
zhparser.seg_with_duality |
off |
Perform duality segmentation on long words |
zhparser.dict_in_memory |
off |
Load the whole dictionary into memory |
zhparser.multi_short |
off |
Short word compound segmentation |
zhparser.multi_duality |
off |
Duality compound segmentation |
zhparser.multi_zmain |
off |
Key word in first compound segmentation |
zhparser.multi_zall |
off |
Use all compound segmentation |
Token Types
zhparser supports the following token types from SCWS:
| Code | Description |
|---|---|
a |
Adjective |
b |
Differentiation (区别词) |
c |
Conjunction |
d |
Adverb |
e |
Exclamation |
f |
Position word (方位词) |
g |
Root word (词根) |
h |
Prefix |
i |
Idiom |
j |
Abbreviation |
k |
Suffix |
l |
Temporary idiom |
m |
Numeral |
n |
Noun |
o |
Onomatopoeia |
p |
Preposition |
q |
Classifier |
r |
Pronoun |
s |
Space word (处所词) |
t |
Time word |
u |
Auxiliary |
v |
Verb |
w |
Punctuation |
x |
Unknown |
y |
Modal particle |
z |
Status word (状态词) |
Custom Dictionaries
File-based Dictionaries
Place custom dictionary files in the share directory (typically $SHAREDIR/tsearch_data/):
- TXT format: one word per line
- XDB format: compiled SCWS dictionary format
Custom dictionaries take precedence over built-in dictionaries.
Database-level Custom Words (v2.1+)
-- Add custom words via zhparser's built-in table
INSERT INTO zhparser.zhprs_custom_word VALUES ('中国科学院计算所');
-- Reload custom dictionary (reconnect after sync to take effect)
SELECT sync_zhprs_custom_word();
-- Verify segmentation with custom word
SELECT to_tsvector('chinese', '小明硕士毕业于中国科学院计算所');Docker Quick Start
docker run --name pgzhparser -d \
-e POSTGRES_PASSWORD=somepassword \
zhparser/zhparser:bookworm-16Last updated on