smlar
smlar : Effective similarity search
Overview
| ID | Extension | Package | Version | Category | License | Language |
|---|---|---|---|---|---|---|
| 1850 | smlar
|
smlar
|
1.0 |
RAG
|
PostgreSQL
|
C
|
| Attribute | Has Binary | Has Library | Need Load | Has DDL | Relocatable | Trusted |
|---|---|---|---|---|---|---|
--s-d-r
|
No
|
Yes
|
No
|
Yes
|
yes
|
no
|
| Relationships | |
|---|---|
| See Also | pg_similarity
fuzzystrmatch
pg_trgm
intarray
vector
pg_bigm
unaccent
vchord
|
fix pg18 break issue by https://github.com/Vonng/smlar
Packages
| Type | Repo | Version | PG Major Compatibility | Package Pattern | Dependencies |
|---|---|---|---|---|---|
| EXT | PIGSTY
|
1.0 |
18
17
16
15
14
|
smlar |
- |
| RPM | PIGSTY
|
1.0 |
18
17
16
15
14
|
smlar_$v |
- |
| DEB | PIGSTY
|
1.0 |
18
17
16
15
14
|
postgresql-$v-smlar |
- |
| Linux / PG | PG18 | PG17 | PG16 | PG15 | PG14 |
|---|---|---|---|---|---|
el8.x86_64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
el8.aarch64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
el9.x86_64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
el9.aarch64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
el10.x86_64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
el10.aarch64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
d12.x86_64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
d12.aarch64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
d13.x86_64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
d13.aarch64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
u22.x86_64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
u22.aarch64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
u24.x86_64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
u24.aarch64
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
PIGSTY 1.0
|
Source
pig build pkg smlar; # build rpm/debInstall
Make sure PGDG and PIGSTY repo available:
pig repo add pgsql -u # add both repo and update cacheInstall this extension with pig:
pig install smlar; # install via package name, for the active PG version
pig install smlar -v 18; # install for PG 18
pig install smlar -v 17; # install for PG 17
pig install smlar -v 16; # install for PG 16
pig install smlar -v 15; # install for PG 15
pig install smlar -v 14; # install for PG 14Create this extension with:
CREATE EXTENSION smlar;Usage
smlar: Effective similarity search for PostgreSQL arrays. Source: README
The smlar extension provides effective similarity search on PostgreSQL arrays using configurable similarity formulas, GiST and GIN index support, and TF/IDF weighting.
Functions
float4 smlar(anyarray, anyarray)Computes similarity of two arrays. Arrays should be the same type.
float4 smlar(anyarray, anyarray, bool useIntersect)Computes similarity of two arrays of composite types. Composite type looks like:
CREATE TYPE type_name AS (element_name anytype, weight_name FLOAT4);The useIntersect option points to use only intersected elements in the denominator.
float4 smlar(anyarray a, anyarray b, text formula)Computes similarity of two arrays by a given formula. Predefined variables in formula:
N.i– number of common elements in both arrays (intersection)N.a– number of unique elements in first arrayN.b– number of unique elements in second array
Example:
SELECT smlar('{1,4,6}'::int[], '{5,4,6}');
SELECT smlar('{1,4,6}'::int[], '{5,4,6}', 'N.i / sqrt(N.a * N.b)');
-- These two calls are equivalent.anyarray % anyarrayReturns true if similarity of the arrays is greater than the threshold limit.
text[] tsvector2textarray(tsvector)Transforms tsvector type to text array.
anyarray array_unique(anyarray)Sort and unique array.
float4 inarray(anyarray, anyelement)Returns zero if second argument does not present in the first one and 1.0 in opposite case.
float4 inarray(anyarray, anyelement, float4, float4)Returns fourth argument if second argument does not present in the first one and third argument in opposite case.
GUC Configuration Variables
smlar.threshold FLOATArrays with similarity lower than threshold are not similar by % operation.
smlar.persistent_cache BOOLCache of global stat is stored in transaction-independent memory.
smlar.type STRINGType of similarity formula: cosine (default), tfidf, overlap.
smlar.stattable STRINGName of table storing set-wide statistic. Table should be defined as:
CREATE TABLE table_name (
value data_type UNIQUE,
ndoc int4 (or bigint) NOT NULL CHECK (ndoc > 0)
);A row with null value means total number of documents. Used only for smlar.type = 'tfidf'.
smlar.tf_method STRINGCalculation method for term frequency. Values:
"n"– simple counting of entries (default)"log"– 1 + log(n)"const"– TF is equal to 1
Used only for smlar.type = 'tfidf'.
smlar.idf_plus_one BOOLIf false (default), calculate idf as log(d/df). If true, as log(1+d/df). Used only for smlar.type = 'tfidf'.
It is highly recommended to add to postgresql.conf:
smlar.threshold = 0.6 # or any other value > 0 and < 1GiST/GIN Index Support
The % and && operations are supported with GiST and GIN indexes for many array types:
| Array Type | GIN operator class | GiST operator class |
|---|---|---|
bit[] |
_bit_sml_ops |
|
bytea[] |
_bytea_sml_ops |
_bytea_sml_ops |
char[] |
_char_sml_ops |
_char_sml_ops |
cidr[] |
_cidr_sml_ops |
_cidr_sml_ops |
date[] |
_date_sml_ops |
_date_sml_ops |
float4[] |
_float4_sml_ops |
_float4_sml_ops |
float8[] |
_float8_sml_ops |
_float8_sml_ops |
inet[] |
_inet_sml_ops |
_inet_sml_ops |
int2[] |
_int2_sml_ops |
_int2_sml_ops |
int4[] |
_int4_sml_ops |
_int4_sml_ops |
int8[] |
_int8_sml_ops |
_int8_sml_ops |
interval[] |
_interval_sml_ops |
_interval_sml_ops |
macaddr[] |
_macaddr_sml_ops |
_macaddr_sml_ops |
money[] |
_money_sml_ops |
|
numeric[] |
_numeric_sml_ops |
_numeric_sml_ops |
oid[] |
_oid_sml_ops |
_oid_sml_ops |
text[] |
_text_sml_ops |
_text_sml_ops |
time[] |
_time_sml_ops |
_time_sml_ops |
timestamp[] |
_timestamp_sml_ops |
_timestamp_sml_ops |
timestamptz[] |
_timestamptz_sml_ops |
_timestamptz_sml_ops |
timetz[] |
_timetz_sml_ops |
_timetz_sml_ops |
varbit[] |
_varbit_sml_ops |
|
varchar[] |
_varchar_sml_ops |
_varchar_sml_ops |