zhparser

1. Overview

zhparser is a PostgreSQL plugin for Chinese full-text search. It implements a Chinese parser based on SCWS (Simple Chinese Word Segmentation).

2. Installation

The source code installation environment is Ubuntu 24.04 (x86_64), with IvorySQL already installed in the environment at the path /path-to/ivorysql

2.1. Source Installation

First, install the dependencies

wget http://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2
tar xf scws-1.2.3.tar.bz2
cd scws-1.2.3

Compile and install scws
./configure
make
sudo make install

# Use git to download the zhparser source code, using the master branch
git clone https://github.com/amutu/zhparser
cd zhparser

# Compile and install
make PG_CONFIF=/path-to/ivorysql/bin/pg_config
make PG_CONFIF=/path-to/ivorysql/bin/pg_config install

3. Create Extension and Full-Text Search Configuration, Bind Dictionary Rules to Token Pos Tags

Connect to the database using psql, and execute the following commands:

-- create the extension
CREATE EXTENSION zhparser;

-- make test configuration using parser
CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser);

-- add token mapping
ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;

4. Usage

-- ts_parse
ivorysql=# SELECT * FROM ts_parse('zhparser', 'hello world! 2010年保障房建设在全国范围内获全面启动,从中央到地方纷纷加大 了保障房的建设和投入力度 。2011年,保障房进入了更大规模的建设阶段。住房城乡建设部党组书记、部长姜伟新去年底在全国住房城乡建设工作会议上表示,要继续推进保障性安居工程建设。');
 tokid |  token
-------+----------
   101 | hello
   101 | world
   117 | !
   101 | 2010
   113 | 年
   118 | 保障
   110 | 房建
   ......

-- test to_tsvector
ivorysql=# SELECT to_tsvector('testzhcfg','"今年保障房新开工数量虽然有所下调,但实际的年度在建规模以及竣工规模会超以往年份,相对应的对资金的需求也会 创历>史纪录。"陈国强说。在他看来,与2011年相比,2012年的保障房建设在资金配套上的压力将更为严峻。');

        to_tsvector

-----------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------
 '2011':27 '2012':29 '上':35 '下调':7 '严峻':37 '会':14 '会创':20 '保障':1,30 '压力':36 '史':21 '国强':24 '在建':10 '实际':8 '对应':17 '年份':16 '年
':9 '开工':4 '房':2 '房建':31 '数量':5 '新':3 '有所':6 '相比':28 '看来':26 '竣工':12 '纪录':22 '规模':11,13 '设在':32 '说':25 '资金':18,33 '超':15 '
套':34 '陈':23 '需求':19
(1 row)

-- test to_tsquery
ivorysql=# SELECT to_tsquery('testzhcfg', '保障房资金压力');
              to_tsquery
---------------------------------------
 '保障' <-> '房' <-> '资金' <-> '压力'
(1 row)

For more detailed usage and advanced features, please refer to https://github.com/amutu/zhparser .