zhparser
1. Overview
zhparser is a PostgreSQL plugin for Chinese full-text search. It implements a Chinese parser based on SCWS (Simple Chinese Word Segmentation).
2. Installation
| The source code installation environment is Ubuntu 24.04 (x86_64), with IvorySQL already installed in the environment at the path /path-to/ivorysql |
2.1. Source Installation
First, install the dependencies wget http://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2 tar xf scws-1.2.3.tar.bz2 cd scws-1.2.3 Compile and install scws ./configure make sudo make install # Use git to download the zhparser source code, using the master branch git clone https://github.com/amutu/zhparser cd zhparser # Compile and install make PG_CONFIF=/path-to/ivorysql/bin/pg_config make PG_CONFIF=/path-to/ivorysql/bin/pg_config install
3. Create Extension and Full-Text Search Configuration, Bind Dictionary Rules to Token Pos Tags
Connect to the database using psql, and execute the following commands:
-- create the extension CREATE EXTENSION zhparser; -- make test configuration using parser CREATE TEXT SEARCH CONFIGURATION testzhcfg (PARSER = zhparser); -- add token mapping ALTER TEXT SEARCH CONFIGURATION testzhcfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
4. Usage
-- ts_parse
ivorysql=# SELECT * FROM ts_parse('zhparser', 'hello world! 2010年保障房建设在全国范围内获全面启动,从中央到地方纷纷加大 了保障房的建设和投入力度 。2011年,保障房进入了更大规模的建设阶段。住房城乡建设部党组书记、部长姜伟新去年底在全国住房城乡建设工作会议上表示,要继续推进保障性安居工程建设。');
tokid | token
-------+----------
101 | hello
101 | world
117 | !
101 | 2010
113 | 年
118 | 保障
110 | 房建
......
-- test to_tsvector
ivorysql=# SELECT to_tsvector('testzhcfg','"今年保障房新开工数量虽然有所下调,但实际的年度在建规模以及竣工规模会超以往年份,相对应的对资金的需求也会 创历>史纪录。"陈国强说。在他看来,与2011年相比,2012年的保障房建设在资金配套上的压力将更为严峻。');
to_tsvector
-----------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------
'2011':27 '2012':29 '上':35 '下调':7 '严峻':37 '会':14 '会创':20 '保障':1,30 '压力':36 '史':21 '国强':24 '在建':10 '实际':8 '对应':17 '年份':16 '年
':9 '开工':4 '房':2 '房建':31 '数量':5 '新':3 '有所':6 '相比':28 '看来':26 '竣工':12 '纪录':22 '规模':11,13 '设在':32 '说':25 '资金':18,33 '超':15 '
套':34 '陈':23 '需求':19
(1 row)
-- test to_tsquery
ivorysql=# SELECT to_tsquery('testzhcfg', '保障房资金压力');
to_tsquery
---------------------------------------
'保障' <-> '房' <-> '资金' <-> '压力'
(1 row)
For more detailed usage and advanced features, please refer to https://github.com/amutu/zhparser .