Mmseg Analysis for ElasticSearch

The Mmseg Analysis plugin integrates Lucene mmseg4j-analyzer:http://code.google.com/p/mmseg4j/ into elasticsearch, support customized dictionary.

The plugin ships with a mmseg analyzer ,a mmseg tokenizer and a cut_letter_digit token_filter.

Versions

Mmseg ver	ES version
master	1.0.0 -> master
1.4.0	1.7.0
1.3.0	1.6.0
1.2.2	1.0.0
1.2.1	0.90.2
1.2.0	0.90.0
1.1.2	0.20.1
1.1.1	0.19.x

Install

you can download this plugin from RTF project(https://github.com/medcl/elasticsearch-rtf) https://github.com/medcl/elasticsearch-rtf/tree/master/plugins/analysis-mmseg

download the dict files,unzip these dict file to your elasticsearch's config folder,such as: your-es-root/config/mmseg https://github.com/medcl/elasticsearch-rtf/tree/master/config/mmseg

you need a service restart after that!

Analysis Configuration (elasticsearch.yml)

index:
  analysis:
    analyzer:
      mmseg:
        alias: [news_analyzer, mmseg_analyzer]
        type: org.elasticsearch.index.analysis.MMsegAnalyzerProvider
index.analysis.analyzer.default.type : "mmseg"

additional parameters that can be used to customize the mmseg tokenizer

index:
  analysis:
    tokenizer:
      mmseg_maxword:
          type: mmseg
          seg_type: "max_word"
      mmseg_complex:
          type: mmseg
          seg_type: "complex"
      mmseg_simple:
          type: mmseg
          seg_type: "simple"
      mmseg_maxword_with_cut_letter_digi:
        type: custom
        filter:
        - lowercase
        - cut_letter_digit
        tokenizer: mmseg_maxword

Mapping Configuration

Here is a quick example: 1.create a index

curl -XPUT http://localhost:9200/index

2.create a mapping

curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'
{
    "fulltext": {
             "_all": {
            "indexAnalyzer": "mmseg",
            "searchAnalyzer": "mmseg",
            "term_vector": "no",
            "store": "false"
        },
        "properties": {
            "content": {
                "type": "string",
                "store": "no",
                "term_vector": "with_positions_offsets",
                "indexAnalyzer": "mmseg",
                "searchAnalyzer": "mmseg",
                "include_in_all": "true",
                "boost": 8
            }
        }
    }
}'

3.indexing some docs

curl -XPOST http://localhost:9200/index/fulltext/1 -d'
{content:"美国留给伊拉克的是个烂摊子吗"}
'

curl -XPOST http://localhost:9200/index/fulltext/2 -d'
{content:"公安部：各地校车将享最高路权"}
'

curl -XPOST http://localhost:9200/index/fulltext/3 -d'
{content:"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}
'

curl -XPOST http://localhost:9200/index/fulltext/4 -d'
{content:"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}
'

4.query with highlighting

curl -XPOST http://localhost:9200/index/fulltext/_search  -d'
{
    "query" : { "term" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}
'

here is the query result


{
    "took": 14,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 2,
        "hits": [
            {
                "_index": "index",
                "_type": "fulltext",
                "_id": "4",
                "_score": 2,
                "_source": {
                    "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
                },
                "highlight": {
                    "content": [
                        "<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首 "
                    ]
                }
            },
            {
                "_index": "index",
                "_type": "fulltext",
                "_id": "3",
                "_score": 2,
                "_source": {
                    "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
                },
                "highlight": {
                    "content": [
                        "均每天扣1艘<tag1>中国</tag1>渔船 "
                    ]
                }
            }
        ]
    }
}

have fun.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
config		config
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

src

src

.gitignore

.gitignore

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Mmseg Analysis for ElasticSearch

Versions

Install

Analysis Configuration (elasticsearch.yml)

Mapping Configuration

About

Releases

Packages

Languages

Kooooooma/elasticsearch-analysis-mmseg

Folders and files

Latest commit

History

Repository files navigation

Mmseg Analysis for ElasticSearch

Versions

Install

Analysis Configuration (elasticsearch.yml)

Mapping Configuration

About

Resources

Stars

Watchers

Forks

Languages