Skip to content

Kooooooma/elasticsearch-analysis-mmseg

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mmseg Analysis for ElasticSearch

The Mmseg Analysis plugin integrates Lucene mmseg4j-analyzer:http://code.google.com/p/mmseg4j/ into elasticsearch, support customized dictionary.

The plugin ships with a mmseg analyzer ,a mmseg tokenizer and a cut_letter_digit token_filter.

Versions

Mmseg ver ES version
master 1.0.0 -> master
1.4.0 1.7.0
1.3.0 1.6.0
1.2.2 1.0.0
1.2.1 0.90.2
1.2.0 0.90.0
1.1.2 0.20.1
1.1.1 0.19.x

Install

you can download this plugin from RTF project(https://github.com/medcl/elasticsearch-rtf) https://github.com/medcl/elasticsearch-rtf/tree/master/plugins/analysis-mmseg

download the dict files,unzip these dict file to your elasticsearch's config folder,such as: your-es-root/config/mmseg https://github.com/medcl/elasticsearch-rtf/tree/master/config/mmseg

you need a service restart after that!

Analysis Configuration (elasticsearch.yml)

index:
  analysis:
    analyzer:
      mmseg:
        alias: [news_analyzer, mmseg_analyzer]
        type: org.elasticsearch.index.analysis.MMsegAnalyzerProvider
index.analysis.analyzer.default.type : "mmseg"

additional parameters that can be used to customize the mmseg tokenizer

index:
  analysis:
    tokenizer:
      mmseg_maxword:
          type: mmseg
          seg_type: "max_word"
      mmseg_complex:
          type: mmseg
          seg_type: "complex"
      mmseg_simple:
          type: mmseg
          seg_type: "simple"
      mmseg_maxword_with_cut_letter_digi:
        type: custom
        filter:
        - lowercase
        - cut_letter_digit
        tokenizer: mmseg_maxword         

Mapping Configuration

Here is a quick example: 1.create a index

curl -XPUT http://localhost:9200/index

2.create a mapping

curl -XPOST http://localhost:9200/index/fulltext/_mapping -d'
{
    "fulltext": {
             "_all": {
            "indexAnalyzer": "mmseg",
            "searchAnalyzer": "mmseg",
            "term_vector": "no",
            "store": "false"
        },
        "properties": {
            "content": {
                "type": "string",
                "store": "no",
                "term_vector": "with_positions_offsets",
                "indexAnalyzer": "mmseg",
                "searchAnalyzer": "mmseg",
                "include_in_all": "true",
                "boost": 8
            }
        }
    }
}'

3.indexing some docs

curl -XPOST http://localhost:9200/index/fulltext/1 -d'
{content:"美国留给伊拉克的是个烂摊子吗"}
'

curl -XPOST http://localhost:9200/index/fulltext/2 -d'
{content:"公安部:各地校车将享最高路权"}
'

curl -XPOST http://localhost:9200/index/fulltext/3 -d'
{content:"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}
'

curl -XPOST http://localhost:9200/index/fulltext/4 -d'
{content:"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}
'

4.query with highlighting

curl -XPOST http://localhost:9200/index/fulltext/_search  -d'
{
    "query" : { "term" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}
'

here is the query result


{
    "took": 14,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 2,
        "hits": [
            {
                "_index": "index",
                "_type": "fulltext",
                "_id": "4",
                "_score": 2,
                "_source": {
                    "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
                },
                "highlight": {
                    "content": [
                        "<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首 "
                    ]
                }
            },
            {
                "_index": "index",
                "_type": "fulltext",
                "_id": "3",
                "_score": 2,
                "_source": {
                    "content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
                },
                "highlight": {
                    "content": [
                        "均每天扣1艘<tag1>中国</tag1>渔船 "
                    ]
                }
            }
        ]
    }
}

have fun.

Packages

No packages published

Languages

  • Java 100.0%