Skip to content

cominvent/querqy

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

travis ci build status Download

Querqy

Querqy is a framework for query preprocessing in Java-based search engines. It comes with a powerful, rule-based preprocessor named 'Common Rules Preprocessor', which provides query-time synonyms, query-dependent boosting and down-ranking, and query-dependent filters. While the Common Rules Preprocessor is not specific to any search engine, Querqy provides a plugin to run it within the Solr search engine.

Getting started: setting up Common Rules under Solr

Getting Querqy and deploying it to Solr

Querqy versions 1.x.x work with Solr 4.10.x, while Querqy versions 2.x.x require the following Solr 5 versions:

  • Querqy 2.0.x to 2.5.x - Solr 5.0
  • Querqy 2.6.x to 2.7.x - Solr 5.1
  • Querqy 2.8.x - Solr 5.3.x
  • Querqy 2.9.x - Solr 5.4.x

You can download a .jar file that includes Querqy and all required dependencies from [Bintray] (https://bintray.com/renekrie/maven/querqy) (querqy/querqy-solr/<version>/querqy-solr-<version>-jar-with-dependencies.jar) and simply put it into Solr's lib folder.

Alternatively, if you already have a Maven build for your Solr plugins, you can add the artifact 'querqy-solr' as a dependency to your pom.xml:

<!-- Add the Querqy repository URL -->
<repository>
    <id>querqy-repo</id>
    <name>Querqy repo</name>
    <url>http://dl.bintray.com/renekrie/maven</url>
</repository>

<!-- Add the querqy-solr dependency -->
<dependencies>
	<dependency>
		<groupId>querqy</groupId>
		<artifactId>querqy-solr</artifactId>
		<version>...</version>
	</dependency>
</dependencies>
     

Configuring Solr for Querqy

Querqy provides a QParserPlugin and a search component that need to be configured in file solrconfig.xml of your Solr core:

<!-- 
    Add the Querqy query parser. 
 -->
<queryParser name="querqy" class="querqy.solr.DefaultQuerqyDismaxQParserPlugin">

    <!--
        Querqy has to parse the user's query text into a query object.
        We use WhiteSpaceQuerqyParser, which only provides a very
        limited syntax (no field names, just -/+ as boolean
        operators). 
        
        Note that the Querqy query parser must not be confused with
        Solr or Lucene query parsers: it is completely independent
        from Lucene/Solr and parses the input into Querqy's internal
        query object model.
    -->
    <lst name="parser">
      <str name="factory">querqy.solr.SimpleQuerqyQParserFactory</str>
      <!-- 
        The parser is provided by a factory, in our case
        by a SimpleQuerqyQParserFactory, which is a very generic 
        factory that just creates an instance for the configured class:
      -->
      <str name="class">querqy.parser.WhiteSpaceQuerqyParser</str>
    </lst>
     	 
	
	<!--
		Define a chain of query rewriters. We'll use just one rewriter
		- SimpleCommonRulesRewriter - which provides 'Common Rules'
		preprocessing.
    --> 
    <lst name="rewriteChain">
    
        <lst name="rewriter">
            <str name="class">querqy.solr.SimpleCommonRulesRewriterFactory</str>
            <!-- 
           	   The file that contains rules for synonyms, 
           	   boosting etc.
            -->
            <str name="rules">rules.txt</str>
            <!--
           	   If true, case will be ignored while trying to find
           	   rules that match the user query input: 
            -->
            <bool name="ignoreCase">true</bool>
            <!-- 
                Some rules in the rules file declare boost queries,
                synonym queries or filter queries that need to be added 				   to the user query. This query parser parses the 
                additional queries from the rules file:
            -->
            <str name="querqyParser">querqy.parser.WhiteSpaceQuerqyParserFactory</str>
        </lst>
       
        <!--
            You can add further rewriters to the chain. For example, 
            you could add a second SimpleCommonRulesRewriter for
            a different group of rules, which would consume the 
            output of the first rewriter. Or you might add a completely 			  different rewriter imlementation, like the ShingleRewriter, 			  that would combine pairs of tokens of the query input and 			  add the concatenated forms as synonyms.
        -->
        <!--
        <lst name="rewriter">
            <str name="class">querqy.solr.contrib.ShingleRewriterFactory</str>
            <bool name="acceptGeneratedTerms">false</bool>
        </lst>
        -->
       
   </lst>
     	 
</queryParser>

<!-- Override the default QueryComponent -->
<searchComponent name="query" class="querqy.solr.QuerqyQueryComponent"/>

Making requests to Solr using Querqy

You can activate the Querqy query parser in Solr by setting the defType request parameter - in other words, just like you would enable any other query parser in a Solr search request:

defType=querqy

Alternatively, you can activate and control Querqy using local parameters.

You'll have to set further parameters for Querqy to process the query. These parameters are exactly the same like for the Extended DisMax Query Parser, including DisMax parameters, with the following exceptions:

  • q.alt - not implemented yet
  • uf, lowercaseOperators and query field aliasing - not implemented. Work on this will depend on the availability of a Querqy-internal query parser that accepts field names and boolean (non-prefix) operators. The currently recommended parser, the querqy.parser.WhiteSpaceQuerqyParser, does not provide these features, though field names and boolean operators are part of Querqy's internal query object model.
  • stopwords - no plans to implement

Example:

q=personal computer&defType=querqy&qf=name^2.0 description^0.5&pf=name

With the exception of the defType paramter this query looks like a standard ExtendedDisMax query, and if you haven't configured any rules for the query 'personal computer', the results and their order would be the same like for ExtendedDisMax. If, on the other hand, you have configured a rule

personal computer =>
    SYNONYM: pc

Querqy would also search for 'pc' in the 'name' and 'description' fields. You'll learn how to write such rules in the next section.

Querqy has the following optional parameters in addition to those shared with the ExtendedDisMax query parser (you can savely skip this list for the moment):

Name Meaning Value Example Default value
gqf "generated query fields" - where to query generated terms like synonyms, boost queries etc. space-separated list of field names and boost factors gqf=name^1.1 color^0.9 use values from param qf
gfb "generated field boost" - a global boost factor that is multiplied with field-specific boosts of generated fields (use this to quickly give a lower boost to all generated terms and queries) decimal number (float) gfb=0.8 1.0

Configuring rules

The rules for the 'Common Rules Rewriter' are maintained in the file that you configured as attribute 'rules' for the SimpleCommonRulesRewriterFactory, i.e. file rules.txt in the following example configuration:

<queryParser name="querqy" class="querqy.solr.DefaultQuerqyDismaxQParserPlugin">

    <lst name="rewriteChain">
    
        <lst name="rewriter">
            <str name="class">querqy.solr.SimpleCommonRulesRewriterFactory</str>
            <!-- 
           	   The file that contains rules for synonyms, 
           	   boosting etc.
            -->
            <str name="rules">rules.txt</str>

Note that the expected character encoding is UTF-8 and that the maximum size of this file is 1 MB if Solr runs as SolrCloud and if you didn't change the maximum file size in Zookeeper (see this issue on GitHub).

Input matching

The first line of a rule declaration defines the matching criteria for the input query. This line must end in an arrow (=>). The next line defines an instruction that shall be applied if the input matches. The same input line can be used for multiple instructions, one per line:

# if the input contains 'personal computer', add two synonyms, 'pc' and
# 'desktop computer', and rank down by factor 50 documents that 
# match 'software':
personal computer =>
    SYNONYM: pc
    SYNONYM: desktop computer
    DOWN(50): software

Querqy applies the above rule if it can find the matching criteria 'personal computer' anywhere in the query, provided that there is no other term between 'personal' and 'computer'. It would thus also match the input 'cheap personal computer'. If you want to match the input exactly, or at the beginning or end of the input, you have to mark the input boundaries using double quotation marks:

# only match the exact query 'personal computer'.
"personal computer" => 
    ....
    
# only match queries starting with 'personal computer'
"personal computer =>
    ....

# only match queries ending with 'personal computer'
personal computer" =>
    ....

Each input token is matched exactly. Matching is even case-sensitive, but you can make it case-insensitive in the configuration:

<lst name="rewriter">
            <str name="class">querqy.solr.SimpleCommonRulesRewriterFactory</str>
                        <str name="rules">rules.txt</str>
            <!--
           	   If true, case will be ignored while trying to find
           	   rules that match the user query input: 
            -->
            <bool name="ignoreCase">true</bool>

There is no stemming or fuzzy matching applied to the input. If you want to make 'pc' a synonym for both, 'personal computer' and 'personal computers', you will have to declare two rules:

personal computer =>
    SYNONYM: pc

personal computers =>
    SYNONYM: pc


You can use a wildcard at the very end of the input declaration:

sofa* =>
    SYNONYM: sofa $1

The above rule matches if the input contains a token that starts with 'sofa-' and adds a synonym 'sofa + wildcard matching string' to the query. For example, a user query 'sofabed' would yield the synonym 'sofa bed'.

The wildcard matches 1 (!) or more characters. It is not intended as a replacement for stemming but to provide some support for decompounding in languages like German where compounding is very productive. For example, compounds of the structure 'material + product type' and 'intended audience + product type' are very common in German. Wildcards in Querqy can help to decompound them and allow to search the components accross multiple fields:

# match queries like 'kinderschuhe' (= kids' shoes) and 
# 'kinderjacke' (= kids' jacket) and search for 
# 'kinder schuhe'/'kinder jacke' etc. in all search fields
kinder* =>
	SYNONYM: kinder $1

Wildcard matching can be used for all rule types. There are some restrictions in the current wildcard implementation, which might be removed in the future:

  • Synonyms are the only rules type that can pick up the '$1' placeholder.
  • The wildcard can only occur at the very end of the input matching.
  • It cannot be combined with the right-hand input boundary marker (...").

SYNONYM rules

Querqy gives you a mighty toolset for using synonyms at query time. As opposed to analysis-based query-time synonyms in Solr, Querqy matches multi-term input and avoids scoring issues related to different document frequencies of the original input and synonym terms (see this blog post and the discussion on index-time vs. query-time synonyms in the Solr wiki). It also allows to configure synonyms in a field-independent manner, making the maintenance of synonyms a lot more intuitive than in Solr.

You have already seen rules for synonyms:

personal computer =>
    SYNONYM: pc
    
sofa* =>
    SYNONYM: sofa $1

Synonyms work in only one direction in Querqy. It always tries to match the input that is specified in the rule and adds a synonym if a given user query matches this input. If you need bi-directional synonyms or synonym groups, you have to declare a rule for each direction. For example, if the query 'personal computer' should also search for 'pc' while query 'pc' should also search for 'personal computer', you would write these two rules:

personal computer =>
    SYNONYM: pc

pc =>
	SYNONYM: personal computer
The right-hand side of synonym rules

The right-hand side of the synonym expression will be parsed by the parser that you configured as queryParser for the Common Rules rewriter:

<lst name="rewriteChain">
    
        <lst name="rewriter">
            <str name="class">querqy.solr.SimpleCommonRulesRewriterFactory</str>
            <str name="rules">rules.txt</str>
            ..
            <str name="querqyParser">querqy.parser.WhiteSpaceQuerqyParserFactory</str>
        </lst>

Thus, in the following example, the WhiteSpaceQuerqyParser is used to parse "personal computer" into Querqy's internal query object model:

pc =>
	SYNONYM: personal computer

Querqy will assign fields in which it searches for the synonym query only after applying all rules and all rewriters when it finally creates a Lucene query from the Querqy-internal query object model. The search fields for synonyms are taken from the gqf parameter (priority) or from qf (see request parameters).

Expert: Structure of expanded queries

Querqy preserves the 'minimum should match' semantics for boolean queries (parameter mm for the DisMax query parser) when constructing synonyms. In order to provide this semantics, given mm=1, the rule

personal computer =>
    SYNONYM: pc

produces the query

boolean_query (mm=1) (
	dismax('personal','pc'),
	dismax('computer','pc')
)

and NOT

boolean_query(mm=??) (
	boolean_query(mm=1) (
		dismax('personal'),
		dismax('computer')
	),
	dismax('pc')
)

UP/DOWN rules

UP and DOWN rules add a positive or negative boost query to the user query, which helps to bring documents that match the boost query further up or down in the result list.

The following rules add UP and DOWN queries to the input query 'iphone'. The UP instruction promotes documents also containing 'apple' further to the top of the result list, while the DOWN query puts documents containing 'case' further down the search results:

iphone =>
	UP(10): apple
	DOWN(20): case

UP and DOWN both take boost factors as parameters. The default boost factor is 1.0. The interpretation of the boost factor is left to the search engine and it might differ between UP and DOWN, which means that UP(10):x and DOWN(10):x do not necessarily equal out each other.

The right-hand side of UP and DOWN instructions will either be parsed using the configured query parser (see The right-hand side of synonym rules), or it will be treated as a query in the syntax of the search engine if the right-hand-side of the query is prefixed by *.

In the following example we favour a certain price range as an interpretation of 'cheap' and penalise documents from category 'accessories' using raw Solr queries:

cheap notebook =>
	UP(10): * price:[350 TO 450]
	DOWN(20): * category:accessories

FILTER rules

Filter rules work similar to UP and DOWN rules, but instead of moving search results up or down the result list they restrict search results to those that match the filter query. The following rule looks similar to the 'iphone' example above but it restricts the search results to documents that contain 'apple' and not 'case':

iphone =>
	FILTER: apple
	FILTER: -case

The filter is applied to all fields given in the gqf or qf parameters. In the case of a required keyword ('apple') the filter matches if the keyword occurs in one or more query fields. The negative filter ('-case') only matches documents where the keyword occurs in none of the query fields. (Note this issue for purely negative queries.)

The right-hand side of filter instructions accepts raw queries. To completely exclude results from category 'accessories' for query 'notebook' you would write:

notebook =>
	FILTER: * -category:accessories

DELETE rules

Delete rules allow you to remove keywords from a query. This is comparable to stopwords in Solr but in Querqy keywords are removed before starting the field analysis chain. Delete rules are thus field-independent. It is also possible to apply delete rules before all other rules (see Rule ordering), which helps to remove stopwords that could otherwise prevent further Querqy rules from matching.

The following rule declares that whenever Querqy sees the input 'cheap iphone' it should remove keyword 'cheap' from the query and only search for 'iphone':

cheap iphone =>
	DELETE: cheap

While in this example the keyword 'cheap' will only be deleted if it is followed by 'iphone', you can also delete keywords regardless of the context:

cheap =>
	DELETE: cheap

or simply:

cheap =>
	DELETE

If the right-hand side of the delete instruction contains more than one term, each term will be removed from the query individually (= they are not considered a phrase and further terms can occur between them):

cheap iphone unlocked =>
	DELETE: cheap unlocked

The above rule would turn the input query 'cheap iphone unlocked' into search query 'iphone'.

The following restrictions apply to delete rules:

  • Terms to be deleted must be part of the input declaration.
  • Querqy will not delete the only term in a query.

DECORATE rules

Decorate rules are not strictly query rewriting rules but they are quite handy to add query-dependent information to search results. For example, in online shops there are almost always a few search queries that have nothing to do with the products in the shop but with deliveries, T&C, FAQs and other service information. A decorate rule matches those search terms and adds the configured information to the search results:

faq =>
	DECORATE: redirect, /service/faq

The Solr response will then contain an array 'querqy_decorations' with the right-hand side expressions of the matching decorate rules:

<response>
    <lst name="responseHeader">...</lst>
    <result name="response" numFound="0" start="0">...</result>
    <lst name="facet_counts">...</lst>
    <arr name="querqy_decorations">
        <str>redirect, /service/faq</str>
        ...
    </arr>
</response>

Querqy does not inspect the right-hand side of the decorate instruction ('redirect, /service/faq') but returns the configured value 'as is'. You could even configure a JSON-formatted value in this place but you have to assure that the value does not contain any line break.

Rule ordering

There is no defined order for the application of rules in Querqy's Common Rules rewriter. When the rewriter sees a query it first tries to find all rules that match the input and then it applies these rules. If you want to make the output of one rule the input for matching another rule, you can split your rules across multiple rewriters, each with its own rules file.

For example, it is often handy to first apply delete rules before applying further rule types:

<queryParser name="querqy" class="querqy.solr.DefaultQuerqyDismaxQParserPlugin">

    <lst name="parser">
      <str name="factory">querqy.solr.SimpleQuerqyQParserFactory</str>
      <str name="class">querqy.parser.WhiteSpaceQuerqyParser</str>
    </lst>
     	 	
	<!--
		The chain of query rewriters.
    --> 
    <lst name="rewriteChain">
    
        <lst name="rewriter">
            <str name="class">querqy.solr.SimpleCommonRulesRewriterFactory</str>
            <!-- 
           	   The file only contains delete rules.
            -->
            <str name="rules">delete-rules.txt</str>
            <bool name="ignoreCase">true</bool>
            <str name="querqyParser">querqy.parser.WhiteSpaceQuerqyParserFactory</str>
        </lst>
        <!-- 
        	The rewritten query of the above rewriter becomes
        	the input for the rewriter below:
        -->
        <lst name="rewriter">
            <str name="class">querqy.solr.SimpleCommonRulesRewriterFactory</str>
            <!-- 
           	   The file only contains further rules (synonyms etc.)
            -->
            <str name="rules">rules.txt</str>
            <bool name="ignoreCase">true</bool>
            <str name="querqyParser">querqy.parser.WhiteSpaceQuerqyParserFactory</str>
        </lst>

   </lst>

</queryParser>

License

Querqy is licensed under the Apache License, Version 2.

Development

Branches

Please base development for Lucene/Solr 5.x and Lucene/Solr-independent features (querqy-core) on branch master and for Lucene/Solr 4.x on branch solr4. Note that before Querqy 2.9.0 development for Solr 5.x was carried out in branch solr5, while branch master contained Lucene/Solr 4.x and Lucene/Solr-independent code.

Modules

  • querqy-antlr - An ANTLR-based Querqy query parser (incomplete, do not use)
  • querqy-core - The core component. Search-engine independent, Querqy's query object model, Common Rules Rewriter
  • querqy-lucene - Lucene-specific components. Builder for creating a Lucene query from Querqy's query object model
  • querqy-solr - Solr-specific components. QParserPlugin, SearchComponent.

Contributors

Many thanks to Galeria Kaufhof, shopping24 and inoio for their support.

Querqy is built using Travis CI.

About

Rule-based query preprocessor for Java-based search engines

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 99.7%
  • ANTLR 0.3%