Complex Search, Java Clients in Elastic Search 6.x Practice Tutorial

Chapter 8 - Complex Search

Night gives me black eyes, but I use it to find light.

After learning about simple API s and simple search, you can basically handle most of the usage scenarios.However, the document data of non-relational database data is often numerous and complex, and various redundant fields form a "record".Complex data structures lead to complex searches.So before going into this chapter, we need to build a data structure that is as "complex" as possible.

Scenario 1 is biased toward complexity in data structure and introduces aggregate queries, specified field returns, deep paging, and Scenario 2 is biased toward complexity in search accuracy.

Scene 1

Stores the employees of a company. Employee information includes name, work number, gender, birth date, position, superior, subordinate, department, entry time, modification time, creation time.The employee's work number is globally unique as the primary key ID. An employee has only one direct superior, but has multiple subordinates, which can be achieved through parent-child documentation.Employees may belong to multiple departments (especially leaders who may concurrently be responsible for multiple departments).

data structure

Create an index and define a mapping structure:

PUT http://localhost:9200/company
{
    "mappings":{
        "employee":{
            "properties":{
                "id":{
                    "type":"keyword"
                },
                "name":{
                    "type":"text",
                    "analyzer":"ik_smart",
                    "fields":{
                        "keyword":{
                            "type":"keyword",
                            "ignore_above":256
                        }
                    }
                },
                "sex":{
                    "type":"keyword"
                },
        "age":{
          "type":"integer"
                },
                "birthday":{
                    "type":"date"
                },
                "position":{
                    "type":"text",
                    "analyzer":"ik_smart",
                    "fields":{
                        "keyword":{
                            "type":"keyword",
                            "ignore_above":256
                        }
                    }
                },
                "level":{
                    "type":"join",
                    "relations":{
                        "superior":"staff",
            "staff":"junior"
                    }
                },
                "departments":{
                    "type":"text",
                    "analyzer":"ik_smart",
                    "fields":{
                        "keyword":{
                            "type":"keyword",
                            "ignore_above":256
                        }
                    }
                },
                "joinTime":{
                    "type":"date"
                },
                "modified":{
                    "type":"date"
                },
                "created":{
                    "type":"date"
                }
            }
        }
    }
}

data

Next we'll construct the data, and we'll construct a few key pieces of data.

  • Zhang 3 is the chairman of the company. He is the largest leader and does not belong to any department.
  • Li Si's superior is Zhang San. His subordinates are Wang Wu, Zhao Liu, Sun Qi and Zhou Ba. He is also the head of the Marketing Department and the R&D Department, which is also subordinate to the Marketing Department and the R&D Department.
  • Wang Wu and Zhao Liu are supervised by Zhang San. He has no subordinates. He belongs to the Marketing Department.
  • Sun Qi and Zhou Bah are supervised by Li Si. He has no subordinates. He belongs to the R&D department.

More comprehensive and intuitive data is shown in the following table:

Full name Work Number Gender Age Date Of Birth post Superior subordinate department Entry time Modification Time Creation Time
Zhang San 1 male 49 1970-01-01 Chairman / Li Si / 1990-01-01 1562167817000 1562167817000
Li Si 2 male 39 1980-04-03 General manager Zhang San Wang 5, Zhao 6, Sun 7, Week 8 Marketing, R&D 2001-02-02 1562167817000 1562167817000
King Five 3 female 27 1992-09-01 Sale Li Si / Marketing Department 2010-07-01 1562167817000 1562167817000
Zhao Six 4 male 29 1990-10-10 Sale Li Si / Marketing Department 2010-08-08 1562167817000 1562167817000
Sun Qi 5 male 26 1993-12-10 Front End Engineer Li Si / R&D Department 2016-07-01 1562167817000 1562167817000
Week Eighth 6 male 25 1994-05-11 Java Engineer Li Si / R&D Department 2018-03-10 1562167817000 1562167817000

Insert 6 pieces of data:

POST http://localhost:9200/company/employee/1?routing=1
{
    "id":"1",
    "name":"Zhang San",
    "sex":"male",
  "age":49,
    "birthday":"1970-01-01",
    "position":"Chairman",
    "level":{
    "name":"superior"
  },
    "joinTime":"1990-01-01",
    "modified":"1562167817000",
    "created":"1562167817000"
}
POST http://localhost:9200/company/employee/2?routing=1
{
    "id":"2",
    "name":"Li Si",
    "sex":"male",
  "age":39,
    "birthday":"1980-04-03",
    "position":"General manager",
    "level":{
    "name":"staff",
    "parent":"1"
  },
  "departments":["Marketing Department","R&D Department"],
    "joinTime":"2001-02-02",
    "modified":"1562167817000",
    "created":"1562167817000"
}
POST http://localhost:9200/company/employee/3?routing=1
{
    "id":"3",
    "name":"King Five",
    "sex":"female",
  "age":27,
    "birthday":"1992-09-01",
    "position":"Sale",
    "level":{
    "name":"junior",
    "parent":"2"
  },
  "departments":["Marketing Department"],
    "joinTime":"2010-07-01",
    "modified":"1562167817000",
    "created":"1562167817000"
}
POST http://localhost:9200/company/employee/4?routing=1
{
    "id":"4",
    "name":"Zhao Six",
    "sex":"male",
  "age":29,
    "birthday":"1990-10-10",
    "position":"Sale",
    "level":{
    "name":"junior",
    "parent":"2"
  },
  "departments":["Marketing Department"],
    "joinTime":"2010-08-08",
    "modified":"1562167817000",
    "created":"1562167817000"
}
POST http://localhost:9200/company/employee/5?routing=1
{
    "id":"5",
    "name":"Sun Qi",
    "sex":"male",
  "age":26,
    "birthday":"1993-12-10",
    "position":"Front End Engineer",
    "level":{
    "name":"junior",
    "parent":"2"
  },
  "departments":["R&D Department"],
    "joinTime":"2016-07-01",
    "modified":"1562167817000",
    "created":"1562167817000"
}
POST http://localhost:9200/company/employee/6?routing=1
{
    "id":"6",
    "name":"Week Eighth",
    "sex":"male",
  "age":28,
    "birthday":"1994-05-11",
    "position":"Java Engineer",
    "level":{
    "name":"junior",
    "parent":"2"
  },
  "departments":["R&D Department"],
    "joinTime":"2018-03-10",
    "modified":"1562167817000",
    "created":"1562167817000"
}

search

  1. Query R&D staff
GET http://localhost:9200/company/employee/_search
{
    "query":{
        "match":{
            "departments":"R&D Department"
        }
    }
}
  1. Query employees in R&D and marketing
GET http://localhost:9200/company/employee/_search
{
    "query": {
        "bool":{
            "must":[{
                "match":{
                    "departments":"Marketing Department"
                }
            },{
                "match":{
                    "departments":"R&D Department"
                }
            }]
        }
    }
}

*The field being searched is an array type, but there are no special requirements for a query statement.

  1. Query name= "Zhang San" direct subordinates.
GET http://localhost:9200/company/employee/_search
{
    "query": {
        "has_parent":{
            "parent_type":"superior",
            "query":{
                "match":{
                    "name":"Zhang San"
                }
            }
        }
    }
}
  1. Query name="Lisi" direct subordinates.
GET http://localhost:9200/company/employee/_search

{
    "query": {
        "has_parent":{
            "parent_type":"staff",
            "query":{
                "match":{
                    "name":"Li Si"
                }
            }
        }
    }
}
  1. Query the direct superior of name="Wang Wu".
GET http://localhost:9200/company/employee/_search
{
    "query": {
        "has_child":{
            "type":"junior",
            "query":{
                "match":{
                    "name":"King Five"
                }
            }
        }
    }
}

Aggregate queries

Aggregate queries in ES are similar to aggregate functions in MySQL (avg, max, and so on), such as calculating the average age of employees.

GET http://localhost:9200/company/employee/_search?pretty
{
    "size": 0,
    "aggs": {
        "avg_age": {
            "avg": {
                "field": "age"
            }
        }
    }
}

Specify Field Query

Specify the field return value Specify the field to return in the query result.For example, only ask for Zhang San's birthday.

GET http://localhost:9200/company/employee/_search?pretty
{
    "_source":["name","birthday"],
    "query":{
        "match":{
            "name":"Zhang San"
        }
    }
}

Deep Paging

Deep paging of ES is a common topic.As anyone who has used ES knows, the default query depth of ES cannot exceed 10000, that is, Page * pageSize < 10000.If you need to query more than 10,000 pieces of data, either by setting the maximum depth or by scrolling through the query.If you adjust the configuration, even if you can find it, the performance will be poor.The problem with scroll scroll queries, however, is that you can only scroll to the previous and next page, not skip to the page number.

The principle of scroll is simply a batch of checks, the last data of the previous batch being the first data of the next batch until all the data has been searched.

First you need to initialize the query

GET http://localhost:9200/company/employee/_search?scroll=1m
{
    "query":{
        "match_all":{}
    },
    "size":1,
    "_source": ["id"]
}

For queries like normal query results, scroll=1m in url means that cursor queries have an expiration time of 1 minute, each query is updated, and setting it too long takes too much time.

Then you can scroll query using the _scroll_id returned by the API above, assuming that the above result returns'_scroll_id': "DnF1ZXJ5VGhlbkZldGhlbkZldGZldGNoBQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFFk1pNzdFUFFFkkk1pkkkk1pNzdFUVhDU3hx3x3VtSVFUdDDVtSVFUdDJJBAAAAAAAAAAAAAAAAAAAAAAAAAAKKKKKKKAAAAAAAAAAAAAAAAQQQQBBBBWWTTTTKKKKKKKDDDDDDDDDDDAAAAFEFk1pNzdFUVhDU3hx3VtSVFUdDJBWlEAAAAAABRRZNaTc3RVFYQ1N4cV91bUlRVHQyQVpR ".

GET http://localhost:9200/_search/scroll
{
    "scroll":"1m",
    "scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAFBFk1pNzdFUVhDU3hxX3VtSVFUdDJBWlEAAAAAAAABQhZNaTc3RVFYQ1N4cV91bUlRVHQyQVpRAAAAAAAAAUMWTWk3N0VRWENTeHFfdW1JUVR0MkFaUQAAAAAAAAFEFk1pNzdFUVhDU3hxX3VtSVFUdDJBWlEAAAAAAAABRRZNaTc3RVFYQ1N4cV91bUlRVHQyQVpR"
}

One minor drawback of this approach is that you cannot continue to query beyond the expiration time, which is good for querying all the data at once in full.However, the reality may be that the user stays on one page for a long time and then clicks on the previous or next page, at which point the page cannot be queried any longer than the expiration time.So there is another way, range query.

Another kind of deep paging

Assuming that the work ID in the employee data is in an incremental and unique order, we can page through a range query.

For example, in incremental order by ID, the first query for data with ID > 0 has a data volume of 1.

GET http://localhost:9200/company/employee/_search
{
    "query":{
        "range":{
            "id":{
                "gt":0
            }
        }
    },
    "size":1,
    "sort":{
        "id":{
            "order":"asc"
        }
    }
}

At this point, a piece of data with ID=1 is returned, and we continue to query the data with ID>1, the amount of data is still 1.

GET http://localhost:9200/company/employee/_search
{
    "query":{
        "range":{
            "id":{
                "gt":1
            }
        }
    },
    "size":1,
    "sort":{
        "id":{
            "order":"asc"
        }
    }
}

This also enables deep-paging queries with no expiration time limit.

Scene 2

Storing commodity data and searching for commodities according to commodity names requires high accuracy, so it is not possible to search for washed milk and flour appears.

Since this scenario is primarily about the accuracy of the search, there is no complex data structure and only a title field.

Define an index that contains only the title field and the word breaker defaults to standard:

PUT http://localhost:9200/ware_index
{
    "mappings": {
        "ware": {
            "properties": {
                "title":{
                    "type":"text"
                }
            }
        }
    }
}

Insert two pieces of data:

POST http://localhost:9200/ware_index/ware
{
    "title":"Facial Cleanser"
}
POST http://localhost:9200/ware_index/ware
{
    "title":"flour"
}

Search keyword "face wash milk":

POST http://localhost:9200/ware_index/ware/_search
{
    "query":{
        "match":{
            "title":"Facial Cleanser"
        }
    }
}

The results of the search show that the two winged horses, Flour and Clean Milk, do not match our expectations.

The reason has been explained in the word breaking chapter that the default word breaker for text type is standard, which splits the Chinese string word by word, that is, splits "face washing milk" into "wash", "flour", "milk", and "flour" into "flour", "flour".And match will split the search keywords into "wash", "noodle", "milk". The last two "noodles" can be matched, which is the result above.So for string search in Chinese we need to specify a word splitter, and the common word splitter is ik_smart, which splits words at maximum granularity. If ik_max_word is used, it will split words at minimum granularity, which may also lead to the above results.

DELETE http://localhost:9200/ware_index Deletes the index, recreates and specifies that the word splitter for the title field is ik_smart.

PUT http://localhost:9200/ware_index
{
    "mappings":{
        "ware":{
            "properties":{
        "id":{
          "type":"keyword"
        },
                "title":{
                    "type":"text",
                    "analyzer":"ik_smart"
                }
            }
        }
    }
}

If you insert "Wash Milk" and "Flour", then searching for "Wash Milk" will result in only one result.But now we insert the following two pieces of data:

POST http://localhost:9200/ware_index/ware
{
    "id":"1",
    "title":"New Hope Milk"
}
POST http://localhost:9200/ware_index/ware
{
    "id":"2",
    "title":"New Short Sleeves in Spring and Autumn"
}

Search keyword New Hope Milk:

POST http://localhost:9200/ware_index/ware/_search
{
    "query":{
        "match":{
            "title":"New Hope Milk"
        }
    }
}

The search results show two just inserted items, obviously the second item, "New Short Sleeves in Spring and Autumn" is not what we want.The reason for this problem is also due to the word breaking problem. There is no word "new hope" in the lexicon of the ik plug-in, so it will split the search keyword "new hope" into "new" and "hope". Also in "new short sleeve in spring and autumn" the word "new" is not combined into other words, it is also separately split into "new", which makes itThat's the result.The solution to this problem can certainly be to add the word "new hope" to the ik plug-in, if we do that in the word breaker, but there are other ways.

PhraseQuery

Match_phrase, a phrase query that splits the search keyword "New Hope Milk" into a list of terms "New Hope Milk". For the results of the search, these terms need to be matched exactly and their locations correspond. The "New Hope Milk" document data in this example corresponds exactly from the terms and locations, so check through the match_phrase phrase phrase phrase phraseThe query searches for results with only one piece of data.

POST http://localhost:9200/ware_index/ware/_search
{
    "query":{
        "match_phrase":{
            "title":"New Hope Milk"
        }
    }
}

Although this satisfies our search results, users may actually search in the order "Milk New Hope". Unfortunately, according to the match_phrase phrase matching requirement, documents that need to be searched need exactly matched terms and corresponding locations, and the keyword "Milk New Hope" is parsed into "Milk New Hope"Hope, although it matches the term New Hope Milk, it doesn't correspond to the location, so you can't search for any results.Similarly, if we insert "New Hope Milk" data at this time, neither "New Hope Milk" nor "New Hope Milk" can be searched for "New Hope Milk". The former keyword is because the term does not exactly match, the latter is because the term and location do not exactly match.

So match_phrase doesn't work perfectly either.

Phrase prefix query

Match_phrase_prefix, a phrase prefix query similar to MySQL's like "New Hope%", is generally consistent with match_phrase_prefix and also needs to satisfy the document data and search keywords to be consistent in terms and locations, as well as no results if you search for "New Hope for Milk".It didn't achieve what we wanted.

Minimum Match

Although the first two queries can search for the results we want through New Hope Milk, there is nothing we can do about New Hope Milk.This next query will "perfectly" achieve the desired results.

Let's start with an example of a query with the lowest degree of matching:

POST http://localhost:9200/ware_index/ware/_search
{
    "query": {
        "match": {
            "title": {
                "query": "New Hope Milk",
                "minimum_should_match": "80%"
            }
        }
    }
}

minimum_should_match is the lowest match.What does "80%" mean?Or start with which terms the keyword "New Hope Milk" is parsed into. Previously, "New Hope Milk" is parsed into "New Hope Milk". If you do a match search, the data with "New" also appears in the search results."80%" means that three terms must match at least 80%* 3 = 2.4 to appear in the search results, rounding down to 2, which means that the data searched needs to contain at least two terms.Obviously, "New Short Sleeve in Spring and Autumn" only has one term, which does not meet the requirements of two terms with minimum matching degree, so it will not appear in search results.

Similarly, if searching for "New Milk Hope" is also the result mentioned above, it is not a phrase match, so it does not require that the matching positions of the terms be the same.

It can be inferred that if "minimum_should_match":"100%" means a complete match, in which case all the terms in the data are required, so fewer search results will appear; if "minimun_should_match:0" does not mean that all the terms can be excluded at this time, but only one term is required to appear at the end of the search.As a result, it is actually the default match search, which results in more search results.

If you find a suitable value, you will have a better experience. According to the 28th Principle and practice, setting to 80% can satisfy most scenarios, neither more useless search results nor less.

Chapter 9 - Java Client (Part 2)

Be based on Java client (above) This article does not dwell on how to create a Spring Data ElasticSearch project, nor does it cover much text.More must be used with source code, source address https://github.com/yu-linfeng/elasticsearch6.x_tutorial/tree/master/code/spring-data-elasticsearch The specific code directory is in the complex package.

In this chapter, be sure to focus your code on how to insert data from parent-child documents and query through the Java API.

Data Insertion for Parent-Child Documents

The format in which parent-child documents are stored in ES actually exists as key-value pairs, for example, when defining a mapping Mapping, we define a child document as:

{
    ......
    "level":{
        "type":"join",
        "relations":{
                    "superior":"staff",
            "staff":"junior"
        }
    }
    ......
}

When writing a piece of data:

{
    ......
    "level":{
        "name":"staff",
        "parent":"1"
    }
    ......
}

For Java entities, we can set the level field to Map<String, Object>type.The key note is that when using Spring Data ElasticSearch, we cannot call the sava or saveAll methods directly.ES specifies that the parent and child documents must belong to the same slice, that is, when writing a child document, the routing parameter needs to be defined.Here is a code excerpt:

BulkRequestBuilder bulkRequestBuilder = client.prepareBulk();
bulkRequestBuilder.add(client.prepareIndex("company", "employee", employeePO.getId()).setRouting(routing).setSource(mapper.writeValueAsString(employeePO), XContentType.JSON)).execute().actionGet();

It must be used with reference source code.

ES is really a very powerful search engine.Limited capabilities, it is really impossible to give examples of all Java API s. If you have difficulties writing code, please contact the author's mailbox, hellobug at outlook.com, or through the public number coderbuff, to get some answers, but not all.

Focus on Public Number: CoderBuff, reply to "es" to get the full PDF of "Elastic Search 6.x Actual Tutorial".

This is a public number (CoderBuff) that adds buff to the programmer

Tags: PHP Java Spring ElasticSearch MySQL

Posted on Mon, 22 Jul 2019 09:43:15 -0700 by lost305