Elasticsearch series - data modeling practice

outline

This paper introduces the characteristics of data modeling by different technical components in the background of practical cases, introduces the advantages and disadvantages of common joint query in the background of ES, and finally introduces the use of the file system word breaker path ﹣ hierarchy and nested objects.

Data model comparison

In the actual project, the common combinations of e-commerce platform systems are Java, Mysql and elastic search, taking the basic department employee entity as the case.

JavaBean type definition

If it's a JavaBean type, it's defined like this

public class Department {
	private Long id;
	private String name;
	private String desc;
	private List<Long> userIds;
}

public class Employee {
	private Long id;
	private String name;
	private byte gender;
	private Department dept;
}

Database model definition

If it is a relational database (mysql), the table will be created in this way

create table t_department (
	id bigint(20) not null auto_increment,
	name varchar(30) not null,
	desc varchar(80) not null,
	PRIMARY KEY (`id`)
)

create table t_employee (
	id bigint(20) not null auto_increment,
	name varchar(30) not null,
	gender tinyint(1) not null,
	dept_id bigint(20),
	PRIMARY KEY (`id`)
)

According to the three normal forms of database, each entity is designed as an independent table, which is associated with the main foreign key constraint. According to the existing data table specification, the foreign key constraint is no longer used. The foreign key constraint is placed in the application layer control.

ES document data model

If the document data model of es, the document will be designed in this way

{
	"deptId": 1,
	"deptname": "CEO office",
	"desc":"This one has feelings CEO",
	"employee":[
		{
			"userId":1,
			"name":Lily,
			"gender":0
		},
		{
			"userId":2,
			"name":Lucy,
			"gender":0
		},
		{
			"userId":3,
			"name":Tom,
			"gender":1
		}
	]
}

es is more like an object-oriented data model, which puts all the associated data in one document.

JOIN query

Based on the case of blog website, we build the data model of blog and user in blog website.

document the user and blog respectively, split the entity, and use the key field (userId) to establish the dependency relationship

First, create two entity document s and put a sample data

PUT /blog/user/1
{
  "id":1,
  "username":"Lily",
  "age":18
}

PUT /website/article/1
{
  "title":"my frist blog",
  "content":"this is my first blog, thank you",
  "userId":1
}

Requirement: to query user name Lily's blog Steps: 1) query the user document and find its userId according to its name Lily; 2) According to the first step, query the returned userId, reassemble the request message, and query the blog document Example message:

GET /blog/user/_search
{
  "query": {
    "match": {
      "username.keyword": "Lily"
    }
  }
}

GET /website/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "userId": [
            "1"
          ]
        }
      }
    }
  }
}

The above steps are called application layer Join to implement Association query

  • Advantages: clear structure, non redundant data, easy maintenance.
  • Disadvantages: application level join, such as too much data associated, low query performance.

Applicable scenario: two levels of join, the first level of document query can basically achieve accurate query, the number of returned results is small, and the second level of data is particularly large. For example, in the case scenario, find userId according to the name, and the returned data is relatively small, so the query performance of the second layer is relatively high. The data of the second layer belongs to the business data type, and the data volume must be very large.

Moderate redundancy reduces application layer Join query

General query

Take the above case, modify the blog document, and redundancy username into the document, such as:

PUT /website/article/2
{
  "title":"my second blog",
  "content":"this is my second blog, thank you",
  "userInfo": {
    "id":1,
    "username":"Lily"
  }
}

Specify username directly when querying:

GET /website/article/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "userInfo.username.keyword": "Lily"
        }
      }
    }
  }
}
  • Advantages: one query, high performance
  • Disadvantage: if redundant fields are updated, maintenance is very difficult

Suitable scenario: appropriate redundancy comparison is necessary, which can reduce join query. Relational database design also often has redundant data optimization. As long as you select redundant fields, you should try to select fields that are less likely to be changed, so as to avoid the occurrence of such things as refreshing query and updating to hit the wall.

Aggregate group query after data redundancy design

Build point test data in

PUT /website/article/3
{
  "title":"my third blog",
  "content":"this is my third blog, thank you",
  "userInfo": {
    "id":2,
    "username":"Lucy"
  }
}

PUT /website/article/4
{
  "title":"my 4th blog",
  "content":"this is my 4th blog, thank you",
  "userInfo": {
    "id":2,
    "username":"Lucy"
  }
}

Group query: which blogs Lily has published and which blogs Lucy has published

GET website/article/_search
{
  "size": 0,
  "aggs": {
    "group_by_username": {
      "terms": {
        "field": "userInfo.username.keyword"
      },
      "aggs": {
        "top_articles": {
          "top_hits": {
            "size": 10,
            "_source": {
              "includes": "title"
            }
          }
        }
      }
    }
  }
}

File search

The data of file type has a great feature: there is a directory hierarchy. If we need to search for files, we can index them as follows:

PUT /files
{
  "settings": {
    "analysis": {
      "analyzer": {
        "paths": {
          "tokenizer":"path_hierarchy"
        }
      }
    }
  }
}

PUT /files/_mapping/file
{
  "properties": {
    "name": {
      "type": "keyword"
    },
    "path": {
      "type": "keyword",
      "fields": {
        "tree": {
          "type": "text",
          "analyzer": "paths"
        }
      }
    }
  }
}

Note that the / opt/data/log is divided into / opt/data/log

/opt/

/opt/data/

/opt/data/log

Insert a test data

PUT /files/file/1
{
  "name":"hello.txt",
  "path":"/opt/data/txt/"
}

Search case

Specify file name and specific path search

GET files/file/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "hello.txt"
          }
        },
        {
          "match": {
            "path": "/opt/data/txt/"
          }
        }
      ]
    }
  }
}

/hello.txt file (including subdirectories) under opt path

GET files/file/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": "hello.txt"
          }
        },
        {
          "match": {
            "path.tree": "/opt/"
          }
        }
      ]
    }
  }
}

Difference: use of path and path.tree path.tree is word segmentation, and the word breaker is specified as path [hierarchy path is used directly without word segmentation.

nested object data type

Ask questions

When using common object objects for data redundancy, if the redundant data is an array collection class, the query may have problems. For example, the comment below blog information is a collection type

PUT /website/article/5
{
  "title": "A technical post on tea and bean milk",
  "content":  "I am tea and soy milk. Do you want to pay attention to it Java Building a community",
  "tags":  [ "IT technology", "Java Building communities" ],
  "comments": [ 
    {
      "name":    "Green tea",
      "comment": "Do you have any dry goods?",
      "age":     29,
      "stars":   4,
      "date":    "2019-10-29"
    },
    {
      "name":    "Soy milk",
      "comment": "I like technology best. That's great",
      "age":     32,
      "stars":   5,
      "date":    "2019-10-30"
    }
  ]
}

Need: check blogs commented by 29 year old soymilk users

GET /website/article/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "comments.name.keyword": "Soy milk"
          }
        },
        {
          "match": {
            "comments.age": "29"
          }
        }
      ]
    }
  }
}

According to this demonstration data, we can't find the result under this condition, but we actually find this data. Why?

Reason: the underlying data results of object type will flatten the json storage. For example, the storage structure will become:

{
	"title":["Green tea","Soy milk","Publish","An article","technology","Post"],
	"content": ["I","Green tea","Soy milk","everybody","Do you want it or not?","Consider","follow","Once","Java Building communities"],
	tags:["IT technology", "Java Building communities"],
	comments.name:["Green tea","Soy milk"],
	comments.comment:["Yes","What","Dried food","Not at all","I","like best","Research","technology","Splendid"],
	comments.age:[29,32],
	comments.stars:[4,5],
	comments.date:["2019-10-29","2019-10-30"]
}

In this way, soymilk and 29 were hit, which was inconsistent with the expected results.

terms of settlement

This problem can be solved by introducing the nested object type. Modify mapping to change the type of comments to nested object. Delete the index first, and then rebuild it

PUT /website
{
  "mappings": {
    "article": {
      "properties": {
        "comments": {
          "type": "nested",
          "properties": {
            "name": {"type":"text"},
            "comment": {"type":"text"},
            "age":     {"type":"short"},
            "stars":   {"type":"short"},
            "date":  {"type":"date"}
          }
        }
      }
    }
  }
}

In this way, the underlying data structure becomes:

{
	"title":["Green tea","Soy milk","Publish","An article","technology","Post"],
	"content": ["I","Green tea","Soy milk","everybody","Do you want it or not?","Consider","follow","Once","Java Building communities"],
	tags:["IT technology", "Java Building communities"],
	comments:[
		{
			"name":"Green tea",
			"comment":["Yes","What","Dried food","Not at all"],
			"age":29,
			"stars":4,
			"date":"2019-10-29"
		},
		{
			"name":"Soy milk",
			"comment":["I","like best","Research","technology","Splendid"],
			"age":32,
			"stars":5,
			"date":"2019-10-30"
		}
	]
}

The query result is empty, which is in line with the expectation.

Aggregate query example

Find the average number of stars in blog comments every day

GET /website/article/_search
{
  "size": 0,
  "aggs": {
    "comments_path": {
      "nested": {
        "path": "comments"
      }, 
      "aggs": {
        "group_by_comments_date": {
          "date_histogram": {
            "field": "comments.date",
            "interval": "day",
            "format": "yyyy-MM-dd"
          }, 
          "aggs": {
            "stars_avg": {
              "avg": {
                "field": "comments.stars"
              }
            }
          }
        }
      }
    }
  }
}

Response results (Abridged):

{
  "aggregations": {
    "comments_path": {
      "doc_count": 2,
      "group_by_comments_date": {
        "buckets": [
          {
            "key_as_string": "2019-10-29",
            "key": 1572307200000,
            "doc_count": 1,
            "stars_avg": {
              "value": 4
            }
          },
          {
            "key_as_string": "2019-10-30",
            "key": 1572393600000,
            "doc_count": 1,
            "stars_avg": {
              "value": 5
            }
          }
        ]
      }
    }
  }
}

Summary

Based on the actual cases, this article briefly and quickly introduces the common data joint query and the use of nested objects in the actual project, which is of great practical value and can be understood.

Focus on Java high concurrency, distributed architecture, more technical dry cargo sharing and experience, please pay attention to official account: Java architecture community You can scan the QR code on the left to add friends and invite you to join the wechat group of Java architecture community to discuss the technology together

Tags: Programming Java Database MySQL less

Posted on Fri, 24 Apr 2020 18:05:00 -0700 by markdr