[TIL - 20240810~12] ElasticSearch 자동 완성 (1)

https://yeon-dev.tistory.com/253

[Spring] Spring Boot3.x Docker Compose로 ElasticSearch 8.x+Kibana 구성 (Local)

프로젝트 루트 디렉토리에 docker-compose.yml을 생성한다. 1. Docker Compose 파일 작성docker-compose.ymlversion: '3.7'services: es: image: docker.elastic.co/elasticsearch/elasticsearch:8.7.1 container_name: es environment: - node.name=es-

yeon-dev.tistory.com

커뮤니티 검색 시, 자동 완성 기능을 ElasticSearch로 구현하고자 한다.

'통밀빵'이라는 검색어 입력 시, 다음과 같이 시작/중간/끝에 '통밀빵'이 포함된 자동 완성 목록이 떠야한다.

우선, ElasticSearch에 '통밀빵'이 접두사/중간/접미사에 사용되는 데이터를 2개씩 넣어주었다.

(bulk 연산 또는 PostMan 요청)

kibana dev tools에서 다음 명령어를 입력해 posts 인덱스의 모든 문서를 검색할 수 있다.

GET /posts/_search?q=*&pretty

1. N-gram 분석기

자동완성 기능은 사용자가 입력하는 대로 실시간으로 반응해야 한다. 때문에 단어의 일부분만을 가지고 결과를 제공하는 것이 중요하다고 판단하여, N-gram 분석기를 사용했다.

1-1. Nori 분석기 예시

Nori 분석기는 한글을 형태소 단위로 분석한다. 다음은 Nori 분석기로 "동해물과 백두산이 마르고 닳도록"을 분석한 결과다.

GET _analyze
{
  "tokenizer": {
    "type": "nori_tokenizer",
    "decompound_mode": "mixed"
  },
  "filter": ["lowercase", "stop", "trim", "nori_part_of_speech"],
  "text": ["동해물과 백두산이 마르고 닳도록"]
}

Nori 분석 결과: ["동해", "물", "백두산", "백두", "산", "마르", "닳"]

여기서 decompound_mode는 복합어를 분리하여 원본과 함께 저장하는 방식이다. mixed로 설정하여 "백두산"이라는 복합어를 "백두"+"산"으로 분해하면서도 "백두산"을 함께 인덱싱한다.

사용자가 '두산'이라는 단어를 입력했을 때 Nori 분석기는 다음과 같은 결과를 보여준다.

GET /posts/_search
{
  "query": {
    "match": {
      "title": "두산"
    }
  }
}

결과:
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

'백두산'이라는 단어에서 '두산' 이라는 일부분만 사용해서 검색했을 때, 원하는 결과를 얻을 수 없었다.

Nori 분석기는 '백두산'을 '백두'+'산'으로 분해하지만, '두산'을 별도로 인식하지 않는다.

1-2. N-gram 분석기 예시

반면에, N-gram 분석기는 텍스트를 일정한 길이의 부분 문자열로 나누어 인덱싱한다.

GET _analyze
{
  "tokenizer": {
    "type": "ngram",
    "min_gram": 2,
    "max_gram": 2,
    "token_chars": ["letter", "digit"]
  },
  "text": ["동해물과 백두산이 마르고 닳도록"]
}

2-gram 분석 결과: ["동해", "해물", "물과", "백두", "두산", "산이", "마르", "르고", "닳도", "도록"]

3-gram 분석 결과: ["동해물", "해물과", "백두산", "두산이", "마르고", "닳도록"]

N-gram 분석기를 사용하면, '백두산이'라는 문자열이 '백두', '두산', '산이'와 같은 부분 문자열로 나누어진다.

때문에 사용자가 '두산'이라는 단어를 검색하면, '백두산'이 포함된 문장을 결과로 얻을 수 있다.

GET /posts/_search
{
  "query": {
    "match": {
      "title.ngram": "두산"
    }
  }
}

결과:
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.5741749,
    "hits": [
      {
        "_index": "posts",
        "_id": "8",
        "_score": 1.5741749,
        "_source": {
          "id": 8,
          "title": "동해물과 백두산이 마르고 닳도록",
          "content": "내용8",
          "createdAt": "2024-08-12T21:04:31.777",
          "tags": [
            "#애국가"
          ]
        }
      }
    ]
  }
}

2. 인덱스 설정 및 매핑 구성

2-1. post-setting.json

{
  "analysis": {
    "tokenizer": {
      "my-nori-tokenizer": {
        "type": "nori_tokenizer",
        "decompound_mode": "mixed"
      },
      "my-ngram-tokenizer": {
        "type": "ngram",
        "min_gram": 2,
        "max_gram": 3,
        "token_chars": ["letter", "digit"]
      }
    },
    "analyzer": {
      "my-nori-analyzer": {
        "type": "custom",
        "tokenizer": "my-nori-tokenizer",
        "filter": [
          "lowercase",
          "stop",
          "trim",
          "nori_part_of_speech"
        ]
      },
      "my-ngram-analyzer": {
        "type": "custom",
        "tokenizer": "my-ngram-tokenizer",
        "filter": [
          "lowercase",
          "trim"
        ]
      }
    }
  }
}

2-2. post-mapping.json

{
  "properties": {
    "id": {
      "type": "long"
    },
    "title": {
      "type": "text",
      "fields": {
        "ngram": {
          "type": "text",
          "analyzer": "my-ngram-analyzer"
        },
        "nori": {
          "type": "text",
          "analyzer": "my-nori-analyzer"
        }
      }
    },
    "content": {
      "type": "text",
      "analyzer": "my-nori-analyzer"
    },
    "createdAt": {
      "type": "date",
      "format": "yyyy-MM-dd'T'HH:mm:ss.SSS||epoch_millis"
    },
    "imageUrl": {
      "type": "text"
    },
    "tags": {
      "type": "keyword"
    }
  }
}

이 설정에서는 title 필드에 n-gram과 nori 두 가지 분석기를 적용하여, 다양한 검색 시나리오에 대응할 수 있도록 멀티 필드를 정의했다.

3. 서비스 및 레포지토리 구현

3-1. 레포지토리 구현

PostCustomElasticRepository

public interface PostCustomElasticRepository {
    List<PostDocument> findByKeywordInField(String fieldName, String keyword);
}

PostCustomElasticRepositoryImpl

@Repository
@RequiredArgsConstructor
public class PostCustomElasticRepositoryImpl implements PostCustomElasticRepository {
    private final ElasticsearchOperations elasticsearchOperations;

    @Override
    public List<PostDocument> findByKeywordInField(String fieldName, String keyword) {

        Criteria criteria = new Criteria(fieldName).matches(keyword);
        CriteriaQuery query = new CriteriaQuery(criteria);

        return elasticsearchOperations
                .search(query, PostDocument.class)
                .map(SearchHit::getContent)
                .stream()
                .toList();
    }
}

PostElasticRepository

@Repository
public interface PostElasticRepository extends ElasticsearchRepository<PostDocument, Long>, PostCustomElasticRepository {
}

3-2. 서비스 구현

@Service
@RequiredArgsConstructor
@Transactional(readOnly = true)
public class PostQueryService {
    private final PostElasticRepository postElasticRepository;
    
    public List<String> searchByKeyword(String keyword) {
        return postElasticRepository.findByKeywordInField("title.ngram", keyword)
                .stream().map(PostDocument::getTitle)
                .toList();
    }
}

4. API 요청

4-1. 통밀빵 검색 예시

http://localhost:8080/api/v1/posts/elastic?keyword=통밀빵

4-2. 오타가 포함된 검색 예시

http://localhost:8080/api/v1/posts/elastic?keyword=텅밀빵

오타가 포함되어도 부분 문자열이 일치하기 때문에 원하는 결과를 확인할 수 있다.

728x90

'TIL' 카테고리의 다른 글

[TIL - 20240817] Spring Boot 3.x + Elasticsearch 8.x jackson.databind.exc.InvalidDefinitionException: Java 8 date/time type `java.time.LocalDateTime` (0)	2024.08.17
[TIL-20240813-17] Elasticsearch 자동 완성 (2) (0)	2024.08.13
[TIL-20240805] ElasticSearch + Spring Boot 연동 오류 해결 (0)	2024.08.05
[TIL - 20240612] Swagger HTTPS 설정 (0)	2024.06.12
[TIL - 20240612] Swagger Failed to load remote configuration 해결 (0)	2024.06.12