자연어 처리 _ 키워드 추출 key bert ) with python & pytorch

# !pip install sentence_transformers

import numpy as np
import itertools

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

def max_sum_sim(doc_embedding, candidate_embeddings, words, top_n, nr_candidates): # 단순 데이터 쌍 최대의 합거리. 문서와의 유사성을 극대화 
    # 문서와 각 키워드들 간의 유사도
    distances = cosine_similarity(doc_embedding, candidate_embeddings)

    # 각 키워드들 간의 유사도
    distances_candidates = cosine_similarity(candidate_embeddings, 
                                            candidate_embeddings)

    # 코사인 유사도에 기반하여 키워드들 중 상위 top_n개의 단어를 pick.
    words_idx = list(distances.argsort()[0][-nr_candidates:])
    words_vals = [candidates[index] for index in words_idx]
    distances_candidates = distances_candidates[np.ix_(words_idx, words_idx)]

    # 각 키워드들 중에서 가장 덜 유사한 키워드들간의 조합을 계산
    min_sim = np.inf
    candidate = None
    for combination in itertools.combinations(range(len(words_idx)), top_n):
        sim = sum([distances_candidates[i][j] for i in combination for j in combination if i != j])
        if sim < min_sim:
            candidate = combination
            min_sim = sim

    return [words_vals[idx] for idx in candidate]

def mmr(doc_embedding, candidate_embeddings, words, top_n, diversity): # 결과를 다양화 하는 방법. 중복을 최소화 하고 결과의 다양성 증대

    # 문서와 각 키워드들 간의 유사도가 적혀있는 리스트
    word_doc_similarity = cosine_similarity(candidate_embeddings, doc_embedding)

    # 각 키워드들 간의 유사도
    word_similarity = cosine_similarity(candidate_embeddings)

    # 문서와 가장 높은 유사도를 가진 키워드의 인덱스를 추출.
    # 만약, 2번 문서가 가장 유사도가 높았다면
    # keywords_idx = [2]
    keywords_idx = [np.argmax(word_doc_similarity)]

    # 가장 높은 유사도를 가진 키워드의 인덱스를 제외한 문서의 인덱스들
    # 만약, 2번 문서가 가장 유사도가 높았다면
    # ==> candidates_idx = [0, 1, 3, 4, 5, 6, 7, 8, 9, 10 ... 중략 ...]
    candidates_idx = [i for i in range(len(words)) if i != keywords_idx[0]]

    # 최고의 키워드는 이미 추출했으므로 top_n-1번만큼 아래를 반복.
    # ex) top_n = 5라면, 아래의 loop는 4번 반복됨.
    for _ in range(top_n - 1):
        candidate_similarities = word_doc_similarity[candidates_idx, :]
        target_similarities = np.max(word_similarity[candidates_idx][:, keywords_idx], axis=1)

        # MMR을 계산
        mmr = (1-diversity) * candidate_similarities - diversity * target_similarities.reshape(-1, 1)
        mmr_idx = candidates_idx[np.argmax(mmr)]

        # keywords & candidates를 업데이트
        keywords_idx.append(mmr_idx)
        candidates_idx.remove(mmr_idx)

    return [words[idx] for idx in keywords_idx]

주석 처리와 같이 코사인 유사도를 파악한 후에, 가장 의미가 있는 단어들을 출력한다.

def main_1(doc,n_gram_range,model):
    # 3개의 단어 묶음인 단어구 추출
     # 3개의 단어 묶음 구가 됨.
    stop_words = "english"

    count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([doc])
    candidates = count.get_feature_names_out()

    print('trigram 개수 :',len(candidates))
    print('trigram 다섯개만 출력 :',candidates[:5])

    
    doc_embedding = model.encode([doc])
    candidate_embeddings = model.encode(candidates)
    
    return max_sum_sim(doc_embedding, candidate_embeddings, candidates, top_n=5, nr_candidates=10) # 유사성이 가장 높은 것들.

해당 부분은 구와 단어의 수를 조절할 수 있는 펑션이다.

doc = """
Peter Blume (I906-I992) The Rock I944-48 Oil on canvas Gift of Edgar Kaufmann, Jr., I956.338 Peter Blume labored for years to complete The Rock, which was commissioned in I939 by the Edgar Kaufmann family for their Frank Lloyd Wright-designed home, Falling Water, in Bear Run, Pennsylvania. Although the complex imagery of the painting resists easy interpretation, it ap- pears to be a parable of devastation and reconstruction, possibly reflecting the turbulence of World War II and its aftermath. A shattered red rock looms at the center of the composition, its base seemingly eroded by the activities of the workers below. Smoking ruins of brick a house the to right of the rock contrast sharply with the new construction on the left, which features architectural elements reminis- cent of Falling Water. While Wright may have intended Falling Water to symbolize man and nature existing in har- mony, The Rock also alludes to man's destructive power.
"""

#####
n_gram_range = (3, 3) # 1,1 로 바꾸면 한단어 씩 됨.
model = SentenceTransformer('distilbert-base-nli-mean-tokens')


main_1(doc,n_gram_range,model)

3,3을 입력하면 이런식으로 출력된다 띄어쓰기에 의미가 있다고 보면 될 것 같다

즉 지문에서 키워드를 추출하는 keybert사용법에 대해 써 보았다.

'IT - 코딩 > AI, 예측모델' 카테고리의 다른 글

전력 사용량 예측 (0)	2023.08.07
chat GPT를 활용한 한국어 댓글 긍부정 판단 vs 공개 사전학습 모델 긍부정 판단 비교 (with python) (4)	2023.06.23
자연어 처리 _ 구문 속 질의응답 모델(QA) with python & pytorch (코랩 pro 사용) (0)	2023.01.03
자연어처리 긍부정 판단 with python & pytorch (bert) (2)	2022.11.19
딥러닝을 응용한 환율예측으로 가상화폐 차익거래 기회 백테스팅 (2) 수익율 측정 (2)	2022.09.08

1원장자

자연어 처리 _ 키워드 추출 key bert ) with python & pytorch

'IT - 코딩 > AI, 예측모델' 카테고리의 다른 글

티스토리툴바

자연어 처리 _ 키워드 추출 key bert ) with python & pytorch

'IT - 코딩 > AI, 예측모델' 카테고리의 다른 글

관련글

티스토리툴바