HybridSimilarity Algorithm

Explaining the HybridSimilarity Algorithm

In this article, we will delve into the HybridSimilarity algorithm, a custom-built neural network-based model for measuring the similarity between two pieces of text. This hybrid model leverages various techniques to combine lexical, phonetic, semantic, and syntactic similarities for a comprehensive similarity score.

<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span>from</span> <span>sklearn.feature_extraction.text</span> <span>import</span> <span>TfidfVectorizer</span>
<span>from</span> <span>sklearn.decomposition</span> <span>import</span> <span>TruncatedSVD</span>
<span>from</span> <span>sentence_transformers</span> <span>import</span> <span>SentenceTransformer</span>
<span>from</span> <span>Levenshtein</span> <span>import</span> <span>ratio</span> <span>as</span> <span>levenshtein_ratio</span>
<span>from</span> <span>phonetics</span> <span>import</span> <span>metaphone</span>
<span>import</span> <span>torch</span>
<span>import</span> <span>torch.nn</span> <span>as</span> <span>nn</span>
<span>class</span> <span>HybridSimilarity</span><span>(</span><span>nn</span><span>.</span><span>Module</span><span>):</span>
<span>def</span> <span>__init__</span><span>(</span><span>self</span><span>):</span>
<span>super</span><span>().</span><span>__init__</span><span>()</span>
<span>self</span><span>.</span><span>bert</span> <span>=</span> <span>SentenceTransformer</span><span>(</span><span>'</span><span>all-MiniLM-L6-v2</span><span>'</span><span>)</span>
<span>self</span><span>.</span><span>tfidf</span> <span>=</span> <span>TfidfVectorizer</span><span>()</span>
<span>self</span><span>.</span><span>attention</span> <span>=</span> <span>nn</span><span>.</span><span>MultiheadAttention</span><span>(</span><span>embed_dim</span><span>=</span><span>384</span><span>,</span> <span>num_heads</span><span>=</span><span>4</span><span>)</span>
<span>self</span><span>.</span><span>fc</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span>
<span>nn</span><span>.</span><span>Linear</span><span>(</span><span>1152</span><span>,</span> <span>256</span><span>),</span>
<span>nn</span><span>.</span><span>ReLU</span><span>(),</span>
<span>nn</span><span>.</span><span>LayerNorm</span><span>(</span><span>256</span><span>),</span>
<span>nn</span><span>.</span><span>Linear</span><span>(</span><span>256</span><span>,</span> <span>1</span><span>),</span>
<span>nn</span><span>.</span><span>Sigmoid</span><span>()</span>
<span>)</span>
<span>def</span> <span>_extract_features</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span>
<span># Multiple features </span> <span>features</span> <span>=</span> <span>{}</span>
<span># Lexical similarity </span> <span>features</span><span>[</span><span>'</span><span>levenshtein</span><span>'</span><span>]</span> <span>=</span> <span>levenshtein_ratio</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span>
<span>features</span><span>[</span><span>'</span><span>jaccard</span><span>'</span><span>]</span> <span>=</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>&</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span> <span>/</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>|</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span>
<span># Phonetic similarity </span> <span>features</span><span>[</span><span>'</span><span>metaphone</span><span>'</span><span>]</span> <span>=</span> <span>1.0</span> <span>if</span> <span>metaphone</span><span>(</span><span>text1</span><span>)</span> <span>==</span> <span>metaphone</span><span>(</span><span>text2</span><span>)</span> <span>else</span> <span>0.0</span>
<span># Semantic embedding (BERT) </span> <span>emb1</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text1</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span>
<span>emb2</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text2</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span>
<span>features</span><span>[</span><span>'</span><span>semantic_cosine</span><span>'</span><span>]</span> <span>=</span> <span>nn</span><span>.</span><span>CosineSimilarity</span><span>()(</span><span>emb1</span><span>,</span> <span>emb2</span><span>).</span><span>item</span><span>()</span>
<span># Syntactic similarity (LSA-TFIDF) </span> <span>tfidf_matrix</span> <span>=</span> <span>self</span><span>.</span><span>tfidf</span><span>.</span><span>fit_transform</span><span>([</span><span>text1</span><span>,</span> <span>text2</span><span>])</span>
<span>svd</span> <span>=</span> <span>TruncatedSVD</span><span>(</span><span>n_components</span><span>=</span><span>1</span><span>)</span>
<span>lsa</span> <span>=</span> <span>svd</span><span>.</span><span>fit_transform</span><span>(</span><span>tfidf_matrix</span><span>)</span>
<span>features</span><span>[</span><span>'</span><span>lsa_cosine</span><span>'</span><span>]</span> <span>=</span> <span>np</span><span>.</span><span>dot</span><span>(</span><span>lsa</span><span>[</span><span>0</span><span>],</span> <span>lsa</span><span>[</span><span>1</span><span>].</span><span>T</span><span>)[</span><span>0</span><span>][</span><span>0</span><span>]</span>
<span># Attention patterns </span> <span>att_output</span><span>,</span> <span>_</span> <span>=</span> <span>self</span><span>.</span><span>attention</span><span>(</span>
<span>emb1</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span>
<span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span>
<span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span>
<span>)</span>
<span>features</span><span>[</span><span>'</span><span>attention_score</span><span>'</span><span>]</span> <span>=</span> <span>att_output</span><span>.</span><span>mean</span><span>().</span><span>item</span><span>()</span>
<span>return</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>list</span><span>(</span><span>features</span><span>.</span><span>values</span><span>())).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span>
<span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span>
<span>features</span> <span>=</span> <span>self</span><span>.</span><span>_extract_features</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span>
<span>return</span> <span>self</span><span>.</span><span>fc</span><span>(</span><span>features</span><span>).</span><span>item</span><span>()</span>
<span>def</span> <span>similarity_coefficient</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>):</span>
<span>model</span> <span>=</span> <span>HybridSimilarity</span><span>()</span>
<span>return</span> <span>model</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span>from</span> <span>sklearn.feature_extraction.text</span> <span>import</span> <span>TfidfVectorizer</span>
<span>from</span> <span>sklearn.decomposition</span> <span>import</span> <span>TruncatedSVD</span>
<span>from</span> <span>sentence_transformers</span> <span>import</span> <span>SentenceTransformer</span>
<span>from</span> <span>Levenshtein</span> <span>import</span> <span>ratio</span> <span>as</span> <span>levenshtein_ratio</span>
<span>from</span> <span>phonetics</span> <span>import</span> <span>metaphone</span>
<span>import</span> <span>torch</span>
<span>import</span> <span>torch.nn</span> <span>as</span> <span>nn</span>

<span>class</span> <span>HybridSimilarity</span><span>(</span><span>nn</span><span>.</span><span>Module</span><span>):</span>
    <span>def</span> <span>__init__</span><span>(</span><span>self</span><span>):</span>
        <span>super</span><span>().</span><span>__init__</span><span>()</span>
        <span>self</span><span>.</span><span>bert</span> <span>=</span> <span>SentenceTransformer</span><span>(</span><span>'</span><span>all-MiniLM-L6-v2</span><span>'</span><span>)</span>
        <span>self</span><span>.</span><span>tfidf</span> <span>=</span> <span>TfidfVectorizer</span><span>()</span>
        <span>self</span><span>.</span><span>attention</span> <span>=</span> <span>nn</span><span>.</span><span>MultiheadAttention</span><span>(</span><span>embed_dim</span><span>=</span><span>384</span><span>,</span> <span>num_heads</span><span>=</span><span>4</span><span>)</span>
        <span>self</span><span>.</span><span>fc</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span>
            <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>1152</span><span>,</span> <span>256</span><span>),</span>
            <span>nn</span><span>.</span><span>ReLU</span><span>(),</span>
            <span>nn</span><span>.</span><span>LayerNorm</span><span>(</span><span>256</span><span>),</span>
            <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>256</span><span>,</span> <span>1</span><span>),</span>
            <span>nn</span><span>.</span><span>Sigmoid</span><span>()</span>
        <span>)</span>

    <span>def</span> <span>_extract_features</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span>
        <span># Multiple features </span>        <span>features</span> <span>=</span> <span>{}</span>

        <span># Lexical similarity </span>        <span>features</span><span>[</span><span>'</span><span>levenshtein</span><span>'</span><span>]</span> <span>=</span> <span>levenshtein_ratio</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span>
        <span>features</span><span>[</span><span>'</span><span>jaccard</span><span>'</span><span>]</span> <span>=</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>&</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span> <span>/</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>|</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span>

        <span># Phonetic similarity </span>        <span>features</span><span>[</span><span>'</span><span>metaphone</span><span>'</span><span>]</span> <span>=</span> <span>1.0</span> <span>if</span> <span>metaphone</span><span>(</span><span>text1</span><span>)</span> <span>==</span> <span>metaphone</span><span>(</span><span>text2</span><span>)</span> <span>else</span> <span>0.0</span>

        <span># Semantic embedding (BERT) </span>        <span>emb1</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text1</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span>
        <span>emb2</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text2</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span>
        <span>features</span><span>[</span><span>'</span><span>semantic_cosine</span><span>'</span><span>]</span> <span>=</span> <span>nn</span><span>.</span><span>CosineSimilarity</span><span>()(</span><span>emb1</span><span>,</span> <span>emb2</span><span>).</span><span>item</span><span>()</span>

        <span># Syntactic similarity (LSA-TFIDF) </span>        <span>tfidf_matrix</span> <span>=</span> <span>self</span><span>.</span><span>tfidf</span><span>.</span><span>fit_transform</span><span>([</span><span>text1</span><span>,</span> <span>text2</span><span>])</span>
        <span>svd</span> <span>=</span> <span>TruncatedSVD</span><span>(</span><span>n_components</span><span>=</span><span>1</span><span>)</span>
        <span>lsa</span> <span>=</span> <span>svd</span><span>.</span><span>fit_transform</span><span>(</span><span>tfidf_matrix</span><span>)</span>
        <span>features</span><span>[</span><span>'</span><span>lsa_cosine</span><span>'</span><span>]</span> <span>=</span> <span>np</span><span>.</span><span>dot</span><span>(</span><span>lsa</span><span>[</span><span>0</span><span>],</span> <span>lsa</span><span>[</span><span>1</span><span>].</span><span>T</span><span>)[</span><span>0</span><span>][</span><span>0</span><span>]</span>

        <span># Attention patterns </span>        <span>att_output</span><span>,</span> <span>_</span> <span>=</span> <span>self</span><span>.</span><span>attention</span><span>(</span>
            <span>emb1</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span> 
            <span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span> 
            <span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span>
        <span>)</span>
        <span>features</span><span>[</span><span>'</span><span>attention_score</span><span>'</span><span>]</span> <span>=</span> <span>att_output</span><span>.</span><span>mean</span><span>().</span><span>item</span><span>()</span>

        <span>return</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>list</span><span>(</span><span>features</span><span>.</span><span>values</span><span>())).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span>

    <span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span>
        <span>features</span> <span>=</span> <span>self</span><span>.</span><span>_extract_features</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span>
        <span>return</span> <span>self</span><span>.</span><span>fc</span><span>(</span><span>features</span><span>).</span><span>item</span><span>()</span>

<span>def</span> <span>similarity_coefficient</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>):</span>
    <span>model</span> <span>=</span> <span>HybridSimilarity</span><span>()</span>
    <span>return</span> <span>model</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span>
import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import TruncatedSVD from sentence_transformers import SentenceTransformer from Levenshtein import ratio as levenshtein_ratio from phonetics import metaphone import torch import torch.nn as nn class HybridSimilarity(nn.Module): def __init__(self): super().__init__() self.bert = SentenceTransformer('all-MiniLM-L6-v2') self.tfidf = TfidfVectorizer() self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4) self.fc = nn.Sequential( nn.Linear(1152, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 1), nn.Sigmoid() ) def _extract_features(self, text1, text2): # Multiple features features = {} # Lexical similarity features['levenshtein'] = levenshtein_ratio(text1, text2) features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split())) # Phonetic similarity features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0 # Semantic embedding (BERT) emb1 = self.bert.encode(text1, convert_to_tensor=True) emb2 = self.bert.encode(text2, convert_to_tensor=True) features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item() # Syntactic similarity (LSA-TFIDF) tfidf_matrix = self.tfidf.fit_transform([text1, text2]) svd = TruncatedSVD(n_components=1) lsa = svd.fit_transform(tfidf_matrix) features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0] # Attention patterns att_output, _ = self.attention( emb1.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0) ) features['attention_score'] = att_output.mean().item() return torch.tensor(list(features.values())).unsqueeze(0) def forward(self, text1, text2): features = self._extract_features(text1, text2) return self.fc(features).item() def similarity_coefficient(text1, text2): model = HybridSimilarity() return model(text1, text2)

Enter fullscreen mode Exit fullscreen mode

Key Components of the Algorithm

The HybridSimilarity model utilizes the following libraries and technologies:

  • SentenceTransformers: For semantic embedding generation using pre-trained transformer models.
  • Levenshtein Ratio: To calculate lexical similarity.
  • Phonetics (Metaphone): For phonetic similarity.
  • TF-IDF and TruncatedSVD: For syntactic similarity through Latent Semantic Analysis (LSA).
  • PyTorch: To define a custom neural network with attention mechanisms and fully connected layers.

Step-by-Step Explanation

1. Model Initialization

The HybridSimilarity class inherits from nn.Module and initializes:

  • A BERT-based sentence embedding model (all-MiniLM-L6-v2).
  • A TF-IDF vectorizer for text vectorization.
  • A multi-head attention mechanism to capture interdependencies between text pairs.
  • A fully connected neural network for aggregating features and producing the final similarity score.
<span>self</span><span>.</span><span>bert</span> <span>=</span> <span>SentenceTransformer</span><span>(</span><span>'</span><span>all-MiniLM-L6-v2</span><span>'</span><span>)</span>
<span>self</span><span>.</span><span>tfidf</span> <span>=</span> <span>TfidfVectorizer</span><span>()</span>
<span>self</span><span>.</span><span>attention</span> <span>=</span> <span>nn</span><span>.</span><span>MultiheadAttention</span><span>(</span><span>embed_dim</span><span>=</span><span>384</span><span>,</span> <span>num_heads</span><span>=</span><span>4</span><span>)</span>
<span>self</span><span>.</span><span>fc</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span>
<span>nn</span><span>.</span><span>Linear</span><span>(</span><span>1152</span><span>,</span> <span>256</span><span>),</span>
<span>nn</span><span>.</span><span>ReLU</span><span>(),</span>
<span>nn</span><span>.</span><span>LayerNorm</span><span>(</span><span>256</span><span>),</span>
<span>nn</span><span>.</span><span>Linear</span><span>(</span><span>256</span><span>,</span> <span>1</span><span>),</span>
<span>nn</span><span>.</span><span>Sigmoid</span><span>()</span>
<span>)</span>
<span>self</span><span>.</span><span>bert</span> <span>=</span> <span>SentenceTransformer</span><span>(</span><span>'</span><span>all-MiniLM-L6-v2</span><span>'</span><span>)</span>
<span>self</span><span>.</span><span>tfidf</span> <span>=</span> <span>TfidfVectorizer</span><span>()</span>
<span>self</span><span>.</span><span>attention</span> <span>=</span> <span>nn</span><span>.</span><span>MultiheadAttention</span><span>(</span><span>embed_dim</span><span>=</span><span>384</span><span>,</span> <span>num_heads</span><span>=</span><span>4</span><span>)</span>
<span>self</span><span>.</span><span>fc</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span>
    <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>1152</span><span>,</span> <span>256</span><span>),</span>
    <span>nn</span><span>.</span><span>ReLU</span><span>(),</span>
    <span>nn</span><span>.</span><span>LayerNorm</span><span>(</span><span>256</span><span>),</span>
    <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>256</span><span>,</span> <span>1</span><span>),</span>
    <span>nn</span><span>.</span><span>Sigmoid</span><span>()</span>
<span>)</span>
self.bert = SentenceTransformer('all-MiniLM-L6-v2') self.tfidf = TfidfVectorizer() self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4) self.fc = nn.Sequential( nn.Linear(1152, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 1), nn.Sigmoid() )

Enter fullscreen mode Exit fullscreen mode

2. Feature Extraction

The _extract_features method calculates multiple similarity features:

  • Lexical Similarity
    • Levenshtein ratio: Measures character-level edits needed to convert one text into another.
    • Jaccard index: Compares sets of unique words in both texts.
<span>features</span><span>[</span><span>'</span><span>levenshtein</span><span>'</span><span>]</span> <span>=</span> <span>levenshtein_ratio</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span>
<span>features</span><span>[</span><span>'</span><span>jaccard</span><span>'</span><span>]</span> <span>=</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>&</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span> <span>/</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>|</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span>
<span>features</span><span>[</span><span>'</span><span>levenshtein</span><span>'</span><span>]</span> <span>=</span> <span>levenshtein_ratio</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span>
<span>features</span><span>[</span><span>'</span><span>jaccard</span><span>'</span><span>]</span> <span>=</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>&</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span> <span>/</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>|</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span>
features['levenshtein'] = levenshtein_ratio(text1, text2) features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))

Enter fullscreen mode Exit fullscreen mode

  • Phonetic Similarity
    • Metaphone encoding: Checks if the phonetic representation of both texts matches.
<span>features</span><span>[</span><span>'</span><span>metaphone</span><span>'</span><span>]</span> <span>=</span> <span>1.0</span> <span>if</span> <span>metaphone</span><span>(</span><span>text1</span><span>)</span> <span>==</span> <span>metaphone</span><span>(</span><span>text2</span><span>)</span> <span>else</span> <span>0.0</span>
<span>features</span><span>[</span><span>'</span><span>metaphone</span><span>'</span><span>]</span> <span>=</span> <span>1.0</span> <span>if</span> <span>metaphone</span><span>(</span><span>text1</span><span>)</span> <span>==</span> <span>metaphone</span><span>(</span><span>text2</span><span>)</span> <span>else</span> <span>0.0</span>
features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0

Enter fullscreen mode Exit fullscreen mode

  • Semantic Similarity
    • Sentence embeddings are generated using BERT, and cosine similarity is calculated between them.
<span>emb1</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text1</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span>
<span>emb2</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text2</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span>
<span>features</span><span>[</span><span>'</span><span>semantic_cosine</span><span>'</span><span>]</span> <span>=</span> <span>nn</span><span>.</span><span>CosineSimilarity</span><span>()(</span><span>emb1</span><span>,</span> <span>emb2</span><span>).</span><span>item</span><span>()</span>
<span>emb1</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text1</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span>
<span>emb2</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text2</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span>
<span>features</span><span>[</span><span>'</span><span>semantic_cosine</span><span>'</span><span>]</span> <span>=</span> <span>nn</span><span>.</span><span>CosineSimilarity</span><span>()(</span><span>emb1</span><span>,</span> <span>emb2</span><span>).</span><span>item</span><span>()</span>
emb1 = self.bert.encode(text1, convert_to_tensor=True) emb2 = self.bert.encode(text2, convert_to_tensor=True) features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()

Enter fullscreen mode Exit fullscreen mode

  • Syntactic Similarity
    • TF-IDF is used to vectorize the text, and Latent Semantic Analysis (LSA) is applied via TruncatedSVD.
<span>tfidf_matrix</span> <span>=</span> <span>self</span><span>.</span><span>tfidf</span><span>.</span><span>fit_transform</span><span>([</span><span>text1</span><span>,</span> <span>text2</span><span>])</span>
<span>svd</span> <span>=</span> <span>TruncatedSVD</span><span>(</span><span>n_components</span><span>=</span><span>1</span><span>)</span>
<span>lsa</span> <span>=</span> <span>svd</span><span>.</span><span>fit_transform</span><span>(</span><span>tfidf_matrix</span><span>)</span>
<span>features</span><span>[</span><span>'</span><span>lsa_cosine</span><span>'</span><span>]</span> <span>=</span> <span>np</span><span>.</span><span>dot</span><span>(</span><span>lsa</span><span>[</span><span>0</span><span>],</span> <span>lsa</span><span>[</span><span>1</span><span>].</span><span>T</span><span>)[</span><span>0</span><span>][</span><span>0</span><span>]</span>
<span>tfidf_matrix</span> <span>=</span> <span>self</span><span>.</span><span>tfidf</span><span>.</span><span>fit_transform</span><span>([</span><span>text1</span><span>,</span> <span>text2</span><span>])</span>
<span>svd</span> <span>=</span> <span>TruncatedSVD</span><span>(</span><span>n_components</span><span>=</span><span>1</span><span>)</span>
<span>lsa</span> <span>=</span> <span>svd</span><span>.</span><span>fit_transform</span><span>(</span><span>tfidf_matrix</span><span>)</span>
<span>features</span><span>[</span><span>'</span><span>lsa_cosine</span><span>'</span><span>]</span> <span>=</span> <span>np</span><span>.</span><span>dot</span><span>(</span><span>lsa</span><span>[</span><span>0</span><span>],</span> <span>lsa</span><span>[</span><span>1</span><span>].</span><span>T</span><span>)[</span><span>0</span><span>][</span><span>0</span><span>]</span>
tfidf_matrix = self.tfidf.fit_transform([text1, text2]) svd = TruncatedSVD(n_components=1) lsa = svd.fit_transform(tfidf_matrix) features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]

Enter fullscreen mode Exit fullscreen mode

  • Attention Mechanism
    • A multi-head attention mechanism is applied to the embeddings, and the average attention score is used as a feature.
<span>att_output</span><span>,</span> <span>_</span> <span>=</span> <span>self</span><span>.</span><span>attention</span><span>(</span>
<span>emb1</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span>
<span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span>
<span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span>
<span>)</span>
<span>features</span><span>[</span><span>'</span><span>attention_score</span><span>'</span><span>]</span> <span>=</span> <span>att_output</span><span>.</span><span>mean</span><span>().</span><span>item</span><span>()</span>
<span>att_output</span><span>,</span> <span>_</span> <span>=</span> <span>self</span><span>.</span><span>attention</span><span>(</span>
    <span>emb1</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span>
    <span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span>
    <span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span>
<span>)</span>
<span>features</span><span>[</span><span>'</span><span>attention_score</span><span>'</span><span>]</span> <span>=</span> <span>att_output</span><span>.</span><span>mean</span><span>().</span><span>item</span><span>()</span>
att_output, _ = self.attention( emb1.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0) ) features['attention_score'] = att_output.mean().item()

Enter fullscreen mode Exit fullscreen mode

3. Neural Network Aggregation

The extracted features are concatenated and passed through a fully connected neural network. The network predicts a similarity score between 0 and 1.

<span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span>
<span>features</span> <span>=</span> <span>self</span><span>.</span><span>_extract_features</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span>
<span>return</span> <span>self</span><span>.</span><span>fc</span><span>(</span><span>features</span><span>).</span><span>item</span><span>()</span>
<span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span>
    <span>features</span> <span>=</span> <span>self</span><span>.</span><span>_extract_features</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span>
    <span>return</span> <span>self</span><span>.</span><span>fc</span><span>(</span><span>features</span><span>).</span><span>item</span><span>()</span>
def forward(self, text1, text2): features = self._extract_features(text1, text2) return self.fc(features).item()

Enter fullscreen mode Exit fullscreen mode

Example Usage

The similarity_coefficient function initializes the model and calculates the similarity between two input texts.

<span>text_a</span> <span>=</span> <span>"</span><span>The quick brown fox jumps over the lazy dog</span><span>"</span>
<span>text_b</span> <span>=</span> <span>"</span><span>A fast brown fox leaps over a sleepy hound</span><span>"</span>
<span>print</span><span>(</span><span>f</span><span>"</span><span>Similarity coefficient: </span><span>{</span><span>similarity_coefficient</span><span>(</span><span>text_a</span><span>,</span> <span>text_b</span><span>)</span><span>:</span><span>.</span><span>4</span><span>f</span><span>}</span><span>"</span><span>)</span>
<span>text_a</span> <span>=</span> <span>"</span><span>The quick brown fox jumps over the lazy dog</span><span>"</span>
<span>text_b</span> <span>=</span> <span>"</span><span>A fast brown fox leaps over a sleepy hound</span><span>"</span>

<span>print</span><span>(</span><span>f</span><span>"</span><span>Similarity coefficient: </span><span>{</span><span>similarity_coefficient</span><span>(</span><span>text_a</span><span>,</span> <span>text_b</span><span>)</span><span>:</span><span>.</span><span>4</span><span>f</span><span>}</span><span>"</span><span>)</span>
text_a = "The quick brown fox jumps over the lazy dog" text_b = "A fast brown fox leaps over a sleepy hound" print(f"Similarity coefficient: {similarity_coefficient(text_a, text_b):.4f}")

Enter fullscreen mode Exit fullscreen mode

This function calls the HybridSimilarity model and outputs a similarity score, which is a float value between 0 (completely dissimilar) and 1 (identical).

Conclusion

The HybridSimilarity algorithm is a robust solution that combines multiple dimensions of text similarity into a unified model. By integrating lexical, phonetic, semantic, and syntactic features, this hybrid approach ensures a nuanced and comprehensive similarity analysis. This makes it suitable for tasks such as duplicate detection, text clustering, and recommendation systems.

原文链接:HybridSimilarity Algorithm

© 版权声明
THE END
喜欢就支持一下吧
点赞13 分享
Worrying does not empty tomorrow of its troubles, it empties today of its strength.
担忧不会清空明日的烦恼,它只会丧失今日的勇气
评论 抢沙发

请登录后发表评论

    暂无评论内容