Explaining the HybridSimilarity Algorithm
In this article, we will delve into the HybridSimilarity algorithm, a custom-built neural network-based model for measuring the similarity between two pieces of text. This hybrid model leverages various techniques to combine lexical, phonetic, semantic, and syntactic similarities for a comprehensive similarity score.
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span><span>from</span> <span>sklearn.feature_extraction.text</span> <span>import</span> <span>TfidfVectorizer</span><span>from</span> <span>sklearn.decomposition</span> <span>import</span> <span>TruncatedSVD</span><span>from</span> <span>sentence_transformers</span> <span>import</span> <span>SentenceTransformer</span><span>from</span> <span>Levenshtein</span> <span>import</span> <span>ratio</span> <span>as</span> <span>levenshtein_ratio</span><span>from</span> <span>phonetics</span> <span>import</span> <span>metaphone</span><span>import</span> <span>torch</span><span>import</span> <span>torch.nn</span> <span>as</span> <span>nn</span><span>class</span> <span>HybridSimilarity</span><span>(</span><span>nn</span><span>.</span><span>Module</span><span>):</span><span>def</span> <span>__init__</span><span>(</span><span>self</span><span>):</span><span>super</span><span>().</span><span>__init__</span><span>()</span><span>self</span><span>.</span><span>bert</span> <span>=</span> <span>SentenceTransformer</span><span>(</span><span>'</span><span>all-MiniLM-L6-v2</span><span>'</span><span>)</span><span>self</span><span>.</span><span>tfidf</span> <span>=</span> <span>TfidfVectorizer</span><span>()</span><span>self</span><span>.</span><span>attention</span> <span>=</span> <span>nn</span><span>.</span><span>MultiheadAttention</span><span>(</span><span>embed_dim</span><span>=</span><span>384</span><span>,</span> <span>num_heads</span><span>=</span><span>4</span><span>)</span><span>self</span><span>.</span><span>fc</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span><span>nn</span><span>.</span><span>Linear</span><span>(</span><span>1152</span><span>,</span> <span>256</span><span>),</span><span>nn</span><span>.</span><span>ReLU</span><span>(),</span><span>nn</span><span>.</span><span>LayerNorm</span><span>(</span><span>256</span><span>),</span><span>nn</span><span>.</span><span>Linear</span><span>(</span><span>256</span><span>,</span> <span>1</span><span>),</span><span>nn</span><span>.</span><span>Sigmoid</span><span>()</span><span>)</span><span>def</span> <span>_extract_features</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span><span># Multiple features </span> <span>features</span> <span>=</span> <span>{}</span><span># Lexical similarity </span> <span>features</span><span>[</span><span>'</span><span>levenshtein</span><span>'</span><span>]</span> <span>=</span> <span>levenshtein_ratio</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span><span>features</span><span>[</span><span>'</span><span>jaccard</span><span>'</span><span>]</span> <span>=</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>&</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span> <span>/</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>|</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span><span># Phonetic similarity </span> <span>features</span><span>[</span><span>'</span><span>metaphone</span><span>'</span><span>]</span> <span>=</span> <span>1.0</span> <span>if</span> <span>metaphone</span><span>(</span><span>text1</span><span>)</span> <span>==</span> <span>metaphone</span><span>(</span><span>text2</span><span>)</span> <span>else</span> <span>0.0</span><span># Semantic embedding (BERT) </span> <span>emb1</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text1</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span><span>emb2</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text2</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span><span>features</span><span>[</span><span>'</span><span>semantic_cosine</span><span>'</span><span>]</span> <span>=</span> <span>nn</span><span>.</span><span>CosineSimilarity</span><span>()(</span><span>emb1</span><span>,</span> <span>emb2</span><span>).</span><span>item</span><span>()</span><span># Syntactic similarity (LSA-TFIDF) </span> <span>tfidf_matrix</span> <span>=</span> <span>self</span><span>.</span><span>tfidf</span><span>.</span><span>fit_transform</span><span>([</span><span>text1</span><span>,</span> <span>text2</span><span>])</span><span>svd</span> <span>=</span> <span>TruncatedSVD</span><span>(</span><span>n_components</span><span>=</span><span>1</span><span>)</span><span>lsa</span> <span>=</span> <span>svd</span><span>.</span><span>fit_transform</span><span>(</span><span>tfidf_matrix</span><span>)</span><span>features</span><span>[</span><span>'</span><span>lsa_cosine</span><span>'</span><span>]</span> <span>=</span> <span>np</span><span>.</span><span>dot</span><span>(</span><span>lsa</span><span>[</span><span>0</span><span>],</span> <span>lsa</span><span>[</span><span>1</span><span>].</span><span>T</span><span>)[</span><span>0</span><span>][</span><span>0</span><span>]</span><span># Attention patterns </span> <span>att_output</span><span>,</span> <span>_</span> <span>=</span> <span>self</span><span>.</span><span>attention</span><span>(</span><span>emb1</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span><span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span><span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span><span>)</span><span>features</span><span>[</span><span>'</span><span>attention_score</span><span>'</span><span>]</span> <span>=</span> <span>att_output</span><span>.</span><span>mean</span><span>().</span><span>item</span><span>()</span><span>return</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>list</span><span>(</span><span>features</span><span>.</span><span>values</span><span>())).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span><span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span><span>features</span> <span>=</span> <span>self</span><span>.</span><span>_extract_features</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span><span>return</span> <span>self</span><span>.</span><span>fc</span><span>(</span><span>features</span><span>).</span><span>item</span><span>()</span><span>def</span> <span>similarity_coefficient</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>):</span><span>model</span> <span>=</span> <span>HybridSimilarity</span><span>()</span><span>return</span> <span>model</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span> <span>from</span> <span>sklearn.feature_extraction.text</span> <span>import</span> <span>TfidfVectorizer</span> <span>from</span> <span>sklearn.decomposition</span> <span>import</span> <span>TruncatedSVD</span> <span>from</span> <span>sentence_transformers</span> <span>import</span> <span>SentenceTransformer</span> <span>from</span> <span>Levenshtein</span> <span>import</span> <span>ratio</span> <span>as</span> <span>levenshtein_ratio</span> <span>from</span> <span>phonetics</span> <span>import</span> <span>metaphone</span> <span>import</span> <span>torch</span> <span>import</span> <span>torch.nn</span> <span>as</span> <span>nn</span> <span>class</span> <span>HybridSimilarity</span><span>(</span><span>nn</span><span>.</span><span>Module</span><span>):</span> <span>def</span> <span>__init__</span><span>(</span><span>self</span><span>):</span> <span>super</span><span>().</span><span>__init__</span><span>()</span> <span>self</span><span>.</span><span>bert</span> <span>=</span> <span>SentenceTransformer</span><span>(</span><span>'</span><span>all-MiniLM-L6-v2</span><span>'</span><span>)</span> <span>self</span><span>.</span><span>tfidf</span> <span>=</span> <span>TfidfVectorizer</span><span>()</span> <span>self</span><span>.</span><span>attention</span> <span>=</span> <span>nn</span><span>.</span><span>MultiheadAttention</span><span>(</span><span>embed_dim</span><span>=</span><span>384</span><span>,</span> <span>num_heads</span><span>=</span><span>4</span><span>)</span> <span>self</span><span>.</span><span>fc</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span> <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>1152</span><span>,</span> <span>256</span><span>),</span> <span>nn</span><span>.</span><span>ReLU</span><span>(),</span> <span>nn</span><span>.</span><span>LayerNorm</span><span>(</span><span>256</span><span>),</span> <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>256</span><span>,</span> <span>1</span><span>),</span> <span>nn</span><span>.</span><span>Sigmoid</span><span>()</span> <span>)</span> <span>def</span> <span>_extract_features</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span> <span># Multiple features </span> <span>features</span> <span>=</span> <span>{}</span> <span># Lexical similarity </span> <span>features</span><span>[</span><span>'</span><span>levenshtein</span><span>'</span><span>]</span> <span>=</span> <span>levenshtein_ratio</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span> <span>features</span><span>[</span><span>'</span><span>jaccard</span><span>'</span><span>]</span> <span>=</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>&</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span> <span>/</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>|</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span> <span># Phonetic similarity </span> <span>features</span><span>[</span><span>'</span><span>metaphone</span><span>'</span><span>]</span> <span>=</span> <span>1.0</span> <span>if</span> <span>metaphone</span><span>(</span><span>text1</span><span>)</span> <span>==</span> <span>metaphone</span><span>(</span><span>text2</span><span>)</span> <span>else</span> <span>0.0</span> <span># Semantic embedding (BERT) </span> <span>emb1</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text1</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span> <span>emb2</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text2</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span> <span>features</span><span>[</span><span>'</span><span>semantic_cosine</span><span>'</span><span>]</span> <span>=</span> <span>nn</span><span>.</span><span>CosineSimilarity</span><span>()(</span><span>emb1</span><span>,</span> <span>emb2</span><span>).</span><span>item</span><span>()</span> <span># Syntactic similarity (LSA-TFIDF) </span> <span>tfidf_matrix</span> <span>=</span> <span>self</span><span>.</span><span>tfidf</span><span>.</span><span>fit_transform</span><span>([</span><span>text1</span><span>,</span> <span>text2</span><span>])</span> <span>svd</span> <span>=</span> <span>TruncatedSVD</span><span>(</span><span>n_components</span><span>=</span><span>1</span><span>)</span> <span>lsa</span> <span>=</span> <span>svd</span><span>.</span><span>fit_transform</span><span>(</span><span>tfidf_matrix</span><span>)</span> <span>features</span><span>[</span><span>'</span><span>lsa_cosine</span><span>'</span><span>]</span> <span>=</span> <span>np</span><span>.</span><span>dot</span><span>(</span><span>lsa</span><span>[</span><span>0</span><span>],</span> <span>lsa</span><span>[</span><span>1</span><span>].</span><span>T</span><span>)[</span><span>0</span><span>][</span><span>0</span><span>]</span> <span># Attention patterns </span> <span>att_output</span><span>,</span> <span>_</span> <span>=</span> <span>self</span><span>.</span><span>attention</span><span>(</span> <span>emb1</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span> <span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span> <span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span> <span>)</span> <span>features</span><span>[</span><span>'</span><span>attention_score</span><span>'</span><span>]</span> <span>=</span> <span>att_output</span><span>.</span><span>mean</span><span>().</span><span>item</span><span>()</span> <span>return</span> <span>torch</span><span>.</span><span>tensor</span><span>(</span><span>list</span><span>(</span><span>features</span><span>.</span><span>values</span><span>())).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span> <span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span> <span>features</span> <span>=</span> <span>self</span><span>.</span><span>_extract_features</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span> <span>return</span> <span>self</span><span>.</span><span>fc</span><span>(</span><span>features</span><span>).</span><span>item</span><span>()</span> <span>def</span> <span>similarity_coefficient</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>):</span> <span>model</span> <span>=</span> <span>HybridSimilarity</span><span>()</span> <span>return</span> <span>model</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span>import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import TruncatedSVD from sentence_transformers import SentenceTransformer from Levenshtein import ratio as levenshtein_ratio from phonetics import metaphone import torch import torch.nn as nn class HybridSimilarity(nn.Module): def __init__(self): super().__init__() self.bert = SentenceTransformer('all-MiniLM-L6-v2') self.tfidf = TfidfVectorizer() self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4) self.fc = nn.Sequential( nn.Linear(1152, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 1), nn.Sigmoid() ) def _extract_features(self, text1, text2): # Multiple features features = {} # Lexical similarity features['levenshtein'] = levenshtein_ratio(text1, text2) features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split())) # Phonetic similarity features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0 # Semantic embedding (BERT) emb1 = self.bert.encode(text1, convert_to_tensor=True) emb2 = self.bert.encode(text2, convert_to_tensor=True) features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item() # Syntactic similarity (LSA-TFIDF) tfidf_matrix = self.tfidf.fit_transform([text1, text2]) svd = TruncatedSVD(n_components=1) lsa = svd.fit_transform(tfidf_matrix) features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0] # Attention patterns att_output, _ = self.attention( emb1.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0) ) features['attention_score'] = att_output.mean().item() return torch.tensor(list(features.values())).unsqueeze(0) def forward(self, text1, text2): features = self._extract_features(text1, text2) return self.fc(features).item() def similarity_coefficient(text1, text2): model = HybridSimilarity() return model(text1, text2)
Enter fullscreen mode Exit fullscreen mode
Key Components of the Algorithm
The HybridSimilarity model utilizes the following libraries and technologies:
- SentenceTransformers: For semantic embedding generation using pre-trained transformer models.
- Levenshtein Ratio: To calculate lexical similarity.
- Phonetics (Metaphone): For phonetic similarity.
- TF-IDF and TruncatedSVD: For syntactic similarity through Latent Semantic Analysis (LSA).
- PyTorch: To define a custom neural network with attention mechanisms and fully connected layers.
Step-by-Step Explanation
1. Model Initialization
The HybridSimilarity
class inherits from nn.Module
and initializes:
- A BERT-based sentence embedding model (
all-MiniLM-L6-v2
). - A TF-IDF vectorizer for text vectorization.
- A multi-head attention mechanism to capture interdependencies between text pairs.
- A fully connected neural network for aggregating features and producing the final similarity score.
<span>self</span><span>.</span><span>bert</span> <span>=</span> <span>SentenceTransformer</span><span>(</span><span>'</span><span>all-MiniLM-L6-v2</span><span>'</span><span>)</span><span>self</span><span>.</span><span>tfidf</span> <span>=</span> <span>TfidfVectorizer</span><span>()</span><span>self</span><span>.</span><span>attention</span> <span>=</span> <span>nn</span><span>.</span><span>MultiheadAttention</span><span>(</span><span>embed_dim</span><span>=</span><span>384</span><span>,</span> <span>num_heads</span><span>=</span><span>4</span><span>)</span><span>self</span><span>.</span><span>fc</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span><span>nn</span><span>.</span><span>Linear</span><span>(</span><span>1152</span><span>,</span> <span>256</span><span>),</span><span>nn</span><span>.</span><span>ReLU</span><span>(),</span><span>nn</span><span>.</span><span>LayerNorm</span><span>(</span><span>256</span><span>),</span><span>nn</span><span>.</span><span>Linear</span><span>(</span><span>256</span><span>,</span> <span>1</span><span>),</span><span>nn</span><span>.</span><span>Sigmoid</span><span>()</span><span>)</span><span>self</span><span>.</span><span>bert</span> <span>=</span> <span>SentenceTransformer</span><span>(</span><span>'</span><span>all-MiniLM-L6-v2</span><span>'</span><span>)</span> <span>self</span><span>.</span><span>tfidf</span> <span>=</span> <span>TfidfVectorizer</span><span>()</span> <span>self</span><span>.</span><span>attention</span> <span>=</span> <span>nn</span><span>.</span><span>MultiheadAttention</span><span>(</span><span>embed_dim</span><span>=</span><span>384</span><span>,</span> <span>num_heads</span><span>=</span><span>4</span><span>)</span> <span>self</span><span>.</span><span>fc</span> <span>=</span> <span>nn</span><span>.</span><span>Sequential</span><span>(</span> <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>1152</span><span>,</span> <span>256</span><span>),</span> <span>nn</span><span>.</span><span>ReLU</span><span>(),</span> <span>nn</span><span>.</span><span>LayerNorm</span><span>(</span><span>256</span><span>),</span> <span>nn</span><span>.</span><span>Linear</span><span>(</span><span>256</span><span>,</span> <span>1</span><span>),</span> <span>nn</span><span>.</span><span>Sigmoid</span><span>()</span> <span>)</span>self.bert = SentenceTransformer('all-MiniLM-L6-v2') self.tfidf = TfidfVectorizer() self.attention = nn.MultiheadAttention(embed_dim=384, num_heads=4) self.fc = nn.Sequential( nn.Linear(1152, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 1), nn.Sigmoid() )
Enter fullscreen mode Exit fullscreen mode
2. Feature Extraction
The _extract_features
method calculates multiple similarity features:
- Lexical Similarity
- Levenshtein ratio: Measures character-level edits needed to convert one text into another.
- Jaccard index: Compares sets of unique words in both texts.
<span>features</span><span>[</span><span>'</span><span>levenshtein</span><span>'</span><span>]</span> <span>=</span> <span>levenshtein_ratio</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span><span>features</span><span>[</span><span>'</span><span>jaccard</span><span>'</span><span>]</span> <span>=</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>&</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span> <span>/</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>|</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span><span>features</span><span>[</span><span>'</span><span>levenshtein</span><span>'</span><span>]</span> <span>=</span> <span>levenshtein_ratio</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span> <span>features</span><span>[</span><span>'</span><span>jaccard</span><span>'</span><span>]</span> <span>=</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>&</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span> <span>/</span> <span>len</span><span>(</span><span>set</span><span>(</span><span>text1</span><span>.</span><span>split</span><span>())</span> <span>|</span> <span>set</span><span>(</span><span>text2</span><span>.</span><span>split</span><span>()))</span>features['levenshtein'] = levenshtein_ratio(text1, text2) features['jaccard'] = len(set(text1.split()) & set(text2.split())) / len(set(text1.split()) | set(text2.split()))
Enter fullscreen mode Exit fullscreen mode
- Phonetic Similarity
- Metaphone encoding: Checks if the phonetic representation of both texts matches.
<span>features</span><span>[</span><span>'</span><span>metaphone</span><span>'</span><span>]</span> <span>=</span> <span>1.0</span> <span>if</span> <span>metaphone</span><span>(</span><span>text1</span><span>)</span> <span>==</span> <span>metaphone</span><span>(</span><span>text2</span><span>)</span> <span>else</span> <span>0.0</span><span>features</span><span>[</span><span>'</span><span>metaphone</span><span>'</span><span>]</span> <span>=</span> <span>1.0</span> <span>if</span> <span>metaphone</span><span>(</span><span>text1</span><span>)</span> <span>==</span> <span>metaphone</span><span>(</span><span>text2</span><span>)</span> <span>else</span> <span>0.0</span>features['metaphone'] = 1.0 if metaphone(text1) == metaphone(text2) else 0.0
Enter fullscreen mode Exit fullscreen mode
- Semantic Similarity
- Sentence embeddings are generated using BERT, and cosine similarity is calculated between them.
<span>emb1</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text1</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span><span>emb2</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text2</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span><span>features</span><span>[</span><span>'</span><span>semantic_cosine</span><span>'</span><span>]</span> <span>=</span> <span>nn</span><span>.</span><span>CosineSimilarity</span><span>()(</span><span>emb1</span><span>,</span> <span>emb2</span><span>).</span><span>item</span><span>()</span><span>emb1</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text1</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span> <span>emb2</span> <span>=</span> <span>self</span><span>.</span><span>bert</span><span>.</span><span>encode</span><span>(</span><span>text2</span><span>,</span> <span>convert_to_tensor</span><span>=</span><span>True</span><span>)</span> <span>features</span><span>[</span><span>'</span><span>semantic_cosine</span><span>'</span><span>]</span> <span>=</span> <span>nn</span><span>.</span><span>CosineSimilarity</span><span>()(</span><span>emb1</span><span>,</span> <span>emb2</span><span>).</span><span>item</span><span>()</span>emb1 = self.bert.encode(text1, convert_to_tensor=True) emb2 = self.bert.encode(text2, convert_to_tensor=True) features['semantic_cosine'] = nn.CosineSimilarity()(emb1, emb2).item()
Enter fullscreen mode Exit fullscreen mode
- Syntactic Similarity
- TF-IDF is used to vectorize the text, and Latent Semantic Analysis (LSA) is applied via TruncatedSVD.
<span>tfidf_matrix</span> <span>=</span> <span>self</span><span>.</span><span>tfidf</span><span>.</span><span>fit_transform</span><span>([</span><span>text1</span><span>,</span> <span>text2</span><span>])</span><span>svd</span> <span>=</span> <span>TruncatedSVD</span><span>(</span><span>n_components</span><span>=</span><span>1</span><span>)</span><span>lsa</span> <span>=</span> <span>svd</span><span>.</span><span>fit_transform</span><span>(</span><span>tfidf_matrix</span><span>)</span><span>features</span><span>[</span><span>'</span><span>lsa_cosine</span><span>'</span><span>]</span> <span>=</span> <span>np</span><span>.</span><span>dot</span><span>(</span><span>lsa</span><span>[</span><span>0</span><span>],</span> <span>lsa</span><span>[</span><span>1</span><span>].</span><span>T</span><span>)[</span><span>0</span><span>][</span><span>0</span><span>]</span><span>tfidf_matrix</span> <span>=</span> <span>self</span><span>.</span><span>tfidf</span><span>.</span><span>fit_transform</span><span>([</span><span>text1</span><span>,</span> <span>text2</span><span>])</span> <span>svd</span> <span>=</span> <span>TruncatedSVD</span><span>(</span><span>n_components</span><span>=</span><span>1</span><span>)</span> <span>lsa</span> <span>=</span> <span>svd</span><span>.</span><span>fit_transform</span><span>(</span><span>tfidf_matrix</span><span>)</span> <span>features</span><span>[</span><span>'</span><span>lsa_cosine</span><span>'</span><span>]</span> <span>=</span> <span>np</span><span>.</span><span>dot</span><span>(</span><span>lsa</span><span>[</span><span>0</span><span>],</span> <span>lsa</span><span>[</span><span>1</span><span>].</span><span>T</span><span>)[</span><span>0</span><span>][</span><span>0</span><span>]</span>tfidf_matrix = self.tfidf.fit_transform([text1, text2]) svd = TruncatedSVD(n_components=1) lsa = svd.fit_transform(tfidf_matrix) features['lsa_cosine'] = np.dot(lsa[0], lsa[1].T)[0][0]
Enter fullscreen mode Exit fullscreen mode
- Attention Mechanism
- A multi-head attention mechanism is applied to the embeddings, and the average attention score is used as a feature.
<span>att_output</span><span>,</span> <span>_</span> <span>=</span> <span>self</span><span>.</span><span>attention</span><span>(</span><span>emb1</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span><span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span><span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span><span>)</span><span>features</span><span>[</span><span>'</span><span>attention_score</span><span>'</span><span>]</span> <span>=</span> <span>att_output</span><span>.</span><span>mean</span><span>().</span><span>item</span><span>()</span><span>att_output</span><span>,</span> <span>_</span> <span>=</span> <span>self</span><span>.</span><span>attention</span><span>(</span> <span>emb1</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span> <span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>),</span> <span>emb2</span><span>.</span><span>unsqueeze</span><span>(</span><span>0</span><span>).</span><span>unsqueeze</span><span>(</span><span>0</span><span>)</span> <span>)</span> <span>features</span><span>[</span><span>'</span><span>attention_score</span><span>'</span><span>]</span> <span>=</span> <span>att_output</span><span>.</span><span>mean</span><span>().</span><span>item</span><span>()</span>att_output, _ = self.attention( emb1.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0), emb2.unsqueeze(0).unsqueeze(0) ) features['attention_score'] = att_output.mean().item()
Enter fullscreen mode Exit fullscreen mode
3. Neural Network Aggregation
The extracted features are concatenated and passed through a fully connected neural network. The network predicts a similarity score between 0 and 1.
<span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span><span>features</span> <span>=</span> <span>self</span><span>.</span><span>_extract_features</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span><span>return</span> <span>self</span><span>.</span><span>fc</span><span>(</span><span>features</span><span>).</span><span>item</span><span>()</span><span>def</span> <span>forward</span><span>(</span><span>self</span><span>,</span> <span>text1</span><span>,</span> <span>text2</span><span>):</span> <span>features</span> <span>=</span> <span>self</span><span>.</span><span>_extract_features</span><span>(</span><span>text1</span><span>,</span> <span>text2</span><span>)</span> <span>return</span> <span>self</span><span>.</span><span>fc</span><span>(</span><span>features</span><span>).</span><span>item</span><span>()</span>def forward(self, text1, text2): features = self._extract_features(text1, text2) return self.fc(features).item()
Enter fullscreen mode Exit fullscreen mode
Example Usage
The similarity_coefficient
function initializes the model and calculates the similarity between two input texts.
<span>text_a</span> <span>=</span> <span>"</span><span>The quick brown fox jumps over the lazy dog</span><span>"</span><span>text_b</span> <span>=</span> <span>"</span><span>A fast brown fox leaps over a sleepy hound</span><span>"</span><span>print</span><span>(</span><span>f</span><span>"</span><span>Similarity coefficient: </span><span>{</span><span>similarity_coefficient</span><span>(</span><span>text_a</span><span>,</span> <span>text_b</span><span>)</span><span>:</span><span>.</span><span>4</span><span>f</span><span>}</span><span>"</span><span>)</span><span>text_a</span> <span>=</span> <span>"</span><span>The quick brown fox jumps over the lazy dog</span><span>"</span> <span>text_b</span> <span>=</span> <span>"</span><span>A fast brown fox leaps over a sleepy hound</span><span>"</span> <span>print</span><span>(</span><span>f</span><span>"</span><span>Similarity coefficient: </span><span>{</span><span>similarity_coefficient</span><span>(</span><span>text_a</span><span>,</span> <span>text_b</span><span>)</span><span>:</span><span>.</span><span>4</span><span>f</span><span>}</span><span>"</span><span>)</span>text_a = "The quick brown fox jumps over the lazy dog" text_b = "A fast brown fox leaps over a sleepy hound" print(f"Similarity coefficient: {similarity_coefficient(text_a, text_b):.4f}")
Enter fullscreen mode Exit fullscreen mode
This function calls the HybridSimilarity
model and outputs a similarity score, which is a float value between 0 (completely dissimilar) and 1 (identical).
Conclusion
The HybridSimilarity algorithm is a robust solution that combines multiple dimensions of text similarity into a unified model. By integrating lexical, phonetic, semantic, and syntactic features, this hybrid approach ensures a nuanced and comprehensive similarity analysis. This makes it suitable for tasks such as duplicate detection, text clustering, and recommendation systems.
暂无评论内容