Train a Sentence-CamemBERT

The CamemBERT model is a state-of-the-art language model for modeling the
French language.

It is a RoBERTa model that has been trained on a large number of French texts and can be easily adapted to a large number of tasks thanks to finetuning.

Here we’re going to to finetune the model for sentence embedding.

Sentence-BERT

The output of a BERT model is an embedding vector for each token. To obtain an embedding of the text as a whole, we need to define a transformation strategy to go from individual token embeddings to an embedding vector for the sentence as a whole.

The simplest and most effective strategy is simply to take the average of the token embeddings.
This strategy is known as mean pooling.

If you’d like to find out more about the strategies that have been considered, take a look at this paper: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

Finetuning a BERT model into a Sentence-BERT model

The authors of the paper mentioned above have built a Python library called sentence-transformers, for manipulating Sentence-BERT models.

We’ll use it to obtain a Sentence-CamemBERT model from a CamemBERT model available on huggingface.

Prerequisites

We will be using the following packages:

datasets
sentence-transformers
datasets
sentence-transformers
datasets sentence-transformers

Enter fullscreen mode Exit fullscreen mode

Training data

First of all, we’re going to retrieve the training data.
We will use the French part of the dataset STSb Multi MT.
This is a dataset containing pairs of sentences and a score between 0 and 5 representing the similarity between the two sentences.

<span>from</span> <span>datasets</span> <span>import</span> <span>load_dataset</span>
<span>sts_train_dataset</span> <span>=</span> <span>load_dataset</span><span>(</span><span>"</span><span>stsb_multi_mt</span><span>"</span><span>,</span> <span>name</span><span>=</span><span>"</span><span>fr</span><span>"</span><span>,</span> <span>split</span><span>=</span><span>"</span><span>train</span><span>"</span><span>)</span>
<span>sts_dev_dataset</span> <span>=</span> <span>load_dataset</span><span>(</span><span>"</span><span>stsb_multi_mt</span><span>"</span><span>,</span> <span>name</span><span>=</span><span>"</span><span>fr</span><span>"</span><span>,</span> <span>split</span><span>=</span><span>"</span><span>dev</span><span>"</span><span>)</span>
<span>sts_test_dataset</span> <span>=</span> <span>load_dataset</span><span>(</span><span>"</span><span>stsb_multi_mt</span><span>"</span><span>,</span> <span>name</span><span>=</span><span>"</span><span>fr</span><span>"</span><span>,</span> <span>split</span><span>=</span><span>"</span><span>test</span><span>"</span><span>)</span>
<span>from</span> <span>datasets</span> <span>import</span> <span>load_dataset</span>

<span>sts_train_dataset</span> <span>=</span> <span>load_dataset</span><span>(</span><span>"</span><span>stsb_multi_mt</span><span>"</span><span>,</span> <span>name</span><span>=</span><span>"</span><span>fr</span><span>"</span><span>,</span> <span>split</span><span>=</span><span>"</span><span>train</span><span>"</span><span>)</span>
<span>sts_dev_dataset</span> <span>=</span> <span>load_dataset</span><span>(</span><span>"</span><span>stsb_multi_mt</span><span>"</span><span>,</span> <span>name</span><span>=</span><span>"</span><span>fr</span><span>"</span><span>,</span> <span>split</span><span>=</span><span>"</span><span>dev</span><span>"</span><span>)</span>
<span>sts_test_dataset</span> <span>=</span> <span>load_dataset</span><span>(</span><span>"</span><span>stsb_multi_mt</span><span>"</span><span>,</span> <span>name</span><span>=</span><span>"</span><span>fr</span><span>"</span><span>,</span> <span>split</span><span>=</span><span>"</span><span>test</span><span>"</span><span>)</span>
from datasets import load_dataset sts_train_dataset = load_dataset("stsb_multi_mt", name="fr", split="train") sts_dev_dataset = load_dataset("stsb_multi_mt", name="fr", split="dev") sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test")

Enter fullscreen mode Exit fullscreen mode

We’ll then convert the retrieved data into InputExample objects that can be used for training.

<span>from</span> <span>typing</span> <span>import</span> <span>List</span>
<span>from</span> <span>sentence_transformers</span> <span>import</span> <span>InputExample</span>
<span>def</span> <span>dataset_to_input_examples</span><span>(</span><span>dataset</span><span>)</span> <span>-></span> <span>List</span><span>[</span><span>InputExample</span><span>]:</span>
<span>return</span> <span>[</span>
<span>InputExample</span><span>(</span>
<span>texts</span><span>=</span><span>[</span><span>example</span><span>[</span><span>"</span><span>sentence1</span><span>"</span><span>],</span> <span>example</span><span>[</span><span>"</span><span>sentence2</span><span>"</span><span>]],</span>
<span>label</span><span>=</span><span>example</span><span>[</span><span>"</span><span>similarity_score</span><span>"</span><span>]</span> <span>/</span> <span>5.0</span><span>,</span>
<span>)</span>
<span>for</span> <span>example</span> <span>in</span> <span>dataset</span>
<span>]</span>
<span>sts_train_examples</span> <span>=</span> <span>dataset_to_input_examples</span><span>(</span><span>sts_train_dataset</span><span>)</span>
<span>sts_dev_examples</span> <span>=</span> <span>dataset_to_input_examples</span><span>(</span><span>sts_dev_dataset</span><span>)</span>
<span>sts_test_examples</span> <span>=</span> <span>dataset_to_input_examples</span><span>(</span><span>sts_test_dataset</span><span>)</span>
<span>from</span> <span>typing</span> <span>import</span> <span>List</span>
<span>from</span> <span>sentence_transformers</span> <span>import</span> <span>InputExample</span>

<span>def</span> <span>dataset_to_input_examples</span><span>(</span><span>dataset</span><span>)</span> <span>-></span> <span>List</span><span>[</span><span>InputExample</span><span>]:</span>
    <span>return</span> <span>[</span>
    <span>InputExample</span><span>(</span>
        <span>texts</span><span>=</span><span>[</span><span>example</span><span>[</span><span>"</span><span>sentence1</span><span>"</span><span>],</span> <span>example</span><span>[</span><span>"</span><span>sentence2</span><span>"</span><span>]],</span>
        <span>label</span><span>=</span><span>example</span><span>[</span><span>"</span><span>similarity_score</span><span>"</span><span>]</span> <span>/</span> <span>5.0</span><span>,</span>
    <span>)</span>
    <span>for</span> <span>example</span> <span>in</span> <span>dataset</span>
<span>]</span>

<span>sts_train_examples</span> <span>=</span> <span>dataset_to_input_examples</span><span>(</span><span>sts_train_dataset</span><span>)</span>
<span>sts_dev_examples</span> <span>=</span> <span>dataset_to_input_examples</span><span>(</span><span>sts_dev_dataset</span><span>)</span>
<span>sts_test_examples</span> <span>=</span> <span>dataset_to_input_examples</span><span>(</span><span>sts_test_dataset</span><span>)</span>
from typing import List from sentence_transformers import InputExample def dataset_to_input_examples(dataset) -> List[InputExample]: return [ InputExample( texts=[example["sentence1"], example["sentence2"]], label=example["similarity_score"] / 5.0, ) for example in dataset ] sts_train_examples = dataset_to_input_examples(sts_train_dataset) sts_dev_examples = dataset_to_input_examples(sts_dev_dataset) sts_test_examples = dataset_to_input_examples(sts_test_dataset)

Enter fullscreen mode Exit fullscreen mode

We will use the CamemBERT model named almanach/camembert-base for finetuning:

<span>from</span> <span>sentence_transformers</span> <span>import</span> <span>evaluation</span><span>,</span> <span>losses</span><span>,</span> <span>SentenceTransformer</span>
<span>from</span> <span>torch.utils.data</span> <span>import</span> <span>DataLoader</span>
<span>batch_size</span> <span>=</span> <span>32</span>
<span>model</span> <span>=</span> <span>SentenceTransformer</span><span>(</span><span>"</span><span>almanach/camembert-base</span><span>"</span><span>)</span>
<span>train_dataloader</span> <span>=</span> <span>DataLoader</span><span>(</span><span>sts_train_examples</span><span>,</span> <span>shuffle</span><span>=</span><span>True</span><span>,</span> <span>batch_size</span><span>=</span><span>batch_size</span><span>)</span>
<span>train_loss</span> <span>=</span> <span>losses</span><span>.</span><span>CosineSimilarityLoss</span><span>(</span><span>model</span><span>=</span><span>model</span><span>)</span>
<span>from</span> <span>sentence_transformers</span> <span>import</span> <span>evaluation</span><span>,</span> <span>losses</span><span>,</span> <span>SentenceTransformer</span>
<span>from</span> <span>torch.utils.data</span> <span>import</span> <span>DataLoader</span>

<span>batch_size</span> <span>=</span> <span>32</span>

<span>model</span> <span>=</span> <span>SentenceTransformer</span><span>(</span><span>"</span><span>almanach/camembert-base</span><span>"</span><span>)</span>

<span>train_dataloader</span> <span>=</span> <span>DataLoader</span><span>(</span><span>sts_train_examples</span><span>,</span> <span>shuffle</span><span>=</span><span>True</span><span>,</span> <span>batch_size</span><span>=</span><span>batch_size</span><span>)</span>
<span>train_loss</span> <span>=</span> <span>losses</span><span>.</span><span>CosineSimilarityLoss</span><span>(</span><span>model</span><span>=</span><span>model</span><span>)</span>
from sentence_transformers import evaluation, losses, SentenceTransformer from torch.utils.data import DataLoader batch_size = 32 model = SentenceTransformer("almanach/camembert-base") train_dataloader = DataLoader(sts_train_examples, shuffle=True, batch_size=batch_size) train_loss = losses.CosineSimilarityLoss(model=model)

Enter fullscreen mode Exit fullscreen mode

We use cosine-similarity loss objective to train the model.

Finally, an evaluator is built to monitor the model’s performance on the dev dataset during training.

<span>sts_dev_evaluator</span> <span>=</span> <span>evaluation</span><span>.</span><span>EmbeddingSimilarityEvaluator</span><span>.</span><span>from_input_examples</span><span>(</span>
<span>sts_dev_examples</span><span>,</span> <span>name</span><span>=</span><span>"</span><span>sts-dev</span><span>"</span>
<span>)</span>
<span>sts_dev_evaluator</span> <span>=</span> <span>evaluation</span><span>.</span><span>EmbeddingSimilarityEvaluator</span><span>.</span><span>from_input_examples</span><span>(</span>
    <span>sts_dev_examples</span><span>,</span> <span>name</span><span>=</span><span>"</span><span>sts-dev</span><span>"</span>
<span>)</span>
sts_dev_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples( sts_dev_examples, name="sts-dev" )

Enter fullscreen mode Exit fullscreen mode

We can now start training the model:

<span>model</span><span>.</span><span>fit</span><span>(</span>
<span>train_objectives</span><span>=</span><span>[(</span><span>train_dataloader</span><span>,</span> <span>train_loss</span><span>)],</span>
<span>evaluator</span><span>=</span><span>sts_dev_evaluator</span><span>,</span>
<span>epochs</span><span>=</span><span>10</span><span>,</span>
<span>warmup_steps</span><span>=</span><span>500</span><span>,</span>
<span>save_best_model</span><span>=</span><span>True</span><span>,</span>
<span>)</span>
<span>model</span><span>.</span><span>fit</span><span>(</span>
    <span>train_objectives</span><span>=</span><span>[(</span><span>train_dataloader</span><span>,</span> <span>train_loss</span><span>)],</span>
    <span>evaluator</span><span>=</span><span>sts_dev_evaluator</span><span>,</span>
    <span>epochs</span><span>=</span><span>10</span><span>,</span>
    <span>warmup_steps</span><span>=</span><span>500</span><span>,</span>
    <span>save_best_model</span><span>=</span><span>True</span><span>,</span>
<span>)</span>
model.fit( train_objectives=[(train_dataloader, train_loss)], evaluator=sts_dev_evaluator, epochs=10, warmup_steps=500, save_best_model=True, )

Enter fullscreen mode Exit fullscreen mode

Model evaluation

Once training is complete, you can measure the model’s performance on the test data set you’ve kept
away:

<span>sts_test_evaluator</span> <span>=</span> <span>evaluation</span><span>.</span><span>EmbeddingSimilarityEvaluator</span><span>.</span><span>from_input_examples</span><span>(</span>
<span>sts_test_examples</span><span>,</span> <span>name</span><span>=</span><span>"</span><span>sts-test</span><span>"</span>
<span>)</span>
<span>sts_test_evaluator</span><span>(</span><span>model</span><span>,</span> <span>"</span><span>.</span><span>"</span><span>)</span>
<span>sts_test_evaluator</span> <span>=</span> <span>evaluation</span><span>.</span><span>EmbeddingSimilarityEvaluator</span><span>.</span><span>from_input_examples</span><span>(</span>
    <span>sts_test_examples</span><span>,</span> <span>name</span><span>=</span><span>"</span><span>sts-test</span><span>"</span>
<span>)</span>

<span>sts_test_evaluator</span><span>(</span><span>model</span><span>,</span> <span>"</span><span>.</span><span>"</span><span>)</span>
sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples( sts_test_examples, name="sts-test" ) sts_test_evaluator(model, ".")

Enter fullscreen mode Exit fullscreen mode

I get a Pearson correlation of 0.837 which is at the same level as the Sentence-CamemBERT I found on huggingface :

Model Pearson Correlation Parameters
h4c5/sts-camembert-base 0.837 110M
Lajavaness/sentence-camembert-base 0.835 110M
inokufu/flaubert-base-uncased-xnli-sts 0.828 137M
h4c5/sts-distilcamembert-base 0.817 68M
sentence-transformers/distiluse-base-multilingual-cased-v2 0.786 135M

Sentence-BERT model distilled

As you may have noticed in the table above, I’ve also trained a Sentence-CamemBERT model that’s about half the size (68M parameters vs. 110M) and yet performs very well: h4c5/sts-distilcamembert-base.

This is in fact a model obtained by following the above procedure but starting from the distilled CamemBERT model: cmarkea/distilcamembert-base.

This so-called “distilled” model was obtained by removing half of the layer of the CamemBERT base model and training it to maintain its performance.

To find out more about the distillation process, please consult the following papers:

Et voilà. You can find my two Sentence-CamemBERT models on huggingface :

原文链接:Train a Sentence-CamemBERT

© 版权声明
THE END
喜欢就支持一下吧
点赞0 分享
It doesn't matter how slow you are, as long as you're determined to get there, you'll get there.
不管你有多慢,都不要紧,只要你有决心,你最终都会到达想去的地方
评论 抢沙发

请登录后发表评论

    暂无评论内容