项目介绍:Twitter-roBERTa-base
Twitter-roBERTa-base是一个自然语言处理模型,基于RoBERTa-base的基础上,通过约5800万条推文进行训练。这一模型在推文分类的统一基准测试TweetEval中进行了描述和评估。想要在推特专用数据上评估这个模型和其他语言模型的,可以参考TweetEval的官方资源库。
文本预处理
在使用模型之前,需要对文本进行预处理,主要是将用户名和链接替换为占位符“@user”和“http”。以下Python代码实现了这一功能:
def preprocess(text):
new_text = []
for t in text.split(" "):
t = '@user' if t.startswith('@') and len(t) > 1 else t
t = 'http' if t.startswith('http') else t
new_text.append(t)
return " ".join(new_text)
示例:遮蔽语言模型
利用Twitter-roBERTa-base,可以进行遮蔽语言模型预测,即预测句子中被遮蔽的词汇。以下是一个简单的例子:
from transformers import pipeline, AutoTokenizer
import numpy as np
MODEL = "cardiffnlp/twitter-roberta-base"
fill_mask = pipeline("fill-mask", model=MODEL, tokenizer=MODEL)
tokenizer = AutoTokenizer.from_pretrained(MODEL)
def print_candidates():
for i in range(5):
token = tokenizer.decode(candidates[i]['token'])
score = np.round(candidates[i]['score'], 4)
print(f"{i+1}) {token} {score}")
texts = [
"I am so <mask> 😊",
"I am so <mask> 😢"
]
for text in texts:
t = preprocess(text)
print(f"{'-'*30}\n{t}")
candidates = fill_mask(t)
print_candidates()
通过此代码,模型可以预测出句子更适合的词语,比如在“我如此
示例:推文嵌入
模型还可以用来生成推文的嵌入向量,并计算相似度。例如,可以计算一条描述“这本书非常棒”的推文与其他几条推文之间的相似度:
from transformers import AutoTokenizer, AutoModel
import numpy as np
from scipy.spatial.distance import cosine
from collections import defaultdict
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModel.from_pretrained(MODEL)
def get_embedding(text):
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
features = model(**encoded_input)
features = features[0].detach().cpu().numpy()
features_mean = np.mean(features[0], axis=0)
return features_mean
MODEL = "cardiffnlp/twitter-roberta-base"
query = "The book was awesome"
tweets = ["I just ordered fried chicken 🐣",
"The movie was great",
"What time is the next game?",
"Just finished reading 'Embeddings in NLP'"]
d = defaultdict(int)
for tweet in tweets:
sim = 1-cosine(get_embedding(query), get_embedding(tweet))
d[tweet] = sim
print('Most similar to: ', query)
print('----------------------------------------')
for idx, x in enumerate(sorted(d.items(), key=lambda x: x[1], reverse=True)):
print(idx + 1, x[0])
从输出结果可以看到,与“这本书非常棒”最相似的推文是“这部电影很棒”。
示例:特征提取
Twitter-roBERTa-base还可以用来从文本中提取特征,并生成相应的特征向量:
from transformers import AutoTokenizer, AutoModel
import numpy as np
MODEL = "cardiffnlp/twitter-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
text = "Good night 😊"
text = preprocess(text)
# Pytorch
model = AutoModel.from_pretrained(MODEL)
encoded_input = tokenizer(text, return_tensors='pt')
features = model(**encoded_input)
features = features[0].detach().cpu().numpy()
features_mean = np.mean(features[0], axis=0)
通过以上步骤,用户可以获取文本的平均特征向量,用于进一步分析和应用。
如果您使用该模型,请参阅引用论文以获取详细的引用信息。