TweetNLP
TweetNLP for all the NLP enthusiasts working on Twitter and social media!
The python library tweetnlp
provides a collection of useful tools to analyze/understand tweets such as sentiment analysis,
emoji prediction, and named-entity recognition, powered by state-of-the-art language modeling specialized on social media.
News (December 2022): We presented a TweetNLP demo paper ("TweetNLP: Cutting-Edge Natural Language Processing for Social Media"), at EMNLP 2022. The final version can be found here.
TweetNLP Hugging Face page All the main TweetNLP models can be found here on Hugging Face.
Resources:
- Quick Tour with Colab Notebook:
- Play with the TweetNLP Online Demo: link
- EMNLP 2022 paper: link
- 2nd Cardiff NLP Summer Workshop Tutorial:
- 2nd Cardiff NLP Summer Workshop Tutorial (solutions):
Table of Contents:
Get Started
Install TweetNLP via pip on your console.
pip install tweetnlp
Model & Dataset
In this section, you will learn how to get the models and datasets with tweetnlp
.
The models follow huggingface model and the datasets are in the format of huggingface datasets.
Easy introductions of huggingface models and datasets should be found at huggingface webpage, so
please check them if you are new to huggingface.
Tweet Classification
The classification module consists of six different tasks (Topic Classification, Sentiment Analysis, Irony Detection, Hate Speech Detection, Offensive Language Detection, Emoji Prediction, and Emotion Analysis).
In each example, the model is instantiated by tweetnlp.load_model("task-name")
, and run the prediction by passing a text or a list of texts as argument to the corresponding function.
- Topic Classification: The aim of this task is, given a tweet to assign topics related to its content. The task is formed as a supervised multi-label classification problem where each tweet is assigned one or more topics from a total of 19 available topics. The topics were carefully curated based on Twitter trends with the aim to be broad and general and consist of classes such as: arts and culture, music, or sports. Our internally-annotated dataset contains over 10K manually-labeled tweets (check the paper here, or the huggingface dataset page).
import tweetnlp
# MULTI-LABEL MODEL
model = tweetnlp.load_model('topic_classification') # Or `model = tweetnlp.TopicClassification()`
model.topic("Jacob Collier is a Grammy-awarded English artist from London.") # Or `model.predict`
>>> {'label': ['celebrity_&_pop_culture', 'music']}
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model.topic("Jacob Collier is a Grammy-awarded English artist from London.", return_probability=True)
>>> {'label': ['celebrity_&_pop_culture', 'music'],
'probability': {'arts_&_culture': 0.037371691316366196,
'business_&_entrepreneurs': 0.010188567452132702,
'celebrity_&_pop_culture': 0.92448890209198,
'diaries_&_daily_life': 0.03425711765885353,
'family': 0.00796138122677803,
'fashion_&_style': 0.020642118528485298,
'film_tv_&_video': 0.08062587678432465,
'fitness_&_health': 0.006343095097690821,
'food_&_dining': 0.0042883665300905704,
'gaming': 0.004327300935983658,
'learning_&_educational': 0.010652057826519012,
'music': 0.8291937112808228,
'news_&_social_concern': 0.24688217043876648,
'other_hobbies': 0.020671198144555092,
'relationships': 0.020371075719594955,
'science_&_technology': 0.0170074962079525,
'sports': 0.014291072264313698,
'travel_&_adventure': 0.010423899628221989,
'youth_&_student_life': 0.008605164475739002}}
# SINGLE-LABEL MODEL
model = tweetnlp.load_model('topic_classification', multi_label=False) # Or `model = tweetnlp.TopicClassification(multi_label=False)`
model.topic("Jacob Collier is a Grammy-awarded English artist from London.")
>>> {'label': 'pop_culture'}
# NOTE: the probability of the sinlge-label model the softmax over the label.
model.topic("Jacob Collier is a Grammy-awarded English artist from London.", return_probability=True)
>>> {'label': 'pop_culture',
'probability': {'arts_&_culture': 9.20625461731106e-05,
'business_&_entrepreneurs': 6.916998972883448e-05,
'pop_culture': 0.9995898604393005,
'daily_life': 0.00011083036952186376,
'sports_&_gaming': 8.668467489769682e-05,
'science_&_technology': 5.152115045348182e-05}}
# GET DATASET
dataset_multi_label, label2id_multi_label = tweetnlp.load_dataset('topic_classification')
dataset_single_label, label2id_single_label = tweetnlp.load_dataset('topic_classification', multi_label=False)
- Sentiment Analysis: The sentiment analysis task integrated in TweetNLP is a simplified version where the goal is to predict the sentiment of a tweet with one of the three following labels: positive, neutral or negative. The base dataset for English is the unified TweetEval version of the Semeval-2017 dataset from the task on Sentiment Analysis in Twitter (check the paper here).
import tweetnlp
# ENGLISH MODEL
model = tweetnlp.load_model('sentiment') # Or `model = tweetnlp.Sentiment()`
model.sentiment("Yes, including Medicare and social security saving👍") # Or `model.predict`
>>> {'label': 'positive'}
model.sentiment("Yes, including Medicare and social security saving👍", return_probability=True)
>>> {'label': 'positive', 'probability': {'negative': 0.004584966693073511, 'neutral': 0.19360853731632233, 'positive': 0.8018065094947815}}
# MULTILINGUAL MODEL
model = tweetnlp.load_model('sentiment', multilingual=True) # Or `model = tweetnlp.Sentiment(multilingual=True)`
model.sentiment("天気が良いとやっぱり気持ち良いなあ✨")
>>> {'label': 'positive'}
model.sentiment("天気が良いとやっぱり気持ち良いなあ✨", return_probability=True)
>>> {'label': 'positive', 'probability': {'negative': 0.028369612991809845, 'neutral': 0.08128828555345535, 'positive': 0.8903420567512512}}
# GET DATASET (ENGLISH)
dataset, label2id = tweetnlp.load_dataset('sentiment')
# GET DATASET (MULTILINGUAL)
for l in ['all', 'arabic', 'english', 'french', 'german', 'hindi', 'italian', 'portuguese', 'spanish']:
dataset_multilingual, label2id_multilingual = tweetnlp.load_dataset('sentiment', multilingual=True, task_language=l)
- Irony Detection: This is a binary classification task where given a tweet, the goal is to detect whether it is ironic or not. It is based on the Irony Detection dataset from the SemEval 2018 task (check the paper here).
import tweetnlp
# MODEL
model = tweetnlp.load_model('irony') # Or `model = tweetnlp.Irony()`
model.irony('If you wanna look like a badass, have drama on social media') # Or `model.predict`
>>> {'label': 'irony'}
model.irony('If you wanna look like a badass, have drama on social media', return_probability=True)
>>> {'label': 'irony', 'probability': {'non_irony': 0.08390884101390839, 'irony': 0.9160911440849304}}
# GET DATASET
dataset, label2id = tweetnlp.load_dataset('irony')
- Hate Speech Detection: The hate speech detection task consists of detecting whether a tweet is hateful towards a target community. The underlying model is based on a suite of unified hate speech detection datasets (see reference paper).
import tweetnlp
# MODEL
model = tweetnlp.load_model('hate') # Or `model = tweetnlp.Hate()`
model.hate('Whoever just unfollowed me you a bitch') # Or `model.predict`
>>> {'label': 'not-hate'}
model.hate('Whoever just unfollowed me you a bitch', return_probability=True)
>>> {'label': 'non-hate', 'probability': {'non-hate': 0.7263831496238708, 'hate': 0.27361682057380676}}
# GET DATASET
dataset, label2id = tweetnlp.load_dataset('hate')
- Offensive Language Identification: This task consists in identifying whether some form of offensive language is present in a tweet. For our benchmark we rely on the SemEval2019 OffensEval dataset (check the paper here).
import tweetnlp
# MODEL
model = tweetnlp.load_model('offensive') # Or `model = tweetnlp.Offensive()`
model.offensive("All two of them taste like ass.") # Or `model.predict`
>>> {'label': 'offensive'}
model.offensive("All two of them taste like ass.", return_probability=True)
>>> {'label': 'offensive', 'probability': {'non-offensive': 0.16420328617095947, 'offensive': 0.8357967734336853}}
# GET DATASET
dataset, label2id = tweetnlp.load_dataset('offensive')
- Emoji Prediction: The goal of emoji prediction is to predict the final emoji on a given tweet. The dataset used to fine-tune our models is the TweetEval adaptation from the SemEval 2018 task on Emoji Prediction (check the paper here), including 20 emoji as labels (❤, 😍, 😂, 💕, 🔥, 😊, 😎, ✨, 💙, 😘, 📷, 🇺🇸, ☀, 💜, 😉, 💯, 😁, 🎄, 📸, 😜).
import tweetnlp
# MODEL
model = tweetnlp.load_model('emoji') # Or `model = tweetnlp.Emoji()`
model.emoji('Beautiful sunset last night from the pontoon @TupperLakeNY') # Or `model.predict`
>>> {'label': '😊'}
model.emoji('Beautiful sunset last night from the pontoon @TupperLakeNY', return_probability=True)
>>> {'label': '📷',
'probability': {'❤': 0.13197319209575653,
'😍': 0.11246423423290253,
'😂': 0.008415069431066513,
'💕': 0.04842926934361458,
'🔥': 0.014528146013617516,
'😊': 0.1509675830602646,
'😎': 0.08625403046607971,
'✨': 0.01616635173559189,
'💙': 0.07396604865789413,
'😘': 0.03033279813826084,
'📷': 0.16525287926197052,
'🇺🇸': 0.020336611196398735,
'☀': 0.00799981877207756,
'💜': 0.016111424192786217,
'😉': 0.012984540313482285,
'💯': 0.012557178735733032,
'😁': 0.031386848539114,
'🎄': 0.006829539313912392,
'📸': 0.04188741743564606,
'😜': 0.011156936176121235}}
# GET DATASET
dataset, label2id = tweetnlp.load_dataset('emoji')
- Emotion Recognition: Given a tweet, this task consists of associating it with its most appropriate emotion. As a reference dataset we use the SemEval 2018 task on Affect in Tweets (check the paper here). The latest multi-label model includes eleven emotion types.
import tweetnlp
# MULTI-LABEL MODEL
model = tweetnlp.load_model('emotion') # Or `model = tweetnlp.Emotion()`
model.emotion('I love swimming for the same reason I love meditating...the feeling of weightlessness.') # Or `model.predict`
>>> {'label': 'joy'}
# Note: the probability of the multi-label model is the output of sigmoid function on binary prediction whether each topic is positive or negative.
model.emotion('I love swimming for the same reason I love meditating...the feeling of weightlessness.', return_probability=True)
>>> {'label': 'joy',
'probability': {'anger': 0.00025800734874792397,
'anticipation': 0.0005329723935574293,
'disgust': 0.00026112011983059347,
'fear': 0.00027552215033210814,
'joy': 0.7721399068832397,
'love': 0.1806265264749527,
'optimism': 0.04208092764019966,
'pessimism': 0.00025325192837044597,
'sadness': 0.0006160663324408233,
'surprise': 0.0005619609728455544,
'trust': 0.002393839880824089}}
# SINGLE-LABEL MODEL
model = tweetnlp.load_model('emotion') # Or `model = tweetnlp.Emotion()`
model.emotion('I love swimming for the same reason I love meditating...the feeling of weightlessness.') # Or `model.predict`
>>> {'label': 'joy'}
# NOTE: the probability of the