使用预训练语言模型的自动化性格预测

本仓库包含了论文《自下而上和自上而下：使用心理语言学和语言模型特征预测性格》的代码，该论文发表于2020年IEEE国际数据挖掘会议。

这里有一系列用tensorflow和pytorch编写的实验，旨在探索使用语言模型在Essays数据集（大五人格标记特征）和Kaggle MBTI数据集上进行自动化性格检测。

设置

从GitHub拉取仓库，然后创建一个新的虚拟环境（conda或venv）：

git clone https://github.com/yashsmehta/personality-prediction.git
cd personality-prediction
conda create -n mvenv python=3.10

安装poetry，并使用它来安装运行项目所需的依赖：

curl -sSL https://install.python-poetry.org | python3 -
poetry install

使用方法

首先运行LM提取器代码，将数据集通过语言模型传递，并将嵌入（所有层）存储在pickle文件中。创建这个"新数据集"可以为我们节省大量计算时间，并允许有效地搜索微调网络的超参数。在运行代码之前，请在仓库文件夹中创建一个pkl_data文件夹。所有参数都是可选的，不传递任何参数则使用默认值运行提取器。

python LM_extractor.py -dataset_type 'essays' -token_length 512 -batch_size 32 -embed 'bert-base' -op_dir 'pkl_data'

接下来运行一个微调模型，从pickle文件中获取提取的特征作为输入，并训练一个微调模型。我们发现浅层MLP表现最佳。

python finetune_models/MLP_LM.py

结果表格	语言模型与心理语言学特征对比

预测未见文本的性格

按照以下步骤预测新文本/文章的性格（例如大五人格：OCEAN特征）：

python finetune_models/MLP_LM.py -save_model 'yes'

现在使用以下脚本预测未见文本：

python unseen_predictor.py

运行时间

LM_extractor.py

在RTX2080 GPU上，-embed 'bert-base'提取器大约需要~~2分30秒，'bert-large'大约需要~~5分30秒

在CPU上，'bert-base'提取器大约需要~25分钟

python finetune_models/MLP_LM.py

在RTX2080 GPU上，运行15个轮次（无交叉验证）需要5秒到60秒，具体取决于MLP架构。

文献

基于深度学习的性格预测 [文献综述] (Springer AIR Journal - 2020)

@article{mehta2020recent,
  title={Recent Trends in Deep Learning Based Personality Detection},
  author={Mehta, Yash and Majumder, Navonil and Gelbukh, Alexander and Cambria, Erik},
  journal={Artificial Intelligence Review},
  pages={2313–2339},
  year={2020},
  doi = {https://doi.org/10.1007/s10462-019-09770-z},
  url = {https://link.springer.com/article/10.1007/s10462-019-09770-z}
  publisher={Springer}
}

基于语言模型的性格预测 (ICDM - 2020)

如果您发现这个仓库对您的研究有用，请使用以下方式引用：

@inproceedings{mehta2020bottom,
  title={Bottom-up and top-down: Predicting personality with psycholinguistic and language model features},
  author={Mehta, Yash and Fatehi, Samin and Kazameini, Amirmohammad and Stachl, Clemens and Cambria, Erik and Eetemadi, Sauleh},
  booktitle={2020 IEEE International Conference on Data Mining (ICDM)},
  pages={1184--1189},
  year={2020},
  organization={IEEE}
}

许可证

本项目的源代码采用MIT许可证。