使用生活事件序列预测人类生活

本代码库包含了使用生活事件序列预测人类生活（life2vec）论文的代码。我们只有一个与项目相关的网页（life2vec.dk），没有任何专门的Facebook、Twitter等账号。更多信息请参考FAQ。

life2vec的基本实现

我们将保持这个代码库现状不变。我们正在单独的代码库中发布life2vec模型的一些组件：

模型的基本实现发布在carlomarxdk/life2vec-light - 其中包含使用虚拟数据运行预训练的代码，
类距离加权交叉熵损失发布在carlomarxdk/cdw-cross-entropy-loss - 这个损失函数用于外向性特质预测任务。

源代码

此代码库包含用于数据处理、life2vec训练、统计分析和可视化的脚本和多个笔记本。模型权重、实验日志和相关模型输出可以按照丹麦统计局研究计划的规则获取。

在提交脚本到GitHub之前，路径（例如数据或模型权重的路径）已经被编辑。

整体结构

我们使用Hydra来运行实验。/conf文件夹包含实验配置：

/experiment包含预训练和微调的配置yaml文件，
/tasks包含MLM、SOP等数据增强的规范，
/trainer包含日志记录（未使用）和多线程训练（未使用）的配置，
/data_new包含数据加载和处理的配置，
/datamodule包含指定如何将数据加载到PyTorch和PyTorch Lightning的配置，
callbacks.yaml指定PyTorch Lightning回调的配置，
prepare_data.yaml可用于运行数据预处理。

/analysis文件夹包含用于事后评估的ipynb笔记本：

/embedding包含嵌入空间的分析，
/metric包含模型评估的笔记本，
/visualisation包含空间可视化的笔记本，
/tcav包括TCAV实现，
/optimization超参数调优。

源文件夹/src包含数据加载和模型训练代码。由于hydra包的特性，以下是/src文件夹的概览：

/src/data_new包含预处理数据以及准备数据加载到PyTorch或PyTorch Lightning的脚本，
/src/models包含基线模型的实现，
/src/tasks包含特定任务的代码，如MLM、SOP、死亡率预测、移民预测等，
/src/tranformer包含life2vec模型的实现：
1. 在performer.py中，我们重写了performer-pytorch包的功能，
2. 在cls_model.py中，我们实现了用于二元分类任务（即早期死亡率和移民）的微调阶段，
3. 在hexaco_model.py中，我们实现了用于人格细微差异预测任务的微调阶段，
4. models.py包含life2vec预训练（即基础life2vec模型）的代码，
5. transformer_utils.py包含自定义模块的实现，如损失函数、激活函数等，
6. metrics.py包含自定义指标的代码，
7. modules.py、attention.py、att_utils.py和embeddings.py包含在transformer网络（即life2vec编码器）中使用的模块的实现。

train.py、test.py、tune.py和val.py等脚本用于运行训练的特定阶段，而prepare_data.py用于运行数据处理（见下面的示例）。

运行脚本

要运行代码，你可以使用以下命令：

# 运行预训练：
HYDRA_FULL_ERROR=1 python -m src.train experiment=pretrain trainer.devices=[7]

# 超参数微调（用于预训练）
HYDRA_FULL_ERROR=1 python -m src.train experiment=pretrain_optim

# 组装通用数据集（GLOBAL_SET）
HYDRA_FULL_ERROR=1 python -m src.prepare_data +data_new/corpus=global_set target=\${data_new.corpus}

# 组装用于死亡率预测任务的数据集（SURVIVAL_SET）
HYDRA_FULL_ERROR=1 python -m src.prepare_data +data_new/population=survival_set target=\${data_new.population}

# 组装劳动力来源
python -m src.prepare_data +data_new/sources=labour target=\${data_new.sources}

# 运行移民微调
HYDRA_FULL_ERROR=1 python -m src.train experiment=emm trainer.devices=[0] version=0.01

其他代码贡献者

Søren Mørk Hartmann.

如何引用

Nature Computational Science

@article{savcisens2024using,
      author={Savcisens, Germans and Eliassi-Rad, Tina and Hansen, Lars Kai and Mortensen, Laust Hvas and Lilleholt, Lau and Rogers, Anna and Zettler, Ingo and Lehmann, Sune},
      title={Using sequences of life-events to predict human lives},
      journal={Nature Computational Science},
      year={2024},
      month={Jan},
      day={01},
      volume={4},
      number={1},
      pages={43-56},
      issn={2662-8457},
      doi={10.1038/s43588-023-00573-5},
      url={https://doi.org/10.1038/s43588-023-00573-5}
}

ArXiv预印本

@article{savcisens2023using,
  title={Using Sequences of Life-events to Predict Human Lives},
  DOI = {arXiv:2306.03009},
  author={Savcisens, Germans and Eliassi-Rad, Tina and Hansen, Lars Kai and Mortensen, Laust and Lilleholt, Lau and Rogers, Anna and Zettler, Ingo and Lehmann, Sune},
  year={2023}
}

代码

@misc{life2vec_code,
  author = {Germans Savcisens},
  title = {Official code for the "Using Sequences of Life-events to Predict Human Lives" paper},
  note = {GitHub: SocialComplexityLab/life2vec},
  year = {2023},
  howpublished = {\url{https://doi.org/10.5281/zenodo.10118621}},
}