ChatGPT outperforms crowd workers for text-annotation tasks

ChatGPT outperforms crowd workers for text-annotation tasks

Edited by Mary Waters, Harvard University, Cambridge, MA; received March 27, 2023; accepted June 2, 2023

July 18, 2023

120 (30) e2305016120

Abstract

Many NLP applications require manual text annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using four samples of tweets and news articles (n = 6,183), we show that ChatGPT outperforms crowd workers for several annotation tasks, including relevance, stance, topics, and frame detection. Across the four datasets, the zero-shot accuracy of ChatGPT exceeds that of crowd workers by about 25 percentage points on average, while ChatGPT’s intercoder agreement exceeds that of both crowd workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003—about thirty times cheaper than MTurk. These results demonstrate the potential of large language models to drastically increase the efficiency of text classification.

Continue Reading

Data, Materials, and Software Availability

Replication materials are available at the Harvard Dataverse, https://doi.org/10.7910/DVN/PQYF6M (15). Some study data are available (only tweet IDs can be shared, not tweets themselves).

Acknowledgments

This project received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 883121). We thank Fabio Melliger, Paula Moser, and Sophie van IJzendoorn for excellent research assistance.

Author contributions

F.G., M.A., and M.K. designed research; performed research; analyzed data; and wrote the paper.

Competing interests

The authors declare no competing interest.

Supporting Information

References

1

G. Emerson et al., Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) (Association for Computational Linguistics, Seattle, 2022).

2

K. Benoit, D. Conway, B. E. Lauderdale, M. Laver, S. Mikhaylov, Crowd-sourced text analysis: Reproducible and agile production of political data. Am. Polit. Sci. Rev. 116, 278–295 (2016).

3

M. Chmielewski, S. C. Kucker, An MTurk crisis? Shifts in data quality and the impact on study results Soc. Psychol. Personality Sci. 11, 464–473 (2020).

4

P. Y. Wu, J. A. Tucker, J. Nagler, S. Messing, Large Language Models Can Be Used to Estimate the Ideologies of Politicians in a Zero-Shot Learning Setting (2023).

5

J. J. Nay, Large Language Models as Corporate Lobbyists (2023).

6

M. Binz, E. Schulz, Using cognitive psychology to understand GPT-3. Proc. Natl. Acad. Sci. U.S.A. 120, e2218523120 (2023).

7

L. P. Argyle et al., Out of one, many: Using language models to simulate human samples. Polit. Anal. 1–15 (2023).

8

T. Kuzman, I. Mozetič, N. Ljubešić, ChatGPT: Beginning of an end of manual linguistic data annotation? Use case of automatic genre identification. arXiv eprints (2023). http://arxiv.org/abs/2303.03953 (Accessed 13 March 2023).

9

F. Huang, H. Kwak, J. An, Is chatGPT better than human annotators? Potential and limitations of chatGPT in explaining implicit hate speech. arXiv [Preprint] (2023). http://arxiv.org/abs/2302.07736 (Accessed 13 March 2023).

10

M. Alizadeh et al., Content moderation as a political issue: The Twitter discourse around trump’s ban. J. Quant. Des.: Digital Media 2, 1–44 (2022).

11

P. S. Bayerl, K. I. Paul, What determines inter-coder agreement in manual annotations? A meta-analytic investigation Comput. Linguist. 37, 699–725 (2011).

12

M. Desmond, E. Duesterwald, K. Brimijoin, M. Brachman, Q. Pan, Semi-automateddatalabeling, in NeurIPS 2020 Competition and Demonstration Track, (PMLR, 2021), pp. 156–169.

13

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa, Large language models are zero-shot reasoners. arXiv [Preprint] (2022). http://arxiv.org/abs/2205.11916 (Accessed 13 March 2023).

14

D. Card, A. Boydstun, J. H. Gross, P. Resnik, N. A. Smith, “The media frames corpus: Annotations of frames across issues” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) (2015), pp. 438–444.

15

F. Gilardi, M. Alizadeh, M. Kubli, Replication Data for: ChatGPT outperforms crowd-workers for text-annotation tasks. Harvard Dataverse. https://doi.org/10.7910/DVN/PQYF6M. Deposited 16 June 2023.

Information & Authors

Information

Published in

Go to Proceedings of the National Academy of Sciences

Proceedings of the National Academy of Sciences

Vol. 120 | No. 30
July 25, 2023

Classifications

Copyright

Data, Materials, and Software Availability

Replication materials are available at the Harvard Dataverse, https://doi.org/10.7910/DVN/PQYF6M (15). Some study data are available (only tweet IDs can be shared, not tweets themselves).

Submission history

Received: March 27, 2023

Accepted: June 2, 2023

Published online: July 18, 2023

Published in issue: July 25, 2023

Keywords

  1. ChatGPT
  2. text classification
  3. large language models
  4. human annotations
  5. text as data

Acknowledgments

This project received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 883121). We thank Fabio Melliger, Paula Moser, and Sophie van IJzendoorn for excellent research assistance.

Author Contributions

F.G., M.A., and M.K. designed research; performed research; analyzed data; and wrote the paper.

Competing Interests

The authors declare no competing interest.

Authors

Affiliations

Department of Political Science, University of Zurich, Zurich 8050, Switzerland

Department of Political Science, University of Zurich, Zurich 8050, Switzerland

Department of Political Science, University of Zurich, Zurich 8050, Switzerland

Notes

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.

Citation statements

Altmetrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

Cited by

View Options

View options

PDF format

Download this article as a PDF file

DOWNLOAD PDF

Get Access

Media

Figures

Tables

Other

Read More

Author:

Leave a Reply