Junzaburo Nakagawa and Hiroki Nomoto
Tokyo University of Foreign Studies
{nakagawa.junzaburo.v0, nomoto}@tufs.ac.jp
This is the English translation (by DeepL and ChatGPT) of the following paper: Nakagawa, Junzaburo and Hiroki Nomoto. 2025. ChatGPT ga kangaeru nihongo jooku no omoshirosa: Ningen tono hikaku [The humor of Japanese jokes according to ChatGPT: A comparison with humans]. Proceedings of the Thirty-First Annual Meeting of the Association for Natural Language Processing, 553-558. [data] BibTeX@InProceedings{NakagawaNomoto25, author = {Nakagawa, Junzaburo and Nomoto, Hiroki}, year = {2025}, title = {{ChatGPT} ga kangaeru nihongo jooku no omoshirosa: Ningen tono hikaku}, booktitle = {Proceedings of the Thirty-First Annual Meeting of the {A}ssociation for {N}atural {L}anguage {P}rocessing}, pages = {553-558}, note = {The humor of {J}apanese jokes according to {ChatGPT}: A comparison with humans}, url = {https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/D2-5.pdf} } |
This paper compares how ChatGPT (GPT-4o) and humans rate Japanese jokes. We examined the ratings of 18 jokes (all but one are diagloues), nine created by humans and nine generated by ChatGPT, in terms of 'funniness', 'offensiveness' and 'intelligibility'. The results showed that ChatGPT was more lenient than humans on 'funniness' and 'intelligibility', but more strict on 'offensiveness', and that, contrary to a similar study on English jokes, ChatGPT-generated Japanese jokes were rated lower than those generated by humans. It is argued that this is due to a level of objectivity not found in humans and an insufficient ability to calculate meanings compositionally.
In recent years, various types of generative AI have emerged and continue to evolve at a rapid pace. Although generative AIs have demonstrated superior capabilities in machine translation, text summarisation, etc., equal to or better than humans, they are also known to have weaknesses in some areas. For example, weaknesses such as lack of emotional understanding and empathy, lack of understanding of cultural backgrounds, and limited creativity have been pointed out in relation to ChatGPT (Abujaber et al. 2023; Kalla et al. 2023). Understanding joke humour (hereafter referred to as 'jokes') is a task that would be affected by such weaknesses. When people hear a joke in a dialogue, they have feelings such as 'funny' or 'offensive'. There are degrees of emotion that depend on the content of the joke. To what extent do a generative AIs have these emotions that humans have when they hear a joke? In this paper, ChatGPT (GPT-4o) and humans are asked to rate 18 Japanese jokes in terms of 'funny', 'offensive' and 'understandable', and the results are compared. Although there are studies on the understanding of jokes by generative AI for English jokes, to our knowledge there are no studies on Japanese jokes.
Jentzsch and Kersting (2023) tested ChatGPT's competence with jokes by having ChatGPT generate and explain English jokes. In terms of production, they point out that more than 90% of all jokes produced consisted of 25 different jokes, with the top four jokes accounting for more than 50% of the total. The jokes are in the form of self-answering questions: Why...? Because.... As for the explanations, valid explanations were provided for almost all the jokes.
In Gorenz and Schwarz (2024), the joke-generating abilities of ChatGPT and humans were compared by asking 200 crowdsourced Americans to rate ChatGPT-generated and human-made jokes in English. The results showed that ChatGPT-generated jokes were rated higher than human-made jokes, despite the fact that ChatGPT is emotionless.
As a language resource for jokes, there is a dataset in English called UR-FUNNY (Hasan et al. 2019), which is a multimodal dataset consisting of text, audio and visual, containing 8,257 examples of funny and non-funny jokes. To the best of our knowledge, a similar dataset does not yet exist in Japanese.
In this study, a total of 18 Japanese jokes (9 jokes made by humans and 9 jokes made by ChatGPT (GPT-4o)) were rated by ChatGPT itself and by humans.
Table 1 summarises the Japanese jokes studied, where (a), (b) and (c) in the IDs indicate the categories described below.
ID | Joke |
---|---|
H1 (a) |
「幸せなら手を叩こうっていうけどさ、不幸せな奴は何を叩けばいいの?」 「...与党...」 ('They say "If you’re happy, clap your hands," but what should an unhappy person clap?' '...The ruling party...') |
H2 (a) |
「もしもし,健太くん?あの,お父さんいる?」 「いらなーい!」 ('Hello, Kenta? Is your dad there?' 'I don’t need him!') (Tanaka 2017) |
H3 (a) |
医者: 「ううむ,スミスさん.あなたは妊娠しているみたいですね」 スミス夫人: 「ああ神様,なんて素晴らしい.私,妊娠したんですね?」 医者: 「妊娠しているように見えると言ったんです.減量しましょう」 (Doctor: 'Hmm, Mrs. Smith. It seems you’re pregnant.' Mrs. Smith: 'Oh my God, that’s wonderful! I’m pregnant?' Doctor: 'I said you *look* pregnant. You need to lose weight.') (JAPAN JOURNALS 2020) |
H4 (b) |
先輩: 「新人くん,今日は腹を割って話そうじゃないか」 新人: 「切腹しろってことですか?」 (Senior: 'Hey, newbie, let’s have an open conversation today.' Newbie: 'Are you telling me to commit seppuku?') |
H5 (b) |
「今日は先輩の引退だし,花持たせなきゃなぁ」 「え!?すみません,今すぐ花束買ってきます!」 ('Today is our senior’s retirement match, so we should let him shine.' 'Huh?! Sorry, I’ll go buy a bouquet right away!') |
H6 (b) |
妻: 「あなたがこんなに貧乏だって分かっていたら絶対結婚しなかったわ」 夫: 「君は僕の全てだって結婚前に何度も言ったじゃないか!」 (Wife: 'If I had known you were this poor, I would never have married you.' Husband: 'But you kept saying I was your everything before we got married!') (JAPAN JOURNALS 2020) |
H7 (b) |
妻は僕に「あなたは100万人に1人の男よ」と言った. ある日,妻のスマホを覗き見したら確かにその通りだった. (My wife told me, 'You're one in a million.' One day, I peeked at her phone, and she was right.) (JAPAN JOURNALS 2020) |
H8 (b) |
「一日一個のりんごは医者を遠ざけるっていうけど,本当かな?」 「ああ,よく狙って投げればね」 ('They say an apple a day keeps the doctor away, but is that really true?' 'Yeah, if you aim well and throw it at them.') (JAPAN JOURNALS 2020) |
H9 (c) |
医師: 「旦那さんは絶対安静が必要です.ここに睡眠薬がありますので,飲んでください」 妻: 「いつ旦那に飲ませたらいいですか?」 医師: 「いいえ,あなたがこれを飲むんです」 ("Doctor: 'Your husband needs complete rest. Here, I’m prescribing sleeping pills.' Wife: 'When should I give them to him?' Doctor: 'No, you should take them.'") (Nakano 2002) |
C1 (a) |
「ねぇ,カフェで働いてる友達が辞めたらしいよ。」 「なんで?」 「もう我慢の限界だったんだってさ。毎日,マメに働くのが疲れたんだって!」 ("Hey, I heard my friend who worked at a café quit." "Why?" "He said he couldn't take it anymore. He got tired of working so ‘bean’-tifully every day!") |
C2 (a) |
「最近,友達がパン屋を辞めたんだって。」 「どうして?」 「仕事がつらすぎて,もうパンパンだったらしい。」[2] ("Recently, my friend quit working at a bakery." "Why?" "The job was so tough, he said he was completely ‘bread’ out!") |
C3 (b) |
「最近,漁師の友達が仕事を辞めたんだって。」 「どうして?」 「上司がいつも『足を引っ張るな』って言うけど,カニを獲る仕事だから無理だってさ!」 ("Recently, my fisherman friend quit his job." "Why?" "His boss always told him, ‘Don’t pull others down,’ but since he was catching crabs, that was impossible!") |
C4 (b) |
「昨日,友達が急に道で倒れたんだ!」 「え,大丈夫だったの?」 「うん,本人いわく『足元をすくわれた』らしいけど,転んだ原因はバナナの皮だったよ。」 ("Yesterday, my friend suddenly collapsed on the street!" "Oh no, was he okay?" "Yeah, he said ‘I got tripped up,’ but the real cause was a banana peel.") |
C5 (b) |
「昨日,友達が『頭が上がらない』って言うから何があったのか聞いたんだ。」 「それで,何があったの?」 「ただ単に首を寝違えただけだったよ!」 ("Yesterday, my friend said, ‘I can't lift my head,’ so I asked him what happened." "So, what was it?" "Turns out he just had a stiff neck!") |
C6 (b) |
「昨日,友達が急に電話で泣き出したんだ。」 「どうしたの?」 「本人いわく,『胸が張り裂けそう』だったらしい。でもよく聞いたら,ボタンシャツのボタンが全部飛んだだけだったよ。」 ("Yesterday, my friend suddenly started crying on the phone." "What happened?" "He said, ‘My chest feels like it’s about to burst.’ But when I asked, it turned out all the buttons on his shirt had popped off.") |
C7 (b) |
「医者の友達が『目が回るほど忙しい』って言ってたんだ。」 「大変そうだね。それでどうなったの?」 「次に会ったら,自分で三半規管の検査してたよ。」 ("My doctor friend said, ‘I’m so busy my head is spinning.’" "That sounds rough. What happened?" "When I saw him next, he was testing his own inner ear balance.") |
C8 (b) |
「この前,友達が『猫の手も借りたい』って言うから,ペットショップで猫を借りてきたんだ。」 「で,どうなったの?」 「部屋が毛だらけになっただけで,全然役に立たなかったよ。」 ("The other day, my friend said, ‘I’d take even a cat’s help right now.’ So, I borrowed a cat from the pet shop." "So what happened?" "My room just got covered in fur—it didn’t help at all.") |
C9 (b) |
「この前,友達が『犬も歩けば棒に当たる』って言うから,試しに犬の散歩をしてみたんだ。」 「どうだった?」 「棒どころか,電柱に当たったよ。こっちがね。」 ("The other day, my friend said, ‘Even a dog that walks will bump into a stick.’ So, I tried walking a dog." "So what happened?" "Instead of a stick, I bumped into a telephone pole—me, not the dog.") |
All jokes except H7 are in dialogue form. H1–H9 were made up by humans, whereas C1–C9 were generated by ChatGPT. H1 was obtained from a video posted on YouTube Shorts on 13 November 2024 by a YouTuber called Risoukyou Purojekuto [Utopia project].[3] H3 and H6–H9 are English jokes translated into Japanese (H9 is the author's translation), H4 and H5 are the author's own ideas, and C1–C9 are generated by the prompts shown in Figure 1.
評価基準 面白さ:「1非常につまらない」「2ややつまらない」「3どちらでもない」「4やや面白い」「5とても面白い」 不快さ:「1全く不快ではない」「2あまり不快ではない」「3どちらでもない」「4やや不快」「5非常に不快」 わかりやすさ:「1非常にわかりやすい」「2ややわかりやすい」「3どちらでもない」「4ややわかりづらい」「5非常にわかりづらい」 この採点基準で満点を取れるようなジョークを考えてください.なお,そのジョークは単語の多義性・慣用表現・ステレオタイプを利用したものでお願いします. |
English translation Evaluation criteria Funniness: '1 very boring', '2 somewhat boring', '3 neither', '4 somewhat funny', '5 very funny' Offensiveness: '1 not at all offensive', '2 not very offensive', '3 neither', '4 somewhat offensive', '5 very offensive' Intelligibility: '1 very easy to understand', '2 somewhat easy to understand', '3 neither', '4 somewhat difficult to understand', '5 very difficult to understand' Create a joke that would get full marks on these scoring criteria. The jokes should make use of word polysemy, idiomatic expressions and stereotypes. |
The common belief that both speaker and listener are using a polysemous word with sense s1 is overturned by an utterance that unexpectedly uses it with a different sense s2.[4] For example, in H1, utsu 'hit' is initially used to mean 'beat', but the next occurrence of the word is used to mean 'criticise'.
The common belief that both speaker and listener are using an idiomatic expression in its idiomatic meaning, which is overturned by an utterance that unexpectedly uses it in its literal meaning. For example, in H4, the senior uses the idiomatic expression hara wo watte hanasu in its idiomatic meaning of 'to confide one's true feelings', while the newbie uses it in its literal meaning.
For the evaluation of the jokes by ChatGPT, the prompt of the form shown in Figure 2 were used. C1–C9 were generated by ChatGPT itself with the instruction to get a perfect score, and these jokes were again evaluated by ChatGPT.
[ジョーク] 上記のジョークを,「面白さ」「不快さ」「わかりやすさ」の3つの観点において,評価をお願いします.またジョークのどこが面白いポイントとなっているのかの解説もお願いします. (図1に太字で示した評価基準) |
English translation [Joke] Please rate the above jokes in terms of their ‘funniness’, ‘offensiveness’ and ‘intelligibility’. We also ask you to explain what is interesting about the joke. (Evaluation criteria shown in bold in Figure 1) |
The jokes were scored by humans using the scoring criteria shown in bold in Figure 1. We used Google Forms. The respondents were 31 university students in their 20s.
Tables 2 and 3 summarise the ratings for the Japanese jokes considered by the humans and ChatGPT respectively. The human ratings are shown as averages (see Appendix for more detailed statistics). Values in cells where there is a difference of more than one point between ChatGPT and human ratings are highlighted in bold.
ID | Funniness | Offensiveness | Intelligibility | |||
---|---|---|---|---|---|---|
ChatGPT | Human | ChatGPT | Human | ChatGPT | Human | |
H1 (a) | 4 | 3.84 | 3 | 1.81 | 5 | 3.45 |
H2 (a) | 4 | 3.29 | 1 | 2.26 | 5 | 3.84 |
H3 (a) | 4 | 3.45 | 2 | 2.13 | 5 | 4.13 |
H4 (b) | 4 | 2.61 | 1 | 1.65 | 5 | 4.19 |
H5 (b) | 4 | 2.71 | 3 | 1.58 | 5 | 4.10 |
H6 (b) | 4 | 3.06 | 2 | 1.71 | 5 | 2.94 |
H7 (b) | 4 | 2.52 | 3 | 1.68 | 5 | 2.16 |
H8 (b) | 5 | 3.29 | 1 | 1.55 | 5 | 3.26 |
H9 (c) | 4 | 3.19 | 2 | 1.97 | 5 | 2.74 |
ID | Funniness | Offensiveness | Intelligibility | |||
---|---|---|---|---|---|---|
ChatGPT | Human | ChatGPT | Human | ChatGPT | Human | |
C1 (a) | 5 | 3.10 | 1 | 1.35 | 5 | 3.97 |
C2 (a) | 5 | 2.39 | 1 | 1.48 | 5 | 4.10 |
C3 (b) | 5 | 2.51 | 1 | 1.45 | 5 | 2.84 |
C4 (b) | 5 | 2.03 | 1 | 1.39 | 5 | 2.77 |
C5 (b) | 5 | 2.19 | 1 | 1.48 | 5 | 3.42 |
C6 (b) | 5 | 2.42 | 1 | 1.65 | 5 | 3.48 |
C7 (b) | 5 | 2.61 | 1 | 1.35 | 5 | 3.61 |
C8 (b) | 5 | 2.06 | 1 | 1.42 | 5 | 4.10 |
C9 (b) | 5 | 2.77 | 1 | 1.39 | 5 | 3.23 |
Firstly, ChatGPT scored all the jokes higher than humans in terms of 'funniness' and 'intelligibility'. Of these, ChatGPT scored a perfect 5 for all the jokes on 'intelligibility'. On the other hand, ChatGPT scored more than one point higher than humans on 'offensiveness' for three of the jokes considered by humans, and is more lenient than humans on 'intelligibility' and 'offensiveness'. One of the reasons for ChatGPT's high rating on 'intelligibility' is objectivity. Normally, when humans associate one meaning (or structure) with a linguistic form, they are unable to pay attention to other meanings that the form has. This is the same psychological effect that occurs when looking at trompe l'oeil. To understand jokes based on polysemy or idiomatic expressions, one needs to overcome the hurdle of noticing their other meanings. In the case of ChatGPT, however, there is probably no such hurdle because it can draw attention to multiple meanings equally well.
Next, we turn to the jokes generated by ChatGPT itself (Table 3). While there was no significant difference in the 'offensiveness' of the jokes, there was a significant difference in the 'funniness' and 'intelligibility' of the jokes compared to the human ratings. It can be seen that humans find many of the jokes generated by ChatGPT to be neither funny nor easy to understand. In particular, there is a difference of more than two points between ChatGPT's own and human ratings for all jokes except C1 in terms of 'funniness'. This is the opposite of the results of similar studies in English. As we saw in section 2, Gorenz and Schwarz (2024) reported that for English jokes, humans rated ChatGPT's jokes higher than those made by humans.
There are several possible reasons why humans may not find ChatGPT's Japanese jokes funny. Firstly, many jokes have a certain level of offensiveness, but ChatGPT's jokes do not. For example, H1, which was created by a human, contains elements of social satire. This is in line with ChatGPT's strict attitude towards 'offensiveness'. Secondly, the literal meaning of Japanese idiomatic expressions may not be properly understood. It is unlikely that one would say ashimoto wo sukuwareru 'my feet get scooped' (C4) when falling down, or atama ga agaranai 'I can't raise my head' (C5) when cricking his/her neck while sleeping. The literal meaning can be obtained by composing the meaning of each element, but compositional meaning semantic calculation seems to be difficult for ChatGPT.
This paper investigated the differences between ChatGPT and human ratings of Japanese jokes. The jokes included those generated by ChatGPT itself and showed that, contrary to previous research on English jokes, Japanese jokes generated by ChatGPT obtained lower ratings than jokes created by humans.
Finally, this study has at least three shortcomings. Firstly, the number of jokes studied was only 18. Secondly, the categories of jokes are not balanced. Most of the jokes in our study involve idiomatic expressions and only one involves stereotype. The polysemy is based only on lexical ambiguity and not on structural ambiguity. These problems should be easier to solve once a dataset of Japanese jokes equivalent to UR-FUNNY (Hasan et al. 2019) is constructed.
The third problem is that the respondents in the human rating experiment were all university students and their demographics are imbalanced. Using crowdsourcing to target a larger number of respondents with diverse attributes, as Gorenz and Schwarz (2024) did, should lead to a more comprehensive understanding of human evaluation of Japanese jokes.
The following tables summarise the details of the human evaluations of the Japanese jokes (H1–H9) and ChatGPT's Japanese jokes (C1–C9), respectively.
ID | Funniness | Offensiveness | Intelligibility | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean | Median | Mode | SD | Mean | Median | Mode | SD | Mean | Median | Mode | SD | |
H1 | 3.84 | 4 | 4 / 5 | 0.74 | 1.81 | 1 | 1 | 1.06 | 3.45 | 4 | 3 / 4 / 5 | 1.27 |
H2 | 3.29 | 4 | 4 | 0.99 | 2.26 | 2 | 1 | 1.27 | 3.84 | 4 | 4 | 0.95 |
H3 | 3.45 | 4 | 4 | 0.87 | 2.13 | 2 | 1 / 2 | 1.10 | 4.13 | 4 | 4 | 0.86 |
H4 | 2.61 | 2 | 2 | 1.18 | 1.65 | 1 | 1 | 1.09 | 4.19 | 4 | 5 | 0.78 |
H5 | 2.71 | 3 | 4 | 1.04 | 1.58 | 1 | 1 | 1.01 | 4.10 | 4 | 4 | 0.89 |
H6 | 3.06 | 3 | 4 | 1.36 | 1.71 | 1 | 1 | 1.19 | 2.94 | 3 | 4 | 1.22 |
H7 | 2.52 | 2 | 1 | 1.23 | 1.68 | 1 | 1 | 1.09 | 2.16 | 2 | 1 | 1.19 |
H8 | 3.29 | 3 | 4 | 1.03 | 1.55 | 1 | 1 | 0.91 | 3.26 | 3 | 4 | 1.32 |
H9 | 3.19 | 3 | 4 | 1.03 | 1.97 | 1 | 1 | 1.31 | 2.74 | 4 | 4 | 1.19 |
ID | Funniness | Offensiveness | Intelligibility | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Mean | Median | Mode | SD | Mean | Median | Mode | SD | Mean | Median | Mode | SD | |
C1 | 3.10 | 3 | 2 | 1.18 | 1.35 | 1 | 1 | 0.90 | 3.97 | 4 | 4 | 0.97 |
C2 | 2.39 | 2 | 2 | 1.13 | 1.48 | 1 | 1 | 1.01 | 4.10 | 4 | 5 | 0.86 |
C3 | 2.51 | 2 | 2 | 1.07 | 1.45 | 1 | 1 | 0.98 | 2.84 | 3 | 2 | 1.27 |
C4 | 2.03 | 2 | 1 | 0.98 | 1.39 | 1 | 1 | 0.90 | 2.77 | 3 | 1 | 1.43 |
C5 | 2.19 | 2 | 1 | 1.09 | 1.48 | 1 | 1 | 0.91 | 3.42 | 4 | 4 | 1.21 |
C6 | 2.42 | 2 | 2 | 1.09 | 1.65 | 1 | 1 | 1.06 | 3.48 | 4 | 4 | 1.16 |
C7 | 2.61 | 2 | 4 | 1.21 | 1.35 | 1 | 1 | 0.82 | 3.61 | 4 | 4 | 1.10 |
C8 | 2.06 | 2 | 1 | 1.17 | 1.42 | 1 | 1 | 0.94 | 4.10 | 4 | 4 | 0.86 |
C9 | 2.77 | 3 | 2 | 1.26 | 1.39 | 1 | 1 | 0.87 | 3.23 | 3 | 3 | 1.36 |
©2025 Hiroki Nomoto. All rights reserved.
Last modified on 21 March 2025.