Do AI detectors live up to the hype? Can they detect text written completely or in part by ChatGPT? The short answer is yes. The long answer is it depends. We’ll go into the details here to help school leaders and k12 teachers understand the limitations of the currently available AI detection applications.
This is the second phase of our tests of popular AI detectors. The first phase, which we described here explored narrative text generated by ChatGPT. It concluded that prompting the AI bot to alter characteristics of the text that are know to increase its complexity will eventually produce text that can’t be detected.
This time, we wanted to experiment with expository writing, which is typically assigned across the disciplines increasingly beginning in fourth grade. We thought that we could take a simple descriptive essay on any topic and then prompt ChatGPT to make it more complex. However, we found inconsistent results when we entered the AI-generated text into the AI detectors. The same prompts (Write a descriptive essay…) with different topics (…about x) yielded different scores or probabilities that the texts were AI generated them.
This finding led to a new line of inquiry. Unlike narrative AI-generated texts, it seemed as though the topics of expository AI-generated texts made it more or less challenging for AI detectors to determine whether they were human or AI composed. As we continued exploring, we began to detect a pattern. See our first hypothesis below:
Hypothesis 1: The more current the text’s topic, the more difficult it is to detect
We understand that large language models, like ChatGPT, are trained on the massive corpus of internet texts that were available at the time of their training. Therefore,
“GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its data cuts off (September 2021), and does not learn from its experience.”
OpenAI
We also understand that the the longer a topic has remained in public discourse, the more digital content there will be about that topic on the internet.
Therefore, the plethora of available content about a well-established topic will allow ChatGPT to generate highly predictable word pairings, phrases, and sentences that will be easier for detectors to flag as AI-written.
Testing AI-detection of text with older and newer topics
To test our hypothesis, we first asked ChatGPT to write a descriptive essay about a well-established topic: climate change. It generated one in just a few seconds. See below:

Next, we ran the AI-written text through two free AI detectors: ZeroGPT and GPT4Detector.ai. After screening a text, both applications provide a percentage indicating the likelihood that AI wrote it. Both apps indicated a high probability that AI generated the descriptive essay on climate change. Below are the results.
ZeroGPT: 81.89%
GPT4Detector.ai: 100%
Next, we picked a topic that we felt was more contemporary and asked ChatGPT to generate the same type of writing about about it. See below.

In line with our hypothesis, the probability that AI generated the text, as determined by the two apps decreased quite a bit.
ZeroGPT: 53.67% (-28.22%)
GPT4Detector.ai: 42% (-58%)
We tried a third time by asking ChatGPT to write about a topic that we felt was even more contemporary – gaining widespread media traction only in the past couple of years. See below.

Again, supporting our hypothesis, the AI detectors indicated a lesser chance that AI wrote the text.
ZeroGPT: 15.72% (-37.95%)
GPT4Detector.ai: 19% (-23%)
Checking Topic Frequency Using Google Books Ngram Viewer
To check our hypothesis from a different angle, we plugged the three topic keywords into the Google Books Ngram Viewer. We wanted to make sure that the frequencies at which these topics appear in the discourse matched our assumptions. This tool provides frequencies for keywords as they appear in the corpus of ebooks available on Google. It provides a percentage which indicates the proportion of books in the corpus that include the keyword.
We would have liked to identify their frequencies across all web-hosted texts, as these are the ones that train large language models like ChatGPT, but this technology doesn’t exist (as far as we’re aware.) We used Google’s Ngram data as a proxy, assuming that internet content trends mirror book trends.
The data supports our hypothesis. Books within the corpus discuss Climate change most frequently, followed by AI, and then the science of reading. See data below.

Hypothesis 2: The more specific the topic, the more difficult it is to detect
Although the old-versus-new topic hypothesis seemed to fit the data, we couldn’t help but wonder if other variables might produce similar results. A topic like the science of reading refers to a specific pedagogy for reading instruction, whereas climate change refers to a broader environmental phenomenon. We wondered if topic specificity might also impact AI detectors’ abilities to determine whether or not ChatGPT generated a text.
To test this hypothesis, we asked ChatGPT to write four new descriptive essays: the first about reading, the second about reading comprehension, the third about how readers generate mental representations of text, and the fourth about Kintsch’s Construction Integration theory which is a theoretical model of reading comprehension from a cognitive perspective.
Though each topic pertains to the same construct, they differ in their specificity with reading denoting the broadest layer within which comprehension is situated and Kintsch’s CI theory as one (of many) specific theoretical models of the construct of comprehension.
Below are the prompts that we used in ChatGPT followed by the results from the two AI detectors. As predicted, the less specific the topic, the more likely the AI detectors were to flag it as AI generated.
“Write a descriptive essay about reading.”
ZeroGPT: 80.93%
GPT4Detector.ai: 98%
“Write a descriptive essay about reading comprehension.”
ZeroGPT: 59.16% (-21.77%)
GPT4Detector.ai: 72% (-26%)
“Write a descriptive essay about how readers generate mental representations of text.”
ZeroGPT: 11.49% (-47.67%)
GPT4Detector.ai: 22% (-50%)
“Write a descriptive essay about Kintsch’s Construction Integration theory.”
ZeroGPT: 8.36% (-3.13%)
GPT4: 1% (-21%)
Topic Frequency According to Google Books Ngram Viewer
We once again plugged our four topics into the Google Books Ngram Viewer. The results are displayed in the graph below. Because reading appears so much more frequently than the other three topics, it’s difficult to see their frequencies.

We removed reading to see how the other topics compare. As predicted, reading comprehension was the next most common topic followed by mental representation and then construction integration.

What does this mean for teachers?
AI detectors can be helpful tools for checking students’ writing, but it’s important to understand their limitations. As shown here, a text’s topic may influence reliability. If a student submitted writing about a broad, well-established topic and the AI detector indicates that a human most likely wrote it, a student probably did. However, if a student submitted writing about a new and niche topic, AI detectors may say they wrote it even if ChatGPT did.
Try our experiment with a new set of topics and tell us your results in the comments, below.

Leave a Reply