Detecting AI May Be Impossible. That’s a Big Problem For Teachers.

Rebecca Dell, AP English teacher, in front of her classroom at Concord High School with student Lucy Goetz. Goetz helped The Post test Turnitin's AI detectors earlier this year — and found the system inaccurately flagged her original essay as AI-generated. (Andria Lo for The Washington Post)

Turns out, we can’t reliably detect writing from artificial intelligence programs like ChatGPT. That’s a big problem, especially for teachers. Even worse, scientists increasingly say using software to accurately spot AI might simply be impossible.

The latest evidence: Turnitin, a big educational software company, said that the AI-cheating detector it has been running on more than 38 million student essays since April has more of a reliability problem than it initially suggested. Turnitin — which assigns a “generated by AI” percent score to each student paper — is making some adjustments, including adding new warnings on the types of borderline results most prone to error.

I first wrote about Turnitin’s AI detector this spring when concerns about students using AI to cheat left many educators clamoring for ways to deter it. At that time, the company said its tech had a less than 1 percent rate of the most problematic kind of error: false positives, where real student writing gets incorrectly flagged as cheating. Now, Turnitin says on a sentence-by-sentence level — a more narrow measure — its software incorrectly flags 4 percent of writing.

My investigation also found false detections were a significant risk. Before it launched, I tested Turnitin’s software with real student writing and with essays that student volunteers helped generate with ChatGPT. Turnitin identified over half of our 16 samples at least partly incorrectly, including saying one student’s completely human-written essay was written partly with AI.

The stakes in detecting AI may be especially high for teachers, but they’re not the only ones looking for ways to do it. So are cybersecurity companies, election officials and even journalists who need to identify what’s human and what’s not. You, too, might want to know if that conspicuous email from a boss or politician was written by AI.

There have been a flood of AI-detection programs onto the web in recent months, including ZeroGPT and Writer. Even OpenAI, the company behind ChatGPT makes one. But there’s a growing body of examples of these detectors getting it wrong — including one that claimed the prologue to the Constitution was written by AI. (Not very likely, unless time travel is also now possible?)

The takeaway for you: Be wary of treating any AI detector like fact. In some cases right now, it’s little better than a random guess.

Can a good AI detector exist?

A 4, or even 1 percent error rate might sound small — but every false accusation of cheating can have disastrous consequences for a student. Since I published my April column, I’ve gotten notes from students and parents distraught about what they said were false accusations. (My email is still open.)

In a lengthy blog post last week, Turnitin Chief Product Officer Annie Chechitelli said the company wants to be transparent about its technology, but she didn’t back off from deploying it. She said that for documents that its detection software thinks contain over 20 percent AI writing, the false positive rate for the whole document is less than 1 percent. But she didn’t specify what the error rate is the rest of the time — for documents its software thinks contain less than 20 percent AI writing. In such cases, Turnitin has begun putting an asterisk next to results “to call attention to the fact that the score is less reliable.”

“We cannot mitigate the risk of false positives completely given the nature of AI writing and analysis, so, it is important that educators use the AI score to start a meaningful and impactful dialogue with their students in such instances,” Chechitelli wrote.

The key question is: How much error is acceptable in an AI detector?

New preprint research from computer science assistant professor Soheil Feizi and colleagues at the University of Maryland finds that no publicly available AI detectors are sufficiently reliable in practical scenarios.

“They have a very high false-positive rate, and can be pretty easily evaded,” Feizi told me. For example, he said, when AI writing is run through paraphrasing software, which works like a kind of automated thesaurus, the AI detection systems are little better than a random guess. (I found the same problem in my tests of Turnitin.)

He’s also concerned that AI detectors are more likely to flag the work of students for whom English is a second language.

Feizi didn’t test Turnitin’s software, which is available only to paying educational institutions. A Turnitin spokeswoman said Turnitin’s detection capabilities “are minimally similar to the ones that were tested in that study.”

Feizi said if Turnitin wants to be transparent, it should publish its full accuracy results and allow independent researchers to conduct their own research on its software. A fair analysis, he said, should use real student-written essays on different topics and writing styles, and address failure on each subgroup as well as overall.

We wouldn’t accept a self-driving car that crashes 4 percent — or even 1 percent — of the time, Feizi said. So, he proposes a new baseline for what should be considered acceptable error in an AI detector used on students: a 0.01 percent false-positive rate.

When will that happen? “At this point, it’s impossible,” he said. “And as we have improvements in large-language models, it will get even more difficult to get even close to that threshold.” The problem, he said, is that the distribution of what AI-generated text and human-generated text looks like are converging on each other.

“I think we should just get used to the fact that we won’t be able to reliably tell if a document is either written by AI — or partially written by AI, or edited by AI — or by humans,” Feizi said. “We should adapt our education system to not police the use of the AI models, but basically embrace it to help students to use it and learn from it.”

This analysis was published by columnist Geoffrey A. Fowler at The Washington Post.

Previous
Previous

AI Snake Oil is Here, and It’s a Distraction, says Daumé

Next
Next

More Engagement in Tech Design Can Improve Children’s Online Privacy, Security