Is your AI language tutor lying to you?

Oh Yeah Sarah

--

How reliable is AI grammar correction for language learners?

>> You can watch/listen to a video version of this article on YouTube

There are many new apps that use AI to provide tutoring and practice in a foreign language. I have an ever-growing list of examples here.

There are two main things that these AI language tutors offer.

1) They chat to you in a realistic human-like way in the language you want to practise.

2) They correct your mistakes and explain them.

As a chat partner they do a great job but what about the advice they give you about your mistakes? How reliable is that?

I’ve recently tested numerous AI language tutor apps and incorporated AI corrections into the app that I run.

As a result, I now have a good idea of how accurate AI corrections are and the areas where AI is most likely to fall short.

I’ve been investigating this topic for years, for example with Grammarly back in 2017, but the new generation of AI has brought new opportunities for automated language correction.

Firstly I’ll tell you what you need to know as a language learner using these apps. In the second half I’ll go into some detail that may be of more interest to people developing AI language tutor apps.

Advice for language learners

The accuracy of the corrections may vary slightly between apps but, assuming all apps are using Chat GPT or a similar model, the output will be broadly similar.

I did some in-depth testing of 3 AI language tutor apps, using 10 texts written by real English learners who have used the app that I run (see the full findings near the end of this article). It’s important to test with texts written by real language learners because it means the texts will contain the types of mistakes that are typical from language learners. We’re not just talking typos and spelling mistakes, like a native speaker of a language would make, we’re talking misunderstandings with grammar and vocabulary.

There are two main ways that an AI tutor will give misleading advice:

  • It will miss a mistake, leaving you thinking something is fine when it’s not.
  • It will correct something that’s fine, leaving you thinking your grammar or vocabulary is worse than it is.

I’ll show some examples using various AI tutor apps, with real texts written by real English learners.

Missed mistakes

The examples below are grammar mistakes but I’ve seen AI miss vocabulary mistakes too.

Here’s an example from Lengi.

‘Had participated’ should have been corrected because it’s not the correct tense here. The correct tense for this context is Past Simple (‘participated’).

And one from Langua.

Here ‘to be too rich’ should have been corrected to ‘being too rich’. I even asked Chat GPT its opinion on this and it agreed it should be ‘being’. For the grammar nerds — we use a gerund when the action is the subject of the sentence.

Correcting things that are not incorrect

This is where the creative ‘generative’ aspect of AI comes in and becomes a problem. AI is always trying to improve the style of a text or make it more precise or descriptive. This is great for some situations (eg. writing a cover letter for a job) but it creates a lot of noise when your goal is to find out whether you used sufficiently correct language.

If you’re at intermediate level in a language, your priority is learning correct grammar and vocabulary, not achieving beautifully elegant language. You need to focus on what’s wrong, rather than what’s correct but could be marginally better.

You also need to know the difference between a change that makes something correct and one that makes it better.

Here’s an example from Go Correct.

‘Huge’ is absolutely fine here and whereas ‘only’ could improve the style it’s certainly not incorrect without it.

This article also contains some detailed and interesting research into AI’s ability to determine correct and incorrect grammar and that writer came to the same conclusion - part of AI’s downfall is its inclination to be ‘too creative’.

Why is this a problem if AI is right most of the time?

For me, as a language learner, I don’t like this element of doubt because it makes me doubt every correction the AI gives me.

When using an AI tutor to practise Spanish, I look at some corrections and immediately see and agree that I obviously got it wrong but it makes me want to double check the rest with a human, which kind of defeats the whole purpose.

Also, a human teacher would make a judgement call about what changes are worth focusing on, depending on the level of the learner. I have not yet seen an AI tutor that does that.

My advice to language learners using AI tutors

Focus on the corrections where you can immediately see and understand what you got wrong. You can rely on that because the correction is backed up by the grammar rules that you already know but temporarily forgot. For everything else, be cautious!

Now let’s look at some data. How often does AI get it wrong?

I took 10 texts written by English learners. I corrected them myself, which we’ll take to be the most accurate and reliable way of correcting them. Although, of course, even among humans there would be some disagreement about what needs to be corrected.

I then ran them through the current prompt I’m using in my app and two other good AI language tutor apps. Then I compared the results.

This graph shows each text in each app. You’ll see there’s a lot of variation in how many mistakes they correct and whether they find all the mistakes that a human corrected.

Tested in October 2024

You might look at text 5 and think — “great, all of them matched the human corrections there”. But no, it’s not that simple. This is just the number of things it corrected but, as I mentioned, sometimes AI corrects things that are fine. So, let’s dive deeper in what each app did with this text.

Tested in October 2024

And here’s the same type of graph for another text. As you’ll see, no app gets it right all the time.

Now, which type of error does AI make most often? Missing a mistake or correcting something that’s fine? That depends on how the prompt is crafted but I think, based on the relatively small amount of data I looked at, generally it’s most likely to correct something that’s fine.

If a prompt is less likely to miss mistakes, then it’s more likely to correct things that aren’t wrong. Here’s the split between missed and unnecessary corrections in all the apps I tested.

How the corrections are displayed matters too

Displaying the corrections in a way that’s easy to read and digest makes a big difference and some apps do this better than others.

Langua and the app I run do it well. Talkpal does it badly, creating a lot of hard work to jump your eyes back and force between the two texts to compare them.

Often’ is the app that I run

What does all this mean?

From 2017 to 2024, I ran Go Correct as an app where English learners could have a human correct mistakes in their daily writing practice. By 2024 it became inevitable that I would add the option to have AI do the job instead of a human, in the new version of the app.

I was initially resistant to using AI in my own app because I don’t like the unreliability and the uncertainty it creates. However, I have recently started to come round to the idea and realised that AI feedback can be useful if approached in the right way.

I am interested to see how long it will be before AI is so reliable that all language learners can confidently rely on it for this task.

--

--

No responses yet