Pages

Monday, October 14, 2024

Using AI to Transcribe and Translate Documents

Being able to upload an image of a document to AI and have it instantly and accurately transcribe the difficult to read handwriting or translate a foreign language from an equally difficult to read handwriting is an exciting idea, but in practice it does not work well. Specifically, the biggest problem I find with it is that when it can't read something, it decides to just... make something up that seems in keeping with the type of document it is. Clearly, AI is smart enough to recognize what kind of document it is and supply the type of information you'd expect from that type of record, but it is apparently not smart enough to realize why the information needs to be accurate and not fabricated.

Take for example this will of Henry Rorer (above, click to enlarge). I uploaded it to ChatGPT and asked it to transcribe it. This is what it gave me:

In the name of God Amen the twenty-second Day of January in the year of our Lord one thousand seven hundred and eighty. I Henry Rorer of Bristol Township in the County of Philadelphia Yeoman being weak in body but of perfect mind and memory thanks be given unto God, therefore this my last Will and Testament. That is to say, I first and principally I resign my soul to God that gave it and my body to the Earth to be buried in a decent and Christian like manner at the discretion of my Executors herein after named and touching such worldly Estate wherewith it hath pleased God to bless me in this life. I give and dispose of the same in the following manner, and form.

Imprimis. I give and bequeath unto Mary Willard one Chest that she is to choose out of all my Chests and Boxes to be paid and delivered her by my Executors hereafter named.

Item. I give and bequeath unto Margaret my dearly beloved Wife all my Estate both real and personal during her widowhood, excepting such legacies as I shall hereafter mention in this my last will and testament and if in case my Wife should die or marry, then it is my will that my Estate shall be sold and divided equally amongst my children or the survivors of them share and share alike, and if either of my Children should die under age or without lawful issue that then my Estate shall be equally divided amongst the survivors of them, and I do hereby nominate and appoint my beloved Wife and my Friend Henry Rorer and George Rorer Executors of this my last Will and Testament, and I do hereby utterly revoke and disannul all former Wills and Testaments by me heretofore made Ratifying and Confirming this and no other to be my last Will and Testament.

In witness whereof I have hereunto set my hand and Seal the day and year above written. Signed Sealed published pronounced and declared by the said Henry Rorer as his last Will and Testament in the presence of us.

Phillip Hagey
Peter Shall

Henry Rorer (Seal)

At first glance, it looks great. But when you actually compare it to the original document, it quickly becomes clear some of it was just completely made up. Firstly, note that the date is slightly off. It was actually the 29th day of January, not the 22nd. It's understandable a single word might be misread, but it goes downhill from there. Where the transcription reads "thanks be given unto God, therefore this my last Will and Testament" the record actually says "do make and ordain this my last Will and Testament." The transcription then goes on to add in an entire sentence that isn't included in the record: "That is to say, I first and principally I resign my soul to God that gave it and my body to the Earth to be buried in a decent and Christian like manner at the discretion of my Executors herein after named..."

Although not accurate, it's at least not giving us any misleading information. The paragraph following it, however, is fabricating people and property never mentioned in the original. The transcription says:

Imprimis. I give and bequeath unto Mary Willard one Chest that she is to choose out of all my Chests and Boxes to be paid and delivered her by my Executors hereafter named.

The actual document says:
Imprimis It is my Will and I do order that in the first place all my just Debts & funeral charges be paid and satisfied.

ChatGPT did not even attempt to get it right. It's as though it ignored even the words it could make out in favor of a sentence that makes sense instead of just leaving blank spaces where it isn't able to read certain words. So the entire sentence was changed just to produce something that wasn't missing words. This could be really disastrous to your research if you're not paying attention. It fabricated a name where there was no person mentioned, making it look like Henry Rorer willed one Mary Willard her choice of a chest or box when there was no such person ever mentioned in his will and no mention of any chests or boxes either.

Additionally, the transcription looks surprisingly short. Even though the will is only about one page length, it seems like more than what the transcription included, and indeed, it is missing a lot of information from the rest of the document including many people who are of importance to Henry Rorer.

The same thing essentially happens when you ask ChatGPT to translate a foreign handwritten document, wherever it struggles to read the handwriting, including names, occupations, ages, etc, it simply makes up random data to replace it. It changes names and other data just to produce a complete transcription, whether accurate or not.

So here is your warning. AI is not there yet. Maybe there's more reliable ones out there, but I imagine they aren't free. Whatever FamilySearch is using to transcribe records in their experimental Full Text Search using some kind of OCR for handwriting seems to work much better than ChatGPT. But not all records are on FamilySearch, let alone under their Full Text Search.

As ever, the genealogist mantra remains true: you have to do the work yourself and verify everything.