Still some way to go for OCR

foto16 · 2025-01-08T06:08:42.635Z

Joplin's OCR seems to perform well with clear texts again a simple background (such a screenshot of a Powerpoint slide or a table), but less so with more complex images. I did a small test on OCR with the attached image. Joplin (Win 3.2.8) either fails or only recognizes one line of the text. Apple, OneNote, Google Keep, Obsidian with the Text Extractor plugin can recognize all the English texts without a problem. ChatGPT can even recognize correctly the Chinese words.

unnamed810×392 38.7 KB

_Ben · 2025-01-08T07:40:03.026Z

Interesting. As an aside I'm thankful that Joplin doesn't incorporate AI and personally I hope it never does!

I've actually had a lot of OCR success with all kinds of documents including crumpled up receipts and badly photographed pages of books. I've very often been pleasantly surprised as I really thought the main use case was trawling clearish PDFs.

But the key thing here is that they are all documents, and your image of course isn't. I might be wrong but I think the issue is you're expecting something that I don't believe the OCR in Joplin was designed for: it's for text from documents and text over a photo is a fair way from that. But particularly when the text shares colours with the image - I'm not surprised it wasn't able to be read for that reason - it's a big ask! Though you do say other software can do it, fair enough.

Anyway just really wanted present a counter viewpoint - I've found Joplin's OCR to work really well with virtually everything I've thrown at it.

laurent · 2025-01-08T12:10:41.857Z

I would guess that Apple, OneNote and Google Keep send all your data to their server for OCR, which makes it a lot easier since they can have very large model running there. We do all this locally so that your data doesn't have to leave your computer.

I'm more surprised that the Text Extractor plugin works since it looks like they use Tesseract too. I'll give it a try and see how we can tweak our settings to improve it.

laurent · 2025-01-08T12:22:22.522Z

Ok I know what happened. In Obsidian I'm getting this text:

ETIrT ’ " I ? r A ' ' ' AL r . 1 . G - 7 ' : v U1l ’ ' | ' \ 1.7 eback Mountain (2005) ' ‘ . /?Iﬁ f Havoc (2005) 1 4 3.Z1i5R%j Love & Other Drugs (2010) J 4.5 H9 I The Last Thing He Wanted (2020) ‘l '

Some of it is ok, but it's a lot of random characters so probably Tesseract gaves it a very low confidence value. In Joplin we'd discard such text since it's not very useful.

I'm not sure we should change this as adjusting the confidence threshold means a lot more poor quality text would get saved to the search index and pollutes it.