Handwritten documents near-duplicate search for data intensive applications

Cover Page

Cite item

Full Text

Open Access Open Access
Restricted Access Access granted
Restricted Access Subscription Access

Abstract

The problem of cheating in handwritten academic essays has become more significant over last several years. One of the cheating cases is submitting the same paper, photographed in different environment (for example, from another angle, in different light or in lower quality), or changed by means of automatic augmentation. The existing methods are not designed to work on large collections of handwritten documents. The proposed approach consists of three stages. The first stage is embedding generation, the second one is finding closest candidates in the collection of handwritten documents and the final one is similarity estimation between query image and each of candidates obtained at previous step. Our solution showed Recall@1 80% and 59% with FPR 4.8% and 5.5% on Synthetic and Real data respectively. The search latency is 5.5 seconds per query for the collection of 10 000 images. The results showed that the developed method is robust enough to work on large collections of handwritten documents.

Full Text

Restricted Access

About the authors

K. Varlamova

Antiplagiat Company; Moscow Institute of Physics and Technology

Author for correspondence.
Email: kvarlamova@ap-team.ru
Russian Federation, Moscow; Moscow

М. Kaprielova

Antiplagiat Company; Moscow Institute of Physics and Technology; FRC CSC RAS

Email: kaprielova@ap-team.ru
Russian Federation, Moscow; Moscow; Moscow

I. Potyashin

Antiplagiat Company; Moscow Institute of Physics and Technology

Email: potyashin@ap-team.ru
Russian Federation, Moscow; Moscow

Yu. Chekhovich

Antiplagiat Company

Email: chehovich@ap-team.ru
Russian Federation, Moscow

References

  1. Bakhteev O., Ogaltsov A., Khazov A., Safin K., Kuznetsova R. CrossLang: the System of Cross-lingual Plagiarism Detection // Workshop on Document Intelligence at NeurIPS. Vancouver, 2019.
  2. Avetisyan K., Gritsay G., Grabovoy. A. Cross-Lingual Plagiarism Detection: Two Are Better Than One // Programming and Computer Software. 2023. V. 49. P. 346–354.
  3. Kuznetsova M., Bakhteev O., Chekhovich Y. Methods of Cross-lingual Text Reuse Detection in Large Textual Collections // Informatika I Ee Primeneniya [Informatics and Its Applications]. 2021. V. 15. P. 30–41.
  4. Gritsay G., Grabovoy A., Kildyakov A., Chekhovich Y. Artificially Generated Text Fragments Search in Academic Documents // Doklady Rossijskoj Akademii Nauk. Matematika, Informatika, Processy Upravlenia. 2023. V. 108. P. 308–317.
  5. Gritsay G., Grabovoy A., Chekhovich Y. Automatic Detection of Machine Generated Texts: Need More Tokens // Ivannikov Memorial Workshop (IVMEM). Kazan, 2022. V. 108. P. 20–26.
  6. Ma H.J., Wan G., Lu E.Y. Digital Cheating and Plagiarism in Schools // Theory Into Practice. 2008. V. 47. P. 197–203.
  7. Wrigley S. Avoiding ‘de-plagiarism’: Exploring the Affordances of Handwriting in the Essay-writing Process // Active Learning in Higher Education. 2019. V. 20. P. 167–179.
  8. Bakhteev O., Kuznetsova R., Khazov A., Ogaltsov A., Safin K., Gorlenko T., Suvorova M., Ivahnenko A., Botov P. et. al. Near-duplicate Handwritten Document Detection Without Text Recognition // Intern. Conf. on Computational Linguistics and Intellectual Technologies. Moscow, 2021. P. 47–57.
  9. Krishnan P., Jawahar C.V. Matching Handwritten Document Images // Europ. Conf. on Computer Vision. Amsterdam, 2016. P. 766–782.
  10. Rowtula V., Bhargavan V., Kumar M., Jawahar C.V. Scaling Handwritten Student Assessments with a Document Image Workflow System // IEEE Conf. on Computer Vision and Pattern Recognition Workshops. Salt Lake City, 2018. P. 2307–2314.
  11. Pandey O., Gupta I., Mishra B.S.P. A Robust Approach to Plagiarism Detection in Handwritten Documents // Intern. Sympos. on Visual Computing. San Diego, 2020. P. 682–693.
  12. Coquenet D., Chatelain C., Paquet T. End-to-end Handwritten Paragraph Text Recognition Using a Vertical Attention Network // ArXiv 2021. ArXiv Preprint ArXiv:2012.03868.
  13. Voigtlaender P., Doetsch P., Ney H. Handwriting Recognition With Large Multidimensional Long Short-term Memory Recurrent Neural Networks // 15th Intern. Conf. on Frontiers in Handwriting Recognition (ICFHR). Shenzhen, 2016. P. 228–233.
  14. Khritankov A., Botov P., Surovenko N., Tsarkov S., Viuchnov D., Chekhovich Y. Discovering Text Reuse in Large Collections of Documents: A Study of Theses in History Sciences // Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conf. (AINLISMW FRUCT). St. Petersburg, 2015. P. 26–32.
  15. Potyashin I., Kaprielova M., Chekhovich Y., Kildyakov A., Seil T., Finogeev E., Grabovoy A. HWR200: New Open Access Dataset of Handwritten Texts Images in Russian // Intern. Conf. on Computational Linguistics and Intellectual Technologies. Moscow, 2023.
  16. Grieggs S., Shen B., Rauch G., Li P., Ma J., Chiang D., Price B., Scheirer W.J. Measuring Human Perception to Improve Handwritten Document Transcription // ArXiv 2019. ArXiv Preprint ArXiv:1904.03734 .
  17. Toselli A., Romero V., Villegas M., Vidal E., Sanchez J. HTR Dataset // Intern. Conf. on Frontiers in Handwriting Recognition (ICFHR). Shenzhen, 2016. P. 630635.
  18. Wang J., Song Y., Leung T., Rosenberg C., Wang J., Philbin J., Chen B., Wu Y. Learning Fine-grained Image Similarity With Deep Ranking // IEEE Conf. on Computer Vision and Pattern Recognition. Columbus, 2014. P. 1386–1393.
  19. Balntas V., Riba E., Ponsa D., Mikolajczyk K. Learning Local Feature Descriptors With Triplets and Shallow Convolutional Neural Networks // The British Machine Vision Conference (BMVC). 2016. V. 1. №2. P. 3.
  20. He K., Zhang X., Ren S., Sun J. Deep Residual Learning for Image Recognition // IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas, 2016.
  21. Annoy // https://github.com/spotify/annoy.
  22. Pinecone // https://github.com/pinecone-io.
  23. Johnson J., Douze M., J’egou H. Billion-scale Similarity Search With GPUs // IEEE Transactions on Big Data. 2019. V. 7. P. 535–547.
  24. Melekhov I., Kannala J., Rahtu E. Siamese Network Features for Image Matching // 23rd Intern. Conf. on Pattern Recognition, ICPR. Cancun, 2016. P. 378–383.
  25. Bakhteev O., Chekhovich Y., Finogeev E., Gorlenko T., Kaprielova M., Kildyakov A., Ogaltsov A. Image Reuse Detection in Large-scale Document Scientific Collection // ENAI Conf., Concurrent Sessions 12. Porto, 2022. P. 107.
  26. Patil B. V., Patil P. R. An Efficient DTW Algorithm for Online Signature Verification // Intern. Conf. On Advances in Communication and Computing Technology (ICACCT). Painpat, 2018. P. 1–5.
  27. Salvador S., Chan P. Toward Accurate Dynamic Time Warping in Linear Time and Space // Intellectual Data Analysis. 2007. V. 11. P. 561–580.
  28. Lowe D.G. Distinctive Image Features from Scale-invariant Keypoints // Intern. J. of Computer Vision. 2004. V. 60. P. 91–110.
  29. Rublee E., Rabaud V., Konolige K., Bradski G. ORB: An Efficient Alternative to SIFT or SURF // Intern. Conf. on Computer Vision. Barcelona, 2011. P. 25642571.
  30. DeTone D., Malisiewicz T., Rabinovich A. Superpoint: Self-supervised Interest Point Detection and Description // IEEE Conf. on Computer Vision and Pattern Recognition Workshops. Salt Lake City. 2018, P. 224–236.
  31. Barroso-Laguna A., Riba E., Ponsa D., Mikolajczyk K. Key. net: Keypoint Detection by Handcrafted and Learned cnn Filters // IEEE/CVF Intern. Conf. on Computer Vision. Seoul, 2019. P. 5836–5844.
  32. Mishkin D. Local Features: from Paper to Practice // Computer Vision and Pattern Recognition (CVPR) Workshops. Seattle, 2020.
  33. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A., Polosukhin I. et. al. Attention Is All You Need // ArXiv 2017. ArXiv Preprint ArXiv:1706.03762.
  34. Sun J., Shen Z., Wang Y., Bao H., Zhou X. LoFTR: Detector-Free Local Feature Matching With Transformers // ArXiv 2021. ArXiv Preprint ArXiv:2104.00680. P. 8922–8931.

Supplementary files

Supplementary Files
Action
1. JATS XML
2. Formula 2.1

Download (36KB)
3. Fig. 1. Example of images from the dataset with synthetic near-duplicates: original and generated near-duplicate

Download (199KB)
4. Fig. 2. Image types from the HWR200 dataset from left to right: scan, light photo, and dark photo.

Download (611KB)
5. Fig. 3. The process of extracting a signal from an image: conversion to grayscale, application of an adaptive threshold, signal extraction

Download (511KB)

Copyright (c) 2024 Russian Academy of Sciences