Handwritten documents near-duplicate search for data intensive applications
- Authors: Varlamova K.1,2, Kaprielova М.1,2,3, Potyashin I.1,2, Chekhovich Y.1
-
Affiliations:
- Antiplagiat Company
- Moscow Institute of Physics and Technology
- FRC CSC RAS
- Issue: No 4 (2024)
- Pages: 129-139
- Section: ARTIFICIAL INTELLIGENCE
- URL: https://journals.eco-vector.com/0002-3388/article/view/676402
- DOI: https://doi.org/10.31857/S0002338824040085
- EDN: https://elibrary.ru/UEFADS
- ID: 676402
Cite item
Abstract
The problem of cheating in handwritten academic essays has become more significant over last several years. One of the cheating cases is submitting the same paper, photographed in different environment (for example, from another angle, in different light or in lower quality), or changed by means of automatic augmentation. The existing methods are not designed to work on large collections of handwritten documents. The proposed approach consists of three stages. The first stage is embedding generation, the second one is finding closest candidates in the collection of handwritten documents and the final one is similarity estimation between query image and each of candidates obtained at previous step. Our solution showed Recall@1 80% and 59% with FPR 4.8% and 5.5% on Synthetic and Real data respectively. The search latency is 5.5 seconds per query for the collection of 10 000 images. The results showed that the developed method is robust enough to work on large collections of handwritten documents.
Full Text

About the authors
K. Varlamova
Antiplagiat Company; Moscow Institute of Physics and Technology
Author for correspondence.
Email: kvarlamova@ap-team.ru
Russian Federation, Moscow; Moscow
М. Kaprielova
Antiplagiat Company; Moscow Institute of Physics and Technology; FRC CSC RAS
Email: kaprielova@ap-team.ru
Russian Federation, Moscow; Moscow; Moscow
I. Potyashin
Antiplagiat Company; Moscow Institute of Physics and Technology
Email: potyashin@ap-team.ru
Russian Federation, Moscow; Moscow
Yu. Chekhovich
Antiplagiat Company
Email: chehovich@ap-team.ru
Russian Federation, Moscow
References
- Bakhteev O., Ogaltsov A., Khazov A., Safin K., Kuznetsova R. CrossLang: the System of Cross-lingual Plagiarism Detection // Workshop on Document Intelligence at NeurIPS. Vancouver, 2019.
- Avetisyan K., Gritsay G., Grabovoy. A. Cross-Lingual Plagiarism Detection: Two Are Better Than One // Programming and Computer Software. 2023. V. 49. P. 346–354.
- Kuznetsova M., Bakhteev O., Chekhovich Y. Methods of Cross-lingual Text Reuse Detection in Large Textual Collections // Informatika I Ee Primeneniya [Informatics and Its Applications]. 2021. V. 15. P. 30–41.
- Gritsay G., Grabovoy A., Kildyakov A., Chekhovich Y. Artificially Generated Text Fragments Search in Academic Documents // Doklady Rossijskoj Akademii Nauk. Matematika, Informatika, Processy Upravlenia. 2023. V. 108. P. 308–317.
- Gritsay G., Grabovoy A., Chekhovich Y. Automatic Detection of Machine Generated Texts: Need More Tokens // Ivannikov Memorial Workshop (IVMEM). Kazan, 2022. V. 108. P. 20–26.
- Ma H.J., Wan G., Lu E.Y. Digital Cheating and Plagiarism in Schools // Theory Into Practice. 2008. V. 47. P. 197–203.
- Wrigley S. Avoiding ‘de-plagiarism’: Exploring the Affordances of Handwriting in the Essay-writing Process // Active Learning in Higher Education. 2019. V. 20. P. 167–179.
- Bakhteev O., Kuznetsova R., Khazov A., Ogaltsov A., Safin K., Gorlenko T., Suvorova M., Ivahnenko A., Botov P. et. al. Near-duplicate Handwritten Document Detection Without Text Recognition // Intern. Conf. on Computational Linguistics and Intellectual Technologies. Moscow, 2021. P. 47–57.
- Krishnan P., Jawahar C.V. Matching Handwritten Document Images // Europ. Conf. on Computer Vision. Amsterdam, 2016. P. 766–782.
- Rowtula V., Bhargavan V., Kumar M., Jawahar C.V. Scaling Handwritten Student Assessments with a Document Image Workflow System // IEEE Conf. on Computer Vision and Pattern Recognition Workshops. Salt Lake City, 2018. P. 2307–2314.
- Pandey O., Gupta I., Mishra B.S.P. A Robust Approach to Plagiarism Detection in Handwritten Documents // Intern. Sympos. on Visual Computing. San Diego, 2020. P. 682–693.
- Coquenet D., Chatelain C., Paquet T. End-to-end Handwritten Paragraph Text Recognition Using a Vertical Attention Network // ArXiv 2021. ArXiv Preprint ArXiv:2012.03868.
- Voigtlaender P., Doetsch P., Ney H. Handwriting Recognition With Large Multidimensional Long Short-term Memory Recurrent Neural Networks // 15th Intern. Conf. on Frontiers in Handwriting Recognition (ICFHR). Shenzhen, 2016. P. 228–233.
- Khritankov A., Botov P., Surovenko N., Tsarkov S., Viuchnov D., Chekhovich Y. Discovering Text Reuse in Large Collections of Documents: A Study of Theses in History Sciences // Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conf. (AINLISMW FRUCT). St. Petersburg, 2015. P. 26–32.
- Potyashin I., Kaprielova M., Chekhovich Y., Kildyakov A., Seil T., Finogeev E., Grabovoy A. HWR200: New Open Access Dataset of Handwritten Texts Images in Russian // Intern. Conf. on Computational Linguistics and Intellectual Technologies. Moscow, 2023.
- Grieggs S., Shen B., Rauch G., Li P., Ma J., Chiang D., Price B., Scheirer W.J. Measuring Human Perception to Improve Handwritten Document Transcription // ArXiv 2019. ArXiv Preprint ArXiv:1904.03734 .
- Toselli A., Romero V., Villegas M., Vidal E., Sanchez J. HTR Dataset // Intern. Conf. on Frontiers in Handwriting Recognition (ICFHR). Shenzhen, 2016. P. 630635.
- Wang J., Song Y., Leung T., Rosenberg C., Wang J., Philbin J., Chen B., Wu Y. Learning Fine-grained Image Similarity With Deep Ranking // IEEE Conf. on Computer Vision and Pattern Recognition. Columbus, 2014. P. 1386–1393.
- Balntas V., Riba E., Ponsa D., Mikolajczyk K. Learning Local Feature Descriptors With Triplets and Shallow Convolutional Neural Networks // The British Machine Vision Conference (BMVC). 2016. V. 1. №2. P. 3.
- He K., Zhang X., Ren S., Sun J. Deep Residual Learning for Image Recognition // IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Las Vegas, 2016.
- Annoy // https://github.com/spotify/annoy.
- Pinecone // https://github.com/pinecone-io.
- Johnson J., Douze M., J’egou H. Billion-scale Similarity Search With GPUs // IEEE Transactions on Big Data. 2019. V. 7. P. 535–547.
- Melekhov I., Kannala J., Rahtu E. Siamese Network Features for Image Matching // 23rd Intern. Conf. on Pattern Recognition, ICPR. Cancun, 2016. P. 378–383.
- Bakhteev O., Chekhovich Y., Finogeev E., Gorlenko T., Kaprielova M., Kildyakov A., Ogaltsov A. Image Reuse Detection in Large-scale Document Scientific Collection // ENAI Conf., Concurrent Sessions 12. Porto, 2022. P. 107.
- Patil B. V., Patil P. R. An Efficient DTW Algorithm for Online Signature Verification // Intern. Conf. On Advances in Communication and Computing Technology (ICACCT). Painpat, 2018. P. 1–5.
- Salvador S., Chan P. Toward Accurate Dynamic Time Warping in Linear Time and Space // Intellectual Data Analysis. 2007. V. 11. P. 561–580.
- Lowe D.G. Distinctive Image Features from Scale-invariant Keypoints // Intern. J. of Computer Vision. 2004. V. 60. P. 91–110.
- Rublee E., Rabaud V., Konolige K., Bradski G. ORB: An Efficient Alternative to SIFT or SURF // Intern. Conf. on Computer Vision. Barcelona, 2011. P. 25642571.
- DeTone D., Malisiewicz T., Rabinovich A. Superpoint: Self-supervised Interest Point Detection and Description // IEEE Conf. on Computer Vision and Pattern Recognition Workshops. Salt Lake City. 2018, P. 224–236.
- Barroso-Laguna A., Riba E., Ponsa D., Mikolajczyk K. Key. net: Keypoint Detection by Handcrafted and Learned cnn Filters // IEEE/CVF Intern. Conf. on Computer Vision. Seoul, 2019. P. 5836–5844.
- Mishkin D. Local Features: from Paper to Practice // Computer Vision and Pattern Recognition (CVPR) Workshops. Seattle, 2020.
- Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A., Polosukhin I. et. al. Attention Is All You Need // ArXiv 2017. ArXiv Preprint ArXiv:1706.03762.
- Sun J., Shen Z., Wang Y., Bao H., Zhou X. LoFTR: Detector-Free Local Feature Matching With Transformers // ArXiv 2021. ArXiv Preprint ArXiv:2104.00680. P. 8922–8931.
Supplementary files
