Yugra State University BulletinYugra State University Bulletin1816-92282078-9114Yugra State University1078810.17816/byusu2018037-48Research ArticleInformation extraction using neural language models for the case of online job listings analysisBotovDmitriy S.<p>Senior Lecturer</p>dmbotov@gmail.comKleninJulius D.<p>Postgraduate student</p>jklen@ya.ruNikolaevIvan E.<p>Senior Lecturer</p>ivan_nikolaev@csu.ruСhelyabinsk State UniversityChelyabinsk State University15092018143374827122018Copyright © 2018, Botov D.S., Klenin J.D., Nikolaev I.E.2018<p>In this article we discuss the approach to information extraction (IE) using neural language models. We provide a detailed overview of modern IE methods: both supervised and unsupervised. The proposed method allows to achieve a high quality solution to the problem of analyzing the relevant labor market requirements without the need for a time-consuming labelling procedure. In this experiment, professional standards act as a knowledge base of the labor domain. Comparing the descriptions of work actions and requirements from professional standards with the elements of job listings, we extract four entity types. The approach is based on the classification of vector representations of texts, generated using various neural language models: averaged word2vec, SIF-weighted averaged word2vec, TF-IDF-weighted averaged word2vec, paragraph2vec. Experimentally, the best quality was shown by the averaged word2vec (CBOW) model.</p>machine learningnatural language processingneural language modelsclassification methodinformation extractionnamed entity recognitionмашинное обучениеобработка естественного языканейросетевые модели языкаметод классификацииизвлечение информациираспознавание именованных сущностей[Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition [Text] / T. Baldwin, de M.-C. Marneffe, B. Han [et al.] // In Proceedings of the Workshop on Noisy User-Generated Text. - Beijing, China, 2015. - P. 126-135.][Domain adaptation of rule-based annotators for named-entity recognition tasks [Text] / L. Chiticariu, R. Krishnamurthy, Y. Li [et al.] // Proceedings of the 2010 conference on empirical methods in natural language processing, Association for Computational Linguistics. - San Jose, USA, 2010. - P. 1002-1012.][Finkel, J. R. Incorporating non-local information into information extraction systems by gibbs sampling [Text] / J. R. Finkel, T. Grenager, C. Manning // Proceedings of the 43rd annual meeting on association for computational linguistics, Association for Computational Linguistics. - Stanford, USA, 2005. - P. 363-370.][Kudo, T. CRF++: Yet another CRF toolkit [Electronic resource] / T. Kudo // GitHub. - URL: https://github.com/taku910/crfpp.][Seker, G. A. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content [Text] / G. A. Seker, G. Eryigit // Semantic Web 8, IOS Press. - 2017. - № 5. - P. 625-642.][Bikel, D. M. An algorithm that learns what's in a name [Text] / D. M. Bikel, R. Schwartz, R. M. Weischedel // Machine learning 34. - 1999. - № 1-3. - P. 211-231.][Curran, J. R. Language independent NER using a maximum entropy tagger [Text] / J. R. Curran, S. Clark // Proceedings of the seventh conference on Natural language learning at HLT-NAACL. - 2003. - Vol. 4. - P. 164-167.][Das, A. Named entity recognition with word embeddings and wikipedia categories for a low-resource language [Text] / A. Das, D. Ganguly, U. Garain // ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). - USA, New York. - 2017. - Vol. 16, Issue 3. - P. 19-25.][Class-based n-gram models of natural language [Text] / P. F. Brown, P. V. Desouza, R. L. Mercer [et al.] // Computational linguistics 18. - 1992. - № 4. - P. 467-479.][Siencnik, S. K. Adapting word2vec to named entity recognition [Text] / S. K. Siencnik // Proceedings of the 20th Nordic Conference of Computational Linguistics. - Sweden, 2015. - № 109. - P. 239-243.][Wu, Y. A study of neural word embeddings for named entity recognition in clinical text [Text] / Y. Wu, J. Xu, M. Jiang, Y. Zhang, H. Xu // AMIA Annual Symposium Proceedings, American Medical Informatics Association. - USA, San Francisco. - 2015. - Vol. 2015. - P. 1326-1333.][Toral, A. A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia [Text] / A. Toral, R. Munoz // Proceedings of the Workshop on NEW TEXT Wikis and blogs and other dynamic text sources. - Italy, Trento. - 2006. - Vol. 1. - P. 56-61.][Chiu, J. P.-C. Named entity recognition with bidirectional LSTM-CNNs [Text] / J. P.-C. Chiu, E. Nichols // Transactions of the Association for Computational Linguistics. - 2016. - Vol. 4. - P. 357-370.][Huang, Z. Bidirectional LSTM-CRF models for sequence tagging [Text] / Z. Huang, W. Xu, K. Yu // arXiv preprint arXiv:1508.01991. - 2015.][RELigator: chemical-disease relation extraction using prior knowledge and textual information [Text] / E. Pons, B. F. H. Becker, S. A. Akhondi [et al.] // Proceedings of the Fifth BioCreative Challenge Evaluation Workshop. - Spain, Sevilla. - 2015. - Vol. 1. - P. 247-253.][Relation classification via convolutional deep neural network [Text] / D. Zeng, K. Liu, S. Lai [et al.] // Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. - Ireland, Dublin. - 2014. - Vol. 1. - P. 2335-2344.][Plank, B. Embedding semantic similarity in tree kernels for domain adaptation of relation extraction [Text] / B. Plank, A. Moschitti // Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Volume 1: Long Papers. - Bulgaria, Sofia. - 2013. - Vol. 1. - P. 1498-1507.][Quan, C. An unsupervised text mining method for relation extraction from biomedical literature [Text] / C. Quan, M. Wang, F. Ren // PloS one - 2014. - Vol. 9, Issue 7. - P. 1-8.][Extracting Relational Facts by an End-to-End Neural Model with Copy Mechanism [Text] / X. Zeng, D. Zeng, S. He [et al.] // Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. - Australia, Melbourne. - 2018. - Vol. 1. - P. 506-514.][Xu, B. CN-DBpedia: A never-ending Chinese knowledge extraction system [Text] / B. Xu, Y. Xu, J. Liang [et al.] // In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, France, Arras : Springer International Publishing. - 2017. - Vol. 1, Part II, LNAI 10351 - P. 428-438.][Exploring Encoder-Decoder Model for Distant Supervised Relation Extraction [Text] / S. Su, N. Jia, X. Cheng [et al.] // Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18). - Sweden, Stockholm. - 2018. - Vol. 1. - P. 4389-4395.][Distributed representations of words and phrases and their compositionality [Text] / T. Mikolov, I. Sutskever, K. Chen [et al.] // In Advances in neural information processing systems. - 2013. - Vol. 1. - P. 3111-3119.][Le, Q. Distributed representations of sentences and documents [Text] / Q. Le, T. Mikolov // International Conference on Machine Learning. - China, Beijing. - 2014. - Vol. 32. - P.1188-1196.][Arora, S. A simple but tough-to-beat baseline for sentence embeddings [Text] / S. Arora, Y. Liang, T. Ma // International Conference on Learning Representations (ICLR). - France, Toulon. - 2017. - Vol. 1. - P. 1-16.][Rehurek, R. Software framework for topic modelling with large corpora [Text] / R. Rehurek, P. Sojka // Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. - Malta, Valletta. - 2010. - Vol. 1. - P. 46-50.][RUSSE’2018: a Shared Task on Word Sense Induction for the Russian Language [Text] / A. Panchenko, A. Lopukhina, D. Ustalov [et al.] // Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference «Dialogue 2018». - Russia, Moscow. - 2018. - Vol. 1. - P. 547-564.][RUSSE: The First Workshop on Russian Semantic Similarity [Text] / A. Panchenko, N. V. Loukachevitch, D. Ustalov [et al.] // Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference «Dialogue 2015». - Russia, Moscow. - 2015. - Vol. 2. - P. 89-105.][FactRuEval 2016: Evaluation of Named Entity Recognition and Fact Extraction Systems for Russian [Text] / V. V. Bocharov, S. V. Alexeeva, A. A. Bodrova [et al.] // Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference «Dialogue 2016». - Russia, Moscow. - 2016. - Vol. 1. - P. 702-720.][Пархоменко, П. А. Обзор и экспериментальное сравнение методов кластеризации текстов [Текст] / П. А. Пархоменко, А. А. Григорьев, Н. А. Астраханцев // Труды ИСП РАН. - 2017. - Т. 29, Вып. 2. - С. 161-200.]