Smart Document Analysis Using AI-ML

Sindhu Rashmi H R; Anisha  B S; Ramakanth Kumar P

Authors

Sindhu Rashmi H R Department of Software Engineering, RV College of Engineering, Bengaluru, India Author
Anisha B S Department of Software Engineering, RV College of Engineering, Bengaluru, India Author
Ramakanth Kumar P Professor and Head of the Department, Department of CSE, RV College of Engineering Bengaluru, India (r Author

Keywords:

ML (Machine Learning), Document classification, Rhyme, Non-Rhyme, Decision Tree Algorithm, Digitalization, Machine Learning Model, Random Forest Algorithm

Abstract

In this era of digitalization, everything is smart and digitalized. All the documents are presented, prepared and shared as soft copies. Classifying those soft copy documents is gaining an important insight in recent times. It is attracting digital word with its impact in different fields like spam filtering, email routing, language identification, genre classification, sentimental analysis, readability assessment. Classifying documents that are available online using smart techniques helps different business. The easiest and efficient way of doing it is through machine learning and it makes human work much easier. To perform classification of document more statistically, documents should be given in a much understandable format to the machine learning classifier. In this report, I’m discussing the types of feature depending on which an document can be classified and later represented. Record arrangement or classifying the documents is the purpose of document collection and classifications based upon the information it consists off and features that it contains. Record arrangement is a huge learning issue that is at the center of numerous data executives and recovery. Document grouping plays an important role in different applications that help with sorting out, ordering, looking and briefly speaking to a lot of data. In this report, we will be discussing the uses of document classification and important steps used for classifying the document or text by considering a small use case to know how document classification is done, basic steps of document classification, processing and analyzing the documents that are collected. We have considered two different categories of data sets for classification and analysis. The problem statement here is to distinguish those two documents where one is Rhyme document and each rhyme is taken as a single file and the other is normal sentences that are a Non-Rhyme document that contains normal Wikipedia

text where few statements of Wikipedia is considered as a single file. The precise objective of my project is to develop scalable and efficient document classification project that classifies the document more precisely depending on the feature that it contains and to know the basic techniques that are used for the document a classification like, data collection, data cleaning, pre-processing and constructing an ML model and applying the ML algorithm. Another objective of the project is to work on machine learning concepts and to get insight into different classification algorithms with the help of this case study.

Downloads

Download data is not yet available.

References

Berina Alic, Lejila Gurbeta and Almir Badnjevic, “Machine Learning Techniques for Classification of Diabetes and Cardiovascular Diseases” 2017, 6th MEDITERRANEAN CONFERENCE ON EMBEDDED COMPUTING," (MECO), 11-15 JUNE 2017, BAR, MONTENEGRO, 978-1-5090-6742-8/17/$31.00 ©2017 IEEE

Zhongmin Luo, “CDS Rate Construction Methods by Machine learning Techniques”, Social Science Research Network (SSRN) Electronic journal on May 12 2107.

Leila Arras, Franziska Horn, Gregoire Montavon, Klaus-Robert Muller, “What is Relevant in a Text Document?” An Interpretable Machine Learning Approach, International Workshop on Analytics and Networking arXiv:1612.07843v1 [cs.CL] 23 Dec 2016

Arthi Venkataraman, “Deep Learning Algorithms Based Text Classifier”, 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), 978-1-5090-2399-8/16/$31.00_c 2016 IEEE

V. V. Gulin and A. B. Frolov, “On the Classification of Text Documents Taking into Account Their Structural Features” Journal of Computer and Systems Sciences International, 2016, Vol. 55, No. 3, pp. 394–403. © Pleiades Publishing, Ltd., ISSN 1064_2307, 2016.

P. O. Lima Junior, L. G. Castro Junior and A. L. Zambalde, “Analysis of Machine Learning Tecniques to Classify News for Information Management in Coffee Market”, International Conference on Digitalization, IEEE LATIN AMERICA TRANSACTIONS, VOL. 13, NO. 7, JULY 2015

Siwei Lai, Liheng Xu, Kang Liu and Jun Zhao, “Recurreny Convolution Neural Networks for Text Classification”, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial

Intelligence on 2015.

Liang Yang and Hongfei Lin, “C.nstruction and Application of Chinese Emotional Corpus”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2015. © Springer-Verlag Berlin Heidelberg 2015.

Marenglen Biba and Mersida Mane, “Sentiment Analysis through Machine Learning: An Experimental Evaluation for Albanian”, Recent Advances in Intelligent Informatics, Advances in Intelligent Systems and Computing 235,195 DOI: 10.1007/978-3-319-01778-5_20, © Springer International Publishing Switzerland 2014.

Bina Kotiyal, Ankit Kumar, Bhaskar Pant and R. H. Goudr, “Classification Technique for Improving User Acces on Web Log Data”, International conference on Intelligent Computing, Networking and Informatics, Online ISBN978-81-322-1665-0 on 18 December 2014

Maofu Liu, Yu Xiao, Chunwei Lei and Xin Zhou, “Social Relation Extraction Based on Chinese Wikipedia Articles”, Chinese Lexical Semantics Workshop (CLSW) 2014, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2014.

B. S. Harish and B. Udayasri, “Document Classification: An Approach Using Feature Clustering”, IEEE Conference on Recent Advances in Intelligent Informatics, Advances in Intelligent Systems and Computing 235, DOI: 10.1007/978-3-319-01778 5_17, © Springer International Publishing Switzerland 2014

Guo-Nian Wang, Yi Qin, Mini Jiang, Qiu-Rong Zhao, “MT-Oriented and Computer- Based Subject Restoration for Chinese Empty-Subject Sentences”, Chinese Lexical Semantics Workshop (CLSW) 2013,

LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013

Muhammad Shahbaz, Qanta Ahmed and Aziz Guergachi, “An Expert Framework For Effective Document Classifictaion Using Support Vector Machine”, International Journal of Innovative Computing Information and Control ICIC International Conference, Volume 9, Number 4, April 2013 ©2013 ISSN 1349-4198.

Yonglei Zhang, Cheng Peng and Hongling Wang, “Research on Chinese Sentence Compression for the Titke Generation”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013

Shengfeng ju and Shoushan Li, “Active Learning in Sentiment Classification by Selecting Both Words and Documents”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013.

Xiuli Hua, Shoushan Li, Peifeng Li and Qiaoming Zhu, Reseach on Intrinsic Plagiarism Detection Resolution: A supervised Learning Approach”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013.

Zhu Zhu, Daming Dai, Yaxing Ding, Jianbin Qian and Shoushan Li, “Employing Emotion Keywords to Improve Cross-Domain Sentiment Classification”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013.

Ge Xu, Chu-Ren Huang and Houfeng Wang, “Extracting Chinese Product Features: Representing a Sequence by a Set of Skip-Bigrams”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp.

Charles Smutz and Angelos Stavrou, “Malicious PDF Detection using Metadata and Structural Features”, Annual Computer Security Applications Conference (ACSAC) 2012 ACSAC ’12 Dec. 3-7, 2012, Orlando, Florida USA, 2012 ACM 978-1-4503-1312-4/12/12 ...$15.00.

Gerhard Paass and Luliu Konya, “Machine Learning for Document Structure Recognition”, Studies in Computational Intelligence on June 22nd, 2011.

Jyri Saarikoski, Jorma Laurikkala, Kalervo Jarvelin and Martti Juhola, “Self-Organizing Maps in Document Classification: A Comparision with Six Machine Learning Methods”, Internation Conference on Adaptive and Natural Computing Algorithms (ICANNGA) 2011, Part I, LNCS 6593, pp. 260–269, 2011. © Springer-Verlag Berlin Heidelberg 2011

Bhawna Nigam, Poorvi Ahirwal, Sonal Salve, Swati Vamney, “Document Classification Using Expectation Maximization with Semi Supervised Learning”, International Journal on Soft Computing (IJSC) Vol.2, No.4, DOI: 10.5121/ijsc.2011.2404 November 2011.

Dilara Torunoglu, Erhan Cakirman, Murat Can Ganiz, et.al, “Analysis of Processing Methods on Classification of Turkish Texts”, International

R. Deepa Lakshmi and N.Radha, “Spam Classification using Supervised Learning Techniques”, International Conference onWomen in Applied Computing and Information Technology. A2CWiC 2010, September 16-17, 2010, India Copyright © 2010 978-1-4503-0194-7/10/0009… $10.00

Baharum Baharudin, Khairullah khan, Lam Hong Lee, Aurangzeb Khan, “A Review of Machine Learning Algorithms for Text-Document Classification”, JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 1, NO. 1, FEBRUARY 2010 Published on 2010

Janusz Wnek, “Machine Learning of Document Templates for Data Extraction”, U. S. Conference on Science and Application, U.S. Patent, US 7,764,830 B1, July 27, 2010

Aurangzeb Khan, Baharum Baharudin, Lam Hong Lee, Khairullah Khan, “A Review of Machine Learning Algorithms for Text-Documents Classification”, JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 1, NO. 1, FEBRUARY 2010 © 2010 ACADEMY PUBLISHER doi:10.4304/jait.1.1.4-20

Simon Tong and Daphne Koller, “Support Vector Machine Active Learning with Applications to Text Classification”, Journal of Machine Learning Research 2010 on 11/01/2010

Konstantin Mertsalov and Michael McCreary, “Document Classification with Support Vector Machines”, International Conference on IEEE Transactions on Knowledge and Data Engineering on January 2009.

Smart Document Analysis Using AI-ML

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

International Journal of Innovative Research in Computer Science & Technology

Smart Document Analysis Using AI-ML

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

International Journal of Innovative Research in Computer Science & Technology

subscribe-us-for-latest-update