Smart Document Analysis Using AI-ML

Authors

  • Sindhu Rashmi H R Department of Software Engineering, RV College of Engineering, Bengaluru, India Author
  • Anisha B S Department of Software Engineering, RV College of Engineering, Bengaluru, India Author
  • Ramakanth Kumar P Professor and Head of the Department, Department of CSE, RV College of Engineering Bengaluru, India (r Author

Keywords:

ML (Machine Learning), Document classification, Rhyme, Non-Rhyme, Decision Tree Algorithm, Digitalization, Machine Learning Model, Random Forest Algorithm

Abstract

 In this era of digitalization, everything is  smart and digitalized. All the documents are presented,  prepared and shared as soft copies. Classifying those  soft copy documents is gaining an important insight in  recent times. It is attracting digital word with its impact  in different fields like spam filtering, email routing,  language identification, genre classification,  sentimental analysis, readability assessment.  Classifying documents that are available online using  smart techniques helps different business. The easiest  and efficient way of doing it is through machine  learning and it makes human work much easier. To  perform classification of document more statistically,  documents should be given in a much understandable  format to the machine learning classifier. In this report,  I’m discussing the types of feature depending on which  an document can be classified and later represented.  Record arrangement or classifying the documents is the  purpose of document collection and classifications based  upon the information it consists off and features that it  contains. Record arrangement is a huge learning issue  that is at the center of numerous data executives and  recovery. Document grouping plays an important role  in different applications that help with sorting out,  ordering, looking and briefly speaking to a lot of data.  In this report, we will be discussing the uses of  document classification and important steps used for  classifying the document or text by considering a small  use case to know how document classification is done,  basic steps of document classification, processing and  analyzing the documents that are collected. We have  considered two different categories of data sets for  classification and analysis. The problem statement here  is to distinguish those two documents where one is  Rhyme document and each rhyme is taken as a single  file and the other is normal sentences that are a  Non-Rhyme document that contains normal Wikipedia 

text where few statements of Wikipedia is considered as  a single file. The precise objective of my project is to  develop scalable and efficient document classification  project that classifies the document more precisely  depending on the feature that it contains and to know  the basic techniques that are used for the document a  classification like, data collection, data cleaning,  pre-processing and constructing an ML model and  applying the ML algorithm. Another objective of the  project is to work on machine learning concepts and to  get insight into different classification algorithms with  the help of this case study. 

Downloads

Download data is not yet available.

References

Ankit Basarkar “Document classification using Machine Learning [1]”, Springer International Conference 5-25-2017 Vol 531 © 2018.

Berina Alic, Lejila Gurbeta and Almir Badnjevic, “Machine Learning Techniques for Classification of Diabetes and Cardiovascular Diseases” 2017, 6th MEDITERRANEAN CONFERENCE ON EMBEDDED COMPUTING," (MECO), 11-15 JUNE 2017, BAR, MONTENEGRO, 978-1-5090-6742-8/17/$31.00 ©2017 IEEE

Zhongmin Luo, “CDS Rate Construction Methods by Machine learning Techniques”, Social Science Research Network (SSRN) Electronic journal on May 12 2107.

Suresh Yaram “Machine Learning Algorithms for Document clustering and Fraud Detection” 2016 IEEE International Conference on Data Science and Engineering (ICDSE) 978-1-5090-1281-7/16/$31.00 ©2016 IEEE

Leila Arras, Franziska Horn, Gregoire Montavon, Klaus-Robert Muller, “What is Relevant in a Text Document?” An Interpretable Machine Learning Approach, International Workshop on Analytics and Networking arXiv:1612.07843v1 [cs.CL] 23 Dec 2016

Arthi Venkataraman, “Deep Learning Algorithms Based Text Classifier”, 2nd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), 978-1-5090-2399-8/16/$31.00_c 2016 IEEE

V. V. Gulin and A. B. Frolov, “On the Classification of Text Documents Taking into Account Their Structural Features” Journal of Computer and Systems Sciences International, 2016, Vol. 55, No. 3, pp. 394–403. © Pleiades Publishing, Ltd., ISSN 1064_2307, 2016.

P. O. Lima Junior, L. G. Castro Junior and A. L. Zambalde, “Analysis of Machine Learning Tecniques to Classify News for Information Management in Coffee Market”, International Conference on Digitalization, IEEE LATIN AMERICA TRANSACTIONS, VOL. 13, NO. 7, JULY 2015

Siwei Lai, Liheng Xu, Kang Liu and Jun Zhao, “Recurreny Convolution Neural Networks for Text Classification”, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Association for the Advancement of Artificial

Intelligence on 2015.

Liang Yang and Hongfei Lin, “C.nstruction and Application of Chinese Emotional Corpus”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2015. © Springer-Verlag Berlin Heidelberg 2015.

Marenglen Biba and Mersida Mane, “Sentiment Analysis through Machine Learning: An Experimental Evaluation for Albanian”, Recent Advances in Intelligent Informatics, Advances in Intelligent Systems and Computing 235,195 DOI: 10.1007/978-3-319-01778-5_20, © Springer International Publishing Switzerland 2014.

Bina Kotiyal, Ankit Kumar, Bhaskar Pant and R. H. Goudr, “Classification Technique for Improving User Acces on Web Log Data”, International conference on Intelligent Computing, Networking and Informatics, Online ISBN978-81-322-1665-0 on 18 December 2014

Maofu Liu, Yu Xiao, Chunwei Lei and Xin Zhou, “Social Relation Extraction Based on Chinese Wikipedia Articles”, Chinese Lexical Semantics Workshop (CLSW) 2014, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2014.

B. S. Harish and B. Udayasri, “Document Classification: An Approach Using Feature Clustering”, IEEE Conference on Recent Advances in Intelligent Informatics, Advances in Intelligent Systems and Computing 235, DOI: 10.1007/978-3-319-01778 5_17, © Springer International Publishing Switzerland 2014

Guo-Nian Wang, Yi Qin, Mini Jiang, Qiu-Rong Zhao, “MT-Oriented and Computer- Based Subject Restoration for Chinese Empty-Subject Sentences”, Chinese Lexical Semantics Workshop (CLSW) 2013,

LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013

Muhammad Shahbaz, Qanta Ahmed and Aziz Guergachi, “An Expert Framework For Effective Document Classifictaion Using Support Vector Machine”, International Journal of Innovative Computing Information and Control ICIC International Conference, Volume 9, Number 4, April 2013 ©2013 ISSN 1349-4198.

Yonglei Zhang, Cheng Peng and Hongling Wang, “Research on Chinese Sentence Compression for the Titke Generation”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013

Shengfeng ju and Shoushan Li, “Active Learning in Sentiment Classification by Selecting Both Words and Documents”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013.

Xiuli Hua, Shoushan Li, Peifeng Li and Qiaoming Zhu, Reseach on Intrinsic Plagiarism Detection Resolution: A supervised Learning Approach”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013.

Zhu Zhu, Daming Dai, Yaxing Ding, Jianbin Qian and Shoushan Li, “Employing Emotion Keywords to Improve Cross-Domain Sentiment Classification”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp. 1–10, 2013. © Springer-Verlag Berlin Heidelberg 2013.

Ge Xu, Chu-Ren Huang and Houfeng Wang, “Extracting Chinese Product Features: Representing a Sequence by a Set of Skip-Bigrams”, Chinese Lexical Semantics Workshop (CLSW) 2013, LNAI 7717, pp.

–10, 2013. © Springer-Verlag Berlin Heidelberg 2013.

Charles Smutz and Angelos Stavrou, “Malicious PDF Detection using Metadata and Structural Features”, Annual Computer Security Applications Conference (ACSAC) 2012 ACSAC ’12 Dec. 3-7, 2012, Orlando, Florida USA, 2012 ACM 978-1-4503-1312-4/12/12 ...$15.00.

Gerhard Paass and Luliu Konya, “Machine Learning for Document Structure Recognition”, Studies in Computational Intelligence on June 22nd, 2011.

Jyri Saarikoski, Jorma Laurikkala, Kalervo Jarvelin and Martti Juhola, “Self-Organizing Maps in Document Classification: A Comparision with Six Machine Learning Methods”, Internation Conference on Adaptive and Natural Computing Algorithms (ICANNGA) 2011, Part I, LNCS 6593, pp. 260–269, 2011. © Springer-Verlag Berlin Heidelberg 2011

Bhawna Nigam, Poorvi Ahirwal, Sonal Salve, Swati Vamney, “Document Classification Using Expectation Maximization with Semi Supervised Learning”, International Journal on Soft Computing (IJSC) Vol.2, No.4, DOI: 10.5121/ijsc.2011.2404 November 2011.

Dilara Torunoglu, Erhan Cakirman, Murat Can Ganiz, et.al, “Analysis of Processing Methods on Classification of Turkish Texts”, International

Conference on Informational Technology with Machine Learning” 978-1-61284-5/11/$26.00 ©2011 IEEE

Yu Wanjun and Song Xiaoguang, “Research on Text Categorization Based on Machine Learning” IEEE International Journal on Machine Learning and its Implementation, 978-1-4244-6932-1/10/$26.00 ©2010 IEEE

R. Deepa Lakshmi and N.Radha, “Spam Classification using Supervised Learning Techniques”, International Conference onWomen in Applied Computing and Information Technology. A2CWiC 2010, September 16-17, 2010, India Copyright © 2010 978-1-4503-0194-7/10/0009… $10.00

Baharum Baharudin, Khairullah khan, Lam Hong Lee, Aurangzeb Khan, “A Review of Machine Learning Algorithms for Text-Document Classification”, JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 1, NO. 1, FEBRUARY 2010 Published on 2010

Janusz Wnek, “Machine Learning of Document Templates for Data Extraction”, U. S. Conference on Science and Application, U.S. Patent, US 7,764,830 B1, July 27, 2010

Aurangzeb Khan, Baharum Baharudin, Lam Hong Lee, Khairullah Khan, “A Review of Machine Learning Algorithms for Text-Documents Classification”, JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 1, NO. 1, FEBRUARY 2010 © 2010 ACADEMY PUBLISHER doi:10.4304/jait.1.1.4-20

Simon Tong and Daphne Koller, “Support Vector Machine Active Learning with Applications to Text Classification”, Journal of Machine Learning Research 2010 on 11/01/2010

Konstantin Mertsalov and Michael McCreary, “Document Classification with Support Vector Machines”, International Conference on IEEE Transactions on Knowledge and Data Engineering on January 2009.

Downloads

Published

2019-05-05

How to Cite

Smart Document Analysis Using AI-ML . (2019). International Journal of Innovative Research in Computer Science & Technology, 7(3), 54–70. Retrieved from https://acspublisher.com/journals/index.php/ijircst/article/view/13379