alternative
  • Home (current)
  • About
  • Tutorial
    Technologies
    C#
    Deep Learning
    Statistics for AIML
    Natural Language Processing
    Machine Learning
    SQL -Structured Query Language
    Python
    Ethical Hacking
    Placement Preparation
    Quantitative Aptitude
    View All Tutorial
  • Quiz
    C#
    SQL -Structured Query Language
    Quantitative Aptitude
    Java
    View All Quiz Course
  • Q & A
    C#
    Quantitative Aptitude
    Java
    View All Q & A course
  • Programs
  • Articles
    Identity And Access Management
    Artificial Intelligence & Machine Learning Project
    How to publish your local website on github pages with a custom domain name?
    How to download and install Xampp on Window Operating System ?
    How To Download And Install MySql Workbench
    How to install Pycharm ?
    How to install Python ?
    How to download and install Visual Studio IDE taking an example of C# (C Sharp)
    View All Post
  • Tools
    Program Compiler
    Sql Compiler
    Replace Multiple Text
    Meta Data From Multiple Url
  • Contact
  • User
    Login
    Register

Natural Language Processing - Overview - Text Preprocessing Tutorial

Basic-

  1. Lowercasing – python is case sensitive. Therefore, “Basic” and “basic” both are different.

Use df[‘review’].lower() to convert to lowercase.

  1. Remove Html Tag – use regex to remove html tag. As html tag is not required
  2. Remove URL – use regex to remove URL. As url is not required
  3. Remove Punctuation –i.e !@#$%....etc
  4. Chat word treatment – such asap(as soon as possible), fyi(for your information), lmao, lol(lot of love), gn(good night),etc 

There is dictionary on github of shorthand. Through which you can convert it to fullhand.

  1. Spelling Correction – you can use textblob or any other library to correct spelling.
  2. Removing Stop word – stop word are used for statement formation, but it does not contribute to statement meaning. eg- a, the, of, are, my .

Stop word are not removed in POS tagging, 

From nltk.corpus import stopwords

Stopwords.words(‘english’) – [I,me,my,and ,the…]

Create function to remove stopwords and apply it to dataframe.

  1. Handling Emojis – use below function

 

There is module in python named emoji. That will convert emoji to text.

Import emoji

Print(emoji.demojize(‘python is’))

  1. Tokenization

1] sentence tokenization – it will divide paragraph based on sentence. Using full stop.

2] word tokenization – it will divide paragraph based on word

 

Hadoop developer for multiple initiatives. Develop Big Data Strategy and Roadmap for the Enterprise

Sentence_tokenization = [‘Hadoop developer for multiple initiatives’, ‘Develop Big Data Strategy and Roadmap for the Enterprise’]

word_tokenization = [‘Hadoop’, ‘developer’, ‘for’, ‘multiple’, ‘initiatives’, ‘Develop’, ‘Big’, ‘Data’, ‘Strategy’, ‘and’, ‘Roadmap’, ‘for’, ‘the’, ‘Enterprise’]

using the split(), take space for word and full stop for sentence.

Or regular expression

Or use NLTK library(Recommended)

From nltk.tokenize import word_tokenize,, sent_tokenize

Or Spacy (best recommended)

Import spacy

Nlp = spacy.load(‘en_core_web_sm’)

  1. Stemming

Stemming convert inflection word to root word

For example-

Dancing, dance, danced

Then stemming word id – danc

 

Lemmatization – stemming will provide root word(sometime meaningless). Lemmatization will provide meaningful English word.

Stemming is fast, lemmatization is slow.

 

 

 



 

Advance- Part of speech tagging, chunking, parsing, co-reference resolution

 

Natural Language Processing

Natural Language Processing

  • Introduction
  • Overview
    • Text Representation
    • Part Of Speech Tagging
    • Challenges in NLP
    • NLP Pipeline
    • Text Preprocessing

About Fresherbell

Best learning portal that provides you great learning experience of various technologies with modern compilation tools and technique

Important Links

Don't hesitate to give us a call or send us a contact form message

Terms & Conditions
Privacy Policy
Contact Us

Social Media

© Untitled. All rights reserved. Demo Images: Unsplash. Design: HTML5 UP.

Toggle