Description
This assignment is about finding the similarities in given sentences.
Id Text
-
In the past John liked only sport but now he likes sport and politics
-
Sam only liked politics but now he is fan of both music and politics
-
Sara likes both books and politics but in the past she only read books
-
Robert loved both books and nature but now he only reads books
-
Linda liked books and sport but she only likes sport now
-
Alison used to loved nature but currently she likes both nature and sport
Using Python language, perform the followings NLP tasks to find the similarities between the given sentences:
-
Using NLTK word_tokenize function, tokenize the given sentences
-
Using NLTK PorterStemmer, perform the stemming for the tokens of the sentences
-
Using NLTK WordNetLemmatizer, perform the lemmatization for the stemmed tokens
-
Using sklearn K-means clustering technique, cluster the given sentences. Find the feature vectors for the input of a K-means algorithm using the below techniques. Also, find an appropriate K-value using a KneeLocator method from the python kneed library.
-
-
TF-IDF
-
TF
-
BOW
-
Word2Vec
-
-
Visualize the clusters using the word clouds.
Tips:
-
You will need to adjust the vector size in order to find the appropriate k-value