Name: Supervised Learning Techniques for Sentiment Analytics Solution
SKU: 1932
Price: 30.00 USD
Availability: InStock

Description

5/5 – (2 votes)

In this project, you will perform sentiment analysis over IMDB movie reviews and Twitter data. Your goal will be to classify tweets or movie reviews as either positive or negative. Towards this end, you will be given labeled training to build the model and labeled testing data to evaluate the model. For classification, you will experiment with logistic regression as well as a Naive Bayes classifier from python’s well-regarded machine learning package scikit-learn. As a point of reference, Stanfords Recursive Neural Network code produced an accuracy of 51.1% on the IMDB dataset and 59.4% on the Twitter data.

A major part of this project is the task of generating feature vectors for use in these classifiers. You will explore two methods: (1) A more traditional NLP technique where the features are simply “important” words and the feature vectors are simple binary vectors and (2) the Doc2Vec technique where document vectors are learned via artificial neural networks (a summary can be found here).

Submission Instructions:

Make your changes directly to the sentiment.py file and submit this file on moodle.

Project Setup

The python packages that you will need for this project are scikit-learn, nltk, and gensim. To install these, simply use the pip installer sudo pip install X or, if you are using Anaconda, conda install X, where X is the package name.

Please use python 2.7 for this.

Datasets

The IMDB reviews and tweets can be found in the data folder. These have already been divided into train and test sets.

The IMDB dataset, originally found here, that contains 50,000 reviews split evenly into 25k train and 25k test sets. Overall, there are 25k pos and 25k neg reviews. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets.

The Twitter Dataset, taken from here, contains 900,000 classified tweets split into 750k train and 150k test sets. The overall distribution of labels is balanced (450k pos and 450k neg).

Project Requirements

You will be provided a basic stub for the project in sentiment.py. You job is to complete the missing parts of the file, look for the tag # YOUR CODE GOES HERE and fill in the missing parts to complete the code. Specifically, you must complete the following functions:

feature_vecs_NLP: The comments in the code should provide enough instruction. Just keep in mind that a word should be counted at most once per tweet/review even if the word has occurred multiple times in that tweet/review.

build_models_NLP: Refer to the documentation linked above for details on how to call the functions.

feature_vecs_DOC: Some documentation for the doc2vec package can be found here. The first thing you will want to do is make a list of LabeledSentence objects from the word lists. These objects consist of a list of words and a list containing a single string label. You will want to use a different label for the train/test and pos/neg sets. For example, we used

TRAIN_POS_i, TRAIN_NEG_i, TEST_POS_i, and TEST_NEG_i, where i is the line number. This blog may be a helpful reference.

build_models_DOC: Similar to the other function.

evaluate_model: Here you will have to calculate the true positives, false positives, true negatives, false negatives, and accuracy.

You should probably test on the IMDB data first, as this runs faster, particularly when using the doc2vec technique. Your outputs should be similar to the outputs shown below.

Please generate the outputs you get and provide them in a separate pdf called results.pdf

				output

command	Naive Bayes				Logistic Regression

python sentiment.py data/imdb/ 0	predicted:	pos	neg		predicted:	pos	neg
	actual:				actual:
	pos	10832	1668		pos	10759	1741
	neg	2374	10126		neg	2057	10443
	accuracy: 0.838320				accuracy: 0.848080

python sentiment.py data/imdb/ 1	predicted:	pos	neg		predicted:	pos	neg
	actual:				actual:
	pos	4739	7761		pos	10362	2138
	neg	2073	10427		neg	1989	10511
	accuracy: 0.606640				accuracy: 0.834920

python sentiment.py data/twitter/ 0	predicted:	pos	neg		predicted:	pos	neg
	actual:				actual:
	pos	67441	7559		pos	67701	7299
	neg	52250	22750		neg	52239	22761
	accuracy: 0.601273				accuracy: 0.603080

python sentiment.py data/twitter/ 1	predicted:	pos	neg		predicted:	pos	neg
	actual:				actual:
	pos	58686	16314		pos	54009	20991
	neg	50316	24684		neg	33657	41343
	accuracy: 0.555800				accuracy: 0.635680

Supervised Learning Techniques for Sentiment Analytics Solution

Share this:

Share this:

Description

Share this:

Related products

Lab 5: Introduction to OpenGL Solution

Lab 5 Task 4 System Calls Summary Solution

Lab 4 Process Management System Calls Solution

Task 5 Process Synchronization Solution

Assignment_4 Solution