Word analysis

$24.99 $18.99

There are several fields in computer science that aims to understand how people use language. This can include analyzing the most frequently used words by certain authors, and then going one step further to ask a question such as: “Given what we know about Hemingway’s language patterns, do we believe Hemingway wrote this lost manuscript?”…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Categorys:

Description

5/5 – (2 votes)

There are several fields in computer science that aims to understand how people use language. This can include analyzing the most frequently used words by certain authors, and then going one step further to ask a question such as: “Given what we know about Hemingway’s language patterns, do we believe Hemingway wrote this lost manuscript?” In this assignment, we’re going to do a basic introduction to document analysis by determining the number of unique words and the least frequently and the most frequently used words in the document.

Please read all directions for the assignment carefully. This write-up contains both the details of what your program needs to do as well as implementation requirements for how the functionality needs to be implemented.

What your program needs to do

There is one test file on Moodle –HungerGames_edit.txtthat contain the full text from Hunger Games Book 1.We have pre-processed the file to remove all punctuation and down-cased all words.

Your program needs to read in the .txt file, with the name of the file to open set as a command-line argument. Your program needs to store the unique words found in the file in a dynamically allocated array and calculate and output the following information:

  • The topnmost common words (nis also a command-line argument) and the number of times each word was found

  • The bottomnleast common words (there can be multiple least common, print any of those n) and the number of times each word was found

  • The total number of non common unique words in the file

  • The total number of non common words

  • The number of array doublings needed to store all unique words in the file(read more about array doubling below)

  • Printout the count of the given words in command line argument

Example:

Running your program using:

./Assignment3 10 HungerGames_edit.txt ignoreWords.txt meadow,listen

would return the 10 most common words in the fileHungerGames_edit.txtand should produce the following results.

682 – is

492 – peeta

479 – its

431 – im

427 – can

414 – says

379 – him

368 – when

367 – no

356 – are

#

1 – platforms

1 – grimy

1 – married

1 – expressionless

1 – aboard

1 – engine

1 – rumble

1 – palpable

1 – nones

1 – yousays

#

Array doubled: 7

#

Unique non-common words: 7682

#

Total number of non common words: 59157

#

meadow – 12

listen – 23

Program specifications

Use an array of struct to store the words and their counts and a class to store the array and all the functions

There are specific requirements for how your program needs to be implemented. For this assignment, you need to use a dynamically allocatedarray of objectsto store the words and their counts. Your class needs to have members for the word and count:

struct wordItem

{

string word;

int count;

};

class WordAnalysis

{

  • Should store an array of the above struct

  • Should store an array of the 50 stop words

  • Should also have functions that are given below

}

Exclude these top 50 common words from your word counting

Table 1 shows the 50 most common words in the English language. In your code, exclude these words from the words you count in the .txt file. The words are included in a .txt file that you code needs to read in and populate a common word array. Your code should include a separate function, calledisStopWord()to determine if the current word read from the .txt file is on this list and only process the word if it is not.

Table 1. Top 50 most common words in the English language

Rank

Word

Rank

Word

Rank

Word

1

The

18

You

35

One

2

Be

19

Do

36

All

3

To

20

At

37

Would

4

Of

21

This

38

There

5

And

22

But

39

Their

6

A

23

His

40

What

7

In

24

By

41

So

8

That

25

From

42

Up

9

Have

26

They

43

Out

10

I

27

We

44

If

11

It

28

Say

45

About

12

For

29

Her

46

Who

13

Not

30

She

47

Get

14

On

31

Or

48

Which

15

With

32

An

49

Go

16

He

33

Will

50

Me

17

As

34

My

Use 4 command-line arguments

Your program needs to have three command-line arguments – the first argument is the number of least/most frequent words to output, the second argument is the name of the file to open and read, and the third argument is the name of the file that contains the words to ignore, also calledstop words.The 4thcommand line argument will be a list of words separated by commas.

Note: DO NOT HAVE SPACE BETWEEN THE WORDS YOU WANT TO SEARCH OTHERWISE THEY WILL BE TREATED AS SEPARATE ARGUMENTS

For example, running

./Assignment3 20 HungerGames_edit.txt ignoreWords.txt meadow,listen

NOTE: List of words that you search for can be in any form. i.e given meadow, you need to search for all the strings starting with meadow.

Use the array-doubling algorithm to increase the size of your array

We don’t know ahead of time how many unique words either of these files has, so you don’t know how big the array should be.Start with an array size of 100,and double the size as words are read in from the file and the array fills up with new words. Use dynamic memory allocation to create your array, copy the values from the current array into the new array, and then free the memory used for the current array.(refer to ppt uploaded in canvas to double array)

Note: some of you might wonder whywe’re not using C++ Vectors for this assignment.A vector is an interface to a dynamically allocated array that uses array doubling to increase its size. In this assignment, you’re doing what happens behind-the-scenes with a Vector.

Output thenleast and most frequent words

Write a function to determine the least and most frequentnwords in the array. This can be a function that sorts the entire array, or a function that generates an array ofnbottom and top items. Output thenleast and most frequent words in the order of most frequent to least frequent.

Format your output the following way

When you output the topnwords in the file, the output needs to be in order, with the most frequent word printed first. The format for the output needs to be:

Count – Word

#

Array doubled: <number of array doublings>

#

Unique non-common words: <number of unique words>

#

Total number of non-common words: <total number of words>

#

Search words : count

Generate the output with these commands:

cout<<numCount<<” – “<<word<<endl; cout<<”#”<<endl;

cout<<”Array doubled: “<<numDoublings<<endl; cout<<”#”<<endl;

cout<<”Unique non-common words: “<<numUniqueWords<<endl; cout<<word<<” – ”<<count<<endl;

  • so on

Your code needs to have a class that will have the array of struct and also the following methods:

/*

  • Function name: isStopWord

  • Purpose: to see if a word is a stop word

  • @param word – a word (which you want to check if it is a stop word)

  • @return – true (if word is a stop word), or false (otherwise)

*/

boolisStopWord(string word);

/*

  • Function name: printTopN

  • Purpose: to print the top N high frequency words from a sorted array

  • @param topN – the number of top frequency words to print

  • @return none

*/

void printTopN(int bottomN);

/*

  • Function name: printBottomN

  • Purpose: to print the Bottom N least frequency words from a sorted array

  • @param bottomN – the number of least frequency words to print

  • @return none

*/

void printBottomN(int topN);

/*

  • Function name:searchCount

  • Purpose: To search the count of a given word

  • @param wordItemList – a pointer that points to a array of wordItems * @param word – the words to search

  • @return int – Count of the given word. Will return -1 if not found

*/

intsearchCount(string word);

/*

* Function name:addWord

  • Purpose: To take in a string and add it to the array of struct. This function should implement the array doubling. It should also check if the word exists and only increase the count if it does. The word should be added in a sorted location.

  • @param word – the words to search

  • @return None

*/

voidaddWord(string word);

Submitting Your Code:

Log into Canvas and go to the Assignment 3 and submit a zip for or a cpp file.

Word analysis
$24.99 $18.99