Description
OBJECTIVES
-
Read a file with unknown size and store its contents in a dynamic array
-
Store, search and iterate through data in an array of structs
-
Use array doubling via dynamic memory to increase the size of the array
Overview
In this assignment, we will write a program to analyse the word frequency in a document. Because the number of words in the document may not be known a priori, we will implement a dynamically doubling arrayto store the necessary information.
Please read all the directions before writing code, as this write-up contains specific requirements for how the code should be written.
Your Task
There are two files on Moodle. One contains text to be read and analyzed, and is named TomSawyer.txt. As the name implies, this file contains the full text from Tom Swayer. For your convenience, all the punctuation has been removed, all the words have been converted to lowercase, and the entire document is written on a single line. The other file contains the 50 most common words in the English language, which your program will ignore during analysis . It is calledignoreWords.txt.
Your program must take three command line arguments in the following order – a number N, the name of the text to be read, and the name of the text file with the words that should be ignored. It will read the text (ignoring the words in the second file) and store all unique words in a dynamically doubling array. It should then calculate and print the following information:
-
The number of time array doubling was required to store all the unique words
-
The number of unique “non-ignore” words in the file
-
The total word count of the file (excluding the ignore words)
-
Starting from index N print the 10 most frequent words along with their probability (up to 4 decimal places) of occurrence from the array. The array should be sorted in decreading manner based on probability
Instructors: Maciej Zagrodzki, Christopher Godley
Assignment 2
For example, running your program with the command:
./Assignment2 25 TomSawyer.txt ignoreWords.txt
would print the next 10 words starting index 5 in TomSawyer.txt, not including any words in ignoreWords.txt. The full results would be:
Array doubled: 7
#
Unique non-common words: 7275
#
Total non-common words: 42962
#
Probability of next 10 words from rank 25
—————————————
0.0033 |
– little |
0.0033 |
– more |
0.0032 |
– into |
0.0032 |
– see |
0.0032 |
– over |
0.0031 |
– joe |
0.0030 |
– never |
0.0030 |
– know |
0.0030 |
– away |
0.0030 |
– again |
Specifics:
-
Use an array of structs to store the words and their counts
There is an unknown number of words in the file. You will store each unique word and its count (the number of times it occurs in the document). Because of this, you will need to store these words in a dynamically sized array of structs. The struct must be defined as follows:
struct wordItem {
string word;
Instructors: Maciej Zagrodzki, Christopher Godley
Assignment 2
int count;
};
-
Use the array-doubling algorithm to increase the size of your array
Your array will need to grow to fit the number of words in the file. Start with an array size of 100, and double the size whenever the array runs out of free space. You will need to allocate your array dynamically and copy values from the old array to the new array.
Note: Don’t use the built-in std::vector class. This will result in a loss of points. You’re actually writing the code that the built-in vector uses behind-the-scenes!
-
Ignore the top 50 most common words that are read in from the second file
To get useful information about word frequency, we will be ignoring the 50 most common words in the English language. These words will be read in from a file, whose name is the third command line argument.
-
Take three command line arguments
Your program must take three command line arguments – a number N which tells your program how many of the most frequent words to print, the name of the text file to be read and analyzed, and the name of the text file with the words that should be ignored.
-
Output the top Nmost frequent words
Your program should print out the next 10 most frequent words – not including the common words – starting index N in the text where N is passed in as a command line argument.
E.g. If N=5 then print words from index 5-14 in the array sorted in decreasing order.
-
Format your final output this way:
Array doubled: <Number of times the array was doubled>
#
Unique non-common words: <Unique non-common words>
#
Total non-common words: <Total non-common words>
Instructors: Maciej Zagrodzki, Christopher Godley
Assignment 2
#
Probability of next 10 words from rank <N>
—————————————
<Nth highest probability> – <corresponding word>
<N+1 th highest probability> – <corresponding word>
…
<N+10 th highest probability> – <corresponding word>
For example, using the command:
./Assignment2 25 TomSawyer.txt ignoreWords.txt
you should get the output:
Array doubled: 7
#
Unique non-common words: 7275
#
Total non-common words: 42962
#
Probability of next 10 words from rank 25
—————————————
-
0.0033
– little
0.0033
– more
0.0032
– into
0.0032
– see
0.0032
– over
0.0031
– joe
0.0030
– never
0.0030
– know
0.0030
– away
0.0030
– again
-
You must include the following functions (they will be tested by the autograder): a. In your main function
Instructors: Maciej Zagrodzki, Christopher Godley
Assignment 2
-
If the correct number of command line arguments is not passed, print the below statement and exit the program
std::cout << “Usage: Assignment2Solution <number of words> <inputfilename.txt> <ignoreWordsfilename.txt>” << std::endl;
-
Get stop-words/common-words from ignoreWords.txt and store them in an array (Call your getStopWordsfunction)
-
Read words from TomSawyer.txt and store all unique words that are not ignore-words in an array of structs
-
Create a dynamic wordItemarray of size 100
-
Add non-ignore words to the array (double the array size if array is full)
-
-
-
Keep track of the number of times the wordItem array is doubled and the number of unique non-ignore words
-
b.
void getStopWords(const char *ignoreWordFileName, string ignoreWords[]);
This function should read the stop words from the file with the name stored in ignoreWordFileName and store them in the ignoreWords array. You can assume there will be exactly 50 stop words. There is no return value.
In case the file fails to open, print an error message using the below cout statement:
std::cout << “Failed to open ” << ignoreWordFileName << std::endl;
c.
bool isStopWord(string word, string ignoreWords[]);
This function should return whether wordis in the ignoreWordsarray.
Instructors: Maciej Zagrodzki, Christopher Godley
Assignment 2
d.
int getTotalNumberNonStopWords(wordItem uniqueWords[], int length);
This function should compute the total number of words in the entire document by summing up all the counts of the individual unique words. The function should return this sum.
e.
void arraySort(wordItem uniqueWords[], int length);
This function should sort the uniqueWords array (which contains length initialized elements) by word count such that the most frequent words are sorted to the beginning. The function does not return anything.
f.
void printNext10(wordItem uniqueWords[], int N, int totalNumWords);
This function should print the next 10 words after the starting index N from sorted array of uniqueWords. The next 10 words with their probability of occurrence up to 4 decimal places. The exact format of this printing is given below . The function does not return anything.
Probability of occurrence of a word at position ind in the array is computed using the formula: (Don’t forget to cast to float!)
probability-of-occurrence = (float) uniqueWords[ind].count / totalNumWords
Output format
Probability of next 10 words from rank 25
—————————————
0.0033 – little
0.0033 – more
0.0032 – into
0.0032 – see
Instructors: Maciej Zagrodzki, Christopher Godley
Assignment 2
0.0032 – over
0.0031 – joe
0.0030 – never
0.0030 – know
0.0030 – away
0.0030 – again
-
Submitting your code:
Log onto Moodle and go to the Assignment 2 Submit link. It’s set up in the quiz format. Follow the instructions on each question to submit all or parts of each assignment question.