CS113 Lab 11: Let’s hash it out

Due Thursday December 8th, before midnight, for extra credit

Goals

  • Read data from a file

  • Use ArrayList and HashMap to analyze text

You are given input files for this assignment. Check your dropbox to get started!
This is an Extra Credit assignment! But to receive the extra credit, you must submit the full assignment. Partial submissions will not be accepted.

1. Word

In a file, Word.java, implement a class that stores a word along with a count.

Data

  • private String text: a single word

  • private int count: the number of times the word appears

Methods

  • Constructor public Word(String text): should initialize the text and set the count to zero

  • public void incrementCount(): increments count by one

  • public int getCount(): returns the count

  • public void setCount(int c): sets the count

  • public String toString(): returns a string representation of the word, for example, "Text: 0"

Below is an implementation of main that tests the functions in your class.

  public static void main(String[] args) {
    Word word = new Word("Test");
    System.out.println(word); // should print "Test: 0"

    word.incrementCount();
    System.out.println(word); // should print "Test: 1"

    word.setCount(10);
    System.out.println(word); // should print "Test: 10"
    System.out.println(word.getCount()); // should print "10"
  }

2. Word counts

In a file, WordCounts.java, implement a program that counts the number of times each word appears in a file. Your program should use command line arguments to load a filename, and threshold of counts, and (optionally) a list of words to ignore in the analysis.

For example, the following output runs WordCounts on the "Cat in the Hat". All words having counts greater than 10 are displayed.

$ java-introcs WordCounts catinhat.txt 10
The number of unique words is: 238
The number of words with counts greater than 10: 46
the: 97
and: 69
i: 59
not: 41
said: 39
you: 34
a: 33
to: 32
in: 27
we: 26
cat: 26
do: 25
that: 25
with: 24
is: 22
will: 21
then: 20
fish: 20
he: 18
things: 18
up: 17
on: 17
this: 16
so: 16
no: 16
they: 16
of: 16
look: 16
all: 15
can: 15
two: 15
like: 14
what: 14
at: 14
hat: 14
one: 13
our: 13
it: 13
have: 13
oh: 13
thing: 12
mother: 12
now: 11
them: 11
should: 11
down: 11

In the next example, we run WordCounts on the "Cat in the Hat" again, but this time we ignore small words, such as "the", "a", "he", etc.

$ java-introcs WordCounts catinhat.txt 10 ignorewords.txt
The number of unique words is: 179
The number of words with counts greater than 10: 17
cat: 26
will: 21
fish: 20
things: 18
up: 17
look: 16
no: 16
can: 15
two: 15
like: 14
hat: 14
oh: 13
our: 13
thing: 12
mother: 12
down: 11
should: 11

Requirements:

  • Use the class In, defined in java-introcs, to load the file, e.g. In file = new In(filename)

  • Your program should use command line arguments.

  • Use a HashMap to store <String,Word> pairs.

  • To sort, create an ArrayList<Word> that contains all words with count greater <NumWords>. Then use a sorting algorithm to sort from greatest to least counts.

  • You can load the list of "ignore" words into an ArrayList<String> using In. ArrayList<String> has a method contains that will return true if the list contains a word.

HINT: Convert all strings to lowercase and strip punctuation for each word before updating the counts in your HashMap. For example,

String text = file.readString();
text = text.toLowerCase();
text = text.replaceAll("[^a-z]", "");
You can output the number of times a word appears in a file using the command line. For example, to output the number of times "the" appears in a file, do grep -o -i -r "\bthe\b" catinhat.txt | wc -l. Embed "the" inbetween the "\b" symbols in the command.

Optional command line arguments

This assignment uses an optional command line argument. Below is example code for handling an optional argument and printing usage information if the user inputs the wrong number of command line arguments.

  public static void main(String[] args) {

    if (args.length < 2) {
      System.out.println("Usage: java-introcs WordCounts <filename> <NumWords> [<IgnoreWords>]");
      System.exit(0);
    }

    int N = Integer.parseInt(args[1]);
    String ignorefile = "";
    if (args.length == 3) {
      ignorefile = args[2];
    }

    // your work here
    // Only load the ignorefile if the filename is not empty
  }

More examples

Below is sample output using your program WordCount.java, `for the text, "tolstoy.txt".

"tolstoy.txt" is the same input you used in Assignment 04: Letter Frequency

3. What to hand-in

  1. The programs, Word.java and WordCount.java

  2. Make sure your programs have a header containing your name, date, and description.

  3. Make sure your program can run from the Dropbox directory. You should include the data files and any supporting files your program needs to run.

  4. No write-up is necessary for this submission — wlthough feel free to add one if you would like to communicate any feedback to us!

3.1. How to hand-in

  1. Copy your files to your dropbox, into the folder called A11.