PRETEST_PR2_25

Task

Your task is to create a program that analyzes written text (i.e. books in public domain)  and compares it to a dictionary containing correctly written words.

Generic requirements

  • Program must be written in either C90, C99, C11 or C17 standard. On agreement, you can also use C23. When electing for a newer version of the standard than C99, you should also be using the features of the newer standard.
  • You are permitted to use GCC extensions, C POSIX and GNU C. You are free to make your case for additional libraries, but they must be agreed upon before use.
  • The program must compile under either
    • Ubuntu 24.04 with GCC-13 (based on the recommended software)
    • OpenSUSE Linux 15 SP5 with GCC-12 (lab configuration)
  • Solution must be divided into source and header files as appropriate
    • You are expected to provide a Makefile to compile the solution with a make all  recipe, that compiles the entire solution from scratch
  • The code is expected to conform to practices widely accepted for C programs, including style, code division, commenting etc.
  • You are allowed to minimally use global variables where appropriate, but they must be limited to file scope and not harm code reuse practices (e.g. keeping a global struct of settings for logging). This does NOT include a pointer to your data structure to skip on passing data to functions.
  • The user experience for the program must be clear.
    • Any errors  that occur must be described to the user in a clear manner. The program is not allowed to “just close” without informing the user what happened.
    • Successful operations must also be confirmed (E.g. successfully read dictionary containing 120 391 words)
  • The length of the dictionary file must not cause significant impacts on the performance of the application
    •  Ideally the size of the dictionary should not affect searching at all and can only minimally affect loading and unloading the application.
    • Thus, You are recommended to implement a Trie data structure, however you can also look for alternative faster data structures and algorithms. There are faster alternatives out there.
  • Program must manage its memory dynamically
    • Do not implement any arbitrary length for word  length or number of words
    • The program must not excessively over-allocate memory. Memory usage should be either exactly as needed or close to what’s needed.
    • Memory must be deallocated before exit. Deallocation will be checked using valgrind . 0 bytes in use is expected at exit.
  • You are expected to create and use a custom enum  type. If You are not able to find a suitable use case for creating a new type within the specified task requirements, you will need to add a feature to the task of your own choosing. Some ideas for that might work:
    • Error handling with specified error cases
    • Multiple output file types (CSV, TSV, space-separated file)
    • Logging that uses logging levels (info, warning, error)

Task requirements

  • The program must offer a basic interactive experience to perform tasks (e.g. a menu or step-by-step input prompts).
    • The program can also offer command line options, if the author so chooses, but it’s not a requirement. If command line arguments are used, the entered arguments should skip the relevant prompts and be documented in the readme.
  • Program must provide the following features
    • Read a dictionary (reference list of correctly written words)
    • Analyze a plain text document (book, short story, paragraph of text)
    • Provide analysis reports of the text document
    • User must be able to choose both the names of the dictionary and (/or) the text document to analyze
  • The user must be able to get the following reports (depending on their selection)
    • The list of words in the document in alphabetical order. Output must include both the word and number of occurrences in the text.
    • The list of unrecognized words (not present in the dictionary), ordered by the number of times they were used in the document.
  • The result will be written into a text file that is formatted as a CSV file with a header row. The name of the results file must contain a timestamp when the result file was created and must be stored in an appropriate subdirectory of the program

Data

Your programs are expected to work with ASCII-encoded text files, however you are free to test and play around with other encodings, such as UTF-8 or UTF-16.

Dictionary files are plain-text files where each word is on a separate line.

Documents are plain-text files that contain written sentences that can formulate a longer story. They may include double empty lines, punctuation marks, upper and lower case words. Words containing numbers can be ignore (e.g. 1st, 6th). Words written using different capitalization must be identified as the same word (e.g. “HELLO!!!!”, “Hello!” and “hello”  all contain the same word).

Useful links

Note: The Estonian dictionary and many texts in Gutenberg are provided with UTF-8 encoding. Your program will be tested only using ASCII strings, but they are a good source of realistic data.

Word lists for English: https://github.com/dwyl/english-words

Word list for Estonian: https://github.com/binoternary/diceware-ee

Public domain books as plain text files: https://www.gutenberg.org

Lorem Ipsum generator: https://www.lipsum.com

Submitting

The preferred method of submitting is by providing a Private Git repository that contains all the source files, Makefile and data files to test the solution (at least 2 dictionary files and 2 written text files, one of which should be a simpler case to test correctness and one containing large files to test for performance). A README file should describe the project and how to build it, including any other necessary details.

You are also expected to provide some examples (screenshots) with supporting text explanations of your program running in a format of your choosing – e.g. It can be written into the Wiki section, provided as a part of the README.md or a secondary markdown file or included as a pdf in the repository.

It’s preferred to use our department GitLab instance https://gitlab.pld.ttu.ee, accessible using your Uni-ID. Create a private project and add your instructor to the project (handle: risto.heinsar ).

Once the task is completed, notify your instructor in Mattermost.

NB! If you are not comfortable using Git, you can also provide the solution as a .zip file through Mattermost.

Task extension to homework 3

Note, that this task is also extendable to Homework 3. All the features included in the extension are additive to this task and agreed upon separately, including with a separate agreed upon deadline. This will include adding additional features, as well as implementing a local SQLite database. To extend the task, ask the instructor separately about the requirements.