PR2EN13: Trees

Lab materials

Lab tasks

This lab has a base task consisting of making two separate programs and comparing the performance of those programs. In addition, it has an advanced task adding a third program to the comparison.

Lab task: linear search vs binary search tree

In the base task, we will create two programs. Both of them will count how many times each name occurred in the data file. For the first program, we will implement a linear search algorithm. For the second one, we will use a binary search tree.

Follow the step-by-step guide provided.

Requirements
  • Find all the name combinations (first + last name) in the data file.
  • Count how many times each name combination occurred in the provided data file
  • Output the name combinations found
    • Note: since we are performance-focused, use a global variable for the FILE pointer in the trees task. Do not reopen the file every time you plan to print a node.
    • Use the following output structure:
      <first name> <last name> <count>
  • Provide two implementations for the same task
    • First one will use a dynamically expanding array, using realloc(), to keep track of the names and their occurrence counts
    • Second one will use a binary search tree.
  • Calculate the time it took your program to process the data
    • The calculation must at least include reading from the file and creation of the data structure.
    • You can also include printing and freeing.
    • Results must be comparable – if one program includes freeing in the time calculation, so must the other.
  • Print the number of unique combinations your program found
    • Note: Since we are performance-focused, use a global variable for the counter in the trees task
  • Your focus should be time efficiency. Leave the name variable in the struct statically declared – e.g. char combinedName[64]
Program 1: Linear search

To start, open up the example_linear.c  and familiarize yourself with the code. Try to run it and look at the output.

Also notice how the program runs – it uses a command line argument for the name of the input file. E.g. ./program_name input_file_name

  1. Alter the reading function to match the data file.
    • The current reading function assumes a single name (one word). The lab task data file contains 4 fields per row – eID, first name, last name and city.
    • Change the fscanf()  to account for the extra fields. You can immediately discard the eID and city – they won’t be needed.
    • Concatenate the name as "Firstname Lastname" . Use the concatenated form both storing into the struct and comparing for the unique name.
      Hint: sprintf() is the easiest function for this.
  2. Count the number of times a name exists
    1. Add the counter variable to the struct. Either int or unsigned .
    2. Before going forward, look at how the current code finds unique values and when it expands the array.
      Hint: look at the return value of CheckNameUnique()  function provided with the code – this will help you a lot.
    3. Add the initialization for the counter you just created into the reading function if it’s a new name (not known before).
    4. Add the update to the counter if a name already exists in our database.
  3. Alter the print function.
    1. Alter the printing statements to print to the output file.
    2. Print out the number of times each name occurred.
  4. Add printing of the number of names you got from the file
  5. Add freeing of the struct
  6. Add the timer
    1. Add two clock  variables.
    2. Place the two points when you wish to measure the time. E.g. start before reading function, end after freeing function.
    3. Calculate and print out the total time taken. Print the time in the terminal!
  7. Test your code without valgrind first, then with valgrind.
  8. Run with valgrind. When running with valgrind, you need to write arguments to valgrind before the executable and arguments for your program in the end. Running with valgrind will take approximately 1 minute to complete!
    E.g. valgrind ./example_linear random_people_city_data.txt
Program 2: Binary search tree

To start, open up the example_bin_search_tree.c  and familiarize yourself with the code. Try to run it and look at the output.

Also notice how the program runs – it uses a command line argument for the name of the input file. E.g. ./program_name input_file_name

  1. Alter the struct. You need to add the string and counter, just as with the linear search.
  2. Alter the reading function similarly to what we did before
    1. Change the fscanf() statement to read 4 fields, you will only need the first and last name fields.
    2. Concatenate the name so you can store it in a single variable.
  3. Update the insert function
    1. Change the parameter to be a string (the full name you wish to store into the tree).
    2. Change the comparison operations. Remember, besides the two given in the sample code, you also have to account for the situation that the name already exists! In that case, you need to update the counter.
  4. Update the CreateNode()  function
    1. Change the parameter so that it can take in a name
    2. Add an initialization to the counter
  5. Update the PrintTree()  function
    • Make updates related to the new data structure we have
    • Note: because we are time optimizing the solution, you can make the FILE *  pointer as a global variable. This is again an acceptable use case for the scenario we have. Alternatively, you can pass it to the function.
    • The sample print function is given as “pre-order” traversal. Change the traversal type to such that the tree would be printed in the alphabetical order. For this you need to change the order of the action done to the current node and the recursive calls to the child nodes.
  6. Implement a free function for the data structure
    • Just as any tree function, this also has to be a recursive function. Take inspiration from the PrintTree()  function for how to traverse nodes. NB! You need to pick the correct traversal (pre-, in- or post-order) for the freeing to be correctly performed!
  7. Add the timer
  8. Run your code through valgrind. You can also use command line arguments with valgrind. NB!
    E.g. valgrind ./example_bin_search_tree random_people_city_data.txt
Compare time

Now compare both of the programs on runtime. Which takes longer, why?

Also think about how the results would change, if

  • You would change from (n + 1) to (n * 2) realloc()  strategy for linear search.
  • The amount of unique names would increase. How badly would either of the programs react to it. Try to express it in terms of the Big O notation (https://www.bigocheatsheet.com)
  • The file would contain more names, but the amount of unique names remained the same (just higher counts).
  • If you did the advanced, compare the binary search tree and trie. How would adding more unique names to the program affect it compared to binary search tree – any key advantages?
Testing

The following samples include the start of the results for each of the programs you will create. First two are within the base task, last on is the advanced task.

The total number of unique name combinations should be 11440.

Note: I’m only showing the order of results for you to validate your program. I’m not showing the output of the program, which is the time it took to perform the task and the number of unique combinations it found.

Linear search
Binary search tree / Trie

Advanced task: Trie implementation

Create a third program to compare the efficiency. Use the trie data structure.

All the other requirements remain the same.

Recommended data structure

Compared to the sample on the slides, we don’t use a boolean true/false value to denote the end of a string (e.g. isLeaf ). Instead, we use a counter. If the counter is 0, we consider it not to be a leaf node, if the counter is > 0, it’s a leaf node where a name ends.

Hint 1: your alphabet length should also include positions for a space and a minus symbol, which are present in the names. No other special symbols are used in the file.

Hint 2: Initialization in the CreateNode()  is crucial for both the counter and the pointer array.

After this class, you should

Additional materials