PR2EN13: Trees

Lab materials

Lab tasks

This lab has a base task that requires you to write two programs, both performing the same task. Once written, you need to compare the performance of those two tasks. Extra task has you adding a third solution and adding it to the comparison.

Lab task [W13-1]: Comparing performance of linear search and binary search tree

For this task, you will create two programs. Both of them will count how many times each name combination occurred in the data file. For the first program, we will implement a linear search algorithm using a dynamic array as the data structure. For the second one, we will create a binary search tree.

The lab should be completed by following the step-by-step guide provided.

Requirements
  • Find all the name combinations (first + last name) in the data file
  • Count the number of times each name combination occurred in the provided data file
  • Print the results in an output file
    • Use the following structure for the output:
      <first name> <last name> <count>
  • Provide two solutions for the same task
    • First one uses a dynamic array that is expanded using realloc()
    • Second one uses a binary search tree
  • Print the number of different name combinations on the screen
  • Optimize your program
    • When building the binary tree, use
  • Calculate the time it took your program to process the data
    • The calculation must at least include reading from the file and creation of the data structure.
    • You can also include printing and freeing.
    • Results must be comparable – if one program includes freeing in the time calculation, so must the other.
  • Focus on performance
    • Avoid printing the list on the screen – this has a large performance penalty
    • When creating the binary tree, use a global variable to count the number of elements
    • For printing the results in the binary tree, use a global file pointer that is only opened and closed once during the entire program.
    • Leave the name variable in the struct statically declared – i.e. char combinedName[64]
  • Before defending, prepare answers for the questions presented in the chapter #Comparing_time
Program 1: Linear search

To begin, open up the example_linear.c  and example_linear.h . Familiarize yourself with the code. Try to run it and look at the output. This will become your starter code for the first program.

Also notice that the program uses a command line arguments. To execute, use ./program_name input_file_name

Once you’ve familiarized yourself with the example code, you need to start modifying the code to fit the task.

  1. Alter the data reading function
    • The current reading function works with a single name (one word). The lab task data file contains 4 fields per row – eID, first name, last name and city.
    • Alter the fscanf()  function call to work with the extra fields. You can immediately discard the eID and city – they won’t be needed.
    • Concatenate the name as "Firstname Lastname" . The concatenated form is used to both compare with the existing names and to store the name into the struct
      Hint: snprintf() helps to combine the name
      Hint: %*s  format can be used to discard a field
  2. Count the number of times a name combination is present in the data file. Counting is done during reading of the data
    • Add the counter variable to the struct. Both  int or unsigned  will work.
    • Before going forward, look at how the provided example code finds non-recurrent values
      Hint: look at the return value of GetNameIndex()  function provided in the code. Make use of the return value, do not alter the GetNameIndex()  function itself!
    • The counter that was added to the struct needs to be initialized. Do it in the ReadData()  function. Think about when this initialization needs to occur!
    • ReadData()  also needs to handle updating the counter when the name is already present in the database (think about what  GetNameIndex()  returns when the name was found in the array).
  3. Update the print function.
    • Alter the printing statements so that the list of names would be printed to a file
    • Change what is being printed, so it would also include the number of times a name occurred.
  4. Add a print statement that displays how many different names were in the data file
  5. Add freeing of the struct memory
  6. Add the timers
    • Time needs to be measured in four different points. For this, it’s best to create a new structure
    • Place time checkpoints in your main()  function to the appropriate locations
    • To calculate the appropriate time intervals, the following function is proposed
  7. Test your code without valgrind first, check the count and the output file.
  8. Once the functionality is working, test the program with valgrind. When running in valgrind, you need to write arguments to valgrind before the executable and arguments for your program in the end. Running with valgrind will take approximately 1 minute to complete!
    E.g. valgrind ./example_linear random_people_city_data.txt
Testing the linear algorithm

When testing the program, the  two things to look for in the output is the number of different name combinations it found and the time it took for different parts of the program. Pay attention to the subparts and think critically whether they are realistic or not!

To validate the name counts, we’ve listed the beginning of the output file so you can compare it to yours.

Once you’ve made sure that the results are functionally correct, test the memory correctness using valgrind. This will take over a minute to run!

Program 2: Binary search tree

To begin, open up the example_bin_search_tree .c  and example_bin_search_tree .h . Familiarize yourself with the code. Try to run it and look at the output. This will become your starter code for the first program.

Also notice that the program uses a command line arguments. To execute, use ./program_name input_file_name

Once you’ve familiarized yourself with the example code, you need to start modifying the code to fit the task. The goal is to solve the same task, but by using a binary search tree.

  1. Alter the struct. You need to store the name and a counter, just as with linear search
  2. Alter the reading function similarly to what we did with linear search
    • Modify the fscanf()  statement to read 4 fields, you will only need the first and last name fields.
    • Concatenate the name so you can store it in a single variable.
  3. Update the function responsible for inserting nodes to the tree, i.e. InsertNode()
    • Change the function parameter – the example inserts integers to the tree, we need to insert a string
    • Change the comparison statements to work with strings (instead of integers). Make sure not to cause any unnecessary performance penalties while doing so!
    • Note what happens when you try to insert a value that already exists in the tree – the example code will ignore it and not do anything. The final solution can’t ignore it. If the name already exists, we need to count up the number of times it occurred! Add an update to the counter!
  4. Update the function responsible for allocating a new tree node, i.e. CreateNode()
    • Change the parameter type to suit the task
    • Add an initialization to the counter
  5. Update the PrintTree()  function
    • Make alterations to work with the updated data structure
    • Note: because we are time optimizing the solution, you should make the FILE *  pointer as a global variable. Open the file before calling this function and close it after.
    • The sample print function is using “pre-order” traversal. Change the traversal mode to such that the tree would be printed in the alphabetical order. For this you need to change the order of the statements are executed to print the current node and the recursive calls to the child nodes.
  6. Implement a free function for the data structure
    • Just as any tree function, this also has to be a recursive function. Take inspiration from the PrintTree()  function for how to traverse nodes. NB! You need to pick the correct traversal (pre-, in- or post-order) for the freeing to be correctly performed!
  7. Add the time measurements. This can be implemented exactly the same way as you did with the linear data structure. All statements, the data structure and the print function remain the same.
  8. Run your code through valgrind. You can also use command line arguments with valgrind.
    E.g. valgrind ./example_bin_search_tree random_people_city_data.txt
Testing binary search tree

For this testing section, we are not showing you the time intervals we got on our solution. The task of analyzing the measured time (are they realistic?) and comparing the time intervals is left as a task for you. This is also a part of the task defense.

Regardless of the algorithm, the number of name combinations needs to be the same, i.e. 27454.

We do offer you a look into the data file. Compare the beginning of the data file from our solution with yours!

Comparing runtime for programs

Once we’re sure about the correctness of our programs, let’s compare the runtime of both of them.

  • Which program was faster, which slower?
  • Was one program faster than the other one in all aspects or were there differences?
  • Were there any sub components of either of the programs that took significantly different amount of time?
  • Don’t forget to elaborate your answers!

Also think about how the following could affect the runtime of the programs

  • The binary search tree solution contains one feature that the linear search does not have.
    • Which functionality is this?
    • How would linear search be affected if this functionality would be added?
  • How would the runtime be affected by changing the realloc()  strategy from (n + 1) to (n * 2)?
  • How would the programs be affected by the increase of different (non-recurrent) name combinations . Try to express it in terms of the Big O notation (https://www.bigocheatsheet.com)
  • How would the programs be affected by increase of recurrences, when the number of different name combinations would remain the same?
  • Think about balancing of the tree
    • Would balancing the tree affect the performance?
    • What is the worst case scenario for the binary search tree (without balancing)?
  • If you solved the extra task with Trie, add it to the comparisons.

Extra task [W13-2]: Trie implementation

Create a third program to compare the efficiency. Use the trie data structure.

All the other requirements remain the same.

Recommended data structure

Compared to the sample on the slides, we don’t use a boolean true/false value to denote the end of a string (e.g. isLeaf ). Instead, we use a counter. If the counter is 0, we consider it not to be a leaf node, if the counter is > 0, it’s a leaf node where a name ends.

Hint 1: your alphabet length should also include positions for a space and a minus symbol, which are present in the names. No other special symbols are used in the file.

Hint 2: Initialization in the CreateNode()  is crucial for both the counter and the pointer array.

After this class, you should

  • Be able to measure an application’s runtime internally within the program and by using a system tools
  • Know how kernel time, user time, and real time are related and what they are
  • Know what may happen if all processes running in parallel on the system no longer fit into the computer’s main memory
  • Know the different compiler optimization levels and, at a basic level, the advantages and problems that optimization can bring
  • Know concepts related to trees, such as node, root, leaf, height, etc.
  • Know the types of trees and their properties (binary tree, binary search tree, and balanced binary tree)
  • Know different tree traversal methods (preorder, inorder, postorder) and their use cases — when to use which traversal method
  • Be able to create recursive functions for adding, searching, and deleting data in a binary tree
  • Know the properties, application areas, and structure of a Trie data structure

Additional materials