PR2EN13: Trees

Lab materials

Slides: Trees

Lab tasks

This lab has a base task consisting of making two separate programs and comparing the performance of those programs. In addition, it has an advanced task adding a third program to the comparison.

Lab task: linear search vs binary search tree

In the base task, we will create two programs. Both of them will count how many times each name occurred in the data file. For the first program, we will implement a linear search algorithm. For the second one, we will use a binary search tree.

Download the sample applications you will base your lab task on: https://blue.pri.ee/ttu/files/iax0584/aluskoodid/13_trees_samples.zip
Download the lab task data file: https://blue.pri.ee/ttu/files/iax0584/andmefailid/13_trees_random_people_city_data.zip

Follow the step-by-step guide provided.

Requirements

Find all the name combinations (first + last name) in the data file.
Count how many times each name combination occurred in the provided data file
Output the name combinations found
- Note: since we are performance-focused, use a global variable for the FILE pointer in the trees task. Do not reopen the file every time you plan to print a node.
- Use the following output structure:
  <first name> <last name> <count>
Provide two implementations for the same task
- First one will use a dynamically expanding array, using realloc(), to keep track of the names and their occurrence counts
- Second one will use a binary search tree.
Calculate the time it took your program to process the data
- The calculation must at least include reading from the file and creation of the data structure.
- You can also include printing and freeing.
- Results must be comparable – if one program includes freeing in the time calculation, so must the other.
Print the number of unique combinations your program found
- Note: Since we are performance-focused, use a global variable for the counter in the trees task
Your focus should be time efficiency. Leave the name variable in the struct statically declared – e.g. char combinedName[64]

Program 1: Linear search

To start, open up the example_linear.c and familiarize yourself with the code. Try to run it and look at the output.

Also notice how the program runs – it uses a command line argument for the name of the input file. E.g. ./program_name input_file_name

Alter the reading function to match the data file.
- The current reading function assumes a single name (one word). The lab task data file contains 4 fields per row – eID, first name, last name and city.
- Change the fscanf() to account for the extra fields. You can immediately discard the eID and city – they won’t be needed.
- Concatenate the name as "Firstname Lastname" . Use the concatenated form both storing into the struct and comparing for the unique name.
  Hint: sprintf() is the easiest function for this.
Count the number of times a name exists
1. Add the counter variable to the struct. Either int or unsigned .
2. Before going forward, look at how the current code finds unique values and when it expands the array.
  Hint: look at the return value of CheckNameUnique() function provided with the code – this will help you a lot.
3. Add the initialization for the counter you just created into the reading function if it’s a new name (not known before).
4. Add the update to the counter if a name already exists in our database.
Alter the print function.
1. Alter the printing statements to print to the output file.
2. Print out the number of times each name occurred.
Add printing of the number of names you got from the file
Add freeing of the struct
Add the timer
1. Add two clock variables.
2. Place the two points when you wish to measure the time. E.g. start before reading function, end after freeing function.
3. Calculate and print out the total time taken. Print the time in the terminal!
Test your code without valgrind first, then with valgrind.
Run with valgrind. When running with valgrind, you need to write arguments to valgrind before the executable and arguments for your program in the end. Running with valgrind will take approximately 1 minute to complete!
E.g. valgrind ./example_linear random_people_city_data.txt

Program 2: Binary search tree

To start, open up the example_bin_search_tree.c and familiarize yourself with the code. Try to run it and look at the output.

Also notice how the program runs – it uses a command line argument for the name of the input file. E.g. ./program_name input_file_name

Alter the struct. You need to add the string and counter, just as with the linear search.
Alter the reading function similarly to what we did before
1. Change the fscanf() statement to read 4 fields, you will only need the first and last name fields.
2. Concatenate the name so you can store it in a single variable.
Update the insert function
1. Change the parameter to be a string (the full name you wish to store into the tree).
2. Change the comparison operations. Remember, besides the two given in the sample code, you also have to account for the situation that the name already exists! In that case, you need to update the counter.
Update the CreateNode() function
1. Change the parameter so that it can take in a name
2. Add an initialization to the counter
Update the PrintTree() function
- Make updates related to the new data structure we have
- Note: because we are time optimizing the solution, you can make the FILE * pointer as a global variable. This is again an acceptable use case for the scenario we have. Alternatively, you can pass it to the function.
- The sample print function is given as “pre-order” traversal. Change the traversal type to such that the tree would be printed in the alphabetical order. For this you need to change the order of the action done to the current node and the recursive calls to the child nodes.
Implement a free function for the data structure
- Just as any tree function, this also has to be a recursive function. Take inspiration from the PrintTree() function for how to traverse nodes. NB! You need to pick the correct traversal (pre-, in- or post-order) for the freeing to be correctly performed!
Add the timer
Run your code through valgrind. You can also use command line arguments with valgrind. NB!
E.g. valgrind ./example_bin_search_tree random_people_city_data.txt

Compare time

Now compare both of the programs on runtime. Which takes longer, why?

Also think about how the results would change, if

You would change from (n + 1) to (n * 2) realloc() strategy for linear search.
The amount of unique names would increase. How badly would either of the programs react to it. Try to express it in terms of the Big O notation (https://www.bigocheatsheet.com)
The file would contain more names, but the amount of unique names remained the same (just higher counts).
If you did the advanced, compare the binary search tree and trie. How would adding more unique names to the program affect it compared to binary search tree – any key advantages?

Testing

The following samples include the start of the results for each of the programs you will create. First two are within the base task, last on is the advanced task.

The total number of unique name combinations should be 11440.

Note: I’m only showing the order of results for you to validate your program. I’m not showing the output of the program, which is the time it took to perform the task and the number of unique combinations it found.

Linear search

Ingrid Purga 14
Angela Leppik 13
Hendri Meier 21
Mihhail Saal 18
Aivar Tiidus 14
Julia Leppik 15
Ingrid Kaasik 18
Marko Lepik 14
Ants Ratas 26
Heimar Tomson 12
Yana Kuningas 10
Aivar Kurg 17
Edgar Johanson 14
Eliise Saks 14
Kristi Saar 18
...

Ingrid Purga 14

Angela Leppik 13

Hendri Meier 21

Mihhail Saal 18

Aivar Tiidus 14

Julia Leppik 15

Ingrid Kaasik 18

Marko Lepik 14

Ants Ratas 26

Heimar Tomson 12

Yana Kuningas 10

Aivar Kurg 17

Edgar Johanson 14

Eliise Saks 14

Kristi Saar 18

...

Binary search tree / Trie

Ain Aas 15
Ain Aasa 19
Ain Allik 11
Ain Anvelt 18
Ain Hein 17
Ain Heinsoo 20
Ain Helme 16
Ain Herkel 17
Ain Hunt 13
Ain Ilves 15
Ain Ivanov 12
Ain Jakobson 13
Ain Johanson 16
Ain Kaasik 20
Ain Kalda 9
...

Ain Aas 15

Ain Aasa 19

Ain Allik 11

Ain Anvelt 18

Ain Hein 17

Ain Heinsoo 20

Ain Helme 16

Ain Herkel 17

Ain Hunt 13

Ain Ilves 15

Ain Ivanov 12

Ain Jakobson 13

Ain Johanson 16

Ain Kaasik 20

Ain Kalda 9

...

Advanced task: Trie implementation

Create a third program to compare the efficiency. Use the trie data structure.

All the other requirements remain the same.

Recommended data structure

Compared to the sample on the slides, we don’t use a boolean true/false value to denote the end of a string (e.g. isLeaf ). Instead, we use a counter. If the counter is 0, we consider it not to be a leaf node, if the counter is > 0, it’s a leaf node where a name ends.

typedef struct node
{
    char ch;
    int count;
    struct node *chars[ALPHA_LEN];
} trie_t;

typedef struct node

{

char ch;

int count;

struct node *chars[ALPHA_LEN];

} trie_t;

Hint 1: your alphabet length should also include positions for a space and a minus symbol, which are present in the names. No other special symbols are used in the file.

Hint 2: Initialization in the CreateNode() is crucial for both the counter and the pointer array.

After this class, you should

Additional materials

Binary search tree
https://www.geeksforgeeks.org/binary-search-tree-data-structure/
Binary search tree visualization
https://www.cs.usfca.edu/~galles/visualization/BST.html
Trashing
https://en.wikipedia.org/wiki/Thrashing_(computer_science)