Week 1
The goal of this week is to pick a research topic that is interesting to me and derive a summer plan. Mike and I decided that I will join Thomas’ Untangling Tools project (advised by Mike and René), and I’m very excited!
Week 1’s Plan:
- Close read the Untangling Tools latest draft to familiarize myself with the project, google and understand all the buzzwords, have a grasp of key ideas/concepts in methodology + evaluation framework
- Successfully install and set up the project codebase from my local machine
- Run 5-bugs experiments end-to-end, read the files and set up for code review
This Week’s Progress:
- While I completed the 3 to-dos listed above, I came across many challenges to understand and use Thomas’ codebase.
- First is my lack of knowledge regarding virtual environments & operating systems that led me to struggle with debugging installation issues. Thanks a lot to Thomas and Mike’s help, I was able to set up the code base on my local computer and run the project end-to-end myself. While it was frustrating, the pro is that we could update the README to cover caveats when build from clean environment.
- Second is my ignorance regarding the background knowledge VCS (unified diff) & viewing and managing VC history, which results in my making a lot of assumptions when trying to understand the evaluation pipeline.
- First is my lack of knowledge regarding virtual environments & operating systems that led me to struggle with debugging installation issues. Thanks a lot to Thomas and Mike’s help, I was able to set up the code base on my local computer and run the project end-to-end myself. While it was frustrating, the pro is that we could update the README to cover caveats when build from clean environment.
- I have attempted to code review and write new functions, but took a lot of time as I have difficulties with navigating the repository and find the documentation missing/ambiguous. Mike and I agreed that there is not sufficient documentation to modify or extend the code, so the current action step for me is to document the code per my understanding. I found this extremely helpful, as I was able to understand the actual representation of diff/patch files and how the code manipulates them. Documenting input/output files gave me a clearer picture of the pipeline and its component functions, as well as the CSV files exported for analysis.
Next Week’s Plan:
- Merge all pull requests and resolve the documentation issue
- Code review thoroughly
- Implement the classification of whether a line/hunk is tangled
- Then, write code to handle tangled lines in the analysis (which is ignored in current implementation)
Week 1 Meeting Notes: Debunking assumptions
- We clarified the terminology of 3 diff files [this can be added to the Readme description]: The 3 diff files are PatchSet containing (diff) Line Objects. We test identity (line contents) over line numbers.
- Orginal diff (or version control diff): The diff generated by VCS from the buggy (pre-commit) and fixed (post-commit) versions
- Bug-fix diff (or minimal diff): A minimized Defects4J patch containing all bug-fixing lines, obtained by inverting the Defects4J bug-inducing patch. Note: the bug-fix diff might not be a subset of original diff, as it may contain bug-fix portion of a tangled line - these are dropped when creating the ground truth files
- Non-bug-fix diff:
- Current implementation: The set difference {orginal diff \ bug-fix diff} - containing all the lines that are non-bug-fixing and tangled (as we drop bug-fix portions in ground truth).
- Desired implementation: The UNIX diff of {buggy version, fixed version}, in which the fixed code is bug-fix-diff applied on original-diff ({Original_diff(buggy) + Bug-fix_diff}
- Current implementation: The set difference {orginal diff \ bug-fix diff} - containing all the lines that are non-bug-fixing and tangled (as we drop bug-fix portions in ground truth).
- Orginal diff (or version control diff): The diff generated by VCS from the buggy (pre-commit) and fixed (post-commit) versions
- Ideas for better implementation:
- Mike proposed a method to obtain ground truth with less manual work by using UNIX Patch commands(diff, -R, etc.). We should filter the programmer code, so that the diffs have no comments/blank lines/etc.
- The main benefit of this proposed implementation is that it is simpler, reduces use of new code. Using UNIX tools might also allow us to also identify if a line or a hunk is tangled. This will be less error-prone and easier for code review.
- New pipeline: Filer code into uncommented, no import, no space, no blank lines -> Obtain bug-fix diffs -> Generate fixed code -> Obtain non-bug-diff -> Repair lines -> Construct ground truth
Written on June 9, 2023