Let us return to the events which occurred in 2014 at a small start-up company in Monroe, CT.
You have been hired by a new small to medium sized eCommerce start-up based in Monroe, CT to investigate the insidious greencat-2 malware which was infecting their accounting computers. You are just beginning to understand the malware’s behavior. You have reverse engineered what appears to be a key malware functionality which exfiltrates files to the command and control server, and this is leading you to suspect that greencat-2 may have been responsible for the fraudulent customer payment directions.
Suddenly, your phone rings. Your caller ID shows that it is the CEO of the eCommerce start-up!
“Hello?” you ask.
“The investors are getting nervous! The lawyers are asking questions! The customers aren’t buying our eCommerce product!” he yells.
“I’m working as fast as …” you say, but he interrupts.
“We need to provide some proof that no customer data was stolen! I need you to get me that proof by next week, or you’re not getting paid!” he says before hanging up.
Your mind races… what can you do? How can you provide proof?
Data dependence! You suddenly realize that you can quickly produce a data dependence graph. The lawyers can spend a few weeks piecing it together, but that will provide proof of what greencat2 could do with the data it handled. You hurriedly rush to begin writing a GHIDRA plugin to compute data dependence — not waiting even one second before getting started (hint hint) … you know that this will take a lot of work to complete by the CEO’s deadline (hint hint).
Instructions:
An accurate Data Dependence Graph (DDG) is the most sought-after building block in the program analysis universe. Malware analysis tools require a DDG to answer any questions about the malware’s operation. You’ve probably seen multiple applications of DDGs in the research papers up to this point. Unfortunately, static analysis hurtles such as path explosion and aliasing force tool developers to make difficult implementation tradeoffs which limit the accuracy of their DDGs. In this lab, you will combat path explosion and aliasing with the goal of building a best-effort DDG — another essential building block for malware analysis. After completing this lab, I encourage you to go a step further and write a simple analysis script to automatically extract any DDG paths within GreenCat that can exfiltrate data from files on the victim system.
Loop every instruction in every basic block in every function in your greencat-2 disassembly (from before).
Compute the data dependence of each instruction. You can design any methods or data structures you wish to accomplish this. You can use any GHIDRA SDK APIs that will help you (but none exist that can compute data dependence for you).
Generate a DOT directed graph representing the data dependence of all the instructions in each function. Specifically, one DOT graph per function — name each DOT graph (called a “digraph” in the DOT file format) the address of the function.
Each node in your DOT directed graph should be the address of an instruction (only ONE node per instruction address). Node labels can be just the instruction addresses. The edges in your DOT directed graph file should go from each instruction to any instructions which that instruction is data dependent on. The order of the edges in the DOT directed graph file does not matter.
Consider this example from the greencat-2 binary. Here are the instructions in the function starting at address 0x401000 in greencat-2:
The DOT directed graph generated by your tool for this function should be as follows:
Note: The following example is for full credit, which includes tracking the calling conventions and arguments of CALL instructions.
You tool should process every function in the greencat-2 binary. All DOT graphs for all the functions should be output in a single “.dot” file. So, after you GHIDRA plugin finishes executing, you should have a single “.dot” file with many digraphs in it (one digraph per function).
The order of the edges in the DOT directed graph file does not matter. Also see: https://stackove rflow.com/questions/1494492/graphviz-how-to-go-from-dot-to-a-graph
As always: Post any questions or ideas on Ed Discussion! Even code snippets are fine, as long as they do not give away a key answer to this assignment. Class collaboration is encouraged — It’s us versus malware! If you’re not sure if your post is safe, send it to the Prof/TA in a private post to verify.
Lab Requirements / FAQ (MUST READ):
This section contains some frequently asked questions and requirements that students should adhere to when working on this assignment.
How do CALL instructions work for this assignment? How are they calculated?
CALL instructions for this assignment are similarly calculated to Lab 3. To get full credit you must properly be tracking Calling conventions and stack dependency.
Do we need to calculate dependencies between functions?
No. Similar to Lab 3 (and for all scripting labs) functions will be considered independently, meaning you do not need to link dependencies between functions. This is the purpose of the START keyword. The START keyword should be used to express that a dependency originated outside of the local function.
Grade: 100points
Grading Criteria:
The grade will be based on how many instructions and functions your plugin handles correctly (i.e., the edges and the labels in the DOT graph are correct).
Warning: Static memory read/write tracking is an extremely hard problem in general — I do not expect you to completely solve it for this lab! If you miss some complex memory read/write dependencies (there are very few in this assignment), you will still get a good grade.
Here is what the team will look for while grading:
Register dependencies: Register reads/writes are the easiest case of dependencies.
Direct push & pop dependencies: This requires that your plugin track changes of the stack pointer inside each function. Hint: Since we do not know its true value, pretend like ESP = 0 at the start of each function, and then track its changes for each instruction. Note that function args will be above ESP at the start of the function.
Static memory positions: These are memory locations that GHIDRA gives a name to and accesses via that name (e.g., “mov [ebp+var_4], eax” or “mov dword_429C48, eax”). This requires your plugin to note each instruction which writes to that memory position.
Everything else: There are very few complex memory read/write dependencies (e.g., those which include aliasing) in the functions we will grade. I did not find any cases of aliasing in my cursory pass over the code. If you are concerned about any cases of complex memory read/write dependencies, then please post on Ed Discussionand we will be glad to check it out.
The grade will be based on how many instructions and functions your plugin processes correctly, and is ultimately based on your graph submission (DOT file).
Data Dependence accuracy of top 10 instruction mnemonics are worth 5% of the total grade each (mov, add, sub, cmp, test, xor, push, pop, lea, all forms of jump). For example, if 20% of your mov instructions are wrong (missing a dependent or has an erroneous dependent) then you will lose 1% of the total 100 points.
Data Dependence accuracy of all other instruction types are collectively worth 15% of the total grade. For example, if 30% of the other instructions are wrong (missing a dependent or has an erroneous dependent) then you will lose 5% of the total 100 points.
Edge accuracy is worth 30% of the total grade. For example, if 10% of your edges are wrong (missing or have an erroneous extra edge) then you will lose 3% of the total 100 points.
If you cannot get any of the cases above to work, do not worry!! Please comment at the top of your file which of these cases that you COULD NOT get to work, and the TA and I will be lenient while grading.
We will only grade the functions that you commented in Lab 2. The maximum deduction is 100. There will be no negative grades.
Note: Grades in sections are rounded down to the nearest percent.
Call Tracking:
Up to 20 additional points will be awarded for properly tracking the DD of CALL arguments Note: This will require using GHIDRA ’s APIs to determine the number of function arguments. For example:
40156B push 3Ch ; …
40156D xor ebx, ebx ; …
40156F lea eax, [ebp+buf] ; …
Teams:
This assignment can be done individually or in a team of 2. Please join a group in Gradescope if you are collaborating.
Do not create or join a group in Canvas. Canvas groups are different from Gradescope groups.
New to Gradescope? This link provides instructions for how to create groups in Gradescope: https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members
Zoom can also provide the ability to collaborate and video conference with your teammate.
Submission Instructions:
Upload the following to the Lab 3 Assignment in Gradescope:
The DOT file output by your GHIDRA plugin, named “submission.dot” which contains digraphs for every function in the greencat-2 binary.
Your GHIDRA plugin code, named either “plugin.py” or “plugin.java” depending on the chosen language. We reserve the right to run all submitted code, through automated means or otherwise, and if it is found that your code does not output equivalent to your original dotfile submission then you will also receive a zero.
Be advised, please submit (1) and (2) separately, do NOT zip them together.
Note: Gradescope will only check the formatting of your submission. Gradescope will not automatically check the correctness and provide a grade.
Note: You can download the webc2-greencat-2.7z file directly into your lab environment. After you are done with this lab, you can submit your files directly from the lab environment (Highly recommended). Doing this will help you avoid transferring the file from the lab environment to your personal computer.
Transferring Files:
To transfer files from your personal device to the lab environment:
Create a zip folder of all the files that you would like to transfer to the lab environment.
Every GT student has Box and OneDrive accounts given free by the institution. Login to either of those two and upload the desired files.
Now go back to the lab environment and login to either of those two services where you uploaded you zip folder. Download folder to the the lab workspace and use the appropriate 7z command to unzip your folder.
What to do when you encounter technical difficulties?
If you are experiencing technical difficulty such as being unable to access the lab environment, please submit a ticket to the “Digital Learning Tools and Platforms” team at https://gatech.servicenow.com/continuity. And on the ticket, please put “Route to the DLT Team” at the top of the ticket because it will help the Service Desk know where to send it.
Grades have been released. How do I view my raw feedback?
GradeScope truncates raw feedback over a certain size. For this reason, we’ve provided both a plaintext version of the JSON feedback, and a Base-85 encoded and GZIP compressed version of the JSON feedback as well. The encoded version is the last test-case in GradeScope (all the way at the bottom). If you find that your plaintext feedback is truncated in GradeScope and need the full feedback for programmatic review, please try decompressing the encoded version. The following python snippet is an example of how one may retrieve the plaintext information:
Reviews
There are no reviews yet.