In this weeks lab you will:
- write your own first pieces of machine code
- learn about how the instructions (i.e. the program) are represented and executed on your discoboard
- learn a few assembler directives for getting data into your program
- manipulate values with boolean logic instructions
This lab will build your competency towards Learning Outcomes(LO) 2 & 3.
Preparation
Before you attend this weeks lab, make sure:
- You can fork, commit & push your work to GitLab
- You can use VSCode to edit assembly (
.S
) files - You can connect your discoboard to your computer and start a debugging session.
- You have read the lab content below
If youre not confident about that stuff then feel free to follow the links to the relevant part of lab 1.
You may find these references useful in this lab well refer back to them later:
Make sure the COMP2300 Cortex-Debug
extension is up to date before starting the lab!
Introduction
In the week 2 lectures you saw how your CPU/MCU is built out of logic gates which are built into different components, including registers for storing memory and adders (as part of the ALU) for adding numbers together.
Today you will see how to explain to your CPU that your plan is to calculate 2+2
, and see how words (i.e. numbers) in memory can be formed and interpreted as opcodes (operation codes) which will instruct your CPU to do things (e.g. add two values).
Exercise 1: 1+2
Fork the lab 2 template on gitlab them clone to your computer with Git, as you did last week in lab 1.
Your job in this first exercise is to write an assembly program which calculates 1+2
and leaves the result in register 1 (r1
).
Remember from last weeks lab that you can see the values in your discoboards registers while debugging, in the registers pane:
You can set the display format for a specific register in the register view. Right click on the register, select set number format and then select the desired format. This will help you make sense of the value of a register.
ARM assembly syntax
This is probably the first time youve written any ARM assembly code, so for this course weve prepared a cheat sheet to help you out. It looks pretty intimidating at first mostly because it crams a lot of information into a small space. So lets pick one line of the cheat sheet the sub
instruction and pick it apart.
First, the syntax column:
sub{s}<c><q> {<Rd>,} <Rn>, <Rm> {,<shift>}
The first token on the line is the instruction name, and after that is the (comma-separated) argument list. Conveniently, all of our assembly instructions will have a similar format.
- Anything in braces (
{}
) is optional, e.g. thes
at the end ofsub{s}
means that it can be eithersub
orsubs
. Addings
to instructions will cause flags to be set by the operation well cover this later in the course. - The
<c>
and<q>
parts relate to the condition codes and opcode size boxes on the second page of the cheat sheet. They are also optional and well visit these later in the course. {<Rd>,}
is the destination register (e.g.r3
orr11
), where the result of the instruction is stored. If you do not specify a destination register,<Rn>
will be used instead.<Rn>, <Rm>
are the two operands (arguments) for thesub
instruction.- Finally, the optional
{,<shift>}
is for using the discoboards barrel shifter to do logical shifts. Well also cover this later in the course.
There are a couple of other parts of the syntax which arent covered in the sub
instruction:
- Instructions which use constant values will use decimal by default. You can prefix your values to indicate a different base:
0b
for binary (e.g.0b1101101
),0o
for octal (e.g.0o125
) or0x
for hexadecimal (0xef20
). - When it comes to load & store operations, square brackets
[]
indicate that the instruction should use the memory address in the register, e.g.[r2]
tells the discoboard to use the memory address inr2
for that instruction.
You wont need to know all of this stuff to complete this lab, so just remember that its here if you need to come back to it.
The semantic column on your cheat sheet describes what the instruction does. For example, the semantic for the sub
instruction is Rd(n) := Rn - Rm{shifted}
, which in English translates to something like:
in the
Rd
register (orRn
, ifRd
was not specified) store the result of subtracting the value in theRm
register (with an optional bit-shift, if present) from the value in theRn
register.
You can probably see why we use assembly language for telling our CPU what to do rather than English its much less wordy.
The flags column of the cheat sheet specifies which of the special condition code flags that instruction sets if the optional s
suffix is present. (Well cover this in next weeks lab, but if youre curious theres a box on the second page of the cheat sheet which lists the flags.)
Since there is a lot of information in the ARM instruction syntax, you dont need to memorize everything. Just keep the cheat sheet nearby and take a closer look when you need to find a specific syntax. You will be given an assembly instruction cheat-sheet in exams.
Please be aware that there are several instructions that can use different sets of arguments. For example, the sub
instruction can be:
sub
with<Rn>, <Rm> {,<shift>}
sub
with<Rn>, #<const>
Both of them will do the samesub
operation, but using different parameters for executing thesub
. For this instruction, we can either subtract two registers, or a register and a constant value.
The task
To actually complete your 1 + 2 task, youll need to
- get number
1
into any register - add
2
to the register value and put the result inr1
Look over the cheat sheetwhich assembly instructions allow you to place a constant value into a register? There are also a number of machine instructions which will implement an addition which one do you want, and why?
Once youve written a program which you think will do what you want, step through it with the debugger and make sure that the value which r1
holds at the end is actually 3
.
The method you use to find 1 + 2 is only one solution to the problem. There are other ways that we can use to answer the question using different instructions. Can you think of any?
Save & make a commit now that you have finished Exercise 1. Its a good idea to keep a version once you have completed each exercise in the lab; thats what a version control system is for, right?
Exercise 2: reverse engineering with the memory viewer
Now we really look under the hood and leave no bit unturned. Start a debugging session with your program from the previous exercise and step through until you get to main
, then leave it hangingdo not execute any further. It will pause the program execution for the moment.
Although most of the process of code execution is displayed inside VSCode, in reality, all the register and memory values are taken directly from your discoboards CPU.
Do you know where each of your numbers are represented in your program when its actually running on your discoboard?
To do that we need to be able to view the memory in the disco board. In VSCode you can look at your memory directly using the Memory View: type memory
in the command palette and select the Cortex Debug view memory option. VSCode will then ask you to input the starting address and the number of bytes to read. In the example below, the starting adress used is 0x080001dc
, and 512
bytes of memory have been read.
This might look overwhelming, but the 2D grid layout is pretty simple: the hex numbers down the left hand side are the base memory addresses, and the hex numbers along the top represent the offset of that particular byte from the base address. So, to work out the exact address of a particular byte, add the row and column values(base+offset).
You can find the bytes in memory which correspond to the instructions you wrote in your main.S
file using this view. The trick is figuring out where to lookwhat should the starting address be? Even your humble discoboard has a lot of addressable memory. Discuss with your lab neighborwhere should you look to find your program? The value in the pc
, or program counter may be a good place to start. See if you can figure out exactly what this value represents.
Theres an assembler directive called .hword
which you can use to put 16-bit numbers into your program (hword is short for half-word, and comes from the fact that your discoboards CPU uses 32-bit words).
Modify your program to use the .hword
directive to put some data into your program, so it looks something like this:
.syntax unified.global main.type main, %functionmain: nop .hword 0xdead .hword 0xbeef b main.size main, .-main
Note we need to add a nop
instruction here for the debugger to work correctly 0xdead
isnt a real instruction, and our CPU can get confused.
What do you think a nop
instruction does? Aside from this very specific case, when do you think such an instruction might be useful?
Build & upload your program and open a memory viewer session. Use the address of your first instruction as the base address, and load enough bytes to view your entire program. Can you see the 0xdead
and 0xbeef
values you put into your program?
If you cant see them exactly, can you see something which looks suspiciously like them? What do you notice?
Endianness
To make sense of the numbers displayed in the memory view, we need to talk about endianness.
Values are stored in memory as individual bytes (i.e. 8-bit numbers, which can be represented with two hex digits). Endianness refers to the order in which these small 8-bit bytes are arranged into larger numbers (e.g. 32-bit words). In the little-endian format used by our discoboards, the byte stored at the lowest address is the least significant byte(LSB). Big-endian is the opposite the byte stored at the lowest address is the most significant byte(MSB).
Heres an example: suppose we have the number 0x01
stored at a lower memory address (e.g. 0x000001e0
), and the number 0xF1
stored at a higher memory address (0x000001e1
), as shown below:
If we tell our CPU to read a half-word(16 bits) from the memory address 0x000001e0
under the little-endian format, it represents 0xF101
(the 0x01
at the lower address is treated as less significant). In a CPU under the big-endian format, it represents 0x01F1
(the 0x01
at the lower address is treated as more significant).
When reading four bytes from the memory, the CPU can read them as four 8-bit bytes, two 16-bit half-words, or as one 32-bit word. The endianness format applies everytime when combining bytes into bigger words. The following diagram illustrates this using the little-endian format:
You need to be aware of this byte ordering to make sense of the memory view.
How might you figure out on your discoboards Cortex-M4 CPU whether a 32-bit instruction is read as one 32-bit word or as two 16-bit half-words? (hint: have a look at A5.1 in the ARMv7-M Architecture Reference Manual).
According to Wikipedia, Danny Cohen introduced the terms Little-Endian and Big-Endian for byte ordering in an article from 1980. In this technical and political examination of byte ordering issues, the endian names were drawn from Jonathan Swifts 1726 satire, Gullivers Travels, in which civil war erupts over whether the big end or the little end of a boiled egg is the proper end to crack open, which is analogous to counting from the end that contains the most significant bit or the least significant bit.
Instruction encoding(s)
Now that you know how bytes fit together into words, lets get back to the task of figuring out how the instructions in your program are encoded in memory.
To help, you can use the known half-words you put into your program earlier to help you out. Update your program like so to add a single instruction (i.e. one line of assembly code) from the your 1+2 program you wrote in Exercise 1.
.syntax unified.global main.type main, %functionmain: nop .hword 0xdead @ put a single "real" assembly instruction here from your 1+2 program .hword 0xbeef b main.size main, .-main
Build, upload and start a new debug session, then find your program again in the memory view. What does your instruction look like in memory? Try making a note of the bytes, then modify the instruction arguments (e.g. change the number, or the register youre using) and see how the bytes change in memory (youll need to re-build & run your program and call the view memory command each time you do).
Discuss with your neighbour: what do you think the different bits (and bytes) in the instruction mean? How does the discoboard know what to do with them? And if youve figured that out, why doesnt your program actually work as written?
To fully make sense of these instruction encodings you need more than just your cheat sheet you need the ARMv7-M Architecture Reference Manual. Dig to the deepest levels of the manual, by going to section A7.7 Alphabetical list of ARMv7-M Thumb instructions (page A7-184). Use the bookmarks in your pdf viewer to navigate to the relevant instructions inside this huge document.
For each instruction you will see a number of encodings. They detail bit-by-bit the different ways of specifying the machine instructions that your discoboard CPU understands. You may find this number format conversion tool helpful:
Decimal | |
Hex | |
Binary |
Commit your reverse engineering program with a comment about what the instruction looks like in memory. It doesnt matter that it doesnt actually run at this pointyoull get there in the next exercise.
Can you tell which specific encoding has been used for the instruction you wrote earlier in exercise 1? Note that not every encoding can express every version of the instruction, but sometimes a more complex encoding can also express what the a simpler form could have done as well. Can you hint the assembler towards the specific encoding you want?
Excercise 3: hand-crafted instructions
Now that youve identified the spot in memory where your instructions live, in this exercise we turn our approach around and program the CPU by writing specific numbers directly to memory locations. Where we earlier inserted 0xdead
and 0xbeef
, we are going to insert hex values that correspond to the machine code for real instructions.
Instead of calculating 1+2
, you are going to make the CPU calculate 3-1
by putting the right numbers into memory.
In fact, you have been doing this all along, except that the assembler has helped you by converting your human-readable instructions in to their raw machine code representations. Replace your program with the following assembly code:
.syntax unified.global main.type main, %functionmain: nop .hword 0xffff .hword 0xffff b main.size main, .-main
This time we want you to put on your assembler hat and figure out the actual .hword
values which will make the CPU load 3
into a r1
and subtract 1
from it. Remember that itll be similar to the words you looked at in the memory viewer earlier, but some of the bits will be different (since were dealing with -
, 3
and 1
instead of +
, 1
and 2
).
You can find the architecture reference manual pages for the add
instruction here, and the pages for mov
here.
Note the line at the bottom:
.size main, .-main
This tells the assembler the size of the main
function, and it is essential for the disassembler to work correctly. The disassembler view can be opened by typing Cortex-Debug: View Disassembly (Function)
in the command palette during an active debug session. It will then ask for which function to disassemble, type the function name (e.g. main
). It will look something like this:
Bring up the disassembler view for main
youre now looking at the program as it will be understood by the CPU.
Looking at the disassembled code (i.e. the way the CPU will interpret the instructions) did you see what you intended? Will your new, hand-assembled program show the correct result in r1
after it has been run? If it doesnt, what might have gone wrong?
You can now cheer as loud as your upbringing allows that you will never again have to hand-craft the bits needed to instruct the CPU to do this, and you can leave this job to the assembler from now on.
As a side effect, you also learnt something about security: your system can be compromised by injecting some data into memory (an array of numbers, a string or anything which the host system would accept) and making the CPU somehow stumble into executing it.
Discuss with your lab neighbour: after completing this exercise, how would you explain the way a CPU works to your grandma/grandpa?
Make a commit now that youve knocked down Exercise 3. Congrats!
Exercise 4: Boolean logic, bit vectors, and labels
Load some new data into register r1
by adding this assembly code to your program:
@ load "COPE" into r1 ldr r1, stringloop: nop b loopstring: .ascii "COPE"
This code introduces a new assembler directive: labels.
If youre the kind of person who likes documentation, you can find it here.
In the above code there are two new labels: string
and loop
. A label is a way of attaching a human-readable name to a location in your program. Always remember that labels are just a name attached to a memory location when the assembler builds your program, it will have a specific memory address which you can store in a register, do arithmetic on, etc.
While you can use just about anything for a label name, try and use something informative as if you are naming a function or variable.
What will you see in r1
after the ldr r1, string
line? Can you guess the address of string
and find it in the memory viewer?
Youll notice a new .ascii
compiler directive in this code: this allows you to put data into your program using the ASCII encoding. This works the same as the .hword
directive you used earlier, except for the data format. While .hword
uses numbers, .ascii
takes characters and encodes each one to a specific byte value. You can find a table of characters and their ascii-encoded value here.
Your goal in this exercise is to isolate and modify individual bytes within the "COPE"
word:
- first, change it into
"HOPE"
and store inr2
- then, change it into
"HOPS"
and store inr3
Each of these steps requires isolating and manipulating one 8-bit (1-character) part of the 32-bit word without messing with the rest of it. What boolean logic or arithmetic operations can you use to modify the appropriate bits and bytes?
It might be helpful to use a piece of paper here: write out what the "COPE"
data looks like in memory (remember endianness!), and figure out what operations you need to make the transformations into "HOPS"
.
There are several ways to do this, how many can you think of? Show your program to your neighbour or tutor to get ideas about how it could be done differently.
You might have used one (or more) large numbers in your bit manipulation adventures. How many bits is that number? Do you think it will fit into the instruction encoding? Have a look at the disassembly, and cross check it with the instruction manual. Can you make sense of whats happening here? (Hint: this blog post might be helpful to you). Give it a crack, but we will revisit this again later in the course.
Finalise your program so that the main
function performs the "COPE"
-> "HOPE"
-> "HOPS"
transformation and leaves the "HOPS"
value in r2
. Commit & push your changes up to your repo on the GitLab server.
Reviews
There are no reviews yet.