Introduction

This text is a compilation of the notes I've taken while studying the RISC-V Architecture and assembly language. It probably has many errors and it's not meant as a substitution of the official specs or a good book on the subject. You can consider it a RISC-V assembly language mini-tutorial or gentle introduction into the RISC-V world.

Architecture

Only the RV32I subset of RISC-V is treated in this text. This is the most basic integer registers and operations. Enough for an introduction. Here there are the 32-bit integer, general purpose, registers, and their intented function in assembly. Note that the function given is optional, since you can use any register anyway you want. The rightmost side indicates who is responsible of saving that register in a call to a procedure. Again, this indication is optional and a mere convention.

Register	Name	Description	Saver
x0	zero	Always zero	-
x1	ra	Return Addres	Caller
x2	sp	Stack Pointer	Callee
x3	gp	Global Pointer	-
x4	tp	Thread Pointer	-
x5	t0	Temporary / Alternate Link Reg	Caller
x6-x7	t1-t2	Temporaries	Caller
x8	s0 / fp	Saved Register / Frame Pointer	Callee
x9	s1	Saved Register	Callee
x10-x11	a0-a1	Function Arguments / Return Values	Caller
x12-x17	a2-a7	Function Arguments	Caller
x18-x27	s2-s11	Saved Registers	Callee
x28-x31	t3-t6	Temporaries	Caller

Instructions

Take into account that many integer instructions treat numbers as two's complement that have the advantage of using the same circuitry/operations than normal unsigned integers. Instructions that treat numbers as unsigned integers are usually explicitly marked.

All instructions occupy the same, 32 bits, or 4 bytes long.

Instruction Table

Instruction	Name	Format	Description
add rd, rs1, rs2	ADD	R	rd=rs1+rs2
sub rd, rs1, rs2	SUBSTRACT	R	rd=rs1-rs2
and rd, rs1, rs2	AND	R	rd=rs1 AND rs2
or rd, rs1, rs2	OR	R	rd=rs1 OR rs2
xor rd, rs1, rs2	XOR	R	rd=rs1 XOR rs2
sll rd, rs1, rs2	Shift Left Logical	R	rd=rs1 << rs2
srl rd, rs1, rs2	Shift Right Logical	R	rd=rs1 >> rs2
sra rd, rs1, rs2	Shift Right Arithmetical	R	rd=rs1 >> rs2 (signed)
slt rd, rs1, rs2	Set Less Than	R	if (rs1<rs2) rd=1 else rd=0; (signed)
sltu rd, rs1, rs2	Set Less Than Unsigned	R	if (rs1<rs2) rd=1 else rd=0; (unsigned)
addi rd, rs1, immediate	ADD Immediate	I	rd=rs1+immediate
andi rd, rs1, immediate	AND Immediate	I	rd=rs1 AND immediate
ori rd, rs1, immediate	OR Immediate	I	rd=rs1 OR immediate
xori rd, rs1, immediate	XOR Immediate	I	rd=rs1 XOR immediate
slli rd, rs1, immediate	Shift Left Logical Immediate	I	rd=rs1 << immediate
srl rd, rs1, immediate	Shift Right Logical Immediate	I	rd=rs1 >> immediate
srai rd, rs1, immediate	Shift Right Arithmetical Immediate	I	rd=rs1 >> immediate (signed)
slti rd, rs1, immediate	Set Less Than Immediate	I	if (rs1<immediate) rd=1 else rd=0; (signed)
sltiu rd, rs1, immediate	Set Less Than Immediate Unsigned	I	if (rs1<immediate) rd=1 else rd=0; (unsigned)
lb rd, offset(rs1)	Load Byte	I	rd=sign_extend(Memory_byte[rs1+offset])
lh rd, offset(rs1)	Load Half	I	rd=sign_extend(Memory_halfword[rs1+offset])
lw rd, offset(rs1)	Load Word	I	rd=Memory[rs1+offset]
lbu rd, offset(rs1)	Load Byte Unsigned	I	rd=zero_extend(Memory_byte[rs1+offset])
lhu rd, offset(rs1)	Load Halfword Unsigned	I	rd=zero_extend(Memory_halfword[rs1+offset])
sb rs2, offset(rs1)	Store Byte	S	Memory[rs1+immediate]=lower_byte(rs2)
sh rs2, offset(rs1)	Store Half	S	Memory[rs1+immediate]=lower_halfword(rs2)
sw rs2, offset(rs1)	Store Word	S	Memory[rs1+immediate]=rs2
beq rs1, rs2, label	Branch if Equal	B	if (rs1==rs2) PC=label;
bne rs1, rs2, label	Branch if Not Equal	B	if (rs1!=rs2) PC=label;
blt rs1, rs2, label	Branch if Less Than	B	if (rs1<rs2) PC=label;
bge rs1, rs2, label	Branch if Greater or Equal	B	if (rs1>=rs2) PC=label;
bltu rs1, rs2, label	Branch if Less Than Unsigned	B	if (rs1<rs2) PC=label; (unsigned)
bgeu rs1, rs2, label	Branch if Greater or Equal Unsigned	B	if (rs1>=rs2) PC=label; (unsigned)
jal rd, label	Jump And Link	J	rd=PC+4; PC=label
jalr rd, offset(rs1)	Jump And Link Register	I	rd=PC+4; PC=rs1+offset
lui rd, immediate	Load Upper Immediate	U	rd=immediate<<12
auipc rd, immediate	Add Upper Immediate to PC	U	rd=PC+(immediate<<12)
ecall	Environment Call	I	Transfer Control to the OS
ebreak	Environment break	I	Transfer Control to Debugger

add rd, rs1, rs2

Adds registers rs1 and rs2 and puts the result in rd.

sub rd, rs1, rs2

Substracts rs2 from rs1 and puts the result in rd.

and rd, rs1, rs2

Logically ANDs registers rs1 and rs2 and puts the result in rd.

or rd, rs1, rs2

Logically ORs registers rs1 and rs2 and puts the result in rd.

xor rd, rs1, rs2

Logically XORs registers rs1 and rs2 and puts the result in rd.

sll rd, rs1, rs2

Shifts left rs1 by rs2 number of bits and puts the result in rd.

srl rd, rs1, rs2

Shifts right rs1 by rs2 number of bits and puts the result in rd.

sra rd, rs1, rs2

Shifts right arithmetically rs1 by rs2 number of bits and puts the result in rd. if you need more info about arithmetic shift try https://en.wikipedia.org/wiki/Arithmetic_shift

slt rd, rs1, rs2

Sets rd to 1 if rs1 is less than rs2, else it sets rd to 0.

sltu rd, rs1, rs2

Sets rd to 1 if rs1 is less than rs2 using unsigned number comparison, else it sets rd to 0.

addi rd, rs1, immediate

Adds register rs1 and a sign-extended 12-bit immediate value and puts the result in rd.

andi rd, rs1, immediate

Logically ANDs register rs1 and a sign-extended 12-bit immediate value and puts the result in rd.

ori rd, rs1, immediate

Logically ORs register rs1 and a sign-extended 12-bit immediate value and puts the result in rd.

xori rd, rs1, immediate

Logically XORs registers r1 and a sign-extended 12-bit immediate value and puts the result in rd.

slli rd, rs1, immediate

Shifts left r1 by a 5-bit immediate and puts the result in rd.

srli rd, rs1, immediate

Shifts right r1 by a 5-bit immediate value and puts the result in rd.

srai rd, rs1, immediate

Shifts right arithmetically r1 by immediate (5-bit) value and puts the result in rd. if you need more info about arithmetic shift try https://en.wikipedia.org/wiki/Arithmetic_shift

slti rd, rs1, immediate

Sets rd to 1 if rs1 is less than immediate (12-bit value), else it sets rd to 0.

sltiu rd, rs1, immediate

Sets rd to 1 if rs1 is less than immediate (12-bit value) using unsigned number comparison, else it sets rd to 0.

lb rd, offset(rs1)

Loads a byte from the memory position rs1+offset, sign extends it, and writes it to rd. Offset is 12 bits signed. if you need more info about sign extension try https://en.wikipedia.org/wiki/Sign_extension

lh rd, offset(rs1)

Loads a half word from the memory position rs1+offset, sign extends it, and writes it to rd. Offset is 12 bits signed. if you need more info about sign extension try https://en.wikipedia.org/wiki/Sign_extension

lw rd, offset(rs1)

Loads a word from the memory position rs1+offset, and writes it to rd. Offset is 12 bits signed.

lbu rd, offset(rs1)

Loads a byte from the memory position rs1+offset, zero extends it, and writes it to rd. Offset is 12 bits signed. if you need more info about zero extension try https://en.wikipedia.org/wiki/Sign_extension

lhu rd, offset(rs1)

Loads a half word from the memory position rs1+offset, zero extends it, and writes it to rd. Offset is 12 bits signed. if you need more info about zero extension try https://en.wikipedia.org/wiki/Sign_extension

sb rs2, offset(rs1)

Writes the bits 0 to 7 (byte) of register rs2 to the memory position rs1+offset. Offset is 12 bits signed.

sh rs2, offset(rs1)

Writes the bits 0 to 15 (half word) of register rs2 to the memory position rs1+offset. Offset is 12 bits signed.

sw rs2, offset(rs1)

Writes the content of register rs2 to the memory position rs1+offset. Offset is 12 bits signed.

beq rs1, rs2, label

Jumps to label if rs1 and rs2 are equal. Internally, at instruction level, this is implemented has a signed 12 bit value shifted left once, that is added to the PC, thus giving a 4KB jump range.

bne rs1, rs2, label

Jumps to label if rs1 and rs2 are not equal. Internally, at instruction level, this is implemented has a signed 12 bit value shifted left once, that is added to the PC, thus giving a 4KB jump range.

blt rs1, rs2, label

Jumps to label if rs1 is less than rs2. Internally, at instruction level, this is implemented has a signed 12 bit value shifted left once, that is added to the PC, thus giving a 4KB jump range.

bge rs1, rs2, label

Jumps to label if rs1 is greater than or equal to rs2. Internally, at instruction level, this is implemented has a signed 12 bit value shifted left once, that is added to the PC, thus giving a 4KB jump range.

bltu rs1, rs2, label

Jumps to label if rs1 is less than rs2 using unsigned number comparision. Internally, at instruction level, this is implemented has a signed 12 bit value shifted left once, that is added to the PC, thus giving a 4KB jump range.

bgeu rs1, rs2, label

Jumps to label if rs1 is greater than or equal to rs2 using unsigned number comparision. Internally, at instruction level, this is implemented has a signed 12 bit value shifted left once, that is added to the PC, thus giving a 4KB jump range.

jal rd, label

Stores the address of the next instruction in register rd and jumps to label. Internally, at instruction level, this is implemented has a signed 20 bit value shifted left once, that is added to the PC, thus giving a 1MB jump range.

jalr rd, offset(rs1)

Stores the address of the next instruction in register rd and jumps to rs1+offset. Internally, at instruction level, offset is a signed 12 bit value that is added to the register rs1 and the least significant bit is set to 0. This instruction is made so that the program can jump to any 32-bit address since any arbitrary value can be loaded into rs1.

lui rd, immediate

Replaces the upper 20 bits of rd with immediate and fills the rest with zeros. This instruction is made to work in pairs with addi to fill the lower 12 bits, effective loading a 32 bits constant. Consider the following example:


            lui a0, %hi(PRIMESTRING)      # this loads the top 20 bits 
            addi a0, a0, %lo(PRIMESTRING) # this loads the bottom 12 bits

You might think there is an error here since addi takes a signed extended 12 bit immediate to be added to the upper 20 bits of the address, but there is not. The answer it that the compiler macros %hi and %lo are made to be used in pairs and already take into account the sign extension of the addi instruction when they return the upper 20 bits value, to make both instructions give the correct address of PRIMESTRING. The alternative is using the la (Load Address) pseudoinstruction.

auipc rd, immediate

Stores the address of the current instruction in register rd and and adds sets the upper 20 bits with immediate, filling the rest with zeros. This instruction is made so that addresses relative to PC can be loaded. Consider the following code:


    1:
    auipc	a0, %pcrel_hi(msg)
    addi	a0, a0, %pcrel_lo(1b)

In this code, %pcrel_hi(msg) returns the upper 20 bits of msg relative to the PC, while %pcrel_lo(1b) returns the lower 12 bits of msg relative to the auipc instruction before. This is so because %pcrel_lo() is designed to be paired with %pcrel_hi(). The compiler does the magic. Take a look to https://sourceware.org/binutils/docs/as/RISC_002dV_002dModifiers.html or https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md for more information.

The following code makes an infinite loop, since ra is loaded with the PC for the auipc instruction, and the jalr jumps to it.


    auipc	ra,0x0      
    jalr	ra,0(ra)

ecall

Transfers control to the Operating System. The exact functioning of this instruction depends on the machine/environment running the program.

ebreak

Transfers control to the debugger. The exact functioning of this instruction depends on the machine/environment running the program.

Pseudoinstructions

Pseudoinstruction Table

Pseudoinstruction	Base Instructions	Description
la rd, symbol	auipc rd, delta[31 : 12] + delta[11] addi rd, rd, delta[11:0]	Load Absolute Address where delta = (symbol − PC)
l{b\|h\|w\|d} rd, symbol	auipc rd, (delta[31 : 12] + delta[11]) l{b\|h\|w\|d} rd, delta[11:0] (rd)	Load byte/halfword/word/double from any 32bit addr
s{b\|h\|w\|d} rd, symbol, rt	auipc rt, (delta[31 : 12] + delta[11]) s{b\|h\|w\|d} rd, delta[11:0] (rt)	Store byte/halfword/word/double to any 32bit addr
nop	addi x0, x0, 0	No Operation
li rd, immediate	lui rd, (immediate[31 : 12] + immediate[11]) addi rd, immediate[11:0]	Load 32bit immediate
mv rd, rs	addi rd, rs, 0	Move Register
not rd, rs	xori rd, rs, -1	Not Register/One’s complement
neg rd, rs	sub rd, x0, rs	Negate Register/Two’s complement
seqz rd, rs	sltiu rd, rs, 1	Set if = zero
snez rd, rs	sltu rd, x0, rs	Set if < > zero
sltz rd, rs	slt rd, rs, x0	Set if < zero
sgtz rd, rs	slt rd, x0, rs	Set if > zero
beqz rs, label	beq rs, x0, offset	Branch if = zero
bnez rs, label	bne rs, x0, offset	Branch if <> zero
blez rs, label	bge x0, rs, offset	Branch if <= zero
bgez rs, label	bge rs, x0, offset	Branch if >= zero
bltz rs, label	blt rs, x0, offset	Branch if < zero
bgtz rs, label	blt x0, rs, offset	Branch if > zero
bgt rs, rt, offset	blt rt, rs, offset	Branch if >
ble rs, rt, offset	bge rt, rs, offset	Branch if <=
bgtu rs, rt, offset	bltu rt, rs, offset	Branch if >= (unsigned comp.)
bleu rs, rt, offset	bgeu rt, rs, offset	Branch if <= (unsigned comp.)
j label	jal x0, label	Jump
jal label	jal x1, label	Jump And Link
jr rs	jalr x0, 0(rs)	Jump Register
jalr rs	jalr x1, 0(rs)	Jump And Link Register
ret	jalr x0, 0(x1)	Return from subroutine
call label	auipc x1, delta[31 : 12] + delta[11] jalr x1, delta[11:0](x1)	Call far-away subroutine where delta = (label − PC)
tail label	auipc x6, delta[31 : 12] + delta[11] jalr x1, delta[11:0](x6)	Tail call far-away subroutine where delta = (label − PC)

Many of these pseudoinstructions involve using two instructions to load a 32 bit value. If you have problems visualizing it, consider the following example.

You want to load the 32 bit value 0ff0fff0:

1. Divide the value into the upper 20 bits (0ff0f) and the lower 12 bits (ff0).

2. Add 1 to the upper 20 bits, since the lower 12 are negative (the sign bit is 1). You get ff10.

3. Upload that value to the register you want with the lui/auipc instructions. That way you get the value ff10000 in that register.

4. Get the lower 12 bits and sign extend them since they will be added via addi/jalr. You get (fffffff0).

5. Add the sign-extended 12 bits (fffffff0) to the value already in the register (ff10000) and you get the value you wanted (ff0fff0).

Examples

To run these examples I used the BRISC-V simulator. I chose it because you don't need to install anything to run your RISC-V assembly programs. You simply load and run them in its web. Here you can find a compilation of examples and all the system calls it supports. But It also has problems. BRISC-V doesn't support all the directives other more complex assemblers have, so you need to be careful of how your programs are structured.

The first example is here :

 
        #here it goes the kernel code
#it setups the stack pointer
	addi	zero,zero,0 
kernel:             
	addi	sp,zero,1536
	call	main        
	addi	zero,zero,0 
	mv      s1,a0   
	addi	zero,zero,0 
	addi	zero,zero,0 
	auipc	ra,0x0      
	jalr	ra,0(ra)    
	addi	zero,zero,0 
	addi	zero,zero,0 

    #here it goes the read only data
    .rodata
.HELLO:
    # .string "Hello World!\n\0" in reverse split in words
    .word 0x6C6C6548
    .word 0x6F57206F
    .word 0x21646C72
    .word 0x0000000D

    #here it goes the code
    .text
main:
    # print the string .HELLO
    addi t0, zero, 3         # this is the string printing syscall 
    lui a0, %hi(.HELLO)      # this loads the top 20 bits 
                             # of .HELLO address into a0
    addi a0, a0, %lo(.HELLO) # this loads the bottom 12 bits
    addi a1, zero, 13         # length of the string
    ecall

    #ask the user for a number 
    addi t0, zero, 4
    ecall

    #now a0 containts the number
    #do a countdown

countdown:
    # print the number
    addi t0, zero, 1
    ecall 

    #iterate until a0 is negative
    addi a0, a0, -1 		
    bge a0, zero, countdown

For my first program went easy. I print the string "Hello World", ask for a number, and print a countdown to zero. Note that BRISC-V doesn't support the asciiz directive, so we need to specify the words that made the string. The rest is pretty strightforward, simply make the calls to the simulator using the ecall instructions. You can ignore the first chunk of code. It's just some code that goes there to initialize the stack pointer and call the main function.

The second example is a bit more complex:

 
#here it goes the kernel code
#it setups the stack pointer
	addi	zero,zero,0 
kernel:             
	addi	sp,zero,1536
	call	main        
	addi	zero,zero,0 
	mv      s1,a0   
	addi	zero,zero,0 
	addi	zero,zero,0 
	auipc	ra,0x0      
	jalr	ra,0(ra)    
	addi	zero,zero,0 
	addi	zero,zero,0 

    #here it goes the read only data
    .rodata
PRIMESTRING:
    .word 0x4D495250
    .word 0x00000D45
NOTPRIMESTRING:
    .word 0x20544F4E
    .word 0x4D495250
    .word 0x00000D45

    #here it goes the code
    .text
main:

    addi a0, zero, 2
    addi t0, zero, 1 	#number print service
    ecall 			#call OS, print the number in a0
    call isprime 		#check if it's prime
    addi a0, zero, 5
    addi t0, zero, 1 	#number print service
    ecall 			#call OS, print the number in a0
    call isprime 		#check if it's prime
    addi a0, zero, 4
    addi t0, zero, 1 	#number print service
    ecall 			#call OS, print the number in a0
    call isprime 		#check if it's prime
    addi a0, zero, 10
    addi t0, zero, 1 	#number print service
    ecall 			#call OS, print the number in a0
    call isprime 		#check if it's prime
    addi a0, zero, 11
    addi t0, zero, 1 	#number print service
    ecall 			#call OS, print the number in a0
    call isprime 		#check if it's prime
    addi a0, zero, 43
    addi t0, zero, 1 	#number print service
    ecall 			#call OS, print the number in a0
    call isprime 		#check if it's prime
    addi a0, zero, 44
    addi t0, zero, 1 	#number print service
    ecall 			#call OS, print the number in a0
    call isprime 		#check if it's prime
    j programexit

# subroutine divide
# divides a0 (dividend) by a1 (divisor)
# returns a0 (remainder), a1 (quotient)
# uses t registers
divide:
	addi t0, zero, 0 #reset temporary quotient
divideloop:
	blt a0, a1, divideexit 	#exit if dividend less than divisor
	sub a0, a0, a1
	addi t0, t0, 1 		#add 1 to quotient
	jal zero, divideloop 	#pseudoinstruction j
divideexit:
	addi a1, t0, 0
	jalr zero, ra, 0 	#pseudoinstruction ret


# subroutine isprime
# prints "prime" or "NOT Prime" according to a number passed in a0
isprime:
	addi	sp,sp,-16
	sw	ra,0(sp) 	#save return address
	#
	addi t6, zero, 2
	addi t5, a0, -1
	#
isprimeloop:
	blt t5, t6, isprimeprintprime
	# save t registers used
	sw t5, 4(sp)
	sw t6, 8(sp)
	sw a0, 12(sp)
	#
	# a0 contains the initial argument
	mv a1, t5
	jal ra, divide 		#pseudoinstruction call, we call subroutine divide
	beq a0, zero, isprimeprintnotprime 	# if remainder is zero, we have a divisor
	# restore t registers used
	lw t5, 4(sp)
	lw t6, 8(sp)
	lw a0, 12(sp)
	#
	addi t5, t5, -1
	j isprimeloop
	#
isprimeprintprime:
    	addi t0, zero, 3         # this is the string printing syscall 
    	lui a0, %hi(PRIMESTRING)      # this loads the top 20 bits 
	addi a0, a0, %lo(PRIMESTRING) # this loads the bottom 12 bits
    	addi a1, zero, 6         # length of the string
    	ecall
	j isprimeexit
isprimeprintnotprime:
    	addi t0, zero, 3         # this is the string printing syscall 
    	lui a0, %hi(NOTPRIMESTRING)      # this loads the top 20 bits 
	addi a0, a0, %lo(NOTPRIMESTRING) # this loads the bottom 12 bits
    	addi a1, zero, 10         # length of the string
    	ecall
	j isprimeexit
isprimeexit:
	lw	ra,0(sp) 	#restore return address
	addi	sp,sp,16
	ret

#
# program exit point
programexit:

It checks whether a number is prime or not and prints a message accordingly. It has two subroutines, isprime, that checks if the number passed int a0 is prime, that instead, supports in the subroutine divide to check the divisors or a given number.

Reecently I found a better online RISC-V assembly simulator called VENUS. It's much more complete than other toys on the internet. I recommend it to you. Here you can find its documentation.

I wrote another example more for VENUS . This time it prints a given number of fibonacci numbers.

VENUS example is here :

 
    .data

    carrret:
        .asciiz "\n"
    seed0:
        .word 0
    seed1:
        .word 1
    spacestr:
        .asciiz " "
    
    .text
        li s0, 5        #number of fibonacci numbers to print
                        #https://github.com/61c-teach/venus/wiki/Environmental-Calls
    
        lw a0, seed0	#load the first two numbers
        lw a1, seed1
    fiboloop:
        ble s0, zero, programexit
        call fibonacci
        addi s0, s0, -1
        j fiboloop
    
    #prints the next fibonacci number
    # a0 and a1 contains the two previous fibonacci numbers
    #returns the next two fibonacci numbers in a0 and a1 too
    fibonacci:
        add t0, a0, a1
        add a0, a1, zero
        add a1, t0, zero
        #
        addi sp, sp, -8
        # save the numbers in a0, a1
        sw a0, 0(sp)
        sw a1, 4(sp)
        # print the number in a1
        addi a0, zero, 1
        ecall
        # print space
        li a0, 4        #print space string
        la a1, spacestr	#asciiz string addr
        ecall
        # restorethe numbers in a0, a1
        lw a0, 0(sp)
        lw a1, 4(sp)
        addi sp, sp, 8
        ret	
    
    programexit:
        li a0, 4        #print string carriage return 
        la a1, carrret 	#asciiz string addr
        ecall