1
0
mirror of https://github.com/antonblanchard/microwatt.git synced 2026-01-11 23:43:15 +00:00

73 Commits

Author SHA1 Message Date
Paul Mackerras
d2bf3f3580 core: Implement hypervisor doorbell interrupt and msg* instructions
This implements the hypervisor doorbell exception and interrupt and
the msgsnd, msgclr and msgsync instructions (msgsync is a no-op).  The
msgsnd instruction can generate a hypervisor doorbell interrupt on any
CPU in the system.  To achieve this, each core sends its hypervisor
doorbell messages to the soc level, which ORs together the bits for
each CPU and sends it to that CPU.

The privileged doorbell exception/interrupt and the msgsndp/msgclrp
instructions are not required since we don't implement SMT.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2025-08-09 12:13:44 +10:00
Paul Mackerras
ca872faede core: Consolidate several OP_* values into a single OP_COMPUTE
This replaces OP_ADDG6S, OP_BCD, OP_BREV, OP_CMPB, OP_CMPEQB,
OP_CMPRB, OP_CROP, OP_EXTS, OP_EXTSWSLI, OP_ISEL, OP_LOGIC, OP_MFCR,
OP_PRTY, OP_RLC, OP_RLCL, OP_RLCR, OP_SETB, OP_SHL, OP_SHR,
and OP_XOR with a single OP_COMPUTE.  The replaced operations are all
ones which just compute a result value (for GPR or CR) in execute1,
don't have any other side effects, and aren't used in decode2 to
determine other signals.  The operation to be performed is
sufficiently defined by the result and subresult fields in the decode
table.  With the elimination of OP_SPARE, this reduces the number of
insn_type_t values to 44.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2025-08-09 12:13:41 +10:00
Paul Mackerras
54173a0677 decode: Move result_sel and subresult_sel into main decode table
Instead of working out result_sel and subresult_sel in decode2 from
the insn_type, they now come directly from the main decode table in
decode1.  This reduces the need for distinct insn_type values and
should enable us to avoid expanding insn_type beyond 6 bits.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2025-05-31 21:02:09 +10:00
Paul Mackerras
e14712e45c core: Simplify operand presentation for hash instructions
This removes the cases in the decode stages which allowed the C
register address to come from the RB field for the hash instructions
(hashst[p], hashchk[p]), and generated a negative immediate value for
the B operand.  The motivation is to simpify the logic for the C
register address.  Instead the unusual construction of the address for
the hash instructions is handled in the loadstore1_in process, and the
hash computation uses the A and B operands rather than A and C.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2025-04-18 22:06:46 +10:00
Paul Mackerras
2c7d1e5d9c decode: Split input B selection into two fields
Instead of a single input_reg_b_t field in the decode table which
select both whether input B is a register or constant, and also which
constant (immediate value) to use, we now have one field which selects
whether input B is immediate (constant), a GPR, or an FPR, and a
separate field to select which sort of immediate value to use.  This
results in simpler logic and better timing.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2025-01-29 11:15:01 +11:00
Paul Mackerras
3bcc31fdda core: Implement hashstp and hashchkp instructions and HASHPKEYR register
These provide facilities similar to hashstp, hashchk and HASHKEYR, but
restricted to privileged mode.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2025-01-25 14:56:31 +11:00
Paul Mackerras
00a3db8457 decode1: Indicate instruction privilege in main decode table
Previously the computation of whether an instruction is privileged or
not was done based on the insn_type.  However, that meant that l*cix
(OP_LOAD) and st*cix (OP_STORE) couldn't be made privileged, and
neither could tlbsync (OP_NOP).

Instead, this adds a field to the main instruction decode table to
indicate privileged instructions, and makes the cache-inhibited loads
and stores privileged, along with tlbsync.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2025-01-25 14:56:27 +11:00
Paul Mackerras
0a11e8455f core: Implement hashst and hashchk instructions
These are done in loadstore1.  The HashDigest function is computed in
9 cycles; for 8 cycles, a state machine does 4 steps of key expansion
per cycle, and for each of 4 lanes of data, does 4 steps of ciphering;
then there is 1 cycle to combine the results into the final hash
value.

At present, hashcmp does not overlap the computation of the hash with
fetching of data from memory (in the case of a cache miss).

The 'is_signed' field in the instruction decode table is used to
distinguish hashst and hashcmp from ordinary loads and stores.  We
have a new 'RBC' value for input_reg_c_t which says that we are
reading RB but we want the value to come in via the C port; this is
because we want the 5-bit immediate offset on the B port.

Note that in the list of insn_code values, hashst/chk have been put in
the section for instructions with an RB operand, which is not strictly
correct given that the B port is used for the immediate D operand;
however, adding them to the section for instructions without an RB
operand would have made that section exceed 128 entries, causing
changes to the padding needed.  The only downside to having hashst/cmp
where they are is that the debug logic can't use the RB port to read
GPR/FPRs when a hashst/cmp instruction is being decoded.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2025-01-25 14:50:13 +11:00
Paul Mackerras
ba4614c5f4 dcache: Implement data cache touch and flush instructions
This implements dcbf, dcbt and dcbtst in the dcache.  The dcbst (data
cache block store) instruction remains a no-op because our dcache is
write-through and therefore never has modified data that could need to
be written back.

Dcbt (data cache block touch) and dcbtst (data cache block touch for
store) behave similarly except that dcbtst is a no-op on a readonly
page.  Neither instruction ever causes an interrupt.  If they miss in
the cache and the page is cacheable, they are handled like a load miss
except that they complete immediately the state machine starts
handling the load miss rather than waiting for any data.

Dcbf (data cache block flush) can cause a data storage interrupt.  If
it hits in the cache, the state machine goes to a new FLUSH_CYCLE
state in which the cache line valid bit is cleared.

In order to avoid having more than 8 values in op_t, this combines
OP_STORE_MISS and OP_STORE_HIT into a single state.  A new OP_NOP
state is used for operations which can complete immediately without
changing any dcache state (now used for dcbt/dcbtst causing access
exception or on a non-cachable page, or dcbf that misses the cache).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2024-12-30 12:59:16 +11:00
Paul Mackerras
722f239c02 Reimplement quadword loads and stores
This adds implementations of lq, plq, stq, pstq, lqarx and stqcx.

Because register file addresses are now computed in decode1 before we
have the decode table entry for the instruction, we have to check the
icode directly to know when to read register RS|1 before RS (i.e. for
stq and stqcx in LE mode, but not pstq).

For the second instance of the instruction, loadstore1 uses the EA
from the first instance + 8.  It generates an alignment interrupt for
unaligned lqarx and stqcx and for lq in LE mode with an unaligned
address.  (The reason for the latter case is that it writes RT|1
before RT, and if we have RA = RT|1 and the second instance traps, we
will have overwritten RA.)

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2024-12-27 20:40:53 +11:00
Paul Mackerras
fa9df33f7e Implement cfuged, pdepd and pextd
This implements the cfuged, pdepd and pextd instructions in a new unit
called bit_sorter (so called because cfuged and pextd can be viewed as
sorting the bits of the mask).

The cnt* instructions and the popcnt* instructions now use the same
OP_COUNTB insn_type so as to free up an insn_type value to use for the
new instructions.

The new instructions are implemented using a slow and simple algorithm
that takes 64 cycles to compute the result.  The ex1 stage is stalled
while this happens, as for a 64-bit multiply, or for a divide when
there is no FPU.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2024-12-20 22:24:31 +11:00
Paul Mackerras
d112a7ad94 Implement scv and rfscv
The main quirk here is that scv sets LR and CTR instead of SRR0 and
SRR1, and likewise rfscv uses LR and CTR.  Also, scv uses a set of 128
interrupt vectors starting at 0x17000.  Fortunately, the layout of the
SPR RAM was already such that LR and CTR were in the even and odd
halves respectively at the same index, so reading or writing LR and
CTR instead of SRR0 and SRR1 is quite easy.

Use of scv is subject to an FSCR bit but not an HFSCR bit.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2024-12-20 22:24:20 +11:00
Paul Mackerras
205c0e2c78 Implement the wait instruction
This implements the behaviour of the 'wait 0' instruction of pausing
execution of instructions until an exception arises.  The exceptions
that terminate a wait are a pending trace exception, external
interrupt request, PMU interrupt request, or decrementer negative
exception.  These exception conditions terminate a wait even if not
enabled to generate an interrupt (e.g. if MSR[EE] is zero).

This is implemented by having execute1 assert its busy_out signal
while the wait state exists.  The wait state is set by the completion
of the wait instruction and cleared by a pending exception.

If the WC operand of the wait instruction is non-zero, indicating wait
for reservation loss or wait for a short period, then the wait
instruction does not wait, but just acts as a no-op.

In order to make space in the insn_type_t type without going over 64
elements, this combines OP_DCBT and OP_ICBT into a single OP_XCBT,
since they were both no-ops (except for their influence on how SRR1 is
set on a trace interrupt, where they were identical).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2024-12-20 22:24:10 +11:00
Paul Mackerras
1c4b5def36 Improve timing of redirect_nia going from writeback to fetch1
This gets rid of the adder in writeback that computes redirect_nia.
Instead, the main adder in the ALU is used to compute the branch
target for relative branches.  We now decode b and bc differently
depending on the AA field, generating INSN_brel, INSN_babs, INSN_bcrel
or INSN_bcabs as appropriate.  Each one has a separate entry in the
decode table in decode1; the *rel versions use CIA as the A input.
The bclr/bcctr/bctar and rfid instructions now select ramspr_result
for the main result mux to get the redirect address into
ex1.e.write_data.

For branches which are predicted taken but not actually taken, we need
to redirect to the following instruction.  We also need to do that for
isync.  We do this in the execute2 stage since whether or not to do it
depends on the branch result.  The next_nia computation is moved to
the execute2 stage and comes in via a new leg on the secondary result
multiplexer, making next_nia available ultimately in ex2.e.write_data.
This also means that the next_nia leg of the primary result
multiplexer is gone.  Incrementing last_nia by 4 for sc (so that SRR0
points to the following instruction) is also moved to execute2.

Writing CIA+4 to LR was previously done through the main result
multiplexer.  Now it comes in explicitly in the ramspr write logic.

Overall this removes the br_offset and abs_br fields and the logic to
add br_offset and next_nia, and one leg of the primary result
multiplexer, at the cost of a few extra control signals between
execute1 and execute2 and some multiplexing for the ramspr write side
and an extra input on the secondary result multiplexer.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2023-09-19 17:30:54 +10:00
Paul Mackerras
06ff486567 icache: Restore primary opcode to instruction word
The icache stores a predecoded insn_code value for each instruction,
and so as to fit in 36 bits, omits the primary opcode (the most
significant 6 bits) of each instruction.  Previously, for valid
instructions, the primary opcode field of the instruction delivered to
decode1 was a part-representation of the insn_code value rather than
the actual primary opcode.  This adds a lookup table to compute the
primary opcode from the insn_code and deliver it in the instruction
words supplied to decode1.

In order that each insn_code can be associated with a single primary
opcode value, the various no-operation instructions with primary
opcode 31 (the reserved no-ops and dss, dst and dstst) have been given
a new insn_code, INSN_rnop, leaving INSN_nop for the preferred no-op
(ori r0,r0,0).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2023-09-19 17:17:15 +10:00
Paul Mackerras
b50170cd1d Implement byte reversal instructions
This implements the byte-reverse halfword, word and doubleword
instructions: brh, brw, and brd.  These instructions were added to the
ISA in version 3.1.  They use a new OP_BREV insn_type value.  The
logic for these instructions is implemented in logical.vhdl.

In order to avoid going over 64 insn_type values, OP_AND and OP_OR
were combined into OP_LOGIC, which is like OP_AND except that the RS
input can be inverted as well as the RB input.  The various forms of
OR instruction are then implemented using the identity

    a OR b = NOT (NOT a AND NOT b)

The 'is_signed' field of the instruction decode table is used to
indicate that RS should be inverted.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2023-09-13 08:43:25 +10:00
Paul Mackerras
39ca675ce3 Decode prefixed instructions
This adds logic to do basic decoding of the prefixed instructions
defined in PowerISA v3.1B which are in the SFFS (Scalar Fixed plus
Floating-Point Subset) compliancy subset.  In PowerISA v3.1B SFFS,
there are 14 prefixed load/store instructions plus the prefixed no-op
instruction (pnop).  The prefixed load/store instructions all use an
extended version of D-form, which has an extra 18 bits of displacement
in the prefix, plus an 'R' bit which enables PC-relative addressing.

When decode1 sees an instruction word where the insn_code is
INSN_prefix (i.e. the primary opcode was 1), it stores the prefix word
and sends nothing down to decode2 in that cycle.  When the next valid
instruction word arrives, it is interpreted as a suffix, meaning that
its insn_code gets modified before being used to look up the decode
table.

The insn_code values are rearranged so that the values for
instructions which are the suffix of a valid prefixed instruction are
all at even indexes, and the corresponding prefixed instructions
follow immediately, so that an insn_code value can be converted to the
corresponding prefixed value by setting the LSB of the insn_code
value.  There are two prefixed instructions, pld and pstd, for which
the suffix is not a valid SFFS instruction by itself, so these have
been given dummy insn_code values which decode as illegal (INSN_op57
and INSN_op61).

For a prefixed instruction, decode1 examines the type and subtype
fields of the prefix and checks that the suffix is valid for the type
and subtype.  This check doesn't affect which entry of the decode
table is used; the result is passed down to decode2, and will in
future be acted upon in execute1.

The instruction address passed down to decode2 is the address of the
prefix.  To enable this, part of the instruction address is saved when
the prefix is seen, and then the instruction address received from
icache is partly overlaid by the saved prefix address.  Because
prefixed instructions are not permitted to cross 64-byte boundaries,
we only need to save bits 5:2 of the instruction to do this.  If the
alignment restriction ever gets relaxed, we will then need to save
more bits of the address.

Decode2 has been extended to handle the R bit of the prefix (in 8LS
and MLS forms) and to be able to generate the 34-bit immediate value
from the prefix and suffix.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2023-07-06 19:39:40 +10:00
Paul Mackerras
7af0e001ad Move insn_codes for mcrfs, mtfsb0/1 and mtfsfi
This moves the insn_code values for mcrfs, mtfsb0/1 and mtfsfi into
the region used for floating-point instructions.  This means that in
no-FPU implementations, they will get turned into illegal instructions
in predecode.  We then don't need the code in execute1 that makes FP
instructions illegal in no-FPU implementations.

We also remove the NONE value for unit_t, since it was only ever used
with insn_type = OP_ILLEGAL, and the check for unit = NONE was
redundant with the check for insn_type = OP_ILLEGAL.  Thus the check
for unit = NONE is no longer needed and is removed here.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2023-07-03 22:17:33 +10:00
Paul Mackerras
58e799b350 predecode: Add more comments to row_predecode_rom and insn_code values
This adds comments to row_predecode_rom to aid understanding how the
columns in the second half of the table are allocated to different
primary opcodes, and to the insn_code values to assist in locating the
code with a given numeric value.  No code change.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2022-08-09 20:14:44 +10:00
Paul Mackerras
26dc1e879c Eliminate use of primary opcode outside of decode1
This changes code that previously looked at the primary opcode (bits
26 to 31) of the instruction to use other methods, in places other
than in stage0 of decode1.

* Extend rc_t to have a new value, RCOE, indicating that the
  instruction has both Rc and OE bits.

* Decode2 now tells execute1 whether the instruction has a third
  operand, used for distinguishing between multiply and multiply-add
  instructions.

* The invert_a field of the decode ROM is overloaded for load/store
  instructions to indicate cache-inhibited loads and stores.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2022-08-09 20:13:54 +10:00
Paul Mackerras
c9aea45ffe decode1: Divide insn_code values into ranges to indicate register usage
This lets us compute r_out.reg_*_addr and r_out.read_2_enable values
without needing access to the primary opcode value.  We also have that
non-FP instructions are < 256.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2022-08-09 19:52:08 +10:00
Paul Mackerras
c3ee10f013 decode1: Split instruction decoding into two steps
This reduces the block RAM requirements for instruction decoding by
splitting it into two steps.  The first, in a new pipeline stage
called decode0 (implemented by code in decode1.vhdl) maps the
instruction to a 9-bit instruction code using major and row decode
ROMs.  The second maps the 9-bit code to the final decode_rom_t (about
44 bits wide).  Branch prediction done in decode is now done in
decode0 rather than decode1.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2022-08-09 19:52:04 +10:00
Paul Mackerras
932da4c114 FPU: Simplify IDLE state code
Do more decoding of the instruction ahead of the IDLE state
processing so that the IDLE state code becomes much simpler.
To make the decoding easier, we now use four insn_type_t codes for
floating-point operations rather than two.  This also rearranges the
insn_type_t values a little to get the 4 FP opcode values to differ
only in the bottom 2 bits, and put OP_DIV, OP_DIVE and OP_MOD next to
them.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2022-08-09 19:51:48 +10:00
Paul Mackerras
fdb3ef6874 Finish off taking SPRs out of register file
With this, the register file now contains 64 entries, for 32 GPRs and
32 FPRs, rather than the 128 it had previously.  Several things get
simplified - decode1 no longer has to work out the ispr{1,2,o} values,
decode_input_reg_{a,b,c} no longer have the t = SPR case, etc.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2022-07-22 22:20:39 +10:00
Paul Mackerras
c9e838b656 Remove support for lq, stq, lqarx and stqcx.
They are optional in SFFS (scalar fixed-point and floating-point
subset), are not needed for running Linux, and add complexity, so
remove them.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2022-07-22 22:19:50 +10:00
Paul Mackerras
4c61a71a62 core: Crack update-form loads into two internal ops
This uses the instruction-doubling machinery to send load with update
instructions down to loadstore1 as two separate ops, rather than
one op with two destinations.  This will help to simplify the value
tracking mechanisms.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-01-19 09:27:29 +11:00
Paul Mackerras
89a67a18d0 decode: Add a facility field to the instruction decode tables
This makes it simpler to work out when to deliver a FPU unavailable
interrupt.  This also means we can get rid of the OP_FPLOAD and
OP_FPSTORE insn_type values.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-01-15 12:40:09 +11:00
Paul Mackerras
4b2c23703c core: Implement quadword loads and stores
This implements the lq, stq, lqarx and stqcx. instructions.

These instructions all access two consecutive GPRs; for example the
"lq %r6,0(%r3)" instruction will load the doubleword at the address
in R3 into R7 and the doubleword at address R3 + 8 into R6.  To cope
with having two GPR sources or destinations, the instruction gets
repeated at the decode2 stage, that is, for each lq/stq/lqarx/stqcx.
coming in from decode1, two instructions get sent out to execute1.

For these instructions, the RS or RT register gets modified on one
of the iterations by setting the LSB of the register number.  In LE
mode, the first iteration uses RS|1 or RT|1 and the second iteration
uses RS or RT.  In BE mode, this is done the other way around.  In
order for decode2 to know what endianness is currently in use, we
pass the big_endian flag down from icache through decode1 to decode2.
This is always in sync with what execute1 is using because only rfid
or an interrupt can change MSR[LE], and those operations all cause
a flush and redirect.

There is now an extra column in the decode tables in decode1 to
indicate whether the instruction needs to be repeated.  Decode1 also
enforces the rule that lq with RT = RT and lqarx with RA = RT or
RB = RT are illegal.

Decode2 now passes a 'repeat' flag and a 'second' flag to execute1,
and execute1 passes them on to loadstore1.  The 'repeat' flag is set
for both iterations of a repeated instruction, and 'second' is set
on the second iteration.  Execute1 does not take asynchronous or
trace interrupts on the second iteration of a repeated instruction.

Loadstore1 uses 'next_addr' for the second iteration of a repeated
load/store so that we access the second doubleword of the memory
operand.  Thus loadstore1 accesses the doublewords in increasing
memory order.  For 16-byte loads this means that the first iteration
writes GPR RT|1.  It is possible that RA = RT|1 (this is a legal
but non-preferred form), meaning that if the memory operand was
misaligned, the first iteration would overwrite RA but then the
second iteration might take a page fault, leading to corrupted state.
To avoid that possibility, 16-byte loads in LE mode take an
alignment interrupt if the operand is not 16-byte aligned.  (This
is the case anyway for lqarx, and we enforce it for lq as well.)

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-01-15 12:40:09 +11:00
Paul Mackerras
e6a5f237bc FPU: Implement fmul[s]
This implements the fmul and fmuls instructions.

For fmul[s] with denormalized operands we normalize the inputs
before doing the multiplication, to eliminate the need for doing
count-leading-zeroes on P.  This adds 3 or 5 cycles to the
execution time when one or both operands are denormalized.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-09-03 17:45:07 +10:00
Paul Mackerras
b628af6176 FPU: Implement fmr and related instructions
This implements fmr, fneg, fabs, fnabs and fcpsgn and adds tests
for them.

This adds logic to unpack and repack floating-point data from the
64-bit packed form (as stored in memory and the register file) into
the unpacked form in the fpr_reg_type record.  This is not strictly
necessary for fmr et al., but will be useful for when we do actual
arithmetic.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-09-03 17:44:37 +10:00
Paul Mackerras
856e9e955f core: Add framework for an FPU
This adds the skeleton of a floating-point unit and implements the
mffs and mtfsf instructions.

Execute1 sends FP instructions to the FPU and receives busy,
exception, FP interrupt and illegal interrupt signals from it.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-09-03 17:44:00 +10:00
Paul Mackerras
45cd8f4fc3 core: Add support for floating-point loads and stores
This extends the register file so it can hold FPR values, and
implements the FP loads and stores that do not require conversion
between single and double precision.

We now have the FP, FE0 and FE1 bits in MSR.  FP loads and stores
cause a FP unavailable interrupt if MSR[FP] = 0.

The FPU facilities are optional and their presence is controlled by
the HAS_FPU generic passed down from the top-level board file.  It
defaults to true for all except the A7-35 boards.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-09-03 15:14:17 +10:00
Paul Mackerras
83816cb9e3 core: Implement BCD Assist instructions addg6s, cdtbcd, cbcdtod
To avoid adding too much logic, this moves the adder used by OP_ADD
out of the case statement in execute1.vhdl so that the result can
be used by OP_ADDG6S as well.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-08-07 10:36:23 +10:00
Paul Mackerras
5fafdc56ef core: Implement the addex instruction
The addex instruction is like adde but uses the XER[OV] bit for the
carry in and out rather than XER[CA].

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-08-06 19:27:00 +10:00
Paul Mackerras
290b05f97d core: Implement the maddhd, maddhdu and maddld instructions
These instructions use major opcode 4 and have a third GPR input
operand, so we need a decode table for major opcode 4 and some
plumbing to get the RC register operand read.

The multiply-add instructions use the same insn_type_t values as the
regular multiply instructions, and we distinguish in execute1 by
looking at the major opcode.  This turns out to be convenient because
we don't have to add any cases in the code that handles the output of
the multiplier, and it frees up some insn_type_t values.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-08-06 11:30:49 +10:00
Paul Mackerras
fa77a6f683 core: Implement the mcrxrx instruction
This also removes OP_MCRXR, as the mcrxr instruction was removed in
version 3.0B of the Power ISA, having been phased-out for the server
architecture since v2.02.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-08-06 11:28:34 +10:00
Paul Mackerras
4a4a98d4b9
core: Do addpcis using the main adder (#189)
By adding logic to decode2 to be able to send the instruction address
down the A input, and making CONST_DX_HI (renamed to CONST_DXHI4) add
4 to the immediate value (easy since the bottom 16 bits were zero),
we can do addpcis using the main adder.  This reduces the width of the
result mux and frees up one value in insn_type_t, since we can now use
OP_ADD for addpcis.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-06-05 11:29:31 +10:00
Shawn Anastasio
e606772aeb Implement the addpcis instruction
This commit adds support for the addpcis instruction from ISA 3.0.

A new input_reg_b_t type, CONST_DX_HI, was added to support the
shifted immediate value used in DX-Form instructions.

Signed-off-by: Shawn Anastasio <shawn@anastas.io>
2020-05-25 20:53:43 -05:00
Paul Mackerras
3d4712ad43 Add TLB to icache
This adds a direct-mapped TLB to the icache, with 64 entries by default.
Execute1 now sends a "virt_mode" signal from MSR[IR] to fetch1 along
with redirects to indicate whether instruction addresses should be
translated through the TLB, and fetch1 sends that on to icache.
Similarly a "priv_mode" signal is sent to indicate the privilege
mode for instruction fetches.  This means that changes to MSR[IR]
or MSR[PR] don't take effect until the next redirect, meaning an
isync, rfid, branch, etc.

The icache uses a hash of the effective address (i.e. next instruction
address) to index the TLB.  The hash is an XOR of three fields of the
address; with a 64-entry TLB, the fields are bits 12--17, 18--23 and
24--29 of the address.  TLB invalidations simply invalidate the
indexed TLB entry without checking the contents.

If the icache detects a TLB miss with virt_mode=1, it will send a
fetch_failed indication through fetch2 to decode1, which will turn it
into a special OP_FETCH_FAILED opcode with unit=LDST.  That will get
sent down to loadstore1 which will currently just raise a Instruction
Storage Interrupt (0x400) exception.

One bit in the PTE obtained from the TLB is used to check whether an
instruction access is allowed -- the privilege bit (bit 3).  If bit 3
is 1 and priv_mode=0, then a fetch_failed indication is sent down to
fetch2 and to decode1, which generates an OP_FETCH_FAILED.  Any PTEs
with PTE bit 0 (EAA[3]) clear or bit 8 (R) clear should not be put
into the iTLB since such PTEs would not allow execution by any
context.

Tlbie operations get sent from mmu to icache over a new connection.

Unfortunately the privileged instruction tests are broken for now.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-05-08 12:12:02 +10:00
Paul Mackerras
750b3a8e28 dcache: Implement data TLB
This adds a TLB to dcache, providing the ability to translate
addresses for loads and stores.  No protection mechanism has been
implemented yet.  The MSR_DR bit controls whether addresses are
translated through the TLB.

The TLB is a fixed-pagesize, set-associative cache.  Currently
the page size is 4kB and the TLB is 2-way set associative with 64
entries per set.

This implements the tlbie instruction.  RB bits 10 and 11 control
whether the whole TLB is invalidated (if either bit is 1) or just
a single entry corresponding to the effective page number in bits
12-63 of RB.

As an extension until we get a hardware page table walk, a tlbie
instruction with RB bits 9-11 set to 001 will load an entry into
the TLB.  The TLB entry value is in RS in the format of a radix PTE.

Currently there is no proper handling of TLB misses.  The load or
store will not be performed but no interrupt is generated.

In order to make timing at 100MHz on the Arty A7-100, we compare
the real address from each way of the TLB with the tag from each way
of the cache in parallel (requiring # TLB ways * # cache ways
comparators).  Then the result is selected based on which way hit in
the TLB.  That avoids a timing path going through the TLB EA
comparators, the multiplexer that selects the RA, and the cache tag
comparators.

The hack where addresses of the form 0xc------- are marked as
cache-inhibited is kept for now but restricted to real-mode accesses.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-05-08 12:12:01 +10:00
Paul Mackerras
278ac5e0eb Remove sim_config instruction
It's not used any more, and it's not in the ISA.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-04-07 20:07:54 +10:00
Paul Mackerras
381149b2cc Consolidate trap variants under a single OP_TRAP
This replaces OP_TD, OP_TDI, OP_TW and OP_TWI with a single OP_TRAP,
distinguishing the cases by the input_reg_b and is_32bit fields of
the decode ROM.  This adds the twi and td cases to the decode tables.

For now we make all of the trap instructions unconditionally generate
a trap-type program interrupt if the TO field of the instruction is
all ones, and do nothing otherwise.

This reduces the number of values in insn_type_t from 65 to 62,
meaning that an insn_type_t can now be encoded in 6 bits rather
than 7.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-04-07 20:04:24 +10:00
Paul Mackerras
fe077a116a Rename OP_MCRF to OP_CROP and trim insn_type_t
OP_MCRF covers the CR logical ops as well as mcrf since commit
c05441bf4793 ("Implement CRNOR and friends"), so this renames
OP_MCRF to OP_CROP.  The OP_* values for the individual CR logical
ops (OP_CRAND, etc.) are not used, so remove them from insn_type_t.

No functional change.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-04-07 11:29:03 +10:00
Michael Neuling
5ef5604b65 Add sc, illegal and decrementer exceptions and some supervisor state
This adds the following exceptions:
 - 0x700 program check (for illegal instructions)
 - 0x900 decrementer
 - 0xc00 system call

This also adds some supervisor state:
 - decremeter
 - msr
(SPRG0/1 and SRR0/1 already exist as fast SPRs)

It also adds some supporting instructions:
 - rfid
 - mtmsrd
 - mfmsr
 - sc

MSR state is added but only EE is used in this patch set. Other bits
are read/written but are not used at all.

This adds a 2 stage state machine to execute1.vhdl. This state machine
allows fast SPRS SRR0/1 to be written in different cycles. This state
machine can be extended later to add DAR and DSISR SPR writing for
more complex exceptions like page faults.

Signed-off-by: Michael Neuling <mikey@neuling.org>
2020-04-01 12:02:33 +11:00
Paul Mackerras
0c714f1be6 execute: Move popcnt and prty instructions into the logical unit
This implements logic in the logical entity to calculate the results
of the popcnt* and prty* instructions.  We now have one insn_type_t
value for the 3 popcnt variants and one for the two prty variants,
using the length field of the decode_rom_t to distinguish between
them.  The implementations in logical.vhdl using recursive
algorithms rather than the simple functions in ppc_fx_insns.vhdl.

This gives a saving of about 140 slice LUTs on the A7-100 and
improves timing slightly.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-01-14 22:40:39 +11:00
Paul Mackerras
d2ca625b3b execute: Do comparisons using the main adder
This handles OP_CMP like a subtraction; the main adder computes
~RA + RB + 1, and the condition codes are computed from the results.
A direct comparison of the two input operands is used to calculate the
EQ bit of the condition result.  The LT and GT bits are computed from
the MSB of the subtraction result, the carry out from the subtraction,
and the MSBs of the operands.  For a 32-bit comparison, the 32-bit
carry and bit 31 of the result and input operands are used; for a
64-bit comparison, the 64-bit carry and bit 63 of the operands and
result are used.

It turns out to be more convenient to use the 'signed' field of
the decode table to distinguish signed from unsigned comparisons,
rather than the insn_type.  Therefore this uses OP_CMP for both
cmp and cmpl, which also has the benefit of reducing the number
of values in insn_type_t.

Doing this saves over 200 slice LUTs on the Arty A7-100 and improves
timing slightly as well.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-01-14 22:29:07 +11:00
Paul Mackerras
39d18d2738 Make divider hang off the side of execute1
With this, the divider is a unit that execute1 sends operands to and
which sends its results back to execute1, which then send them to
writeback.  Execute1 now sends a stall signal when it gets a divide
or modulus instruction until it gets a valid signal back from the
divider.  Divide and modulus instructions are no longer marked as
single-issue.

The data formatting step that used to be done in decode2 for div
and mod instructions is now done in execute1.  We also do the
absolute value operation in that same cycle instead of taking an
extra cycle inside the divider for signed operations with a
negative operand.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-01-14 17:20:52 +11:00
Paul Mackerras
2167186b5f Make multiplier hang off the side of execute1
With this, the multiplier isn't a separate pipe that decode2 issues
instructions to, but rather is a unit that execute1 sends operands
to and which sends the result back to execute1, which then sends it
to writeback.  Execute1 now sends a stall signal when it gets a
multiply instruction until it gets a valid signal back from the
multiplier.

This all means that we no longer need to mark the multiply
instructions as single-issue.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2020-01-14 17:20:52 +11:00
Benjamin Herrenschmidt
e4f475e17f sprs: Store common SPRs in register file
This stores the most common SPRs in the register file.

This includes CTR and LR and a not yet final list of others.

The register file is set to 64 entries for now. Specific types
are defined that can represent a GPR index (gpr_index_t) or
a GPR/SPR index (gspr_index_t) along with conversion functions
between the two.

On order to deal with some forms of branch updating both LR and
CTR, we introduced a delayed update of LR after a branch link.

Note: We currently stall the pipeline on such a delayed branch,
but we could avoid stalling fetch in that specific case as we
know we have a branch delay. We could also limit that to the
specific case where we need to update both CTR and LR.

This allows us to make bcreg, mtspr and mfspr pipelined. decode1
will automatically force the single issue flag on mfspr/mtspr to
a "slow" SPR.

[paulus@ozlabs.org - fix direction of decode2.stall_in]

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2019-12-07 15:30:40 +11:00
Benjamin Herrenschmidt
797b1bb045 decode: Reformat decode_types.vhdl
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2019-10-30 13:18:58 +11:00