From 3b3c1c5e5eb1be9f2973170235670ee02dc26c87 Mon Sep 17 00:00:00 2001
From: Mark Ermolov <mermolov@ptsecurity.com>
Date: Fri, 18 Jun 2021 13:41:21 +0300
Subject: [PATCH] README.md fixes: 1. Grammar and formatting 2. Inaccuracy in
 idq_disassemble description in Content of Publication 2. Usage is moved to
 end

---
 README.md | 149 ++++++++++++++++++++++++++++--------------------------
 1 file changed, 77 insertions(+), 72 deletions(-)

diff --git a/README.md b/README.md
index f728f17..103a23a 100644
--- a/README.md
+++ b/README.md
@@ -5,79 +5,27 @@
 
 # Content
 [Introduction](#introduction)  
-[Usage](#Usage)  
 [The Structure and the Binary Format of Intel Atom Goldmont Microcode](#the-structure-and-the-binary-format-of-intel-atom-goldmont-microcode)  
 [Description of Some Important Microoperations](#description-of-some-important-microoperations)  
 [Text Labels For Microcode Addresses](#text-labels-for-microcode-addresses)  
 [Unresolved Questions](#unresolved-questions)  
 [Content of the Publication](#content-of-the-publication)  
+[Usage](#usage)  
 [Research Team](#research-eam)  
 [License](#license)  
 
 
 # Introduction
+
 Since Intel Atom CPUs are full-fledged, modern representatives of the x86 architecture supporting most of its instruction extensions (Intel VMX, Intel MPX, Intel SGX)  the ability to view, understand and research the microcode of these CPUs is being considered by us as a very important game-changing opportunity in many areas of nowadays security/performance/functional analysis of x86 CPUs. The knowing of the x86 implementation in microcode even for the one representative can greatly empower researchers of the CPU transient execution vulnerabilities because now they can see much deeper what is going on inside one or another x86 instruction implementation and how it affects the microarchitecture (various buffers, registers and internal states). Performance engineers finally can estimate the true latency of Intel CPUs instructions, comparing it with official documentation and Hypervisors developers could see the genuine reason leading to VM exit without relying on numerous guesses. Unfortunately, the Chips Giant has kept this secret with seven seals for over 40 years, but now it seems to emerge.
 
-So last year we managed to extract the microcode for the actual Intel Atom microprocessor having codename Goldmont. We don’t intend to describe the process now, but instead we would like to share our results of the reverse engineering that we’re doing for the Atom’s microcode. Here, we are publishing our microcode disassembler tool using which you can see the interpretation in plain, readable form of the binary microcode which we have already published last year [glm-ucode][4]. Our disassembler is written in Python 3.x script language and prints the binary microoperations together with their text representation (mnemonic + operands). The text translation is done based on our understanding and the progress in the reverse engineering at the current stage, so we don’t claim its absolute certainly. There can be errors as in the microoperation mnemonics naming as well in the arguments representation. Moreover, there still exist unknown operation codes (opcodes) for many microoperations (mostly, for XMM specific), but the basic control flow and ALU opcodes were determined. We encourage all researchers interested in the topic to continue with us the research and extend our disassembler fixing the errors and adding new opcodes. This is one of the goals for the current publication of the microcode disassembler tool intendent for Intel Atom CPUs microcode.
+So last year we managed to extract the microcode for the actual Intel Atom microprocessor having codename Goldmont. We don’t intend to describe the process now, but instead we would like to share our results of the reverse engineering that we’re doing for the Atom’s microcode. Here, we are publishing our microcode disassembler tool using which you can see the interpretation in plain, readable form of the binary microcode which we have already published last year ([glm-ucode][4]). Our disassembler is written in Python 3.x script language and prints the binary microoperations together with their text representation (mnemonic + operands). The text translation is done based on our understanding and the progress in the reverse engineering at the current stage, so we don’t claim its absolute accuracy. There can be errors as in the microoperation mnemonics naming as well in the arguments representation. Moreover, there still exist unknown operation codes (opcodes) for many microoperations (mostly, for XMM specific), but the basic control flow and ALU opcodes were determined. We encourage all researchers interested in the topic to continue with us the research and extend our disassembler fixing the errors and adding new opcodes. This is one of the goals for the current publication of the microcode disassembler tool intendent for Intel Atom CPUs microcode.
 
 At first glance at the disassembler’s output the researcher may be confused by the naming of some mnemonics especially for microoperations working with physical memory (e.g. LDPPHYSTICKLE_DSZ64_ASZ64_SC1) and he can raise the question of the source for those weird names.  For now, we can say only that those mnemonics were acquired directly from Intel – they published on the one of their official internet resources the raw data representing log files from some microcode simulation tool for certain Big Core microarchitecture. Now, the link isn’t available, but we kept the data which have been subject to deep analysis where we got all those sophisticated mnemonics. By analogy, we invented and our own, where we were not able to find correspondent in the list using the logic and the existing mnemonics as a template. We’re publishing the original list of the opcodes’ mnemonics in separate file (misc/bigcore_opcodes.txt) to let researchers make independent decision about correctness of our choice in the naming and use it for new opcodes.
 
 Next, we will describe the structure of Atom Goldmont microcode and the basic semantic of some most important microoperations. Further, we will describe the remaining unresolved problems which we encountered during our research.
 
 
-# Usage
-```
-glm_ucode_disasm.py
-Usage: glm_ucode_disasm <ms_array0_file_path>
-```
-
-Example:
-```
-glm_ucode_disasm.py ..\ucode\ms_array0.txt
-```
-
-Output listing can be found in 
-
-```
-cat ..\ucode\ucode_glm.txt
-U0000: 00626803f200                tmp15:= MOVEFROMCREG_DSZ64(CORE_CR_CUR_UIP)
-U0001: 000801030008                tmp0:= ZEROEXT_DSZ32(0x00000001)
-           018e5e40                SEQW GOTO U0e5e
-------------------------------------------------------------------------------------
-U0002: 004800013000                tmp7:= ZEROEXT_DSZ64(0x00000000)
-
-U0004: 05b900013000                mm7:= unk_5b9(0x00000000)
-U0005: 000a01000200                TESTUSTATE(UCODE, UST_MSLOOPCTR_NONZERO)
-           0b000240                ? SEQW GOTO U0002
-U0006: 014800000000     SYNCWAIT-> URET(0x00)
-------------------------------------------------------------------------------------
-
-U0008: 000c6c97e208                tmp14:= SAVEUIP(0x01, U056c)
-           01890900                SEQW GOTO U0909
-------------------------------------------------------------------------------------
-U0009: 0005a407de08                tmp13:= SUB_DSZ32(0x000001a4, tmp8)
-U000a: 01310023d23d                tmp13:= SELECTCC_DSZ32_CONDNZ(tmp13, 0x00000800)
-
-U000c: 00470003dc7d                tmp13:= NOTAND_DSZ64(tmp13, tmp1)
-U000d: 0150015c027d   LFNCEWTMRK-> UJMPCC_DIRECT_NOTTAKEN_CONDZ(tmp13, U3701)
-U000e: 000000000000                NOP
-           06a71180                SEQW GOTO generate_#GP
-------------------------------------------------------------------------------------
-
-U0010: 000c6c97e208                tmp14:= SAVEUIP(0x01, U056c)
-           0187e100                SEQW GOTO U07e1
-------------------------------------------------------------------------------------
-
-sha256_ret:
-U0011: 00638e03d200                tmp13:= READURAM(0x008e, 64)
-U0012: 00652003e23d                tmp14:= SHR_DSZ64(tmp13, 0x00000020)
-
-U0014: 003d0003df7e                tmp13:= MOVEINSERTFLGS_DSZ32(tmp14, tmp13)
-U0015: 00638d03e200                tmp14:= READURAM(0x008d, 64)
-U0016: 015d00000ec0                UJMP(tmp11)
-```
-
-
 # The Structure and the Binary Format of Intel Atom Goldmont Microcode
 
 The microcode of the Intel Atom CPUs consists from two large chunks of data – Microcode Triads and Sequence Words. These data are kept in the ROM area of a functional block inside CPU core that is called Microcode Sequencer (MS). We used debug port of MS exposed to CRBUS to extract the data.
@@ -86,15 +34,16 @@ Microcode triads represent a set of **three microoperations** which are processe
 
 Each microoperation of Atom Goldmont microarchitecture has the following 48-bit binary format (at the top here’re the bits indexes, at the bottom – the fields lengths, signs plus mark fields boundaries, vertical bars – bytes):
 ```
-48        44  40        32       24 23    18 16  12    8  6      0
+48       44   40       32       24 23    18 16   12   8  6      0
 -|--+--+--+----|--------|--------|--+-----+--|----+----|--+------|
  |??|m2|m1|    opcode   |  imm0  |m0| imm1|  dst  | src1  | src0 |
 -|--+--+--+----|--------|--------|--+-----+--|----+----|--+------|
-   2  1 1      12            8     1   5     6      6       6
+   2  1  1     12            8     1   5      6      6       6
 ```
 Where:
 
 **opcode** – 12-bit numeric microoperation code of operation representing the actual operation to perform (all opcodes which we’ve determined are placed in separate file opcodes.txt of our disassembler package)
+
 **src0/src1/dst** – three 6-bits fields which select operands for the operation. You can find the meaning of all numeric selectors for the fields in the disassembler’s python code. For some microoperations, the field dst is actually src2 (represents third source operand, e.g. for memory store uops).
 
 **m0/m1/m2** – there bits representing modes of the operation altering its behavior which are specific for microoperations or to groups of microoperations. E.g. for TESTUSTATE uop (see the description below), bit m0 means NOT, and bits m1 and m2 select various sets of internal state bits to check. For ALU uops (ADD_DSZN, SUB_DSZN and so on), bit m0 allows to select various immediate values representing data of macro-instruction (MACRO IMMS) for which the microcode gets executed.
@@ -108,11 +57,11 @@ Where:
 Each sequence word has the following 30-bit binary format:
 
 ```
-30 28  25  24 23                    8   6       2   0
+30 28    25 24 23                   8   6       2   0
 -+--+-----+--|--+--------------------+---+-------+---|
  |??|sync | up2 |          uaddr     |up1| eflow |up0|
 -+--+-----+--|--+--------------------+---+-------+---|
-  2    3     2              15         2     4     2
+   2   3     2             15          2     4     2
 ```
 Where:
 
@@ -137,7 +86,6 @@ There’re two groups of the most important microoperations:
 1. Controlling conditional execution of sequence words pertaining to their microcode triads
 
 
-
 ## SAVEUIP/SAVEUIP_REGOVR/READUIP_REGOVR/URET
 
 We found these mnemonics (SAVEUIP/READUIP/URET) in the original list of opcodes for the Big Core. During the reverse engineering of Atom microcode, we understood that there’re two internal microarchitectural (uarch) registers accessed by the considered uops which allow some kind of procedure calling inside microcode. We named the registers UIP0 and UIP1.
@@ -208,7 +156,8 @@ The microoperations MOVEFROMCREG_DSZ64/ MOVETOCREG_DSZ64 are simples uops to acc
 Inside execution pipeline there exist special small random-access memory which is private to each CPU core instance. It has only 512 (0x200) 64-bit entries and is accessed by READURAM/WRITEURAM uops. We called the memory as URAM. The memory isn’t shared by other cores of CPU complex. We are convinced that the memory can be written by arbitrary data and its entries aren’t hardware registers, but it seems that executive units of CPU core can access the URAM independent of microcode. Studying the microcode simulation log files for some Big Core (see Overview chapter) we’ve seen that the Big Cores also have the dedicated small private microarchitectural memory, but they name it as FSCP. We don’t know certainly what the abbreviation means, but decided to name the entries in URAM also as FSCP_CR_XXX. So, in our disassembler package there exist fscp.txt file where the association between arbitrary URAM address and its text name can be set.
 There also exist uops performing bit operations on their arguments (by analogy of correspondent CRBUS uops) before the write to URAM, but for now we didn’t determine their mnemonics.
 
-## Text Labels for Microcode Addresses
+
+# Text Labels for Microcode Addresses
 
 Our disassembler can assign text label to arbitrary address in microcode, so in all control flow uops, conditional and direct, the text label is used instead UXXX microcode address. The file has name labels.txt and placed nearby main python script. We already filled the file with several labels, which we assigned for different ucode procedures, such as performing  cryptographic procedures and others.
 Especially note the labels ending with **_xlat**: they mark entry points for x86 instructions which we determined. XLAT is an abbreviation of “Translate” and underlines that the x86 entry points in ucode are selected by a static tabular mechanism (we’ve seen the same naming of x86 entry points for Big Cores in the ucode emulation log files). Using the ability to execute arbitrary ucode via Match/Patch mechanism (isn’t described in this write-up), we determined many entry points for x86 instructions and placed them into the labels.txt file to be used by researchers.
@@ -217,45 +166,101 @@ Even more x86 entries aren’t determined yet. As you can see, each x86 instruct
 1.	The address for x86 instruction entry must be a multiple of 8
 1.	There must not be references in other places of ucode to the x86 entry address
 
-## Unresolved Questions
+
+# Unresolved Questions
 
 Our disassembler is far from complete. Here’re the open issues (how we see it) to be implemented:
 
-1. Opcodes and semantic for most SSE uops
+1.  Opcodes and semantic for most SSE uops.  
 Although we found several uops processing MMX/XMM data and implemented the support in our disassembler for mixed uops operating with both MMX/XMM and GP registers (the selectors for the registers in src0/src1/dst fields are overlapped), we didn’t process all SSE microoperations: we added only simple SSE uops those map one to one to correspondent x86 instructions naming them as the instructions (in fact, the mnemonics names for uops may differ). There exist in microcode the procedure for fast SHA256 implementation using vectored SSE data – it almost completely consists from uops with unknown opcodes. That’s a good place to start researching SSE uops.
 
-1.	Two unknown bits for TESTUSTATE
+1.  Two unknown bits for TESTUSTATE.  
 From all possible 48 state bits which can be used in TESTUSTATE uop, only for two of them we don’t know where they are in the microarchitectural state (see description for the TESTUSTATE uop above). We didn’t find bit #1 from UCODE state and bit #13 from SYS state. To understand their meaning, it must be found at first where the bits exist in the microarchitecture (CRBUS, arch state, Fuse, FSCP and so on).
 
-1.	Text names for state bits of TESTUSTATE
+1.	Text names for state bits of TESTUSTATE.  
 We assigned the names for eight most important SYS states of TESTUSTATE uop. You can find the enumeration in Phyton’s function parsing the arguments of the uop (*get_str_uop_xxx_ustate_special_imms*). For remaining seven (one bit is unterminated) SYS states and for VMX states, their purpose must be determined by reverse engineering of microcode changing the states’ sources and appropriate names must be assigned (the Python code has dict for the names to be extended).
 
-1.	Many CRBUS registers
+1.	Many CRBUS registers.  
 Unfortunately, we don’t have full list of CRBUS registers for Atom Goldmont microarchitecture (we do have the list for some Big Cores that was acquired from XML files of Intel DAL software package). However, the knowing of the Control Registers and their bit layout is very important for complete reverse engineering of the microcode (you will see how much code in MSROM works with CRBUS). We found and added to our disassembler some CRegs using their correlation with MSRs but they are very few of full set.
 
-1.	SIGEVENT numeric argument
+1.	SIGEVENT numeric argument.  
 This uop is used to raise x86 architectural exceptions. We found (using pure logic) two very important places where #UD and #GP exceptions are generated in microcode using the SIGEVENT uop, but we are not able to map the SIGEVENT argument to x86 exception vector. It seems there’s some other information in the numbers passed to SIGEVENT that must be understood, so the more convenient support for the SIGEVENT uop can be added to our disassembler.
 
-1.	UFLOWCTRL first argument’s value 0x01
+1.	UFLOWCTRL first argument’s value 0x01.  
 We didn’t determinate the purpose of the UFLOWCTRL with first argument’s value of 0x01. It replaces some other uop but it’s unknown which for the argument.
 
-1.	Sequence Word’s UEND variations
+1.	Sequence Word’s UEND variations.  
 We detected among eflow field bits of Sequence Words four values requesting the end of microcode sequencing for current macroinstruction. We marked them as UEND0, UEND1, UEND2 and UEND3. Although we suppose they are indented to deal with out of order execution of uops during the microcode sequencing and perhaps beyond the macroinstruction boundaries the certain purpose of each UENDX is to be determined.
 
-1.	Find an unfixable bug in CPU initialization code
+1.	Find an unfixable bug in CPU initialization code.  
 We already found many interesting things using our disassembler, in particular the two undocumented x86 instructions for microarchitectural access, but the main goal remains unresolved: to find a bug in microcode performing CPU initialization from the Reset Entry Point in microcode (U4000) to call of x86 Reset Vector. It’s very probably that a bug in that code flow could not be fixed by microcode patch what makes a precedent of truly unfixable microcode bug and changes the approach of the industry to the microcode implementation.
 
-## Content of the Publication
+
+# Content of the Publication
 
 1.	We publish our microcode disassembler (glm_ucode_disasm), consisting from: 
    * main Python script glm_ucode_disasm.py
    * opcode.txt file with all opcode mnemonics which we determined
    * hard_imm.txt containing all constants from Constants ROM of Atom Goldmont. They are used in uops with special src0/1 selectors
    * Various auxiliary files containing  textual names for several microarchitectural entities (CRBUS regs, URAM entries, labels for microcode addresses)
-2.	We publish without any description (who wants to - let him deal with the code) the IDQ (Instruction Decode Queue) processing Python code (*idq_disassemble.py)* with sample test data. IDQ is a key for reverse engineering of the microoperations format and uop opcodes. The code is tightly coupled with disassembler and we don’t want to separate it.
+2.	We publish without any description (who wants to - let him deal with the code) the IDQ (Instruction Decode Queue) processing Python code (*idq_disassemble* function) with sample test data (*idq_test_uops.txt* and *idq_test_imms.txt* in /glm_ucode_disasm). IDQ is a key for reverse engineering of the microoperations format and uop opcodes. The code is tightly coupled with disassembler and we don’t want to separate it.
 2.	Microoperations opcodes and mnemonics for one of Intel Big Core representative (*misc/bigcore_opodes.txt*)
 2.	The full list of all MSRs (*misc/glm_msr_read_desc.txt*, *misc/glm_msr_write_desc.txt*) for Atom Goldmont microarchitecture. MSRs are a bridge between x86 architecture and the microcode, some kind of an interface and they are very important for successfully reverse engineering of microcode. We extracted the two lists of MSR descriptors from special ROM area in uarch (via arbitrary execution of MSR2CR uop), parsed them according to microcode (see *rdmsr_xlat* and *wrmsr_xlat*) and publish the results: for each existing MSR, the following is published: MSR address, applicable modes, the check procedure in microcode, read/write procedure in microcode, address of microarchitectural data for MSR depending of its type (CRBUS regs, URAM regs, hardware register accessed via IO uops, custom MSR composed from many sources). There’re four modes, which affect MSR availability: Normal, SMM, JTAG and ELF (special very privileged x86 code that can exist in microcode update file in encrypted form and gets run directly by microcode). In our MSRs lists, in field Type we mark MSRs by: N (Normal), S (SMM), J (JTAG) and E (ELF).
 
+
+# Usage
+```
+glm_ucode_disasm.py
+Usage: glm_ucode_disasm <ms_array0_file_path>
+```
+
+Example:
+```
+glm_ucode_disasm.py ..\ucode\ms_array0.txt
+```
+
+Output listing can be found in 
+
+```
+cat ..\ucode\ucode_glm.txt
+U0000: 00626803f200                tmp15:= MOVEFROMCREG_DSZ64(CORE_CR_CUR_UIP)
+U0001: 000801030008                tmp0:= ZEROEXT_DSZ32(0x00000001)
+           018e5e40                SEQW GOTO U0e5e
+------------------------------------------------------------------------------------
+U0002: 004800013000                tmp7:= ZEROEXT_DSZ64(0x00000000)
+
+U0004: 05b900013000                mm7:= unk_5b9(0x00000000)
+U0005: 000a01000200                TESTUSTATE(UCODE, UST_MSLOOPCTR_NONZERO)
+           0b000240                ? SEQW GOTO U0002
+U0006: 014800000000     SYNCWAIT-> URET(0x00)
+------------------------------------------------------------------------------------
+
+U0008: 000c6c97e208                tmp14:= SAVEUIP(0x01, U056c)
+           01890900                SEQW GOTO U0909
+------------------------------------------------------------------------------------
+U0009: 0005a407de08                tmp13:= SUB_DSZ32(0x000001a4, tmp8)
+U000a: 01310023d23d                tmp13:= SELECTCC_DSZ32_CONDNZ(tmp13, 0x00000800)
+
+U000c: 00470003dc7d                tmp13:= NOTAND_DSZ64(tmp13, tmp1)
+U000d: 0150015c027d   LFNCEWTMRK-> UJMPCC_DIRECT_NOTTAKEN_CONDZ(tmp13, U3701)
+U000e: 000000000000                NOP
+           06a71180                SEQW GOTO generate_#GP
+------------------------------------------------------------------------------------
+
+U0010: 000c6c97e208                tmp14:= SAVEUIP(0x01, U056c)
+           0187e100                SEQW GOTO U07e1
+------------------------------------------------------------------------------------
+
+sha256_ret:
+U0011: 00638e03d200                tmp13:= READURAM(0x008e, 64)
+U0012: 00652003e23d                tmp14:= SHR_DSZ64(tmp13, 0x00000020)
+
+U0014: 003d0003df7e                tmp13:= MOVEINSERTFLGS_DSZ32(tmp14, tmp13)
+U0015: 00638d03e200                tmp14:= READURAM(0x008d, 64)
+U0016: 015d00000ec0                UJMP(tmp11)
+```
+
+
 # Research Team
 
 Mark Ermolov ([@\_markel___][1])