ASLR on the Line Practical Cache Attacks on the MMU 2017 (anc ndss17)

2020-02-27 58浏览

  • 1.ASLR on theLine:Practical Cache Attacks on the MMU Ben Gras∗ Kaveh Razavi∗ Erik Bosman Herbert Bos Cristiano Giuffrida Vrije Universiteit Amsterdam {beng, kaveh, ejbosman, herbertb, giuffrida}@cs.vu.nl ∗ Equal contribution joint first authors Abstract—Address space layout randomization (ASLR) is an important first line of defense against memory corruption attacks and a building block for many modern countermeasures. Existing attacks against ASLR rely on software vulnerabilities and/or on repeated (and detectable) memory probing. In this paper, we show that neither is a hard requirement and that ASLR is fundamentally insecure on modern cachebased architectures, making ASLR and caching conflicting requirements (ASLR⊕Cache, or simply AnC). To support this claim, we describe a new EVICT+TIME cache attack on the virtual address translation performed by the memory management unit (MMU) of modern processors. Our AnC attack relies on the property that the MMU’s page-table walks result in caching page-table pages in the shared last-level cache (LLC). As a result, an attacker can derandomize virtual addresses of a victim’s code and data by locating the cache lines that store the page-table entries used for address translation. Relying only on basic memory accesses allows AnC to be implemented in JavaScript without any specific instructions or software features. We show our JavaScript implementation can break code and heap ASLR in two major browsers running on the latest Linux operating system with 28 bits of entropy in 150 seconds. We further verify that the AnC attack is applicable to every modern architecture that we tried, including Intel, ARM and AMD. Mitigating this attack without naively disabling caches is hard, since it targets the low-level operations of the MMU. We conclude that ASLR is fundamentally flawed in sandboxed environments such as JavaScript and future defenses should not rely on randomized virtual addresses as a building block. bruteforcing, if at all possible [16], [59], requires repeatedly generating anomalous events (e.g., crashes [5], [17], [54], exceptions [19], [23], or huge allocations [47]) that are easy to detect or prevent. For instance, for some attacks [6] disabling non-fundamental memory management features is enough [62]. Consequently, even if ASLR does not stop the more advanced attackers, in the eyes of many, it still serves as a good first line of defense for protecting the users and as a pivotal building block in more advanced defenses [9], [15], [36], [42], [52]. In this paper, we challenge this belief by systematically derandomizing ASLR through a side-channel attack on the memory management unit (MMU) of processors that we call ASLR⊕Cache (or simply AnC). Previous work has shown that ASLR breaks down in the presence of specific weaknesses and (sometimes arcane) features in software. For instance, attackers may derandomize ASLR if the application is vulnerable to thread spraying [23], if the system turns on memory overcommit and exposes allocation oracles [47], if the application allows for crash tolerant/resistant memory probing [5], [17], [19], [54], or if the underlying operating system uses deduplication to merge data pages crafted by the attacker with pages containing sensitive system data [6]. While all these conditions hold for some applications, none of them are universal and they can be mitigated in software. Finding secondary information leak vulnerabilities raises the effort on an attacker’s side for exploitation [22]. Also In this paper, we show that the problem is much more serious and that ASLR is fundamentally insecure on modern cache-based architectures. Specifically, we show that it is possible to derandomize ASLR completely from JavaScript, without resorting to esoteric operating system or application features. Unlike all previous approaches, we do not abuse weaknesses in the software (that are relatively easy to fix). Instead, our attack builds on hardware behavior that is central to efficient codeexecution:the fast translation of virtual to physical addresses in the MMU by means of page tables. As a result, all fixes to our attacks (e.g., naively disabling caching) are likely too costly in performance to be practical. To our knowledge, this is the first attack that side-channels the MMU and also the very first cache attack that targets a victim hardware rather than software component. Permission to freely reproduce all or part of this paper for noncommercial purposes is granted provided that copies bear this notice and the full citation on the first page. Reproduction for commercial purposes is strictly prohibited without the prior written consent of the Internet Society, the first-named author (for reproduction of an entire paper only), and the author’s employer if the paper was prepared within the scope of employment. NDSS ’17, 26 February - 1 March 2017, San Diego, CA, USA Copyright 2017 Internet Society, ISBN 1-1891562-46-0http://dx.doi.org/10.14722/ndss.2017.23271High level overview of the attack Whenever a process wants to access a virtual address, be it data or code, the MMU performs a translation to find the corresponding physical address in main memory. The translation lookaside buffer (TLB) in each CPU core stores most of the recent translations in order to speed up the memory access. However, whenever a TLB miss occurs, the MMU needs to walk the page tables (PTs) of the process (also stored in main memory) to perform the trans- I. I NTRODUCTION Address-space layout randomization (ASLR) is the first line of defense against memory-related security vulnerabilities in today’s modern software. ASLR selects random locations in the large virtual address space of a protected process for placing code or data. This simple defense mechanism forces attackers to rely on secondary software vulnerabilities (e.g., arbitrary memory reads) to directly leak pointers [16], [56] or ad-hoc mechanisms to bruteforce the randomized locations [5], [6], [17], [19], [23], [47], [54].
  • 2.mitigations to limit (but not eliminate) the impact of the attack in Section IX and highlight the related work in Section X before concluding in Section XI. Further AnC results are collectedat:https://vusec.net/projects/anc.lation. To improve the performance of the MMU walk (i.e., a TLB miss), the PTs are cached in the fast data caches very much like the process data is cached for faster access [4], [28]. Relying on ASLR as a security mechanism means that the PTs now store security-sensitivesecrets:the offset of a PT entry in a PT page at each PT level encodes part of the secret virtual address. To the best of our knowledge, the implications of sharing the CPU data caches between the secret PT pages and untrusted code (e.g., JavaScript code) has never been previously explored. II. We assume the attacker can execute JavaScript code in the victim’s browser, either by luring the victim into visiting a malicious website or by compromising a trusted website. Assuming all the common defenses (e.g., DEP) are enabled in the browser, the attacker aims to escape the JavaScript sandbox via a memory corruption vulnerability. To successfully compromise the JavaScript sandbox, we assume the attacker needs to first break ASLR and derandomize the location of some code and/or data pointers in the address space— a common attack model against modern defenses [53]. For this purpose, we assume the attacker cannot rely on ad-hoc disclosure vulnerabilities [16], [56] or special application/OS behavior [5], [6], [17], [19], [23], [47], [54]. While we focus on a JavaScript sandbox, the same principles apply to other sandboxing environments such as Google’s Native Client [65]. By executing specially crafted memory access patterns on a commodity Intel processor, we are able to infer which cache sets have been accessed after a targeted MMU PT walk when dereferencing a data pointer or executing a piece of code. As only certain addresses map to a specific cache set, knowing the cache sets allows us to identify the offsets of the target PT entries at each PT level, hence derandomizing ASLR. Contributions Summarizing, we make the followingcontributions:1) 2) 3) 4) T HREAT M ODEL We design and implement AnC, the first cache side-channel attack against a hardware component (i.e., the processor’s MMU), which allows malicious JavaScript code to derandomize the layout of the browser’s address space, solely by accessing memory. Since AnC does not rely on any specific instruction or software feature, it cannot be easily mitigated without naively disabling CPU caches. To implement AnC, we needed to implement better synthetic timers than the one provided by the current browsers. Our timers are practical and can tell the difference between a cached and an uncached memory access. On top of AnC, they make previous cache attacks (e.g., [48]) possible again. While AnC fundamentally breaks ASLR, we further show, counter-intuitively perhaps, that memory allocation patterns and security countermeasures in current browsers, such as randomizing the location of the browser heaps on every allocation, make AnC attacks more effective. We evaluated end-to-end attacks with AnC on two major browsers running on Linux. AnC runs in tens of seconds and successfully derandomizes code and heap pointers, significantly reducing an attacker’s efforts to exploit a given vulnerability. III. BACKGROUND AND A PPROACH In this section, we discuss necessary details of the memory architecture in modern processors. Our description will focus on recent Intel processors due to their prevalence, but other processors use similar designs [4] and are equally vulnerable as we will show in our evaluation in Section VII. A. Virtual Address Translation Currently, the virtual address-space of a 64 bit process is 256 TB on x86 64 processors, whereas the physical memory backing it is often much smaller and may range from a few KBs to a few GBs in common settings. To translate a virtual address in the large virtual address-space to its corresponding physical address in the smaller physical address-space, the MMU uses the PT data structure. The PT is a uni-directed tree where parts of the virtual address select the outgoing edge at each level. Hence, each virtual address uniquely selects a path between the root of the tree to the leaf where the target physical address is stored. As the current x86 64 architecture uses only the lower 48 bits for virtual addressing, the total address space is 256 TB. Since a PT maintains a translation at the granularity of a memory page (4096 bytes), the lower 12 bits of a virtual page and its corresponding physical page are always the same. The other 36 bits select the path in the PT tree. The PT tree has four levels of page tables, where each PT is itself a page that stores 512 PT entries (PTEs). This means that at each PT level, 9 of the aforementioned 36 bits decide the offset of the PTE within the PT page. Outline After presenting the threat model in Section II, we explain the details of address translation in Section III. In that section, we also summarize the main challenges and how we address them. Next, Sections IV—VI discuss our solutions for each of the challenges in detail. In Section VII, we evaluate AnC against Chrome and Firefox, running on the latest Linux operating system. We show that AnC successfully derandomizes ASLR of the heap in both browsers and ASLR of the JIT code in Firefox while being much faster and less demanding in terms of requirements than state-of-theart derandomization attacks. We discuss the impact of AnC on browser-based exploitation and on security defenses that rely on information hiding in the address space or leakageresilient code randomization in Section VIII. We then propose Figure 1 shows how the MMU uses the PT for translating an example virtual address, 0x644b321f4000. On the x86 architecture, the CPU’s CR3 control register points to the highest level of the page table hierarchy, known as level 4 or PTL4. The top 9 bits of the virtual address index into this single PTL4 page, in this case selecting PTE 200. This PTE has a reference to the level 3 page (i.e., PTL3), which the next 9 bits of the virtual address index to find the target PT 2
  • 3.Level 3 Level 4 PTE PTE 0: Level 2 PTE Level 1 PTE 0: Core 0: ....... ....... ....... .......CR3:Level 4 Physical Addr 0: CR3 MMU Execution Unit PTE 200: Level 3 Phys Addr ....... Virt Addr Load/Store Unit Miss TLB .... Fill PT Walker PTE 300: Level 2 Phys Addr ... Phys Addr PTE 400: Level 1 Phys Addr ....... L1 Data L2 .... PTE 500: Target Phys Addr ... L3 (Shared) DRAM Fig. 1. MMU’s page table walk to translate 0x644b321f4000 to its corresponding memory page on the x86 64 architecture. Fig. 2. entry (this time at offset 300). Repeating the same operation for PT pages at level 2 and 1, the MMU can then find the corresponding physical page for 0x644b321f4000 at the PT entry in the level 1 page. but can store a larger amount of data. There are two caches at the first level, L1D and L1I, to cache data and instructions, respectively. The cache at the second level, L2, is a unified cache for both data and instructions. L1 and L2 are private to each core, but all cores share L3. An important property of these caches is their inclusivity. L2 is exclusive of L1, that is, the data present in L1 is not necessarily present in L2. L3, however, is inclusive of L1 and L2, meaning that if data is present in L1 or L2, it also has to be present in L3. We later exploit this property to ensure that a certain memory location is not cached at any level by making sure that it is not present in L3. We now discuss the internal organization of these CPU caches. Note that each PTE will be in a cache line, as shown by different colors and patterns in Figure 1. Each PTE on x86 64 is eight bytes, hence, each 64 byte cache line stores eight PTE. We will discuss how we can use this information for derandomizing ASLR of a given virtual address in Section III-D after looking into the memory organization and cache architecture of recent Intel x86 64 processors. B. Memory Organization Recent commodity processors contain a complex memory hierarchy involving multiple levels of caches in order to speed up the processor’s access to main memory. Figure 2 shows how the MMU uses this memory hierarchy during virtual to physical address translation in a recent Intel Core microarchitecture. Loads and stores as well as instruction fetches on virtual addresses are issued from the core that is executing a process. The MMU performs the translation from the virtual address to the physical address using the TLB before accessing the data or the instruction since the caches that store the data are tagged with physical addresses (i.e., physically-tagged caches). If the virtual address is in the TLB, the load/store or the instruction fetch can proceed. If the virtual address is not in the TLB, the MMU needs to walk the PT as we discussed in Section III-A and fill in the TLB. The TLB may include translation caches for different PT levels (e.g., paging-structure caches on Intel described in Section 4.10.3 of [29]). As an example, if TLB includes a translation cache for PTL2, then the MMU only needs to walk PTL1 to find the target physical address. To adhere to the principle of locality while avoiding expensive logic circuits, current commodity processors partition the caches at each level. Each partition, often referred to as a cache set, can store only a subset of physical memory. Depending on the cache architecture, the physical or virtual address of a memory location decides its cache set. We often associate a cache set with wayness. An n-way set-associative cache can store n items in each cache set. A replacement policy then decides which of the n items to replace in case of a miss in a cache set. For example, the L2 cache on an Intel Skylake processor is 256 KB and 4-way set-associative with a cache line size of 64 B [28]. This means that there are 1024 cache sets (256 KB/(4-way×64 B)) and bits 6 to 16 of a physical address decide its corresponding cache set (the lower 6 bits decide the offset within a cache line). In the Intel Core microarchitecture, all the cores of the processor share the LLC, but the microarchitecture partitions it in so-called slices, one for each core, where each core has faster access to its own slice than to the others. In contrast to L1 and L2 where the lower bits of a physical address decide its corresponding cache set, there is a complex addressing function (based on an XOR scheme) that decides the slice for each physical memory address [27], [44]. This means that each slice gets different cache sets. For example, a 4-core Skylake i7-6700K processor has an 8 MB 16-way set associative LLC with 4 slices each with 2048 cache sets. We now show how PT pages are cached and how we can evict them from the LLC. During the PT walk, the MMU reads PT pages at each PT level using their physical addresses. The MMU uses the same path as the core for loading data to load the PTEs during translation. As a result, after a PT walk, the cache lines that store the PTE at each PT level are available in the L1 data cache (i.e., L1D). We now briefly discuss the cache architecture. C. Cache Architecture In the Intel Core microarchitecture, there are three levels of CPU caches1 . The caches that are closer to the CPU are smaller and faster whereas the caches further away are slower 1 The Memory organization in a recent Intel processor. D. Derandomizing ASLR As discussed earlier, any memory access that incurs a TLB miss requires a PT walk. A PT walk reads four PTEs from main memory and stores them in four different cache lines in mobile version of the Skylake processors has a level 4 cache too. 3
  • 4.levels. We call this technique sliding and discuss it further in Section V-E. L1D if they are not there already. Knowing the offset of these cache lines within a page already derandomizes six bits out of nine bits of the virtual address at each PT level. The last three bits will still not be known because the offset of the PTE within the cache line is not known. We hence need to answer three questions in order to derandomizeASLR:(1) which cache lines are loaded from memory during the PT walk, (2) which page offsets do these cache lines belong to, and (3) what are the offsets of the target PTEs in these cache lines? E. ASLR on Modern Systems Mapped virtual areas for position-independent executables in modern Linux systems exhibit 28 bit of ASLR entropy. This means that the PTL1, PTL2 and PTL3 fully contribute to creating 27 bits of entropy, but only the last bit of the PTE offset in PTL4 contributes to the ASLR entropy. Nevertheless, if we want to identify this last bit, since it falls into the lowest three bits of the PTE offset (i.e., within a cache line), we require a crossing cache set at PTL4. Each PTE at PTL4 maps 512 GB of virtual address-space, and hence, we need a virtual mapping that crosses a 4 TB mark in order for a cache set change to happen at PTL4. Note that a cache set change in PTL4 results in cache sets changing in the other levels as well. We will describe how we can achieve this by exploiting the behavior of memory allocators in various browsers in Section VI. 1) Identifying the cache lines that host thePTEs:Since the LLC is inclusive of L1D, if the four PTEs cache lines are in L1D, they will also be in the LLC and if they are not in the LLC, they will also not be in L1D. This is an important property that we exploit for implementingAnC:rather than requiring a timer that can tell the difference between L1D and LLC (assuming no L2 entry), we only require one that can tell the difference between L1D and memory by evicting the target cache line from the LLC rather than from L1D. The PTE cache lines could land in up to four different cache sets. While we cannot directly identify the cache lines that host the PTE, by monitoring (or controlling) the state of various cache sets at the LLC, we can detect MMU activity due to a PT walk at the affected cache sets. While the knowledge of MMU activity on cache sets is coarser than on cache lines, it is still enough to identify the offset of the PTE cache lines within a page as we describe next. Note that the entropy of ASLR in Linux is higher than other popular operating systems such as Windows 10 [45], [63] which provides only 24 bits of entropy for the heap and 17– 19 bits of entropy for executables. This means that on Windows 10, PTL4 does not contribute to ASLR for the heap area. Since each entry in PTL3 covers 1 GB of memory, a mapping that crosses an 8 GB will result in cache set change at PTL3, resulting in derandomization of ASLR. The lower executable entropy means that it is possible to derandomize executable locations on Windows 10 when crossing only the two lower level (i.e., with 16 MB). In this paper we focus on the much harder case of Linux which provides the highest entropy for ASLR. 2) Identifying page offsets of the cachelines:Oren et al. [48] realized that given two different (physical) memory pages, if their first cache lines (i.e., first 64 B) belong to the same cache set, then their other 63 cache lines share (different) cache sets as well. This is due to the fact that for the first cache lines to be in the same cache set, all the bits of the physical addresses of both pages that decide the cache set and the slice have to be the same and an offset within both memory pages will share the lower 12 bits. Given, for example, 8192 unique cache sets, this means that there are 128 (8192/64) unique page colors in terms of the cache sets they cover. F. Summary of Challenges and Approach We have discussed the memory hierarchy on modern x86 64 processors and the way in which an attacker can monitor MMU activity to deplete ASLR’s entropy. The remainder of this paper revolves around the three main challenges that we need to overcome to implement a successfulattack:This simple fact has an interesting implication for our attack. Given an identified cache set with PT activity, we can directly determine its page color, and more importantly, the offset of the cache line that hosts the PT entry. C1 Distinguishing between a memory access and a cache access when performed by the MMU in modern browsers. To combat timing attacks from a sandboxed JavaScript code [6], [48], browsers have decreased the precision of the accessible timers in order to make it difficult, if not impossible, for the attackers to observe the time it takes to perform a certain action. C2 Observing the effect of MMU’s PT walk on the state of the data cache. Recent work [48] shows that it is possible to build cache eviction sets from JavaScript in order to bring the last-level cache (LLC) to a known state for a well-known PRIME+PROBE attack [39], [49]. In a typical PRIME+PROBE attack, the victim is a process running on a core, whereas in our attack the victim is the MMU with a different behavior. C3 Distinguishing between multiple PT entries that are stored in the same cache line and uniquely identifying PT entries that belong to different PT levels. On e.g., x86 64 each PT entry is 8 bytes, hence, each 64-byte 3) Identifying cache line offsets of the PTentries:At this stage, we have identified the cache sets for PTEs at each PT level. To completely derandomize ASLR for a given virtual address, we still need to identify the PTE offset within a cache line (inside the identified cache set), as well as mapping each identified cache set to a PT level. We achieve both goals via accessing pages that are x bytes apart from our target virtual address v. For example, the pages that are 4 KB, 8 KB, ..., 32 KB away from v, are 1 to 8 PTEs away from v at PTL1 and if we access them, we are ensured to see a change in one of the four cache sets that show MMU activity (i.e., the new cache set will directly follow the previous cache set). The moving cache set, hence, uniquely identifies as the one that is hosting the PT entry for PTL1, and the point when the change in cache set happens uniquely identifies the PT entry offset of v within its cache line, derandomizing the unknown 3 least significant bits in PTL1. We can apply the same principle for finding the PT entry offsets at other PT 4
  • 5.Frequency (normalized) 0.6 1. Old Timer MT > CT Google Chrome 50.0 on Linux 4.4.0 Mozilla Firefox 46.0.1 on Linux 4.4.0 0.5 0.4 0.3 Cache 0.2 t0 0.1 0 0 20 40 60 80 100 120 140 t1 t2 cache line can store 8 PT entries (i.e., the 3 lower bits of the PT entry’s offset is not known). Therefore, to uniquely identify the location of a target PT entry within a PT page, we require the ability to access the virtual addresses that correspond to the neighboring PT entries in order to observe a cache line change. Further, in order to derandomize ASLR, we need PT entries to cross cache line at each PT level. t3 4. TTT Cache CC = c1 – c0 c1 MC < CC CC t0 Memory MC = c3 – c2 c3 t1 Memory t2 MC t3 Fig. 4. 1. How the old performance.now() was used to distinguish between a cached or a memory access. 2. How the new performance.now() stops timing side-channel attacks. 3. How the SMC can be used to make the distinction in the memory reference using a separate counting core as a reference. 4. How TTT can make the distinction by counting in between ticks. To address C1, we have created a new synthetic timer in JavaScript to detect the difference between a memory and a cache access. We exploit the fact that the available timer, although coarse, is precise and allows us to use the CPU core as a counter to measure how long each operation takes. We elaborate on our design and its implications on browser-based timing attacks in Section IV. performance.now()) in our target browsers. The microbenchmark measures how many times we can execute performance.now() in a tight loop between two subsequent ticks of performance.now() for a hundred times. Figure 3 shows the results of our experiment in terms of frequency. Firefox shows a single peak, while Chrome shows multiple peaks. This means that Firefox’s performance.now() ticks exactly at 5µs, while Chrome has introduced some jitter around around the 5µs intervals. The decreased precision makes it difficult to tell the difference between a cached or memory access (in the order of tens of nanoseconds) which we require for AnC to work. To address C2, we built a PRIME+PROBE attack for observing the MMU’s modification on the state of LLC in commodity Intel processors. We noticed that the noisy nature of PRIME+PROBE that monitors the entire LLC in each round of the attack makes it difficult to observe the (faint) MMU signal, but a more directed and low-noise EVICT+TIME attack that monitors one cache set at a time can reliably detect the MMU signal. We discuss the details of this attack for derandomizing JavaScript’s heap and code ASLR in Section V. Figure 4.1 shows how the old timer was being used to distinguish between cached or memory access (CT stands for cache time and MT stands for memory time). With a lowprecision timer, shown in Figure 4.2, it is no longer possible to tell the difference by simply calling the timer. Recent work [35] shows that it is possible to use the clock edge to improve the precision of the degraded counters, but it is still not enough to tell the difference between operations that are only tens of nanoseconds apart (i.e., accessing cache versus memory). To address C3, we needed to ensure that we can allocate and access virtually contiguous buffers that span enough PT levels to completely derandomize ASLR. For example, on a 64 bit Linux system ASLR entropy for the browser’s heap and the JITed code is 28 bits and on an x86 64 processor, there are 4 PT levels, each providing 9 bits of entropy (each PT level stores 512 PT entries). Hence, we need a virtually contiguous area that spans all four PT levels for complete derandomization of ASLR. In Section VI, we discuss how ASLR⊕Cache exploits low-level memory management properties of Chrome and Firefox to gain access to such areas. IV. ... t2 MC > CC c2 t1 Memory MT = t3 – t2 Cache c0 ... t0 t3 3. SMC Fig. 3. Measuring the quality of the low-precision performance.now() in Chrome and Firefox. Cache CT = t1 – t0 Memory Number of loops per tick of performance.now() 2. Fixed Timer MT = CT In the following sections, we describe two techniques for measuring how long executing a memory reference takes by counting how long a memory reference takes rather than timing. Both techniques rely on the fact that CPU cores have a higher precision than performance.now(). T IMING BY C OUNTING The first technique (Figure 4.3), shared memory counter (SMC), relies on an experimental feature (with a draft RFC [55]) that allows for sharing of memory between JavaScript’s web workers2 . SMC builds a high-resolution counter that can be used to reliably implement AnC in all the browsers that implement it. Both Firefox and Chrome currently support this feature, but it needs to be explicit enabled due to its experimental nature. We expect shared memory between JavaScript web workers to become a defaulton mainstream feature in a near future. The second technique Recent work shows that timing side channels can be exploited in the browser to leak sensitive information such as randomized pointers [6] or mouse movement [48]. These attacks rely on the precise JavaScript timer in order to tell the difference between an access that is satisfied through a cache or main memory. In order to thwart these attacks, major browser vendors have reduced the precision of the timer. Based on our measurements, both Firefox and Chrome have decreased the precision of performance.now() to exactly 5µs. We designed a small microbenchmark in order to better understand the quality of the JavaScript timer (i.e., 2 JavaScript 5 webworkers are threads used for long running background tasks.
  • 6.We use TTT on Firefox and SMC on Chrome for our evaluation in Section VII. The shared memory feature, needed for SMC, is currently enabled by default in the nightly build of Firefox, implemented in Chrome [10], implemented and currently enabled under experimental flag in Edge [8]. We have notified major browser vendors warning them of this danger. (Figure 4.4), time to tick (TTT), relies on the current lowprecision performance.now() for building a timer that allows us to measure the difference between a cached reference and a memory reference, and allows us to implement AnC in low-jitter browsers such as Firefox. The impact of TTT and SMC goes beyond AnC. All previous timing attacks that were considered mitigated by browser vendors are still applicable today. It is worth mentioning that the recently proposed fuzzy time defense for browsers [35], while also expensive, is not effective against SMC. We now discuss the TTT and SMC timers in further detail. V. I MPLEMENTING A N C Equipped with our TTT and SMC timers, we now proceed with the implementation of AnC described in Section III-D. We first show how we managed to trigger MMU walks when accessing our target heap and when executing code on our target JIT area in Section V-A. We then discuss how we identified the page offsets that store PT entries of a target virtual address in Sections V-B, V-C and V-D. In Sections V-E and V-F, we describe the techniques that we applied to observe the signal and uniquely identify the location of PT entries inside the cache lines that store them. In Sections V-G and V-H we discuss the techniques we applied to clear the MMU signal by flushing the page table caches and eliminating noise. A. Time to Tick The idea behind the TTT measurement, as shown in Figure 4.4, is quite simple. Instead of measuring how long a memory reference takes with the timer (which is no longer possible), we count how long it takes for the timer to tick after the memory reference takes place. More precisely, we first wait for performance.now() to tick, we then execute the memory reference, and then count by executing performance.now() in a loop until it ticks. If memory reference is a fast cache access, we have time to count more until the next tick in comparison to a memory reference that needs to be satisfied through main memory. A. Triggering MMU Page Table Walks In order to observe the MMU activities on the CPU caches we need to make sure that 1) we know the offset in pages within our buffer when we access the target, and 2) we are able to evict the TLB in order to trigger an MMU walk on the target memory location. We discuss how we achieved these goals for heap memory and JITed code. TTT performs well in situations where performance.now() does not have jitter and ticks at regular intervals such as in Firefox. We, however, believe that TTT can also be used in performance.now() with jitter as long as it does not drift, but it will require a higher number of measurements to combat jitter. 1)Heap:We use the ArrayBuffer type to back the heap memory that we are trying to derandomize. An ArrayBuffer is always page-aligned which makes it possible for us to predict the relative page offset of any index in our target ArrayBuffer. Recent Intel processors have two levels of TLB. The first level consists of an instruction TLB (iTLB) and a data TLB (dTLB) while the second level is a larger unified TLB cache. In order to flush both the data TLB and the unified TLB, we access every page in a TLB eviction buffer with a larger size than the unified TLB. We later show that this TLB eviction buffer can be used to evict LLC cache sets at the desired offset as well. B. Shared Memory Counter Our SMC counter uses a dedicated JavaScript web worker for counting through a shared memory area between the main JavaScript thread and the counting web worker. This means that during the attack, we are practically using a separate core for counting. Figure 4-3 shows how an attacker can use SMC to measure how long a memory reference takes. The thread that wants to perform the measurement (in our case the main thread) reads the counter and stores it in c1, executes the memory reference, and then reads the counter again and stores it in c2. Since the other thread is incrementing the counter during the execution of the memory reference, in case of a slow memory access, we see a larger c2 − c1 compared to the case where a faster cache access is taking place. 2)Code:In order to allocate a large enough JITed code area we spray 217 JavaScript functions in an asm.js [26] module. We can tune the size of these functions by changing the number of their statements to be compiled by the JIT engine. The machine code of these functions start from a browser-dependent but known offset in a page and follow each other in memory and since we can predict their (machine code) size on our target browsers, we know the relative offset of each function from the beginning of the asm.js object. In order to minimize the effect of these functions on the cache without affecting their size, we add an if statement in the beginning of all the functions in order not to execute their body. The goal is to hit a single cache line once executed so as to not obscure the pagetable cache line signals, but still maintain a large offset between functions. To trigger a PT walk when executing one of our functions, we need to flush the iTLB and the unified TLB. To flush the iTLB, we use a separate asm.js object and execute some of its functions that span enough pages beyond the size of the iTLB. To flush the unified TLB, we use the same TLB eviction buffer that we use for the heap. SMC is agnostic to the quality of performance.now() since it only relies on a separate observer core for its measurements. C. Discussion We designed a microbenchmark that performs a cached access and a memory access for a given number of iterations. We can do this by accessing a huge buffer (an improvised eviction set), ensuring the next access of a test buffer will be uncached. We measure this access time with both timers. We then know the next access time of the same test buffer will be cached. We time this access with both timers. In all cases, TTT and SMC could tell the difference between the two cases. 6
  • 7.As we will discuss shortly, AnC observes one page offset in each round. This allows us to choose the iTLB eviction functions and the page offset for the unified TLB eviction buffer in a way that does not interfere with the page offset under measurement. trigger (MMU’s PT walk), we could opt for a more exotic EVICT+TIME attack that allowed us to avoid the drawbacks of PRIME+PROBE (Section V-D). B. PRIME+PROBE and the MMU Signal Cache-based side-channel attacks benefit from the finegrained information available in the state of cache after a secret operation—the cache sets that were accessed by a victim. A cache set is uniquely identified by a color (i.e., page color) and a page (cache line) offset. For example, a cache set in an LLC with 8192 cache sets can be identified by a (color, offset) tuple, where 0 ≤ color < 128 and 0 ≤ offset < 64. C. Cache Colors Do Not Matter for AnC The main idea behind ASLR⊕Cache is the fact that we can observe the effect of MMU’s PT walk on the LLC. There are two attacks that we can implement for this purpose [49]: PRIME+PROBE or EVICT+TIME. To implement a PRIME+PROBE attack, we need to follow a number ofsteps:'>steps: