More

cokernel_hacker · 2024-12-17T17:03:27 1734455007

Internal DACs/amps have gotten a lot better, at least in Apple products.

https://support.apple.com/en-us/108351

https://www.audiosciencereview.com/forum/index.php?threads/r...

cokernel_hacker · 2024-10-23T15:47:52 1729698472

ARM has DIT: https://developer.arm.com/documentation/ddi0595/2021-06/AArc...

Intel has DOIT: https://www.intel.com/content/www/us/en/developer/articles/t...

The idea is that the processor will not take shortcuts that take advantage of the values it is processing. For example, a 64-bit division cannot shortcut if the operands are both small, etc.

namibj · 2024-10-23T17:10:27 1729703427

Intel DOIT instruction list does not include DIV: https://www.intel.com/content/www/us/en/developer/articles/t...

cokernel_hacker · 2024-09-05T19:40:34 1725565234

With chip fabrication technology in the spotlight lately, this book chapter on the history of EUV development and DARPA's role might be of interest to some.

cokernel_hacker · 2024-09-05T02:16:40 1725502600

What behavior should the following have:

  int f(int x) {
    switch (x) {
    case 0:
      return 31;
    case 1:
      return 28;
    case 2:
      return 30;
    }
  }

This code on its own has no undefined behavior.

In another translation unit, someone calls `f(3)`. What would you have compilers do in that case?

That path through the program has undefined behavior. However, the two translation units are separate and as such normal tooling will not be able to detect any sort of UB without some kind of whole program static analysis or heavy instrumentation which would harm performance.

zzo38computer · 2024-09-05T03:24:32 1725506672

What I would have it to do is: Return a number that is in the range of the "int" type, but there is no guarantee what number it will be, and it will not necessarily be consistent when called more than once, when the program is executed more than once (unless the operating system has features to enforce consistent behaviour), when the program is compiled for and running on a different computer, etc. I would also have the undefined value to be frozen, like the "freeze" command in LLVM. Normally, the effect would be according to the target instruction set, because it would be compiled in the best way for that target instruction set. Depending on the compiler options, it might also display a warning that not all cases are handled, although this warning would be disabled by default. (However, some instruction sets might allow it to be handled differently; e.g. if you have an instruction set with tagged pointers that can be stored in ordinary registers and memory, then there is the possibility that trying to use the return value causes an error condition.)

torstenvl · 2024-09-05T02:24:17 1725503057

I would do what the standard tells me to do, which is to ignore the undefined behavior if I don't detect it.

On most platforms, that would probably result in the return value of 3 (it would still be in AX, EAX, r0, x0, o0/i0, whatever, when execution hits the ret instruction or whatever that ISA/ABI uses to mark the end of the function). But it would be undefined. But that's fine.

[EDIT: I misremembered the x86 calling convention, so my references to AX and EAX are wrong above. Mea culpa.]

What isn't fine is ignoring the end of the function, not emitting a ret instruction, and letting execution fall through to the next label/function, which is what I suspect GCC does.

cokernel_hacker · 2024-09-05T02:37:27 1725503847

So let's change it up a bit.

  typedef int (*pfn)(void);
  int g(void);
  int h(void);

  pfn f(double x) {
    switch ((long long)x) {
    case 0:
      return g;
    case 17:
      return h;
    }
  }

If I understand your perspective correctly, `f` should return whatever happens to be in rax if the caller does not pass in a number which truncates to 0 or 17?

torstenvl · 2024-09-05T02:59:43 1725505183

More or less, yes.

I quibble with "should return" because I don't think it's accurate to say it "should" do anything in any specific set of circumstances. In fact, I'm saying the opposite: it should generate the generic, semantic code translation of what is actually written in source, and if it happens to "return whatever happens to be in rax" (as is likely on x64), then so be it.

In my view, that's what "ignoring the situation completely with unpredictable results" means.

nlewycky · 2024-09-05T03:14:45 1725506085

Why isn't that fine? The compiler ignored the undefined behavior it didn't detect.

torstenvl · 2024-09-05T03:55:37 1725508537

No. No honest person can claim that making a decision predicated on the existence of X is the same as "ignoring" X.

nlewycky · 2024-09-05T06:54:40 1725519280

This is the most normal case though, isn't it? Suppose a very simple compiler, one that sees a function so it writes out the prologue, it sees the switch so it writes out the jump tables, it sees each return statement so it writes out the code that returns the values, then it sees the function closing brace and writes out a function epilogue. The problem is that the epilogue is wrong because there is no return statement returning a value, the epilogue is only correct if the function has void return type. Depending on ABI, the function returns to a random address.

Most of the time people accuse compilers of finding and exploiting UB and say they wish it would just emit the straight-forward code, as close to writing out assembly matching the input C code expression by expression as possible. Here you have an example where the compiler never checked for UB let alone proved presence of UB in any sense, it trusted the user, it acted like a high-level assembler, yet this compiler is still not ignoring UB for you? What does it take? Adding runtime checks for the UB case is ignoring? Having the compiler find the UB paths to insert safety code is ignoring?

torstenvl · 2024-09-06T10:29:09 1725618549

> the epilogue is only correct if the function has void return type

That's a lie.

> Adding runtime checks for the UB case is ignoring? Having the compiler find the UB paths to insert safety code is ignoring?

Don't come onto HN with the intent of engaging in bad faith.

nlewycky · 2024-09-08T09:03:36 1725786216

> > the epilogue is only correct if the function has void return type

> That's a lie.

All C functions return via a return statement with expression (only for non-void functions), a return statement without an expression (only for void functions) or by the closing of function scope (only for void functions). True?

The simple "spit out a block of assembly for each thing in the C code" compiler spits out the epilogue that works for void-returning functions, because we reach the end of the function with no return statement. That epilog might happen to work for non-void functions too, but unless we specify an ABI and examine that case, it isn't guaranteed to work for them. So it's not correct to emit it. True?

Where's the lie?

> > Adding runtime checks for the UB case is ignoring? Having the compiler find the UB paths to insert safety code is ignoring?

> Don't come onto HN with the intent of engaging in bad faith.

Always! You too!

The text you quoted was referring to how real compilers handle falling off the end of a non-void function today with -fsanitize=return from UBSan. If I understand you correctly, in your reading a compiler with UBSan enabled is non-conforming because it fails to ignore the situation. That's not an argument as to whether your reading is right or wrong, but I do think UBSan compilation ought to be standard conforming, even if that means we need to add it to the Standard.

To the larger point, because the Standard doesn't define what "ignore" means, the user and implementer can't use it to pin down whether a given UB was ignored or not, and thus whether a given program was miscompiled or not. A compiler rewrites the code into its intermediate IR -- could be Z3 SMT solver or raw Turing Machine or anything -- then writes code back out. Can ignoring be done at any stage in the middle? Once the code has been converted and processed, how can you tell from the assembly output what's been ignored and what hasn't? If you demand certain assembly or semantics, isn't that just defining undefined behaviour? If you don't demand them, and leave the interpretation of "ignore" to the particular implementation of a compiler, yet any output could be valid for some potential design of compiler, why not allow any compiler emit whatever it wants?

qiqitori · 2024-09-05T02:30:03 1725503403

This won't compile with reasonable compiler flags. (-Wall and a reasonable set of -Werror settings).

Now, assume that you didn't compile this with those flags; what actually happens is entirely obvious but platform-dependent. Assume amd64 (and many other architectures) where the return value is in the "accumulator register", assume that int is 32 bits. The return value will be whatever was in eax. The called function doesn't set eax (or maybe does in order to implement some unrelated surrounding code). The caller takes eax without knowledge of where it came from.

loeg · 2024-09-05T15:52:07 1725551527

In a new language that isn't C, that function shouldn't compile at all (missing return).

In a C compiler, inserting a trap (x86 ud2, for example) might be reasonable.

cokernel_hacker · 2024-09-05T02:08:08 1725502088

That works for a lot of behavior but not everything. For example:

  int f(int x) {
    static int y[] = {42, 43};
    return y[x];
  }

What behavior should `f(-1)` or `f(100)` have? What is sensible?

Y_Y · 2024-09-05T12:00:52 1725537652

Desugar to pointer arithmetic, try to do an dereference like

    *(y-1)

and more than likely segfault, or return the value at that address if it's somehow valid.

cokernel_hacker · 2024-05-03T20:33:46 1714768426

This python actually builds a graph under the hood which then gets JIT compiled for CPU/GPU/TPU.

senseiV · 2024-05-03T22:00:55 1714773655

Does a TPU have XLA-graph for GPUs Cuda-graphs? Not sure on TPU theory

dekhn · 2024-05-04T17:30:41 1714843841

TPUs lower XLA into their own instruction set. https://pytorch.org/xla/release/2.3/index.html

cokernel_hacker · on Dec 3, 2023

I found more of it in here: https://www.oranlooney.com/post/playfair/

However, I can't determine where this is originally from...

cokernel_hacker · on March 10, 2023

Fun fact: the MSVC C++ ABI gives up if the mangled name is >= 4096 characters, it just replaces the symbol with md5(mangled name): https://github.com/llvm/llvm-project/blob/d32f71a91a432db2d9...

cokernel_hacker · on Jan 7, 2022

I particularly liked the "Windows" vs "Sindogs" issue. My immediate hunch was that the characters differed by a single bit so I ran a bit of python:

  >>> [ord(x) ^ ord(y) for (x, y) in zip("windows", "sindogs")]
  [4, 0, 0, 0, 0, 16, 0]

Sure enough...

cokernel_hacker · on Jan 4, 2022

The conclusions section of the paper is a good summary:

"In the process, we learned ten lessons about DSAs and DNNs in general and about DNN DSAs specifically that shaped the design of TPUv4i:

1. Logic improves more quickly than wires and SRAM ⇒ TPUv4i has 4 MXUs per core vs 2 for TPUv3 and 1 for TPUv1/v2.

2. Leverage existing compiler optimizations ⇒ TPUv4i evolved from TPUv3 instead of being a brand new ISA.

3. Design for perf/TCO instead of perf/CapEx ⇒ TDP is low, CMEM/HBM are fast, and the die is not big.

4. Backwards ML compatibility enables rapid deployment of trained DNNs ⇒TPUv4i supports bf16 and avoids arithmetic problems by looking like TPUv3 from the XLA compiler’s perspective.

5. Inference DSAs need air cooling for global scale ⇒ Its design and 1.0 GHz clock lowers its TDP to 175W.

6. Some inference apps need floating point arithmetic ⇒ It supports bf16 and int8, so quantization is optional.

7. Production inference normally needs multi-tenancy ⇒ TPUv4i’s HBM capacity can support multiple tenants.

8. DNNs grow ~1.5x annually in memory and compute ⇒ To support DNN growth, TPUv4i has 4 MXUs, fast onand off-chip memory, and ICI to link 4 adjacent TPUs.

9. DNN workloads evolve with DNN breakthroughs ⇒ Its programmability and software stack help pace DNNs.

10. The inference SLO is P99 latency, not batch size ⇒ Backwards ML compatible training tailors DNNs to TPUv4i, yielding batch sizes of 8–128 that raise throughput and meet SLOs. Applications do not restrict batch size."

ArtWomb · on Jan 4, 2022

>>> 8. DNNs grow ~1.5x annually in memory and compute

Wow! That's a massive growth rate for ML. TPUv3 was already faster than A100 in MLPerf. But this suggests a real breakthrough is needed to keep pace with future requirements. Each MXU already handles 16k ops per tick. And with the additional constraint of optimizing per watt rather than dollar, its quite the challange ;)

omegalulw · on Jan 4, 2022

> But this suggests a real breakthrough is needed to keep pace with future requirements.

Not necessarily. Look at papers like the lottery ticket hypothesis - big ML models may be doing better simply because gradient descent just isn't doing a good enough job. Better optimizers would go a long way than just throwing compute at the the problem. Even if you can, it's impractical to use something like GPT-3 all the time.