It's OK to compare floating-points for equality

(lisyarus.github.io)

178 points | by coinfused 4 days ago ago

115 comments

_moof an hour ago
Something I've observed as someone who works in the physical sciences and used to work as a software engineer is:
Very few software engineers understand that tolerances are fundamental.
In the physical sciences, strict equality - of actual quantities, not variables in equations - is almost never a thing. Even though an equation might show that two theoretical quantities are exactly equal, the practical fact is that every quantity begins life as a measurement, and measurements have inherent uncertainty.
Even in design work, nothing is exact. It's simply not possible. A resistor's value is subject to manufacturing tolerances, will vary with temperature, and can drift as it ages. A mechanical part will also have manufacturing tolerance, and changes size and shape with temperature, applied forces, and wear. So even if a spec sheet states an exact number, the heading or notes will tell you that this is a nominal value under specific conditions. (Those conditions are also impossible to achieve and maintain exactly for all the same reasons.)
Even the voltages that represent 0 and 1 inside a computer aren't exact. Digital parts like CPUs, GPUs, RAM, etc. specify low and high thresholds, under or over which a voltage is considered a 0 or 1.
Floating-point numbers have uses outside the physical sciences, so there's no one-size-fits-all approach to using them correctly. But if you are writing code that deals with physical quantities, making equality comparisons is almost always going to be wrong even if floating-point numbers had infinite precision and no rounding error. Physical quantities simply can't be used that way.
vouwfietsman 12 hours ago
This explanation is relatively reductive when it comes to its criticism of computational geometry.
The thing with computational geometry is, that its usually someone else's geometry, i.e you have no control over its quality or intention. In other words, whether two points or planes or lines actually align or align within 1e-4 is no longer really mathematically interesting because its all about the intention of the user: does the user think these planes overlap?.
This is why most geometry kernels (see open cascade) sport things like "fuzzy boolean operations" [0]) that lean into epsilons. These epsilons mask the error-prone supply chain of these meshes that arrive in your program by allowing some tolerance.
Finally, the remark "There are many ways of solving this problem" is also overly reductive, everyone reading here should really understand that this is a topic that is being actively researched right now in 2026, hence there are currently no blessed solutions to this problem, otherwise this research would not be needed. Even more so, to some extent this problem is fundamentally unsolvable depending on what you mean by "solvable", because your input is inexact not all geometrical operations are topologically valid, hence an "exact" or let alone "correct along some dimension" result cannot be achieved for all (combination of) inputs.
[0] https://dev.opencascade.org/content/fuzzy-boolean-operations
[-]
- throwup238 11 hours ago
  > This is why most geometry kernels (see open cascade) sport things like "fuzzy boolean operations" [0]) that lean into epsilons. These epsilons mask the error-prone supply chain of these meshes that arrive in your program by allowing some tolerance.
  They don’t just lean into epsilons, the session context tolerance is used for almost every single point classification operation in geometric kernels and many primitives carry their own accumulating error component for downstream math.
  Even then the current state of the art (in production kernels) is tolerance expansion where the kernel goes through up to 7 expansion steps retrying point classification until it just gives up. Those edge cases were some of the hardest parts of working on a kernel.
  This is a fundamentally unsolvable problem with floating point math (I worked on both Parasolid and ACIS in the 2000s). Even the ray-box intersection example TFA gives is a long standing thorn - raytracing is one of the last fallbacks for nasty point classification problems.
  [-]
  - vouwfietsman 8 hours ago
    Nice thanks, gotta love knowing a bit about a niche and then encountering someone who knows a great deal more. That's the beauty of HN.
    Could you point to any literature/freely available resource that comes close to the SOTA for these kinds of operations? I would be greatly helped.
  - jstanley 10 hours ago
    > This is a fundamentally unsolvable problem with floating point math
    It's a fundamentally unsolvable problem with B-reps! The problem completely disappears with F-reps. (In exchange for some other difficult problems).
    [-]
    - alterom 4 hours ago
      >(In exchange for some other difficult problems).
      Ahhaha.
      (I used to work in nTop, and boy is this an understatement when it comes to field based solid modeling)
  - MarkusQ 11 hours ago
    > They don’t just lean into epsilons, the session context tolerance is used for almost every single point classification operation in geometric kernels and many primitives carry their own accumulating error component for downstream math.
    The GP wasn't wrong. To "lean in" means to fully commit to, go all in on, (or, equivalently, go all out on).
    [-]
    - vouwfietsman 8 hours ago
      I think his point is: rather than "leaning into" it as in, masking through epsilons, he argues that tolerance is fundamental to the problem space, not a way to resolve edge cases.
      [-]
      - MarkusQ 2 hours ago
        Right. And my point is that "leaning in" doesn't mean masking, it means committing to. Taking seriously. Exactly the sort of thing he's describing.
        I'm wondering if people have heard the expression "leaning in" from people who were insincere/lying, and assumed that that was what the phrase means?
- 8note 5 hours ago
  im surprised terminology isnt borrowed from mechanical engineering on the type of fit that two pieces are supposed to have. Interference fits vs clearance do a physical job of describing whats happening
jph 13 hours ago
I have this floating-point problem at scale and will donate $100 to the author, or to anyone here, who can improve my code the most.
The Rust code in the assert_f64_eq macro is:
```
    if (a >= b && a - b < f64::EPSILON) || (a <= b && b - a < f64::EPSILON)
```
I'm the author of the Rust assertables crate. It provides floating-point assert macros much as described in the article.
https://github.com/SixArm/assertables-rust-crate/blob/main/s...
If there's a way to make it more precise and/or specific and/or faster, or create similar macros with better functionality and/or correctness, that's great.
See the same directory for corresponding assert_* macros for less than, greater than, etc.
[-]
- hmry 13 hours ago
  Is there any constant more misused in compsci than ieee epsilon? :)
  It's defined as the difference between 1.0 and the smallest number larger than 1.0. More usefully, it's the spacing between adjacent representable float numbers in the range 1.0 to 2.0.
  Because floats get less precise at every integer power of two, it's impossible for two numbers greater than or equal to 2.0 to be epsilon apart. The spacing between 2.0 and the next larger number is 2*epsilon.
  That means `abs(a - b) <= epsilon` is equivalent to `a == b` for any a or b greater than or equal to 2.0. And if you use `<` then the limit will be 1.0 instead.
  Epsilon is the wrong tool for the job in 99.9% of cases.
  [-]
  - zamadatix 11 hours ago
    A (perhaps initially) counterintuitive part of the above more explicitly stated: The doubling/halving also means numbers between 0 and 1 actually have _more_ precision than the epsilon would suggest.
    [-]
    - jameshart 11 hours ago
      Considerably more in many cases. The point of floating point is to have as many distinct values in the range 2-4 as are in the range 1-2 as are between 1/2 and 1, 1/4 and 1/2, 1/8 and 1/4, etc. the smallest representable difference between consecutive floating point numbers down around the size of 1/64 is on the order of epsilon/64
      Multiplying epsilon by the largest number you are dealing with is a strategy that makes using epsilons at least somewhat logical.
  - TomatoCo 12 hours ago
    The term I've seen a lot is https://en.wikipedia.org/wiki/Unit_in_the_last_place
    So I'd probably rewrite that code to first find the ulp of the larger of the abs of a and b and then assert that their difference is less than or equal to that.
    Edit: Or maybe the smaller of the abs of the two, I haven't totally thought through the consequences. It might not matter, because the ulps will only differ when the numbers are significantly apart and then it doesn't matter which one you pick. Perhaps you can just always pick the first number and get its ULP.
    [-]
    - magicalhippo 11 hours ago
      This is what was done to a raytracer I used. People kept making large-scale scenes with intricate details, think detailed ring placed on table in a room with a huge field in view through the window. For a while one could override the fixed epsilon based on scene scale, but for such high dynamic range scenes a fixed epsilon just didn't cut it.
      IIRC it would compute the "dynamic" epsilon value essentially by adding one to the mantissa (treated as an integer) to get the next possible float. Then subtract from that the initial value to get the dynamic epsilon value.
      Definitely use library functions if you got 'em though.
  - a-dub 12 hours ago
    i find the best way to remember it is "it's not the epsilon you think it is."
    epsilons are fine in the case that you actually want to put a static error bound on an equality comparison. numpy's relative errors are better for floats at arbitrary scales (https://numpy.org/doc/stable/reference/generated/numpy.isclo...).
    edit: ahh i forgot all about ulps. that is what people often confuse ieee eps with. also, good background material in the necronomicon (https://en.wikipedia.org/wiki/Numerical_Recipes).
  - russdill 5 hours ago
    It would be very useful to be able to compare the significant directly then. I realize there is a boundary issue when a significant is very close to 0x00..000 or 0xFFF..FFF
- jcranmer 10 hours ago
  Everyone has already made several comments on the incorrect use of EPSILON here, but there's one more thing I want to add that hasn't yet been mentioned:
  EPSILON = (1 ulp for numbers in the range [1, 2)). is a lousy choice for tolerance. Every operation whose result is in the range [1, 2) has a mathematical absolute error of ½ ulp. Doing just a few operations in a row has a chance to make the error term larger than your tolerance, simply because of the inherent inaccuracy of floating-point operations. Randomly generate a few doubles in the range [1, 10], then randomize the list and compute the sum of different random orders in the list, and your assertion should fail. I'd guess you haven't run into this issue because either very few people are using this particular assertion, or the people who do happen to be testing it in cases where the result is fully deterministic.
  If you look at professional solvers for numerical algorithms, one of the things you'll notice is that not only is the (relative!) tolerance tunable, but there's actually several different tolerance values. The HiGHS linear solver for example uses 5 different tolerance values for its simplex algorithm. Furthermore, the default values for these tolerances tend to be in the region of 10^-6 - 10^-10... about the square root of f64::EPSILON. There's a basic rule of thumb in numerical analysis that you need your internal working precision to be roughly twice the number of digits as your output precision.
  [-]
  - pfortuny 6 hours ago
    Your last comment is essential for numerical analysis, indeed. There is this "surprising" effect that increasing the precision of the input ends up by decreasing that of the output (roughly speaking). So "I shall just use a s very small discretization" can be harmful.
- pclmulqdq 13 hours ago
  Your assertion code here doesn't make a ton of sense. The epsilon of choice here is the distance between 1 and the next number up, and it's completely separated from the scale of the numbers in question. 1e-50 will compare equal to 2e-50, for example.
  I would suggest that "equals" actually is for "exactly equals" as in (a == b). In many pieces of floating point code this is the correct thing to test. Then also add a function for "within range of" so your users can specify an epsilon of interest, using the formula (abs(a - b) < eps). You may also want to support multidimensional quantities by allowing the user to specify a distance metric. You probably also want a relative version of the comparison in addition to an absolute version.
  Auto-computing epsilons for an equality check is really hard and depends on the usage, as well as the numerics of the code that is upstream and downstream of the comparison. I don't see how you would do it in an assertion library.
- judofyr 13 hours ago
  Ignoring the misuse of epsilon, I'd also say that you'd be helping your users more by not providing a general `assert_f64_eq` macro, but rather force the user to decide the error model. Add a required "precision" parameter as an enum with different modes:
```
    // Precise matching:
    assert_f64_eq!(a, 0.1, Steps(2))
    // same as: assert!(a == 0.1.next_down().next_down())

    // Number of digits (after period) that are matching:
    assert_f64_eq!(a, 0.1, Digits(5))

    // Relative error:
    assert_f64_eq!(a, 0.1, Rel(0.5))
```
- lukax 13 hours ago
  You generally want both relative and absolute tolerances. Relative handles scale, absolute handles values near zero (raw EPSILON isn’t a universal threshold per IEEE 754).
  The usual pattern is abs(a - b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol) to avoid both large-value and near-zero pitfalls.
  [-]
  - lukax 13 hours ago
    See the implementation of Python's math.isclose
    https://github.com/python/cpython/blob/d61fcf834d197f0113a6a...
- thomasmg 13 hours ago
  It depends on the use case, but do you consider NaN to be equal to NaN? For an assert macro, I would expect so. Also, your code works differently for very large and very small numbers, eg. 1.0000001, 1.0000002 vs 1e-100, 1.0000002e-100.
  For my own soft-floating point math library, I expect the value is off by a some percentage, not just off by epsilon. And so I have my own almostSame method [1] which accounts for that and is quite a bit more complex. Actually multiple such methods. But well, that's just my own use case.
  [1] https://github.com/thomasmueller/bau-lang/blob/main/src/test...
- sobellian 10 hours ago
  Machine eps provides the maximum rounding error for a single op. Let's say I write:
```
  let y = 2.0;
  let x = sqrt(y);
```
  Now is `x` actually the square root of 2? Of course not - because the digit expansion of sqrt(2) doesn't terminate, the only way to precisely represent it is with symbolics. So what do we actually have? `x` was either rounded up or down to a number that does have an exact FP representation. So, `x` / sqrt(2) is in `[1 - eps, 1 + eps]`. The eps tells you, on a relative scale, the maximum distance to an adjacent FP number for any real number. (Full disclosure, IDK how this interacts with weird stuff like denormals).
  Note that in general we can only guarantee hitting this relative error for single ops. More elaborate computations may develop worse error as things compound. But it gets even worse. This error says nothing about errors that don't occur in the machine. For example, say I have a test that takes some experimental data, runs my whiz-bang algorithm, and checks if the result is close to elementary charge of an electron. Now I can't just worry about machine error but also a zillion different kinds of experimental error.
  There are also cases where we want to enforce a contract on a number so we stay within acceptable domains. Author alluded to this. For example - if I compute some `x` s.t. I'm later going to take `acos(x)`, `x` had better be between `[-1, 1]`. `x >= -1 - EPS && x <= 1 + EPS` wouldn't be right because it would include two numbers, -1 - EPS and 1 + EPS, that are outside the acceptable domain.
  - "I want to relax exact equality because my computation has errors" -> Make `assert_rel_tol` and `assert_abs_tol`.
  - "I want to enforce determinism" -> exact equality.
  - "I want to enforce a domain" -> exact comparison
  Your code here is using eps for controlling absolute error, which is already not great since eps is about relative error. Unfortunately your assertion degenerates to `a == b` for large numbers but is extremely loose for small numbers.
- layer8 12 hours ago
  Apart from what others have commented, IMO an “assertables” crate should not invent new predicates of its own, especially for domains (like math) that are orthogonal to assertability.
- fouronnes3 13 hours ago
  You should use two tolerances: absolute and relative. See for example numpy.allclose()
  https://numpy.org/doc/stable/reference/generated/numpy.allcl...
- lifthrasiir 13 hours ago
  Hyb error [1] might be what you want.
  [1] https://arxiv.org/html/2403.07492v2
- thayne 8 hours ago
  You should allow the user to supply the epsilon value, because the precision needed for the assertion will depend on the use case.
- bee_rider 10 hours ago
  EQ should be exactly equal, I think. Although we often (incorrectly) model floats as a real plus some non-deterministic error, there are cases where you can expect an exact bit pattern, and that’s what EQ is for (the obvious example is, you could be writing a library and accept a scaling factor from the user—scaling factors of 1 or 0 allow you to optimize).
  You probably also want an isclose and probably want to push most users toward using that.
- meindnoch 8 hours ago
  Well, could you please describe a scenario where you think this assertion would be useful?
- reacweb 12 hours ago
  I suggest
  if a.abs()+b.abs() >= (a-b).abs() * 2f64.powi(48)
  It remains accurate for small and for big numbers. 48 is slightly less than 52.
- firebot 3 hours ago
  You probably don't need the (in)accuracy.
  Fix your precision so it matches.
  You only need so many significant digits.
- icantremember 11 hours ago
  You want equality?
  ‘a.to_bits() == b.to_bits()’
  Alternatively, use ‘partial_eq’ and fall back to bit equality if it returns None.
- scotty79 7 hours ago
  Author says in the article that for tests, and assertion is a test, it's ok to use epsilon.
- colechristensen 9 hours ago
  I think a key you may want is ε which scales with the actual local floating point increment.
  C++ implements this https://en.cppreference.com/cpp/numeric/math/nextafter
  Rust does not https://rust-lang.github.io/rfcs/3173-float-next-up-down.htm... but people have in various places.
  [-]
  - tialaramex 9 hours ago
    Um, what? You've linked an RFC for Rust, but the CPP Reference article for C++ So yeah, the Rust RFC documents a proposed change, and the C++ reference documents an implemented feature, but you could equally link the C++ Proposal document and the Rust library docs to make the opposite point if you wanted.
    Rust's https://doc.rust-lang.org/std/primitive.f32.html#method.next... https://doc.rust-lang.org/std/primitive.f32.html#method.next... of course exist, they're even actually constant expressions (the C++ functions are constexpr since 2023 but of course you're not promised they actually work as constant expressions because C++ is a stupid language and "constexpr" means almost nothing)
    You can also rely on the fact (not promised in C++) that these are actually the IEEE floats and so they have all the resulting properties you can (entirely in safe Rust) just ask for the integers with the same bit pattern, compare integers and because of how IEEE is designed that tells you how far away in some proportional sense, the two values are.
    On an actual CPU manufactured this century that's almost free because the type system evaporates during compilation -- for example f32::to_bits is literally zero CPU instructions.
    [-]
    - colechristensen 8 hours ago
      Oh, my research was wrong and the line from the RFC doc...
      >Currently it is not possible to answer the question ‘which floating point value comes after x’ in Rust without intimate knowledge of the IEEE 754 standard.
      So nevermind on it not being present in Rust I guess I was finding old documentation
      [-]
      - tialaramex an hour ago
        Yeah, the RFC is explaining what they proposed in 2021. In 2022 that work landed in "nightly" Rust, which means you could see it in the documentation (unless you've turned off seeing unstable features entirely) but to actually use it in software you need the nightly compiler mode and a feature flag in your source #![feature(float_next_up_down)].
        By 2025 every remaining question about edge cases or real world experience was resolved and in April 2025 the finished feature was stabilized in release 1.86, so it just works in Rust since about a year.
        For future reference you can follow separate links from a Rust RFC document to see whether the project took this RFC (anybody can write one, not everything gets accepted) and then also how far along the implementation work is. Can I use this in nightly? Maybe there's an outstanding question I can help answer. Or, maybe it's writing a stabilization report and this is my last chance to say "Hey, I am an expert on this and your API is a bit wrong".
- werdnapk 12 hours ago
  The use of epsilon is correct here. It's exactly what I was taught in comp sci over 20 years ago. You can call it's use here an "epsilon-delta".
dnautics 9 hours ago
> In reality it is a pretty deterministic (modulo compiler options, CPU flags, etc)
IIRC this was not ALWAYS the case, on x86 not too long ago the CPU might choose to put your operation in an 80-bit fp register, and if due to multitasking the CPU state got evicted, it would only be able to store it in a 32-bit slot while it's waiting to be scheduled back in?
It might not be the case now in a modern system if based on load patterns the software decides to schedule some math operations or another on the GPU vs the CPU, or maybe some sort of corner case where you are horizontally load balancing on two different GPUs (one AMD, one Nvidia) -- I'm speculating here.
[-]
- dunham 7 hours ago
  I was bit by this years ago when our test cases failed on Linux, but worked on macos. pdftotext was behaving differently (deciding to merge two lines or not) on the two platforms - both were gcc and intel at the time. When I looked at it in a debugger or tried to log the values, it magically fixed itself.
  Eventually I learned about the 80-bit thing and that macos gcc was automatically adding a -ffloat-store to make == more predictable (they use a floats everywhere in the UI library). Since pdftotext was full of == comparisons, I ended up adding a -ffloat-store to the gcc command line and calling it a day.
- Dylan16807 4 hours ago
  > IIRC this was not ALWAYS the case, on x86 not too long ago the CPU might choose to put your operation in an 80-bit fp register, and if due to multitasking the CPU state got evicted, it would only be able to store it in a 32-bit slot while it's waiting to be scheduled back in?
  I don't think the CPU was ever allowed to do that, but with your average compiler you were playing with fire.
  Did any actual OS mess up state like that? They could and should save the full registers. There's even a bultin instruction for this, FSAVE.
- Negitivefrags 4 hours ago
  This is the kind of misinformation that makes people more wary of floats than they should be.
  The same series of operations with the same input will always produce exactly the same floating point results. Every time. No exceptions.
  Hardware doesn't matter. Breed of CPU doesn't matter. Threads don't matter. Scheduling doesn't matter. IEEE floating point is a standard. Everyone follows the standard. Anything not producing indentical results for the same series of operations is *broken*.
  What you are referring to is the result of different compilers doing a different series of operations than each other. In particular, if you are using the x87 fp unit, MSVC will round 80-bit floating point down to 32/64 bits before doing a comparison, and GCC will not by default.
  Compliers doesn't even use 80-bit FP by default when compiling for 64 bit targets, so this is not a concern anymore, and hasn't been for a very long time.
  [-]
  - purplesyringa 3 hours ago
    There's just so many "but"s to this that I can't in good faith recommend people to treat floats as deterministic, even though I'd very much love to do so (and I make such assumptions myself, caveat emptor):
    - NaN bits are non-deterministic. x86 and ARM generate different sign bits for NaNs. Wasm says NaN payloads are completely unpredictable.
    - GPUs don't give a shit about IEEE-754 and apply optimizations raging from DAZ to -ffast-math.
    - sin, rsqrt, etc. behave differently when implemented by different libraries. If you're linking libm for sin, you can get different implementations depending on the libc in use. Or you can get different results on different hardware.
    - C compilers are allowed to "optimize" a * b + c to FMA when they wish to. The standard only technically allows this merge within one expression, but GCC enables this in all cases by default on some `-std`s.
    You're technically correct that floats can be used right, but it's just impossible to explain to a layman that, yes, floats are fine on CPUs, but not on GPUs; fine if you're doing normal arithmetic and sqrt, but not sin or rsqrt; fine on modern compilers, but not old ones; fine on x86, but not i686; fine if you're writing code yourself, but not if you're relying on linear algebra libraries, unless of course you write `a * b + c` and compile with the wrong options; fine if you rely on float equality, but not bitwise equality; etc. Everything is broken and the entire thing is a mess.
    [-]
    - Negitivefrags 3 hours ago
      Yes, there are a large number of ways to fall into traps that cause you to do a different series of operations when you didn't realise that you did. But that's still ultimately what all your examples are. (Except the NaN thing!)
      I still think it's important to fight the misinformation.
      Programmers have been conditioned to be so afraid of floats that many believe that doing a + b has an essentially random outcome when it doesn't work that way at all. It leads people to spend a bunch of effort on things that they don't need to be doing.
GuB-42 8 hours ago
The thing with floating point numbers is they are meant to work with physical quantities: distances, durations, etc...
Physical quantities involve imprecision: measurement devices, tools, display devices, ADC/DACs etc... They all have some tolerances. And when you are using epsilons, the epsilon value should be chosen based on that physical value. For example, you set the epsilon to 1e-4 because that's 100 microns and you can't display 100 micron details.
That's also the reason why there is not one size fits all solution. If you are working with microscopic objects, 100 microns is huge, and if you are doing a space simulation, 1 km may be negligible. Some operations involve a huge loss of precision, some don't, and sometimes you really want exact numbers and therefore you have to know your fractional powers of 2.
[-]
- waffletower 8 hours ago
  This is highly reductive, "they are meant to work with physical quantities", but agree that the applicability of an epsilon is entirely situational.
amelius 12 hours ago
Think about this. It's silly to use floating point numbers to represent geometry, because it gives coordinates closer to the origin more precision and in most cases the origin is just an arbitrary point.
[-]
- anonymars 12 hours ago
  Random aside but as I recall I think this is what made Kerbal Space Program so difficult. Very large distances and changing origins as you'd go to separate bodies, and I think the latter was basically because of this aspect of floating point. And because of the mismanagement of KSP2 they had to relearn these difficulties, because they didn't really have the experienced people work with the new developers.
  I only played it rather than modded it, so happy to be corrected or further enlightened, but seems like an interesting problem to have to solve.
  Edit: sure enough, it was actually discussed here: https://news.ycombinator.com/item?id=26938812
  [-]
  - adgjlsfhk1 8 hours ago
    What KSP really should have done is just done their orbital math separately from their force propagation. If they had made a virtual node for each craft's center of mass, they could have made it so that the COM position was just never affected by intra-body forces and done the orbital math in super high (Float128?) precision.
- m-schuetz 7 hours ago
  For geometry, fixed-precision integers are better. But for computation and usability, floats are great. Scaling a 10 meter model in floats to 13% of the size is a trivial multiplication by 0.13f. With integers, this can get tricky. Can't first divide by 100 then multiply by 13 because you'd lose precision. Also can't multiply by 13 and then divide by 100 because you might overflow. Unless maybe venders would add hardware that computes that more accurately like they currently do for float, but honestly, float is good enough and the the potential benefits do not outweigh the disadvantages.
  Float is also fantastic for depth values precisely because they have more precision towards the origin, basically quasi-logarithmic precision. Having double the precision at half the distance is A+. At least if you're writing software rasterizers and do linear depth. The story with depth buffer precision in GPU pipelines with normalized depth and and hyperbolic distribution is...sad.
- meheleventyone 11 hours ago
  Yeah in a lot of cases it's much better to use integers and a fixed precision as the absolute unit of position. For games it's just that the scale of most games works well with floats in the range they care about.
- adrian_b 10 hours ago
  You are right, but only for a certain meaning of the word "geometry".
  If "geometry" refers to the geometry of an affine space, i.e. a space of points, then indeed there is nothing special about any point that is chosen as the origin and no reason do desire lower tolerances for the coordinates of points close to the current origin.
  Therefore for the coordinates of points in an affine space, using fixed-point numbers would be a better choice. There are also other quantities for which usually floating-point numbers are used, despite the fact that fixed-point numbers are preferable, e.g. angles and logarithms.
  On the other hand, if you work with the vector space associated to an affine space, i.e. with the set of displacements from one point to another, then the origin is special, i.e. it corresponds with no displacement. For the components of a vector, floating-point numbers are normally the right representation.
  So for the best results, one would need both fixed-point numbers and floating-point numbers in a computer.
  These were provided in some early computers, but it is expensive to provide hardware for both, so eventually hardware execution units were provided only for floating-point numbers.
  The reason is that fixed-point numbers can be implemented in software with a modest overhead, using operations with integer numbers. The overhead consists in implementing correct rounding, keeping track of the position of the fraction point and doing some extra shifting when multiplications or divisions are done.
  In languages that allow the user to define custom types and that allow operator overloading and function overloading, like C++, it is possible to make the use of fixed-point numbers as simple as the use of the floating-point numbers.
  Some programming languages, like Ada, have fixed-point numbers among the standard data types. Nevertheless, not all compilers for such programming languages include an implementation for fixed-point numbers that has a good performance.
  [-]
  - AlotOfReading 8 hours ago
    Fixed point and Floating point are extremely similar, so most of the time you should just go with floats. If you start with a fixed type, reserve some bits for storing an explicit exponent and define a normalization scheme, you've recreated the core of IEEE floats. That also means we can go the other way and emulate (lower precision) fixed point by masking an appropriate number of LSBs in the significand to regain the constant density of fixed. You can treat floating point like fixed point in a log space for most purposes, ignoring some fiddly details about exponent boundaries.
    And since they're essentially the same, there just aren't many situations where implementing your own fixed point is worth it. MCUs without FPUs are increasingly uncommon. Financial calculations seem to have converged on Decimal floating point. Floating point determinism is largely solved these days. Fixed point has better precision at a given width, but 53 vs 64 bits isn't much different for most applications. If you happen to regularly encounter situations where you need translation invariants across a huge range at a fixed (high) precision though, fixed point is probably more useful to you.
    [-]
    - adrian_b 6 hours ago
      There are applications where the difference between fixed-point and floating-point numbers matters, i.e. the difference between having a limit for the absolute error or for the relative error.
      The applications where the difference does not matter are those whose accuracy requirements are much less than provided by the numeric format that is used.
      When using double-precision FP64 numbers, the rounding errors are frequently small enough to satisfy the requirements of an application, regardless if those requirements are specified as a relative error or as an absolute error.
      In such cases, floating-point numbers must be used, because they are supported by the existing hardware.
      But when an application has more strict requirements for the maximum absolute error, there are cases when it is preferable to use smaller fixed-point formats instead of bigger floating-point formats, especially when FP64 is not sufficient, so quadruple-precision floating-point numbers would be needed, for which there is only seldom hardware support, so they must be implemented in software anyway, preferably as double-double-precision numbers.
      [-]
      - AlotOfReading 5 hours ago
        i.e. the difference between having a limit for the absolute error or for the relative error.
        The masking procedure I mentioned gives uniform absolute error in floats, at the cost of lost precision in the significand. The trade-off between the two is really space and hence precision.
        I'm not saying fixed point is never useful, just that it's a very situational technique these days to address specific issues rather than an alternative default. So if you aren't even doing numerical analysis (as most people don't), you should stick with floats.
- Dylan16807 4 hours ago
  Floating point has the benefit of not screaming and exploding when you have to take three lengths and calculate a volume.
  Double precision floating point is like a 54-bit fixed point system that automatically scales to the exact size you need it to be. You get huge benefits for paying those 10 exponent bits. Even if you need those extra bits, you're often better off switching to a higher precision float or a double-double system.
- rpdillon 12 hours ago
  For all the players of the original Morrowind out there, you'll notice that your character movement gets extremely janky when you're well outside of Vvardenfell because the game was never designed to go that far from the origin. OpenMW fixes this (as do patches to the original Morrowind, though I haven't used those), since mods typically expand outwards from the original island, often by quite a bit.
- Asooka 3 hours ago
  Well yeah, you would store your values in whatever representation fits your domain, then do the calculations with floats based on a suitable origin when needed. For example, for raytracing you would have each model defined in its local coordinate system with 32-bit floats for coordinates (because those are plenty accurate enough for single human-scale models), but offset them in the scene with 64-bit doubles (again, plenty enough of precision), and convert the ray coordinates to the local coordinates for ray-mesh intersection once the ray-box intersection passes.
desdenova 11 hours ago
The problem with floating point comparison is not that it's nondeterministic, it's that what should be the same number may have different representations, often with different rounding behavior as well, so depending on the exact operations you use to arrive at it, it may not compare as equal, hence the need for the epsilon trick.
If all you're comparing is the result from the same operations, you _may_ be fine using equality, but you should really know that you're never getting a number from an uncontrolled source.
demorro 13 hours ago
I guess I'm confused. I thought epsilon was the smallest possible value to account for accuracy drift across the range of a floating point representation, not just "1e-4".
Done some reading. Thanks to the article to waking me up to this fact at least. I didn't realize that the epsilon provided by languages tends to be the one that only works around 1.0, and if you want to use episilons globally (which the article would say is generally a bad idea) you need to be more dynamic as your ranges, and potential errors, increase.
[-]
- rpdillon 13 hours ago
  Yeah, I'm not sure how widespread the knowledge is that floating point trades precision for magnitude. Its obvious if you know the implementation, but I'm not sure most folks do.
  [-]
  - ryandrake 11 hours ago
    I remember having convincing a few coworkers that the number of distinct floating point values between 0.0 and 1.0 is the same as the number of values between 1.0 and infinity. They must not be teaching this properly anymore. Are there no longer courses that explain the basics of floating point representation?
    I was arguing that we could squeeze a tiny bit more precision out of our angle types by storing angles in radians (range: -π to π) instead of degrees (range: -180 to 180) because when storing as degrees, we were wasting a ton of floating point precision on angles between -1° and 1°.
    [-]
    - Dylan16807 4 hours ago
      That doesn't work. The only real difference between those two scales is in the values located between -.0000000001 and .0000000001 And that's grossly underestimating the number of 0s.
      No matter what scale you pick, your number line is going to look like this: https://anniecherkaev.com/images/floating_point_density.jpg Do a 2x zoom in or out and not a single pixel of the graph will change, just the labels.
      Whether your biggest value is 0.005 or 7000000, most of your range has 25 (or 54) bits of precision. 99% of values are either too small to matter or outside your range. Changing your scale shifts between the "too small" and "too big" categories, but the number of useful values stays roughly the same.
    - adrian_b 10 hours ago
      What you say was true exactly only in most floating-point formats that were used before 1980.
      In those old FP formats, the product of the smallest normalized and non-null FP number with the biggest normalized and non-infinite FP number was approximately equal to 1.
      However in the IEEE standard for FP arithmetic, it was decided that overflows are more dangerous than underflows, so the range of numbers greater than 1 has been increased by diminishing the range of numbers smaller than 1.
      With IEEE FP numbers, the product of the smallest and biggest non-null non-infinite numbers is no longer approximately 1, but it is approximately 4.
      So there are more numbers greater than 1 than smaller than 1. For IEEE FP numbers, there are approximately as many numbers smaller than 2 as there are numbers greater than 2.
      An extra complication appears when the underflow exception is masked. Then there is an additional set of numbers smaller than 1, the denormalized numbers. Those are not many enough to compensate the additional numbers bigger than 1, but with those the mid point is no longer at 2, but somewhere between 1 and 2, close to 1.5.
      [-]
      - adgjlsfhk1 9 hours ago
        > With IEEE FP numbers, the product of the smallest and biggest non-null non-infinite numbers is no longer approximately 1, but it is approximately 4.
        This is just wrong? The largest Float64 is 1.7976931348623157e308 and the smallest is 5.0e-324 They multiply to ~1e-16.
        [-]
        meindnoch 8 hours ago
        >the smallest is 5.0e-324
        That's a subnormal [1]. The smallest normal double is 2.22507e-308:
        DBL_MIN = 2.22507e-308 DBL_TRUE_MIN = 4.94066e-324
        [1] https://en.wikipedia.org/wiki/Subnormal_number
    - valicord 10 hours ago
      Wait this doesn't make sense. Yes you'd get smaller absolute error in radians, but it doesn't really help because it's different units. Relative error is the same in degrees and radians, that's the whole point of exponential representation. All you're doing is adding a fixed offset to the exponent, but it doesn't give you any more precision when converting to radians
      [-]
      - adrian_b 10 hours ago
        Having a constant relative error is indeed the reason for using floating-point numbers.
        However, for angles the relative error is completely irrelevant. For angles only the absolute error matters.
        For angles the optimum representation is as fixed-point numbers, not as floating-point numbers.
        [-]
        valicord 9 hours ago
        With -π to π radians you get absolute error of approximately 4e-16 radians. With -180 to 180 degrees you get absolute error of approximately 2e-14 degrees.
        Even though the first number is smaller than the 2nd one, they actually represent the same angle once you consider that they are different units. So there's no precision advantage (absolute or relative) to converting degrees to radians.
        Note that I'm not saying anything about fixed vs floating point, only responding to an earlier comment that radians give more precision in floating point representation.
        Dylan16807 4 hours ago
        The absolute error accounting for units is what matters.
        Changing the unit gives the illusion of changing absolute error, but doesn't actually change the absolute error.
        ryandrake 9 hours ago
        Yep, it was a long time ago but I think that's exactly what we ended up with, eventually: An int type of unit 2π/(int range). I believe we used unsigned because signed int overflow is undefined behavior.
    - bee_rider 9 hours ago
      Wouldn’t you want to use “turns” for that sort of thing?
      Re: teaching floats; when I was working with students, we touch on floats slightly, but mostly just to reinforce the idea that they aren’t always exact. I think, realistically, it can be hard. You don’t want to put an “intro to numerical analysis” class into the first couple lectures of your “intro to programming” class, where you introduce the data-types.
      Then, if you are going to do a sort of numerical analysis or scientific computing class… I dunno, that bit of information could end up being a bit of trivia or easily forgotten, right?
  - MarkusQ 11 hours ago
    Always remember https://m.xkcd.com/2501/
- mhh__ 12 hours ago
  Some languages even use different definitions of epsilon! (dotnet...)
Joker_vD 5 hours ago
To quote from one of my previous comments:
> > the myth about exactness is that you can't use strict equality with floating point numbers because they are somehow fuzzy. They are not.
> They are though. All arithmetic operations involve rounding, so e.g. (7.0 / 1234 + 0.5) * 1234 is not equal to 7.0 + 617 (it differs in 1 ULP). On the other hand, (9.0 / 1234 + 0.5) * 1234 is equal to 9.0 + 617, so the end result is sometimes exact and sometimes is not. How can you know beforehand which one is the case in your specific case? Generally, you can't, any arithmetic operation can potentially give you 1 ULP of error, and it can (and likely, will) slowly accumulate.
Also, please don't comment how nobody has a use for "f(x) = (x / 1234 + 0.5) * 1234": there are all kinds of queer computations people do in floating point, and for most of them, figuring out the exactness of the end result requires an absurd amount of applied numerical analysis, doing which would undermine most of the "just let the computer crunch the numbers" point of doing this computation on a computer.
mcv 9 hours ago
This is mostly about game logic, where I can understand the reliance on floating point numbers. I've also seen these epsilon comparisons in code that had nothing to do with game engines or positions in continuous space, and it has always hurt my eyes.
I think if you want to work with values that might be exactly equal to other values, floating point is simply not the right choice. For money, use BigDecimal or something like that. For lots of purposes, int might be more appropriate. If you do need floating point, maybe compare whether the value is larger than the other value.
mtklein 6 hours ago
My preference in tests is a little different than just using IEEE 754 ==,
```
    _Bool equiv(float x, float y) {
        return (x <= y && y <= x)
            || (x != x && y != y);
    }
```
which both handles NaNs sensibly (all NaNs are equivalent) and won't warn about using == on floats. I find it also easy to remember how to write when starting a new project.
hansvm 10 hours ago
My normal issue with floating-point epsilon shenanigans is that they don't usually pass the sniff test, suggesting something fundamentally wrong with the problem framing or its solution.
It's a classic, so let's take vector normalization as an example. Topologically, you're ripping a hole in the space, and that's causing your issues. It manifests as NaN for length-zero vectors, weird precision issues too close to zero, etc, but no matter what you employ to try to fix it you're never going to have a good time squishing N-D space onto the surface of an N-D sphere if you need it to be continuous.
Some common subroutines where I see this:
1. You want to know the average direction of a bunch of objects and thus have to normalize each vector contributing to that average. Solution 1: That's not what you want almost ever. In any of the sciences, or anything loosely approximating the real world, you want to average the un-normalized vectors 99.999% of the time. Solution 2: Maybe you really do need directions for some reason (e.g., tracking where birds are looking in a game). Then don't rely on vectors for your in-band signaling. Explicitly track direction and magnitude separately and observe the magic of never having direction-related precision errors.
2. You're doing some sort of lighting normalization and need to compute something involving areas of potentially near-degenerate triangles, dividing by those values to weight contributions appropriately. Solution: Same as above, this is kind of like an average of averages problem. It can make fuzzy, intuitive sense, but you'll get better results if you do your summing and averaging in an un-normalized space. If you really do need surface normals, store those explicitly and separate from magnitude.
3. You're doing some sort of ML voodoo to try to get better empirical results via some vague appeal to vanishing gradients or whatever. Solution: The core property you want is a somewhat strange constraint on your layer's Jacobian matrix, and outside of like two papers nobody is willing to put up with the code complexity or runtime costs, even when they recognize it as the right thing to do. Everything you're doing is a hack anyway, so make your normalization term x/(|x|+eps) with eps > 0 rather than equal to zero like normal. Choose eps much smaller than most of the vectors you're normalizing this way and much bigger than zero. Something like 1e-3, 1e-20, and 1e-150 should be fine for f16, f32, and f64. You don't have to tune because it's a pretty weak constraint on the model, and it's able to learn around it.
beyondCritics 8 hours ago
If your code may be compiled, to use the Intel x87 numerical coprocessor, an important issue is the so called "excess precision": Different values on chip can collapse after being rounded and stored to their memory locations, invalidating previous comparisons. Spilling can happen unexpectedly. Note that Intel calls the x87 "legacy"
[-]
- pclmulqdq 8 hours ago
  Nobody's code will be compiled to use x87 any more.
  [-]
  - beyondCritics 7 hours ago
    There is plenty of demand for so called "secure code", where such coder arrogance will not be tolerated. Trust me on that, I know it.
    [-]
    - adgjlsfhk1 6 hours ago
      modern compilers do just have options to disable using x87 registers entirely.
lisper 8 hours ago
Well, at least the author is honest about it:
> The title of this post is an intentional clickbait.
Unfortunately, that's where the honesty ends.
> It's NOT OK to compare floating-points using epsilons.
> So, are epsilons good or bad? Usually bad, but sometimes okay.
So which is it? Emphatically NOT OK, or sometimes okay?
mizmar 4 days ago
There is another way to compare floats for rough equality that I haven't seen much explored anywhere: bit-cast to integer, strip few least significant bits and then compare for equality. This is agnostic to magnitude, unlike epsilon which has to be tuned for range of values you expect to get a meaningful result.
[-]
- twic 14 hours ago
  This doesn't work. For any number of significant bits, there are pairs of numbers one machine epsilon apart which will truncate to different values.
- SideQuark 14 hours ago
  Completely worked out at least 20 years ago: https://www.lomont.org/papers/2005/CompareFloat.pdf
  [-]
  - fn-mote 13 hours ago
    Note for the skeptic: this cites Knuth, Volume II, writes out the IEEE edge cases, and optimizes.
  - mizmar 9 hours ago
    Great reading, thanks. Describes how to handle ±0, works with difference to avoid truncation errors. First half of the paper is arriving at this correct snippet, second part of the paper is about optimizing it.
```
    bool DawsonCompare(float af, float bf, int maxDiff)
    {
        int ai = *reinterpret_cast<int*>(&af);
        int bi = *reinterpret_cast<int*>(&bf);
        if (ai < 0)
            ai = 0x80000000 - ai;
        if (bi < 0)
            bi = 0x80000000 - bi;
        int diff = ai - bi;
        if (abs(diff) < maxDiff)
            return true;
        return false;
    }
```
- ethan_smith 12 hours ago
  This is essentially ULP (units in the last place) comparison, and it's a solid approach. One gotcha: IEEE 754 floats have separate representations for +0 and -0, so values straddling zero (like 1e-45 and -1e-45) will look maximally far apart as integers even though they're nearly equal. You need to handle the sign bit specially.
  [-]
  - kbolino 8 hours ago
    There's another gotcha. Consider positive, normal x and y where ulp(y) != ulp(x). Bitwise comparison, regardless of tolerance, will consider x to be far from y, even though they might be adjacent numbers, e.g. if y = x+ulp(x) but y is a power of 2.
    [-]
    - ack_complete 5 hours ago
      This case actually works because for finite numbers of a given sign, the integer bit representations are monotonic with the value due to the placement of the exponent and mantissa fields and the implicit mantissa bit. For instance, 1.0 in IEEE float is 0x3F800000, and the next immediate representable value below it 1.0-e is 0x3F7FFFFF.
      Signed zero and the sign-magnitude representation is more of an issue, but can be resolved by XORing the sign bit into the mantissa and exponent fields, flipping the negative range. This places -0 adjacent to 0 which is typically enough, and can be fixed up for minimal additional cost (another subtract).
      [-]
      - kbolino 4 hours ago
        I interpreted OP's "bit-cast to integer, strip few least significant bits and then compare for equality" message as suggesting this kind of comparison (Go):
        func equiv(x, y float32, ignoreBits int) bool { mask := uint32(0xFFFFFFFF) << ignoreBits xi, yi := math.Float32bits(x), math.Float32bits(y) return xi&mask == yi&mask }
        with the sensitivity controlled by ignoreBits, higher values being less sensitive.
        Supposing y is 1.0 and x is the predecessor of 1.0, the smallest value of ignoreBits for which equiv would return true is 24.
        But a worst case example is found at the very next power of 2, 2.0 (bitwise 0x40000000), whose predecessor is quite different (bitwise 0x3FFFFFFF). In this case, you'd have to set ignoreBits to 31, and thus equivalence here is no better than checking that the two numbers have the same sign.
    - oasisaimlessly 5 hours ago
      I don't think this is true. Modulo the sign bit, the "next float" operator is equivalent to the next bitstring or the integer++.
      [-]
      - kbolino 4 hours ago
        Sure, but that operator can propagate a carry all the way to the most significant bit, so a check for bitwise equality after "strip[ping] few least significant bits" will yield false in some cases. The pathologically worst case for single precision, for example, is illustrated by the value 2.0 (bitwise 0x40000000) and its predecessor, which differ in all bits except the sign.
- andyjohnson0 14 hours ago
  > strip few least significant bits
  I'm unconvinced. Doesnt this just replace the need to choose a suitable epsilon with the need to choose the right number of bits to strip? With the latter affording much fewer choices for degree of "roughness" than does the former.
  [-]
  - chaboud 11 hours ago
    Not quite. It's basically a combined mantissa and exponent test, so it can be thought of as functionally equivalent to scaling epsilon by a power of two (the shared exponent of the nearly equal floating point values) and then using that epsilon.
    I think I'll just use scaled epsilon... though I've gotten lots of performance wins out of direct bitwise trickery with floats (e.g., fast rounding with mantissa normalization and casting).
- StilesCrisis 13 hours ago
  Rather than stripping bits, you can just compare if the bit-casted numbers are less than N apart (choose an appropriate N that works for your data; a good starting point is 4).
  This breaks down across the positive/negative boundary, but honestly, that's probably a good property. -0.00001 is not all that similar to +0.00001 despite being close on the number line.
  It also requires that the inputs are finite (no INF/NAN), unless you are okay saying that FLT_MAX is roughly equal to infinity.
- hansvm 10 hours ago
  That works well for sorting/bucketing/etc in a few places, but as a comparison it's prone to false negatives (your values are close and computed to not be close), so you're restricted to algorithms tolerant of that behavior.
bananzamba 3 hours ago
If you need equality, just use fixed point
hun3 9 hours ago
I used floating timestamps as some kind of an identity. If there is ever a conflict, I just increase it by 1 ulp until it doesn't collide with anything. Sorry.
Cold_Miserable 6 hours ago
I've yet to encounter a need for == equality for floating point operations.
[-]
- spacechild1 4 hours ago
  I need it all the time. A very common case is caching of (expensive) computations. Let's say you have a parameter for an audio plugin and every parameter change requires some non-trivial and possibly expensive computation (e.g. calculation of filter coefficients). To avoid wasting CPU cycles on every audio sample, you only do the (re)calculation when the parameter has actually changed:
```
  double freq = getInput(0);
  if (freq != mLastFreq) {
    calculateCoefficients(freq);
    mLastFreq = freq;
  }
```
  Also, keep in mind that certain languages, such as JS, store all numbers as double-precision floating point numbers. So every time you are writing a numeric for-loop in JS you are implicitly relying on floating point equality :)
apitman 10 hours ago
> that's how maths works
Wait is British "maths" a singular noun or is this a typo? I was willing to go along with it if it was plural, but I have to draw the line here.
[-]
- adrian_b 9 hours ago
  Originally, maths/mathematics meant "things that are taught", like physics meant "natural things" and similarly for other such names.
  However, nowadays a word like physics is understood not as "natural things", but as an implicit abbreviation for "the science of natural things". Similarly for mathematics, mechanics, dynamics and so on.
  So such nouns are used as singular nouns, because the implicit noun "science" is singular.
- f33d5173 10 hours ago
  "Maths" is short for "mathematics". The latter is not plural and can be substituted into this quote with no other alterations.
- jonquark 10 hours ago
  Yes maths is singular, just like physics. We would say in the UK "maths is hard, physics is also hard"
- 3836293648 10 hours ago
  Maths is like physics
4pkjai 14 hours ago
I do this to see if text in a PDF is exactly where it is in some other PDF. For my use case it works pretty well.
Asooka 3 hours ago
My one small nitpick is that vector length is usually 2 instructions with SSE4:
```
    dpps xmm0, xmm0, 0x17 ; dot product of 3 lanes, write lane 0
    sqrtss xmm0, xmm0
    ret
```
And is considerably faster than the fancy version, mainly because Intel still hasn't given us horizontal-max vector instruction! ARM is a bit better in that regard with their fancy vmaxvq_f32 and vmaxnmvq_f32...
darepublic 9 hours ago
Plus or minus eps
AshamedCaptain 13 hours ago
One of the goals of comparing floating points with an epsilon is precisely so that you can apply these types of accuracy increasing (or decreasing) changes to the operations, and still get similar results.
Anything else is basically a nightmare to however has to maintain the code in the future.
Also, good luck with e.g. checking if points are aligned to a grid or the like without introducing a concept of epsilon _somewhere_.
thayne 8 hours ago
So they say you should use epsilons, then their solution to the first problem is to use an epsilon. There may be some cases when you can get by without using epsilon comparison, but in many cases epsilon comparison is the right thing to do, but you need to choose a good value for it.
This is especially true in cases where the number comes from some kind of input (user controls, sensor reading, etc.) or random number generation.