Quadrupleprecision floatingpoint format
From Wikipedia the free encyclopedia
Floatingpoint formats 

IEEE 754 

Other 
Alternatives 
Computer architecture bit widths 

Bit 
Application 
Binary floatingpoint precision 
Decimal floatingpoint precision 
In computing, quadruple precision (or quad precision) is a binary floatingpoint–based computer number format that occupies 16 bytes (128 bits) with precision at least twice the 53bit double precision.
This 128bit quadruple precision is designed not only for applications requiring results in higher than double precision,^{[1]} but also, as a primary function, to allow the computation of double precision results more reliably and accurately by minimising overflow and roundoff errors in intermediate calculations and scratch variables. William Kahan, primary architect of the original IEEE 754 floatingpoint standard noted, "For now the 10byte Extended format is a tolerable compromise between the value of extraprecise arithmetic and the price of implementing it to run fast; very soon two more bytes of precision will become tolerable, and ultimately a 16byte format ... That kind of gradual evolution towards wider precision was already in view when IEEE Standard 754 for FloatingPoint Arithmetic was framed."^{[2]}
In IEEE 7542008 the 128bit base2 format is officially referred to as binary128.
IEEE 754 quadrupleprecision binary floatingpoint format: binary128[edit]
The IEEE 754 standard specifies a binary128 as having:
 Sign bit: 1 bit
 Exponent width: 15 bits
 Significand precision: 113 bits (112 explicitly stored)
This gives from 33 to 36 significant decimal digits precision. If a decimal string with at most 33 significant digits is converted to the IEEE 754 quadrupleprecision format, giving a normal number, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 quadrupleprecision number is converted to a decimal string with at least 36 significant digits, and then converted back to quadrupleprecision representation, the final result must match the original number.^{[3]}
The format is written with an implicit lead bit with value 1 unless the exponent is stored with all zeros. Thus only 112 bits of the significand appear in the memory format, but the total precision is 113 bits (approximately 34 decimal digits: log_{10}(2^{113}) ≈ 34.016). The bits are laid out as:
Exponent encoding[edit]
The quadrupleprecision binary floatingpoint exponent is encoded using an offset binary representation, with the zero offset being 16383; this is also known as exponent bias in the IEEE 754 standard.
 E_{min} = 0001_{16} − 3FFF_{16} = −16382
 E_{max} = 7FFE_{16} − 3FFF_{16} = 16383
 Exponent bias = 3FFF_{16} = 16383
Thus, as defined by the offset binary representation, in order to get the true exponent, the offset of 16383 has to be subtracted from the stored exponent.
The stored exponents 0000_{16} and 7FFF_{16} are interpreted specially.
Exponent  Significand zero  Significand nonzero  Equation 

0000_{16}  0, −0  subnormal numbers  (−1)^{signbit} × 2^{−16382} × 0.significandbits_{2} 
0001_{16}, ..., 7FFE_{16}  normalized value  (−1)^{signbit} × 2^{exponentbits2 − 16383} × 1.significandbits_{2}  
7FFF_{16}  ±∞  NaN (quiet, signalling) 
The minimum strictly positive (subnormal) value is 2^{−16494} ≈ 10^{−4965} and has a precision of only one bit. The minimum positive normal value is 2^{−16382} ≈ 3.3621 × 10^{−4932} and has a precision of 113 bits, i.e. ±2^{−16494} as well. The maximum representable value is 2^{16384} − 2^{16271} ≈ 1.1897 × 10^{4932}.
Quadruple precision examples[edit]
These examples are given in bit representation, in hexadecimal, of the floatingpoint value. This includes the sign, (biased) exponent, and significand.
0000 0000 0000 0000 0000 0000 0000 0001_{16} = 2^{−16382} × 2^{−112} = 2^{−16494} ≈ 6.4751751194380251109244389582276465525 × 10^{−4966} (smallest positive subnormal number)
0000 ffff ffff ffff ffff ffff ffff ffff_{16} = 2^{−16382} × (1 − 2^{−112}) ≈ 3.3621031431120935062626778173217519551 × 10^{−4932} (largest subnormal number)
0001 0000 0000 0000 0000 0000 0000 0000_{16} = 2^{−16382} ≈ 3.3621031431120935062626778173217526026 × 10^{−4932} (smallest positive normal number)
7ffe ffff ffff ffff ffff ffff ffff ffff_{16} = 2^{16383} × (2 − 2^{−112}) ≈ 1.1897314953572317650857593266280070162 × 10^{4932} (largest normal number)
3ffe ffff ffff ffff ffff ffff ffff ffff_{16} = 1 − 2^{−113} ≈ 0.9999999999999999999999999999999999037 (largest number less than one)
3fff 0000 0000 0000 0000 0000 0000 0000_{16} = 1 (one)
3fff 0000 0000 0000 0000 0000 0000 0001_{16} = 1 + 2^{−112} ≈ 1.0000000000000000000000000000000001926 (smallest number larger than one)
c000 0000 0000 0000 0000 0000 0000 0000_{16} = −2
0000 0000 0000 0000 0000 0000 0000 0000_{16} = 0 8000 0000 0000 0000 0000 0000 0000 0000_{16} = −0
7fff 0000 0000 0000 0000 0000 0000 0000_{16} = infinity ffff 0000 0000 0000 0000 0000 0000 0000_{16} = −infinity
4000 921f b544 42d1 8469 898c c517 01b8_{16} ≈ π
3ffd 5555 5555 5555 5555 5555 5555 5555_{16} ≈ 1/3
By default, 1/3 rounds down like double precision, because of the odd number of bits in the significand. So the bits beyond the rounding point are 0101...
which is less than 1/2 of a unit in the last place.
Doubledouble arithmetic[edit]
A common software technique to implement nearly quadruple precision using pairs of doubleprecision values is sometimes called doubledouble arithmetic.^{[4]}^{[5]}^{[6]} Using pairs of IEEE doubleprecision values with 53bit significands, doubledouble arithmetic provides operations on numbers with significands of at least^{[4]} 2 × 53 = 106 bits (actually 107 bits^{[7]} except for some of the largest values, due to the limited exponent range), only slightly less precise than the 113bit significand of IEEE binary128 quadruple precision. The range of a doubledouble remains essentially the same as the doubleprecision format because the exponent has still 11 bits,^{[4]} significantly lower than the 15bit exponent of IEEE quadruple precision (a range of 1.8 × 10^{308} for doubledouble versus 1.2 × 10^{4932} for binary128).
In particular, a doubledouble/quadrupleprecision value q in the doubledouble technique is represented implicitly as a sum q = x + y of two doubleprecision values x and y, each of which supplies half of q's significand.^{[5]} That is, the pair (x, y) is stored in place of q, and operations on q values (+, −, ×, ...) are transformed into equivalent (but more complicated) operations on the x and y values. Thus, arithmetic in this technique reduces to a sequence of doubleprecision operations; since doubleprecision arithmetic is commonly implemented in hardware, doubledouble arithmetic is typically substantially faster than more general arbitraryprecision arithmetic techniques.^{[4]}^{[5]}
Note that doubledouble arithmetic has the following special characteristics:^{[8]}
 As the magnitude of the value decreases, the amount of extra precision also decreases. Therefore, the smallest number in the normalized range is narrower than double precision. The smallest number with full precision is 1000...0_{2} (106 zeros) × 2^{−1074}, or 1.000...0_{2} (106 zeros) × 2^{−968}. Numbers whose magnitude is smaller than 2^{−1021} will not have additional precision compared with double precision.
 The actual number of bits of precision can vary. In general, the magnitude of the loworder part of the number is no greater than half ULP of the highorder part. If the loworder part is less than half ULP of the highorder part, significant bits (either all 0s or all 1s) are implied between the significant of the highorder and loworder numbers. Certain algorithms that rely on having a fixed number of bits in the significand can fail when using 128bit long double numbers.
 Because of the reason above, it is possible to represent values like 1 + 2^{−1074}, which is the smallest representable number greater than 1.
In addition to the doubledouble arithmetic, it is also possible to generate tripledouble or quaddouble arithmetic if higher precision is required without any higher precision floatingpoint library. They are represented as a sum of three (or four) doubleprecision values respectively. They can represent operations with at least 159/161 and 212/215 bits respectively.
A similar technique can be used to produce a doublequad arithmetic, which is represented as a sum of two quadrupleprecision values. They can represent operations with at least 226 (or 227) bits.^{[9]}
Implementations[edit]
Quadruple precision is often implemented in software by a variety of techniques (such as the doubledouble technique above, although that technique does not implement IEEE quadruple precision), since direct hardware support for quadruple precision is, as of 2016, less common (see "Hardware support" below). One can use general arbitraryprecision arithmetic libraries to obtain quadruple (or higher) precision, but specialized quadrupleprecision implementations may achieve higher performance.
Computerlanguage support[edit]
A separate question is the extent to which quadrupleprecision types are directly incorporated into computer programming languages.
Quadruple precision is specified in Fortran by the real(real128)
(module iso_fortran_env
from Fortran 2008 must be used, the constant real128
is equal to 16 on most processors), or as real(selected_real_kind(33, 4931))
, or in a nonstandard way as REAL*16
. (Quadrupleprecision REAL*16
is supported by the Intel Fortran Compiler^{[10]} and by the GNU Fortran compiler^{[11]} on x86, x8664, and Itanium architectures, for example.)
For the C programming language, ISO/IEC TS 186613 (floatingpoint extensions for C, interchange and extended types) specifies _Float128
as the type implementing the IEEE 754 quadrupleprecision format (binary128).^{[12]} Alternatively, in C/C++ with a few systems and compilers, quadruple precision may be specified by the long double type, but this is not required by the language (which only requires long double
to be at least as precise as double
), nor is it common.
On x86 and x8664, the most common C/C++ compilers implement long double
as either 80bit extended precision (e.g. the GNU C Compiler gcc^{[13]} and the Intel C++ Compiler with a /Qlong‑double
switch^{[14]}) or simply as being synonymous with double precision (e.g. Microsoft Visual C++^{[15]}), rather than as quadruple precision. The procedure call standard for the ARM 64bit architecture (AArch64) specifies that long double
corresponds to the IEEE 754 quadrupleprecision format.^{[16]} On a few other architectures, some C/C++ compilers implement long double
as quadruple precision, e.g. gcc on PowerPC (as doubledouble^{[17]}^{[18]}^{[19]}) and SPARC,^{[20]} or the Sun Studio compilers on SPARC.^{[21]} Even if long double
is not quadruple precision, however, some C/C++ compilers provide a nonstandard quadrupleprecision type as an extension. For example, gcc provides a quadrupleprecision type called __float128
for x86, x8664 and Itanium CPUs,^{[22]} and on PowerPC as IEEE 128bit floatingpoint using the mfloat128hardware or mfloat128 options;^{[23]} and some versions of Intel's C/C++ compiler for x86 and x8664 supply a nonstandard quadrupleprecision type called _Quad
.^{[24]}
Zig provides support for it with its f128
type.^{[25]}
Google's workinprogress language Carbon provides support for it with the type called 'f128'.^{[26]}
Libraries and toolboxes[edit]
 The GCC quadprecision math library, libquadmath, provides
__float128
and__complex128
operations.  The Boost multiprecision library Boost.Multiprecision provides unified crossplatform C++ interface for
__float128
and_Quad
types, and includes a custom implementation of the standard math library.^{[27]}  The Multiprecision Computing Toolbox for MATLAB allows quadrupleprecision computations in MATLAB. It includes basic arithmetic functionality as well as numerical methods, dense and sparse linear algebra.^{[28]}
 The DoubleFloats^{[29]} package provides support for doubledouble computations for the Julia programming language.
 The doubledouble.py^{[30]} library enables doubledouble computations in Python. ^{[citation needed]}
 Mathematica supports IEEE quadprecision numbers: 128bit floatingpoint values (Real128), and 256bit complex values (Complex256).^{[citation needed]}
Hardware support[edit]
IEEE quadruple precision was added to the IBM System/390 G5 in 1998,^{[31]} and is supported in hardware in subsequent z/Architecture processors.^{[32]}^{[33]} The IBM POWER9 CPU (Power ISA 3.0) has native 128bit hardware support.^{[23]}
Native support of IEEE 128bit floats is defined in PARISC 1.0,^{[34]} and in SPARC V8^{[35]} and V9^{[36]} architectures (e.g. there are 16 quadprecision registers %q0, %q4, ...), but no SPARC CPU implements quadprecision operations in hardware as of 2004^{[update]}.^{[37]}
NonIEEE extendedprecision (128 bits of storage, 1 sign bit, 7 exponent bits, 112 fraction bits, 8 bits unused) was added to the IBM System/370 series (1970s–1980s) and was available on some System/360 models in the 1960s (System/36085,^{[38]} 195, and others by special request or simulated by OS software).
The Siemens 7.700 and 7.500 series mainframes and their successors support the same floatingpoint formats and instructions as the IBM System/360 and System/370.
The VAX processor implemented nonIEEE quadrupleprecision floating point as its "H Floatingpoint" format. It had one sign bit, a 15bit exponent and 112fraction bits, however the layout in memory was significantly different from IEEE quadruple precision and the exponent bias also differed. Only a few of the earliest VAX processors implemented H Floatingpoint instructions in hardware, all the others emulated H Floatingpoint in software.
The NEC Vector Engine architecture supports adding, subtracting, multiplying and comparing 128bit binary IEEE 754 quadrupleprecision numbers.^{[39]} Two neighboring 64bit registers are used. Quadrupleprecision arithmetic is not supported in the vector register.^{[40]}
The RISCV architecture specifies a "Q" (quadprecision) extension for 128bit binary IEEE 7542008 floatingpoint arithmetic.^{[41]} The "L" extension (not yet certified) will specify 64bit and 128bit decimal floating point.^{[42]}
Quadrupleprecision (128bit) hardware implementation should not be confused with "128bit FPUs" that implement SIMD instructions, such as Streaming SIMD Extensions or AltiVec, which refers to 128bit vectors of four 32bit singleprecision or two 64bit doubleprecision values that are operated on simultaneously.
See also[edit]
 IEEE 754, IEEE standard for floatingpoint arithmetic
 ISO/IEC 10967, Language independent arithmetic
 Primitive data type
 Q notation (scientific notation)
References[edit]
 ^ David H. Bailey; Jonathan M. Borwein (July 6, 2009). "HighPrecision Computation and Mathematical Physics" (PDF).
 ^ Higham, Nicholas (2002). "Designing stable algorithms" in Accuracy and Stability of Numerical Algorithms (2 ed). SIAM. p. 43.
 ^ William Kahan (1 October 1987). "Lecture Notes on the Status of IEEE Standard 754 for Binary FloatingPoint Arithmetic" (PDF).
 ^ ^{a} ^{b} ^{c} ^{d} Yozo Hida, X. Li, and D. H. Bailey, QuadDouble Arithmetic: Algorithms, Implementation, and Application, Lawrence Berkeley National Laboratory Technical Report LBNL46996 (2000). Also Y. Hida et al., Library for doubledouble and quaddouble arithmetic (2007).
 ^ ^{a} ^{b} ^{c} J. R. Shewchuk, Adaptive Precision FloatingPoint Arithmetic and Fast Robust Geometric Predicates, Discrete & Computational Geometry 18:305–363, 1997.
 ^ Knuth, D. E. The Art of Computer Programming (2nd ed.). chapter 4.2.3. problem 9.
 ^ Robert Munafo F107 and F161 HighPrecision FloatingPoint Data Types (2011).
 ^ 128Bit Long Double FloatingPoint Data Type
 ^ sourceware.org Re: The state of glibc libm
 ^ "Intel Fortran Compiler Product Brief (archived copy on web.archive.org)" (PDF). Su. Archived from the original on October 25, 2008. Retrieved 20100123.
{{cite web}}
: CS1 maint: unfit URL (link)  ^ "GCC 4.6 Release Series  Changes, New Features, and Fixes". Retrieved 20100206.
 ^ "ISO/IEC TS 186613" (PDF). 20150610. Retrieved 20190922.
 ^ i386 and x8664 Options (archived copy on web.archive.org), Using the GNU Compiler Collection.
 ^ Intel Developer Site
 ^ MSDN homepage, about Visual C++ compiler
 ^ "Procedure Call Standard for the ARM 64bit Architecture (AArch64)" (PDF). 20130522. Archived from the original (PDF) on 20191016. Retrieved 20190922.
 ^ RS/6000 and PowerPC Options, Using the GNU Compiler Collection.
 ^ Inside Macintosh  PowerPC Numerics Archived October 9, 2012, at the Wayback Machine
 ^ 128bit long double support routines for Darwin
 ^ SPARC Options, Using the GNU Compiler Collection.
 ^ The Math Libraries, Sun Studio 11 Numerical Computation Guide (2005).
 ^ Additional Floating Types, Using the GNU Compiler Collection
 ^ ^{a} ^{b} "GCC 6 Release Series  Changes, New Features, and Fixes". Retrieved 20160913.
 ^ Intel C++ Forums (2007).
 ^ "Floats". ziglang.org. Retrieved 7 January 2024.
 ^ "Carbon Language's main repository  Language design". GitHub. 20220809. Retrieved 20220922.
 ^ "Boost.Multiprecision  float128". Retrieved 20150622.
 ^ Pavel Holoborodko (20130120). "Fast Quadruple Precision Computations in MATLAB". Retrieved 20150622.
 ^ "DoubleFloats.jl". GitHub.
 ^ "doubledouble.py". GitHub.
 ^ Schwarz, E. M.; Krygowski, C. A. (September 1999). "The S/390 G5 floatingpoint unit". IBM Journal of Research and Development. 43 (5/6): 707–721. CiteSeerX 10.1.1.117.6711. doi:10.1147/rd.435.0707.
 ^ Gerwig, G. and Wetter, H. and Schwarz, E. M. and Haess, J. and Krygowski, C. A. and Fleischer, B. M. and Kroener, M. (May 2004). "The IBM eServer z990 floatingpoint unit. IBM J. Res. Dev. 48; pp. 311322".
{{cite news}}
: CS1 maint: multiple names: authors list (link)  ^ Eric Schwarz (June 22, 2015). "The IBM z13 SIMD Accelerators for Integer, String, and FloatingPoint" (PDF). Retrieved July 13, 2015.
 ^ "Implementor support for the binary interchange formats". grouper.ieee.org. Archived from the original on 20171027. Retrieved 20210715.
 ^ The SPARC Architecture Manual: Version 8 (archived copy on web.archive.org) (PDF). SPARC International, Inc. 1992. Archived from the original (PDF) on 20050204. Retrieved 20110924.
SPARC is an instruction set architecture (ISA) with 32bit integer and 32, 64, and 128bit IEEE Standard 754 floatingpoint as its principal data types.
 ^ David L. Weaver; Tom Germond, eds. (1994). The SPARC Architecture Manual: Version 9 (archived copy on web.archive.org) (PDF). SPARC International, Inc. Archived from the original (PDF) on 20120118. Retrieved 20110924.
Floatingpoint: The architecture provides an IEEE 754compatible floatingpoint instruction set, operating on a separate register file that provides 32 singleprecision (32bit), 32 doubleprecision (64bit), 16 quadprecision (128bit) registers, or a mixture thereof.
 ^ "SPARC Behavior and Implementation". Numerical Computation Guide — Sun Studio 10. Sun Microsystems, Inc. 2004. Retrieved 20110924.
There are four situations, however, when the hardware will not successfully complete a floatingpoint instruction: ... The instruction is not implemented by the hardware (such as ... quadprecision instructions on any SPARC FPU).
 ^ Padegs A (1968). "Structural aspects of the System/360 Model 85, III: Extensions to floatingpoint architecture". IBM Systems Journal. 7: 22–29. doi:10.1147/sj.71.0022.
 ^ Vector Engine AssemblyLanguage Reference Manual, Chapter4 Assembler Syntax page 23.
 ^ SXAurora TSUBASA Architecture Guide Revision 1.1 (p. 38, 60).
 ^ RISCV ISA Specification v. 20191213, Chapter 13, “Q” Standard Extension for QuadPrecision FloatingPoint, page 79.
 ^ [1] Chapter 15 (p. 95).
External links[edit]
 HighPrecision Software Directory
 QPFloat, a free software (GPL) software library for quadrupleprecision arithmetic
 HPAlib, a free software (LGPL) software library for quadprecision arithmetic
 libquadmath, the GCC quadprecision math library
 IEEE754 Analysis, Interactive web page for examining Binary32, Binary64, and Binary128 floatingpoint values