Page 1 of 1

### Median estimator

Posted: Thu Aug 20, 2020 8:28 pm
History
Calculating the median of a numerical sequence is a complex problem involving :
• the storage of all the values
• sorting the values
So it is easily understood that in the 1970th, there was no enough storage space into the first calculators to record all the values of a sequence. Therefore, between Median and Mean the choice was obvious : the first calculators offered a Σ+ key for the mean, and the median has been missing from the handled calculators for several decades.

Since then, many attempt have been done in order to produce a continuous estimation of the median. The deal is :
1. An acceptable estimation of the median
2. A minimal storage place
3. On the fly calculation
4. Without prior knowledge of the amount of values
Literature:
Among a large literature relevant to this field, I choose an idea presented by M.Hofri (see http://algo.inria.fr/seminars/sem05-06/hofri-slides.pdf and https://web.cs.wpi.edu/~hofri/medsel.pdf)

In this paper, the algorithm is written for a quantity of data equal to a power of 3. But I adapt it for any number of values.

In a general case, (ie : the size of a cluster is NOT an odd number) the median is calculated by a weighted mean of each median stored at each cluster.
In this version of the Hofri algorithm, any clusters size is allowed (the size of the clusters is given at initialisation).

Algorithm:
1. Each entered value is "pushed" (LBL"PushV") inside the first stage of an incremental structure (the matrix MAT, growing on demand).
2. Into each stage, the values are sorted using an insertion-sorting algorithm to maintain a growing order (LBL"sort").
3. When a stage is full, the relevant median is "pushed" up to the next upper stage (LBL02) and so on (recursive call to PushV). Previously, the current stage is freed for the nextvalues (local nr of data canceled) . To be noted that the first number of each stage (each row ofthe MAT matrix) contains the amount of values stored in the stage (0,1,2,3 in case of a 3-dim clusters
Median
The median (LBL"med") which is pushed from one stage to another, is calculated as classical : the central value of the stage or the mean of the two central values.
But here (that differs from Hofri) the median which is displayed after each entry is the "global median"(LBL14). This global median is calculated involving the upmost median but weighted with the remaining data from the lowest stages, each with its respective weigth (1 for the first stage,3 for the second, 9 the third...In case of clusters of size 3) (see LBL15 and following)

Registers
• R00: (LBL PushV) current stage nr (first stage=1)
• R01: (LBL Sort) local counter for insertion algo or (LBL Med) ∑j pi xj weighted values of stage i
• R02: (LBL Med) ∑pj
• R03: (LBL Med) current weight of stage i : pi=ni−1 where n is the cluster’s size (eg : 3)
• R04: (LBL Med) current cluster’s size
• R05: (LBL Med) current stage
So I guess that the size of the MAT matrix is growing with the log of the nr of values

Usage
With XEQ MEDIAN, there is a menu :
• DIM (size of the clusters --- ) to be called first. This erase the MAT matrix and define the size of the clusters (usually 3)
• Σ+ (x --- ) to enter successively the data values. The median estimation is displayed at top
Warning
This code produces an estimation of the mean of a sequence of data. In my experience, I felt that this estimation is not so bad.
I loved writting this program, thanks to the extraodinary capacities of the DM42 !

Feedback and advice are welcome !

Thank you.
Michel

Code: Select all

``````00 { 369-Byte Prgm }
01▸LBL "MEDIAN"
02 "DIM"
03 KEY 1 GTO 00
04 "Σ+"
05 KEY 2 GTO 01
07▸LBL 30
08 STOP
09 GTO 30
10▸LBL 00
11 1
12 STO+ ST Y
13 X<>Y
14 NEWMAT
15 STO "MAT"
16 RTN
17▸LBL 01
18 INDEX "MAT"
19 GROW
20 1
21 XEQ "PushV"
22 RTN
23▸LBL "PushV"
24 STO 00
25 1
26 STOIJ
27 R↓
28 RCLEL
29 1
30 +
31 STOEL
32 1
33 +
34 STOIJ
35 RCL ST Z
36 →
37 R↓
38 STOIJ
39 XEQ "sort"
40 RCL 00
41 1
42 STOIJ
43 RCLEL
44 1
45 +
46 X<>Y
47 R↓
48 STOIJ
49 J+
50 FS? 76
51 GTO 02
52 RCL "MAT"
53 DIM?
54 R↓
55 SF 00
56 XEQ "med"
57 VIEW ST X
58 RTN
59▸LBL 02
60 RCL 00
61 CF 00
62 XEQ "med"
63 RCL 00
64 1
65 STOIJ
66 0
67 STOEL
68 R↓
69 R↓
70 1
71 +
72 XEQ "PushV"
73 RTN
74▸LBL "sort"
75 1
76 -
77 STO 01
78▸LBL 11
79 DSE 01
80 GTO 10
81 RTN
82▸LBL 10
83 RCLEL
84 J-
85 RCLEL
86 X≤Y?
87 RTN
88 X<>Y
89 →
90 R↓
91 STOEL
92 J-
93 GTO 11
94▸LBL "med"
95 FS? 00
96 GTO 14
97 ENTER
98 1
99 STOIJ
100 R↓
101 RCLEL
102 1
103 +
104 2
105 ÷
106 1
107 +
108 IP
109 STOIJ
110 RCLEL
111 LASTX
112 FP
113 X=0?
114 GTO 13
115 R↓
116 J+
117 RCLEL
118 +
119 2
120 ÷
121 RTN
122▸LBL 13
123 R↓
124 RTN
125▸LBL 14
126 0
127 STO 01
128 STO 02
129 1
130 STO 03
131 RCL "MAT"
132 DIM?
133 1
134 -
135 STO 04
136 R↓
137 STO 05
138 1
139 1
140 STOIJ
141▸LBL 15
142 RCLEL
143 X=0?
144 GTO 16
145 1
146 +
147 J+
148 XEQ 18
149▸LBL 16
150 RCL 04
151 STO× 03
152 RCLIJ
153 R↓
154 1
155 +
156 RCL 05
157 X≥Y?
158 GTO 17
159 RCL 01
160 RCL 02
161 ÷
162 RTN
163▸LBL 17
164 R↓
165 1
166 STOIJ
167 GTO 15
168▸LBL 18
169 RCLEL
170 RCL 03
171 STO+ 02
172 ×
173 STO+ 01
174 R↓
175 J+
176 FC? 76
177 GTO 19
178 J-
179 RTN
180▸LBL 19
181 RCLIJ
182 X<>Y
183 R↓
184 X>Y?
185 RTN
186 R↓
187 GTO 18
188 END``````

### Re: Median estimator

Posted: Fri Aug 21, 2020 10:01 am
If the data can be stored, the median (or any quantile) can be computed in linear time.

A long time ago, I looked into estimation of the mode. This is a long researched subject, the only reference I've to hand is: T. Dalenius: The Mode - A Neglected Statistical Parameter, Journal of the Royal Statistical Society Series A, Vol 128, pp. 110-117, 1965.

Pauli

### Re: Median estimator

Posted: Fri Aug 21, 2020 10:40 am
Thanks Pauli,
Agree that's an interesting research subject.
T. Dalenius: The Mode - A Neglected Statistical Parameter, Journal of the Royal Statistical Society Series A, Vol 128, pp. 110-117, 1965.
Hum, I don't have access to this paper maybe you can post those pages here ?

Michel

### Re: Median estimator

Posted: Fri Aug 21, 2020 11:10 am
I don't have the article anymore, I've only got the reference from my thesis.

Pauli