w.f.j.mueller@gsi.de [hercules-390]
2017-12-21 15:01:20 UTC
Hi,
I've written yet another 370 instruction time benchmark, called for
historical reasons perf_asm. An early version posted was posted on
turnkey-mvs as thread 'perf_asm - V0.1 of a 370 instruction timing tester'
https://groups.yahoo.com/neo/groups/turnkey-mvs/conversations/topics/10720
A much improved version is now in the files section under
https://groups.yahoo.com/neo/groups/hercules-390/files/perf_asm_v0.9.zip
The up-to-date version is in a larger project on GitHub
https://github.com/wfjm/mvs38j-langtest
in folder 'tests'.
I've run this benchmark under tk4- and the Hercules version coming with it
on a 4 core XEON system with either one or two 370 CPUs
NUMCPU=1 MAXCPU=1 ./mvs
NUMCPU=2 MAXCPU=2 ./mvs
When looking at the instruction timing for CS,CDS and TS I get the
following instruction times (in ns)
1 CPU 2 CPU
LR;CS R,R,m (eq,eq) : 16.31 39.05
LR;CS R,R,m (eq,ne) : 16.27 39.12
LR;CS R,R,m (ne) : 17.26 178.74 <-- slow
LR;CDS R,R,m (eq,eq) : 17.40 41.53
LR;CDS R,R,m (eq,ne) : 17.46 41.52
LR;CDS R,R,m (ne) : 19.96 180.38 <-- slow
MVI;TS m (zero) : 22.19 42.41
MVI;TS m (ones) : 25.82 182.59 <-- slow
That the multi-CPU emulation is slower for memory-interlocked instructions
is easy to understand, apparently only in the SMP case a mutex is used.
What surprised me is that
CS,CDS is much slower in the 'ne' case
TS is much slower in the 'memory is already ones' case
Take TS as simpler case. TS is critical only when the memory location is
zero, in that case an interlocked access sequence must be done. But in
Hercules this is the faster case, the seemingly easy case when the memory
location is all ones is the slow one.
Any help/hint on understanding this (at least to me surprising) behavior
is very much welcome.
With best regards, Walter
I've written yet another 370 instruction time benchmark, called for
historical reasons perf_asm. An early version posted was posted on
turnkey-mvs as thread 'perf_asm - V0.1 of a 370 instruction timing tester'
https://groups.yahoo.com/neo/groups/turnkey-mvs/conversations/topics/10720
A much improved version is now in the files section under
https://groups.yahoo.com/neo/groups/hercules-390/files/perf_asm_v0.9.zip
The up-to-date version is in a larger project on GitHub
https://github.com/wfjm/mvs38j-langtest
in folder 'tests'.
I've run this benchmark under tk4- and the Hercules version coming with it
on a 4 core XEON system with either one or two 370 CPUs
NUMCPU=1 MAXCPU=1 ./mvs
NUMCPU=2 MAXCPU=2 ./mvs
When looking at the instruction timing for CS,CDS and TS I get the
following instruction times (in ns)
1 CPU 2 CPU
LR;CS R,R,m (eq,eq) : 16.31 39.05
LR;CS R,R,m (eq,ne) : 16.27 39.12
LR;CS R,R,m (ne) : 17.26 178.74 <-- slow
LR;CDS R,R,m (eq,eq) : 17.40 41.53
LR;CDS R,R,m (eq,ne) : 17.46 41.52
LR;CDS R,R,m (ne) : 19.96 180.38 <-- slow
MVI;TS m (zero) : 22.19 42.41
MVI;TS m (ones) : 25.82 182.59 <-- slow
That the multi-CPU emulation is slower for memory-interlocked instructions
is easy to understand, apparently only in the SMP case a mutex is used.
What surprised me is that
CS,CDS is much slower in the 'ne' case
TS is much slower in the 'memory is already ones' case
Take TS as simpler case. TS is critical only when the memory location is
zero, in that case an interlocked access sequence must be done. But in
Hercules this is the faster case, the seemingly easy case when the memory
location is all ones is the slow one.
Any help/hint on understanding this (at least to me surprising) behavior
is very much welcome.
With best regards, Walter