[hercules-390] Timing of CS, CSD and TS instructions in Hercules

Discussion:

w.f.j.mueller@gsi.de [hercules-390]

2017-12-21 15:01:20 UTC

Hi,

I've written yet another 370 instruction time benchmark, called for
historical reasons perf_asm. An early version posted was posted on
turnkey-mvs as thread 'perf_asm - V0.1 of a 370 instruction timing tester'

https://groups.yahoo.com/neo/groups/turnkey-mvs/conversations/topics/10720

A much improved version is now in the files section under

https://groups.yahoo.com/neo/groups/hercules-390/files/perf_asm_v0.9.zip

The up-to-date version is in a larger project on GitHub

https://github.com/wfjm/mvs38j-langtest

in folder 'tests'.

I've run this benchmark under tk4- and the Hercules version coming with it
on a 4 core XEON system with either one or two 370 CPUs

NUMCPU=1 MAXCPU=1 ./mvs
NUMCPU=2 MAXCPU=2 ./mvs

When looking at the instruction timing for CS,CDS and TS I get the
following instruction times (in ns)

1 CPU 2 CPU
LR;CS R,R,m (eq,eq) : 16.31 39.05
LR;CS R,R,m (eq,ne) : 16.27 39.12
LR;CS R,R,m (ne) : 17.26 178.74 <-- slow
LR;CDS R,R,m (eq,eq) : 17.40 41.53
LR;CDS R,R,m (eq,ne) : 17.46 41.52
LR;CDS R,R,m (ne) : 19.96 180.38 <-- slow
MVI;TS m (zero) : 22.19 42.41
MVI;TS m (ones) : 25.82 182.59 <-- slow

That the multi-CPU emulation is slower for memory-interlocked instructions
is easy to understand, apparently only in the SMP case a mutex is used.

What surprised me is that

CS,CDS is much slower in the 'ne' case
TS is much slower in the 'memory is already ones' case

Take TS as simpler case. TS is critical only when the memory location is
zero, in that case an interlocked access sequence must be done. But in
Hercules this is the faster case, the seemingly easy case when the memory
location is all ones is the slow one.

Any help/hint on understanding this (at least to me surprising) behavior
is very much welcome.

With best regards, Walter

Jon Perryman jperryma@pacbell.net [hercules-390]

2017-12-21 17:00:26 UTC

Permalink

For CS, CDS, CS being much slower when NE, I'm guessing there is a short delay in the code. These instructions can be used as spin locks. These are often in time sensitive code so the short delay may offset the non-spin lock overhead situations.
As for TS at location 0, your thinking is backwards. Page 0 is unique to each processor and probably considered SMP in the code.
Regards, Jon.

On Thursday, December 21, 2017 7:03 AM, "***@gsi.de [hercules-390]" <hercules-***@yahoogroups.com> wrote:

What surprised me is that

Â CS,CDS is much slower in the 'ne' case
Â TS is much slower in the 'memory is already ones' case

Take TS as simpler case. TS is critical only when the memory location is
zero, in that case an interlocked access sequence must be done. But in
Hercules this is the faster case, the seemingly easy case when the memory
location is all ones is the slow one.
#yiv6414973956 -- #yiv6414973956ygrp-mkp {border:1px solid #d8d8d8;font-family:Arial;margin:10px 0;padding:0 10px;}#yiv6414973956 #yiv6414973956ygrp-mkp hr {border:1px solid #d8d8d8;}#yiv6414973956 #yiv6414973956ygrp-mkp #yiv6414973956hd {color:#628c2a;font-size:85%;font-weight:700;line-height:122%;margin:10px 0;}#yiv6414973956 #yiv6414973956ygrp-mkp #yiv6414973956ads {margin-bottom:10px;}#yiv6414973956 #yiv6414973956ygrp-mkp .yiv6414973956ad {padding:0 0;}#yiv6414973956 #yiv6414973956ygrp-mkp .yiv6414973956ad p {margin:0;}#yiv6414973956 #yiv6414973956ygrp-mkp .yiv6414973956ad a {color:#0000ff;text-decoration:none;}#yiv6414973956 #yiv6414973956ygrp-sponsor #yiv6414973956ygrp-lc {font-family:Arial;}#yiv6414973956 #yiv6414973956ygrp-sponsor #yiv6414973956ygrp-lc #yiv6414973956hd {margin:10px 0px;font-weight:700;font-size:78%;line-height:122%;}#yiv6414973956 #yiv6414973956ygrp-sponsor #yiv6414973956ygrp-lc .yiv6414973956ad {margin-bottom:10px;padding:0 0;}#yiv6414973956 #yiv6414973956actions {font-family:Verdana;font-size:11px;padding:10px 0;}#yiv6414973956 #yiv6414973956activity {background-color:#e0ecee;float:left;font-family:Verdana;font-size:10px;padding:10px;}#yiv6414973956 #yiv6414973956activity span {font-weight:700;}#yiv6414973956 #yiv6414973956activity span:first-child {text-transform:uppercase;}#yiv6414973956 #yiv6414973956activity span a {color:#5085b6;text-decoration:none;}#yiv6414973956 #yiv6414973956activity span span {color:#ff7900;}#yiv6414973956 #yiv6414973956activity span .yiv6414973956underline {text-decoration:underline;}#yiv6414973956 .yiv6414973956attach {clear:both;display:table;font-family:Arial;font-size:12px;padding:10px 0;width:400px;}#yiv6414973956 .yiv6414973956attach div a {text-decoration:none;}#yiv6414973956 .yiv6414973956attach img {border:none;padding-right:5px;}#yiv6414973956 .yiv6414973956attach label {display:block;margin-bottom:5px;}#yiv6414973956 .yiv6414973956attach label a {text-decoration:none;}#yiv6414973956 blockquote {margin:0 0 0 4px;}#yiv6414973956 .yiv6414973956bold {font-family:Arial;font-size:13px;font-weight:700;}#yiv6414973956 .yiv6414973956bold a {text-decoration:none;}#yiv6414973956 dd.yiv6414973956last p a {font-family:Verdana;font-weight:700;}#yiv6414973956 dd.yiv6414973956last p span {margin-right:10px;font-family:Verdana;font-weight:700;}#yiv6414973956 dd.yiv6414973956last p span.yiv6414973956yshortcuts {margin-right:0;}#yiv6414973956 div.yiv6414973956attach-table div div a {text-decoration:none;}#yiv6414973956 div.yiv6414973956attach-table {width:400px;}#yiv6414973956 div.yiv6414973956file-title a, #yiv6414973956 div.yiv6414973956file-title a:active, #yiv6414973956 div.yiv6414973956file-title a:hover, #yiv6414973956 div.yiv6414973956file-title a:visited {text-decoration:none;}#yiv6414973956 div.yiv6414973956photo-title a, #yiv6414973956 div.yiv6414973956photo-title a:active, #yiv6414973956 div.yiv6414973956photo-title a:hover, #yiv6414973956 div.yiv6414973956photo-title a:visited {text-decoration:none;}#yiv6414973956 div#yiv6414973956ygrp-mlmsg #yiv6414973956ygrp-msg p a span.yiv6414973956yshortcuts {font-family:Verdana;font-size:10px;font-weight:normal;}#yiv6414973956 .yiv6414973956green {color:#628c2a;}#yiv6414973956 .yiv6414973956MsoNormal {margin:0 0 0 0;}#yiv6414973956 o {font-size:0;}#yiv6414973956 #yiv6414973956photos div {float:left;width:72px;}#yiv6414973956 #yiv6414973956photos div div {border:1px solid #666666;min-height:62px;overflow:hidden;width:62px;}#yiv6414973956 #yiv6414973956photos div label {color:#666666;font-size:10px;overflow:hidden;text-align:center;white-space:nowrap;width:64px;}#yiv6414973956 #yiv6414973956reco-category {font-size:77%;}#yiv6414973956 #yiv6414973956reco-desc {font-size:77%;}#yiv6414973956 .yiv6414973956replbq {margin:4px;}#yiv6414973956 #yiv6414973956ygrp-actbar div a:first-child {margin-right:2px;padding-right:5px;}#yiv6414973956 #yiv6414973956ygrp-mlmsg {font-size:13px;font-family:Arial, helvetica, clean, sans-serif;}#yiv6414973956 #yiv6414973956ygrp-mlmsg table {font-size:inherit;font:100%;}#yiv6414973956 #yiv6414973956ygrp-mlmsg select, #yiv6414973956 input, #yiv6414973956 textarea {font:99% Arial, Helvetica, clean, sans-serif;}#yiv6414973956 #yiv6414973956ygrp-mlmsg pre, #yiv6414973956 code {font:115% monospace;}#yiv6414973956 #yiv6414973956ygrp-mlmsg * {line-height:1.22em;}#yiv6414973956 #yiv6414973956ygrp-mlmsg #yiv6414973956logo {padding-bottom:10px;}#yiv6414973956 #yiv6414973956ygrp-msg p a {font-family:Verdana;}#yiv6414973956 #yiv6414973956ygrp-msg p#yiv6414973956attach-count span {color:#1E66AE;font-weight:700;}#yiv6414973956 #yiv6414973956ygrp-reco #yiv6414973956reco-head {color:#ff7900;font-weight:700;}#yiv6414973956 #yiv6414973956ygrp-reco {margin-bottom:20px;padding:0px;}#yiv6414973956 #yiv6414973956ygrp-sponsor #yiv6414973956ov li a {font-size:130%;text-decoration:none;}#yiv6414973956 #yiv6414973956ygrp-sponsor #yiv6414973956ov li {font-size:77%;list-style-type:square;padding:6px 0;}#yiv6414973956 #yiv6414973956ygrp-sponsor #yiv6414973956ov ul {margin:0;padding:0 0 0 8px;}#yiv6414973956 #yiv6414973956ygrp-text {font-family:Georgia;}#yiv6414973956 #yiv6414973956ygrp-text p {margin:0 0 1em 0;}#yiv6414973956 #yiv6414973956ygrp-text tt {font-size:120%;}#yiv6414973956 #yiv6414973956ygrp-vital ul li:last-child {border-right:none !important;}#yiv6414973956

w.f.j.mueller@gsi.de [hercules-390]

2017-12-21 18:47:34 UTC

Permalink

Hi Jon,

I cloned the spinhawk sources and had a look at them. It's a bit as you suggested, in the code of CD,CDS, and TS you find the code (stripped to the essence):

if (regs->psw.cc == 1) {
...
if (sysblk.cpus > 1) sched_yield();
}

So in the case where likely a retry is done and if one has a SMP system

sched_yield()

is called, which causes the calling thread to relinquish the CPU. What one sees as extra time is exactly this system call. In a true lock contention case this ensures that threads can compete in the most efficient manner.

On your remark on 'TS at location 0': perf_asm is a normal user program, so the variables are in normal pages, never in page 0. The sched_yield() really explains all.

Thanks and with best regards, Walter

Tony Harminc tharminc@gmail.com [hercules-390]

2017-12-22 05:59:45 UTC

Permalink

Post by Jon Perryman ***@pacbell.net [hercules-390]
As for TS at location 0, your thinking is backwards. Page 0 is unique to
each processor and probably considered SMP in the code.

I think you misinterpreted Walter's statement "TS is critical only when the
memory location is zero, in that case an interlocked access sequence must
be done." He was surely not talking about address 0, but rather the content
of the byte being TS'd.

Tony H.

'Mark L. Gaubatz' mgaubatz@groupgw.com [hercules-390]

2017-12-22 07:56:03 UTC

Permalink

Walter, Jon, and Tony:

Page Zero is only "unique" as assigned by one's thought patterns, and
the presumed observations of storage operation. The reality is that it
is still shared storage, and there are indeed software routines that
cross-update the page zeroes for each processor using CS/CDS to ensure
atomicity. What is really done, and is documented in the Principles of
Operations, is that the specified page is swapped for Page Zero, but it
is still addressable at it's absolute address by all other processors.
For the specified page, it is addressed as real Page Zero, while
absolute Page Zero is addressable at the specified page address.

In regards to performance, CS/CDS/TS operations are incorrectly locked
multiple times, as main storage lock is obtained, the operation is then
atomically performed, followed be a free of the main storage lock. It is
the redundant use of the main storage lock that is the root of any
observed performance issues; the performance of the actual atomic
operation is dependent upon the real underlying hardware. Should locking
problems be observed, then the actual error(s) are in the Hercules
defined cmpxchg function definitions for the hardware in use.

The sched_yield call is poor construction, and was kept for those who
insist on running multiengine Hercules instances on single core/thread
machines, or running more Hercules engines than processor cores (for
example, myself, and to be outlandish, 64 engines on a 1-core/1-thread
processor at times). My personal copy Hercules runs well without the
main storage locks and sched_yield calls for atomic operations, and I
run this copy with the number of Hercules engines limited to no more
than half of the available processor threads (preferably fewer, if heavy
I/O operations).

Mark L. Gaubatz
dasdman

Post by Jon Perryman ***@pacbell.net [hercules-390]
As for TS at location 0, your thinking is backwards. Page 0 is
unique to each processor and probably considered SMP in the code.
I think you misinterpreted Walter's statement "TS is critical only
when the memory location is zero, in that case an interlocked access
sequence must be done." He was surely not talking about address 0, but
rather the content of the byte being TS'd.
Tony H.

Tony Harminc tharminc@gmail.com [hercules-390]

2017-12-23 05:13:39 UTC

Permalink

Page Zero is only "unique" as assigned by one's thought patterns, and the
presumed observations of storage operation. The reality is that it is still
shared storage, and there are indeed software routines that cross-update
the page zeroes for each processor using CS/CDS to ensure atomicity. What
is really done, and is documented in the Principles of Operations, is that
the specified page is swapped for Page Zero, but it is still addressable at
it's absolute address by all other processors. For the specified page, it
is addressed as real Page Zero, while absolute Page Zero is addressable at
the specified page address.

Sure - but it's all irrelevant to Walter's original post which, unless I am
very much mistaken, is talking about the content of the single byte target
of TS, and has nothing to do with page zero and prefixing.

Tony H.