Discussion:
[hercules-390] How-to condense a set of instruction times into a figure-of-merit ?
w.f.j.mueller@gsi.de [hercules-390]
2018-04-04 21:46:36 UTC
Permalink
The s370_perf instruction time benchmark is now feature complete and available as GitHub project wfjm/s370_perf https://github.com/wfjm/s370-perf/ in version 0.80. Also lots of data https://github.com/wfjm/s370-perf/blob/master/narr/README.md which will allow a lot of deeper analysis.



The data has been generated on a variety of systems, on real CPUs like the P/390 and on Hercules emulators running on a wide range of host systems, from Raspberry Pi 2B to a XEON workstation. More host CPUs likely to come, and maybe also more Hercules versions.

So it be nice to condense a set of instruction timings, see for example the P/390 listing https://github.com/wfjm/s370-perf/blob/master/data/2018-02-14_p390-ins.dat, into a single figure-of-merit.

One classical way is to use instruction frequencies to generate a weighted average, which could be converted into a 'MIPS' number. So I started to look for such instruction frequencies, of course for S/370 workloads, and found the Stanford Technical Report Nr 66 written 1974 by Liba Svobodova, see

http://i.stanford.edu/pub/cstr/reports/csl/tr/74/66/CSL-TR-74-66.pdf http://i.stanford.edu/pub/cstr/reports/csl/tr/74/66/CSL-TR-74-66.pdf


which contains in Table B.3 on page 63+64 a full distribution. In seems that the workload was integer dominated, the frequencies for floating and decimal instructions are negligible.

There are other papers on the subject, which also mention distributions based on FORTRAN and COBOL workloads (so with a significant floating and decimal arithmetic instruction fraction), but I haven't found complete distributions so far.

Any help or hint where to find such instruction frequency distribution data is very appreciated. Best from the S/370 times, because for me it's a retro computing project and s370_perf only tests S/370 instructions.


Thanks in advance, Walter
'Mark L. Gaubatz' mgaubatz@groupgw.com [hercules-390]
2018-04-05 00:18:50 UTC
Permalink
*** Off List ***

Walter:

Thank you for taking the time to begin the formalization of

As a prior specialist in this area, as well as a hardware and firmware
architect, PLEASE update the test to use the MVS task CPU time to remove
both OS and host OS times. Using the TOD clock via STCK yields the wall
clock execution time, NOT the actual instruction time, and the times
will significantly vary when tests are properly constructed and have a
minimum proper length. If variance is not seen when running under an OS,
then there is a test construction error, and/or STPT emulation is not
functioning properly.

Even with these updates, the test sizes (instruction counts between the
branches) need to be large enough to offset test overhead variance
timings as well as system overhead variance timings. We used to use a 4K
instruction/data buffer for each test, with a pre-read of the buffer to
ensure that it was in cache. While not for quite the same reasons under
Hercules, it still makes a difference when you truly work at the
comparisons.

When running standalone (possibly in a later version of your code), you
will want to use STPT for your clock source.

And, yes, I also reviewed Liba Svobodova's paper in 1974-1975, along
with my office partners of the day, as we were writing instruction
timing routines to both show where our machines operated faster than
IBM's, and that no "critical" instruction ran slower. This was done as
we had to show that both the underlying machine was faster via proof,
and that both individual and combination loads ran faster. The proof was
in the CPU time, NOT the wall clock time. In addition we had to show
that our "third-party" memory we were selling was not a source of
slowdowns on the machines.

The instruction mix information used by Svobodova was generated by each
installation shown in Table B.4 on page 68, were not fully business
operational mixes -- significantly different from a true workload mixes
of the 1980s and beyond.

I will be glad to answer any questions that you may have; please be
aware that there are still areas that I am not permitted to address. If
I appear to dodge, or intentionally not answer, a question, please take
that information into consideration. Restating a question in a different
manner may permit me to answer the question.

Regards,

Mark L. Gaubatz
dasdman
Post by ***@gsi.de [hercules-390]
The s370_perf instruction time benchmark is now feature complete and
available as GitHub projectwfjm/s370_perf
<https://github.com/wfjm/s370-perf/> in version 0.80. Also lots of
data <https://github.com/wfjm/s370-perf/blob/master/narr/README.md>
which will allow a lot of deeper analysis.
The data has been generated on a variety of systems, on real CPUs like
the P/390 and on Hercules emulators running on a wide range of host
systems, from Raspberry Pi 2B to a XEON workstation. More host CPUs
likely to come, and maybe also more Hercules versions.
So it be nice to condense a set of instruction timings, see for
example the P/390 listing
<https://github.com/wfjm/s370-perf/blob/master/data/2018-02-14_p390-ins.dat>,
into a single figure-of-merit.
One classical way is to use instruction frequencies to generate a
weighted average, which could be converted into a 'MIPS' number. So I
started to look for such instruction frequencies, of course for S/370
workloads, and found the Stanford Technical Report Nr 66 written 1974
by Liba Svobodova, see
http://i.stanford.edu/pub/cstr/reports/csl/tr/74/66/CSL-TR-74-66.pdf
which contains in Table B.3 on page 63+64 a full distribution. In
seems that the workload was integer dominated, the frequencies for
floating and decimal instructions are negligible.
There are other papers on the subject, which also mention
distributions based on FORTRAN and COBOL workloads (so with a
significant floating and decimal arithmetic instruction fraction), but
I haven't found complete distributions so far.
Any help or hint where to find such instruction frequency distribution
data is very appreciated. Best from the S/370 times, because for me
it's a retro computing project and s370_perf only tests S/370
instructions.
    Thanks in advance,    Walter
broweo@yahoo.com [hercules-390]
2018-04-07 20:53:03 UTC
Permalink
I'm no expert but wouldn't there be a dependence on instruction sequence rather than just relative frequency? e.g. a bunch of MVC's in sequence rather than interspersed with instructions that affect the base registers?
w.f.j.mueller@gsi.de [hercules-390]
2018-04-08 19:31:38 UTC
Permalink
In hercules-***@yahoogroups.com, <***@...> wrote :
I'm no expert but wouldn't there be a dependence on instruction sequence rather than just relative frequency? e.g. a bunch of MVC's in sequence rather than interspersed with instructions that affect the base registers?

Hi broweo,

sure, instruction timing and instruction frequencies are terms from the 70ties and 80ties where CPUs were still simple and worked quite linear and deterministic. But for systems like a P/390 or a straightforward emulator as Hercules these concepts can still be applied. I'm very well aware of the limitations, and also clearly see them in both P/390 data and Hercules data.

With best regards, Walter
w.f.j.mueller@gsi.de [hercules-390]
2018-04-08 19:23:59 UTC
Permalink
In hercules-***@yahoogroups.com, <***@...> wrote : As a prior specialist in this area, as well as a hardware and firmware architect, PLEASE update the test to use the MVS task CPU time to remove both OS and host OS times.


Hi Mark,

I fully understand your comment on elapsed wall clock vs used CPU time. If I'd perform a instruction timing test on a 'bare iron' system, as you apparently did, I'd of course also use STPT.

However, s370_perf was designed to run as normal user program under MVS 3.8J. This MVS has, to the best of my knowledge, no TIMEUSED macro or equivalent. And STPT is a privileged instruction, and actually useless unless one has full control of the system.

Even more important is that the typical use case is Hercules, which runs under an operating system, and if in a virtual machine, even under stack or OS and Hypervisors. So elapsed CPU time is, in this environment, at the mercy of the host OS stack.

So the strategy is to run on an idle system, run the test many times, and use robust estimators like medians constructed from cumulative distribution functions. This altogether gives very reliable results. The 50% width of the variation is very low, often 1% or even less. And well controlled, see the w50% column in the data files.

On a P/390, a real CPU, it also works great. See
https://github.com/wfjm/s370-perf/blob/master/narr/2018-02-14_p390.md https://github.com/wfjm/s370-perf/blob/master/narr/2018-02-14_p390.md
one gets the system clock reconstructed at the 1.E-4 level.

So I see, imho, neither a real alternative to using STCK, nor an essential disadvantage of using STCK.

With best regards, Walter
'Mark L. Gaubatz' mgaubatz@groupgw.com [hercules-390]
2018-04-09 08:01:43 UTC
Permalink
Walter:

Please reread my post; I did not say to use STPT or the TIMEUSED macro
in non-privileged mode. The MVS control blocks with the necessary
information are readily available, you just can't write to them. A
little bit of research will uncover a vast amount of information, and
normally adjusted for running under Hercules. When the test is run on my
system when "idle", the variance between wall clock and the CPU Timer is
significant.

Mark
Post by 'Mark L. Gaubatz' ***@groupgw.com [hercules-390]
As a prior specialist in this area, as well as a hardware and firmware
architect, PLEASE update the test to use the MVS task CPU time to
remove both OS and host OS times.
Hi Mark,
I fully understand your comment on elapsed wall clock vs used CPU
time. If I'd perform a instruction timing test on a 'bare iron'
system, as you apparently did, I'd of course also use STPT.
However, s370_perf was designed to run as normal user program under
MVS 3.8J. This MVS has, to the best of my knowledge, no TIMEUSED macro
or equivalent. And STPT is a privileged instruction, and actually
useless unless one has full control of the system.
Even more important is that the typical use case is Hercules, which
runs under an operating system, and if in a virtual machine, even
under stack or OS and Hypervisors. So elapsed CPU time is, in this
environment, at the mercy of the host OS stack.
So the strategy is to run on an idle system, run the test many times,
and use robust estimators like medians constructed from cumulative
distribution functions. This altogether gives very reliable results.
The 50% width of the variation is very low, often 1% or even less. And
well controlled, see the w50% column in the data files.
On a P/390, a real CPU, it also works great. See
https://github.com/wfjm/s370-perf/blob/master/narr/2018-02-14_p390.md
one gets the system clock reconstructed at the 1.E-4 level.
So I see, imho,  neither a real alternative to using STCK, nor an
essential disadvantage of using STCK.
     With best regards,   Walter
w.f.j.mueller@gsi.de [hercules-390]
2018-04-14 18:28:03 UTC
Permalink
---In hercules-***@yahoogroups.com, <***@...> wrote :

Walter:
Please reread my post; I did not say to use STPT or the TIMEUSED macro in non-privileged mode. The MVS control blocks with the necessary information are readily available, you just can't write to them. A little bit of research will uncover a vast amount of information, and normally adjusted for running under Hercules. When the test is run on my system when "idle", the variance between wall clock and the CPU Timer is significant.

Mark


On 04/08/2018 12:23 PM, ***@... mailto:***@... [hercules-390] wrote:

In hercules-***@yahoogroups.com mailto:hercules-***@yahoogroups.com, <***@...> mailto:***@... wrote :
As a prior specialist in this area, as well as a hardware and firmware architect, PLEASE update the test to use the MVS task CPU time to remove both OS and host OS times.


Hi Mark,

I fully understand your comment on elapsed wall clock vs used CPU time. If I'd perform a instruction timing test on a 'bare iron' system, as you apparently did, I'd of course also use STPT.

However, s370_perf was designed to run as normal user program under MVS 3.8J. This MVS has, to the best of my knowledge, no TIMEUSED macro or equivalent. And STPT is a privileged instruction, and actually useless unless one has full control of the system.

Even more important is that the typical use case is Hercules, which runs under an operating system, and if in a virtual machine, even under stack or OS and Hypervisors. So elapsed CPU time is, in this environment, at the mercy of the host OS stack.

So the strategy is to run on an idle system, run the test many times, and use robust estimators like medians constructed from cumulative distribution functions. This altogether gives very reliable results. The 50% width of the variation is very low, often 1% or even less. And well controlled, see the w50% column in the data files.

On a P/390, a real CPU, it also works great. See
https://github.com/wfjm/s370-perf/blob/master/narr/2018-02-14_p390.md https://github.com/wfjm/s370-perf/blob/master/narr/2018-02-14_p390.md
one gets the system clock reconstructed at the 1.E-4 level.

So I see, imho, neither a real alternative to using STCK, nor an essential disadvantage of using STCK.

With best regards, Walter
w.f.j.mueller@gsi.de [hercules-390]
2018-04-14 18:41:10 UTC
Permalink
Hi Mark,

OK, you point is well taken, and I'll have a look at using for example TCBTTIME in combination with a WAIT ECB and a phony ECB to get access to a high granularity CPU time reading, and add this as additional time determination method to s370_perf. But it'll take some time to verify this.

Having said this I'm nonetheless sure that it will not change the results in a significant way. I've added a CPU/elapsed time report to the hercjsu https://github.com/wfjm/herc-tools/blob/master/doc/hercjsu.md tool, and with this I get for typical runs

J5958 PERF#ASM CLG GO PGM=*.DD RC= 0000 271.59s 99.61% 120K
J5959 PERF#ASM CLG GO PGM=*.DD RC= 0000 272.96s 99.90% 120K
J5960 PERF#ASM CLG GO PGM=*.DD RC= 0000 278.73s 99.92% 120K
J5961 PERF#ASM CLG GO PGM=*.DD RC= 0000 277.17s 99.92% 120K
J5962 PERF#ASM CLG GO PGM=*.DD RC= 0000 278.62s 99.92% 120K

so the CPU time is 99.9% of the elapsed time. Only the very first of the usually 30 runs shows a slightly reduced fraction. So the difference between CPU and wall clock time is just one permille. The reconstruction of the P/390 system clock gave also a value with one permille deviation from the nominal clock value, see 2018-02-14_p390 narrative https://github.com/wfjm/s370-perf/blob/master/narr/2018-02-14_p390.md, especially the clock section https://github.com/wfjm/s370-perf/blob/master/narr/2018-02-14_p390.md#user-content-find-clock.

The main uncertainties don't come from the way time is measured. Key issue is the concept of a well defined instruction time. This was a useful concept in simple CPUs in the 70ies. Time of a program was in good approximation the sum of the instruction times. That's why they were printed in the manuals. Already for the P/390, which still has a quite simple structure, this picture doesn't hold anymore.

s370_perf https://github.com/wfjm/s370-perf/blob/master/doc/s370_perf.md has now tests designed to whether instruction times are additive or not, see tests T9xx https://github.com/wfjm/s370-perf/blob/master/doc/s370_perf.md#user-content-tests-t90x.

P/390 shows already significant deviations, see BCTR behavior https://github.com/wfjm/s370-perf/blob/master/narr/2018-02-14_p390.md#user-content-find-bctr. Likely caused by the way instructions are fetched and decoded.

Hercules itself has a very simple, linear and additive structure, the instruction stream resulting from the emulation of a S/370 code sequence is predictable and additive. The host CPU executing Hercules is these days in usually a highly concurrent multi-issue out-of-order system, where execution time depends in a potentially non-linear manner from the instruction stream. Can be nicely seen in the non-monotonous increase of times for the T95x tests https://github.com/wfjm/s370-perf/blob/master/doc/s370_perf.md#user-content-tests-t95x, see data from my Hercules on XEON reference https://github.com/wfjm/s370-perf/blob/master/data/2018-03-31_sys2.dat#L351-L389.

The key limitation of s370_perf is that the simple concept of a well defined instruction time doesn't hold very well. Imho this very likely holds for all programs which attempt to measure instruction times.

Cheers, Walter

---In hercules-***@yahoogroups.com, <***@...> wrote :

Walter:
Please reread my post; I did not say to use STPT or the TIMEUSED macro in non-privileged mode. The MVS control blocks with the necessary information are readily available, you just can't write to them. A little bit of research will uncover a vast amount of information, and normally adjusted for running under Hercules. When the test is run on my system when "idle", the variance between wall clock and the CPU Timer is significant.

Mark
w.f.j.mueller@gsi.de [hercules-390]
2018-06-16 15:36:52 UTC
Permalink
Hi Mark,


you wrote on April 9th


The MVS control blocks with the necessary information are readily available, you just can't write to them. A little bit of research will uncover a vast amount of information, and normally adjusted for running under Hercules.


Well, I did the 'little bit of research' and think that I found a way to implement a reliable high-precision CPU time retrieval under MVS 3.8J. It's all described in a posting to H390-MVS


Reliable high-resolution CPU TIME retrieval under MVS 3.8J https://groups.yahoo.com/neo/groups/H390-MVS/conversations/topics/18217



With best regards, Walter

Loading...