w.f.j.mueller@gsi.de [hercules-390]
2018-05-06 11:32:13 UTC
Hallo,
A first fully analyzed instruction timing dataset for my Intel Xeon E5-1620 reference system is now available under the case id 2018-03-31_sys2 https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md in GitHub project wfjm/s370-perf https://github.com/wfjm/s370-perf. The page contains a list of findings https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find.
Up front a proviso: there are significant deviations from a simple additive instruction timing model. See section additivity of instruction times https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-itadd.
Some findings simply show nicely how an emulator like Hercules works, e.g.
- branch to same page is faster than to different page, see section branch timing https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-bfar. - ALR is faster than AR, see section ALR timing https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-alr. - CS,CDS and TS slow in the lock missed case for multi-CPU setups, see section CS, CDS, TS performance https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-lock. Other key findings are
- MVCIN is quite slow, a factor 6 slower than MVN or MVZ, see section MVCIN performance https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-mvcin. - CLCL is factor of 12 slower than CLC, see section CLCL performance https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-clcl. - TRT is factor 12 slower than TR, see section TRT performance https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-trt. - speed of decimal arithmetic seems independent of digit count, except for DP, see section decimal performance https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-dec.
The poor CLCL performance, when compared to CLC, is a bit surprising because MVCL shows roughly the same performance as MVC, so the overhead of an incorruptible instruction can't be the culprit.
Any remarks and comments are very welcome.
Data for many other systems is available now, see list of cases https://github.com/wfjm/s370-perf/blob/master/narr/README.md, but the full analysis will take some time.
With best regards, Walter
P.S.: in case the the links are broken in the email distribution, here again the main URLs
https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md
https://github.com/wfjm/s370-perf/blob/master/narr/README.md https://github.com/wfjm/s370-perf/blob/master/narr/README.md
A first fully analyzed instruction timing dataset for my Intel Xeon E5-1620 reference system is now available under the case id 2018-03-31_sys2 https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md in GitHub project wfjm/s370-perf https://github.com/wfjm/s370-perf. The page contains a list of findings https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find.
Up front a proviso: there are significant deviations from a simple additive instruction timing model. See section additivity of instruction times https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-itadd.
Some findings simply show nicely how an emulator like Hercules works, e.g.
- branch to same page is faster than to different page, see section branch timing https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-bfar. - ALR is faster than AR, see section ALR timing https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-alr. - CS,CDS and TS slow in the lock missed case for multi-CPU setups, see section CS, CDS, TS performance https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-lock. Other key findings are
- MVCIN is quite slow, a factor 6 slower than MVN or MVZ, see section MVCIN performance https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-mvcin. - CLCL is factor of 12 slower than CLC, see section CLCL performance https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-clcl. - TRT is factor 12 slower than TR, see section TRT performance https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-trt. - speed of decimal arithmetic seems independent of digit count, except for DP, see section decimal performance https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md#user-content-find-dec.
The poor CLCL performance, when compared to CLC, is a bit surprising because MVCL shows roughly the same performance as MVC, so the overhead of an incorruptible instruction can't be the culprit.
Any remarks and comments are very welcome.
Data for many other systems is available now, see list of cases https://github.com/wfjm/s370-perf/blob/master/narr/README.md, but the full analysis will take some time.
With best regards, Walter
P.S.: in case the the links are broken in the email distribution, here again the main URLs
https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md https://github.com/wfjm/s370-perf/blob/master/narr/2018-03-31_sys2.md
https://github.com/wfjm/s370-perf/blob/master/narr/README.md https://github.com/wfjm/s370-perf/blob/master/narr/README.md