Discussion:
[hercules-390] Impact of Meltdown kernel updates on Hercules performance
w.f.j.mueller@gsi.de [hercules-390]
2018-01-14 15:19:19 UTC
Permalink
The Kernel page-table isolation (KPTI https://en.wikipedia.org/wiki/Kernel_page-table_isolation) patches recently introduced to mitigate the Meltdown https://en.wikipedia.org/wiki/Meltdown_(security_bug) security vulnerability increases the overhead seen by system calls and will thus impact system performance.

I wondered whether that can be seen with Hercules, and indeed there are cases where the instruction timing increases by more than a factor of two !

I used the s370_perf https://github.com/wfjm/s370-perf/blob/master/README.md instruction time benchmark, now available as GitHub project wfjm/s370_perf https://github.com/wfjm/s370-perf.



I run the benchmark, under MVS 3.8J with Hercules as included in tk4-, in a dual CPU configuration (NUMCPU=2 MAXCPU=2) before and after the updates fighting Spectre/Meltdown were installed. The CS, CDS and TS tests in the lock missed configuration show a clear effect, times are up by more than a factor two, all other tests stay the same within measurement precision. See the test reports


https://github.com/wfjm/s370-perf/blob/master/data/2018-01-14_sys1-a.dat https://github.com/wfjm/s370-perf/blob/master/data/2018-01-14_sys1-a.dat

https://github.com/wfjm/s370-perf/blob/master/data/2018-01-14_sys1-b.dat https://github.com/wfjm/s370-perf/blob/master/data/2018-01-14_sys1-b.dat



and inspect tests T292,T297, and T621. Summarized

Tag Comment : before after
T292 LR;CS R,R,m (ne) : 333.92 726.15
T297 LR;CDS R,R,m (ne) : 334.79 742.46
T621 MVI;TS m (ones) : 342.58 729.77

As said, all other instruction times are essentially unchanged.

What happened is easy to explain. The CS, CDS and TS emulation code contains

if (sysblk.cpus > 1) sched_yield();

to get spin locks in the lock missed case efficiently handled. That's why the lock missed case shows a substantially slower instruction time than the lock taken case (which takes only about 80-90 usec). So this test is essentially a system call benchmark, thus very sensitive to the KPTI patch.

Really nice to see this with such clarity.

The practical impact for normal code is likely negligible though, that's why I resisted the temptation to title the thread 'Hercules a factor 2 slower' :).

Cheers, Walter
dwegscheid@sbcglobal.net [hercules-390]
2018-01-19 14:24:05 UTC
Permalink
Walter, nice analysis, and thanks for doing this. I am curious; what was the host environment CPU configuration? I'm curious how much this is affected by the number of host CPUs brought to bear...

I work in information security, and this is pretty consistent with what we are seeing in our large shop. Overall, for our workloads, we're not seeing much of a hit, but we had some apps that took a beating and need to be shored up with additional resources...
Carey Tyler Schug sqrfolkdnc@comcast.net [hercules-390]
2018-01-19 15:02:09 UTC
Permalink
Lurker speaking.  I assume the caching done by Hercules is NOT
speculative though, so no way it could be vulnerable to Spectre or
Meltdown?  So running under Hercules, we could disable the fixes for
them?  I assume at some point (next generation?) the hardware will be
fixed and the patches no longer needed either?
Post by ***@gsi.de [hercules-390]
The Kernel page-table isolation (KPTI
<https://en.wikipedia.org/wiki/Kernel_page-table_isolation>) patches
recently introduced to mitigate the Meltdown
<https://en.wikipedia.org/wiki/Meltdown_%28security_bug%29>security
vulnerability increases the overhead seen by system calls and will
thus impact system performance.
I wondered whether that can be seen with Hercules, and indeed there
are cases where the instruction timing increases by more than a factor
of two !
I used the s370_perf
<https://github.com/wfjm/s370-perf/blob/master/README.md> instruction
time benchmark, now available as GitHub project wfjm/s370_perf
<https://github.com/wfjm/s370-perf>.
I run the benchmark, under MVS 3.8J with Hercules as included in tk4-,
in a dual CPU configuration (NUMCPU=2 MAXCPU=2) before and after the
updates fighting Spectre/Meltdown were installed. The CS, CDS and TS
tests in the lock missed configuration show a clear effect, times are
up by more than a factor two, all other tests stay the same within
measurement precision. See the test reports
https://github.com/wfjm/s370-perf/blob/master/data/2018-01-14_sys1-a.dat
https://github.com/wfjm/s370-perf/blob/master/data/2018-01-14_sys1-b.dat
and inspect tests T292,T297, and T621. Summarized
  Tag   Comment                : before     after
  T292  LR;CS R,R,m (ne)       :    333.92    726.15
  T297  LR;CDS R,R,m (ne)      :    334.79    742.46
  T621  MVI;TS m (ones)        :    342.58    729.77
As said, all other instruction times are essentially unchanged.
What happened is easy to explain. The CS, CDS and TS emulation code contains
   if (sysblk.cpus > 1) sched_yield();
to get spin locks in the lock missed case efficiently handled. That's
why the lock missed case shows a substantially slower instruction time
than the lock taken case (which takes only about 80-90 usec). So this
test is essentially a system call benchmark, thus very sensitive to
the KPTI patch.
Really nice to see this with such clarity.
The practical impact for normal code is likely negligible though,
that's why I resisted the temptation to title the thread 'Hercules a
factor 2 slower' :).
Cheers,   Walter
Gregg Levine gregg.drwho8@gmail.com [hercules-390]
2018-01-19 15:16:35 UTC
Permalink
Hello!
I agree with the fellow wearing the nice hat.
After spent time parsing the discussions on the regular IBM list
concerning this potential problem, I'm not completely convinced System
Z has any problem other then the usual batch of blockheads who find a
book in the library and want to try to commit a break on six on their
favorite reachable system.

I might also add that my Linux distribution has not even decided what
to release for its sake, that's Slackware.

I might also add that the discussions here are being a bit more
thoughtful and interesting then usual.
-----
Gregg C Levine ***@gmail.com
"This signature fought the Time Wars, time and again."


On Fri, Jan 19, 2018 at 10:02 AM, Carey Tyler Schug
Lurker speaking. I assume the caching done by Hercules is NOT speculative
though, so no way it could be vulnerable to Spectre or Meltdown? So running
under Hercules, we could disable the fixes for them? I assume at some point
(next generation?) the hardware will be fixed and the patches no longer
needed either?
The Kernel page-table isolation (KPTI) patches recently introduced to
mitigate the Meltdown security vulnerability increases the overhead seen by
system calls and will thus impact system performance.
I wondered whether that can be seen with Hercules, and indeed there are
cases where the instruction timing increases by more than a factor of two !
I used the s370_perf instruction time benchmark, now available as GitHub
project wfjm/s370_perf.
I run the benchmark, under MVS 3.8J with Hercules as included in tk4-, in a
dual CPU configuration (NUMCPU=2 MAXCPU=2) before and after the updates
fighting Spectre/Meltdown were installed. The CS, CDS and TS tests in the
lock missed configuration show a clear effect, times are up by more than a
factor two, all other tests stay the same within measurement precision. See
the test reports
https://github.com/wfjm/s370-perf/blob/master/data/2018-01-14_sys1-a.dat
https://github.com/wfjm/s370-perf/blob/master/data/2018-01-14_sys1-b.dat
and inspect tests T292,T297, and T621. Summarized
Tag Comment : before after
T292 LR;CS R,R,m (ne) : 333.92 726.15
T297 LR;CDS R,R,m (ne) : 334.79 742.46
T621 MVI;TS m (ones) : 342.58 729.77
As said, all other instruction times are essentially unchanged.
What happened is easy to explain. The CS, CDS and TS emulation code contains
if (sysblk.cpus > 1) sched_yield();
to get spin locks in the lock missed case efficiently handled. That's why
the lock missed case shows a substantially slower instruction time than the
lock taken case (which takes only about 80-90 usec). So this test is
essentially a system call benchmark, thus very sensitive to the KPTI patch.
Really nice to see this with such clarity.
The practical impact for normal code is likely negligible though, that's why
I resisted the temptation to title the thread 'Hercules a factor 2 slower'
:).
Cheers, Walter
Tony Harminc tharminc@gmail.com [hercules-390]
2018-01-20 02:42:47 UTC
Permalink
Lurker speaking. I assume the caching done by Hercules is NOT speculative
though, so no way it could be vulnerable to Spectre or Meltdown? So
running under Hercules, we could disable the fixes for them? I assume at
some point (next generation?) the hardware will be fixed.
I think this is true of any system where you don't run potential malware.
If you know what code you are running, and you know it's not going to try
to exploit this kind of information leakage in the underlying hardware,
then there's nothing to worry about.

If you are thinking of, essentially, running possibly malicious mainframe
code on an OS under Hercules, and assuming the underlying intel (or
whatever) machine does have this kind of vulnerability then I think there
is a small possibility of the malware exploiting the intel vulnerability
via a sort of pass-through. If a 370 program can convince Hercules to
execute certain intel code that itself unintentionally exposes a weakness,
then it's possible. But such code has to exist, and it's far from clear
that any such does (or would) in Hercules.

But I think you are speaking of the Hercules virtual machine itself, and in
that case I agree that there is no direct speculative execution to be
exploited. Even if there were, I am doubtful it would fall into the
exploitable category.

Tony H.
w.f.j.mueller@gsi.de [hercules-390]
2018-01-28 19:34:10 UTC
Permalink
Hi,

a few more remarks on Meltdown, it's impact on Hercules, and answers to some questions raised in this thread.

The Meltdown vulnerability is caused by a combination of
out-of-order execution speculative execution sub-optimal handling of L1 cache and TLB which leads to delayed exceptions which allow a side-channel attackThe key culprit are the delayed exceptions. This is a feature of the concrete implementation of a processor architecture, not of a processor architecture itself. Therefore for example Intel has this unfortunate feature, while AMD claims it has not.

Vulnerable is the host CPU and of course not an emulated CPU. The side channel attack requires good time resolution, so it's imho unlikely that System/390 code executed by Hercules can be either source or target of an attack.

What one sees is only the performance impact coming from the mitigation action. The Kernel page-table isolation (KPTI) patches rolled out by all OS vendors slow down system calls, the amount depends on CPU generation and OS version. Newer Intel CPUs, Haswell or later, support Process Context Identifiers (PCID),
and newer Kernels, like Linux 4.14.11 or later, can use this to reduce the performance impact of KPTI. In general older CPUs with older OS versions will take a bigger performance hit than newer CPUs with newer Kernel versions.

The text case 'sys1' shown in the last posting was generated on
Intel(R) Core(TM)2 Duo CPU E8400 Ubuntu 16.04 LTS with a 4.4.0 Linux KernelI've done another test case 'nbk2' on
Intel(R) Core(TM) i5 CPU M520 Ubuntu 14.04 LTS with a 3.13.0 Linux Kernel VitualBox 5.0.12 r104815 Windows 7The test reports are under

https://github.com/wfjm/s370-perf/blob/master/data/2018-01-21_nbk2-a.dat https://github.com/wfjm/s370-perf/blob/master/data/2018-01-21_nbk2-a.dat
https://github.com/wfjm/s370-perf/blob/master/data/2018-01-21_nbk2-b.dat https://github.com/wfjm/s370-perf/blob/master/data/2018-01-21_nbk2-b.dat

In this case one gets (instruction times in ns)

Tag Comment : before after
T292 LR;CS R,R,m (ne) : 2291.28 3854.92
T297 LR;CDS R,R,m (ne) : 2295.46 3831.74
T621 MVI;TS m (ones) : 2320.39 3812.82

Comparing both systems with s370_perf_sum https://github.com/wfjm/s370-perf/blob/master/bin/s370_perf_sum gives

Tag Comment : sys1-a sys1-b nbk2-a nbk2-b
T100 LR R,R : 3.07 3.06 3.53 3.56
T101 LA R,n : 3.91 3.90 4.07 4.09
T102 L R,m : 12.81 12.80 11.86 11.90
T110 ST R,m : 12.79 12.79 12.32 12.23
...
T292 LR;CS R,R,m (ne) : 333.92 726.15 2291.28 3854.92
T297 LR;CDS R,R,m (ne) : 334.79 742.46 2295.46 3831.74
T621 MVI;TS m (ones) : 342.58 729.77 2320.39 3812.82

Observations are
simple instructions, like LR,LA,L,ST have very similar speed on both systems. lock misses are apparently more costly in a Linux under VitualBox under Windows environment. Not too astonishing, most likely all three layers get into action to process the sched_yield(). the relative KPTI patch impact is smaller on the nbk2 system, which is slow anyway. So hard to judge what's behind this.Both systems fall likely in the 'old CPU' plus 'old Kernel' category and thus show the worst case impact of the KPTI kernel patches.

Cheers, Walter

Loading...