Discussion:
S/370 MP Timer Bug ?
halfmeg
2010-11-29 15:34:34 UTC
Permalink
Occasionally someone comes along and tries to utilize S/370 mode using more than 1 processor defined in the Hercules configuration. There is some discussion but never seems to be a resolution to SPIN problem or what seems to be excessive overhead when NUMCPU 2 is defined.

http://tech.groups.yahoo.com/group/hercules-390/message/49171

http://tech.groups.yahoo.com/group/hercules-390/message/49269

3.8j only does a couple of things differently when MP is genned, as I think TK3 is. It sets aside some Global Save Area for the 2nd processor and includes 2 modules, IGFPXMFA & IFGPTAIM in the gen.

Thinking that the above seems pretty innocuous I tinker with Hercules a bit this morning.

Checked out current SVN (7143) version ( dyn76.c causes failure in compile so removed it from make ).

Changed hercules.cnf to:

CPUMODEL 3033
ARCHLVL s/370
NUMCPU 2

Started Hercules no need for OS to see what looks like a problem to me. Log segment reporduced below:

09:58:22 HHC00013I Herc command: 'sysclear'
09:58:22 HHC02311I sysclear completed
09:58:27 HHC00013I Herc command: 'r 50-50'
09:58:27 HHC02290I R:00000050:K:00=FFF99851
09:58:32 HHC00013I Herc command: 'r 50-50'
09:58:32 HHC02290I R:00000050:K:00=FFF3CBD4
09:58:38 HHC00013I Herc command: 'r 50-50'
09:58:38 HHC02290I R:00000050:K:00=FFECCB65 <<---<<<<
09:58:42 HHC00013I Herc command: 'cpu 1'
09:58:50 HHC00013I Herc command: 'r 50-50'
09:58:50 HHC02290I R:00000050:K:00=FFECCB65 <<---<<<<
09:58:52 HHC00013I Herc command: 'r 50-50'
09:58:52 HHC02290I R:00000050:K:00=FFE9E3A6
09:58:53 HHC00013I Herc command: 'r 50-50'
09:58:53 HHC02290I R:00000050:K:00=FFE8C626 <<---<<<<
09:58:57 HHC00013I Herc command: 'cpu 0'
09:59:07 HHC00013I Herc command: 'r 50-50'
09:59:07 HHC02290I R:00000050:K:00=FFE8C626 <<---<<<<
09:59:10 HHC00013I Herc command: 'r 50-50'
09:59:10 HHC02290I R:00000050:K:00=FFE5C4F0
09:59:11 HHC00013I Herc command: 'r 50-50'
09:59:11 HHC02290I R:00000050:K:00=FFE47A95 <<---<<<<
09:59:15 HHC00013I Herc command: 'cpu 1 '
09:59:23 HHC00013I Herc command: 'r 50-50'
09:59:23 HHC02290I R:00000050:K:00=FFE47A95 <<---<<<<
09:59:24 HHC00013I Herc command: 'r 50-50'
09:59:24 HHC02290I R:00000050:K:00=FFE2F573


SYSCLEAR starts the CPUTIMER in s/370 mode located at x'50'. When 2 CPUs are defined there are supposed to be a separate PSA for each. That may be alright, but the timer should be getting updated in each as well. As the log above shows the timer doesn't increment after switching the display from CPU 0 to 1 or back to 0 even though several seconds have passed. Once the 'bogus' timer display is displayed then the timer display changes as it should again until a CPU switch is performed then it doesn't update until the 2nd display request.

This doesn't look right and if a CPU is expecting the timer to always increment but doesn't, isn't there a possibility the SPIN is coming from what looks to me like a bug?

Phil
halfmeg
2010-11-29 15:37:24 UTC
Permalink
Post by halfmeg
includes 2 modules, IGFPXMFA & IFGPTAIM in the gen.
Which is probably a typo and should be:

includes 2 modules, IGFPXMFA & IGFPTAIM in the gen.

Phil
Tony Harminc
2010-11-29 17:49:00 UTC
Permalink
Occasionally someone comes along and tries to utilize S/370 mode using more than 1 processor defined in the Hercules configuration.  There is some discussion but never seems to be a resolution to SPIN problem or what seems to be excessive overhead when NUMCPU 2 is defined.
[snip]
SYSCLEAR starts the CPUTIMER in s/370 mode located at x'50'.  When 2 CPUs are defined there are supposed to be a separate PSA for each.  That may be alright, but the timer should be getting updated in each as well.  As the log above shows the timer doesn't increment after switching the display from CPU 0 to 1 or back to 0 even though several seconds have passed.  Once the 'bogus' timer display is displayed then the timer display changes as it should again until a CPU switch is performed then it doesn't update until the 2nd display request.
This doesn't look right and if a CPU is expecting the timer to always increment but doesn't, isn't there a possibility the SPIN is coming from what looks to me like a bug?
I know nothing of this Hercules code, but in general it seems to me
that any emulation of a timer does not need to (and should not for
performance reasons) be updating it at any time except when it is
looked at. There is no reason at all to actually take an interrupt on
the host 300 times per second, just to increment a location on guest
storage that is almost certainly not being examined. All this applies
a fortiori to a higher precision timer like the TOD clock or CPU
timer, but it is perhaps more obvious in these cases in that there is
a unique machine instruction needed to examine these timers, and there
can be no casual observation of their values by looking at storage
somewhere.

Now whether examining such timers using console commands should count
as looking at the timer is a good question. Again, not knowing if the
Hercules code actually baheves this way, or if it does timer updates
naively, I would suggest writing a tiny program to look at the timer,
rather than using the console commands.

Tony H.
Ivan Warren
2010-11-29 18:43:30 UTC
Permalink
Post by Tony Harminc
Now whether examining such timers using console commands should count
as looking at the timer is a good question. Again, not knowing if the
Hercules code actually baheves this way, or if it does timer updates
naively, I would suggest writing a tiny program to look at the timer,
rather than using the console commands.
Tony H.
The interval timer location in CPU's PSA ("Real" address X'50') is only
updated when it is being fetched by the CPU owning the PSA. This is to
ensure operations such as

MVC 0(8,X'4C'),0(X'50')

is done atomically (the word before X'50' and the word after X'50' are
designed for this - to ensure you can fetch and store a value in X'50'
in an atomic fashion).

Doing a

L <x>,X'50'
ST <y>,X'50'

cannot insure the interval timer won't have been updated between the
load & store operations because the interval timer is guaranteed to only
be updated in between instructions on the CPU owning the PSA. Note that
the word preceding and following the interval timer are designed for
that effect.

Under hercules, for every fetch made (in S/370 mode), a check is made to
see if the logical address is X'50' - and if it is the case, location
for real address X'50' (aka Absolute "X'50' + CPU Prefix") is updated
(if logical address X'50' happens not to be real address X'50', no
damage is done.. We just did a spurious update).

Now, we can also do this because, according to the S/370 principles of
operation, fetching the interval timer of a CPU from another CPU or I/O
channel (Page 50 of GA22-700-04, 3rd paragraph) may yield unpredictable
results.

Note that if the PSA is mapped to a logical address other than 0 through
DAT, I'm not sure we're doing this correctly (but I have yet to see a
real world example of this.. However, this may be an issue..).

Also, since the Alter/Display function is only available (per S/370
Principle of Operations) when the CPU is in a stopped state (at which
point the interval timer no longer gets updated) - Using the
Alter/Display manual functions when the CPU is not in a stopped state
(as permitted by hercules) can also yield unpredictable results. So if
you want a true image of the interval timer for a CPU, you should stop
that CPU first.

PS : VM/370 also has that quirk .. I noticed this 20 odd years ago : If
you attempt to do a 'CP D 50' from a secondary user with the CPU
running, the location at X'50' seems to never change, even with a CP SET
TIMER REAL !

--Ivan



[Non-text portions of this message have been removed]
rhtatum
2010-11-29 19:44:52 UTC
Permalink
There are some real consequences for how the timer is worked. If one puts a "virgin" tape on a drive, OS/360, at least, stupidly goes off, no matter what, and tries to find a label on the damned thing, and of course there isnt't one; tape winds up being ripped off the supply reel and all kinds of mischief happens. I guess that's why we had DEBE, CLIP, etc. to get a tape out of the box and into some sort of library. Dumb nonsense.

Now, on the Telpar OS, whenever an I/O was requested, before the SIO was issued, a timer was started with a few seconds on it; if an I/O interrupt wasn't received before the timer expired (with a corresponding interrupt), the attempt at an I/O peration was terminated, as one would wish.

For example, a system I had the misfortune to try to use only had 1600 BPI tape drives and I wanted to read a tape that I knew was recorded at 800 BPI; I hoped, not knowing exactly what the hardware was, that the tape could be read. Tried to read the tape under DOS, ripped the tape off and made a damned mess. Brought up Telpar OS, it used a timer, tried a bit, gave up and halted the II/O attempt. Which is what one would wish for. So the CPU timer in location 80 (X'50") can be important, and should be updated. As far as updating the fool thing 300 times per second (or 60 times/sec., line frequency) I haven't a clue. That seems to be rather onerous for a simulator that runs on as many systems as Hercules.
----- Original Message -----
From: Ivan Warren
To: hercules-390-***@public.gmane.org
Sent: Monday, November 29, 2010 12:43 PM
Subject: Re: [hercules-390] S/370 MP Timer Bug ?
Post by Tony Harminc
Now whether examining such timers using console commands should count
as looking at the timer is a good question. Again, not knowing if the
Hercules code actually baheves this way, or if it does timer updates
naively, I would suggest writing a tiny program to look at the timer,
rather than using the console commands.
Tony H.
The interval timer location in CPU's PSA ("Real" address X'50') is only
updated when it is being fetched by the CPU owning the PSA. This is to
ensure operations such as

MVC 0(8,X'4C'),0(X'50')

is done atomically (the word before X'50' and the word after X'50' are
designed for this - to ensure you can fetch and store a value in X'50'
in an atomic fashion).

Doing a

L <x>,X'50'
ST <y>,X'50'

cannot insure the interval timer won't have been updated between the
load & store operations because the interval timer is guaranteed to only
be updated in between instructions on the CPU owning the PSA. Note that
the word preceding and following the interval timer are designed for
that effect.

Under hercules, for every fetch made (in S/370 mode), a check is made to
see if the logical address is X'50' - and if it is the case, location
for real address X'50' (aka Absolute "X'50' + CPU Prefix") is updated
(if logical address X'50' happens not to be real address X'50', no
damage is done.. We just did a spurious update).

Now, we can also do this because, according to the S/370 principles of
operation, fetching the interval timer of a CPU from another CPU or I/O
channel (Page 50 of GA22-700-04, 3rd paragraph) may yield unpredictable
results.

Note that if the PSA is mapped to a logical address other than 0 through
DAT, I'm not sure we're doing this correctly (but I have yet to see a
real world example of this.. However, this may be an issue..).

Also, since the Alter/Display function is only available (per S/370
Principle of Operations) when the CPU is in a stopped state (at which
point the interval timer no longer gets updated) - Using the
Alter/Display manual functions when the CPU is not in a stopped state
(as permitted by hercules) can also yield unpredictable results. So if
you want a true image of the interval timer for a CPU, you should stop
that CPU first.

PS : VM/370 also has that quirk .. I noticed this 20 odd years ago : If
you attempt to do a 'CP D 50' from a secondary user with the CPU
running, the location at X'50' seems to never change, even with a CP SET
TIMER REAL !

--Ivan

[Non-text portions of this message have been removed]





[Non-text portions of this message have been removed]
Tony Harminc
2010-11-29 19:53:57 UTC
Permalink
Post by rhtatum
There are some real consequences for how the timer is worked. If one puts a "virgin" tape on a drive, OS/360, at least, stupidly goes off, no matter what, and tries to find a label on the damned thing, and of course there isnt't one; tape winds up being ripped off the supply reel and all kinds of mischief happens. I guess that's why we had DEBE, CLIP, etc. to get a tape out of the box and into some sort of library. Dumb nonsense.
Now, on the Telpar OS, whenever an I/O was requested, before the SIO was issued, a timer was started with a few seconds on it; if an I/O interrupt wasn't received before the timer expired (with a corresponding interrupt), the attempt at an I/O peration was terminated, as one would wish.
For example, a system I had the misfortune to try to use only had 1600 BPI tape drives and I wanted to read a tape that I knew was recorded at 800 BPI; I hoped, not knowing exactly what the hardware was, that the tape could be read. Tried to read the tape under DOS, ripped the tape off and made a damned mess. Brought up Telpar OS, it used a timer, tried a bit, gave up and halted the II/O attempt. Which is what one would wish for. So the CPU timer in location 80 (X'50") can be important, and should be updated. As far as updating the fool thing 300 times per second (or 60 times/sec., line frequency) I haven't a clue. That seems to be rather onerous for a simulator that runs on as many systems as Hercules.
Your scenario is not an argument for updating the timer in real time
(or anything close to it). The timer needs to be updated only when it
is examined, and clearly as well the timer interrupt processing needs
to be done on time. But timer interrupts are not driven by examining
location X'50'. (Well, clearly a hardware implementation *could* do it
that way, but surely no emulation would ever take that approach.) I
assume that the value in the timer is used as a pending time interval
in a host timer, or more probably the sooner-to-expire of the Clock
Comparator, CPU Timer, and X'50' timer is set as a host interval, and
then when that timer pops, the others are all recalculated. This would
take care of your tape issue just fine.

Tony H.
Kevin Leonard
2010-11-30 17:35:31 UTC
Permalink
I've been trying to avoid this thread. Just thinking of the
interval timer gives me a headache.
Post by Tony Harminc
MVS does have constants based on CPU model - the so-called SRM
constants that map CPU time to service units. IIRC these are in
module IRARMCPU. Amdahl used to distribute an update to MVS for
use on their V6 and similar processors. Whether these constants
are also used for other purposes, I don't know.
As Tony said, selection of the "appropriate" SRM constant
is model-dependent based on the response from STIDP, and if the
"appropriate" value is actually inappropriate for the processor
speed, a spin loop at IPL is one of the early symptoms. When
my late employer brought in its first Amdahl system, someone
unthinkingly applied an installation-standard hack on top of
the Amdahl-supplied SRM value, and the system went into a spin
loop every time it was IPLed. Removing the local hack and
restoring the Amdahl value fixed it. If we're running MVS
on hardware that's a lot faster than anything it was designed
to run on, it may be necessary to zap the SRM constant to a
more "appropriate" value.
Post by Tony Harminc
SYSCLEAR starts the CPUTIMER in s/370 mode located at x'50'.
But it shouldn't. The timer should start when the processor
leaves the stopped state.
Post by Tony Harminc
When 2 CPUs are defined there are supposed to be a separate
PSA for each. That may be alright, but the timer should be getting
updated in each as well. As the log above shows the timer doesn't
increment after switching the display from CPU 0 to 1 or back
to 0 even though several seconds have passed. Once the 'bogus'
timer display is displayed then the timer display changes as it
should again until a CPU switch is performed then it doesn't
update until the 2nd display request.
This doesn't look right and if a CPU is expecting the timer to
always increment but doesn't, isn't there a possibility the SPIN
is coming from what looks to me like a bug.
Hercules doesn't update the timer continuously. An actual update
occurs when an event requires the timer value. At that point, the
timer is modified to have the value it should have had if it had
been updated continuously. With most timers, there are two places
an update happens:

1. In the CPU thread during instruction execution.

2. In the timer watchdog thread, to guard against the possibility
of a target time arriving while the processor is in a wait.

In addition, the location 80 timer is (supposed to be) updated in
a third place:

3. Whenever the value of location 80 is fetched.

What we actually maintain is the value of when the next timer
interrupt should occur. When an event requires the timer value,
we calculate it as (target_value - current_time) converted to
timer units.

Current Hercules implementation of the location 80 timer has some
problems:

1. Architecturally, the location 80 timer should not be updated
when the processor is stopped. Right now, it's being updated.
(I think this is true of the CPU timer as well, even though
the "clocks" command displays "not decrementing".) Fixing
this would require saving the last-running time whenever a
processor is stopped, and using that value (instead of the
current time) to calculate what should be reported as the
interval timer value. When the processor enters the operating
state, the timer's target value needs to be adjusted to
((target_value - stopped_time) + current_time) to account
for the time the processor was stopped.

2. The first interval timer fetch, at least if it's done using
alter/display, causes the timer value to jump. If I start
Hercules without IPLing anything and issue "clocks" a couple
of times, it displays:

10:50:49 clocks
10:50:49 HHCPN028I tod = 94BAF1BE7B75B000 1982.334 10:50:49.926491
10:50:49 h/w = C6F58466DAF5B000 2010.334 16:50:49.926491
10:50:49 off = CDC56D57A0800000 - 28.001 06:00:00.000000
10:50:49 ckc = 0000000000000000 1900.001 00:00:00.000000
10:50:49 cpt = not decrementing
10:50:49 itm = FFFC7700 15:32:01.036608
10:51:07 clocks
10:51:07 HHCPN028I tod = 94BAF1CF269A1000 1982.334 10:51:07.404705
10:51:07 h/w = C6F58477861A1000 2010.334 16:51:07.404705
10:51:07 off = CDC56D57A0800000 - 28.001 06:00:00.000000
10:51:07 ckc = 0000000000000000 1900.001 00:00:00.000000
10:51:07 cpt = not decrementing
10:51:07 itm = FFE7FB89 15:31:43.557557

which is wrong because it should be zeroes, but at least it's
consistently wrong. When I then display location 80 using
alter/display, an hour or so gets whacked off the timer value:

10:51:14 r 50
10:51:14 R:00000050:K:00=E4E56734 00000000 00000000 00000000 UV..............
10:51:14 R:00000060:K:00=00000000 00000000 00000000 00000000 ................
10:51:14 R:00000070:K:00=00000000 00000000 00000000 00000000 ................
10:51:14 R:00000080:K:00=00000000 00000000 00000000 00000000 ................
10:51:16 clocks
10:51:16 HHCPN028I tod = 94BAF1D8037F5000 1982.334 10:51:16.698101
10:51:16 h/w = C6F5848062FF5000 2010.334 16:51:16.698101
10:51:16 off = CDC56D57A0800000 - 28.001 06:00:00.000000
10:51:16 ckc = 0000000000000000 1900.001 00:00:00.000000
10:51:16 cpt = not decrementing
10:51:16 itm = E4E28FF3 13:53:20.692055

Subsequent alter/display of location 80 doesn't change the timer:

11:05:25 clocks
11:05:25 HHCPN028I tod = 94BAF501690D8000 1982.334 11:05:25.412056
11:05:25 h/w = C6F587A9C88D8000 2010.334 17:05:25.412056
11:05:25 off = CDC56D57A0800000 - 28.001 06:00:00.000000
11:05:25 ckc = 0000000000000000 1900.001 00:00:00.000000
11:05:25 cpt = not decrementing
11:05:25 itm = E0FFF9C3 13:39:11.977639
11:05:27 r 50
11:05:27 R:00000050:K:00=E0FE1040 00000000 00000000 00000000 \.. ............
11:05:27 R:00000060:K:00=00000000 00000000 00000000 00000000 ................
11:05:27 R:00000070:K:00=00000000 00000000 00000000 00000000 ................
11:05:27 R:00000080:K:00=00000000 00000000 00000000 00000000 ................
11:05:28 clocks
11:05:28 HHCPN028I tod = 94BAF50409D1C000 1982.334 11:05:28.167708
11:05:28 h/w = C6F587AC6951C000 2010.334 17:05:28.167708
11:05:28 off = CDC56D57A0800000 - 28.001 06:00:00.000000
11:05:28 ckc = 0000000000000000 1900.001 00:00:00.000000
11:05:28 cpt = not decrementing
11:05:28 itm = E0FCBF11 13:39:09.223197

This could be a variant of what Phil is seeing when he
alternately displays timers from different PSAs.

There's another problem using the "iplc" command to IPL simple
standalone utilities like IBCDASDI and IBCDMPRS. The timer gets
reset twice during "iplc", which results in two external interrupts
being made pending. In what I would consider a design error, the
standalone utilities leave the external new PSW enabled for
interrupts. The first interrupt is presented and we load the
external new PSW. Before any instructions can be executed the
second interrupt is presented and another PSW swap occurs. This
causes the external new PSW to be saved as the external old PSW.
The standalone utilities' interrupt handler checks the interrupt
code to see if the interrupt was produced by the console INTERRUPT
key. If it wasn't, the interrupt handler loads the external old
PSW. Because the external old PSW is now the same as the external
new PSW, we're in a three instruction loop. Even though it's a
program design error, it doesn't happen on real hardware.

See why the interval timer gives me a headache?

--
Tony Harminc
2010-11-30 19:26:56 UTC
Permalink
I've been trying to avoid this thread.  Just thinking of the
interval timer gives me a headache.
:-)
Hercules doesn't update the timer continuously.  An actual update
occurs when an event requires the timer value.  At that point, the
timer is modified to have the value it should have had if it had
been updated continuously.  With most timers, there are two places
1.  In the CPU thread during instruction execution.
2.  In the timer watchdog thread, to guard against the possibility
   of a target time arriving while the processor is in a wait.
I'm not sure I get this bit (point 2). Are you talking about an
emulated CPU being in a wait, or the Hercules code?
1.  Architecturally, the location 80 timer should not be updated
   when the processor is stopped.  Right now, it's being updated.
   (I think this is true of the CPU timer as well, even though
   the "clocks" command displays "not decrementing".)  Fixing
   this would require saving the last-running time whenever a
   processor is stopped, and using that value (instead of the
   current time) to calculate what should be reported as the
   interval timer value.  When the processor enters the operating
   state, the timer's target value needs to be adjusted to
   ((target_value - stopped_time) + current_time) to account
   for the time the processor was stopped.
I don't think this is right. The interval timer just plain doesn't run
when the CPU is stopped. When the CPU is started, the timer starts up,
but I don't believe it compensates for the time the CPU was stopped.
To my dim and distant recollection, it just skips time and the OS (MVT
or any OS that uses the interval timer for time of day stuff) reported
time of day becomes late. Of course the IBM OSs that support the TOD
Clock use that for (duh) time of day display, and the interval timer
only for elapsed time calculations.

Tony H.
Kevin Leonard
2010-12-01 01:46:53 UTC
Permalink
Post by Tony Harminc
Post by Kevin Leonard
Hercules doesn't update the timer continuously. An actual update
occurs when an event requires the timer value. At that point, the
timer is modified to have the value it should have had if it had
been updated continuously. With most timers, there are two places
1. In the CPU thread during instruction execution.
2. In the timer watchdog thread, to guard against the possibility
of a target time arriving while the processor is in a wait.
I'm not sure I get this bit (point 2). Are you talking about an
emulated CPU being in a wait
Yes. If the emulated processor is in a wait, the expiration of
a timer interval won't get caught by a CPU thread because there
won't be any instructions being emulated. In that case, expiring
timer events will be detected in the watchdog thread.
Post by Tony Harminc
Post by Kevin Leonard
1. Architecturally, the location 80 timer should not be updated
when the processor is stopped. Right now, it's being updated.
(I think this is true of the CPU timer as well, even though
the "clocks" command displays "not decrementing".) Fixing
this would require saving the last-running time whenever a
processor is stopped, and using that value (instead of the
current time) to calculate what should be reported as the
interval timer value. When the processor enters the operating
state, the timer's target value needs to be adjusted to
((target_value - stopped_time) + current_time) to account
for the time the processor was stopped.
I don't think this is right. The interval timer just plain doesn't
run when the CPU is stopped.
Architecturally correct, but that's not currently what Hercules
is doing.
Post by Tony Harminc
When the CPU is started, the timer starts up, but I don't believe it
compensates for the time the CPU was stopped.
I didn't do a very good job explaining it. What we need to do but
currently aren't is to compensate internally for the time the
processor was stopped to prevent the stopped time from being
included in the interval timer value calculation. Take a trivial
example: Suppose it's currently 10:15, and you set the timer for
15 minutes. Internally, Hercules will save the target time as 10:30.
If at 10:20 you ask how much time remains, Hercules will tell you 10
minutes. Then stop the processor for five minutes. When you start
the processor at 10:25 and ask how much time remains, Hercules should
still tell you 10 minutes. Instead the response you get back will be
five minutes, because Hercules is using the (unmodified) 10:30 target
time minus the current time to calculate the interval timer value.
If the processor is stopped for five minutes and then started, the
target time (the time at which the timer will expire) should get
adjusted to 10:35.

--
Mike Schwab
2010-12-01 05:24:53 UTC
Permalink
On Tue, Nov 30, 2010 at 7:46 PM, Kevin Leonard <hercules-list-+***@public.gmane.org> wrote:
<deleted>
I didn't do a very good job explaining it.  What we need to do but
currently aren't is to compensate internally for the time the
processor was stopped to prevent the stopped time from being
included in the interval timer value calculation.  Take a trivial
example:  Suppose it's currently 10:15, and you set the timer for
15 minutes.  Internally, Hercules will save the target time as 10:30.
If at 10:20 you ask how much time remains, Hercules will tell you 10
minutes.  Then stop the processor for five minutes.  When you start
the processor at 10:25 and ask how much time remains, Hercules should
still tell you 10 minutes.  Instead the response you get back will be
five minutes, because Hercules is using the (unmodified) 10:30 target
time minus the current time to calculate the interval timer value.
If the processor is stopped for five minutes and then started, the
target time (the time at which the timer will expire) should get
adjusted to 10:35.
--
How about: As part of STOPPING the processor, change the target time
to a time remaining value (in some format), then as part of STARTING
the processor, recalculate the new target time by taking the current
time and adding the time remaining.
--
Mike A Schwab, Springfield IL USA
Where do Forest Rangers go to get away from it all?
halfmeg
2010-11-30 20:51:04 UTC
Permalink
Post by Kevin Leonard
I've been trying to avoid this thread. Just thinking of the
interval timer gives me a headache.
Join the crowd. :-)
Post by Kevin Leonard
< various snips >
If we're running MVS on hardware that's a lot faster than anything
it was designed to run on, it may be necessary to zap the SRM
constant to a more "appropriate" value.
Possible, but having an almost certain failure during IPL ( actually right after NIP finishes ) when cranking up JES3 gives an opportunity to isolate what it is that is causing the SPIN LOOP.

If others can run with S/370 MP after IPL ( some of the time ) then an SRM value doesn't sound like it is the smoking gun.
Post by Kevin Leonard
Current Hercules implementation of the location 80 timer has some
They may not be contributing to the S/370 MP problem, whatever it is.
Post by Kevin Leonard
< bunch of clock stuff snipped >
This could be a variant of what Phil is seeing when he
alternately displays timers from different PSAs.
Other than what Ivan mentioned as a bug ( timer decrementing with CPU stopped ), I didn't think it right to see the timer value from the previous display, CPU00 or CPU01, appear when the CPU had been switched to the other one. The inconsistency of the display was what caught my eye, not that it has anything to do with S/370 MP problem.

I'm thinking of a small system for Ivan to test with, perhaps 3 packs or so + page packs. It ought to provide enough for the problem to be nailed down to Hercules, MVS 3.8j, SYSGEN, Shared-DASD, Channels etc...

Phil
"Fish" (David B. Trout)
2010-12-01 00:00:49 UTC
Permalink
Post by halfmeg
Post by Kevin Leonard
I've been trying to avoid this thread. Just thinking of the
interval timer gives me a headache.
Join the crowd. :-)
This entire thread is giving me a headache. :)
--
"Fish" (David B. Trout)
fish-VLFb7ALKWJGGw+***@public.gmane.org






------------------------------------
Kevin Leonard
2010-12-01 01:56:23 UTC
Permalink
Post by halfmeg
Post by Kevin Leonard
Current Hercules implementation of the location 80 timer has some
They may not be contributing to the S/370 MP problem, whatever
it is.
Strictly true, because MVS isn't using the interval timer.
My recollection from when I first dug into this last year
was that some of the things I was seeing applied to the
CPU timer as well, but it's been a while.
Post by halfmeg
Other than what Ivan mentioned as a bug ( timer decrementing with
CPU stopped ), I didn't think it right to see the timer value from
the previous display, CPU00 or CPU01, appear when the CPU had been
switched to the other one. The inconsistency of the display was
what caught my eye, not that it has anything to do with S/370 MP
problem.
Right. This may be related to interval timer problems or not,
and either way may have nothing to do with your spin loop.
I think I may have just seen "interval timer" and heard
fingernails scratching across a chalkboard. The interval
timer does that to me.
Post by halfmeg
I'm thinking of a small system for Ivan to test with, perhaps
3 packs or so + page packs. It ought to provide enough for the
problem to be nailed down to Hercules, MVS 3.8j, SYSGEN,
Shared-DASD, Channels etc...
If possible, yes. The first step in identifying a problem
is to be able to reproduce it consistently.

--
halfmeg
2010-12-01 05:18:08 UTC
Permalink
Post by Tony Harminc
<snip>
Post by halfmeg
I'm thinking of a small system for Ivan to test with, perhaps
3 packs or so + page packs. It ought to provide enough for the
problem to be nailed down to Hercules, MVS 3.8j, SYSGEN,
Shared-DASD, Channels etc...
If possible, yes. The first step in identifying a problem
is to be able to reproduce it consistently.
Which I am no longer sure it is consistent. Running NUMCPU 2 on an old 300 MHz single CPU laptop never skipped a beat in bringing up JES3 or JES2, several times. SVN 5518 instruction count from IPL to sysparm message were almost constant too,

NUMCPU 1 - 975,000, NUMCPU 2 - 1,300,000.

When switching over to a 1 GHz single CPU laptop using SVN 7148, it couldn't be IPLed to even get to the sysparm message. Instruction count when the 1st device shared message appeared however was

NUMCPU 1 - 257,000,000, NUMCPU 2 - 437,000,000

Test differed this time FISH as Hercules was shutdown completely between each trial ( 3 each ).

Phil
"Fish" (David B. Trout)
2010-12-01 06:47:12 UTC
Permalink
halfmeg wrote:

[...]
Post by halfmeg
Running NUMCPU 2 on an old 300 MHz single CPU laptop
never skipped a beat in bringing up JES3 or JES2,
several times. SVN 5518 instruction count from IPL
to sysparm message were almost constant too,
[...]
Post by halfmeg
When switching over to a 1 GHz single CPU laptop using
SVN 7148, it couldn't be IPLed to even get to the sysparm
message. [...]
So it's timing related. The faster the host CPU the more likely the
problem(s) is/are to occur, whereas with a slower host CPU (one likely
matching more closely to the actual speed of the systems which MVS 3.8j was
natively run on) it works fine.

This tells me the problem is not with Hercules but rather with MVS 3.8j.
It's spin loop value is too small.

With a fast host CPU, Hercules is able to complete the spin loop much faster
than the other CPU is able to complete its work and release the lock,
whereas on a much slower host CPU, the other CPU is able to complete its
work and release the lock (that the other one is spinning on) before the
first CPU completes its spin loop and gives up.

You need to locate the code in MVS and patch it to increase the number of
times it loops when waiting on a lock (i.e. its spin loop value).

This doesn't sound like a Hercules issue.

In fact, it sounds identical to a similar issue I believe we had (still
have?) in VM/370. It was doing the same thing: using a hard coded value for
how long to spin on a lock based on the CPU model.

The correct way would be to calculate it each time whenever the system is
first initialized. I.e. STCK, loop many millions of times (some VERY large
number), STCK, divide loop value by subtracted clock values yielding number
of loops/second, and then dividing or multiplying that value by some other
value depending on how long (time-wise) you are willing to wait on a lock
(i.e. you're "Times Up!" value).

Do that and your problem will go away and *should* work no matter how
fast/slow the host/emulated CPU is.

Just my 2 cents.

HTH
--
"Fish" (David B. Trout)
fish-VLFb7ALKWJGGw+***@public.gmane.org






------------------------------------
halfmeg
2010-12-01 13:26:16 UTC
Permalink
Post by "Fish" (David B. Trout)
[...]
Post by halfmeg
Running NUMCPU 2 on an old 300 MHz single CPU laptop
never skipped a beat in bringing up JES3 or JES2,
several times. SVN 5518 instruction count from IPL
to sysparm message were almost constant too,
[...]
Post by halfmeg
When switching over to a 1 GHz single CPU laptop using
SVN 7148, it couldn't be IPLed to even get to the sysparm
message. [...]
So it's timing related.
Maybe, maybe not.
Post by "Fish" (David B. Trout)
The faster the host CPU the more likely the problem(s) is/are to
occur, whereas with a slower host CPU (one likely matching more
closely to the actual speed of the systems which MVS 3.8j was
natively run on) it works fine.
This tells me the problem is not with Hercules but rather with MVS
3.8j. It's spin loop value is too small.
We'll see as I'm pretty committed to resolving this one way or another and not by running on a 300 MHz laptop running WIN2K. :-)
Post by "Fish" (David B. Trout)
With a fast host CPU, Hercules is able to complete the spin loop
much faster than the other CPU is able to complete its work and
release the lock, whereas on a much slower host CPU, the other CPU
is able to complete its work and release the lock (that the other
one is spinning on) before the first CPU completes its spin loop and
gives up.
You need to locate the code in MVS and patch it to increase the
number of times it loops when waiting on a lock (i.e. its spin loop
value).
OS/360 had an issue like this with the a timer loop and that was patched. Even though this is somewhat similar, it may not be a timer loop. It could possibly be a 2nd processor which has no access to devices due to the sysgen.

Some time back you mentioned I should go back and look over the IPL fundamentals. I looked through one of the J Ranade (?) books and started where Uniprocessor and Multiprocessor diverge, separate PSAs.
Post by "Fish" (David B. Trout)
This doesn't sound like a Hercules issue.
<snip>
...your problem will go away and *should* work no matter how
fast/slow the host/emulated CPU is.
As the customer pointed toward the hardware and the CE pointed toward the software.

IMHO, there is something wrong when:

after IPL is entered on the Hercules console
the Hercules console is non-responsive and it doesn't update until whatever is happening is done and magically the instruction count updates from 0 to 400,000,000+ instructions. This is with MSVC WinXP 1 GHz 1 CPU.

On Linux 2.6 GHz 1 CPU after IPL is entered:
a brief moment later instruction count jumps from 0 to 218,021, then everything freezes until instruction count jumps to 70,000,000+.

This test, IPL to sysparm message, on the Linux host ( I'll have to check MSVC later ) has MAXRATES reporting:

MIPS 33.652
I/Os of 473

So a faster CPU doesn't really explain what is going on.

That is why, instead of trying to isolate the problem with a S/370 OS, I wanted some simple ( complex ) standalone program, if possible, to help isolate what is happening.

Hercules behavior shouldn't be so very different on a different Host platform, Win vs Linux.

Phil
"Fish" (David B. Trout)
2010-12-02 09:50:29 UTC
Permalink
[...]
Post by halfmeg
Post by "Fish" (David B. Trout)
So it's timing related.
Maybe, maybe not.
Trust me. It's timing related. :)


[...]
Post by halfmeg
Post by "Fish" (David B. Trout)
This tells me the problem is not with Hercules but rather
with MVS 3.8j. It's spin loop value is too small.
We'll see as I'm pretty committed to resolving this one way or
another and not by running on a 300 MHz laptop running WIN2K. :-)
Have you tried tweaking your priorities? E.g:

HERCPRIO 0
DEVPRIO -8
TODPRIO 0
CPUPRIO 0

(see source member "RELEASE.NOTES")


[...]
Post by halfmeg
after IPL is entered on the Hercules console the Hercules
console is non-responsive and it doesn't update until
whatever is happening is done and magically the instruction
count updates from 0 to 400,000,000+ instructions. This
is with MSVC WinXP 1 GHz 1 CPU.
Check your Hercules priorities as well as your Windows priorities too
(Win32PrioritySeparation = "Applications" vs. "Background Services"):

http://support.microsoft.com/kb/259025
http://technet.microsoft.com/en-us/library/cc976120.aspx

Also try an earlier SVN snapshot version. Go back 6-12 months and try that.
If the problem goes away then try one 3-6 months back, etc, until you find
the version where the problem began and then report to us which version that
is. The SVN version of Hercules is *vastly* different from the current 3.07
version. A *huge* number of significant changes have been made in the past 9
months. I'm not kidding.
Post by halfmeg
On Linux 2.6 GHz 1 CPU after IPL is entered: a brief
moment later instruction count jumps from 0 to 218,021,
then everything freezes until instruction count jumps
to 70,000,000+.
Wow.
Post by halfmeg
This test, IPL to sysparm message, on the Linux host
MIPS 33.652
I/Os of 473
So a faster CPU doesn't really explain what is going on.
I wouldn't place too much trust upon the accuracy of the reported MIPS rate.
Post by halfmeg
That is why, instead of trying to isolate the problem with
a S/370 OS, I wanted some simple (complex) standalone program,
if possible, to help isolate what is happening.
Hercules behavior shouldn't be so very different on a different
Host platform, Win vs Linux.
The two operating systems are VERY different from one another Phil.

Now granted, Hercules shouldn't be temporarily freezing up like that and the
sudden HUGE jump in the instruction count certainly doesn't sound right
(doesn't sound healthy), but as I said *many* changes have been made to
Hercules in the past 9 months and there's no telling which one introduced
the new behavior.

The only way to find out is to do a "binary search" of the snapshot versions
to try and isolate which change introduced the problem.

Yes that's a royal PITA I know. I went through it myself only just a few
days ago to try and isolate a problem introduced by another developer and it
wasn't fun.

But you do what you have to in order to identify the problem. Otherwise
we're just stumbling around in the dark guessing and that's never fun (and
very unproductive besides).
--
"Fish" (David B. Trout)
fish-VLFb7ALKWJGGw+***@public.gmane.org







------------------------------------
halfmeg
2010-12-02 14:21:34 UTC
Permalink
<various snips>
Trust me. It's timing related. :)
If you mean it is related to internal MVS stuff, I'm not convinced yet. If you mean it is related to Hercules thread ability to execute on the Host system without 'running away' from associated threads ( CPU00 vs CPU01 threads shouldn't lock out each other even on a 1 CPU Host should yet ? ) or being impacted by other processes/applications running on the Host, then I'm more inclined to agree.

No matter which or what there is some type of issue.
HERCPRIO 0
DEVPRIO -8
TODPRIO 0
CPUPRIO 0
(see source member "RELEASE.NOTES")
No, I tend to leave stuff alone if it is working ( on everything else ) and when I don't understand it.
Also try an earlier SVN snapshot version. Go back 6-12 months and
try that. If the problem goes away then try one 3-6 months back,
etc, until you find the version where the problem began and then
report to us which version that is. The SVN version of Hercules is
*vastly* different from the current 3.07 version. A *huge* number of
significant changes have been made in the past 9 months. I'm not
kidding.
I know you aren't.
Post by halfmeg
This test, IPL to sysparm message, on the Linux host
MIPS 33.652
I/Os of 473
So a faster CPU doesn't really explain what is going on.
I wouldn't place too much trust upon the accuracy of the reported MIPS rate.
Same test on WinXP 1 GHz 1 CPU Host with SVN 7143 gave:

MIPS 33 + and IIRC around 130 I/Os which was a little strange. More than half the CPU Hz horsepower yet less I/O ( may be it didn't get as far - see below ).

Samv test as above but with SVN 5518 gave:

MIPS 9 or so, don't remember I/Os.
The two operating systems are VERY different from one another Phil.
Right, which was shown in a test today between Linux SVN 7140 and MSVC SVN 7140.

Linux Hercules allows IPL of the system.
MSVC Hercules gives the CONT or WAIT DASD issue.
The only way to find out is to do a "binary search" of the snapshot
versions to try and isolate which change introduced the problem.
Which I was hoping to avoid as there seem to be multiple issues.

Phil
somitcw
2010-12-02 15:44:28 UTC
Permalink
Post by "Fish" (David B. Trout)
[...]
Post by halfmeg
Post by "Fish" (David B. Trout)
So it's timing related.
Maybe, maybe not.
Trust me. It's timing related. :)
You are correct. That was my guess, but before
testing your priority changes, I wasn't 100% certain.
Post by "Fish" (David B. Trout)
[...]
Post by halfmeg
Post by "Fish" (David B. Trout)
This tells me the problem is not with Hercules but rather
with MVS 3.8j. It's spin loop value is too small.
We'll see as I'm pretty committed to resolving this
one way or another and not by running on a 300 MHz
laptop running WIN2K. :-)
I tested and posted my wild guesses but they did not
speed MVS MP IPL by much.
Post by "Fish" (David B. Trout)
HERCPRIO 0
DEVPRIO -8
TODPRIO 0
CPUPRIO 0
Setting TODPRIO and CPUPRIO to the same value makes
much more sense than the default. You did not give a
SRVPRIO so I guessed. What works well for me is:

HERCPRIO 0 # attempt#1 255 # Default ???
CPUPRIO 0 # attempt#1 127 # Default 15
DEVPRIO -8 # attempt#1 64 # Default 8
SRVPRIO 0 # attempt#1 32 # Default 4
TODPRIO 0 # attempt#1 16 # Default 0
Post by "Fish" (David B. Trout)
(see source member "RELEASE.NOTES")
It indicates that Hercules 3.05 changed to priority
defaults that may/will cause performance issues and need
to be overriden to the Hercules 3.04 values.
Post by "Fish" (David B. Trout)
[...]
Post by halfmeg
after IPL is entered on the Hercules console the Hercules
console is non-responsive and it doesn't update until
whatever is happening is done and magically the instruction
count updates from 0 to 400,000,000+ instructions. This
is with MSVC WinXP 1 GHz 1 CPU.
Check your Hercules priorities as well as your Windows
priorities too (Win32PrioritySeparation = "Applications"
http://support.microsoft.com/kb/259025
http://technet.microsoft.com/en-us/library/cc976120.aspx
Too much reading for me.

You already solved the issue.
Post by "Fish" (David B. Trout)
Also try an earlier SVN snapshot version. Go back 6-12
months and try that.
If the problem goes away then try one 3-6 months back,
etc, until you find the version where the problem began
and then report to us which version that is. The SVN
version of Hercules is *vastly* different from the
current 3.07 version. A *huge* number of significant
changes have been made in the past 9 months. I'm not
kidding.
Post by halfmeg
On Linux 2.6 GHz 1 CPU after IPL is entered: a brief
moment later instruction count jumps from 0 to 218,021,
then everything freezes until instruction count jumps
to 70,000,000+.
Wow.
I didn't have a jump like that but MVS MP IPL would
exceed a billion instructions.
Post by "Fish" (David B. Trout)
Post by halfmeg
This test, IPL to sysparm message, on the Linux host
MIPS 33.652
I/Os of 473
So a faster CPU doesn't really explain what is going on.
I wouldn't place too much trust upon the accuracy of
the reported MIPS rate.
Post by halfmeg
That is why, instead of trying to isolate the problem
with a S/370 OS, I wanted some simple (complex) standalone
program, if possible, to help isolate what is happening.
Hercules behavior shouldn't be so very different on a
different Host platform, Win vs Linux.
The two operating systems are VERY different from one
another Phil.
Now granted, Hercules shouldn't be temporarily freezing
up like that and the sudden HUGE jump in the instruction
count certainly doesn't sound right (doesn't sound healthy),
but as I said *many* changes have been made to Hercules in
the past 9 months and there's no telling which one
introduced the new behavior.
The only way to find out is to do a "binary search" of
the snapshot versions to try and isolate which change
introduced the problem.
Yes that's a royal PITA I know. I went through it myself
only just a few days ago to try and isolate a problem
introduced by another developer and it wasn't fun.
But you do what you have to in order to identify the
problem. Otherwise we're just stumbling around in the
dark guessing and that's never fun (and very unproductive
besides).
--
"Fish" (David B. Trout)
As far as I know, the MVS MP bugs have been found and
circumventions are available. Since the device CONT
reply is said to be on MVS UP and MP, it is not related.

How about we close this thread and move MVS specific
question to the H390-MVS group?
Mike Schwab
2010-12-02 18:37:01 UTC
Permalink
On Thu, Dec 2, 2010 at 3:50 AM, "Fish" (David B. Trout)
<fish-6N/dkqvhA+***@public.gmane.org> wrote:
<deleted>
Post by "Fish" (David B. Trout)
The SVN version of Hercules is *vastly* different from the current 3.07
version. A *huge* number of significant changes have been made in the past 9
months. I'm not kidding.
I think the next version should be called Hercules 4.0
--
Mike A Schwab, Springfield IL USA
Where do Forest Rangers go to get away from it all?
halfmeg
2010-12-01 14:38:19 UTC
Permalink
Post by Tony Harminc
<snip>
You need to locate the code in MVS and patch it to increase the number
of times it loops when waiting on a lock (i.e. its spin loop value).
<snip>
So in the mean time, does anyone recognize where SRM might be in one of the below perhaps near the end as this is the CCW trace of the res pack after IPL and before the SIGP on the CPUs takes place ( and the instruction count jumps so drastically.

IEANUC01VNIPM.ÿÿ
A .SIEANUC01.À.&.:'
.SIEANUC01.À.&.:
1 IEAVEPC UZ36571'
9 IEAVEDS0 UZ30139'
0 .4IECVXTPTR03700'
0 IEAVESVC UZ86400'
1 IECVEXCP UY09531'
7 IECVEXPR05/07/87'
0 IEAVLK00 UZ48820'
4 .0-»IEAVLK01 UZ4'
5 IECVOTBL12/30/85'
2 .00».IEEVEXSN 82'
2 .00».IECVHREC 82'
5 .0õ»IEAVEES UZ5'
2 .0§0§0.00ÂIECVRS'
0 .00Â.È.IECVESIO '
3 .00º.IECVRRSV 83'
3 ð.}....0..IECTAT'
7 .00..IEFAB438 7'
0 .00.IEDQATTN.ã.-'
8 IEAVBK0010/04/78'
3 .¢.õ.0õ IECTSVC'
1 .00»IEAVEVT0 UZ1'
7 IEAVPREF UZ33317'
7 .0.00÷...IEAVNIP'
8 IEAVEVAL UZ52628'
0 .0.00.IEA0TI00Ã"ð'
0 .-.0-.IEAVMODER0'
0 IEAVTEST........'
8 .0õ»IEAVTRCE UZ8'
8 IEAVETCL UZ29328'
7 IEAVLK02 UZ69157'
2 IEAVEIO UZ28752'
4 IEAVEEXT UZ24824'
8 .00..IEAVEAC0 8'
5 IEAVERES UZ51835'
6 IEAVEEXPR0300006'
0 IEAVESC0R0200010'
9 IEAVELK UZ37599'
6 .00»IEAQPGTM UZ6'
3 IEAVRT02R0300003'
7 IEAVLK03 UZ69157'
3 ð.}..?.0-ÂIEAVVC'
2 .0õ»IEAVVCRX UZ2'
4 .ð.0ð»IEAVPRT0 U'
3 IEAVGM00 UZ63173'
3 IEAVGM01 UZ63173'
3 IEASMFGF UZ63173'
9 .00..Í.IEAVGCASR'
0 .00.IEAVBLDPR020'
0 .00.IEAVDELPR020'
0 .00.IEAVGTCLR030'
1 .00»IEAVFRCL UZ1'
0 .00».Â.IEAVCKEY '
7 .00..IEAVCKRR 7'
8 .00..IEAVAR02 8'
2 .00».IEAVTRTM 82'
9 .00º.IEAVTRTS UZ'
0 .00..+.IEAVTRTR '
9 .00».IEAVTRTH 79'
0 .00».IEAVTSDX 80'
1 .00».IEAVTSBP 81'
1 .00».IEAVTRER 81'
2 .00»IEAVTPER UZ2'
9 IEAVPFTE UZ18479'
4 IEAVPCB UZ53364'
2 .00»IEAVINV UZ2'
3 .00»IEAVRFR UZ3'
5 IEAVRCV UZ28905'
3 .00»IEAVRCF UZ3'
7 .00..IEAVTERM 7'
1 IEAVSQA UZ35051'
1 IEAVEQR UZ25071'
2 IEAVPSI UZ34542'
6 IEAVRELS UZ27266'
1 .00»IEAVITAS UZ1'
1 .00»IEAVDLAS UZ1'
0 .00Â.È.IEAVCSEG '
5 IEAVFP UZ28905'
4 .0.00»IEAVTRV U'
7 .00..IEAVPIX 7'
1 IEAVGFA UZ24591'
1 IEAVGFAD UZ24591'
3 IEAVDSEGR0300003'
0 .00.IEAVPIOIR030'
3 .00»IEAVPIOP UZ3'
0 IEAVIOCPR03700 '
3 IEAVFREE UZ33993'
9 IEAVOUT UZ21199'
9 IEAVFXLD UZ21199'
1 .00»IEAVAMSI UZ1'
9 IEAVRSM UZ28909'
3 .00»IEAVSWIN UZ3'
3 .00»IEAVSOUT UZ3'
0 .00.IEAVSWPCR030'
1 .00»IEAVPRSB UZ1'
2 .00»IEESTPRS UZ2'
9 .00º.IEEVSTOP UZ'
5 .00»IEEVDCCR UZ5'
9 .00».IEEVLDWT 79'
0 .õ.0õ.IEAVGPRRR0'
0 .00.IEAVCARRR030'
9 .00º.IEAVGFRR UZ'
0 .00»...IEAVESPR '
0 .È.0È.IEAVEADVR0'
0 .00È.Î.IEAVECBV '
0 .00Â.Ç.IEAVEVRR '
0 .{.0{.IEAVEDSRR0'
0 .00Å...IEAVEPDR '
0 .00»...IEAVESCR '
0 .00Å...IEAVEEER '
4 .õ.0õ»IEAVEIPR U'
8 .00».IEAVTACR 78'
0 .00Â.Ç.IEAVTCR1 '
2 .00»IEAVERI UZ2'
2 .00»IEAVERP UZ2'
3 .00»IEAVEDR UZ3'
0 .0õ.IEAVEXS R020'
4 .&.0&»IECVDAVV U'
5 .00Ã***@.004IE'
0 .0.00».IECVIRST '
2 .00».IECVHDET 82'
8 .00»IECVMAP UZ8'
8 .00».IECVBRSV 78'
8 .00».IECVRDIO 78'
1 .00».IECVURDT 81'
9 .00».IECVURSV 79'
3 IECVDDT0 UZ90163'
5 jœß².çõ.m¿ß..4IE'
0 o œ..4IECVXGRTR0'
5 Ã"Ï...0{ÌIECTCATN'

Phil
halfmeg
2010-12-01 16:07:39 UTC
Permalink
Post by halfmeg
So in the mean time, does anyone recognize where SRM might be in one
of the below ...
This looks like a place to start:

http://www.mainframe.eu/mvs38/asm/Master%20Sheduler%20(IEE)/IEESTPRS

as it is checking out TOD clock and CPUs.

Phil
somitcw
2010-12-01 16:31:30 UTC
Permalink
Post by halfmeg
So in the mean time, does anyone recognize where
SRM might be in one of the below ...
SRM might not be directly related.
http://www.mainframe.eu/mvs38/asm/\
Master%20Sheduler%20%28IEE%29/IEESTPRS
as it is checking out TOD clock and CPUs.
Phil
The problems seem to be more than SPIN loop problems but
a better place to start looking for SPIN issues may be:

SYS4.MVS38J.SOURCE(IEEVEXSN)
TITLE 'IEEVEXSN - EXCESSIVE SPIN NOTIFICATION'
From old MVS 3.8j source before any PTFs includes:

CHECK FOR RUNNING AS A GUEST VIRTUAL MACHINE AND FOR
TIME-OF-DAY CLOCK SYNCHRONIZATION. IF EITHER CASE IS PRESENT
RETURN TO CALLER(RC=0). THE REASONS FOR DOING THIS ARE: (1)
RUNNING AS A GUEST VIRTUAL MACHINE DISTORTS THE TIMING
RELATION BETWEEN THE TWO HALVES OF AN MP, (2) THE HUMAN
INTERVENTION NEEDED DURING CLOCK SYNCING MAY CAUSE UNNEEDED
SPIN LOOP TIMEOUTS FROM THE RISGNL ROUTINE.) @VS49603
halfmeg
2010-12-01 19:06:04 UTC
Permalink
Post by somitcw
Post by halfmeg
So in the mean time, does anyone recognize where
SRM might be in one of the below ...
SRM might not be directly related.
Right, thought about that after I posted. Was thinking about someone else's post mentioning SRM.
Post by somitcw
Post by halfmeg
http://www.mainframe.eu/mvs38/asm/\
Master%20Sheduler%20%28IEE%29/IEESTPRS
Post by somitcw
Post by halfmeg
as it is checking out TOD clock and CPUs.
Phil
The problems seem to be more than SPIN loop problems but
SYS4.MVS38J.SOURCE(IEEVEXSN)
TITLE 'IEEVEXSN - EXCESSIVE SPIN NOTIFICATION'
CHECK FOR RUNNING AS A GUEST VIRTUAL MACHINE AND FOR
TIME-OF-DAY CLOCK SYNCHRONIZATION. IF EITHER CASE IS PRESENT
RETURN TO CALLER(RC=0). THE REASONS FOR DOING THIS ARE: (1)
RUNNING AS A GUEST VIRTUAL MACHINE DISTORTS THE TIMING
RELATION BETWEEN THE TWO HALVES OF AN MP, (2) THE HUMAN
INTERVENTION NEEDED DURING CLOCK SYNCING MAY CAUSE UNNEEDED
Yeah, but that one isn't available online. :-)

On the other hand, the CLOCK SYNCING deserves looking at because when the 3270 console is absent but the 3215 console is attached, the SPIN LOOP message can't be displayed and turns into a WAIT STATE 091. Think that wait state mentioned RISGNL .

The one I pointed to is and contains such good stuff as:

* MAXCPUS='10'X; /* SET MAXCPUS TO NUMBER OF CPUS 00156000
* THAT COULD POSSIBLY BE ONLINE */ 00157000
MVI MAXCPUS(GLOSAPTR),X'10' 0079 00158000

That doesn't look like it limits to 2 CPUs in the above code. Maybe somewhere else it does.

*
/*****************************************************************/ 00318000
* /* */ 00319000
* /* SET RESTART NEW PSW TO POINT TO MY RESTART FLIH WITH DAT OFF */ 00320000
* /* */ 00321000
* /*****************************************************************/ 00322000
* 0118 00323000
* RSTRMPS='000C0000'X; /* @ZD03005*/ 00324000
MVC RSTRMPS(4,PSAPTR),@CB02295 0118 00325000
* RSTRTIC=ADDR(RESTFLIH); /* @ZD03005*/ 00326000
LA @09,RESTFLIH 0119 00327000
ST @09,RSTRTIC(,PSAPTR) 0119 00328000

Believe Ivan mentioned possible DAT on/off (?) for PSA . What would happen if DAT is on and PSA page for an idle CPU was paged out, then needed when CPU got work to do ?


* /*************************************************************/ 00360000
* /* */ 00361000
* /* THE FOLLOWING INFORMATION WILL BE STORED INTO THE STATUS */ 00362000
* /* SA FROM THE PSA */ 00363000
* /* */ 00364000
* /*************************************************************/ 00365000
* 0125 00366000
* LOWSAVE=LOWCORE; /* SAVE LOW CORE INFO */ 00367000
MVC LOWSAVE(24,STATUSAD),LOWCORE(PSAPTR) 0125 00368000
* NEWPSWS=MACHNEW; /* MCK NEW, SVC NEW, PROG CK NEW */ 00369000
MVC NEWPSWS(24,STATUSAD),MACHNEW(PSAPTR) 0126 00370000
* OLDPSWS=MACHOLD; /* MCK OLD, SVC OLD, PROG CK OLD */ 00371000
MVC OLDPSWS(24,STATUSAD),MACHOLD(PSAPTR) 0127 00372000
* GENERATE; 0128 00373000
* 0128 00374000
STD 0,FLPTREG0(STATUSAD) STORE FLOATING POINT REG 0 00375000
STD 2,FLPTREG2(STATUSAD) STORE FLOATING POINT REG 2 00376000
STD 4,FLPTREG4(STATUSAD) STORE FLOATING POINT REG 4 00377000
STD 6,FLPTREG6(STATUSAD) STORE FLOATING POINT REG 6 00378000
STCTL 0,15,CONTREGS(STATUSAD) STORE CONTROL REGS 00379000
STPT CPUTIMER(STATUSAD) STORE CPU TIMER 00380000
STCKC CLKCOMP(STATUSAD) STORE CLOCK COMPARATOR 00381000


So 2 CLOCK things to keep track of in S/370 mode. Know CPU TIMER goes away with XA mode and beyond, but not sure about CLOCK COMPARATOR.


*/*** START OF RESTFLIH **********************************************/ 00417000
*/* */ 00418000
*/* THE FIRST CPU THROUGH THIS ROUTINE WILL ISSUE RESTART TO ALL OF */ 00419000
*/* THE OTHER ALIVE CPU'S. EACH OF THEM WILL RUN A PORTION OF THE */ 00420000
*/* INTERRUPT HANDLER AS THEY ARE RESTARTED. FIELDS ARE MODIFIED TO */ 00421000
*/* PRESERVE JOB STEP TIMING, SAVED STATUS IS RESTORED AND CONTROL */ 00422000
*/* IS GIVEN BACK TO THE PROGRAMS INTERRUPTED BY THE STOP ON EACH */ 00423000
*/* OF THE CPU'S. THIS FLIH IS ENTERED WITH DAT OFF. */ 00424000
*/* */ 00425000
*/********************************************************************/ 00426000

First Level Interrupt Handler (?) DAT OFF for each CPU.


* /*************************************************************/ 00783000
* /* */ 00784000
* /* THE FOLLOWING SEGMENT OF CODE WILL COMPUTE THE AMOUNT OF */ 00785000
* /* TIME THE CPU('S) WERE IN THE STOPPED STATE AND ADJUST THE */ 00786000
* /* LCCADTOD FIELD FOR THE DISPATCHER. THE CHECK WILL BE MADE */ 00787000
* /* TO DETERMINE IF THE DISPATCHER WAS UPDATEING JOB STEP */ 00788000
* /* TIMING AT THE TIME THE CPU WAS STOPPED. IF SO, I WILL SET */ 00789000
* /* THE DISPATCHER BACK TO THE BEGINNING OF THE JST */ 00790000
* /* COMPUTATION CODE. */ 00791000
* /* */ 00792000
* /*************************************************************/ 00793000
* 0209 00794000
* DTOD=DTOD+(HIAFTER-HIBEFOR);/* SUBTRACT TODBEFOR FROM 0209 00795000
* TODAFTER AND ADD IT TO 0209 00796000
* LCCADTOD */ 00797000


More clock adjustment stuff.


* /*****************************************************************/ 01131000
* /* */ 01132000
* /* CLEANUP WILL BE ENTERED AT THIS POINT FOR NORMAL ENTRY. IF ALL*/ 01133000
* /* OF THE CPU'S HAVE NOT SET THEIR RESPECTIVE BITS IN THE */ 01134000
* /* COMPLETE MASK, WE WILL LOOP HERE FOR APPROXIMATELY 2 MILLION */ 01135000
* /* INSTRUCTIONS WAITING FOR THEM TO COMPLETE */ 01136000
* /* */ 01137000
* /*****************************************************************/ 01138000


Loop of 2 Million for each processor. Even with NUMCPU 2 the Hercules console displays SIGP for 8 processors IIRC. 8 is hardcoded in Hercules compile IIRC. Some of those 'extra' instructions could be here looping on non-existing CPUs.

Phil - knowing nothing about any of the code contained in either of the assembler members mentioned above or Hercules
ikj1234i
2010-12-01 23:35:16 UTC
Permalink
Post by halfmeg
Believe Ivan mentioned possible DAT on/off (?) for PSA . What
would happen if DAT is on and PSA page for an idle CPU was paged out,
[snip]
IIRC, this scenario "should not occur" - the PSA's are located in page-fixed storage (core which can't get paged out).

Max
Ivan Warren
2010-12-02 00:35:19 UTC
Permalink
Post by ikj1234i
IIRC, this scenario "should not occur" - the PSA's are located in page-fixed storage (core which can't get paged out).
Uh ?

Page 0 of a virtual address space is as eligible as any other page to
cause a paging exception under DAT as any other page !

--Ivan



[Non-text portions of this message have been removed]
Ivan Warren
2010-12-02 00:37:36 UTC
Permalink
Post by Ivan Warren
Page 0 of a virtual address space is as eligible as any other page to
cause a paging exception under DAT as any other page !
Read : Translation Exception

--Ivan



[Non-text portions of this message have been removed]
somitcw
2010-12-02 02:51:42 UTC
Permalink
Post by Ivan Warren
Post by Ivan Warren
Page 0 of a virtual address space is as eligible as
any other page to cause a paging exception under DAT
as any other page !
Read : Translation Exception
--Ivan
[Non-text portions of this message have been removed]
Most MVS common storage can be paged out but a
prefix page is constantly in use so even if MVS
would allow it to be paged out, it would never be.
If the real address ever changed, the PCCA control
block would be wrong and the prefix register would
be wrong. The prefix page will never be paged out.

My system may be different than the others but
the MAXCPU 2 with NUMCPU 2 issue I have with MVS
all looks like a priority problem. When I run a
high CPU Windows process as Windows priority LOW,
I don't expect it to interfere with Hercules or MVS
in any way but the priority LOW processes prevent
MVS MP from functioning. Extremely high CPU cycles
with low and slow I/O and various hangs and wait
states like 05A, 091, and 092.

My system is Intel Dual 1.6GHz with 32-bit Vista.
Hercules is 3.07.7133 but the issues have existed for
several years.

Investigating Hercules priority, I first tried to
display them:

HHC00013I Herc command: 'hercprio'
HHC01455E Invalid number of arguments for 'hercprio'

Some others did display but the numbers seem a bit
low to be believed:

cpuprio 15
devprio 8
srvprio 4
todprio 0

Next I actually looked at the Hercules console:

HHC00100I Thread id 00001A80, prio 15, name 'Processor CP00' started
HHC00100I Thread id 00001600, prio 0, name 'Timer' started
HHC00100I Thread id 00001808, prio 0, name 'Console connection' started
HHC00100I Thread id 0000182C, prio 0, name 'Control panel' started
HHC00100I Thread id 00001DE4, prio 0, name 'Hercules Automatic
Operator' started
HHC00100I Thread id 00001AAC, prio 0, name 'Read-ahead thread-1'
started
HHC00100I Thread id 000019F4, prio 0, name 'Read-ahead thread-2'
started
HHC00100I Thread id 000010B4, prio 0, name 'Writer thread-1' started
HHC00100I Thread id 00001B30, prio 0, name 'Garbage collector'
started
HHC00100I Thread id 00001BE4, prio 0, name 'Writer thread-2' started

The above might explain why Windows priority LOW processes
can impact MVS and be one of the things preventing MVS MP?

How do I display HERCPRIO ?

How do I get change all of the prio 0 tasks?
Changing hercprio from unknown to a number and changing
cpuprio, devprio, srvprio, and todprio doesn't change
the prio 0 displays.

The spin loop counters and distorted MP timers under VM
should still be examined. Both the spin loop counters and
distorted VM could be zapped.

I haven't disassembled IEEVEXSN but suspected that the
zap needed might be something like:

untested
AOSC5
untested
SYS1.NUCLEUS
untested
NAME IEANUC01 IEEVEXSN
untested
VER 0116 9140A080,47E09126 TM CVTFLGBT,CVTVME BNO @RF00064
untested
REP 0116 9140A080,47009126 TM CVTFLGBT,CVTVME BC 0,@RF00064
untested

The zap shouldn't hurt and probably won't fix many
of the issues but would be a good excuse to do a full
backup copy of the folder that has MVS in it.

I haven't looked at zapping the spin loop counters
and assUme that someone else is testing them.
Ivan Warren
2010-12-02 03:56:45 UTC
Permalink
Post by somitcw
Most MVS common storage can be paged out but a
prefix page is constantly in use so even if MVS
would allow it to be paged out, it would never be.
If the real address ever changed, the PCCA control
block would be wrong and the prefix register would
be wrong. The prefix page will never be paged out.
Here you go again with this specifc OS ! (We've done this before haven't
we ! ah ah)

hercules isn't about MVS ! It's an implementation of a computer
architecture !

(Then again, I know.. I hear what you're saying... MVS doesn't page out
page 0.. But Other OSes do (VM does !))

But this is the hercules group.. in hercules-mvs, I would understand
that.. In the main hercules group, someone saying Page 0 cannot be paged
out just doesn't seem right.

--Ivan



[Non-text portions of this message have been removed]
somitcw
2010-12-02 05:38:31 UTC
Permalink
Post by Ivan Warren
Post by somitcw
Post by Ivan Warren
Post by ikj1234i
IIRC, this scenario "should not occur" - the PSA's
are located in page-fixed storage (core which can't
get paged out).
Max
Page 0 of a virtual address space is as eligible as
any other page to cause a paging exception under DAT
as any other page !
Read : Translation Exception
--Ivan
Most MVS common storage can be paged out but a
prefix page is constantly in use so even if MVS
would allow it to be paged out, it would never be.
If the real address ever changed, the PCCA control
block would be wrong and the prefix register would
be wrong. The prefix page will never be paged out.
Here you go again with this specifc OS ! (We've done
this before haven't we ! ah ah)
When working on specific Hercules issues with specific
operating systems there is sometimes a mention of a unique
operating system trait like prefix areas are page fixed
in the operating system being discussed.

If hardware emulation is not working with software,
both, even if MVS, both need to be looked at.
Post by Ivan Warren
hercules isn't about MVS ! It's an implementation of a
computer architecture !
Which hasn't ever worked properly with MVS MP.
Looking at Hercules without looking at MVS will not fix
the interface between them.
Post by Ivan Warren
(Then again, I know.. I hear what you're saying... MVS
doesn't page out page 0.. But Other OSes do (VM does !))
You mentioned an operating system. Actually two.
Post by Ivan Warren
But this is the hercules group.. in hercules-mvs,
I would understand that.. In the main hercules group,
someone saying Page 0 cannot be paged out just doesn't
seem right.
--Ivan
[Non-text portions of this message have been removed]
The discussion was MVS PSA, not all page zeroes.
MVS keeps virtual page zero, real and real prefixed page
zero, and absolute page zero in memory.

What was said was the right thing to say for the
issue being discussed. Since MVS, yes MVS, does not page
a prefix area, that eliminates one of the possible causes
of the MVS MP issues.

P.S. After putting on the zap to tell MVS it's running
under VM, now here I go mentioning two operating systems,
and changing some Hercules priorities, I haven't had
another failure. It still crawls dog slow with MVS MP
but it does run. I also didn't test the two changes
separately so don't know which did what if anything.

//HERC01Z JOB (XXXXXXXX,XXXX,1439,9999,9999),ZAPSPIN-HERC01,
// CLASS=A,MSGCLASS=C,NOTIFY=HERC01
//*
//* Treat Hercules like VM for SPIN loop recovery.
//*
//* From old MVS 3.8j source before any PTFs
//* 'IEEVEXSN - Excessive SPIN Notification' says:
//*
//* Check for running as a guest Virtual Machine and for
//* time-of-day clock synchronization. If either case is present,
//* return to caller(rc=0). The reasons for doing this are:
//* (1) Running as a guest Virtual Machine distorts the timing
//* relation between the two halves of an mp,
//* (2) The human intervention needed during clock syncing may
//* cause unneeded spin loop timeouts from the RISGNL routine.)
//* @VS49603
//*
//ZAPSPIN EXEC PGM=AMASPZAP
//SYSPRINT DD SYSOUT=*
//* AOSC5
NAME IEANUC01 IEEVEXSN
VER 0116 9140A08047E09126 TM CVTFLGBT,CVTVME BNO @RF00064
REP 0116 9140A08047009126 TM CVTFLGBT,CVTVME BC 0,@RF00064
//SYSLIB DD DISP=SHR,DSN=SYS1.NUCLEUS <=== IPL . CLPA okay

CPUSERIAL 000611
CPUMODEL 0380
MAINSIZE 160
XPNDSIZE 0
AUTOINIT OFF # Require tape volumes to already exist
AUTOMOUNT ADD tapes
. . .
MOUNTED_TAPE_REINIT DISALLOW
MAXCPU 3
NUMCPU 2
# NUMCPU 1
LOADPARM ........
PANTITLE "MVS 3.8j Modified"
# SHRDPORT 3990
# SYSEPOCH 1928
TZOFFSET +0000
# TODDRAG 1
HERCPRIO 255 # Default ???
CPUPRIO 127 # Default 15
DEVPRIO 64 # Default 8
SRVPRIO 32 # Default 4
TODPRIO 16 # Default 0
ARCHMODE S/370
PANRATE FAST
OSTAILOR QUIET
. . .
DEFSYM PF01 devlist
# SET EMSG TIME
# 16 Aug 2010 Deprecate s37x statement.
# Use "ldmod s37x" instead - Ivan Warren
LDMOD S37X
halfmeg
2010-12-02 04:32:54 UTC
Permalink
Post by Tony Harminc
<snip>
The spin loop counters and distorted MP timers under VM
should still be examined. Both the spin loop counters and
distorted VM could be zapped.
Did you mean MVS instead of VM ?
Does VM/370 permit multiple CPU in Hercules emulation?
Does VM/370 permit multiple CPU MVS execution running under it ?
Post by Tony Harminc
I haven't disassembled IEEVEXSN but suspected that the
Will try the zap. Branches around the SPIN check ?
Post by Tony Harminc
I haven't looked at zapping the spin loop counters
and assUme that someone else is testing them.
How do you set a value for a table which expects fixed execution times when under emulation the execution time will vary ?

Phil
somitcw
2010-12-02 06:06:29 UTC
Permalink
Post by halfmeg
Post by Tony Harminc
<snip>
The spin loop counters and distorted MP timers under VM
should still be examined. Both the spin loop counters and
distorted VM could be zapped.
Did you mean MVS instead of VM ?
No I meant time being distorted by VM so said VM.
Post by halfmeg
Does VM/370 permit multiple CPU in Hercules emulation?
Does VM/370 permit multiple CPU MVS execution running
under it ?
I never tested either in Hercules.
Post by halfmeg
Post by Tony Harminc
I haven't disassembled IEEVEXSN but suspected that the
Will try the zap. Branches around the SPIN check ?
The zap doesn't hurt anything and might help?
If I got the right instruction and changed it correctly,
it ignores recovery.
Post by halfmeg
Post by Tony Harminc
I haven't looked at zapping the spin loop counters
and assUme that someone else is testing them.
How do you set a value for a table which expects fixed
execution times when under emulation the execution time
will vary ?
Phil
I don't know what should be in the tables. Using a
S/370/155 for SRM seconds sounds fine. Some of my
amdahl customers did that on purpose so SRM would not
hog the system as much. Another amdahl zap that
I recommended was for memory error recovery. If amdahl
memory had a single bit error, it reported it as recovered.
MVS would assume that it had S/370/155 donut core memory
and abend the active task. EREP would also reported the
error that was collected and sent for system comparison.
That added about one error every six months for one system
without the memory zap and hurt our stats.
Dick the man
2010-12-02 07:42:45 UTC
Permalink
Just a thought - has anybody tried it with the CPUID set to FFxxxxxx so that
the
OS thinks it's under VM?

Maybe it makes the spinning more flexible in that case, after all, any guest
under VM
can't really be guaranteed that "x spins = t time".

Just a thought.
Post by halfmeg
Post by Tony Harminc
<snip>
The spin loop counters and distorted MP timers under VM
should still be examined. Both the spin loop counters and
distorted VM could be zapped.
Did you mean MVS instead of VM ?
Does VM/370 permit multiple CPU in Hercules emulation?
Does VM/370 permit multiple CPU MVS execution running under it ?
Post by Tony Harminc
I haven't disassembled IEEVEXSN but suspected that the
Will try the zap. Branches around the SPIN check ?
Post by Tony Harminc
I haven't looked at zapping the spin loop counters
and assUme that someone else is testing them.
How do you set a value for a table which expects fixed execution times when
under emulation the execution time will vary ?
Phil
[Non-text portions of this message have been removed]



------------------------------------
Binyamin Dissen
2010-12-02 08:25:55 UTC
Permalink
On Wed, 01 Dec 2010 23:35:16 -0000 "ikj1234i" <ikj1234i-/***@public.gmane.org> wrote:

:>> Believe Ivan mentioned possible DAT on/off (?) for PSA . What
:>> would happen if DAT is on and PSA page for an idle CPU was paged out,

:>IIRC, this scenario "should not occur" - the PSA's are located in page-fixed storage (core which can't get paged out).

It absolutely cannot happen. The hardware uses prefixed real addresses for the
PSW switch, so if the real slot was used for some other page the old PSW would
be stored on that page and the new PSW would be fetched from that page. And I
would expect a machine check to be expressed to the other CP's or a check stop
if the real page was damaged or if the prefix register pointed to a
non-existant page.

--
Binyamin Dissen <bdissen-***@public.gmane.org>
http://www.dissensoftware.com

Director, Dissen Software, Bar & Grill - Israel


Should you use the mailblocks package and expect a response from me,
you should preauthorize the dissensoftware.com domain.

I very rarely bother responding to challenge/response systems,
especially those from irresponsible companies.
Ivan Warren
2010-12-02 09:32:14 UTC
Permalink
:>> Believe Ivan mentioned possible DAT on/off (?) for PSA . What
:>> would happen if DAT is on and PSA page for an idle CPU was paged out,
:>IIRC, this scenario "should not occur" - the PSA's are located in page-fixed storage (core which can't get paged out).
It absolutely cannot happen. The hardware uses prefixed real addresses for the
PSW switch, so if the real slot was used for some other page the old PSW would
be stored on that page and the new PSW would be fetched from that page. And I
would expect a machine check to be expressed to the other CP's or a check stop
if the real page was damaged or if the prefix register pointed to a
non-existant page.
The *absolute* address set by SET PREFIX is checked for validity before
the prefix register is set - so one cannot set the prefix register to
have a CPU *real* address point to some page that is not addressable
using an *absolute* address.

However, a logical address of 0 can very well translate to either an
absolute address outside addressable storage or be invalidated (for
example with IPTE), because addresses for interruptions are either
*real* (for Program, External, I/O, SVC or Machine Check) or *absolute*
(for RESTART) - so are not subject to DAT.

Now, if the page is "damaged" - that is, a machine check would occur if
a storage reference was made to that page, I suspect either a CPU check
stop or SYSTEM check stop would occur - because a machine check
occurring while processing a machine check interrupt does yield a check
stop state.

However, again, it is perfectly possibly to have a "logical" address of
0-4095 point to a bad page or to be invalid in the PTE. Attempting to
access it using an instruction which has a logical addressing scope
would yield a translation exception - thus generating a program
interrupt for the CPU attempting the store or fetch operation.

As an example, as I mentioned earlier, VM routinely pages out virtual
machine's Page 0 - and for cases when the CPU needs to access a virtual
address of 0 (which occurs with ECPS:VM), then the MICBLOK either has a
pointer for that (interval timer) or ECPS:VM VM Assist is bypassed.

--Ivan



[Non-text portions of this message have been removed]
somitcw
2010-12-02 14:28:03 UTC
Permalink
--- In hercules-390-***@public.gmane.org,
Ivan Warren <***@...> wrote:
- - - snipped - - -
Post by Ivan Warren
As an example, as I mentioned earlier, VM routinely pages
out virtual machine's Page 0 - and for cases when the CPU
needs to access a virtual address of 0 (which occurs with
ECPS:VM), then the MICBLOK either has a pointer for that
(interval timer) or ECPS:VM VM Assist is bypassed.
--Ivan
[Non-text portions of this message have been removed]
So you are saying that there are no conditions that
allow for any operating system to page out its PSA which
even the words make no sense? If unsaid operating system
could "page out" whatever that means its PSA, it couldn't
issue a SIO/SSCH or take interrupts to get it back in.

But if running CP under Hercules under VirtualPC under
VirtualBox under VMWARE, then if Hercules could be made not
active and swapped out, technically the CP PSA would be
considered sort-of paged out? Same is true for GCS, MVS,
DOS/VS,Music, CMS, and all other operating systems running
under various emulators and hypervisors.

I believe you are 100% correct.
Ivan Warren
2010-12-02 15:03:39 UTC
Permalink
Post by somitcw
So you are saying that there are no conditions that
allow for any operating system to page out its PSA which
even the words make no sense? If unsaid operating system
could "page out" whatever that means its PSA, it couldn't
issue a SIO/SSCH or take interrupts to get it back in.
You can't page out because all these use *REAL* or *ABSOLUTE* addresses.
Therefore DAT doesn't apply, thus, the page cannot invalidated.
Post by somitcw
But if running CP under Hercules under VirtualPC under
VirtualBox under VMWARE, then if Hercules could be made not
active and swapped out, technically the CP PSA would be
considered sort-of paged out? Same is true for GCS, MVS,
DOS/VS,Music, CMS, and all other operating systems running
under various emulators and hypervisors.
I believe you are 100% correct.
Ok.. I'm not talking about CP's PSA - I'm talking about a virtual
machine's own PSA. This can very well be swapped out - because it is
completely distinct from CP's PSA - It is an *emulated* PSA - maintained
by CP. However the hardware WILL interact with it when dealing with
ECPS:VM - VM Assist.

The 1st Page table entry for the 1st segment table entry pointed by the
Segment Table Origin can very well be invalidated.

--Ivan



[Non-text portions of this message have been removed]
Mike Ward
2010-12-02 15:57:24 UTC
Permalink
Ivan is correct. Back in the day we ran multiple OS/VS1 guests under VM. And
sometimes we had to lock page 0 of each guest so it wouldn't get paged out. Lock
Osvs1a 0 0 map. Lock osvs1b 0 0 map.




________________________________
From: Ivan Warren <ivan-lnHwE90NT89Ooi3Kub+***@public.gmane.org>
To: hercules-390-***@public.gmane.org
Sent: Thu, December 2, 2010 9:03:39 AM
Subject: Re: [hercules-390] Re: S/370 MP Timer Bug ?

 
Post by somitcw
So you are saying that there are no conditions that
allow for any operating system to page out its PSA which
even the words make no sense? If unsaid operating system
could "page out" whatever that means its PSA, it couldn't
issue a SIO/SSCH or take interrupts to get it back in.
You can't page out because all these use *REAL* or *ABSOLUTE* addresses.
Therefore DAT doesn't apply, thus, the page cannot invalidated.
Post by somitcw
But if running CP under Hercules under VirtualPC under
VirtualBox under VMWARE, then if Hercules could be made not
active and swapped out, technically the CP PSA would be
considered sort-of paged out? Same is true for GCS, MVS,
DOS/VS,Music, CMS, and all other operating systems running
under various emulators and hypervisors.
I believe you are 100% correct.
Ok.. I'm not talking about CP's PSA - I'm talking about a virtual
machine's own PSA. This can very well be swapped out - because it is
completely distinct from CP's PSA - It is an *emulated* PSA - maintained
by CP. However the hardware WILL interact with it when dealing with
ECPS:VM - VM Assist.

The 1st Page table entry for the 1st segment table entry pointed by the
Segment Table Origin can very well be invalidated.

--Ivan

[Non-text portions of this message have been removed]




[Non-text portions of this message have been removed]
halfmeg
2010-12-02 00:56:46 UTC
Permalink
Post by halfmeg
Post by somitcw
Post by halfmeg
So in the mean time, does anyone recognize where
SRM might be in one of the below ...
SRM might not be directly related.
Right, thought about that after I posted. Was thinking about
someone else's post mentioning SRM.
And since then have received off forum email which mentions a couple of places:

http://www.mainframe.eu/mvs38/asm/System%20Resources%20Manaager%20SRM%20(IRA)/IRARMCPU

Which has stuff like:

*/* TABLES - IRARMCMD - CPU MODEL TABLE */ 01850051

RMC3032M DC AL2(X'3032') CPU MODEL 3032 03168051
RMC3032R DC AL2(X'0000') RESERVED 03171051
RMC3032A DC A(0252*1024) CPU ADJUSTMENT FACTOR @ZM48346 03174051
RMCPU8 DS 0F 03177051
RMC3033M DC AL2(X'3033') CPU MODEL 3033 03180051
RMC3033R DC AL2(X'0000') RESERVED 03183051
RMC3033A DC A(0160*1024) CPU ADJUSTMENT FACTOR 03186051

Which is used in:

http://www.mainframe.eu/mvs38/asm/Supervisor%20(IEA)/IEAVNP10

*/* ADJUST EACH CPU MODEL DEPENDENT FIELDS BY THE ADJUSTMENT FACTOR */ 00079000
*/* IN THE FOLLOWING TABLES : */ 00080000
*/* CCT - CPU CONTROL TABLE */ 00081000
*/* MCT - MAIN STORAGE CONTROL TABLE */ 00082000
* 0154 00083000
*NP10ADJ: 0154 00084000
* NP10ADJN=ADJSHIFT*NP10INST; /* 16* NO. OF MICROSECS FOR 0154 00085000
* 10,000 INSTRUCTIONS */ 00086000

Which looks like a 3033 is the latest in the source posted online. Don't know what PTFs may be applied or whether more modern CPUs were ever added. Hercules TK3 configuration has a 4381 in it IIRC.

And then the second mention was:

http://www.mainframe.eu/mvs38/asm/Supervisor%20(IEA)/IEAVELK

which has this:

LOOPCTN DC X'00000100' SPIN COUNT @Z40FPXJ 29149140
LOOPSALC DC X'00000005' SALLOC SPIN COUNT @Z40FPXJ 29149240

So there are possibly multiple places for a SPIN LOOP to occur in the MVS source perhaps.
Post by halfmeg
<snip>
Post by somitcw
The problems seem to be more than SPIN loop problems but a better
<snip>
There seem to be multiple problems in current Hercules as well (S/370)

CPU Timer decrementing while CPU in stopped state

CPU configured offline with 'CF OFF' on 'hardware' console comes back online and gets restarted by an IPL. Hardware configured offline should stay offline until configured back on or new 'hardware' installed.

NUMCPU 2 is somehow overridden by Hercules hardcoded MAXCPU 8 as SIGP for 7 CPUs appear on 'hardware' console when only 1 should appear

Can no longer IPL MVS 3.8j using MSVC SVN as every DASD comes up with reply to continue or wait as it thinks something else has the drive

There may be a couple more but I can't think of them offhand and I really, really, really don't like to shoot my mouth off about most of this stuff which I don't understand.

Phil
Mike Schwab
2010-12-02 02:27:23 UTC
Permalink
On Wed, Dec 1, 2010 at 6:56 PM, halfmeg <opplr-***@public.gmane.org> wrote:
<deleted>
Post by halfmeg
Can no longer IPL MVS 3.8j using MSVC SVN as every DASD comes up with reply to continue or wait as it thinks something else has the drive
<deleted>
Post by halfmeg
Phil
At one point, someone added a script into the hercules batch file
which checked for the existence of a empty stub file. If it did not
exist, it created the file, started hercules, then when hercules
ended, deleted the file.

An attempt to run hercules while the first batch file was running
stopped because of the presence of the empty stub file.

Before this was done, running two instances of hercules at the same
time on the same dasd emulator file would ruin the emulator file.

I am GUESSING this is being integrated into hercules code itself, by
raising a flag on the emulated dasd file and reseting it when you
close the file.

My suggestion is to reboot the host pc then start hecules and reply
continue. If MY GUESS is wrong, you might corrupt your emulated dasd
files to the point they need to be reloaded.
--
Mike A Schwab, Springfield IL USA
Where do Forest Rangers go to get away from it all?
halfmeg
2010-12-02 03:54:47 UTC
Permalink
Post by Tony Harminc
<snip>
I am GUESSING this is being integrated into hercules code itself, by
raising a flag on the emulated dasd file and reseting it when you
close the file.
My suggestion is to reboot the host pc then start hecules and reply
continue. If MY GUESS is wrong, you might corrupt your emulated
dasd files to the point they need to be reloaded.
No, current MSVC SVN thinks something has the DASD. I'll have to recheck to see if it is NUMCPU 2 only or both NUMCPU 1 & 2.

When current MSVC SVN is shutdown and older MSVC SVN is used, the system IPLs without problem.

Shutting down and trying current MSVC SVN again after proper shutdown with the older snapshot, again gives the CONT or WAIT message for each DASD.

Phil
Kevin Leonard
2010-12-02 03:09:06 UTC
Permalink
Post by halfmeg
Don't know what PTFs may be applied or whether more modern
CPUs were ever added.
Someone has been playing, and unfortunately we don't have the source.

In my TK3 DLIBs, IRARMCPU is at UZ34335, which is included in the
3.8J base. Displaying the CSECT in hex:

PDS141I AT 001420 CSECT IRARMCPU LENGTH 000070
001420 0000 FFFF0000 00000000 FFFF0000 00000000 *................*
001430 0010 01450000 001B5800 01550000 000FA000 *................*
001440 0020 01580000 000CD000 01650000 00054000 *......:....... .*
001450 0030 01680000 00045800 30620000 00045800 *................*
001460 0040 30310000 000A3800 30320000 0003F000 *..............0.*
001470 0050 30330000 00028000 30330100 00030800 *................*
001480 0060 30330200 00045800 00000000 000FA000 *................*

1. There are a couple of additional entries at the front. They
have zero adjustment values, so my guess would be that they
were intended to be zapable to add new models.

2. There's an additional entry X'3062' that has the same adjustment
value as a 168.

3. There are now three 3033 values. Two of them have non-zero values
(X'0100' and X'0200' respectively) in the fields defined in the
source code as "reserved". Both have adjustment values larger
than the one for the base X'30330000' entry, which suggests they
are supposed to be for slower systems. (Is one supposed to be MP,
another AP, the third UP?)

4. The final X'0000' entry now has a non-zero adjustment value in
a field defined as A(0) in the source. Since it's the same as
the adjustment for a 155, and the IEAVNP10 source uses 155 as
the default model type if no match is found in the table, the
adjustment field in the final entry may now be what's used as
the default adjustment if the reported model type isn't found
in the table.

IEAVNP10 is at UZ35231, which is also included in the 3.8J base.

I suppose it's probably time to disassemble IEAVNP10.

--
halfmeg
2010-12-02 03:49:17 UTC
Permalink
Post by Kevin Leonard
Post by halfmeg
Don't know what PTFs may be applied or whether more modern
CPUs were ever added.
Someone has been playing, and unfortunately we don't have the
source.
<snip>
3. There are now three 3033 values. Two of them have non-zero
values (X'0100' and X'0200' respectively) in the fields defined
in the source code as "reserved". Both have adjustment values
larger than the one for the base X'30330000' entry, which
suggests they are supposed to be for slower systems. (Is one
supposed to be MP, another AP, the third UP?)
<snip>
I suppose it's probably time to disassemble IEAVNP10.
I' not so sure this is going to be the right avenue to take. If we were all running the same speed Host ( or = # of CPUs ) then a fixed value table might make sense.

Emulation takes away the time interval that 'x' number of loops can be preformed from a tablized entry to a variable value depending on, unfortunately as somitcw points out, changing demands on the host by other processes and applications running on the host.

Since most folks will be running on systems which provides much 'faster hardware' than the 168 or 3033 entry, what solution would best fit with ever faster emulation and give time slicing (? is this where intervals are determined ? ) a more balanced interval. 3033 time slices don't seem appropriate for a system 5 to 10 times as fast ( maybe 100 times faster ).

Phil
halfmeg
2010-12-02 15:28:22 UTC
Permalink
Post by Tony Harminc
<snip>
3. There are now three 3033 values. Two of them have non-zero
values (X'0100' and X'0200' respectively) in the fields defined
in the source code as "reserved". Both have adjustment values
larger than the one for the base X'30330000' entry, which
suggests they are supposed to be for slower systems. (Is one
supposed to be MP, another AP, the third UP?)
<snip>
Cycle times for the three:

3031 - 115
3032 - 80
3033 - 57

So they got faster. All seem to be available as MP with the option of MP or AP on the 3033 from what I read.

http://www.beagle-ears.com/lars/engineer/comphist/model360.htm

IBM web site backs up those cycle times on the individual CPU archives.

Phil
somitcw
2010-12-02 15:51:23 UTC
Permalink
Post by halfmeg
Post by Tony Harminc
<snip>
3. There are now three 3033 values. Two of them have
non-zero values (X'0100' and X'0200' respectively) in
the fields defined in the source code as "reserved".
Both have adjustment values larger than the one for
the base X'30330000' entry, which suggests they are
supposed to be for slower systems. (Is one supposed
to be MP, another AP, the third UP?)
<snip>
3031 - 115
3032 - 80
3033 - 57
So they got faster. All seem to be available as MP
with the option of MP or AP on the 3033 from what I read.
http://www.beagle-ears.com/lars/engineer/comphist/model360.htm
IBM web site backs up those cycle times on the individual
CPU archives.
Phil
Hercules does not normally run on a 3031.

The default of S370/155 should work for all.
If someone wants to update the table, it wouldn't hurt
unless they set the CPUMODEL to use a new high value
on a slow PC. SRM has high overhead.
Kevin Leonard
2010-12-02 04:05:00 UTC
Permalink
Phil:

Thought of something else.
Post by halfmeg
Can no longer IPL MVS 3.8j using MSVC SVN as every DASD comes up
with reply to continue or wait as it thinks something else has the
drive
What exactly is the SVN level and MVS environment you're running
into this with? I've tested MSVC SVN 7150 with my TK3-level MVS
system, both UP and MP, and don't encounter any problems with my
shared DASD being reserved by someone else.

But...

Hercules at 7150 takes *forever* to shut down. Shutdown goes to
sleep between the messages:

HHC01501I HDL: calling 'shared_device_manager_shutdown'

and

HHC01502I HDL: calling 'shared_device_manager_shutdown' complete

Removing the SHRDPORT statement from the Hercules configuration
restores a quick shutdown time. So maybe someone has been tinkering
with the shared DASD management code, and we are each seeing
different manifestations of it. I haven't yet worked my way
back through the SVN log, but that's definitely indicated.

And the HERCLOGO statement is ignored in the configuration file
at SVN 7150 (it works as a hardware console command).
--
Kevin
halfmeg
2010-12-02 05:16:44 UTC
Permalink
Post by Kevin Leonard
Thought of something else.
Post by halfmeg
Can no longer IPL MVS 3.8j using MSVC SVN as every DASD comes up
with reply to continue or wait as it thinks something else has the
drive
What exactly is the SVN level and MVS environment you're running
into this with? I've tested MSVC SVN 7150 with my TK3-level MVS
system, both UP and MP, and don't encounter any problems with my
shared DASD being reserved by someone else.
MVS environment is TK3UPD freshly unziped into new directory.

HHC01413I Hercules version 3.0.7.7148
HHC01414I (c) Copyright 1999-2010 by Roger Bowler, Jan Jaeger, and others
HHC01415I Built on Dec 1 2010 at 01:52:11
HHC01416I Build information:
HHC01417I Windows (MSVC) build for i386
....
HHC01421I Main Storage will be reconfigured to 1 Mbyte due to machine architectu
....
HHC02204I numcpu set to 1
HHC02204I panrate set to FAST
....
HHC00013I Herc command: 'ipl 148'
HHC00811I Processor CP00: architecture mode 'S/370'
HHC00814I Processor CP00: SIGP Initial program reset (07) CP01, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP02, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP03, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP04, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP05, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP06, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP07, PARM
herc =====>

Shouldn't be getting these with NUMCPU 1. IE MVS shouldn't even know about them.

NUMCPU 2

HHC00013I Herc command: 'ipl 148'
HHC00811I Processor CP00: architecture mode 'S/370'
HHC00814I Processor CP00: SIGP Initial program reset (07) CP01, PARM
HHC00814I Processor CP00: SIGP Restart (06) CP01, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP02, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP03, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP04, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP05, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP06, PARM
HHC00814I Processor CP00: SIGP Initial program reset (07) CP07, PARM

Shouldn't be getting these for CPU02-CPU07 when NUMCPU 2. Same as above MVS should know about anything more than 2 CPUs.

Getting the below on both NUMCPU 1 and NUMCPU 2:

| IEA101A SPECIFY SYSTEM PARAMETERS FOR RELEASE 03.8 .VS2
| r 0,clpa
| IEA120A DEVICE 131 SHARED. REPLY 'CONT' OR 'WAIT'
| r 0,cont
| IEA120A DEVICE 132 SHARED. REPLY 'CONT' OR 'WAIT'
| r 0,cont
| IEA120A DEVICE 133 SHARED. REPLY 'CONT' OR 'WAIT'
| r 0,cont
| IEA120A DEVICE 134 SHARED. REPLY 'CONT' OR 'WAIT'

These are the sortwork packs, the 1st DASD in the Hercules configuration file.

Using MSVC SVN 5518 we get:

ipl 148
HHCCD001I Readahead thread 1 started: tid=000000A0, pid=1984
HHCCD001I Readahead thread 2 started: tid=00000434, pid=1984
HHCCD002I Writer thread 1 started: tid=000005BC, pid=1984
HHCCD002I Writer thread 2 started: tid=00000688, pid=1984
HHCCD003I Garbage collector thread started: tid=000004F0, pid=1984

when NUMCPU 1. Notice no SIGP for non-existant CPUs.

ipl 148
CPU0000: SIGP Initial program reset (07) CPU0001, PARM 00000000: CC 0
CPU0000: SIGP Restart (06) CPU0001, PARM 00000000: CC 0
HHCCD001I Readahead thread 1 started: tid=00000428, pid=1600
HHCCD001I Readahead thread 2 started: tid=00000668, pid=1600
HHCCD002I Writer thread 1 started: tid=0000051C, pid=1600
HHCCD003I Garbage collector thread started: tid=000003B0, pid=1600
HHCCD002I Writer thread 2 started: tid=00000448, pid=1600

when NUMCPU 2. Notice single SIGP for CPU0001.

So end result is MVS environment works fine with older SVN but something farily recently started this DASD issue. Don't know when as haven't tracked it down.

Phil

Sorry Ivan and others, the subject matter is primarily MVS oriented but there do seem to be some Hercules issues.
somitcw
2010-12-02 05:53:05 UTC
Permalink
--- In hercules-390-***@public.gmane.org,
"halfmeg" <***@...> wrote:
- - - much snipped - - -
Post by halfmeg
HHC00814I Processor CP00: SIGP Initial program reset (07) CP07, PARM
herc =====>
Shouldn't be getting these with NUMCPU 1. IE MVS
shouldn't even know about them.
NUMCPU 2
- - - remainder snipped - - -

With your coded or defaulted to
MAXCPU 8
NUMCPU 1 # or 2
I would have guessed a CC 3, not RESET(07) but don't
see an issue. I never wrote any MP code so don't know
any fine points.

P.S. Are you putting half of your dasd on each processor?
I only removed 0160 PAGE00 and put it as 1:0160 PAGE00.
The old way of changing 0160 to 1160 doesn't work now.
halfmeg
2010-12-02 14:36:06 UTC
Permalink
Post by somitcw
- - - much snipped - - -
Post by halfmeg
HHC00814I Processor CP00: SIGP Initial program reset
(07) CP07, PARM
herc =====>
Shouldn't be getting these with NUMCPU 1. IE MVS
shouldn't even know about them.
NUMCPU 2
- - - remainder snipped - - -
With your coded or defaulted to
MAXCPU 8
NUMCPU 1 # or 2
I would have guessed a CC 3, not RESET(07) but don't
see an issue. I never wrote any MP code so don't know
any fine points.
Defaulted to MAXCPU 8 as there is no statement in Hercules config for MAXCPU. With NUMCPU 1 or 2 current SVN, MVS somehow sees the CPUs exist and goes into the loops to SIGP them.

This doesn't happen with 3.07 release. Only 1 CPU gets a SIGP when NUMCPU 2 is coded.
Post by somitcw
P.S. Are you putting half of your dasd on each processor?
I only removed 0160 PAGE00 and put it as 1:0160 PAGE00.
The old way of changing 0160 to 1160 doesn't work now.
No, all config statements for DASD are the old way 0xxx, so CPU 1 doesn't have a channel to any of them AFAICT. New way 0:xxx hasn't been tried as there are other issues to address first.

Once MP is no longer 'dog slow' ( believe me it is during IPL also ) it would be good to figure out dual path access to all DASD but that is for another forum I think.

Another Ranade book, MVS Performance Management, mentions MVS being capable of 16 CPUs except for I/O subsystem. So limit of 2 CPUs in S/370 is due to Channel structure ?

Phil
somitcw
2010-12-02 14:54:00 UTC
Permalink
Post by halfmeg
Post by somitcw
- - - much snipped - - -
Post by halfmeg
HHC00814I Processor CP00: SIGP Initial program reset
(07) CP07, PARM
herc =====>
Shouldn't be getting these with NUMCPU 1. IE MVS
shouldn't even know about them.
NUMCPU 2
- - - remainder snipped - - -
With your coded or defaulted to
MAXCPU 8
NUMCPU 1 # or 2
I would have guessed a CC 3, not RESET(07) but don't
see an issue. I never wrote any MP code so don't know
any fine points.
Defaulted to MAXCPU 8 as there is no statement in
Hercules config for MAXCPU. With NUMCPU 1 or 2 current
SVN, MVS somehow sees the CPUs exist and goes into the
loops to SIGP them.
This doesn't happen with 3.07 release. Only 1 CPU gets
a SIGP when NUMCPU 2 is coded.
MAXCPU default is not release related. It can be
specified when Hercules is compiled, assembled, made,
configured, of created. Volker supplied multiple
distributions. Something like i386, i486, i586,
MAXCPU 2, and MAXCPU 8.
Post by halfmeg
Post by somitcw
P.S. Are you putting half of your dasd on each processor?
I only removed 0160 PAGE00 and put it as 1:0160 PAGE00.
The old way of changing 0160 to 1160 doesn't work now.
No, all config statements for DASD are the old way 0xxx,
so CPU 1 doesn't have a channel to any of them AFAICT.
New way 0:xxx hasn't been tried as there are other issues
to address first.
Once MP is no longer 'dog slow' ( believe me it is during
IPL also ) it would be good to figure out dual path access
to all DASD but that is for another forum I think.
Just a wild guess, but the signal delays may be due to
slow signalling to do I/O ?
Post by halfmeg
Another Ranade book, MVS Performance Management, mentions
MVS being capable of 16 CPUs except for I/O subsystem.
So limit of 2 CPUs in S/370 is due to Channel structure ?
Phil
I still have piles of that type of book it you want more.

What PTF level was the book for? The S/370 Extended
Architecture might have problems with channel sets but
I don't understand the S/370 non-Extended Architecture
issue.

If your system residence volume is on channel one,
remember to set Hercules to CPU 1 before the IPL command.
halfmeg
2010-12-02 15:56:15 UTC
Permalink
Post by Tony Harminc
<snip>
MAXCPU default is not release related. It can be
specified when Hercules is compiled, assembled, made,
configured, of created.
No matter, the difference in behavior between 'default ( 8 )' in Hercules 3.07 and SVN gives exposure to MVS of non-existant CPUs.
Post by Tony Harminc
Just a wild guess, but the signal delays may be due to
slow signalling to do I/O ?
If it had this much impact on 'real iron', it would have been better to have multiple 1 CPU environments instead of MP complexes. Some overhead is to be expected, 20 seconds from IPL to TSO longon vs 4 minutes from IPL to TSO logon isn't no matter what penalty emulation.
Post by Tony Harminc
I still have piles of that type of book it you want more.
Believe I picked these up at Goodwill and haven't read all yet. Should make a list.
Post by Tony Harminc
What PTF level was the book for? The S/370 Extended
Architecture might have problems with channel sets but
I don't understand the S/370 non-Extended Architecture
issue.
Don't see a PTF level in the book. Copyright is 1990 with
ISBN 0-07-054528-6 . Page 26 Sytem/370 Multiprocessors.

In contrast with MVS, MVS was designed with multiprocessing support included from "day one." Access to unique but unsharable system resources was serialized through several software locks. (MP65 support had only one lock, thus spending proportionally more of its time than MVS "under lock" and appearing at those times like a uniprocessor. The indifferent MP performance of the MP65, in the range of 1.6 to 1.7 times that of a UP, led to another legend--that MPs were inefficient. MP efficiency has increased steadily since that time through both hardware and software improvements.) System/370 (and thus MVS/370) was still limited to two-way MP. Although prefixing, SIGP, and the MVS/370 locking structure could handle up to 16 CPUs, the System/370 engineering designs and the channel subsystem could not accommodate more than two.

So maybe more than just the channel subsystem limited it to 2.

Phil
somitcw
2010-12-02 19:19:16 UTC
Permalink
Post by halfmeg
Post by Tony Harminc
<snip>
MAXCPU default is not release related. It can be
specified when Hercules is compiled, assembled, made,
configured, of created.
No matter, the difference in behavior between
'default ( 8 )' in Hercules 3.07 and SVN gives exposure
to MVS of non-existant CPUs.
They are existent. Just some are offline if your
MAXCPU is higher than your NUMCPU. You can use the
Hercules CF command to bring online. V CPU(1),ONLINE
may also work but until MVS MP is fully checked out,
you might find more issues?

If you don't want MAXCPU 8, specify MAXCPU 1 or 2
in your Hercules configuration file.

The extra 6 CPUs were put there because several
people requested that they be put there. Please either
ignore them or specify what you want in your Hercules
configuration file.
Post by halfmeg
Post by Tony Harminc
Just a wild guess, but the signal delays may be due
to slow signalling to do I/O ?
If it had this much impact on 'real iron', it would
have been better to have multiple 1 CPU environments
instead of MP complexes. Some overhead is to be
expected, 20 seconds from IPL to TSO longon vs
4 minutes from IPL to TSO logon isn't no matter what
penalty emulation.
It is clearly a bug. Changing the Hercules priorities
that we can change changes how severe the bug is.
Post by halfmeg
Post by Tony Harminc
I still have piles of that type of book it you want more.
Believe I picked these up at Goodwill and haven't read
all yet. Should make a list.
To separate them from my IBM manuals and list would
take more time than I have available today and probably
this year.
Post by halfmeg
Post by Tony Harminc
What PTF level was the book for? The S/370 Extended
Architecture might have problems with channel sets but
I don't understand the S/370 non-Extended Architecture
issue.
Don't see a PTF level in the book. Copyright is 1990 with
ISBN 0-07-054528-6 . Page 26 Sytem/370 Multiprocessors.
In contrast with MVS, MVS was designed with multiprocessing
support included from "day one." Access to unique but
unsharable system resources was serialized through several
software locks. (MP65 support had only one lock, thus
spending proportionally more of its time than MVS "under
lock" and appearing at those times like a uniprocessor.
The indifferent MP performance of the MP65, in the range of
1.6 to 1.7 times that of a UP, led to another legend--that
MPs were inefficient. MP efficiency has increased steadily
since that time through both hardware and software improvements.) >System/370 (and thus MVS/370) was still limited to two-way MP. >Although prefixing, SIGP, and the MVS/370 locking structure
could handle up to 16 CPUs, the System/370 engineering designs
and the channel subsystem could not accommodate more than two.
So maybe more than just the channel subsystem limited it to 2.
Phil
A non-IBM book makes a statement that the S/370 hardware
could not be wired to have channels on more than two CPUs
with no indication as to what area the limit was in or as
to why there was a limit. Since any CPU could disconnect
its channel set to attach a different one, the restriction
seems unusual. If only two CPUs could be wired to channels,
a three CPU system would be a combination of MP with an AP.
If only one CPU could be wired to channels, a three CPU
system would have two attached processors attached.
Base+AP+AP? Would that be AAP or other naming problem.

I don't know if the amdahl customers using three CPUs had
one, two, or three channel sets. All I heard was that they
removed the blocks IBM had added to MVS and MVS ran without
problems.

Prin.of.Op doesn't indicate any issue that I can see.
There could be some hidden paragraph that I missed.
halfmeg
2010-12-02 19:55:00 UTC
Permalink
Post by somitcw
Post by halfmeg
No matter, the difference in behavior between
'default ( 8 )' in Hercules 3.07 and SVN gives exposure
to MVS of non-existant CPUs.
They are existent. Just some are offline if your
MAXCPU is higher than your NUMCPU. You can use the
Hercules CF command to bring online. V CPU(1),ONLINE
may also work but until MVS MP is fully checked out,
you might find more issues?
If you don't want MAXCPU 8, specify MAXCPU 1 or 2
in your Hercules configuration file.
The extra 6 CPUs were put there because several
people requested that they be put there. Please either
ignore them or specify what you want in your Hercules
configuration file.
I did, NUMCPU 2. Hercules 3.07 defaulted to MAXCPU 8 and never exposed anything which was over my NUMCPU 2 to MVS. When NUMCPU 1 Hercules 3.07 didn't expose non-existant CPUs to MVS and MVS didn't go into any MP code to set up internals differently. Using current SVN causes MVS to enter MP code and who knows what actually is happening because of it. At the minimum it sets up PSA pages AFAICT and may in fact attempt to test the status of those non-existant CPUs repeatedly at some time interval.

I have to have 2 entries in Hercules configuration file to specify there are only 2 CPUs ?
Post by somitcw
Post by halfmeg
Post by somitcw
Just a wild guess, but the signal delays may be due
to slow signalling to do I/O ?
If it had this much impact on 'real iron', it would
have been better to have multiple 1 CPU environments
instead of MP complexes. Some overhead is to be
expected, 20 seconds from IPL to TSO longon vs
4 minutes from IPL to TSO logon isn't no matter what
penalty emulation.
It is clearly a bug. Changing the Hercules priorities
that we can change changes how severe the bug is.
More tests on the WinXP 1 CPU laptop ( time in secs ).

Hercules 3.07 NUMCPU 1 - NUMCPU 2

IPL to TSO ready 17 - 69 TK3 priorities

IPL to TSO ready 15 - 78 Adjusted priorities

SVN 7140

IPL to SYSP msg 69 - 87 TK3 priorities

IPL to SYSP msg 81 - 98 Adjusted priorities

Maybe not 4 minutes but 4 times the length of time for 3.07 and since SVN 7140 will only get to Enter System Parameters msg before croaking on DASD issue one can see the performance is more than 4 times as bad even with NUMCPU 1 specified. I believe this is due to those additional CPUs being recognized/exposed to MVS. Will add MAXCPU 1 and see how much difference it makes.
Post by somitcw
Post by halfmeg
Post by somitcw
I still have piles of that type of book it you want more.
Believe I picked these up at Goodwill and haven't read
all yet. Should make a list.
To separate them from my IBM manuals and list would
take more time than I have available today and probably
this year.
Not you make a list, I need to make a list.
Post by somitcw
A non-IBM book makes a statement that the S/370 hardware
could not be wired to have channels on more than two CPUs
with no indication as to what area the limit was in or as
to why there was a limit.
This seems to be the whole issue, hear-say or second hand no it can't but no one seems to know exactly why.
Post by somitcw
I don't know if the amdahl customers using three CPUs had
one, two, or three channel sets. All I heard was that they
removed the blocks IBM had added to MVS and MVS ran without
problems.
Over my head. Perhaps Peter H. knows something about this.
Post by somitcw
Prin.of.Op doesn't indicate any issue that I can see.
There could be some hidden paragraph that I missed.
Another book which makes my eyes hurt.

Phil
PeterH
2010-12-02 20:18:05 UTC
Permalink
Post by halfmeg
Post by somitcw
I don't know if the amdahl customers using three CPUs had
one, two, or three channel sets. All I heard was that they
removed the blocks IBM had added to MVS and MVS ran without
problems.
Over my head. Perhaps Peter H. knows something about this.
Although the 470V series was designed with two CPUs and two C-Units,
only single CPUs and one or two C-unit versions were sold in the US
(Fujitsu sold SMP versions of the 470V in Japan and possibly elsewhere).

The next series, the 580, was a SMP machine. One or two CPUs, each of
which had a channel unit.

The next series, Apache, was also a SMP machine, with two processor
stacks internal in a single frame.

The following series could have more than two processors.

The final bipolar series could have sixteen processors at a time when
IBM offered only ten processors (or was it twelve).

Anyway, once the CPUID was changed from strictly X'C0' and X'C1' to a
bit field with sixteen bits, it quickly became apparent that MVS, at
least architecturally, could handle more than two processors.

But, by that time the MP support had also changed to accommodate more
than two, and the independent channel processor certainly confirmed
that.

Gerhard Postpischil
2010-12-02 16:13:02 UTC
Permalink
Post by halfmeg
No, all config statements for DASD are the old way 0xxx, so
CPU 1 doesn't have a channel to any of them AFAICT. New way
0:xxx hasn't been tried as there are other issues to address
first.
It may or may not work under Hercules. On real iron, XA was the
first system that I recall as supporting channel sets on more
than one CPU. I came close to firing one of my employees, who
installed a 4381 with SP 1.3.6, and put the TP front end on the
second CPU. The system generation accepted it, but in practice
it failed, due to unnecessary shoulder taps to transfer I/O
requests between the CPUs.
Post by halfmeg
Once MP is no longer 'dog slow' ( believe me it is during IPL
also ) it would be good to figure out dual path access to all
DASD but that is for another forum I think.
If it works under Hercules, that's fine, but if not, it might be
expedient to have configuration processing produce an
appropriate warning message?
Post by halfmeg
Another Ranade book, MVS Performance Management, mentions MVS
being capable of 16 CPUs except for I/O subsystem. So limit
of 2 CPUs in S/370 is due to Channel structure ?
I'm guessing, but the throughput figures published in a Systems
Journal article showed very little practical improvement beyond
a second CPU, to the point where more just weren't cost
effective. MP and AP models were available years prior to extra
channel sets, so extra processors weren't tied to them.


Gerhard Postpischil
Bradford, VT
somitcw
2010-12-02 17:47:51 UTC
Permalink
Post by Gerhard Postpischil
Post by halfmeg
No, all config statements for DASD are the old way
0xxx, so CPU 1 doesn't have a channel to any of them
AFAICT. New way 0:xxx hasn't been tried as there are
other issues to address first.
It may or may not work under Hercules. On real iron,
XA was the first system that I recall as supporting
channel sets on more than one CPU. I came close to
firing one of my employees, who installed a 4381 with
SP 1.3.6, and put the TP front end on the second CPU.
The system generation accepted it, but in practice
it failed, due to unnecessary shoulder taps to transfer
I/O requests between the CPUs.
I ran MVS SP 1.3.5/1.3.6 with all of my disks on
both 4381-T92E processors. Also switched to MVS 2.2.3
with no issue except for device number changes.
Maybe IBM had fixed the temporary bug. With MVS 3.8j,
PAGE00.160 runs fine on the second processor.
Post by Gerhard Postpischil
Post by halfmeg
Once MP is no longer 'dog slow' ( believe me it is
during IPL also ) it would be good to figure out dual
path access to all DASD but that is for another forum
I think.
If it works under Hercules, that's fine, but if not,
it might be expedient to have configuration processing
produce an appropriate warning message?
It works but the recommended priorities still can
allow other Windows processes to interfere with Hercules
and Hercules to interfere with other processes.
Some testing and tuning is needed.

There probably isn't a good reason to scatter dasd
over a couple of CPUs, but MVS 3.8j doesn't seem to mind.
We could use SHRDPORT to have all dasd on all CPUs but
I would worry about updates applying out of sync.
Post by Gerhard Postpischil
Post by halfmeg
Another Ranade book, MVS Performance Management,
mentions MVS being capable of 16 CPUs except for
I/O subsystem. So limit of 2 CPUs in S/370 is due
to Channel structure ?
I'm guessing, but the throughput figures published in
a Systems Journal article showed very little practical
improvement beyond a second CPU, to the point where more
just weren't cost effective. MP and AP models were
available years prior to extra channel sets, so extra
processors weren't tied to them.
Gerhard Postpischil
Bradford, VT
From the view-point of two channels to a disk, IBM
said to not expect a performance improvement. It was
for availability. I have also always tried to have
multi-CPUs in case a high priority task went in a loop
so the system would keep running to allw me to locate
the looping task and get rid of it. With uni-processors,
I could interrupt the current task an hope that it was
the loop. Someone running CICS might prefer one 10 MIPS
uni-processor over a multi-processor with 5 MIPS each.
CICS at the time was one task so couldn't use two CPUs.
Robert Hodge
2010-12-02 19:34:30 UTC
Permalink
Hello everyone,

I am writing this in the hopes that someone might be able to summarize and make
some sense out of the MP timer bug issue, because I find a lot of the technical
details over my head.

1. Basically, I would like to know if this issue implies that there is something
wrong with Hercules in its current SVN state, or is it that MP configurations
are not (and maybe never were?) correctly supported within the fundamental
Hercules architecture?

2. I find it a little disturbing that someone would consider patching MVS to
work properly in an MP configuration.  If Hercules supports multiprocessors
correctly, why would any guest OS need to be patched?  But if patching is the
only way this can be achieved, what about other systems that cannot be patched,
like VM, VSE, or z/OS systems for which no source is available?

3. I am also unclear on the relationship of this issue to multiple simulated
mainframe CPU's vs. multiple real PC cores.  Does one have anything to do with
the other?

The basis of my concerns is that, as I get more familiar with Hercules and make
plans to run it on a system more powerful than my core-2 laptop, I have
considered buying a new, expensive, Windows-7 based "rocket ship" desktop with a
6-core CPU and lots of memory and disk.  Would I be wasting my money buying such
a machine, if guest OS's running on Hercules will not operate correctly with so
many cores anyway?

Could someone please tell me that the Hercules developers have a handle on this
situation?  Or, have so many changes been made to Hercules that it can no longer
be considered stable?

Or, in plain English, did you guys break our toy?


Regards,

Robert
halfmeg
2010-12-02 20:08:14 UTC
Permalink
Post by Robert Hodge
Hello everyone,
I am writing this in the hopes that someone might be able to
summarize and make some sense out of the MP timer bug issue, because
I find a lot of the technical details over my head.
1. Basically, I would like to know if this issue implies that there
is something wrong with Hercules in its current SVN state, or is it
that MP configurations are not (and maybe never were?) correctly
supported within the fundamental Hercules architecture?
There is always something wrong in SVN state. :-)

MP configurations are probably fine in ESA/390 and ESAME mode. It's us old fellas in S/370 mode who are finally wanting to utilize those multi-cpu hosts with an MP configuration which doesn't spit out SPIN LOOP lock situations.
Post by Robert Hodge
3. I am also unclear on the relationship of this issue to multiple
simulated mainframe CPU's vs. multiple real PC cores. Does one
have anything to do with the other?
Indirectly and/or Directly yes. Hercules spins off a number of threads for various functions, reading dasd, watchdog timer, CPU threads, etc..

Say you are running z/Linux with NUMCPU 4 on a uni-processor Host. You may actually reduce your processing power due to those 4 CPU threads taking up more Host time in overhead.

If the host has 4 CPUs to process the 4 CPU threads you processing power is likely to increate by say af factor of 3 ( ballpark figure but there is still overhead which takes away from your 4 Host CPUs).
Post by Robert Hodge
Could someone please tell me that the Hercules developers have a
handle on this situation? Or, have so many changes been made to
Hercules that it can no longer be considered stable?
Sometimes they only become aware of issues as someone decides to post about them. The S/370 MP thing has been percolating on a back burner for many years. You can check various threads from time to time but most peter out without anything being resolved.

Hercules 3.07 is stable. No one has ever claimed SVN would be stable.
Post by Robert Hodge
Or, in plain English, did you guys break our toy?
It's not broken, we are just pushing it in perhaps unexpected ways.

The developers can probably give better answers as I am just a newbie.

Phil
somitcw
2010-12-02 20:15:15 UTC
Permalink
Post by Robert Hodge
Hello everyone,
I am writing this in the hopes that someone might be
able to summarize and make some sense out of the MP
timer bug issue, because I find a lot of the technical
details over my head.
1. Basically, I would like to know if this issue implies
that there is something wrong with Hercules in its current
SVN state, or is it that MP configurations are not (and
maybe never were?) correctly supported within the
fundamental Hercules architecture?
There are issues identified for MVS 3.8j running MP.
The major problem is the same one that IBM had running
MVS 3.8j MP under VM. Both VM and Hercules cannot give
all CPU cycles to all MVS CPUs so timing is a problem.
MVS 3.8j has code to detect that it is running under
VM and ignores spin loops because of timing issues
caused by not getting all CPU cycles.

The fix was to tell MVS 3.8j MP that it was
running under VM for the spin loop issue.

The less severe issue found so far is that other
Windows processes can interfere with MVS 3.8j MP
getting CPU cycles and MVS 3.8j MP loops causing no
cycles for anything on the Windows PC. Adjusting
the Hercules default priorities can lessen the
problem but no one has found a perfect balance or
solution yet.
Post by Robert Hodge
2. I find it a little disturbing that someone would
consider patching MVS to work properly in an MP
configuration.  If Hercules supports multiprocessors
correctly, why would any guest OS need to be patched?
But if patching is the only way this can be achieved,
what about other systems that cannot be patched, like
VM, VSE, or z/OS systems for which no source is available?
We only modified IBM's patch that was already in the
system to allow MVS 3.8j MP to run under VM to also work
the same for us to run MVS 3.8j MP under Hercules.
Now MVS 3.8j MP always treats CPU spins as if MVS was
running under a hypervisor or emulator which is true.
Post by Robert Hodge
3. I am also unclear on the relationship of this issue
to multiple simulated mainframe CPU's vs. multiple real
PC cores.  Does one have anything to do with the other?
Are you referring to HyperThreading compared to Cores
or are you referring to emulating more CPUs with Hercules
than are in the PC system?
Post by Robert Hodge
The basis of my concerns is that, as I get more familiar
with Hercules and make plans to run it on a system more
powerful than my core-2 laptop, I have considered buying
a new, expensive, Windows-7 based "rocket ship" desktop
with a 6-core CPU and lots of memory and disk.  Would I be
wasting my money buying such a machine, if guest OS's
running on Hercules will not operate correctly with so
many cores anyway?
Hercules would be happy with all of the cores but
MVS 3.8j might not work with NUMCPU greater than two.
It's a restriction in our version of MVS 3.8j, not in
Hercules. Someone may want to fix MVS 3.8j someday?
Post by Robert Hodge
Could someone please tell me that the Hercules developers
have a handle on this situation?  Or, have so many changes
been made to Hercules that it can no longer be considered
stable?
Or, in plain English, did you guys break our toy?
Regards,
Robert
I'm certain that the developers see that halfmeg is
trying to fix MVS 3.8j MP and they also see that the
default Hercules priorities are being tested. If the
developers did not pick the absolute perfect default
priorities for MVS 3.8j MP, I will let them know.
But first, someone needs to find the absolute perfect
default priorities.
halfmeg
2010-12-02 04:26:19 UTC
Permalink
Post by Tony Harminc
<snip>
There may be a couple more but I can't think of them offhand and I
really, really, really don't like to shoot my mouth off about most
of this stuff which I don't understand.
Ivan's post reminded me.

There is a question about PSA page and DAT I think, whether DAT is getting involved or not.

The snippet of source code mentioned DAT off, but unless you read all that section of source it may not be revelant.

The book, MVS Concepts and Facilities page 248, mentions step 12 in NIP processing initializes multiple CPUs, in tightly coupled system both CPUs have access to all virtual storage, 1st 4096 must be unique to each CPU in the complex.

MVS obtains a 4-Kb Central Storage frame at NIP time and set another special register -- the PREFIX register -- with the address of the obtained page. Any references to the first page frame by MVS, an application program, or the I/O subsystem will be interpreted by DAT to be this relocated page.

Each CPU has its own copy of the "first" page of memory. Note that tis is not part of the virtual memory paging system but a special feature of the hardware to support tightly coupled processing.

Next NIP issues SIGPs to start the additional CPUs.

Step 13, NIP ends and turns control over to the Master Scheduler.

On page 271 there is another mention of PSA:

The OLD and NEW PSWs are really locations in the Prefixed Save Area in the first 4,096 bytes of storage. Remember, this block exists in some other Central Storage frame other than x'0000000' if the Processor Complex has multiple CPUs.

Although I can't find it now, IIRC it also stated somewhere in there that in a Multiprocessor Complex real address page 0 is not used at all.

Hercules may adhere to all of the above. I get confused about one thing mentioning DAT off while another mentions DAT handles access to the relocated page 0 and a third mention of a special feature of the hardware.

The other Hercules issue was the dyn76.c breaking 'make'. Harold has backed that out for now, but it seems to indicate there is no review of changes or testing by someone other than the committer (?).

Phil
Harold Grovesteen
2010-12-02 09:29:21 UTC
Permalink
Post by halfmeg
Post by Tony Harminc
<snip>
There may be a couple more but I can't think of them offhand and I
really, really, really don't like to shoot my mouth off about most
of this stuff which I don't understand.
<snip>
Post by halfmeg
The other Hercules issue was the dyn76.c breaking 'make'. Harold has backed that out for now, but it seems to indicate there is no review of changes or testing by someone other than the committer (?).
Other developers get to "test" new code the same way you do when you
pull down a new revision level of the SVN repository. We have too few
developers working on whatever they are working on to check up on the
other developers. So to a large extent, no, it is the developer of the
new functionality who has the responsibility to test and make sure the
new code works and does not screw up anything else. However, a lot of
internal structural changes have been going on with what will ultimately
become the next version of Hercules. Working with the SVN is likely to
create a number of surprises as you are seeing.

Someone mentioned the HERCLOGO configuration statement. One of the
major enhancements is unification of the configuration process and
console command functionality. Now all configuration statements are
also console commands. But, as observed with the HERCLOGO command, I
too have experienced that some configuration statements have to be
entered as a console command to be recognized and are not picked up in
the configuration file. This will likely get fixed, but is the
consequence of using the SVN.

With regards to the problem being explored, it might be worth
considering to do all of the testing with the standard release. At
least the emulation will not be a moving target and, while problems
might be identified therein, you know from day-to-day what you have and
do not have to question whether something in the latest SVN altered
behavior because you had to start working with another revision level to
fix something else.

Harold
Post by halfmeg
Phil
------------------------------------
http://groups.yahoo.com/group/hercules-390
http://www.hercules-390.org
Yahoo! Groups Links
halfmeg
2010-12-02 14:03:54 UTC
Permalink
Post by Tony Harminc
<snip>
Working with the SVN is likely to create a number of surprises as
you are seeing.
But it precisely for that reason SVN needs to be worked with, so that the next release version of Hercules isn't in need of a fix for x, y, z the day it is announced.
Post by Tony Harminc
<snip>
With regards to the problem being explored, it might be worth
considering to do all of the testing with the standard release. At
least the emulation will not be a moving target and, while problems
might be identified therein, you know from day-to-day what you have
and do not have to question whether something in the latest SVN
altered behavior because you had to start working with another
revision level to fix something else.
I did test with 3.07 regarding some of the current issues. Responses in reply to other posters.

Phil
Tony Harminc
2010-11-29 19:46:52 UTC
Permalink
Post by Ivan Warren
The interval timer location in CPU's PSA ("Real" address X'50') is only
updated when it is being fetched by the CPU owning the PSA. This is to
ensure operations such as
   MVC 0(8,X'4C'),0(X'50')
is done atomically (the word before X'50' and the word after X'50' are
designed for this - to ensure you can fetch and store a value in X'50'
in an atomic fashion).
Yup - I'm aware of this. It's been in the POO since day 1. That MVC is
the *only* approved instruction for updating the timer. But that's not
really this issue, I think.
Post by Ivan Warren
Under hercules, for every fetch made (in S/370 mode), a check is made to
see if the logical address is X'50' - and if it is the case, location
for real address X'50' (aka Absolute "X'50' + CPU Prefix") is updated
(if logical address X'50' happens not to be real address X'50', no
damage is done.. We just did a spurious update).
OK. Was that easier than checking the real (or absolute) address?
Post by Ivan Warren
Now, we can also do this because, according to the S/370 principles of
operation, fetching the interval timer of a CPU from another CPU or I/O
channel (Page 50 of GA22-700-04, 3rd paragraph) may yield unpredictable
results.
One wonders if, to get back to the original problem, MVS is breaking
the rules because it happens to work as expected on real IBM machines.
In other words, is it checking from one CPU to see if the timer is
running on the other, by just looking at it directly, waiting a bit
(via its own timer or by looping), and then checking again?
Post by Ivan Warren
Note that if the PSA is mapped to a logical address other than 0 through
DAT, I'm not sure we're doing this correctly (but I have yet to see a
real world example of this.. However, this may be an issue..).
I think this becomes an issue only on VM/ESA or even z/VM, by which
time the X'50' timer is long gone anyway.
Post by Ivan Warren
PS : VM/370 also has that quirk .. I noticed this 20 odd years ago : If
you attempt to do a 'CP D 50' from a secondary user with the CPU
running, the location at X'50' seems to never change, even with a CP SET
TIMER REAL !
I'm not sure how this would work... Unlike Hercules, VM/370 cannot
detect fetch references to guest-real location X'50' in order to
update the timer. (Well, it *could*, but it would be utterly
impractical for performance reasons.) But I'm not sure what you mean
by a "secondary user", so maybe I don't understand your scenario.

Tony H.
Ivan Warren
2010-11-29 20:37:08 UTC
Permalink
Post by Tony Harminc
I'm not sure how this would work... Unlike Hercules, VM/370 cannot
detect fetch references to guest-real location X'50' in order to
update the timer. (Well, it *could*, but it would be utterly
impractical for performance reasons.) But I'm not sure what you mean
by a "secondary user", so maybe I don't understand your scenario.
Tony H.
Well.. That's easy !

Under VM/370, there are 3 situations :

- TIMER assist is available and enabled
- TIMER assist is neither available nor enabled
- The S/370 environment is a S/390 SIE guest

In the 1st & 3rd cases, the updating of dispatched VMBLOK's PSA - OR
VMDBK PSA for a z/VM, VM/ESA or VM/XA guest (or a special field in the
VMBLOK if Page 0 is not available) is updated by the hardware (part of
VM Assist which is part of ECPS:VM for VM/370 or SIE for z/VM, VM/ESA or
VM/XA guest)

In the second case, a TRQBLOK (Timer Request Block) is queued to update
the virtual interval timer location of the dispatched virtual machine.

When using case 1 & 3, the virtual machine's interval timer is updated
just like it would be on the real hardware

When using case 2, the hypervisor does the job.

Now.. A "secondary user" is a VM term.. It applies to a logged on
virtual machine that can act on behalf of a disconnected virtual machine
via the "CP SEND" command (and see the output of said machine).
Unfortunately, in this case, the virtual machine being dispatched is NOT
the one holding the interval timer - but the one sending the command. So
the virtual interval timer location doesn't get updated in EITHER cases.

--Ivan



[Non-text portions of this message have been removed]
halfmeg
2010-11-29 21:45:45 UTC
Permalink
Post by Tony Harminc
<snip>
Now, we can also do this because, according to the S/370 principles
of operation, fetching the interval timer of a CPU from another CPU
or I/O channel (Page 50 of GA22-700-04, 3rd paragraph) may yield
unpredictable results.
I had to go get GA22-7000-05, August 1976 ( says it's a reprint of 04 with TNL GN22-0498 included ).
Post by Tony Harminc
Note that if the PSA is mapped to a logical address other than 0
through DAT, I'm not sure we're doing this correctly (but I have yet
to see a real world example of this.. However, this may be an
issue..).
SPIN failure seems to occur with TK3UPD after SYSP=J3 entered. Provoked it into giving me a dump. IIRC title was Error in Real Memory Manager.

Would prefer a simple example if you can think one up.
Post by Tony Harminc
Also, since the Alter/Display function is only available (per S/370
Principle of Operations) when the CPU is in a stopped state (at
which point the interval timer no longer gets updated)
Hmmm, example test performed shows CPU(s) in stopped state yet x'50' continues to decrement.
Post by Tony Harminc
- Using the Alter/Display manual functions when the CPU is not in a
stopped state (as permitted by hercules) can also yield
unpredictable results. So if you want a true image of the interval
timer for a CPU, you should stop that CPU first.
sysclear
stopall
r 50-50
r 50-50

still have location decrementing

Paragraph 5 same page manual above:

"The timer value is not decremented when the CPU is not in the operating state, or when the rate switch on the system console is set to the instruction-step position."

sysclear
s+
r 50-50
r 50-50

still have location decrementing

Phil
halfmeg
2010-11-29 21:16:44 UTC
Permalink
Post by Tony Harminc
Post by halfmeg
There is some discussion but never seems to be a resolution to
SPIN problem or what seems to be excessive overhead when NUMCPU 2
is defined.
[snip]
Post by halfmeg
SYSCLEAR starts the CPUTIMER in s/370 mode located at x'50'.
<snip>
Wow bunch of typos this A.M. and wrong name for x'50' ( Interval Timer ).
Post by Tony Harminc
Post by halfmeg
This doesn't look right and if a CPU is expecting the timer to
always increment but doesn't, isn't there a possibility the SPIN
is coming from what looks to me like a bug?
<snip>
There is no reason at all to actually take an interrupt on the host
300 times per second, just to increment a location on guest storage
that is almost certainly not being examined.
<snip>
No, not advocating that at all. Intent is to get to the bottom of why MP, NUMCPU 2, causes problems with 3.8j or ie S/370 mode. Previous threads lead off into shared devices, channels, AP vs MP, etc... I was trying to start back before any of that stuff enters the picture and see if Hercules might have a bug before we get to the other stuff.
Post by Tony Harminc
... and there can be no casual observation of their values by
looking at storage somewhere.
Now whether examining such timers using console commands should
count as looking at the timer is a good question. Again, not knowing
if the Hercules code actually baheves this way, or if it does timer
updates naively,
If NUMCPU is set to 1, then the display of location x'50' changes each time and should never be the same until it wraps at about 15.5 hours or so ( CPU(s) are stopped in my example ). On the other hand, if the Interval Timer is set to a short value, then the host must 'know' when it goes negative and raise an interrupt for MVS to service. In other words, does it only cause an interrupt when MVS happens to check to see if it went negative ?
Post by Tony Harminc
I would suggest writing a tiny program to look at the timer, rather
than using the console commands.
If I could write a tiny standalone program which targets running on CPUx while the other portion of it ran on CPUy to verify Hercules is working properly in s/370 MP mode I would. Instead I look at simplistic external results which when they are not consistent looks like something is wrong.

Phil
Ivan Warren
2010-11-29 21:20:18 UTC
Permalink
Post by halfmeg
If I could write a tiny standalone program which targets running on CPUx while the other portion of it ran on CPUy to verify Hercules is working properly in s/370 MP mode I would. Instead I look at simplistic external results which when they are not consistent looks like something is wrong.
Remember..

The Principles of Operations DOES say you can't do that !

Well you can.. But it says the results are unpredictable !

And since the hercules results WILL be unpredictable, hercules is within
the specs !

--Ivan



[Non-text portions of this message have been removed]
halfmeg
2010-11-29 21:51:31 UTC
Permalink
Post by Ivan Warren
Post by halfmeg
If I could write a tiny standalone program which targets running
on CPUx while the other portion of it ran on CPUy to verify
Hercules is working properly in s/370 MP mode I would. Instead I
look at simplistic external results which when they are not
consistent looks like something is wrong.
Remember..
The Principles of Operations DOES say you can't do that !
Well you can.. But it says the results are unpredictable !
And since the hercules results WILL be unpredictable, hercules is
within the specs !
Are you saying that the display of location x'50' will be unpredictable no matter what state Hercules is in ( ie all CPUs stopped ) ?

How would you recommend debugging the SPIN problem ?

Phil
Ivan Warren
2010-11-29 21:59:52 UTC
Permalink
Post by halfmeg
Are you saying that the display of location x'50' will be unpredictable no matter what state Hercules is in ( ie all CPUs stopped ) ?
Nope.. That's a bug ! but it's an easy one to fix ! We just have to tell
the timer thread the CPU is stopped and we should no longer decrement
the CPU's Interval timer.. Shouldn't be a big deal.
Post by halfmeg
How would you recommend debugging the SPIN problem ?
I have NO idea what a SPIN problem is (I have no idea what a SPIN is ..
I can't see anything that resembles SPIN in the Principles of
Operation)... But I doubt it has anything to do with an operating system
running with ALL CPUs being in a stopped state.

--Ivan



[Non-text portions of this message have been removed]
Tony Harminc
2010-11-29 22:25:37 UTC
Permalink
Post by Ivan Warren
I have NO idea what a SPIN problem is (I have no idea what a SPIN is ..
I can't see anything that resembles SPIN in the Principles of
Operation)... But I doubt it has anything to do with an operating system
running with ALL CPUs being in a stopped state.
:-)

MVS uses what it calls spin locks, which really just means that a byte
is set with TS, or a fullword with CS, to indicate lock ownership. If
the other CPU wants the lock, it "spins", i.e. tests the lock and if
it can't get it, just loops back to the test.

I imagine the same term is being used here for the case where one CPU
wants to be sure the timer is being updated, and spins waiting for it
to change. Whether MVS is incorrectly looking at another CPU's timer
via the prefix register, or the timer speed is just too slow wrt the
CPU speed (that's my guess), or there's actually a Hercules bug, I do
not know.

Tony H.
Ivan Warren
2010-11-29 23:04:16 UTC
Permalink
Post by Tony Harminc
Post by Ivan Warren
I have NO idea what a SPIN problem is (I have no idea what a SPIN is ..
I can't see anything that resembles SPIN in the Principles of
Operation)... But I doubt it has anything to do with an operating system
running with ALL CPUs being in a stopped state.
:-)
MVS uses what it calls spin locks, which really just means that a byte
is set with TS, or a fullword with CS, to indicate lock ownership. If
the other CPU wants the lock, it "spins", i.e. tests the lock and if
it can't get it, just loops back to the test.
I imagine the same term is being used here for the case where one CPU
wants to be sure the timer is being updated, and spins waiting for it
to change. Whether MVS is incorrectly looking at another CPU's timer
via the prefix register, or the timer speed is just too slow wrt the
CPU speed (that's my guess), or there's actually a Hercules bug, I do
not know.
Tony,

Yeah.. I guessed it was something like that. Now..

Spin Locks are only used when 2 specific conditions are met :

- You are disabled for interrupts (when you are enabled for interrupts,
a simple wait lock is sufficient)
- You have more than 1 execution engines (MP environment)

Now.. Using a non MP safe facility for a spin lock seems - at best -
awkward to me under these conditions !

--Ivan



[Non-text portions of this message have been removed]
halfmeg
2010-11-29 23:26:02 UTC
Permalink
Post by Ivan Warren
Post by Tony Harminc
Post by Ivan Warren
I have NO idea what a SPIN problem is (I have no idea what a SPIN
is .. I can't see anything that resembles SPIN in the Principles\
of Operation)... But I doubt it has anything to do with an
operating system running with ALL CPUs being in a stopped state.
:-)
MVS uses what it calls spin locks, which really just means that a
byte is set with TS, or a fullword with CS, to indicate lock
ownership. If the other CPU wants the lock, it "spins", i.e. tests
the lock and if it can't get it, just loops back to the test.
I imagine the same term is being used here for the case where one
CPU wants to be sure the timer is being updated, and spins waiting
for it to change. Whether MVS is incorrectly looking at another
CPU's timer via the prefix register, or the timer speed is just
too slow wrt the CPU speed (that's my guess), or there's actually
a Hercules bug, I do not know.
Tony,
Yeah.. I guessed it was something like that. Now..
- You are disabled for interrupts (when you are enabled for
interrupts, a simple wait lock is sufficient)
- You have more than 1 execution engines (MP environment)
Now.. Using a non MP safe facility for a spin lock seems - at best -
awkward to me under these conditions !
--Ivan
Sorry guys, yall are starting to lose me. The test I did was overly simplistic but showed what looked like a bug. It may have no bearing on the SPIN LOOP problem. I don't know if CPU01 is having a problem because it expects it's own timer(s) to be a certain way or not and have no idea about whether CPU00 is checking out CPU01's PSA.

I was just starting at what I saw as a basic level, separate PSA areas with a timer in them, which might be causing the problem.

Here is what is showing up on the MVS console:

IEE331A EXCESSIVE DISABLED SPIN LOOP DETECTED
WAITING FOR LOCK RELEASE
REPLY U TO CONTINUE SPIN
OR, PRESS STOP ON PROCESSOR(1) AND REPLY ACR
(AFTER PRESSING STOP, DO NOT START THE PROCESSOR)

If ACR is replied, CPU01 is disabled and JES3 continues to initialize.

If U is given, sometimes another SPIN LOOP message is displayed ( I think this is where I was able to get the thing to dump:

DUMP TITLE= ERROR IN REAL STORAGE MANAGEMENT

If it doesn't get another SPIN LOOP, then JES3 most likely comes up and fails as follows:

IGF992I MIH INIT COMPLETE, PRI=000300, SEC=000015
IEE360I SMF NOW RECORDING ON SYS1.MANX ON MVSRES TIME=05.26.31
IAT3040 STATUS OF JES3 PROCESSORS IN COMPLEX
IAT3040 BSP1 ( )
IAT3042 CHECKPOINT DATA SET INVALID. WARM OR COLD START REQUIRED
IAT3011 SPECIFY JES3 START TYPE
*00 IAT3011 (L H HA W WA OR C)
IEE600I REPLY TO 00 IS;C
*01 IAT3033 CONFIRM JES3 COLDSTART REQUEST (U)
IEE600I REPLY TO 01 IS;U
*02 IAT3012 SELECT JES3 INISH ORIGIN (N M= OR U=), AND OPTIONAL EXIT PARM (,P=)
IEE600I REPLY TO 02 IS;N
IAT3102 CATASTROPHIC ERROR(S) WERE DETECTED DURING INITIALIZATION, SEE JES3OU
T. JES3 TERMINATED
IAT3713 ****************************************************************
IAT3713 ****************************************************************
IAT3713 DATE = 10332 TIME = 0528303 JES3 3.0.0
IAT3713 JES3 FAILURE NUMBER = 0001 ABENDED U0004
IAT3713 ACTIVE FCT = INITIALIZATION FCT FAILURE NO = 0001
IAT3713 MODULE = IATINJB MOD BASE = 0F0378 DISPLACEMENT = 000626
IAT3713 PSW AT TIME OF FAILURE 071C0000 000F099E ILC 2 INTC 000D
IAT3713 THE FAILING INSTRUCTION IS 0A0D
IAT3713 REGISTERS AT TIME OF FAILURE
IAT3713 REGS 0- 3 010F0EF4 00000004 000F0DF3 000B5CE7
IAT3713 REGS 4- 7 000B5E08 001959AC 0019599C 000B5C98
IAT3713 REGS 8-11 800F1948 000F1340 000F0378 000D6998
IAT3713 REGS 12-15 000BB000 000E90D0 600F0986 800CC90C
IAT3713 ****************************************************************
IAT3713 ****************************************************************
IAT3702 INITIALIZATION ABENDED U0004 - JES3 FAILURE NO. 0001
IAT3801 JES3 CONTROL BLOCK FORMAT COMPLETE
IEF450I JES3 JES3 - ABEND S2FB U0000

Even with normal JES2 operation the problem with NUMCPU 2 presents as a greatly reduced IPL time yet both CPU indicators are pegged a good bit of the time. With 1 CPU IPL is 20 seconds, with 2 more than a minute, perhaps multiple minutes ( I didn't wait around for it ).

Suggestions and or help in resolving the above would be appreciated.

Phil
Ivan Warren
2010-11-29 23:33:10 UTC
Permalink
Post by halfmeg
Suggestions and or help in resolving the above would be appreciated.
I hear you..

One of the possibilities is that - although we are *within* what the
Principle of operations says - we might not be within what IBM
implementations did.. And that JES3 might be relying on that.

We've dealt with this before - and we might have to deal with this again...

I would like to have a test case.. (I'm not very MVS proficient as you
might have guessed by now).. Something I can gnaw on.. to try to trace
it down to the lowest level..

--Ivan



[Non-text portions of this message have been removed]
s***@public.gmane.org
2010-11-30 00:04:55 UTC
Permalink
Sorry guys, but I should/e responded to a question that Halfmeg asked Sunday, but haven't had the chance yet. I have always suffered from processor spins occasionally. I did have them consistently when I first installed the latest TK3UPD, but that seemed to be because I installed it it wrong. I am now running 2 JES3 systems, sharing public DASD and they only have a problem generally when they try to talk to each other via CTCT (which I'm in the early stages of seeing if I can fix, don't hold your breath!).

Both systems are running with 2 CPUs. I still get CPU spin problems from time to time, perhaps 1 in 50 IPLs, but I get the same sort of number from JES2 IPLs (I use JES2 when I screw up the JES3 instalation, which is not a rare event!).

In a nutshell, this doesn't appear to be a JES3 related problem. In my experience it happens just as often under JES2.

Sorry!

Regards Simon
Post by Ivan Warren
Post by halfmeg
Suggestions and or help in resolving the above would be appreciated.
I hear you..
One of the possibilities is that - although we are *within* what the
Principle of operations says - we might not be within what IBM
implementations did.. And that JES3 might be relying on that.
We've dealt with this before - and we might have to deal with this again...
I would like to have a test case.. (I'm not very MVS proficient as you
might have guessed by now).. Something I can gnaw on.. to try to trace
it down to the lowest level..
--Ivan
[Non-text portions of this message have been removed]
halfmeg
2010-11-30 02:37:37 UTC
Permalink
Post by s***@public.gmane.org
Sorry guys, but I should/e responded to a question that Halfmeg
asked Sunday, but haven't had the chance yet. I have always suffered
from processor spins occasionally. I did have them consistently when
I first installed the latest TK3UPD, but that seemed to be because I
installed it it wrong. I am now running 2 JES3 systems, sharing
public DASD and they only have a problem generally when they try to
talk to each other via CTCT (which I'm in the early stages of seeing
if I can fix, don't hold your breath!).
No problem about not responding. It happens all the time.

I dropped further work on OS360 Sort as that type of work can be completed later. If the SPIN problem is in Hercules, it would be nice to have it and your CTC work included in the next release ( ie before 3.08 hits the streets ).
Post by s***@public.gmane.org
Both systems are running with 2 CPUs. I still get CPU spin problems
from time to time, perhaps 1 in 50 IPLs, but I get the same sort of
number from JES2 IPLs (I use JES2 when I screw up the JES3
instalation, which is not a rare event!).
In a nutshell, this doesn't appear to be a JES3 related problem. In
my experience it happens just as often under JES2.
<snip>
It seems more prevalent upon JES3 startup. In the past ( see 1st post for old threads ) there were several candidates for the cause. This time around I would like to isolate it for good. If Hercules, perhaps a fix, if channels not defined, a different sysgen, if devices not shared, again different sysgen, if MP on single CPU host then warning of slow-down but shouldn't hang up guest system.

Phil
s***@public.gmane.org
2010-11-30 13:32:10 UTC
Permalink
For completeness, I should really answer Phil's original question as well! The machine I am running on is a 4.5 year old Toshiba Satalite Pro P100 running Debian Linux Squeeze in 512Mb of RAM. I use both the official 3.07 release of Hercules and the latest (as of last Sunday currently) subversion version and get the same results on both. Both versions are compiled by me from source, the official 3.07 release configured with '--prefix=/usr --enable-external-gui' and the version grabbed from subversion configured with just '--enable-external-gui'.

The P100's processor is an Intel Core Duo T2400, so I am effectively running a dual processor emulated system on a dual processor box. I am however now running 2 emulated dual processor machines on the same box and that doesn't seem to increase the number of spins I get. I couldn't tell you what MIPs rate I get from the emulated machines, but could probably do some checking if you think it's relevant.

Also, in case it's relevant, I do also get IPL 'hangs' when IPLing into both JES2 and JES3. In these cases the IPL usually gets as far as displaying the 'Primary system selected' message and then all processor activity stops. The only way out of these hangs is to issue a stopall in Hercules and reIPL. This has always cured the problem. I don't know if this may be related to the processor spin issue, which is why I mention it here. These hangs happen more frequently than the processor spins, but still not in any consistent way that I can identify. The hangs do seem to happen at the same point in the IPL process as the processor spins.

I have never experienced a processor spin after the system is IPL'd and running, but due to the nature of the use I am currently making of the emulated systems, they are rarely running for longer than about 15 minutes without an IPL being done.

If any more information would help, please ask. It may take a day or two, but I will get back to you.

Regards

Simon
somitcw
2010-11-30 17:04:06 UTC
Permalink
Post by s***@public.gmane.org
For completeness, I should really answer Phil's original
question as well! The machine I am running on is a
4.5 year old Toshiba Satalite Pro P100 running Debian
Linux Squeeze in 512Mb of RAM. I use both the official
3.07 release of Hercules and the latest (as of last Sunday currently) subversion version and get the same results on both.
Both versions are compiled by me from source, the official 3.07
release configured with '--prefix=/usr --enable-external-gui'
and the version grabbed from subversion configured with just
'--enable-external-gui'.
The P100's processor is an Intel Core Duo T2400, so I am >effectively running a dual processor emulated system on a
dual processor box. I am however now running 2 emulated dual >processor machines on the same box and that doesn't seem to
increase the number of spins I get. I couldn't tell you what
MIPs rate I get from the emulated machines, but could probably
do some checking if you think it's relevant.
Also, in case it's relevant, I do also get IPL 'hangs' when
IPLing into both JES2 and JES3. In these cases the IPL
usually gets as far as displaying the 'Primary system selected'
message and then all processor activity stops. The only way
out of these hangs is to issue a stopall in Hercules and reIPL.
This has always cured the problem. I don't know if this may be
related to the processor spin issue, which is why I mention it
here. These hangs happen more frequently than the processor
spins, but still not in any consistent way that I can identify.
The hangs do seem to happen at the same point in the IPL process
as the processor spins.
I have never experienced a processor spin after the system
is IPL'd and running, but due to the nature of the use I am
currently making of the emulated systems, they are rarely
running for longer than about 15 minutes without an IPL
being done.
If any more information would help, please ask. It may take
a day or two, but I will get back to you.
Regards
Simon
I have a Intel(R) Pentium(R) Dual CPU E2140 @ 1.60GHz 1.60GHz
running Windows Vista 32-bit and have no problem turning the
MP problem on and off. I also run Boinc for Enigma and Seti
which always keep two tasks running Priority LOW soaking
every left over CPU cycle that there is. Not only does
MVS 3.8j MP normally hang during IPL if both MVS processors
are online, also if one processor is offline for IPL but
I bring it online later, then there is often a hang doing
other MVS stuff or shutting MVS down.

If Boinc is stopped before MVS IPL, MVS will normally IPL
and run until I start Boinc. IPL is normally slow, sometimes
really crawls, and sometimes IPL does fail, but normally I can
run okay when I don't have too much else running outside of
Hercules.

I do not run JES3 but expect my MVS CPU cycle usage for
IPL is not too much different than running JES3.
s***@public.gmane.org
2010-11-30 00:13:31 UTC
Permalink
Also, in case it's relevant, I get the same results from the official 3.07 release and from my system built from subversion, current as of 36 hours ago.

Regards

Simon
Post by Ivan Warren
Post by halfmeg
Suggestions and or help in resolving the above would be appreciated.
I hear you..
One of the possibilities is that - although we are *within* what the
Principle of operations says - we might not be within what IBM
implementations did.. And that JES3 might be relying on that.
We've dealt with this before - and we might have to deal with this again...
I would like to have a test case.. (I'm not very MVS proficient as you
might have guessed by now).. Something I can gnaw on.. to try to trace
it down to the lowest level..
--Ivan
[Non-text portions of this message have been removed]
Gerhard Postpischil
2010-11-30 01:58:17 UTC
Permalink
Post by Ivan Warren
One of the possibilities is that - although we are *within* what the
Principle of operations says - we might not be within what IBM
implementations did.. And that JES3 might be relying on that.
While it may not help, it might make you feel better, but I get
spin lock hangs once every couple of months. One MVS system is
basic turnkey level, the other as distributed with MVS/380
(includes SU1 updates in some form or another). Both run JES2;
the last hang occurred overnight, while the system was
(relatively) idle. Hercules is at 3.07 on one, and an October
SVN on the other.

On real hardware, we used to get these on our 3081 about once a
week (IBM never came up with a fix). We moved the system to a
4341 (slightly different I/O configuration), and later a 4381,
and never had a hang again.


Gerhard Postpischil
Bradford, VT
halfmeg
2010-11-30 02:52:19 UTC
Permalink
Post by Tony Harminc
<snip>
On real hardware, we used to get these on our 3081 about once a
week (IBM never came up with a fix).
Good to know. It might be in MVS.
Post by Tony Harminc
We moved the system to a 4341 (slightly different I/O
configuration), and later a 4381, and never had a hang again.
There is a possibility that TK sysgen needs some changes.

But AFAICT the non-sharing of device addresses or dual-access in Hercules configuration to utilize different channels in S/370 mode shouldn't slow down a non-hanging system by a factor of 4 or more ( pulled number out of a hat, no empirical test yet to measure impact of NUMCPU 2 vs 1 ) .

Phil
"Fish" (David B. Trout)
2010-11-30 02:53:46 UTC
Permalink
Gerhard Postpischil wrote:

[...]
Post by Gerhard Postpischil
On real hardware, we used to get these on our 3081 about
once a week (IBM never came up with a fix). We moved the
system to a 4341 (slightly different I/O configuration),
and later a 4381, and never had a hang again.
I haven't been following this thread very closely (only marginally) and
don't have much productive to add except the following.

I do know that *some* operating systems (I forget which ones) check upon
initialization what CPU model they are and attempt to adjust their "spin"
value accordingly. The idea being that model X is known to operate at nn
MIPS whereas model Y operates at zz MIPS. (They usually have a hard coded
table to control this)

This is of course the WRONG way to do things, but as we all know programmers
are oftentimes lazy buggers who sometimes do things the easy way rather than
the right way. :)

I only mention this in case MVS is one of those types of operating systems.

Has anyone experimented with trying a different CPU model? (or tried
searching for the code which decides how long to spin? (i.e. the spin count?
i.e. how many attempts to make before giving up and throwing an error?)

Just food for thought.
--
"Fish" (David B. Trout)
fish-VLFb7ALKWJGGw+***@public.gmane.org






------------------------------------
halfmeg
2010-11-30 03:04:02 UTC
Permalink
Post by Tony Harminc
<snip>
Has anyone experimented with trying a different CPU model? (or tried
searching for the code which decides how long to spin? (i.e. the spin
count? i.e. how many attempts to make before giving up and throwing
an error?)
Just food for thought.
I had thought about different CPU models, but don't think Hercules much cares. There could be something in MVS itself I guess, but have never seen or looked for it.

Phil
Tony Harminc
2010-11-30 16:11:06 UTC
Permalink
Post by Tony Harminc
<snip>
Has anyone experimented with trying a different CPU model?  (or tried
searching for the code which decides how long to spin? (i.e. the spin
count?  i.e. how many attempts to make before giving up and throwing
an error?)
Just food for thought.
I had thought about different CPU models, but don't think Hercules much cares.  There could be something in MVS itself I guess, but have never seen or looked for it.
MVS does have constants based on CPU model - the so-called SRM
constants that map CPU time to service units. IIRC these are in module
IRARMCPU. Amdahl used to distribute an update to MVS for use on their
V6 and similar processors. Whether these constants are also used for
other purposes, I don't know.

Tony H.
Ivan Warren
2010-11-30 03:05:03 UTC
Permalink
Post by Gerhard Postpischil
On real hardware, we used to get these on our 3081 about once a
week (IBM never came up with a fix). We moved the system to a
4341 (slightly different I/O configuration), and later a 4381,
and never had a hang again.
Well... The 4341 was never an MP system..

The 4381 could, it depended on your model.. From P02 to R93.. some where
UP, some were MP.. Some were even ESA capable.

--Ivan



[Non-text portions of this message have been removed]
halfmeg
2010-11-30 02:25:04 UTC
Permalink
Post by Ivan Warren
One of the possibilities is that - although we are *within* what the
Principle of operations says - we might not be within what IBM
implementations did.. And that JES3 might be relying on that.
As others have posted, it isn't entirely a JES3 only issue. It's just that the JES3 startup seems to trigger it quite a bit so might permit tracking it down easier.
Post by Ivan Warren
We've dealt with this before - and we might have to deal with this again...
I would like to have a test case.. (I'm not very MVS proficient as
you might have guessed by now).. Something I can gnaw on.. to try to
trace it down to the lowest level..
I'll see if I can get a small test together. May be a couple of days.

Phil
Mike Schwab
2010-11-30 05:35:39 UTC
Permalink
Post by Ivan Warren
Post by halfmeg
Suggestions and or help in resolving the above would be appreciated.
I hear you..
One of the possibilities is that - although we are *within* what the
Principle of operations says - we might not be within what IBM
implementations did.. And that JES3 might be relying on that.
We've dealt with this before - and we might have to deal with this again...
I would like to have a test case.. (I'm not very MVS proficient as you
might have guessed by now).. Something I can gnaw on.. to try to trace
it down to the lowest level..
--Ivan
We are re-inventing the wheel. MVS 3.8j was from before IBM got all
the multiprocessor bugs out of their code. We might search APAR
histories for clues to what bugs they posted fixes for.
--
Mike A Schwab, Springfield IL USA
Where do Forest Rangers go to get away from it all?
Gregg Levine
2010-11-30 07:21:58 UTC
Permalink
Post by Ivan Warren
Post by halfmeg
Suggestions and or help in resolving the above would be appreciated.
I hear you..
One of the possibilities is that - although we are *within* what the
Principle of operations says - we might not be within what IBM
implementations did.. And that JES3 might be relying on that.
We've dealt with this before - and we might have to deal with this again...
I would like to have a test case.. (I'm not very MVS proficient as you
might have guessed by now).. Something I can gnaw on.. to try to trace
it down to the lowest level..
--Ivan
We are re-inventing the wheel.  MVS 3.8j was from before IBM got all
the multiprocessor bugs out of their code.  We might search APAR
histories for clues to what bugs they posted fixes for.
--
Mike A Schwab, Springfield IL USA
Where do Forest Rangers go to get away from it all?
------------------------------------
Hello!
I quite agree. I wonder what APARs were created for the base
VM/370rel6 system we do not have... The APARs not the system.

-----
Gregg C Levine gregg.drwho8-***@public.gmane.org
"This signature fought the Time Wars, time and again."
halfmeg
2010-11-30 12:57:40 UTC
Permalink
Post by Mike Schwab
We are re-inventing the wheel. MVS 3.8j was from before IBM got all
the multiprocessor bugs out of their code. We might search APAR
histories for clues to what bugs they posted fixes for.
Don't believe so. Hercules S/370 archmode is different than Hercules ESA/390 or ESAME mode. It has been such an intermittent problem that whenever it comes up the usual suspects are gathered and then released without anything ever being nailed down as the culprit.

Since JES3 startup seems to present the problem quite often ( majority of the time ), it would be foolish not to take the opportunity to resolve what is causing the situation. It may be in Hercules, it may be in the TK3 configuration ( non-shared DASD ) it may be in the Hercules configuration ( no devices using dual paths so that CPUxx can access them ).

This time I would like to find the culprit.

Phil
somitcw
2010-11-30 01:32:51 UTC
Permalink
Post by Ivan Warren
Post by Tony Harminc
Post by Ivan Warren
I have NO idea what a SPIN problem is (I have no idea
what a SPIN is ..
I can't see anything that resembles SPIN in the
Principles of Operation)... But I doubt it has anything
to do with an operating system running with ALL CPUs
being in a stopped state.
:-)
MVS uses what it calls spin locks, which really just
means that a byte is set with TS, or a fullword with CS,
to indicate lock ownership. If the other CPU wants the
lock, it "spins", i.e. tests the lock and if it can't get
it, just loops back to the test.
I imagine the same term is being used here for the case
where one CPU wants to be sure the timer is being updated,
and spins waiting for it to change. Whether MVS is
incorrectly looking at another CPU's timer via the prefix
register, or the timer speed is just too slow wrt the CPU
speed (that's my guess), or there's actually a Hercules
bug, I do not know.
Tony,
Yeah.. I guessed it was something like that. Now..
- You are disabled for interrupts (when you are enabled for
interrupts, a simple wait lock is sufficient)
- You have more than 1 execution engines (MP environment)
Now.. Using a non MP safe facility for a spin lock seems
- at best - awkward to me under these conditions !
--Ivan
[Non-text portions of this message have been removed]
Pieces MVS must serialize memory updates and checkpoint
CPUs when running an MP. TS, BCR 15,0, CS, and CDS are all
used. To gen MVS to include AP/MP code, you request
ACRCODE=YES ( Alternate CPU Recovery code = include ).
That's why some messages refer to ACR.
Ivan Warren
2010-11-30 01:40:36 UTC
Permalink
Post by somitcw
Pieces MVS must serialize memory updates and checkpoint
CPUs when running an MP. TS, BCR 15,0, CS, and CDS are all
used. To gen MVS to include AP/MP code, you request
ACRCODE=YES ( Alternate CPU Recovery code = include ).
That's why some messages refer to ACR.
That's why I'm a little confused..

Neither of these (TS, CS, CDS or the serialized BCR) do guarantee a
faithful image of another CPU's Interval timer. They only guarantee is
that *access* by another CPU is serialized - but *NOT* that a
modification to the PSA by an external facility is.

--Ivan



[Non-text portions of this message have been removed]
halfmeg
2010-11-30 02:42:01 UTC
Permalink
Post by Tony Harminc
<snip>
That's why I'm a little confused..
Neither of these (TS, CS, CDS or the serialized BCR) do guarantee a
faithful image of another CPU's Interval timer. They only guarantee
is that *access* by another CPU is serialized - but *NOT* that a
modification to the PSA by an external facility is.
Don't be confused by my tinkering.

It just happened to be where I started the journey. I thought what is different about S/370 MP than more modern MP and decided to look at the timer at location x'50'. It may not have anything to do with anything associated with the SPIN problem.

Phil
somitcw
2010-11-30 02:45:47 UTC
Permalink
Post by Ivan Warren
Post by somitcw
Pieces MVS must serialize memory updates and checkpoint
CPUs when running an MP. TS, BCR 15,0, CS, and CDS are all
used. To gen MVS to include AP/MP code, you request
ACRCODE=YES ( Alternate CPU Recovery code = include ).
That's why some messages refer to ACR.
That's why I'm a little confused..
Neither of these (TS, CS, CDS or the serialized BCR) do
guarantee a faithful image of another CPU's Interval timer.
They only guarantee is that *access* by another CPU is
serialized - but *NOT* that a modification to the PSA by
an external facility is.
--Ivan
[Non-text portions of this message have been removed]
I don't know of any MVS code that looks at the interval
timer in another CPUs prefix page. If the code could be
located, it could be corrected.

To add random MVS information, the prefix pages have
high memory addresses ( for a 16MB system, real addresses
like FEE000 and FBF000 ). If a CPU is online at IPL time,
for 16MB, the virtual address of the prefix will equal real.
MVS keeps both the virtual and real address of each prefix
page in a control block for each CPU ( PCCA ). Only 16 CPUs
are allowed so there are 16 pointers that can point to PCCAs.
The control blocks and prefixed pages are in common memory
so can be accessed from any address space.

Of course when a Prefix register is set, any reference
to an address in that prefixed page is translated by the
hardware or emulator to an absolute page zero reference.
So if looking at the current prefix page, use page zero
addresses. If looking at a different CPU's prefix page,
use the virtual or DAT-off real address.
Harold Grovesteen
2010-11-30 10:53:52 UTC
Permalink
Are you compiling on Windows or Linux? I have no problem on gcc of
course, but do not have Windows to compile it there. This might be the
first time it was compiled on Windows. Either way, could you provide,
off list is fine, the error, please?

Thanks,
Harold
Post by halfmeg
( dyn76.c causes failure in compile so removed it from make ).
halfmeg
2010-11-30 12:46:49 UTC
Permalink
Post by Harold Grovesteen
Are you compiling on Windows or Linux? I have no problem on gcc of
course, but do not have Windows to compile it there. This might be
the first time it was compiled on Windows. Either way, could you
provide, off list is fine, the error, please?
Post by halfmeg
( dyn76.c causes failure in compile so removed it from make ).
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I. -I. -I./decNumber -W -Wall -O3 -march=i686 -fomit-frame-pointer -MT dyn76.lo -MD -MP -MF ".deps/dyn76.Tpo" -c -o dyn76.lo dyn76.c; \
then mv -f ".deps/dyn76.Tpo" ".deps/dyn76.Plo"; else rm -f ".deps/dyn76.Tpo"; exit 1; fi
gcc -DHAVE_CONFIG_H -I. -I. -I. -I. -I./decNumber -W -Wall -O3 -march=i686 -fomit-frame-pointer -MT dyn76.lo -MD -MP -MF .deps/dyn76.Tpo -c dyn76.c -fPIC -DPIC -o .libs/dyn76.o
dyn76.c: In function `z900_hdiagf18_FC':
dyn76.c:1009: error: parse error at end of input
inline.h:51: warning: `s390_logical_to_main' declared `static' but never defined
inline.h:53: warning: `s390_translate_addr' declared `static' but never defined
inline.h:58: warning: `z900_logical_to_main' declared `static' but never defined
inline.h:60: warning: `z900_translate_addr' declared `static' but never defined
inline.h:94: warning: `s390_instfetch' declared `static' but never defined
make[2]: *** [dyn76.lo] Error 1
make[2]: Leaving directory `/home/testdyn'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/testdyn'
make: *** [all] Error 2

[***@test/home/testdyn]# gcc -v
Reading specs from /usr/lib/gcc-lib/i486-slackware-linux/3.3.4/specs
Configured with: ../gcc-3.3.4/configure --prefix=/usr --enable-shared --enable-threads=posix --enable-__cxa_atexit --disable-checking --with-gnu-ld --verbose --target=i486-slackware-linux --host=i486-slackware-linux
Thread model: posix
gcc version 3.3.4

LINUX above version of GCC. Only doing

"sh autogen.sh"
"configure --prefix=/home/testdyn/"
"make"

Phil
Harold Grovesteen
2010-12-01 10:48:17 UTC
Permalink
I will take a look at this. Of course, when I compile Hercules I have
the feature defined that enables the DIAGNOSE, but the SVN does not
define that feature by default yet. I will look at this.

Thanks,
Harold
Post by halfmeg
Post by Harold Grovesteen
Are you compiling on Windows or Linux? I have no problem on gcc of
course, but do not have Windows to compile it there. This might be
the first time it was compiled on Windows. Either way, could you
provide, off list is fine, the error, please?
Post by halfmeg
( dyn76.c causes failure in compile so removed it from make ).
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I. -I. -I./decNumber -W -Wall -O3 -march=i686 -fomit-frame-pointer -MT dyn76.lo -MD -MP -MF ".deps/dyn76.Tpo" -c -o dyn76.lo dyn76.c; \
then mv -f ".deps/dyn76.Tpo" ".deps/dyn76.Plo"; else rm -f ".deps/dyn76.Tpo"; exit 1; fi
gcc -DHAVE_CONFIG_H -I. -I. -I. -I. -I./decNumber -W -Wall -O3 -march=i686 -fomit-frame-pointer -MT dyn76.lo -MD -MP -MF .deps/dyn76.Tpo -c dyn76.c -fPIC -DPIC -o .libs/dyn76.o
dyn76.c:1009: error: parse error at end of input
inline.h:51: warning: `s390_logical_to_main' declared `static' but never defined
inline.h:53: warning: `s390_translate_addr' declared `static' but never defined
inline.h:58: warning: `z900_logical_to_main' declared `static' but never defined
inline.h:60: warning: `z900_translate_addr' declared `static' but never defined
inline.h:94: warning: `s390_instfetch' declared `static' but never defined
make[2]: *** [dyn76.lo] Error 1
make[2]: Leaving directory `/home/testdyn'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/testdyn'
make: *** [all] Error 2
Reading specs from /usr/lib/gcc-lib/i486-slackware-linux/3.3.4/specs
Configured with: ../gcc-3.3.4/configure --prefix=/usr --enable-shared --enable-threads=posix --enable-__cxa_atexit --disable-checking --with-gnu-ld --verbose --target=i486-slackware-linux --host=i486-slackware-linux
Thread model: posix
gcc version 3.3.4
LINUX above version of GCC. Only doing
"sh autogen.sh"
"configure --prefix=/home/testdyn/"
"make"
Phil
------------------------------------
http://groups.yahoo.com/group/hercules-390
http://www.hercules-390.org
Yahoo! Groups Links
Loading...