Discussion:
instcount behavior
(too old to reply)
'\'Fish\' (David B. Trout)' david.b.trout@gmail.com [hercules-390]
2018-06-09 13:07:28 UTC
Permalink
Raw Message
This question is directed towards Hercules developers.

A user has contacted me asking about what he is calling "Intermittent instruction count" behavior, which seems to only occur for him (albeit somewhat sporadically) with SDL Hyperion but not with non-SDL Hyperion (nor with 3.x).

He says when my SDL Hyperion is used, the "instcnt=" value displayed at the bottom right of the panel (he is not using my HercGUI) appears to only be updated intermittently. He described the behavior to me over the telephone as: "It goes brip! (is incremented very quickly), then pauses for a very brief moment (he didn't say for how long), then goes brip! again (is updated), then pauses again, etc." In other words, the instruction count at the bottom of the screen (panel) is not updated as smoothly/continuously as it is when a non-SDL version of Hyperion (or 3.x) is used.

Now I *did* indeed make some changes to my SDL Hyperion related to instruction count handling, but I cannot see how the changes I made could be causing such behavior. (For what it's worth, I am not seeing the same behavior myself, or if it is occurring, it is not as noticeable to me as it apparently is to him.)

The changes I made in this area were:

1. Changed/fixed the 'run_cpu' function in cpu.c to actually update sysblk.instcount each time through its instruction execution loop ("UPDATE_SYSBLK_INSTCOUNT" macro and associated function).(*)

2. In the panel.c logic where the instruction count was being calculated (lines 2938-2971 of current hercules-390 Hyperion), I removed the logic that was manually determining the system-wide instruction count value (by manually looping through and accumulating each CPU's regs->instcount value) and replaced it instead with code to directly access the now properly/reliably updated sysblk.instcount value.

That's pretty much it!

I was wondering if one or more of you Hercules developers out there might be able to explain WHY the changes I made could explain this new behavior this user is now seeing.

I'm fairly confident the behavior is purely cosmetic and nothing to worry about. I'm fairly confident there is nothing inherently wrong (buggy) with my new SDL Hyperion instcount handling, so I'm not asking any of you to "debug" my code for me. I'm not asking that.

I'm just asking if any of you can see anything in my new code that might explain this behavior this user is seeing. Because I cannot explain it myself! Not only am I not seeing this unusual behavior myself whenever I run my own tests, but I simply CANNOT SEE *how* the changes I made could possibly have *caused* this new behavior! (Apparently is *has* caused it, but I cannot see *how* it has!)

Thanks in advance for any insight any of you can provide.


---------------------
(*) Surprisingly, current (non-SDL, i.e. hercules-390) Hercules NEVER UPDATES SYSBLK.INSTCOUNT AT ALL! My change fixed this so that now it *is* properly updated in my SDL Hyperion.
--
"Fish" (David B. Trout)
Software Development Laboratories
http://www.softdevlabs.com
mail: ***@softdevlabs.com
mcisho@yahoo.com [hercules-390]
2018-06-09 18:54:59 UTC
Permalink
Raw Message
PANRATE setting? The default of SLOW updates the count every half second.


Ian
'\'Fish\' (David B. Trout)' david.b.trout@gmail.com [hercules-390]
2018-06-09 22:08:00 UTC
Permalink
Raw Message
Post by ***@yahoo.com [hercules-390]
PANRATE setting? The default of SLOW updates the count
every half second.
Oh please! If it was something that simple would I have even bothered to ask you guys for help? :)

That was one of the first things I asked him. He sent me his config file and his is set to 50 (milliseconds).

Another thing I thought *might* cause it: over-allocation of virtual CPUs and/or MAINSIZE in relation to total number of real host CPU cores and installed RAM. The system he's running on (a laptop; don't know the manufacturer or model) has four (4) cores and I forget how much RAM (4GB? 8GB?), and his Hercules config file is (was) specifying 6 virtual CPUs (which he has supposedly dropped to 3 or 4) and only 1200MB for MAINSIZE. So I'm doubting resource over-allocation is the issue.

My gut instinct is telling me it's my new code that's causing it, but I honestly can't see how it possibly could be.

Which is why I asked you guys to give it a quick review to see if maybe you could see something I can't.

Thanks for replying Ian. I appreciate it.
--
"Fish" (David B. Trout)
Software Development Laboratories
http://www.softdevlabs.com
mail: ***@softdevlabs.com
Wayne Bickerdike waynevb@gmail.com [hercules-390]
2018-06-09 22:42:00 UTC
Permalink
Raw Message
PANRATE....

Mine is set to 2000 and the instruction count is updated every 2 seconds.

Build date April 23, 2016.

On Sun, Jun 10, 2018 at 8:08 AM, ''Fish' (David B. Trout)'
Post by ***@yahoo.com [hercules-390]
PANRATE setting? The default of SLOW updates the count
every half second.
Oh please! If it was something that simple would I have even bothered to
ask you guys for help? :)
That was one of the first things I asked him. He sent me his config file
and his is set to 50 (milliseconds).
Another thing I thought *might* cause it: over-allocation of virtual CPUs
and/or MAINSIZE in relation to total number of real host CPU cores and
installed RAM. The system he's running on (a laptop; don't know the
manufacturer or model) has four (4) cores and I forget how much RAM (4GB?
8GB?), and his Hercules config file is (was) specifying 6 virtual CPUs
(which he has supposedly dropped to 3 or 4) and only 1200MB for MAINSIZE.
So I'm doubting resource over-allocation is the issue.
My gut instinct is telling me it's my new code that's causing it, but I
honestly can't see how it possibly could be.
Which is why I asked you guys to give it a quick review to see if maybe
you could see something I can't.
Thanks for replying Ian. I appreciate it.
--
"Fish" (David B. Trout)
Software Development Laboratories
http://www.softdevlabs.com
--
Wayne V. Bickerdike
mcisho@yahoo.com [hercules-390]
2018-06-10 00:09:30 UTC
Permalink
Raw Message
Fish, as your user has provided you with such a useless problem report, the obvious might have been the answer. I can see no difference between Hyperion and SDL Hyperion, on a Linux host, in the instcount increment when using PANRATE 50.


Ian
'\'Fish\' (David B. Trout)' david.b.trout@gmail.com [hercules-390]
2018-06-10 02:03:00 UTC
Permalink
Raw Message
Post by ***@yahoo.com [hercules-390]
Fish, as your user has provided you with such a useless
problem report,
Well, to be honest, I'm not sure there's much more information he can provide! (Other than maybe more details regarding his actual physical host, e.g.. RAM amount, hard disk models, CPU model/speed, etc). The only other thing I can think of that would help would be a video of the actual "problem".
Post by ***@yahoo.com [hercules-390]
the obvious might have been the answer. I can see no difference
between Hyperion and SDL Hyperion, on a Linux host, in the instcount
increment when using PANRATE 50.
Neither can I.

Which is why I'm thinking it might just be: a) his system/hardware, and/or b) the way he runs Hercules. <shrug>

I myself am unable to reproduce the behavior on my own system and I can find no rhyme or reason in my coding changes to explain the behavior he is seeing.

But I was hoping that MAYBE there might be a subtle "bug" in my new code that handles the instcount that I can't see that maybe others might be able to see. (It always helps to have a "different set of eyes" review your coding changes! Others can sometimes see what you can't! That's what I was hoping for.)

Thanks again for your time/effort.
--
"Fish" (David B. Trout)
Software Development Laboratories
http://www.softdevlabs.com
mail: ***@softdevlabs.com
'\'Fish\' (David B. Trout)' david.b.trout@gmail.com [hercules-390]
2018-06-10 05:56:18 UTC
Permalink
Raw Message
I think I may have figured it out! (But I still need to confirm it)

Follow me on this and tell me whether the following makes any sense:



Fish wrote:

[...]
Post by '\'Fish\' (David B. Trout)' ***@gmail.com [hercules-390]
1. Changed/fixed the 'run_cpu' function in cpu.c to
actually update sysblk.instcount each time through
its instruction execution loop ("UPDATE_SYSBLK_INSTCOUNT"
macro and associated function).
2. In the panel.c logic where the instruction count was
being calculated (lines 2938-2971 of current hercules-390
Hyperion), I removed the logic that was manually determining
the system-wide instruction count value (by manually looping
through and accumulating each CPU's regs->instcount value)
and replaced it instead with code to directly access the
now properly/reliably updated sysblk.instcount value.
[...]
Post by '\'Fish\' (David B. Trout)' ***@gmail.com [hercules-390]
but I simply CANNOT SEE *how* the changes I made could possibly
have *caused* this new behavior! (Apparently is *has* caused it,
but I cannot see *how* it has!)
Thanks in advance for any insight any of you can provide.
When you add in a bit of additional information (in addition to the above), I believe I can now see how this new unexpected (undesirable?) behavior might have been triggered by my changes. Please check me on this and tell me whether or not I'm on the right track (whether the following makes any sense or not).

In addition to above changes, something else should probably be mentioned which I did NOT change, but which likely plays a key role into causing the new unexpected (undesirable?) behavior to occur:

That "something else" that should be mentioned is how the regs->instcount field (and perhaps more importantly, the regs->prevcount field!) are maintained by the "timer_update_thread" in timer.c:

At the expiration of each update period, it loops through all of the online CPUs and adds the current regs->instcount value to the regs->prevcount value, and resets the regs->instcount value back to zero, and these two values (regs->prevcount and regs->instcount) are what the old panel.c code added together FOR EACH CPU to obtain the "totalcount" value (total instruction count of the system) that it would then display in the "instcnt=" field at the bottom right of the panel screen.

But the "timer_update_thread" in timer.c does this accumulating and resetting of each CPU's instruction count fields UNDER THE CONTROL OF THE CPU LOCK!

Each CPU has its own lock, called "sysblk.cpulock[i]" (where 'i' is of course whichever CPU you are dealing with), and this lock is obtained by "timer_update_thread" in timer.c before it updates the instruction count fields, and then released afterwards. (The code in panel.c that accesses these same fields doe not bother to obtain/release the CPU lock because it is just reading those fields but not updating them.)

Obtaining and Releasing a lock causes a MEMORY BARRIER, which likely causes whatever value which may still be in the CPU's data cache to be FLUSHED OUT, such that the code in panel.c (the OLD code, prior to my changes) would now see the new value as updated by the "timer_update_thread" in timer.c (under control of the CPU lock).

Thus, the old code in panel.c would always "see" (access) an absolute "current" regs->prevcount and regs->instcount value due to the timer_update_thread's lock usage.

With my new code however, each CPU thread itself ("run_cpu" function in cpu..c) now updates the sysblk.instcount field DIRECTLY, *without* the use of any type of lock! (it uses an atomic intrinsic to do the updating, a "locked add").

Because no lock is being used to update the sysblk.instcount field, the code in panel.c is very likely using a "stale" sysblk.instcount value (because the real host's CPU hasn't flushed its data cache yet).

Only later, when "something else" happens to cause the real host CPU to flush the 'sysblk.instcount' value from its data cache back to its L2 or L3 cache (or even all the way back to storage) does my new code in panel.c finally "see" the TRUE value of sysblk.instcount field! (which it then displays, "Brip!").

Sooo... what I'm basically saying is that because the value that panel.c is now displaying on the screen with my new code is coming from a sysblk field that is being updated WITHOUT obtaining or releasing any type of lock, the value it's displaying (seeing/using) ends up being "stale" for brief periods of time. That is to say, every once in a while, the sysblk.instcount field FROM THE PANEL.C CODE'S POINT OF VIEW hasn't changed! It still believes the value its using is still the same, and so doesn't bother to update it.

The ACTUAL value however, *has* changed, but hasn't been FLUSHED FROM CACHE yet so that that panel.c code can see it! (because no locks are being used to access that field)

Does that make sense? Could that perhaps be the cause? Each different model of Intel CPU (or whatever model CPU a person happens to be using) probably has not only a different sized data cache but also different algorithm/timing for when the caches will be flushed too.

And my friend, this Hercules user, just happens to be using a particular system in which the timing/frequency of cache flushing on his system is such that the "problem" is more noticeable to him than it might be to others (who are using a different CPU model with different cache sizes and flushing frequency).

Yes? No? Close? Way out in left field?

What do you think? Am I maybe on the right track? Or am I way off base here?

Anyone's / Everyone's thoughtful/knowledgeable/experienced feedback GREATLY appreciated!
--
"Fish" (David B. Trout)
Software Development Laboratories
http://www.softdevlabs.com
mail: ***@softdevlabs.com
mcisho@yahoo.com [hercules-390]
2018-06-10 08:54:51 UTC
Permalink
Raw Message
My understanding is that cache is intended to improve main memory access time, by having a copy of the main memory in cache before a main memory location needs to be accessed.

If thread A is writing to a main memory location, the copy in cache is updated, and the cached copy will be written back to main memory at some point in the future. If thread B is reading from the same main memory location, the copy in cache is read, main memory isn't involved.

What is the user seeing? An update every second? Every 10 seconds? Does the update interval appear fixed or variable? What state is the guest in when the effect is noticed? I said earlier I couldn't detect any difference in instcount updating, but that was while the guest (a Linux s390x system) was starting. However, when the guest was idle the instcount update was exactly as described by your user.


I don't know what you know, I'm working from a not very helpful third-hand problem description. As an old colleague used to say, give us the facts.


Ian
'\'Fish\' (David B. Trout)' david.b.trout@gmail.com [hercules-390]
2018-06-10 13:47:52 UTC
Permalink
Raw Message
Ian Shorter wrote:

[...]
Post by ***@yahoo.com [hercules-390]
I said earlier I couldn't detect any difference in instcount
updating, but that was while the guest (a Linux s390x system)
was starting. However, when the guest was idle the instcount
update was exactly as described by your user.
If we define the two update behaviors as being "while busy" and "while idle", then what I believe his complaint is, is that his system exhibits "while idle" behavior when it is both busy *and* idle (i.e. the normal "while busy" behavior never occurs on his system).
Post by ***@yahoo.com [hercules-390]
I don't know what you know,
You know as much as I do! Everything I know has been reported in this thread!
Post by ***@yahoo.com [hercules-390]
I'm working from a not very helpful third-hand problem description.
As an old colleague used to say, give us the facts.
Unfortunately there are no more facts to give! (except maybe, as I said earlier, details regarding his host system hardware configuration, but that's about it!)

This is one of those problems that doesn't present you with very many facts to go on, unfortunately. That's what makes it so frustrating and difficult to nail down.

By and large however, I'm not too worried about it. I believe it is 100% cosmetic and NOT indicative of any type of problem or bug. It is, to me, more of an intellectual curiosity than anything.
--
"Fish" (David B. Trout)
Software Development Laboratories
http://www.softdevlabs.com
mail: ***@softdevlabs.com
'\'Fish\' (David B. Trout)' david.b.trout@gmail.com [hercules-390]
2018-06-10 16:20:57 UTC
Permalink
Raw Message
Post by '\'Fish\' (David B. Trout)' ***@gmail.com [hercules-390]
I think I may have figured it out!
I thought wrong. :(


[...]
Post by '\'Fish\' (David B. Trout)' ***@gmail.com [hercules-390]
Sooo... what I'm basically saying is that because the value
that panel.c is now displaying on the screen with my new code
is coming from a sysblk field that is being updated WITHOUT
obtaining or releasing any type of lock, the value it's displaying
(seeing/using) ends up being "stale" for brief periods of time.
That is to say, every once in a while, the sysblk.instcount field
FROM THE PANEL.C CODE'S POINT OF VIEW hasn't changed! It still
believes the value its using is still the same, and so doesn't
bother to update it.
The ACTUAL value however, *has* changed, but hasn't been FLUSHED
FROM CACHE yet so that that panel.c code can see it! (because no
locks are being used to access that field)
Does that make sense?
No, not after reading Ian's reply and thinking about it some more. :(
Post by '\'Fish\' (David B. Trout)' ***@gmail.com [hercules-390]
Could that perhaps be the cause?
Very unlikely, as Ian Shorter helpfully pointed out.


BUT... I have a new theory: Unexpected compiler optimization.

Since the sysblk.instcount field is never updated anywhere by panel.c code, maybe the compiler decided to completely elide (remove) the conditional from the if statement that checks to see if the screen needs to be redrawn? (because it determined the condition would always be false?)

The panel.c code in question is the following sequence:


/* Obtain the PSW for target CPU */
regs = copy_regs( sysblk.pcpu );
memset( curpsw, 0, sizeof( curpsw ));
copy_psw( regs, curpsw );

numcpu = 0;
for (i=0; i < sysblk.maxcpu; ++i )
if (IS_CPU_ONLINE( i ))
++numcpu;

/* Set the display update indicator
if anything interesting happened.
*/
if (0
|| memcmp( curpsw, prvpsw, sizeof( curpsw )) != 0
|| prvcpupct != regs->cpupct
|| prvstate != regs->cpustate

#if defined( OPTION_SHARED_DEVICES )
|| prvscount != sysblk.shrdcount
#endif
|| (prev_instcount != sysblk.instcount
&& NPDup && NPcpugraph)
)
{
redraw_status = 1;



The first thing to note is that the PSW being checked for change is the PANEL cpu (which is by default cpu #0), which, for brief periods, may not change due to the guest dispatching its work on CPUs other than cpu #0. (i.e. depending on how many virtual CPUs you have, cpu #0 might not get dispatched as often as other cpus.

The other condition is the cpu percent busy (which once again, is only for the PANEL cpu, not for the entire system), and as just explained might not change for brief periods due to cpu #0 not being dispatched by the guest.

The cpu state is just a simply "started" or "stopped" state, so that's not going to change very often either.

A change in the shared devices I/O count will never occur if shared devices aren't being used, which leaves only the sysblk.instcount conditional, which includes NPDup and NPcpugraph (causing the condition to always be false since NPcpugraph will always be false when the primary panel display is active; it's only true when you press escape to display the screen that shows all of your devices and the cpu percent graph).

So since the if condition that's checking the sysblk.instcount will, from the optimizer's point-of-view, always be false, perhaps it completely elided it? (removed it altogether)

That might certainly explain the observed behavior!

I'm not sure why NPDup and NPcpugraph are part of the condition in the first place. But if the testing of sysblk.instcount against prev_instcount were made a separate condition unto itself (and perhaps cast with (volatile) just to be safe) the problem might not occur.

Of course there's also the argument that panel.c shouldn't be using sysblk.instcount AT ALL, since it's a system-wide value, and it's *supposed* to only be displaying information for the current PANEL cpu only. It technically should be using regs->hostregs->prevcount + regs->hostregs->instcount, and NOT sysblk.instcount at all (i.e. it should only be displaying the instruction count for the current PANEL cpu, not for the entire system).

What do you guys think of this new theory? It seems much more likely to me!
--
"Fish" (David B. Trout)
Software Development Laboratories
http://www.softdevlabs.com
mail: ***@softdevlabs.com
Loading...