Discussion:
MCH005 z/VM ABEND with REAL SCSI TAPES
(too old to reply)
o***@public.gmane.org
2009-04-10 22:28:17 UTC
Permalink
Using CMS DDR, abend MCH005 occurs in z/vm when ddr rewinds and unloads a tape.
--------------------------------
t+588
HHCPN136I CCW tracing is now on for device 0:0588
0588:synchronous I/O ccw addr 3f975dd8
HHCCP048I 0588:CCW=E4200007 3643D641=>00000000 000000FF C3D6D7E8 C9D5C740 ........COPYING
HHCCP075I 0588:Stat=0C00 Count=0000 =>FF349050 349050FF C3D6D7E8 C9D5C740 ...&..&.COPYING
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0000 CCW=3F975DE0
0588:synchronous I/O ccw addr 3f975f38
HHCCP048I 0588:CCW=64200040 3E4FFDF0=>00000000 00000000 00000000 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=0000 =>34905034 90500000 02C08080 00000000 ..&..&...{......
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0000 CCW=3F975F40
HHCCP048I 0588:CCW=02200078 3643CB68=>00000000 00000000 00000000 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=0028 =>E5D6D3F1 E2E8E2E3 D4C14040 40404040 VOL1SYSTMA
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0028 CCW=3F975440
0588:synchronous I/O ccw addr 3f975dd8
HHCCP048I 0588:CCW=2F200001 3643C4F8=>2F02F4F8 20000001 0702F500 60000001 ..48......5.-...
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x01010000: ON-LINE READY
HHCTA038E Backspace file error on 0:0588=/dev/nst0; errno=5: Input/output error
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x41010000: ON-LINE READY LOAD-POINT
HHCCP075I 0588:Stat=0E00 Count=0001
HHCCP076I 0588:Sense=10480044 00000020 00000000 00000000 00000000 00000000
HHCCP077I 0588:Sense=EQC ITF NRF
HHCCP050I 0588:SCSW=00C24017 Stat=0E00 Count=0001 CCW=3F975DE0
<pnl,color(green,black)>15:23:47 HCPERP591I TAPE 0588 AN OPERATION WAS TERMINATED BECAUSE
<pnl,color(green,black)>15:23:47 HCPERP591I DESIRED BLOCK WAS NOT FOUND
<pnl,color(green,black)>15:23:47 HCPERP6300I SENSE DATA FORMAT = 02 MSG CODE = 00
<pnl,color(green,black)>15:23:47 HCPERP6301I CHANNEL COMMAND WORD COMMAND CODE = 2F
<pnl,color(green,black)>15:23:47 HCPERP6303I SENSE = 10480044 00000020 00000000 00000000 00000000
<pnl,color(green,black)>15:23:47 HCPERP6303I 00000000 00000000 00000000
<pnl,color(green,black)>15:23:47 HCPERP6304I IRB = 00C24017 3F975DE0 0E000001 00800000
<pnl,color(green,black)>15:23:47 HCPERP6305I USERID = MAINT
<pnl,color(green,black)>15:23:47 HCPERP2216I CHANNEL PATH ID = 05
0588:synchronous I/O ccw addr 3f975388
HHCCP048I 0588:CCW=0F600001 3643C538=>0F02F538 60000001 0302F540 20000001 ..5.-.....5 ....
HHCTA100I 0588: Now Displays: "UNLOADNG"
HHCTA077I Tape 0:0588 unloaded
HHCTA100I 0588: Now Displays: " "
HHCTA023I 0:0588 filename=/dev/nst0 (closed), sstat=0x00040000: OFF-LINE NO-TAPE
HHCTA023I 0:0588 filename=/dev/nst0 (closed), sstat=0x00040000: OFF-LINE NO-TAPE
HHCCP075I 0588:Stat=0C00 Count=0001
HHCTA200I SCSI-Tape mount-monitoring thread started;
dev=0:0588, tid=42B2A940, pri=15, pid=14203
HHCCP048I 0588:CCW=03200001 3643C540=>0302F540 20000001 0F02F548 60000001 ..5 ......5.-...
HHCCP017I CPU0000: Machine check due to host error: User defined signal 1
CPU0000: PSW=04041000 80000000 00000000000F703A INST=182A LR 2,10 load_register
CPU0000: R0=0000000000000000 R1=000000000001002A
CPU0000: R2=0000000000000011 R3=0000000000000000
CPU0000: R4=8280834F00000000 R5=000000003DC8D3A4
CPU0000: R6=0000000000000012 R7=0000000000187410
CPU0000: R8=000000003F977BE8 R9=00000000000F7DC0
CPU0000: RA=000000003F98A178 RB=000000003DC8D000
CPU0000: RC=00000000000F6DC0 RD=000000003F920C00
CPU0000: RE=00000000000000B2 RF=0000000000000000
HHCCP011I CPU0001: Disabled wait state
PSW=00020000 00000000 0000000000009005
CPU0000: SIGP CPU reset (0C) CPU0001, PARM 0000000000176218: CC 0
<pnl,color(green,black)>HCPMCH9101I MACHINE CHECK ON CPU 0000. MCIC = 40000F1D 403B0000
HHCCP055I 0588: Clear subchannel
<pnl,color(green,black)>15:24:17 HCPDMP908I SYSTEM FAILURE ON CPU 0000, CODE - MCH005
<pnl,color(green,black)>HCPDMP9252I DUMPING TO DASD
<pnl,color(green,black)>HCPDMP9260I 25% OF THE DUMP HAS BEEN COMPLETED
<pnl,color(green,black)>HCPDMP9260I 50% OF THE DUMP HAS BEEN COMPLETED
<pnl,color(green,black)>HCPDMP9260I 75% OF THE DUMP HAS BEEN COMPLETED
<pnl,color(green,black)>HCPDMP9261I DUMP COMPLETE
<pnl,color(green,black)>HCPWRP9277I SYSTEM TERMINATION COMPLETE, ATTEMPTING RESTART
CPU0000: SIGP Set architecture mode (12) CPU0000, PARM 000000003F946000: CC 0
HHCCP007I CPU0000 architecture mode set to ESA/390
CPU0000: SIGP Set architecture mode (12) CPU0000, PARM 00000001: CC 0
HHCCP007I CPU0000 architecture mode set to z/Arch
HHCCP055I 0588: Clear subchannel
<pnl,color(green,black)>15:24:23 z/VM SYSTEM RESTART AFTER SYSTEM FAILURE; CODE MCH005
<pnl,color(green,black)>15:24:23 z/VM V3 R1.0 SERVICE LEVEL 0301 (64-BIT)
<pnl,color(green,black)>15:24:23 SYSTEM NUCLEUS CREATED ON 2004-03-13 AT 12:27:06, LOADED FROM 310RES
<pnl,color(green,black)>15:24:23
<pnl,color(green,black)>15:24:23 ****************************************************************
<pnl,color(green,black)>15:24:23 * LICENSED MATERIALS - PROPERTY OF IBM* *
<pnl,color(green,black)>15:24:23 * *
<pnl,color(green,black)>15:24:23 * 5654-A17 (C) COPYRIGHT IBM CORP. 1983, 2001. ALL RIGHTS *
<pnl,color(green,black)>15:24:23 * RESERVED. US GOVERNMENT USERS RESTRICTED RIGHTS - USE, *
<pnl,color(green,black)>15:24:23 * DUPLICATION OR DISCLOSURE RESTRICTED BY GSA ADP SCHEDULE *
<pnl,color(green,black)>15:24:23 * CONTRACT WITH IBM CORP. *
<pnl,color(green,black)>15:24:23 * *
<pnl,color(green,black)>15:24:23 * * TRADEMARK OF INTERNATIONAL BUSINESS MACHINES. *
<pnl,color(green,black)>15:24:23 ****************************************************************
<pnl,color(green,black)>15:24:23
<pnl,color(green,black)>15:24:23 HCPZCO6718I Using parm disk 1 on volume 310RES (device 080E).
<pnl,color(green,black)>15:24:23 HCPZCO6718I Parm disk resides on cylinders 326 through 360.
<pnl,color(green,black)>15:24:23
<pnl,color(green,black)>15:24:23 HCPZPM2773I Entry point name CPHOSTIN is unknown at this time; processing continues
<pnl,color(green,black)>15:24:23
<pnl,color(green,black)>15:24:23 HCPIIS954I DASD 0713 VOLID 310RES IS A DUPLICATE OF DASD 080E
<pnl,color(green,black)>15:24:23 Start ((Warm Force COLD CLEAN) (DRain) (DIsable) (NODIRect)
<pnl,color(green,black)>15:24:23 (NOAUTOlog)) or (SHUTDOWN)
<pnl,color(green,black)>HCPSED6013A A CP read is pending.
.shutdown
<pnl,color(green,black)>15:24:28 SHUTDOWN
<pnl,color(green,black)>15:24:28 HCPISU961W SYSTEM SHUTDOWN COMPLETE
HHCCP011I CPU0000: Disabled wait state
PSW=00020000 00000000 0000000000000961
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x41010000: ON-LINE READY LOAD-POINT
HHCTA299I SCSI-Tape mount-monitoring thread ended;
dev=0:0588, tid=42B2A940, pid=14203
--------------------------------


Paul

PS. This is a legal copy of z/VM under hercules...
Fish
2009-04-11 03:26:12 UTC
Permalink
Post by o***@public.gmane.org
Using CMS DDR, abend MCH005 occurs in z/vm when ddr rewinds and unloads a tape.
<snip>
Post by o***@public.gmane.org
Input/output error
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x41010000: ON-LINE READY LOAD-POINT
HHCCP075I 0588:Stat=0E00 Count=0001
HHCCP076I 0588:Sense=10480044 00000020 00000000 00000000 00000000
00000000 HHCCP077I 0588:Sense=EQC ITF NRF
HHCCP050I 0588:SCSW=00C24017 Stat=0E00 Count=0001 CCW=3F975DE0
<pnl,color(green,black)>15:23:47 HCPERP591I TAPE 0588 AN
OPERATION WAS TERMINATED BECAUSE
<pnl,color(green,black)>15:23:47 HCPERP591I DESIRED BLOCK WAS NOT FOUND
<CHOMP!>
Post by o***@public.gmane.org
Paul
Need more information:

1. What host platform are we talking about here? Linux? Windows?
Apple? Other?

2. What version of Hercules is this? (o/p of 'version' command)

3. What model/manufacturer of SCSI tape drive are you using?

4. Does it support 32-bit block-ids? Or 22-bit? (i.e. is it
physically running in 3480/3490E emulation mode, or is it a true 3590
compatible?)

5. How do you have it defined in your Hercules configuration file?

6. If running on Windows, have you verified it functions properly[1]
via 'ftape'? If using Linux, have you verified its proper
functioning[2] via some similar type of SCSI tape testing utility?

7. Try your test again but with the CCW tracing enabled earlier. I'd
like to see the 10-20 i/o's immediately preceding the "HHCTA038E
Backspace file error" that occurs.
Post by o***@public.gmane.org
PS. This is a legal copy of z/VM under hercules...
I couldn't care less.

That's between you can IBM.

- --
"Fish" (David B. Trout) - fish-VLFb7ALKWJGGw+***@public.gmane.org
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
PGP key fingerprints:
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A

[1] Basically what you want to do is run it through a series of
simple tests: rew, erg, bsr, fsr, bsf, fsf, read, write, rdblkid,
locate, etc... to make sure that the drive is capable of successfully
executable the same sequence (series) of commands that your guest
operating system is going to be issuing to it.

Please also see:

http://www.hercules-390.org/hercconf.html#Quantum

Even though the options were coded in order to support Quantum DLTs,
the options apply to ANY make/model of SCSI tape drive. If the drive
physically behaves as a 3480/3490 but you define it in Herc as a 3590
(or vice-versa, etc) then it's not going to work right.

The model Hercules is supposed to be emulating is different from the
model that it is physically being used, and Hercues needs accurate
information on both in order for things to work correctly.

[2] Ibid.
o***@public.gmane.org
2009-04-15 20:07:50 UTC
Permalink
Output from version:
Hercules Version 3.06
(c)Copyright 1999-2009 by Roger Bowler, Jan Jaeger, and others
Built on Apr 2 2009 at 17:47:11
Build information:
Modes: S/370 ESA/390 z/Arch
Max CPU Engines: 8
Using setresuid() for setting privileges
Dynamic loading support
Using shared libraries
No External GUI support
HTTP Server support
Regular Expressions support
Automatic Operator support
National Language Support
Machine dependent assists: cmpxchg1 cmpxchg4 cmpxchg8
Running on legmainx86 Linux-2.6.16.60-0.37_f594.#1 SMP Mon Mar 23 13:39:48 UTC 2009 x86_64 MP=2

This is running on SUSE Enterprise LINUX 10, z/VM 3.1, IBM System x3650, with IBM 3490E E11 tape drive.

Hercules is from the April 2, 2009 SVN source build.

Config file: 0588 3490 "/dev/nst0"

I am working on the traces. The problem appears to be with end of tape handling. This has an autoloader on it.

It appears that "scsitape.c" is failing causing the MCH005.

Paul
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by o***@public.gmane.org
Using CMS DDR, abend MCH005 occurs in z/vm when ddr rewinds and unloads a tape.
<snip>
Post by o***@public.gmane.org
Input/output error
ON-LINE READY LOAD-POINT
HHCCP075I 0588:Stat=0E00 Count=0001
HHCCP076I 0588:Sense=10480044 00000020 00000000 00000000 00000000
00000000 HHCCP077I 0588:Sense=EQC ITF NRF
HHCCP050I 0588:SCSW=00C24017 Stat=0E00 Count=0001 CCW=3F975DE0
<pnl,color(green,black)>15:23:47 HCPERP591I TAPE 0588 AN
OPERATION WAS TERMINATED BECAUSE
<pnl,color(green,black)>15:23:47 HCPERP591I DESIRED BLOCK WAS NOT FOUND
<CHOMP!>
Post by o***@public.gmane.org
Paul
1. What host platform are we talking about here? Linux? Windows?
Apple? Other?
2. What version of Hercules is this? (o/p of 'version' command)
3. What model/manufacturer of SCSI tape drive are you using?
4. Does it support 32-bit block-ids? Or 22-bit? (i.e. is it
physically running in 3480/3490E emulation mode, or is it a true 3590
compatible?)
5. How do you have it defined in your Hercules configuration file?
6. If running on Windows, have you verified it functions properly[1]
via 'ftape'? If using Linux, have you verified its proper
functioning[2] via some similar type of SCSI tape testing utility?
7. Try your test again but with the CCW tracing enabled earlier. I'd
like to see the 10-20 i/o's immediately preceding the "HHCTA038E
Backspace file error" that occurs.
Post by o***@public.gmane.org
PS. This is a legal copy of z/VM under hercules...
I couldn't care less.
That's between you can IBM.
- --
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
[1] Basically what you want to do is run it through a series of
simple tests: rew, erg, bsr, fsr, bsf, fsf, read, write, rdblkid,
locate, etc... to make sure that the drive is capable of successfully
executable the same sequence (series) of commands that your guest
operating system is going to be issuing to it.
http://www.hercules-390.org/hercconf.html#Quantum
Even though the options were coded in order to support Quantum DLTs,
the options apply to ANY make/model of SCSI tape drive. If the drive
physically behaves as a 3480/3490 but you define it in Herc as a 3590
(or vice-versa, etc) then it's not going to work right.
The model Hercules is supposed to be emulating is different from the
model that it is physically being used, and Hercues needs accurate
information on both in order for things to work correctly.
[2] Ibid.
-----BEGIN PGP SIGNATURE-----
Version: PGP 8.1
iQA/AwUBSeAN1Ej11/TE7j4qEQLf2wCffD50Fb8KQPods4YrUnDdL1pgb8wAnjSP
eHQ3nWeu7pmVMkc8Yw7h/Tjd
=dBPH
-----END PGP SIGNATURE-----
o***@public.gmane.org
2009-04-16 18:11:43 UTC
Permalink
I think I have an idea as to what is causing the problem:

If SCSIMOUNT is NO or SCSIMOUNT term expires before the tape can be dismounted and a new tape mounted, then the abends occurs.

Paul
Post by o***@public.gmane.org
Hercules Version 3.06
(c)Copyright 1999-2009 by Roger Bowler, Jan Jaeger, and others
Built on Apr 2 2009 at 17:47:11
Modes: S/370 ESA/390 z/Arch
Max CPU Engines: 8
Using setresuid() for setting privileges
Dynamic loading support
Using shared libraries
No External GUI support
HTTP Server support
Regular Expressions support
Automatic Operator support
National Language Support
Machine dependent assists: cmpxchg1 cmpxchg4 cmpxchg8
Running on legmainx86 Linux-2.6.16.60-0.37_f594.#1 SMP Mon Mar 23 13:39:48 UTC 2009 x86_64 MP=2
This is running on SUSE Enterprise LINUX 10, z/VM 3.1, IBM System x3650, with IBM 3490E E11 tape drive.
Hercules is from the April 2, 2009 SVN source build.
Config file: 0588 3490 "/dev/nst0"
I am working on the traces. The problem appears to be with end of tape handling. This has an autoloader on it.
It appears that "scsitape.c" is failing causing the MCH005.
Paul
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by o***@public.gmane.org
Using CMS DDR, abend MCH005 occurs in z/vm when ddr rewinds and unloads a tape.
<snip>
Post by o***@public.gmane.org
Input/output error
ON-LINE READY LOAD-POINT
HHCCP075I 0588:Stat=0E00 Count=0001
HHCCP076I 0588:Sense=10480044 00000020 00000000 00000000 00000000
00000000 HHCCP077I 0588:Sense=EQC ITF NRF
HHCCP050I 0588:SCSW=00C24017 Stat=0E00 Count=0001 CCW=3F975DE0
<pnl,color(green,black)>15:23:47 HCPERP591I TAPE 0588 AN
OPERATION WAS TERMINATED BECAUSE
<pnl,color(green,black)>15:23:47 HCPERP591I DESIRED BLOCK WAS NOT FOUND
<CHOMP!>
Post by o***@public.gmane.org
Paul
1. What host platform are we talking about here? Linux? Windows?
Apple? Other?
2. What version of Hercules is this? (o/p of 'version' command)
3. What model/manufacturer of SCSI tape drive are you using?
4. Does it support 32-bit block-ids? Or 22-bit? (i.e. is it
physically running in 3480/3490E emulation mode, or is it a true 3590
compatible?)
5. How do you have it defined in your Hercules configuration file?
6. If running on Windows, have you verified it functions properly[1]
via 'ftape'? If using Linux, have you verified its proper
functioning[2] via some similar type of SCSI tape testing utility?
7. Try your test again but with the CCW tracing enabled earlier. I'd
like to see the 10-20 i/o's immediately preceding the "HHCTA038E
Backspace file error" that occurs.
Post by o***@public.gmane.org
PS. This is a legal copy of z/VM under hercules...
I couldn't care less.
That's between you can IBM.
- --
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
[1] Basically what you want to do is run it through a series of
simple tests: rew, erg, bsr, fsr, bsf, fsf, read, write, rdblkid,
locate, etc... to make sure that the drive is capable of successfully
executable the same sequence (series) of commands that your guest
operating system is going to be issuing to it.
http://www.hercules-390.org/hercconf.html#Quantum
Even though the options were coded in order to support Quantum DLTs,
the options apply to ANY make/model of SCSI tape drive. If the drive
physically behaves as a 3480/3490 but you define it in Herc as a 3590
(or vice-versa, etc) then it's not going to work right.
The model Hercules is supposed to be emulating is different from the
model that it is physically being used, and Hercues needs accurate
information on both in order for things to work correctly.
[2] Ibid.
-----BEGIN PGP SIGNATURE-----
Version: PGP 8.1
iQA/AwUBSeAN1Ej11/TE7j4qEQLf2wCffD50Fb8KQPods4YrUnDdL1pgb8wAnjSP
eHQ3nWeu7pmVMkc8Yw7h/Tjd
=dBPH
-----END PGP SIGNATURE-----
o***@public.gmane.org
2009-04-18 03:35:56 UTC
Permalink
An additional note:
If the SCSI tape drive is empty during IPL of z/VM the IPL will halt with a wait 9000 with the "user signal 1" just prior to the wait.

Paul
Post by o***@public.gmane.org
If SCSIMOUNT is NO or SCSIMOUNT term expires before the tape can be dismounted and a new tape mounted, then the abends occurs.
Paul
Post by o***@public.gmane.org
Hercules Version 3.06
(c)Copyright 1999-2009 by Roger Bowler, Jan Jaeger, and others
Built on Apr 2 2009 at 17:47:11
Modes: S/370 ESA/390 z/Arch
Max CPU Engines: 8
Using setresuid() for setting privileges
Dynamic loading support
Using shared libraries
No External GUI support
HTTP Server support
Regular Expressions support
Automatic Operator support
National Language Support
Machine dependent assists: cmpxchg1 cmpxchg4 cmpxchg8
Running on legmainx86 Linux-2.6.16.60-0.37_f594.#1 SMP Mon Mar 23 13:39:48 UTC 2009 x86_64 MP=2
This is running on SUSE Enterprise LINUX 10, z/VM 3.1, IBM System x3650, with IBM 3490E E11 tape drive.
Hercules is from the April 2, 2009 SVN source build.
Config file: 0588 3490 "/dev/nst0"
I am working on the traces. The problem appears to be with end of tape handling. This has an autoloader on it.
It appears that "scsitape.c" is failing causing the MCH005.
Paul
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by o***@public.gmane.org
Using CMS DDR, abend MCH005 occurs in z/vm when ddr rewinds and unloads a tape.
<snip>
Post by o***@public.gmane.org
Input/output error
ON-LINE READY LOAD-POINT
HHCCP075I 0588:Stat=0E00 Count=0001
HHCCP076I 0588:Sense=10480044 00000020 00000000 00000000 00000000
00000000 HHCCP077I 0588:Sense=EQC ITF NRF
HHCCP050I 0588:SCSW=00C24017 Stat=0E00 Count=0001 CCW=3F975DE0
<pnl,color(green,black)>15:23:47 HCPERP591I TAPE 0588 AN
OPERATION WAS TERMINATED BECAUSE
<pnl,color(green,black)>15:23:47 HCPERP591I DESIRED BLOCK WAS NOT FOUND
<CHOMP!>
Post by o***@public.gmane.org
Paul
1. What host platform are we talking about here? Linux? Windows?
Apple? Other?
2. What version of Hercules is this? (o/p of 'version' command)
3. What model/manufacturer of SCSI tape drive are you using?
4. Does it support 32-bit block-ids? Or 22-bit? (i.e. is it
physically running in 3480/3490E emulation mode, or is it a true 3590
compatible?)
5. How do you have it defined in your Hercules configuration file?
6. If running on Windows, have you verified it functions properly[1]
via 'ftape'? If using Linux, have you verified its proper
functioning[2] via some similar type of SCSI tape testing utility?
7. Try your test again but with the CCW tracing enabled earlier. I'd
like to see the 10-20 i/o's immediately preceding the "HHCTA038E
Backspace file error" that occurs.
Post by o***@public.gmane.org
PS. This is a legal copy of z/VM under hercules...
I couldn't care less.
That's between you can IBM.
- --
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
[1] Basically what you want to do is run it through a series of
simple tests: rew, erg, bsr, fsr, bsf, fsf, read, write, rdblkid,
locate, etc... to make sure that the drive is capable of successfully
executable the same sequence (series) of commands that your guest
operating system is going to be issuing to it.
http://www.hercules-390.org/hercconf.html#Quantum
Even though the options were coded in order to support Quantum DLTs,
the options apply to ANY make/model of SCSI tape drive. If the drive
physically behaves as a 3480/3490 but you define it in Herc as a 3590
(or vice-versa, etc) then it's not going to work right.
The model Hercules is supposed to be emulating is different from the
model that it is physically being used, and Hercues needs accurate
information on both in order for things to work correctly.
[2] Ibid.
-----BEGIN PGP SIGNATURE-----
Version: PGP 8.1
iQA/AwUBSeAN1Ej11/TE7j4qEQLf2wCffD50Fb8KQPods4YrUnDdL1pgb8wAnjSP
eHQ3nWeu7pmVMkc8Yw7h/Tjd
=dBPH
-----END PGP SIGNATURE-----
Fish
2009-04-19 04:10:14 UTC
Permalink
Post by o***@public.gmane.org
If SCSIMOUNT is NO or SCSIMOUNT term expires before the tape can
be dismounted and a new tape mounted, then the abends occurs.
Paul
Could you clarify that a bit more? What do you mean by "or SCSIMOUNT
term expires"?

Are you saying after the tape is unloaded (presumably by VM) and
Herc's auto-scsi-mount thread kicks in, if the scsimount thread
attempts to query the drive's status (to see if a tape has been
mounted yet) before the drive has had a chance to automatically load
the next cartridge that THAT is when VM abends?!

If that's the case then I suspect there SHOULD be *some* type of tape
i/o error message being issued by Hercules immediately beforehand.
Would you check please?


(p.s. any luck with those CCW traces yet?)

- --
"Fish" (David B. Trout) - fish-VLFb7ALKWJGGw+***@public.gmane.org
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
PGP key fingerprints:
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
Fish
2009-04-19 03:54:06 UTC
Permalink
olis_zvm-***@public.gmane.org wrote:

<snip version info, host info, etc.>

Thanks. It's important to have that info in case we need it, so
thanks for that.
Post by o***@public.gmane.org
I am working on the traces.
Good. It's the CCW traces that I believe will be most useful in
determining what's going on.
Post by o***@public.gmane.org
The problem appears to be with end of tape handling.
Which is why I'm so interested in seeing the CCW traces. :)

Seeing what I/O's the guest is issuing to the drive during its
end-of-tape handling will provide the biggest clue as to what exactly
is going wrong.
Post by o***@public.gmane.org
This has an autoloader on it.
That *could* have something to do with the problem.

Most autoloading SCSI tape drives can be configured in either
"manual" or "system" (automatic) mode or similar, and it's important
that it be configured to the proper mode. In system/automatic mode,
whenever a tape is unloaded the drive itself automatically loads the
next tape without the guest even asking. Usually that's NOT what you
want since more often than not it only serves to confuse the poor
guest since it was expecting the drive to be UNloaded (since it DID
after all just unload the tape).

Check your user manual for your tape drive. You may need to
reconfigure it.
Post by o***@public.gmane.org
It appears that "scsitape.c" is failing causing the MCH005.
How so?


- --
"Fish" (David B. Trout) - fish-VLFb7ALKWJGGw+***@public.gmane.org
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
PGP key fingerprints:
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
o***@public.gmane.org
2009-04-19 19:46:37 UTC
Permalink
The 3490-E11 is configured as AUTO at the present. z/VM does not do SYSTEM mode very well. I have set the SCSIMOUNT value to 99, this seems to mitigate the problem.

As noted earlier, when the tape drive is empty and an IPL is started, the IPL will fail with a similar problem. The same "user signal" is generated followed by the failure.

When the timer value is default or any value less than the time it takes to rewind, unload and load next tape, then the MCH005 abend occurs.

I should have the documentation early next week.

Paul
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
<snip version info, host info, etc.>
Thanks. It's important to have that info in case we need it, so
thanks for that.
Post by o***@public.gmane.org
I am working on the traces.
Good. It's the CCW traces that I believe will be most useful in
determining what's going on.
Post by o***@public.gmane.org
The problem appears to be with end of tape handling.
Which is why I'm so interested in seeing the CCW traces. :)
Seeing what I/O's the guest is issuing to the drive during its
end-of-tape handling will provide the biggest clue as to what exactly
is going wrong.
Post by o***@public.gmane.org
This has an autoloader on it.
That *could* have something to do with the problem.
Most autoloading SCSI tape drives can be configured in either
"manual" or "system" (automatic) mode or similar, and it's important
that it be configured to the proper mode. In system/automatic mode,
whenever a tape is unloaded the drive itself automatically loads the
next tape without the guest even asking. Usually that's NOT what you
want since more often than not it only serves to confuse the poor
guest since it was expecting the drive to be UNloaded (since it DID
after all just unload the tape).
Check your user manual for your tape drive. You may need to
reconfigure it.
Post by o***@public.gmane.org
It appears that "scsitape.c" is failing causing the MCH005.
How so?
- --
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
-----BEGIN PGP SIGNATURE-----
Version: PGP 8.1
iQA/AwUBSeqgXkj11/TE7j4qEQJ8rQCeLX8qh00fB1hb69I2qmXePOXWpEMAoMBi
26osJWdVNVh12tg5SjeqZ6Tt
=7t5s
-----END PGP SIGNATURE-----
o***@public.gmane.org
2009-04-20 22:19:11 UTC
Permalink
TRACES - Below are the traces - The problem seems to be unique to DDR. I tried several times to cause the with "TAPE RUN" followed by a "TAPE READ" or something similar and no failure.

However, when DDR changes tapes either due to wrong volume or needing the next tape of a multi-tape set; the abend occurs.

There is still one other condition that causes this, an empty tape drive during IPL of z/VM.

Paul

t+588
HHCPN136I CCW tracing is now on for device 0:0588
0588:synchronous I/O ccw addr 3f981058
HHCCP048I 0588:CCW=04200004 3FBFCC78=>00000000 3F981058 00000000 00000000 .....q..........
HHCCP075I 0588:Stat=0C00 Count=0000 =>40420043 3F981058 00000000 00000000 ....q..........
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0000 CCW=3F981060
0588:synchronous I/O ccw addr 3f981058
HHCCP048I 0588:CCW=04200004 3FBF9978=>00000000 3F981058 00000000 00000000 .....q..........
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x41010000: ON-LINE READY LOAD-POINT
HHCCP075I 0588:Stat=0C00 Count=0000 =>00480000 3F981058 00000000 00000000 .....q..........
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0000 CCW=3F981060
0588:synchronous I/O ccw addr 000cdea8
HHCCP048I 0588:CCW=07200001 00000000=>000C0000 80026028 06000200 60000200 ......-.....-...
HHCTA100I 0588: Now Displays: "REWINDNG"
HHCTA073E Error rewinding 0:0588=/dev/nst0; errno=5: Input/output error
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x41010000: ON-LINE READY LOAD-POINT
HHCTA100I 0588: Now Displays: " READY "
HHCCP075I 0588:Stat=0E00 Count=0001
HHCCP076I 0588:Sense=1048002C 00000020 00000000 00000000 00000000 00000000
HHCCP077I 0588:Sense=EQC ITF NRF
HHCCP050I 0588:SCSW=00C24017 Stat=0E00 Count=0001 CCW=000CDEB0
15:02:42 HCPERP503I TAPE 0588 AN EQUIPMENT CHECK OCCURRED
15:02:42 HCPERP6300I SENSE DATA FORMAT = 02 MSG CODE = 00
15:02:42 HCPERP6301I CHANNEL COMMAND WORD COMMAND CODE = 07
15:02:42 HCPERP6303I SENSE = 1048002C 00000020 00000000 00000000 00000000
15:02:42 HCPERP6303I 00000000 00000000 00000000
15:02:42 HCPERP6304I IRB = 00C24017 000CDEB0 0E000001 00800000
15:02:42 HCPERP6305I USERID = MAINT
15:02:42 HCPERP2216I CHANNEL PATH ID = 05
0588:synchronous I/O ccw addr 3f981058
HHCCP048I 0588:CCW=04200004 3FBF9178=>00000000 3F981058 00000000 00000000 .....q..........
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x41010000: ON-LINE READY LOAD-POINT
HHCCP075I 0588:Stat=0C00 Count=0000 =>00480000 3F981058 00000000 00000000 .....q..........
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0000 CCW=3F981060
0588:synchronous I/O ccw addr 000cdea8
HHCCP048I 0588:CCW=07200001 00000000=>000C0000 80026028 06000200 60000200 ......-.....-...
HHCTA100I 0588: Now Displays: "REWINDNG"
HHCTA100I 0588: Now Displays: " READY "
HHCCP075I 0588:Stat=0C00 Count=0001
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0001 CCW=000CDEB0
0588:synchronous I/O ccw addr 3f975648
HHCCP048I 0588:CCW=E4200007 074F7641=>00000000 000000FF C3D6D7E8 C9D5C740 ........COPYING
HHCCP075I 0588:Stat=0C00 Count=0000 =>FF349050 349050FF C3D6D7E8 C9D5C740 ...&..&.COPYING
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0000 CCW=3F975650
0588:synchronous I/O ccw addr 3f975a68
HHCCP048I 0588:CCW=64200040 05D3DDF0=>00000000 00000000 00000000 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=0000 =>34905034 90500000 02C08080 00000000 ..&..&...{......
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0000 CCW=3F975A70
HHCCP048I 0588:CCW=02200078 05D39B68=>00000000 00000000 00000000 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=0028 =>E5D4E4E3 C9D34040 F1F14040 40404040 VMUTIL 11
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0028 CCW=3F975650
0588:synchronous I/O ccw addr 3f975a68
HHCCP048I 0588:CCW=2F200001 05D394F8=>2F02F4F8 20000001 0702F500 60000001 ..48......5.-...
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x01010000: ON-LINE READY
HHCTA038E Backspace file error on 0:0588=/dev/nst0; errno=5: Input/output error
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x41010000: ON-LINE READY LOAD-POINT
HHCCP075I 0588:Stat=0E00 Count=0001
HHCCP076I 0588:Sense=10480044 00000020 00000000 00000000 00000000 00000000
HHCCP077I 0588:Sense=EQC ITF NRF
HHCCP050I 0588:SCSW=00C24017 Stat=0E00 Count=0001 CCW=3F975A70
15:03:17 HCPERP591I TAPE 0588 AN OPERATION WAS TERMINATED BECAUSE
15:03:17 HCPERP591I DESIRED BLOCK WAS NOT FOUND
15:03:17 HCPERP6300I SENSE DATA FORMAT = 02 MSG CODE = 00
15:03:17 HCPERP6301I CHANNEL COMMAND WORD COMMAND CODE = 2F
15:03:17 HCPERP6303I SENSE = 10480044 00000020 00000000 00000000 00000000
15:03:17 HCPERP6303I 00000000 00000000 00000000
15:03:17 HCPERP6304I IRB = 00C24017 3F975A70 0E000001 00800000
15:03:17 HCPERP6305I USERID = MAINT
15:03:17 HCPERP2216I CHANNEL PATH ID = 05
0588:synchronous I/O ccw addr 3f975f38
HHCCP048I 0588:CCW=0F600001 05D39538=>0F02F538 60000001 0302F540 20000001 ..5.-.....5 ....
HHCTA100I 0588: Now Displays: "UNLOADNG"
HHCTA077I Tape 0:0588 unloaded
HHCTA100I 0588: Now Displays: " "
HHCTA023I 0:0588 filename=/dev/nst0 (closed), sstat=0x00040000: OFF-LINE NO-TAPE
HHCTA023I 0:0588 filename=/dev/nst0 (closed), sstat=0x00040000: OFF-LINE NO-TAPE
HHCCP075I 0588:Stat=0C00 Count=0001
HHCCP048I 0588:CCW=03200001 05D39540=>0302F540 20000001 0F02F548 60000001 ..5 ......5.-...
HHCCP075I 0588:Stat=0200 Count=0001
HHCCP076I 0588:Sense=40420043 00000020 00000000 00000000 00000000 00000000
HHCCP077I 0588:Sense=INTREQ ITF WRI
HHCCP050I 0588:SCSW=00C24017 Stat=0200 Count=0001 CCW=3F975F48
0588:synchronous I/O ccw addr 3f975f38
HHCCP048I 0588:CCW=0F600001 05D39538=>0F02F538 60000001 0302F540 20000001 ..5.-.....5 ....
HHCCP075I 0588:Stat=2E00 Count=0001
HHCCP076I 0588:Sense=40420043 00000020 00000000 00000000 00000000 00000000
HHCCP077I 0588:Sense=INTREQ ITF WRI
HHCCP050I 0588:SCSW=00C24017 Stat=2E00 Count=0001 CCW=3F975F40
HHCTA200I SCSI-Tape mount-monitoring thread started;
dev=0:0588, tid=42928940, pri=15, pid=7078
HHCCP048I 0588:CCW=DB600001 10356B58=>08003040 0060548C 00004BE8 00EFDB5C ... .-.....Y...*
HHCCP075I 0588:Stat=0C00 Count=0000
HHCCP048I 0588:CCW=0F200001 199C0000=>03F13023 83F12024 00EFD168 83F127EE .1..c1....J.c1..
HHCCP075I 0588:Stat=2E00 Count=0001
HHCCP076I 0588:Sense=40420043 00000020 00000000 00000000 00000000 00000000
HHCCP077I 0588:Sense=INTREQ ITF WRI
HHCCP050I 0588:SCSW=00C24017 Stat=2E00 Count=0001 CCW=3F975F48
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x41010000: ON-LINE READY LOAD-POINT
HHCCP066I DEV0588: attention
HHCCP050I 0588:SCSW=00000011 Stat=0400 Count=0000 CCW=00000000
HHCTA299I SCSI-Tape mount-monitoring thread ended;
dev=0:0588, tid=42928940, pid=7078
0588:synchronous I/O ccw addr 3f975598
HHCCP048I 0588:CCW=E4200007 05CD2641=>00000000 000000FF C3D6D7E8 C9D5C740 ........COPYING
HHCCP075I 0588:Stat=0C00 Count=0000 =>FF349050 349050FF C3D6D7E8 C9D5C740 ...&..&.COPYING
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0000 CCW=3F9755A0
0588:synchronous I/O ccw addr 3f975f38
HHCCP048I 0588:CCW=64200040 05D39DF0=>00000000 00000000 00000000 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=0000 =>34905034 90500000 02C08080 00000000 ..&..&...{......
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0000 CCW=3F975F40
HHCCP048I 0588:CCW=02200078 05D3DB68=>00000000 00000000 00000000 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=0000 =>E5C8D940 00000352 000700BB 00000000 VHR ............
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0000 CCW=3F9755A0
HHCCP048I 0588:CCW=02200078 05D3DB68=>00000000 00000000 00000000 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=0000 =>E3C8D940 000AA050 04000200 03520007 THR ...&........
HHCCP050I 0588:SCSW=00C04007 Stat=0C00 Count=0000 CCW=3F975F40
0588:synchronous I/O ccw addr 3f975598
HHCCP048I 0588:CCW=2F200001 05D3D4F8=>2F02F4F8 20000001 0702F500 60000001 ..48......5.-...
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x01010000: ON-LINE READY
HHCTA038E Backspace file error on 0:0588=/dev/nst0; errno=5: Input/output error
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x41010000: ON-LINE READY LOAD-POINT
HHCCP075I 0588:Stat=0E00 Count=0001
HHCCP076I 0588:Sense=10480044 00000020 00000000 00000000 00000000 00000000
HHCCP077I 0588:Sense=EQC ITF NRF
HHCCP050I 0588:SCSW=00C24017 Stat=0E00 Count=0001 CCW=3F9755A0
15:04:35 HCPERP591I TAPE 0588 AN OPERATION WAS TERMINATED BECAUSE
15:04:35 HCPERP591I DESIRED BLOCK WAS NOT FOUND
15:04:35 HCPERP6300I SENSE DATA FORMAT = 02 MSG CODE = 00
15:04:35 HCPERP6301I CHANNEL COMMAND WORD COMMAND CODE = 2F
15:04:35 HCPERP6303I SENSE = 10480044 00000020 00000000 00000000 00000000
15:04:35 HCPERP6303I 00000000 00000000 00000000
15:04:35 HCPERP6304I IRB = 00C24017 3F9755A0 0E000001 00800000
15:04:35 HCPERP6305I USERID = MAINT
15:04:35 HCPERP2216I CHANNEL PATH ID = 05
0588:synchronous I/O ccw addr 3f975bc8
HHCCP048I 0588:CCW=0F600001 05D3D538=>0F02F538 60000001 0302F540 20000001 ..5.-.....5 ....
HHCTA100I 0588: Now Displays: "UNLOADNG"
HHCTA077I Tape 0:0588 unloaded
HHCTA100I 0588: Now Displays: " "
HHCTA023I 0:0588 filename=/dev/nst0 (closed), sstat=0x00040000: OFF-LINE NO-TAPE
HHCTA023I 0:0588 filename=/dev/nst0 (closed), sstat=0x00040000: OFF-LINE NO-TAPE
HHCCP075I 0588:Stat=0C00 Count=0001
HHCCP048I 0588:CCW=03200001 05D3D540=>0302F540 20000001 0F02F548 60000001 ..5 ......5.-...
HHCCP075I 0588:Stat=0200 Count=0001
HHCCP076I 0588:Sense=40420043 00000020 00000000 00000000 00000000 00000000
HHCCP077I 0588:Sense=INTREQ ITF WRI
HHCCP050I 0588:SCSW=00C24017 Stat=0200 Count=0001 CCW=3F975BD8
HHCTA200I SCSI-Tape mount-monitoring thread started;
dev=0:0588, tid=42928940, pri=15, pid=7078
0588:synchronous I/O ccw addr 3f975bc8
HHCCP048I 0588:CCW=0F600001 05D3D538=>0F02F538 60000001 0302F540 20000001 ..5.-.....5 ....
HHCCP017I CPU0000: Machine check due to host error: User defined signal 1
PSW=04041000 80000000 00000000000F7C0E INST=1801 LR 0,1 load_register
R0=0588002B00000000 R1=000000000001002B R2=0000000000000001 R3=000000003F98ABF0
R4=00000000001B1F60 R5=0000000000000001 R6=0000000000000002 R7=00000000000AF120
R8=000000003F977928 R9=00000000000F7DC0 RA=000000003F98AB88 RB=0000000000002000
RC=00000000000F6DC0 RD=000000003FBFB800 RE=0000000000195E40 RF=0000000000000000
HCPMCH9101I MACHINE CHECK ON CPU 0000. MCIC = 40000F1D 403B0000
HHCCP055I 0588: Clear subchannel
15:05:03 HCPDMP908I SYSTEM FAILURE ON CPU 0000, CODE - MCH005
HCPDMP9252I DUMPING TO DASD
HCPDMP9260I 25% OF THE DUMP HAS BEEN COMPLETED
HCPDMP9260I 50% OF THE DUMP HAS BEEN COMPLETED
HCPDMP9260I 75% OF THE DUMP HAS BEEN COMPLETED
HCPDMP9261I DUMP COMPLETE
HCPWRP9277I SYSTEM TERMINATION COMPLETE, ATTEMPTING RESTART
CPU0000: SIGP Set architecture mode (12) CPU0000, PARM 000000003F955000: CC 0
HHCCP007I CPU0000 architecture mode set to ESA/390
CPU0000: SIGP Set architecture mode (12) CPU0000, PARM 00000001: CC 0
HHCCP007I CPU0000 architecture mode set to z/Arch
22:05:10 HCPDMP908I SYSTEM FAILURE ON CPU 0000, CODE - IFI004
HCPDMP9250E SYSTEM DUMP FAILURE; NO DUMP DEVICE AVAILABLE
HCPWRP9025W SYSTEM TERMINATION COMPLETE, ENTERING WAIT STATE
HHCCP011I CPU0000: Disabled wait state
PSW=00020000 00000000 0000000000009025
Post by o***@public.gmane.org
The 3490-E11 is configured as AUTO at the present. z/VM does not do SYSTEM mode very well. I have set the SCSIMOUNT value to 99, this seems to mitigate the problem.
As noted earlier, when the tape drive is empty and an IPL is started, the IPL will fail with a similar problem. The same "user signal" is generated followed by the failure.
When the timer value is default or any value less than the time it takes to rewind, unload and load next tape, then the MCH005 abend occurs.
I should have the documentation early next week.
Paul
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
<snip version info, host info, etc.>
Thanks. It's important to have that info in case we need it, so
thanks for that.
Post by o***@public.gmane.org
I am working on the traces.
Good. It's the CCW traces that I believe will be most useful in
determining what's going on.
Post by o***@public.gmane.org
The problem appears to be with end of tape handling.
Which is why I'm so interested in seeing the CCW traces. :)
Seeing what I/O's the guest is issuing to the drive during its
end-of-tape handling will provide the biggest clue as to what exactly
is going wrong.
Post by o***@public.gmane.org
This has an autoloader on it.
That *could* have something to do with the problem.
Most autoloading SCSI tape drives can be configured in either
"manual" or "system" (automatic) mode or similar, and it's important
that it be configured to the proper mode. In system/automatic mode,
whenever a tape is unloaded the drive itself automatically loads the
next tape without the guest even asking. Usually that's NOT what you
want since more often than not it only serves to confuse the poor
guest since it was expecting the drive to be UNloaded (since it DID
after all just unload the tape).
Check your user manual for your tape drive. You may need to
reconfigure it.
Post by o***@public.gmane.org
It appears that "scsitape.c" is failing causing the MCH005.
How so?
- --
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
-----BEGIN PGP SIGNATURE-----
Version: PGP 8.1
iQA/AwUBSeqgXkj11/TE7j4qEQJ8rQCeLX8qh00fB1hb69I2qmXePOXWpEMAoMBi
26osJWdVNVh12tg5SjeqZ6Tt
=7t5s
-----END PGP SIGNATURE-----
o***@public.gmane.org
2009-04-20 22:37:36 UTC
Permalink
Console Messages from IPL without tape in drive:
t+588
HHCPN136I CCW tracing is now on for device 0:0588
ipl 80e
HHCPN013I EOF reached on SCRIPT file. Processing complete.
HHCCP007I CPU0000 architecture mode set to ESA/390
HHCCD001I Readahead thread 1 started: tid=40A09940, pid=20107
HHCCD001I Readahead thread 2 started: tid=40C0B940, pid=20107
CPU0000: SIGP Set architecture mode (12) CPU0000, PARM 00000001: CC 0
HHCCP007I CPU0000 architecture mode set to z/Arch
0588:synchronous I/O ccw addr 3f967970
HHCCP048I 0588:CCW=04200018 3F96706E=>00000000 00000000 00000000 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=0000 =>40420000 00000020 00000000 00000000 ...............
HHCCP050I 0588:SCSW=00804007 Stat=0C00 Count=0000 CCW=3F967978
0588:synchronous I/O ccw addr 3f967970
HHCCP048I 0588:CCW=E4200100 3F96706E=>40420000 00000020 00000000 00000000 ...............
HHCCP075I 0588:Stat=0C00 Count=00F9 =>FF349050 34905020 00000000 00000000 ...&..&.........
HHCCP050I 0588:SCSW=00804007 Stat=0C00 Count=00F9 CCW=3F967978
0588:synchronous I/O ccw addr 3f981030
HHCCP048I 0588:CCW=03200001 00000000=>000C0000 80026028 06000200 60000200 ......-.....-...
HHCCP017I CPU0000: Machine check due to host error: User defined signal 1
PSW=04040000 80000000 00000000000F703A INST=182A LR 2,10 load_register
R0=0588002B00000000 R1=000000000001002B R2=0000000000000002 R3=0000000000000000
R4=0000000000277716 R5=000000003F97E008 R6=0000000000000003 R7=0000000080276D88
R8=000000003F977928 R9=00000000000F7DC0 RA=000000003F98A458 RB=0000000000002000
RC=00000000000F6DC0 RD=000000003FBFAC00 RE=0000000080277716 RF=0000000000000000
HHCCP011I CPU0000: Disabled wait state
PSW=00020000 00000000 0000000000009000
---------------------------------------------
o***@public.gmane.org
2009-04-20 22:39:21 UTC
Permalink
Console Message from IPL with Tape in unit: No failure...
t+588
HHCPN136I CCW tracing is now on for device 0:0588
ipl 80e
HHCPN013I EOF reached on SCRIPT file. Processing complete.
HHCCP007I CPU0000 architecture mode set to ESA/390
HHCCD001I Readahead thread 1 started: tid=40403940, pid=20183
HHCCD001I Readahead thread 2 started: tid=40D0C940, pid=20183
CPU0000: SIGP Set architecture mode (12) CPU0000, PARM 00000001: CC 0
HHCCP007I CPU0000 architecture mode set to z/Arch
0588:synchronous I/O ccw addr 3f967970
HHCCP048I 0588:CCW=04200018 3F96706E=>00000000 00000000 00000000 00000000 ................
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x41010000: ON-LINE READY LOAD-POINT
HHCCP075I 0588:Stat=0C00 Count=0000 =>00480000 00000020 00000000 00000000 ................
HHCCP050I 0588:SCSW=00804007 Stat=0C00 Count=0000 CCW=3F967978
0588:synchronous I/O ccw addr 3f967970
HHCCP048I 0588:CCW=E4200100 3F96706E=>00480000 00000020 00000000 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=00F9 =>FF349050 34905020 00000000 00000000 ...&..&.........
HHCCP050I 0588:SCSW=00804007 Stat=0C00 Count=00F9 CCW=3F967978
0588:synchronous I/O ccw addr 3f981030
HHCCP048I 0588:CCW=03200001 00000000=>000C0000 80026028 06000200 60000200 ......-.....-...
HHCCP075I 0588:Stat=0C00 Count=0001
HHCCP050I 0588:SCSW=00804007 Stat=0C00 Count=0001 CCW=3F981038
0588:synchronous I/O ccw addr 3f981080
HHCCP048I 0588:CCW=64200040 3F95EE48=>00000034 90000000 00008080 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=0000 =>34905034 90500000 02C08080 00000000 ..&..&...{......
HHCCP050I 0588:SCSW=00804007 Stat=0C00 Count=0000 CCW=3F981088
HHCCP048I 0588:CCW=AF20000C 3F97E978=>80000003 09062003 C411462A 00000000 ........D.......
HHCCP075I 0588:Stat=0C00 Count=0000
HHCCP050I 0588:SCSW=00804007 Stat=0C00 Count=0000 CCW=3F97E978
0588:synchronous I/O ccw addr 3f981030
HHCCP048I 0588:CCW=D4300006 00000000=>000C0000 80026028 06000200 60000200 ......-.....-...
HHCTA023I 0:0588 filename=/dev/nst0 (opened), sstat=0x41010000: ON-LINE READY LOAD-POINT
HHCCP075I 0588:Stat=0E00 Count=0006 =>000C0000 80026028 06000200 60000200 ......-.....-...
HHCCP076I 0588:Sense=80480027 00000020 00000000 00000000 00000000 00000000
HHCCP077I 0588:Sense=CMDREJ ITF NRF
HHCCP050I 0588:SCSW=00824017 Stat=0E00 Count=0006 CCW=3F981038
HHCCP048I 0588:CCW=C700000B 0022CFF0=>00000000 00000000 00000000 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=0000
HHCCP050I 0588:SCSW=00804007 Stat=0C00 Count=0000 CCW=3F981038
HHCCP048I 0588:CCW=DB200001 0022CFE8=>10000000 00000000 00000000 00000000 ................
HHCCP075I 0588:Stat=0C00 Count=0000
HHCCP050I 0588:SCSW=00804007 Stat=0C00 Count=0000 CCW=3F981038
15:32:12 z/VM V3 R1.0 SERVICE LEVEL 0301 (64-BIT)
15:32:12 SYSTEM NUCLEUS CREATED ON 2004-03-13 AT 12:27:06, LOADED FROM 310RES
15:32:12
----------------------------------
o***@public.gmane.org
2009-04-20 22:41:09 UTC
Permalink
This post might be inappropriate. Click to display it.
Fish
2009-04-21 04:09:28 UTC
Permalink
Jan? Ivan? etc...

I'm going to need your help on this one. Thanks.
Post by o***@public.gmane.org
TRACES - Below are the traces - The problem seems to be unique to
DDR. I tried several times to cause the with "TAPE RUN" followed
by a "TAPE READ" or something similar and no failure.
However, when DDR changes tapes either due to wrong volume or
needing the next tape of a multi-tape set; the abend occurs.
There is still one other condition that causes this, an empty
tape drive during IPL of z/VM.
Paul
Okay, after looking at all your traces I have to say this issue is
beginning to look more like an SIE related bug and not a SCSI tape
bug at all.

Basically here's what I think is likely(?) going on (educated guess
based on your traces and looking at the code):

Your VM guest running in SIE mode with WAIT STATE ASSIST enabled. The
tape gets unloaded but it takes longer than 20 seconds to load the
next tape.

(Paul: could you enable Hercules logfile timestamping ("logopt
timestamp" command) and run your test again? I'd like to confirm the
machine check doesn't occur until about 20 seconds after the last
tape i/o.)

Because WAIT STATE ASSIST enabled, the CPU is never exiting out of
SIE mode (regs->sie_active remains '1') and thus the real ("host")
CPU's instruction count never changes ((INSTCOUNT(sysblk.regs[i]) ==
(U64)savecount[i]).

Due to a bug(?) in the "watchdog_thread", it thinks the CPU is
malfunctioning since the host CPU instruction count is never
incrementing; only the SIE guest CPU's instruction count is.

How best to confirm this theory I'm not sure. I'm pretty weak in this
area of Herc. Someone else is going to have to take over from here.
(Jan, Ivan, etc.)

What we may need to do is either increase the watchdog_thread's time
period from 20 seconds to, say, 60 seconds or more...

-OR-

Fix(?) the test its doing (which I'm suspecting may be wrong).


The test in the watchdog_thread that I'm suspecting may be bad is the
following one:


if (IS_CPU_ONLINE(i)
&& sysblk.regs[i]->cpustate == CPUSTATE_STARTED
&& (!WAITSTATE(&sysblk.regs[i]->psw)
#if defined(_FEATURE_WAITSTATE_ASSIST)
&& !(sysblk.regs[i]->sie_active &&
WAITSTATE(&sysblk.regs[i]->guestregs->psw))
#endif


The last condition doesn't appear to be taking wait state assist into
account properly. When wait state assist is enabled, the guest never
exits out of SIE mode and thus after 20 seconds the watchdog_thread
thinks the CPU is malfunctioning and forces a machine check
("HHCCP017I CPU0000: Machine check due to host error: User defined
signal 1") thereby bringing VM down.



Jan?


Ivan?


Anyone??


HELP! :)


Could you either confirm or deny my suspicions and/or take a look at
this issue please?

Thanks!

- --
"Fish" (David B. Trout) - fish-VLFb7ALKWJGGw+***@public.gmane.org
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
PGP key fingerprints:
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
Dave Wade
2009-04-21 06:32:55 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Jan? Ivan? etc...
I'm going to need your help on this one. Thanks.
Post by o***@public.gmane.org
TRACES - Below are the traces - The problem seems to be unique to
DDR. I tried several times to cause the with "TAPE RUN" followed
by a "TAPE READ" or something similar and no failure.
However, when DDR changes tapes either due to wrong volume or
needing the next tape of a multi-tape set; the abend occurs.
There is still one other condition that causes this, an empty
tape drive during IPL of z/VM.
Paul
Okay, after looking at all your traces I have to say this issue is
beginning to look more like an SIE related bug and not a SCSI tape
bug at all.
Basically here's what I think is likely(?) going on (educated guess
Your VM guest running in SIE mode with WAIT STATE ASSIST enabled. The
tape gets unloaded but it takes longer than 20 seconds to load the
next tape.
(Paul: could you enable Hercules logfile timestamping ("logopt
timestamp" command) and run your test again? I'd like to confirm the
machine check doesn't occur until about 20 seconds after the last
tape i/o.)
Because WAIT STATE ASSIST enabled, the CPU is never exiting out of
SIE mode (regs->sie_active remains '1') and thus the real ("host")
CPU's instruction count never changes ((INSTCOUNT(sysblk.regs[i]) ==
(U64)savecount[i]).
Due to a bug(?) in the "watchdog_thread", it thinks the CPU is
malfunctioning since the host CPU instruction count is never
incrementing; only the SIE guest CPU's instruction count is.
How best to confirm this theory I'm not sure. I'm pretty weak in this
area of Herc. Someone else is going to have to take over from here.
(Jan, Ivan, etc.)
What we may need to do is either increase the watchdog_thread's time
period from 20 seconds to, say, 60 seconds or more...
-OR-
Fix(?) the test its doing (which I'm suspecting may be wrong).
The test in the watchdog_thread that I'm suspecting may be bad is the
if (IS_CPU_ONLINE(i)
&& sysblk.regs[i]->cpustate == CPUSTATE_STARTED
&& (!WAITSTATE(&sysblk.regs[i]->psw)
#if defined(_FEATURE_WAITSTATE_ASSIST)
&& !(sysblk.regs[i]->sie_active &&
WAITSTATE(&sysblk.regs[i]->guestregs->psw))
#endif
The last condition doesn't appear to be taking wait state assist into
account properly. When wait state assist is enabled, the guest never
exits out of SIE mode and thus after 20 seconds the watchdog_thread
thinks the CPU is malfunctioning and forces a machine check
("HHCCP017I CPU0000: Machine check due to host error: User defined
signal 1") thereby bringing VM down.
Can I just ask how does this affect an IPL of zVM with an empty drive,
or is this a different problem?
Jan?
Ivan?
Anyone??
HELP! :)
Could you either confirm or deny my suspicions and/or take a look at
this issue please?
Thanks!
- --
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
-----BEGIN PGP SIGNATURE-----
Version: PGP 8.1
iQA/AwUBSe1G90j11/TE7j4qEQKmKQCg18d149Uh0gHlPMmq0vE8MWU5faYAoOIN
G4Fi5+cPRF6YjEIrjwJgQgOc
=l6TY
-----END PGP SIGNATURE-----
------------------------------------
http://groups.yahoo.com/group/hercules-390
http://www.hercules-390.org
Yahoo! Groups Links
o***@public.gmane.org
2009-04-21 20:49:59 UTC
Permalink
We might be dealing with two issues.
1) SCSI tape (IBM 3490-E11) attached with no tape mounted during IPL
2) SCSI tape (IBM 3490-E11) use by DDR dismount/remount

If the fix is in the SCSIMOUNT timer, then it will probably fix both. If it is in the SIE area, we still might have another problem.

Logs of the IPL with ptt enabled and logs of DDR are included in the file logs.tar.gz in FILES.

Paul
Post by Dave Wade
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Jan? Ivan? etc...
I'm going to need your help on this one. Thanks.
Post by o***@public.gmane.org
TRACES - Below are the traces - The problem seems to be unique to
DDR. I tried several times to cause the with "TAPE RUN" followed
by a "TAPE READ" or something similar and no failure.
However, when DDR changes tapes either due to wrong volume or
needing the next tape of a multi-tape set; the abend occurs.
There is still one other condition that causes this, an empty
tape drive during IPL of z/VM.
Paul
Okay, after looking at all your traces I have to say this issue is
beginning to look more like an SIE related bug and not a SCSI tape
bug at all.
Basically here's what I think is likely(?) going on (educated guess
Your VM guest running in SIE mode with WAIT STATE ASSIST enabled. The
tape gets unloaded but it takes longer than 20 seconds to load the
next tape.
(Paul: could you enable Hercules logfile timestamping ("logopt
timestamp" command) and run your test again? I'd like to confirm the
machine check doesn't occur until about 20 seconds after the last
tape i/o.)
Because WAIT STATE ASSIST enabled, the CPU is never exiting out of
SIE mode (regs->sie_active remains '1') and thus the real ("host")
CPU's instruction count never changes ((INSTCOUNT(sysblk.regs[i]) ==
(U64)savecount[i]).
Due to a bug(?) in the "watchdog_thread", it thinks the CPU is
malfunctioning since the host CPU instruction count is never
incrementing; only the SIE guest CPU's instruction count is.
How best to confirm this theory I'm not sure. I'm pretty weak in this
area of Herc. Someone else is going to have to take over from here.
(Jan, Ivan, etc.)
What we may need to do is either increase the watchdog_thread's time
period from 20 seconds to, say, 60 seconds or more...
-OR-
Fix(?) the test its doing (which I'm suspecting may be wrong).
The test in the watchdog_thread that I'm suspecting may be bad is the
if (IS_CPU_ONLINE(i)
&& sysblk.regs[i]->cpustate == CPUSTATE_STARTED
&& (!WAITSTATE(&sysblk.regs[i]->psw)
#if defined(_FEATURE_WAITSTATE_ASSIST)
&& !(sysblk.regs[i]->sie_active &&
WAITSTATE(&sysblk.regs[i]->guestregs->psw))
#endif
The last condition doesn't appear to be taking wait state assist into
account properly. When wait state assist is enabled, the guest never
exits out of SIE mode and thus after 20 seconds the watchdog_thread
thinks the CPU is malfunctioning and forces a machine check
("HHCCP017I CPU0000: Machine check due to host error: User defined
signal 1") thereby bringing VM down.
Can I just ask how does this affect an IPL of zVM with an empty drive,
or is this a different problem?
Jan?
Ivan?
Anyone??
HELP! :)
Could you either confirm or deny my suspicions and/or take a look at
this issue please?
Thanks!
- --
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
-----BEGIN PGP SIGNATURE-----
Version: PGP 8.1
iQA/AwUBSe1G90j11/TE7j4qEQKmKQCg18d149Uh0gHlPMmq0vE8MWU5faYAoOIN
G4Fi5+cPRF6YjEIrjwJgQgOc
=l6TY
-----END PGP SIGNATURE-----
------------------------------------
http://groups.yahoo.com/group/hercules-390
http://www.hercules-390.org
Yahoo! Groups Links
Fish
2009-04-23 01:09:57 UTC
Permalink
Post by o***@public.gmane.org
We might be dealing with two issues.
1) SCSI tape (IBM 3490-E11) attached with no tape mounted during
IPL 2) SCSI tape (IBM 3490-E11) use by DDR dismount/remount
It's probably one issue that just happens to be triggered in more
than one way.
Post by o***@public.gmane.org
If the fix is in the SCSIMOUNT timer, then it will probably fix
both. If it is in the SIE area, we still might have another
problem.
It's not the SCSIMOUNT timer. The SCSIMOUNT timer simply defines how
frequently Hercules is to query the tape drive for it current status.
Once the drive responds [to Hercules] with a status that a tape is
now finally mounted, Hercules then presents an unsolicited ATTN
interrupt so the guest (z/VM 3.1 in this case).

During the development of this logic (done on Windows) I personally
experienced unusual host behavior wherein the status query to the
drive would take upwards of 10 seconds or more to complete whenever
there was no tape mounted. (Whenever a tape was mounted the query
would complete immediately). Thus there is special logic in Hercules
to: 1) "timeout" its internal query attempt after x #of milliseconds,
and 2) to not query the drive any more frequently than once per
second. These two changes seemed to resolve the issue.

While the guest is waiting for the tape to be mounted however (i.e.
while it's waiting for that attention interrupt to eventually occur),
it's usually either executing other instructions (i.e. it has other
work to do while it's waiting) or else it's sitting there in an
enabled interrupt state (i.e. its psw in a wait state enabled for i/o
interrupts).

When the guest is NOT running in SIE mode, the machine check does not
occur because the CPU is waiting for an interrupt and thus purposely
not executing any instructions.

If the guest *IS* in SIE mode however, then the machine check will
occur IF:

a) the host (non-SIE guest) CPU is NOT in a wait state
(i.e. from Herc's point-of-view it's supposed to be
executing instructions), AND...

b) the SIE-guest CPU is NOT in a wait state, AND...

c) it's been longer than 20 seconds and the instruction
count hasn't changed.

This condition is detected by Herc's "watchdog_thread" responsible
for detecting malfunctioning CPUs (i.e CPUs that are "stuck" and not
actually executing any instructions as they should be). When this
occurs a Machine Check is generated by the watchdog_thread (which is
what you're experiencing).

The idea here is if the CPU is waiting, then the fact that the
instruction count has not changed is completely normal and expected.
It's only when the CPU is *not* in a wait state (and thus *should* be
executing instructions) but the instruction count has not changed
that Herc thinks something is wrong.

In your case, I'm suspecting (but we need to do another trace to be
sure; the last trace I had you do wasn't good enough; sorry) your
z/VM guest is continuing to execute instructions in SIE mode or else
has wait state assist enabled (which causes it to *appear* as if the
SIE guest PSW is never in a wait state), thus opening the window of
opportunity for the bug to get triggered (which it does 20 seconds
later whenever the watchdog_thread notices the instruction count
hasn't changed and the real host cpu is NOT in a wait state AND its
still in SIE mode AND the SIE guest CPU is not in a wait state either
thereby causing the Machine Check).

The more I think about it the more I'm inclined to believe the fix is
NOT in adjusting the condition of the 'if' statement as I originally
thought, but rather in the subsequent test that's checking whether
the instruction count has changed or not. If the host (NON-SIE) CPU
instruction count has not changed (which today is the only condition
that generates the Machine Check), it should THEN check to see
whether or not the SIE guest CPU's instruction count has changed or
not. If NEITHER instruction count has changed THEN something is wrong
and a Machine Check should be generated.

But REGARDLESS of whether the host (non-SIE guest) CPU's instruction
count has changed, if the SIE guest CPUs instruction count has
changed, then everything is still okay and the Machine Check should
NOT be generated.
Post by o***@public.gmane.org
Logs of the IPL with ptt enabled and logs of DDR are included in
the file logs.tar.gz in FILES.
Paul
Which are not good. :(

(My bad. Sorry.)

Try again using:

ptt sie io 12345
hao HHCCP017I
hao ptt
t+588
hao HHCCP017I
hao ipending
hao HHCCP048I 0588:CCW=03200001
hao ipending


The 'ipending' command will display, among other things, the
instruction count for the CPU, and the 'ptt' command will of course
dump the trace table. Both should be issued IMMEDIATELY upon the
Machine Check occurring. Thus the reason for using the Hercules
Automatic Operator (HAO) facility to automatically issue the commands
for us.

Try that and post your results. Thanks.

(p.s. you can delete those other logs. We don't need them any more.)


OH! By the way: just out of curiosity, how's the tape drive defined
in VM? (i.e. DMKRIO)

- --
"Fish" (David B. Trout) - fish-VLFb7ALKWJGGw+***@public.gmane.org
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
PGP key fingerprints:
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
pfg504
2009-04-23 02:54:27 UTC
Permalink
I get the traces tomorrow...

With z/VM we have the option of dynamically adding devices. All of these are sensed during IPL. We have no DMKRIO.

I will look at the RDEVICE block for the device to see what was added.

Paul

PS. I am using the DDR MODULE and not the standalone copy. I am not sure if that would make any difference.
o***@public.gmane.org
2009-04-23 17:38:19 UTC
Permalink
Traces are in a-sie-error.tar.gz and a-scsi-sie-error-2.tar.gz files

z/VM thinks that the tape and controller is a 3490

The first file in a-sie-error-tar.gz is the IPL error.
The file in a-scsi-sie-error-2.tar.gz is the DDR error.

It took several attempts to get the output from the DDR error.

Paul
I'll get the traces tomorrow...
With z/VM we have the option of dynamically adding devices. All of these are sensed during IPL. We have no DMKRIO.
I will look at the RDEVICE block for the device to see what was added.
Paul
PS. I am using the DDR MODULE and not the standalone copy. I am not sure if that would make any difference.
o***@public.gmane.org
2009-04-24 16:36:57 UTC
Permalink
One more item, the problem gets much worse when there are multiple CPUs running on z/VM.

Paul
Post by o***@public.gmane.org
Traces are in a-sie-error.tar.gz and a-scsi-sie-error-2.tar.gz files
z/VM thinks that the tape and controller is a 3490
The first file in a-sie-error-tar.gz is the IPL error.
The file in a-scsi-sie-error-2.tar.gz is the DDR error.
It took several attempts to get the output from the DDR error.
Paul
I'll get the traces tomorrow...
With z/VM we have the option of dynamically adding devices. All of these are sensed during IPL. We have no DMKRIO.
I will look at the RDEVICE block for the device to see what was added.
Paul
PS. I am using the DDR MODULE and not the standalone copy. I am not sure if that would make any difference.
o***@public.gmane.org
2009-04-25 02:25:34 UTC
Permalink
Changing the value in impl.c from 20 to 300 seconds in watchdog, seems to have circumvented this issue.

I ran several tests with 5 CPU running and with only 1 and was not able to cause the abend.

I also timed how long it took to rewind and unload the tapes. Even if the tape is at the very beginning (3490E cart), then it takes about 25-28 seconds to rewind and a unload a tape and another 19 to load the next tape.

During part of this time CP is in HCPIOSRS (RECOVERY START-SUBCHANNEL REQUEST) on a SSCH instruction... hung. That is the PSW address of F7C0E.

For the present, I have a circumvention. Let me know if there is something else to try.

Paul
Post by o***@public.gmane.org
One more item, the problem gets much worse when there are multiple CPUs running on z/VM.
Paul
Post by o***@public.gmane.org
Traces are in a-sie-error.tar.gz and a-scsi-sie-error-2.tar.gz files
z/VM thinks that the tape and controller is a 3490
The first file in a-sie-error-tar.gz is the IPL error.
The file in a-scsi-sie-error-2.tar.gz is the DDR error.
It took several attempts to get the output from the DDR error.
Paul
I'll get the traces tomorrow...
With z/VM we have the option of dynamically adding devices. All of these are sensed during IPL. We have no DMKRIO.
I will look at the RDEVICE block for the device to see what was added.
Paul
PS. I am using the DDR MODULE and not the standalone copy. I am not sure if that would make any difference.
Fish
2009-04-25 03:38:21 UTC
Permalink
Post by o***@public.gmane.org
Changing the value in impl.c from 20 to 300 seconds in watchdog,
seems to have circumvented this issue.
I ran several tests with 5 CPU running and with only 1 and was
not able to cause the abend.
<snip>
Post by o***@public.gmane.org
For the present, I have a circumvention. Let me know if there is something else to try.
Paul
I was wrong. The SIE issue was a red herring.
Post by o***@public.gmane.org
BEGIN PATCH >>>>>>>>>>>>>>>
- --- ../hercules/hstdinc.h 2009-04-12 20:15:46.516875000 -0700
+++ hstdinc.h 2009-04-24 19:43:40.640375000 -0700
@@ -80,6 +80,9 @@
#ifndef O_BINARY
#define O_BINARY 0
#endif
+#ifndef O_NONBLOCK
+#define O_NONBLOCK 0
+#endif
#include <limits.h>
#include <time.h>
#include <sys/stat.h>
- --- ../hercules/scsitape.c 2009-04-04 15:08:24.373875000 -0700
+++ scsitape.c 2009-04-24 20:17:01.789625000 -0700
@@ -90,11 +90,11 @@

/* Open the SCSI tape device */
dev->readonly = 0;
- - rc = open_tape (dev->filename, O_RDWR | O_BINARY);
+ rc = open_tape (dev->filename, O_RDWR | O_BINARY | O_NONBLOCK);
if (rc < 0 && EROFS == errno )
{
dev->readonly = 1;
- - rc = open_tape (dev->filename, O_RDONLY | O_BINARY);
+ rc = open_tape (dev->filename, O_RDONLY | O_BINARY |
O_NONBLOCK);
}

/* Check for successful open */
@@ -169,8 +169,15 @@
int finish_scsitape_open( DEVBLK *dev, BYTE *unitstat, BYTE code )
{
int rc; /* Return code
*/
+int oflags; /* re-open flags
*/
struct mtop opblk; /* Area for MTIOCTOP ioctl
*/

+ /* Switch drive over to BLOCKING-mode i/o... */
+
+ close_tape( dev->fd );
+ oflags = O_BINARY | (dev->readonly ? O_RDONLY : O_RDWR);
+ VERIFY( (dev->fd = open_tape (dev->filename, oflags)) > 0);
+
/* Since a tape was just mounted, reset the blockid back to zero
*/

dev->blockid = 0;
@@ -2038,11 +2045,11 @@
if ( (fd = dev->fd) < 0 )
{
dev->readonly = 0;
- - fd = open_tape (dev->filename, O_RDWR | O_BINARY);
+ fd = open_tape (dev->filename, O_RDWR | O_BINARY |
O_NONBLOCK);
if (fd < 0 && EROFS == errno )
{
dev->readonly = 1;
- - fd = open_tape (dev->filename, O_RDONLY | O_BINARY);
+ fd = open_tape (dev->filename, O_RDONLY | O_BINARY |
O_NONBLOCK);
}

// Check for successful open

<<<<<<<<<<<<<<< END PATCH <<<<<<<<<<<<<<<


I'm unable to test it since I don't have Linux.

I'd appreciate it if one of the Hercules developers would commit it
to the repository once you've verified that it's good.

Thanks.

- --
"Fish" (David B. Trout) - fish-VLFb7ALKWJGGw+***@public.gmane.org
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
PGP key fingerprints:
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
o***@public.gmane.org
2009-04-26 20:12:40 UTC
Permalink
I have applied the patch and changed impl.c back to 20 seconds.

The tests have been rerun using a single processor, multiple processor, under stress (another userid looping) and in a quiesced machine.

The MCH005 has vanished.

Please add the patch to the current subversion source.

Thanks

Paul
pfg504
2009-04-27 15:29:57 UTC
Permalink
There is one other case that is causing the same problem...

I'll traces this afternoon


Paul
Post by o***@public.gmane.org
I have applied the patch and changed impl.c back to 20 seconds.
The tests have been rerun using a single processor, multiple processor, under stress (another userid looping) and in a quiesced machine.
The MCH005 has vanished.
Please add the patch to the current subversion source.
Thanks
Paul
o***@public.gmane.org
2009-04-27 17:42:49 UTC
Permalink
I have another MCH005 ABEND. The failure was caused with a x'07' ccw. The CPU was hung at x'F703A' HCPIOSIL.

Traces loaded as a-scsi-tape-mch005-abend.


Paul

PS. The IPL with an empty drive has also been resolved.
Post by pfg504
There is one other case that is causing the same problem...
I'll traces this afternoon
Paul
Post by o***@public.gmane.org
I have applied the patch and changed impl.c back to 20 seconds.
The tests have been rerun using a single processor, multiple processor, under stress (another userid looping) and in a quiesced machine.
The MCH005 has vanished.
Please add the patch to the current subversion source.
Thanks
Paul
pfg504
2009-04-29 14:46:23 UTC
Permalink
It appears that under some circumstances, the rewind CCW processing waits for the rewind to complete before returning to the caller. In the case of the 3490-E11, it can take 25+ seconds to rewind the tape if it is at EOT.

I have adjusted impl.c to use a 300 second value for the watchdog timer instead of 20. This appears to circumvent the problem.

Paul
Post by o***@public.gmane.org
I have another MCH005 ABEND. The failure was caused with a x'07' ccw. The CPU was hung at x'F703A' HCPIOSIL.
Traces loaded as a-scsi-tape-mch005-abend.
Paul
PS. The IPL with an empty drive has also been resolved.
Post by pfg504
There is one other case that is causing the same problem...
I'll traces this afternoon
Paul
Post by o***@public.gmane.org
I have applied the patch and changed impl.c back to 20 seconds.
The tests have been rerun using a single processor, multiple processor, under stress (another userid looping) and in a quiesced machine.
The MCH005 has vanished.
Please add the patch to the current subversion source.
Thanks
Paul
Ivan Warren
2009-04-29 19:41:48 UTC
Permalink
Post by pfg504
It appears that under some circumstances, the rewind CCW processing waits for the rewind to complete before returning to the caller. In the case of the 3490-E11, it can take 25+ seconds to rewind the tape if it is at EOT.
I have adjusted impl.c to use a 300 second value for the watchdog timer instead of 20. This appears to circumvent the problem.
Paul
Paul,

I talked about this with Fish a bit.. and we were wondering something...

Could you try the following :

********************
Index: channel.c
===================================================================
--- channel.c (revision 5336)
+++ channel.c (working copy)
@@ -2278,8 +2278,7 @@
else if (dev->syncio == 2 && fetch_fw(dev->orb.ccwaddr) < dev->mainlim)
{
dev->code = dev->mainstor[fetch_fw(dev->orb.ccwaddr)];
- syncio = IS_CCW_TIC(dev->code) || IS_CCW_SENSE(dev->code)
- || IS_CCW_IMMEDIATE(dev);
+ syncio = IS_CCW_TIC(dev->code) || IS_CCW_SENSE(dev->code);
}
else
syncio = 0;
********************

And also revert the watchdog to 20 (actually.. maybe try an even smaller
value instead to exercise the code) ?

Thanks,

--Ivan


[Non-text portions of this message have been removed]
PeterH
2009-04-29 20:48:30 UTC
Permalink
Post by pfg504
It appears that under some circumstances, the rewind CCW
processing waits for the rewind to complete before returning to
the caller. In the case of the 3490-E11, it can take 25+ seconds
to rewind the tape if it is at EOT.
The REWIND and UNLOAD CCW is command chained to a NOP or other sense-
type CCW so the OS can be certain that BOT has been reached before
the CCW chain is terminated with CE+DE.

I suppose a more sophisticated implementation could be possible under
ASP/JES3, where pre-mounting of tapes is supported, and, I suppose,
some other method of detecting BOT after a REWIND could be implemented.
Gerhard Postpischil
2009-04-29 22:09:19 UTC
Permalink
Post by Ivan Warren
I talked about this with Fish a bit.. and we were wondering
something...
And also revert the watchdog to 20 (actually.. maybe try an
even smaller value instead to exercise the code) ?
It's only indirectly relevant (?), but about ten years ago I was
running OS/390 2.5 on a P/390. My work was regularly backed up
on a channel attached 3480 using an ADRDSSU job (data set mode).
After a while it started abending consistently, when a second
volume was required, and the I/O handler terminated the job
while the cartridge was rewinding. I was a consultant with no
access to the system, and was told that IBM provided some fix
consisting of increasing a time delay value (MIH? or similar).
This would seem to be a variation on the theme?


Gerhard Postpischil
Bradford, VT
o***@public.gmane.org
2009-04-29 22:17:50 UTC
Permalink
GP:
I have the MIT set off on this z/VM. This appears to be a related to the watchdog timer and long running functions like the rewind command to the tape.
The CPU waits at HCPIOSIL until the rewind is completed. Watchdog expects that the instruction counter should increase within 20 seconds. If is does not, it thinks that the CPU is broken and causes the MCH005.
Setting the watchdog up to a high value takes care of the abend but does not resolve the problem. Why are we hanging on a rewind of a physical tape?
Thats what is being worked on.

Paul
Post by Gerhard Postpischil
Post by Ivan Warren
I talked about this with Fish a bit.. and we were wondering
something...
And also revert the watchdog to 20 (actually.. maybe try an
even smaller value instead to exercise the code) ?
It's only indirectly relevant (?), but about ten years ago I was
running OS/390 2.5 on a P/390. My work was regularly backed up
on a channel attached 3480 using an ADRDSSU job (data set mode).
After a while it started abending consistently, when a second
volume was required, and the I/O handler terminated the job
while the cartridge was rewinding. I was a consultant with no
access to the system, and was told that IBM provided some fix
consisting of increasing a time delay value (MIH? or similar).
This would seem to be a variation on the theme?
Gerhard Postpischil
Bradford, VT
Ivan Warren
2009-04-29 22:58:03 UTC
Permalink
This post might be inappropriate. Click to display it.
o***@public.gmane.org
2009-04-29 23:14:09 UTC
Permalink
I have applied the patch and tested as before (Watchdog=3) and no failure. I am doing a longer test now. I had to lower the SCSIMOUNT value backdown to 5; at 99 it took 5 minutes to recognize the tape change.

So far, so good.

Thanks
Paul
Post by Ivan Warren
Post by Gerhard Postpischil
It's only indirectly relevant (?), but about ten years ago I was
running OS/390 2.5 on a P/390. My work was regularly backed up
on a channel attached 3480 using an ADRDSSU job (data set mode).
After a while it started abending consistently, when a second
volume was required, and the I/O handler terminated the job
while the cartridge was rewinding. I was a consultant with no
access to the system, and was told that IBM provided some fix
consisting of increasing a time delay value (MIH? or similar).
This would seem to be a variation on the theme?
Not exactly..
See, there is something in hercules called 'Synchronous I/O'..
This was something implemented a long time ago (in a galaxy far far
away) to help improve performance when spawning a thread (aka
lightweight process) was considered a heavy operation.
The rationale behind it is as follows : for *some* I/Os, (and
originally, this only applied to DASD I/Os), it is advantageous to NOT
spawn a thread. Cases, for example, are when performing a seek (which is
a purely logical action on an emulated DASD), or when hitting the
CKD/CCKD cache.. Well all those operations where there is a very limited
amount of wall time involved in performing the operation.
In those cases, the CCW is executed in the device handler in the SAME
thread as the CPU (that what Synch I/O is in hercules) - and in those
cases, eliminating the thread creation and/or thread wakeup (for those
systems where we have a pool of I/O threads ready to service I/Os
instead of spawning threads everytime) is beneficial. The DASD handler
has a method to self-terminate synchronous I/O mode. In this case the
DASD handler exits by notifying channel.c that it couldn't complete
synchronously and asks to be re-invoked from within a separate thread.
Typically happens when requesting a track but the track is not in cache.
So basically, you enter the SIO/SSCH instruction, the I/O gets
performed, and when the I/O completes, you resume with the next instruction.
Now.. Sometime, someplace.. someone decided to do the same for Tape
I/Os... so far so good.. Only for tape I/Os, only a certain number of
operations would benefit from that : Basically those operations that are
actually dummy calls (like set mode on an AWS file or stuff like that)..
So someone came up and said.. Ok.. let's do this for TIC (ok).. Sense
(fine).. and... Immediate Commands (Oops!)
Now comes SCSI tapes !
Mind it.. "Immediate Command", counter-intuitively does not mean the
command takes a very short amount of time to complete ! "Immediate
Command", in the S/370, ESA and z world has a *very* specific meaning
which is completely different ! It means : A Control Write (xxxx xx11)
command for which the control unit will present a Channel End status
before having requested any data from the channel[1] ! Actually, an
'Enable' (X'27') on a Dial in telecommunication line)
One example for tapes : Rewind !
So... Now, you're starting to see the picture !
We're entering SIO/SSCH.. 1st CCW in the chain is.. Rewind.. It's an
immediate command, so we enter synchronous I/O mode, and the rewind is
executed in the SAME thread as the CPU.. As long as the Rewind is taking
place, the CPU is tied up executing the SIO/SSCH.. If the Rewind takes
more than 20 seconds, Mr Watchdog comes up and says : What the heck are
you doing CPU ? You're not in a wait state and you haven't executed an
instruction in 20 seconds ? you're broken.. Let me wake you up : Signal
1 !... Exigent Machine Check (Probably Instruction Damage).. Occurring
in Supervisor mode under VM : Abend MCH005 !
Increasing the watchdog value only hides the problem.. We shouldn't be
stuck in the CPU for long term I/Os to start with !
The proposed patch submitted for your testing is to validate that theory !
--Ivan
[1] The reason for which we NEED to know if a command is an immediate
command is to avoid generating a channel program check if the data
portion of the CCW points to non-accessible storage. In this case, the
POO is clear : No Channel Program Check or Channel Protection Check is
generated if the channel doesn't attempt to access the storage area -
which only happens if the Control Unit requests the data. In the case of
an immediate command, we KNOW by definition, that no data will be
requested - thus, we do not attempt to use the Data Address portion of
the CCW in the case of immediate commands. We also need to know because
the ESA and z "Incorrect Length Suppression Facility" states that no
Incorrect Length indication is reported EVEN if the SLI bit is not set
in the CCW when the L bit in the ORB is one and the CCW is an immediate
command !
[Non-text portions of this message have been removed]
Fish
2009-04-29 23:24:25 UTC
Permalink
Post by Ivan Warren
See, there is something in hercules called 'Synchronous I/O'..
<snip details>

Well said. :)

- --
"Fish" (David B. Trout) - fish-VLFb7ALKWJGGw+***@public.gmane.org
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
PGP key fingerprints:
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
Greg Smith
2009-04-30 01:14:24 UTC
Permalink
Post by Fish
Post by Ivan Warren
See, there is something in hercules called 'Synchronous I/O'..
<snip details>
Well said. :)
When I wrote the syncio stuff we were dealing with emulated dasd and not
some physical device. So the syncio executed in the cpu thread until we
got to a ccw that might actually cause a host wait. In that case we
resumed the i/o request in its own thread.

It seemed like a good idea at the time ;-)

Greg
Fish
2009-04-30 01:21:30 UTC
Permalink
Greg Smith wrote:

[...]
Post by Greg Smith
It seemed like a good idea at the time ;-)
It always does, doesn't it? ;-)

- --
"Fish" (David B. Trout) - fish-VLFb7ALKWJGGw+***@public.gmane.org
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
PGP key fingerprints:
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
Greg Smith
2009-04-30 02:09:37 UTC
Permalink
Post by Fish
[...]
Post by Greg Smith
It seemed like a good idea at the time ;-)
It always does, doesn't it? ;-)
At least for the dasd devices I added the `nosyncio' option. Maybe that
should be done too for the other syncio eligible devices??

Greg
Fish
2009-04-30 16:27:35 UTC
Permalink
Greg Smith wrote:

[...]
Post by Greg Smith
At least for the dasd devices I added the `nosyncio' option.
Maybe that should be done too for the other syncio eligible
devices??
While I think Ivan is wanting a "mixed" approach right now (syncio
preferred if possible (i.e. by default) but nosyncio for those i/o's
(CCWs) we know could take a long while (e.g. REW, etc)), having
support for a blanket 'nosyncio' override option to force all i/o to
be non-synchronous would certainly be handy to have (and would
probably have come in quite handy in this particular case too).

Are you volunteering to do it? :)

- --
"Fish" (David B. Trout) - fish-VLFb7ALKWJGGw+***@public.gmane.org
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
PGP key fingerprints:
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
Ivan Warren
2009-04-30 19:32:43 UTC
Permalink
Post by Fish
Are you volunteering to do it? :)
Ok.. The *ONLY* CCW command codes that are now left eligible for sync
I/O are :

- TIC
- SENSE

(and these must be the 1st CCW in a CCW chain anyway !)

So my question is :

Do we want to keep sync I/O altogether for tape I/O ?

My take on this is that it's too much trouble considering handfull of
microseconds we will gain on the occasional sense being issued to a tape
device !

--Ivan


[Non-text portions of this message have been removed]
Gerhard Postpischil
2009-04-30 00:04:10 UTC
Permalink
Post by Ivan Warren
Post by Gerhard Postpischil
It's only indirectly relevant (?), but about ten years ago I was
See, there is something in hercules called 'Synchronous I/O'..
I should have been more explicit. My point was that there
appears to be a timing problem related to physical tape, and
that this may get encountered again even if the Hercules code is
changed. For MVS we have the source and it should be fairly easy
to check and track down; for newer systems a fix may or may not
be available, and it merits testing.

Gerhard Postpischil
Bradford, VT
Harold Grovesteen
2009-04-30 11:04:29 UTC
Permalink
There is a syncio bit in the regs structure that is also needed for
synchronous I/O. I ran into this while researching how to implement
DIAGNOSE X'250'. This bit causes the CPU to appear to be in the wait
state. Does this bit also need to be set and is it being set to avoid
the machine check? Does the watchdog timer need to recognize this
condition?

Just a random thought, may be off base. Have to head off to work and do
not have the time right now to research further.

Harold Grovesteen
Post by Ivan Warren
Post by Gerhard Postpischil
It's only indirectly relevant (?), but about ten years ago I was
running OS/390 2.5 on a P/390. My work was regularly backed up
on a channel attached 3480 using an ADRDSSU job (data set mode).
After a while it started abending consistently, when a second
volume was required, and the I/O handler terminated the job
while the cartridge was rewinding. I was a consultant with no
access to the system, and was told that IBM provided some fix
consisting of increasing a time delay value (MIH? or similar).
This would seem to be a variation on the theme?
Not exactly..
See, there is something in hercules called 'Synchronous I/O'..
This was something implemented a long time ago (in a galaxy far far
away) to help improve performance when spawning a thread (aka
lightweight process) was considered a heavy operation.
The rationale behind it is as follows : for *some* I/Os, (and
originally, this only applied to DASD I/Os), it is advantageous to NOT
spawn a thread. Cases, for example, are when performing a seek (which is
a purely logical action on an emulated DASD), or when hitting the
CKD/CCKD cache.. Well all those operations where there is a very limited
amount of wall time involved in performing the operation.
In those cases, the CCW is executed in the device handler in the SAME
thread as the CPU (that what Synch I/O is in hercules) - and in those
cases, eliminating the thread creation and/or thread wakeup (for those
systems where we have a pool of I/O threads ready to service I/Os
instead of spawning threads everytime) is beneficial. The DASD handler
has a method to self-terminate synchronous I/O mode. In this case the
DASD handler exits by notifying channel.c that it couldn't complete
synchronously and asks to be re-invoked from within a separate thread.
Typically happens when requesting a track but the track is not in cache.
So basically, you enter the SIO/SSCH instruction, the I/O gets
performed, and when the I/O completes, you resume with the next instruction.
Now.. Sometime, someplace.. someone decided to do the same for Tape
I/Os... so far so good.. Only for tape I/Os, only a certain number of
operations would benefit from that : Basically those operations that are
actually dummy calls (like set mode on an AWS file or stuff like that)..
So someone came up and said.. Ok.. let's do this for TIC (ok).. Sense
(fine).. and... Immediate Commands (Oops!)
Now comes SCSI tapes !
Mind it.. "Immediate Command", counter-intuitively does not mean the
command takes a very short amount of time to complete ! "Immediate
Command", in the S/370, ESA and z world has a *very* specific meaning
which is completely different ! It means : A Control Write (xxxx xx11)
command for which the control unit will present a Channel End status
before having requested any data from the channel[1] ! Actually, an
'Enable' (X'27') on a Dial in telecommunication line)
One example for tapes : Rewind !
So... Now, you're starting to see the picture !
We're entering SIO/SSCH.. 1st CCW in the chain is.. Rewind.. It's an
immediate command, so we enter synchronous I/O mode, and the rewind is
executed in the SAME thread as the CPU.. As long as the Rewind is taking
place, the CPU is tied up executing the SIO/SSCH.. If the Rewind takes
more than 20 seconds, Mr Watchdog comes up and says : What the heck are
you doing CPU ? You're not in a wait state and you haven't executed an
instruction in 20 seconds ? you're broken.. Let me wake you up : Signal
1 !... Exigent Machine Check (Probably Instruction Damage).. Occurring
in Supervisor mode under VM : Abend MCH005 !
Increasing the watchdog value only hides the problem.. We shouldn't be
stuck in the CPU for long term I/Os to start with !
The proposed patch submitted for your testing is to validate that theory !
--Ivan
[1] The reason for which we NEED to know if a command is an immediate
command is to avoid generating a channel program check if the data
portion of the CCW points to non-accessible storage. In this case, the
POO is clear : No Channel Program Check or Channel Protection Check is
generated if the channel doesn't attempt to access the storage area -
which only happens if the Control Unit requests the data. In the case of
an immediate command, we KNOW by definition, that no data will be
requested - thus, we do not attempt to use the Data Address portion of
the CCW in the case of immediate commands. We also need to know because
the ESA and z "Incorrect Length Suppression Facility" states that no
Incorrect Length indication is reported EVEN if the SLI bit is not set
in the CCW when the L bit in the ORB is one and the CCW is an immediate
command !
[Non-text portions of this message have been removed]
Greg Smith
2009-04-30 23:17:52 UTC
Permalink
Post by Harold Grovesteen
There is a syncio bit in the regs structure that is also needed for
synchronous I/O. I ran into this while researching how to implement
DIAGNOSE X'250'. This bit causes the CPU to appear to be in the wait
state. Does this bit also need to be set and is it being set to avoid
the machine check? Does the watchdog timer need to recognize this
condition?
That's not a bad idea.

Strictly speaking, the syncio bit in REGS does not make the emulated CPU
appear to be in a wait state; that is, the wait bit in the PSW is not
on. It does make the cpu thread look like it's `waiting' to the
SYNCHRONIZE_CPUS macro in this sense: `intlock' is held when a cpu
thread issues SYNCHRONIZE_CPUS; if another cpu thread has the syncio bit
turned on in REGS then that thread will be unable to turn off the syncio
bit because it is guarded by `intlock'.

However, the intent of syncio is allow ccws to execute in the cpu thread
without causing an implicit wait. Waiting for a tape to rewind for over
20 seconds is out of that scope.

What I think should happen is that when the tape rewind ccw is processed
by the tape device handler when syncio is active, then the device
handler should turn on the DEVBLK syncio_retry bit and return so the
rewind ccw will be executed in an asynchronous device thread.

Greg
Ivan Warren
2009-04-30 23:51:09 UTC
Permalink
Post by Greg Smith
What I think should happen is that when the tape rewind ccw is processed
by the tape device handler when syncio is active, then the device
handler should turn on the DEVBLK syncio_retry bit and return so the
rewind ccw will be executed in an asynchronous device thread.
Greg
I still believe Sync I/O for tapes is superfluous !

--Ivan


[Non-text portions of this message have been removed]
Greg Smith
2009-05-01 00:26:31 UTC
Permalink
Post by Ivan Warren
I still believe Sync I/O for tapes is superfluous !
What he said! Who added syncio for tapes anyway... maybe they could
explain.

Greg
Ivan Warren
2009-05-01 01:35:38 UTC
Permalink
Post by Greg Smith
Post by Ivan Warren
I still believe Sync I/O for tapes is superfluous !
What he said! Who added syncio for tapes anyway... maybe they could
explain.
Greg
It may have been a good idea at one time.. unfortunately :

- Tape I/O data is not cached
- With the advent of SCSI tape, sync I/O has become a source of problems

So basically, I don't care who initiated the idea behind sync I/O for
tape (for all I know, it could have been You, Fish, Jan, Roger, Jay,
Bernard.. or Me !)..

So whodunit doesn't really matter..

I say : let's be done with it and remove it (2 line removal in
channel.c - easy to rollback if someone finds a good reason for tape
syncio)..

--Ivan


[Non-text portions of this message have been removed]
pfg504
2009-05-01 02:27:25 UTC
Permalink
As a note here:

During the testing with the 3490-E11 with the autoloader, I received an I/O error during a tape switch which I believe was associated with a SCSIMOUNT value being too small. I changed it from 5 to 10 and was not able to reproduce the problem. I will run a couple of more test, lowering the value to see if the error can be recreated.

Paul

PS. thanks for all of the help.
Post by Ivan Warren
Post by Greg Smith
Post by Ivan Warren
I still believe Sync I/O for tapes is superfluous !
What he said! Who added syncio for tapes anyway... maybe they could
explain.
Greg
- Tape I/O data is not cached
- With the advent of SCSI tape, sync I/O has become a source of problems
So basically, I don't care who initiated the idea behind sync I/O for
tape (for all I know, it could have been You, Fish, Jan, Roger, Jay,
Bernard.. or Me !)..
So whodunit doesn't really matter..
I say : let's be done with it and remove it (2 line removal in
channel.c - easy to rollback if someone finds a good reason for tape
syncio)..
--Ivan
[Non-text portions of this message have been removed]
liamcold
2009-04-23 04:39:36 UTC
Permalink
What is the configuration parameter that we can use to disable all this SCSIAUTOMOUNT processing?
I have also had "ISSUES" with this in the past and would like to be able to disable this entire "OPTION".
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by o***@public.gmane.org
We might be dealing with two issues.
1) SCSI tape (IBM 3490-E11) attached with no tape mounted during
IPL 2) SCSI tape (IBM 3490-E11) use by DDR dismount/remount
It's probably one issue that just happens to be triggered in more
than one way.
Post by o***@public.gmane.org
If the fix is in the SCSIMOUNT timer, then it will probably fix
both. If it is in the SIE area, we still might have another
problem.
It's not the SCSIMOUNT timer. The SCSIMOUNT timer simply defines how
frequently Hercules is to query the tape drive for it current status.
Once the drive responds [to Hercules] with a status that a tape is
now finally mounted, Hercules then presents an unsolicited ATTN
interrupt so the guest (z/VM 3.1 in this case).
During the development of this logic (done on Windows) I personally
experienced unusual host behavior wherein the status query to the
drive would take upwards of 10 seconds or more to complete whenever
there was no tape mounted. (Whenever a tape was mounted the query
would complete immediately). Thus there is special logic in Hercules
to: 1) "timeout" its internal query attempt after x #of milliseconds,
and 2) to not query the drive any more frequently than once per
second. These two changes seemed to resolve the issue.
While the guest is waiting for the tape to be mounted however (i.e.
while it's waiting for that attention interrupt to eventually occur),
it's usually either executing other instructions (i.e. it has other
work to do while it's waiting) or else it's sitting there in an
enabled interrupt state (i.e. its psw in a wait state enabled for i/o
interrupts).
When the guest is NOT running in SIE mode, the machine check does not
occur because the CPU is waiting for an interrupt and thus purposely
not executing any instructions.
If the guest *IS* in SIE mode however, then the machine check will
a) the host (non-SIE guest) CPU is NOT in a wait state
(i.e. from Herc's point-of-view it's supposed to be
executing instructions), AND...
b) the SIE-guest CPU is NOT in a wait state, AND...
c) it's been longer than 20 seconds and the instruction
count hasn't changed.
This condition is detected by Herc's "watchdog_thread" responsible
for detecting malfunctioning CPUs (i.e CPUs that are "stuck" and not
actually executing any instructions as they should be). When this
occurs a Machine Check is generated by the watchdog_thread (which is
what you're experiencing).
The idea here is if the CPU is waiting, then the fact that the
instruction count has not changed is completely normal and expected.
It's only when the CPU is *not* in a wait state (and thus *should* be
executing instructions) but the instruction count has not changed
that Herc thinks something is wrong.
In your case, I'm suspecting (but we need to do another trace to be
sure; the last trace I had you do wasn't good enough; sorry) your
z/VM guest is continuing to execute instructions in SIE mode or else
has wait state assist enabled (which causes it to *appear* as if the
SIE guest PSW is never in a wait state), thus opening the window of
opportunity for the bug to get triggered (which it does 20 seconds
later whenever the watchdog_thread notices the instruction count
hasn't changed and the real host cpu is NOT in a wait state AND its
still in SIE mode AND the SIE guest CPU is not in a wait state either
thereby causing the Machine Check).
The more I think about it the more I'm inclined to believe the fix is
NOT in adjusting the condition of the 'if' statement as I originally
thought, but rather in the subsequent test that's checking whether
the instruction count has changed or not. If the host (NON-SIE) CPU
instruction count has not changed (which today is the only condition
that generates the Machine Check), it should THEN check to see
whether or not the SIE guest CPU's instruction count has changed or
not. If NEITHER instruction count has changed THEN something is wrong
and a Machine Check should be generated.
But REGARDLESS of whether the host (non-SIE guest) CPU's instruction
count has changed, if the SIE guest CPUs instruction count has
changed, then everything is still okay and the Machine Check should
NOT be generated.
Post by o***@public.gmane.org
Logs of the IPL with ptt enabled and logs of DDR are included in
the file logs.tar.gz in FILES.
Paul
Which are not good. :(
(My bad. Sorry.)
ptt sie io 12345
hao HHCCP017I
hao ptt
t+588
hao HHCCP017I
hao ipending
hao HHCCP048I 0588:CCW=03200001
hao ipending
The 'ipending' command will display, among other things, the
instruction count for the CPU, and the 'ptt' command will of course
dump the trace table. Both should be issued IMMEDIATELY upon the
Machine Check occurring. Thus the reason for using the Hercules
Automatic Operator (HAO) facility to automatically issue the commands
for us.
Try that and post your results. Thanks.
(p.s. you can delete those other logs. We don't need them any more.)
OH! By the way: just out of curiosity, how's the tape drive defined
in VM? (i.e. DMKRIO)
- --
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
-----BEGIN PGP SIGNATURE-----
Version: PGP 8.1
iQA/AwUBSe+/5Uj11/TE7j4qEQKkJACgthFUt6UfnCirC+wl4dQ7gM4ojEwAnAxm
amYmYC5i7Bhznpzd0e23mIG2
=AU8m
-----END PGP SIGNATURE-----
o***@public.gmane.org
2009-04-23 17:39:57 UTC
Permalink
Did you try AUTO_SCSI_MOUNT NO ?

Paul
Post by liamcold
What is the configuration parameter that we can use to disable all this SCSIAUTOMOUNT processing?
I have also had "ISSUES" with this in the past and would like to be able to disable this entire "OPTION".
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by o***@public.gmane.org
We might be dealing with two issues.
1) SCSI tape (IBM 3490-E11) attached with no tape mounted during
IPL 2) SCSI tape (IBM 3490-E11) use by DDR dismount/remount
It's probably one issue that just happens to be triggered in more
than one way.
Post by o***@public.gmane.org
If the fix is in the SCSIMOUNT timer, then it will probably fix
both. If it is in the SIE area, we still might have another
problem.
It's not the SCSIMOUNT timer. The SCSIMOUNT timer simply defines how
frequently Hercules is to query the tape drive for it current status.
Once the drive responds [to Hercules] with a status that a tape is
now finally mounted, Hercules then presents an unsolicited ATTN
interrupt so the guest (z/VM 3.1 in this case).
During the development of this logic (done on Windows) I personally
experienced unusual host behavior wherein the status query to the
drive would take upwards of 10 seconds or more to complete whenever
there was no tape mounted. (Whenever a tape was mounted the query
would complete immediately). Thus there is special logic in Hercules
to: 1) "timeout" its internal query attempt after x #of milliseconds,
and 2) to not query the drive any more frequently than once per
second. These two changes seemed to resolve the issue.
While the guest is waiting for the tape to be mounted however (i.e.
while it's waiting for that attention interrupt to eventually occur),
it's usually either executing other instructions (i.e. it has other
work to do while it's waiting) or else it's sitting there in an
enabled interrupt state (i.e. its psw in a wait state enabled for i/o
interrupts).
When the guest is NOT running in SIE mode, the machine check does not
occur because the CPU is waiting for an interrupt and thus purposely
not executing any instructions.
If the guest *IS* in SIE mode however, then the machine check will
a) the host (non-SIE guest) CPU is NOT in a wait state
(i.e. from Herc's point-of-view it's supposed to be
executing instructions), AND...
b) the SIE-guest CPU is NOT in a wait state, AND...
c) it's been longer than 20 seconds and the instruction
count hasn't changed.
This condition is detected by Herc's "watchdog_thread" responsible
for detecting malfunctioning CPUs (i.e CPUs that are "stuck" and not
actually executing any instructions as they should be). When this
occurs a Machine Check is generated by the watchdog_thread (which is
what you're experiencing).
The idea here is if the CPU is waiting, then the fact that the
instruction count has not changed is completely normal and expected.
It's only when the CPU is *not* in a wait state (and thus *should* be
executing instructions) but the instruction count has not changed
that Herc thinks something is wrong.
In your case, I'm suspecting (but we need to do another trace to be
sure; the last trace I had you do wasn't good enough; sorry) your
z/VM guest is continuing to execute instructions in SIE mode or else
has wait state assist enabled (which causes it to *appear* as if the
SIE guest PSW is never in a wait state), thus opening the window of
opportunity for the bug to get triggered (which it does 20 seconds
later whenever the watchdog_thread notices the instruction count
hasn't changed and the real host cpu is NOT in a wait state AND its
still in SIE mode AND the SIE guest CPU is not in a wait state either
thereby causing the Machine Check).
The more I think about it the more I'm inclined to believe the fix is
NOT in adjusting the condition of the 'if' statement as I originally
thought, but rather in the subsequent test that's checking whether
the instruction count has changed or not. If the host (NON-SIE) CPU
instruction count has not changed (which today is the only condition
that generates the Machine Check), it should THEN check to see
whether or not the SIE guest CPU's instruction count has changed or
not. If NEITHER instruction count has changed THEN something is wrong
and a Machine Check should be generated.
But REGARDLESS of whether the host (non-SIE guest) CPU's instruction
count has changed, if the SIE guest CPUs instruction count has
changed, then everything is still okay and the Machine Check should
NOT be generated.
Post by o***@public.gmane.org
Logs of the IPL with ptt enabled and logs of DDR are included in
the file logs.tar.gz in FILES.
Paul
Which are not good. :(
(My bad. Sorry.)
ptt sie io 12345
hao HHCCP017I
hao ptt
t+588
hao HHCCP017I
hao ipending
hao HHCCP048I 0588:CCW=03200001
hao ipending
The 'ipending' command will display, among other things, the
instruction count for the CPU, and the 'ptt' command will of course
dump the trace table. Both should be issued IMMEDIATELY upon the
Machine Check occurring. Thus the reason for using the Hercules
Automatic Operator (HAO) facility to automatically issue the commands
for us.
Try that and post your results. Thanks.
(p.s. you can delete those other logs. We don't need them any more.)
OH! By the way: just out of curiosity, how's the tape drive defined
in VM? (i.e. DMKRIO)
- --
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
-----BEGIN PGP SIGNATURE-----
Version: PGP 8.1
iQA/AwUBSe+/5Uj11/TE7j4qEQKkJACgthFUt6UfnCirC+wl4dQ7gM4ojEwAnAxm
amYmYC5i7Bhznpzd0e23mIG2
=AU8m
-----END PGP SIGNATURE-----
liamcold
2009-04-23 23:56:47 UTC
Permalink
Thanks Paul,

I must be blind. I now see the AUTO_SCSI_MOUNT parameter in the configuration documentation. I am now running with "NO".
Post by o***@public.gmane.org
Did you try AUTO_SCSI_MOUNT NO ?
Paul
Post by liamcold
What is the configuration parameter that we can use to disable all this SCSIAUTOMOUNT processing?
I have also had "ISSUES" with this in the past and would like to be able to disable this entire "OPTION".
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by o***@public.gmane.org
We might be dealing with two issues.
1) SCSI tape (IBM 3490-E11) attached with no tape mounted during
IPL 2) SCSI tape (IBM 3490-E11) use by DDR dismount/remount
It's probably one issue that just happens to be triggered in more
than one way.
Post by o***@public.gmane.org
If the fix is in the SCSIMOUNT timer, then it will probably fix
both. If it is in the SIE area, we still might have another
problem.
It's not the SCSIMOUNT timer. The SCSIMOUNT timer simply defines how
frequently Hercules is to query the tape drive for it current status.
Once the drive responds [to Hercules] with a status that a tape is
now finally mounted, Hercules then presents an unsolicited ATTN
interrupt so the guest (z/VM 3.1 in this case).
During the development of this logic (done on Windows) I personally
experienced unusual host behavior wherein the status query to the
drive would take upwards of 10 seconds or more to complete whenever
there was no tape mounted. (Whenever a tape was mounted the query
would complete immediately). Thus there is special logic in Hercules
to: 1) "timeout" its internal query attempt after x #of milliseconds,
and 2) to not query the drive any more frequently than once per
second. These two changes seemed to resolve the issue.
While the guest is waiting for the tape to be mounted however (i.e.
while it's waiting for that attention interrupt to eventually occur),
it's usually either executing other instructions (i.e. it has other
work to do while it's waiting) or else it's sitting there in an
enabled interrupt state (i.e. its psw in a wait state enabled for i/o
interrupts).
When the guest is NOT running in SIE mode, the machine check does not
occur because the CPU is waiting for an interrupt and thus purposely
not executing any instructions.
If the guest *IS* in SIE mode however, then the machine check will
a) the host (non-SIE guest) CPU is NOT in a wait state
(i.e. from Herc's point-of-view it's supposed to be
executing instructions), AND...
b) the SIE-guest CPU is NOT in a wait state, AND...
c) it's been longer than 20 seconds and the instruction
count hasn't changed.
This condition is detected by Herc's "watchdog_thread" responsible
for detecting malfunctioning CPUs (i.e CPUs that are "stuck" and not
actually executing any instructions as they should be). When this
occurs a Machine Check is generated by the watchdog_thread (which is
what you're experiencing).
The idea here is if the CPU is waiting, then the fact that the
instruction count has not changed is completely normal and expected.
It's only when the CPU is *not* in a wait state (and thus *should* be
executing instructions) but the instruction count has not changed
that Herc thinks something is wrong.
In your case, I'm suspecting (but we need to do another trace to be
sure; the last trace I had you do wasn't good enough; sorry) your
z/VM guest is continuing to execute instructions in SIE mode or else
has wait state assist enabled (which causes it to *appear* as if the
SIE guest PSW is never in a wait state), thus opening the window of
opportunity for the bug to get triggered (which it does 20 seconds
later whenever the watchdog_thread notices the instruction count
hasn't changed and the real host cpu is NOT in a wait state AND its
still in SIE mode AND the SIE guest CPU is not in a wait state either
thereby causing the Machine Check).
The more I think about it the more I'm inclined to believe the fix is
NOT in adjusting the condition of the 'if' statement as I originally
thought, but rather in the subsequent test that's checking whether
the instruction count has changed or not. If the host (NON-SIE) CPU
instruction count has not changed (which today is the only condition
that generates the Machine Check), it should THEN check to see
whether or not the SIE guest CPU's instruction count has changed or
not. If NEITHER instruction count has changed THEN something is wrong
and a Machine Check should be generated.
But REGARDLESS of whether the host (non-SIE guest) CPU's instruction
count has changed, if the SIE guest CPUs instruction count has
changed, then everything is still okay and the Machine Check should
NOT be generated.
Post by o***@public.gmane.org
Logs of the IPL with ptt enabled and logs of DDR are included in
the file logs.tar.gz in FILES.
Paul
Which are not good. :(
(My bad. Sorry.)
ptt sie io 12345
hao HHCCP017I
hao ptt
t+588
hao HHCCP017I
hao ipending
hao HHCCP048I 0588:CCW=03200001
hao ipending
The 'ipending' command will display, among other things, the
instruction count for the CPU, and the 'ptt' command will of course
dump the trace table. Both should be issued IMMEDIATELY upon the
Machine Check occurring. Thus the reason for using the Hercules
Automatic Operator (HAO) facility to automatically issue the commands
for us.
Try that and post your results. Thanks.
(p.s. you can delete those other logs. We don't need them any more.)
OH! By the way: just out of curiosity, how's the tape drive defined
in VM? (i.e. DMKRIO)
- --
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
-----BEGIN PGP SIGNATURE-----
Version: PGP 8.1
iQA/AwUBSe+/5Uj11/TE7j4qEQKkJACgthFUt6UfnCirC+wl4dQ7gM4ojEwAnAxm
amYmYC5i7Bhznpzd0e23mIG2
=AU8m
-----END PGP SIGNATURE-----
Fish
2009-04-21 13:39:34 UTC
Permalink
Paul,

Do me a favor and do your test again, but by first entering the
following command before you start the test:

ptt sie io 100

Then, immediately upon receiving the machine check, enter the
command:

ptt

and post the results so we can look at it.

Thanks.

- --
"Fish" (David B. Trout) - fish-VLFb7ALKWJGGw+***@public.gmane.org
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
PGP key fingerprints:
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
o***@public.gmane.org
2009-04-21 18:12:28 UTC
Permalink
The logs are in the FILES area as logs.tar.gz

Paul
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Paul,
Do me a favor and do your test again, but by first entering the
ptt sie io 100
Then, immediately upon receiving the machine check, enter the
ptt
and post the results so we can look at it.
Thanks.
- --
Fight Spam! Join CAUCE! <http://www.cauce.org/>
7 reasons why HTML email is a bad thing
http://www.georgedillon.com/web/html_email_is_evil.shtml
RSA: 6B37 7110 7201 9917 9B0D 99E3 55DB 5D58 FADE 4A52
DH/DSS: 9F9B BAB0 BA7F C458 1A89 FE26 48F5 D7F4 C4EE 3E2A
-----BEGIN PGP SIGNATURE-----
Version: PGP 8.1
iQA/AwUBSe3MlUj11/TE7j4qEQJCpwCfdNrLTvHVZ+qXH4ZbAGBfG/CiMxgAn2q/
qXrSDMb0/vFVQNlVUZoOPdT0
=oxeJ
-----END PGP SIGNATURE-----
Loading...