'Peter J. Jansen' peter_j_jansen@yahoo.com [hercules-390]
2017-10-23 15:00:15 UTC
The Hercules Spinhawk version 3.13 (but also 3.12) have been causing me
trouble at 3270 console connection time, i.e. even before IPL. Sometimes,
Spinhawk crashed immediately, with a less than complete set of "HHCTE009I
Client 127.0.0.1 connected to 3270 device 0:xxxx" messages in the log. With
a bit more luck, Spinhawk would not crash, but there would still be less
HHCTE009I messages than the six 3270 consoles I defined (all x3270). This
only occurred when Spinhawk was running on Linux; both Ubuntu 16.04.2 and
Debian 9.1.0 experienced it. Also the Hyperion and SDL-hyperion versions do
not exhibit this behavior.
I had a simple workaround by manually initialing the 3270 consoles after
starting Spinhawk but before IPL-ing the system, so this was a lower
priority for me. But it was manual extra work and thus annoying.
After some careful reading of the console.c source, I think that some race
condition may have been experienced by others in the past, as I found
various comments hinting at earlier attempts to fix the race condition.
After quite a bit of testing (including running Spinhawk under valgrind), my
tests lead me now to believe that I may have found the problem as well as a
simple fix for it. The culprit is the "csock" socket number from an
"accept" call, whose address is passed into "connect_client" via a
"create_thread" (e.g. a pthread_create). This I believe to NOT be
thread-safe. Even before "connect_clean" picks up the "csock", another
"accept" may have delivered a new "csock", as a result of which both the 1st
and 2nd "connect_client" may pick up the same 2nd "csock" number. Not good.
My fix is simple. Instead of introducing a thread-safe argument passing
into "connect_client", I considered the fact that the latter thread is
rather simple and short-lived, during initialisation only. So I thought to
call "connect_client" straight away, without creating an additional thread
for each 3270 console during its initial connecting phase. And my problem
went away.
I am interested in feedback regarding this issue. My correction is already
available in GitHub : https://github.com/Peter-J-Jansen/spinhawk. Depending
on community feedback, I may make a pull request to have the fix included in
the official Hercules Spinhawk repository.
Thanks,
Peter J.
P.S.: This is what the fix looks like :
***@ubuntu:~$ diff -C 9 Qsync/Source/Originals/spinhawk/console.c
Qsync/Source/Changes/spinhawk/console.c
*** Qsync/Source/Originals/spinhawk/console.c 2017-10-01 17:23:40.330668000
+0200
--- Qsync/Source/Changes/spinhawk/console.c 2017-10-23
14:57:52.309843400 +0200
***************
*** 2107,2133 ****
--- 2107,2142 ----
/* Accept a connection and create conversation socket */
csock = accept (lsock, NULL, NULL);
if (csock < 0)
{
TNSERROR("console: DBG029: accept: %s\n",
strerror(HSO_errno));
continue;
}
+ /* The original connect_client thread creation is @PJJ */
+ /* replaced with a straight function call, thus @PJJ */
+ /* avoiding the race condition caused by passing @PJJ */
+ /* &csock set by the above accept to create_thread. @PJJ */
+ #if 0 /* 23-Oct-2017 @PJJ */
/* Create a thread to complete the client connection */
if ( create_thread (&tidneg, DETACHED,
connect_client, &csock, "connect_client")
)
{
TNSERROR("console: DBG030: connect_client create_thread:
%s\n",
strerror(errno));
close_socket (csock);
}
+ #else /* @PJJ */
+ UNREFERENCED(tidneg); /* @PJJ */
+ connect_client( &csock ); /* @PJJ */
+ #endif /* @PJJ */
} /* end if(FD_ISSET(lsock, &readset)) */
trouble at 3270 console connection time, i.e. even before IPL. Sometimes,
Spinhawk crashed immediately, with a less than complete set of "HHCTE009I
Client 127.0.0.1 connected to 3270 device 0:xxxx" messages in the log. With
a bit more luck, Spinhawk would not crash, but there would still be less
HHCTE009I messages than the six 3270 consoles I defined (all x3270). This
only occurred when Spinhawk was running on Linux; both Ubuntu 16.04.2 and
Debian 9.1.0 experienced it. Also the Hyperion and SDL-hyperion versions do
not exhibit this behavior.
I had a simple workaround by manually initialing the 3270 consoles after
starting Spinhawk but before IPL-ing the system, so this was a lower
priority for me. But it was manual extra work and thus annoying.
After some careful reading of the console.c source, I think that some race
condition may have been experienced by others in the past, as I found
various comments hinting at earlier attempts to fix the race condition.
After quite a bit of testing (including running Spinhawk under valgrind), my
tests lead me now to believe that I may have found the problem as well as a
simple fix for it. The culprit is the "csock" socket number from an
"accept" call, whose address is passed into "connect_client" via a
"create_thread" (e.g. a pthread_create). This I believe to NOT be
thread-safe. Even before "connect_clean" picks up the "csock", another
"accept" may have delivered a new "csock", as a result of which both the 1st
and 2nd "connect_client" may pick up the same 2nd "csock" number. Not good.
My fix is simple. Instead of introducing a thread-safe argument passing
into "connect_client", I considered the fact that the latter thread is
rather simple and short-lived, during initialisation only. So I thought to
call "connect_client" straight away, without creating an additional thread
for each 3270 console during its initial connecting phase. And my problem
went away.
I am interested in feedback regarding this issue. My correction is already
available in GitHub : https://github.com/Peter-J-Jansen/spinhawk. Depending
on community feedback, I may make a pull request to have the fix included in
the official Hercules Spinhawk repository.
Thanks,
Peter J.
P.S.: This is what the fix looks like :
***@ubuntu:~$ diff -C 9 Qsync/Source/Originals/spinhawk/console.c
Qsync/Source/Changes/spinhawk/console.c
*** Qsync/Source/Originals/spinhawk/console.c 2017-10-01 17:23:40.330668000
+0200
--- Qsync/Source/Changes/spinhawk/console.c 2017-10-23
14:57:52.309843400 +0200
***************
*** 2107,2133 ****
--- 2107,2142 ----
/* Accept a connection and create conversation socket */
csock = accept (lsock, NULL, NULL);
if (csock < 0)
{
TNSERROR("console: DBG029: accept: %s\n",
strerror(HSO_errno));
continue;
}
+ /* The original connect_client thread creation is @PJJ */
+ /* replaced with a straight function call, thus @PJJ */
+ /* avoiding the race condition caused by passing @PJJ */
+ /* &csock set by the above accept to create_thread. @PJJ */
+ #if 0 /* 23-Oct-2017 @PJJ */
/* Create a thread to complete the client connection */
if ( create_thread (&tidneg, DETACHED,
connect_client, &csock, "connect_client")
)
{
TNSERROR("console: DBG030: connect_client create_thread:
%s\n",
strerror(errno));
close_socket (csock);
}
+ #else /* @PJJ */
+ UNREFERENCED(tidneg); /* @PJJ */
+ connect_client( &csock ); /* @PJJ */
+ #endif /* @PJJ */
} /* end if(FD_ISSET(lsock, &readset)) */