[hercules-390] Hercules Spinhawk : Race Condition Fixed in console.c

'Peter J. Jansen' peter_j_jansen@yahoo.com [hercules-390]

2017-10-23 15:00:15 UTC

The Hercules Spinhawk version 3.13 (but also 3.12) have been causing me
trouble at 3270 console connection time, i.e. even before IPL. Sometimes,
Spinhawk crashed immediately, with a less than complete set of "HHCTE009I
Client 127.0.0.1 connected to 3270 device 0:xxxx" messages in the log. With
a bit more luck, Spinhawk would not crash, but there would still be less
HHCTE009I messages than the six 3270 consoles I defined (all x3270). This
only occurred when Spinhawk was running on Linux; both Ubuntu 16.04.2 and
Debian 9.1.0 experienced it. Also the Hyperion and SDL-hyperion versions do
not exhibit this behavior.

I had a simple workaround by manually initialing the 3270 consoles after
starting Spinhawk but before IPL-ing the system, so this was a lower
priority for me. But it was manual extra work and thus annoying.

After some careful reading of the console.c source, I think that some race
condition may have been experienced by others in the past, as I found
various comments hinting at earlier attempts to fix the race condition.
After quite a bit of testing (including running Spinhawk under valgrind), my
tests lead me now to believe that I may have found the problem as well as a
simple fix for it. The culprit is the "csock" socket number from an
"accept" call, whose address is passed into "connect_client" via a
"create_thread" (e.g. a pthread_create). This I believe to NOT be
thread-safe. Even before "connect_clean" picks up the "csock", another
"accept" may have delivered a new "csock", as a result of which both the 1st
and 2nd "connect_client" may pick up the same 2nd "csock" number. Not good.

My fix is simple. Instead of introducing a thread-safe argument passing
into "connect_client", I considered the fact that the latter thread is
rather simple and short-lived, during initialisation only. So I thought to
call "connect_client" straight away, without creating an additional thread
for each 3270 console during its initial connecting phase. And my problem
went away.

I am interested in feedback regarding this issue. My correction is already
available in GitHub : https://github.com/Peter-J-Jansen/spinhawk. Depending
on community feedback, I may make a pull request to have the fix included in
the official Hercules Spinhawk repository.

Thanks,

Peter J.

P.S.: This is what the fix looks like :

***@ubuntu:~$ diff -C 9 Qsync/Source/Originals/spinhawk/console.c
Qsync/Source/Changes/spinhawk/console.c

*** Qsync/Source/Originals/spinhawk/console.c 2017-10-01 17:23:40.330668000
+0200

--- Qsync/Source/Changes/spinhawk/console.c 2017-10-23
14:57:52.309843400 +0200

***************

*** 2107,2133 ****

--- 2107,2142 ----

/* Accept a connection and create conversation socket */

csock = accept (lsock, NULL, NULL);

if (csock < 0)

{

TNSERROR("console: DBG029: accept: %s\n",
strerror(HSO_errno));

continue;

}

+ /* The original connect_client thread creation is @PJJ */

+ /* replaced with a straight function call, thus @PJJ */

+ /* avoiding the race condition caused by passing @PJJ */

+ /* &csock set by the above accept to create_thread. @PJJ */

+ #if 0 /* 23-Oct-2017 @PJJ */

/* Create a thread to complete the client connection */

if ( create_thread (&tidneg, DETACHED,

connect_client, &csock, "connect_client")

)

{

TNSERROR("console: DBG030: connect_client create_thread:
%s\n",

strerror(errno));

close_socket (csock);

}

+ #else /* @PJJ */

+ UNREFERENCED(tidneg); /* @PJJ */

+ connect_client( &csock ); /* @PJJ */

+ #endif /* @PJJ */

} /* end if(FD_ISSET(lsock, &readset)) */