AIR Wiki : IPv4LinkLocalDemo

HomePage :: Categories :: PageIndex :: RecentChanges :: RecentlyCommented :: Login/Register
See IPv4LinkLocalInstallation for now

Actions taken


Configured hosts:

Basic setup:

Tasks



Bug encountered



Libnss_mdns


(see problem 1a below)

Need to:

There is an odd thing -- at first, rebooting mdns solved the problem (restarting the python script did not). Later, it was the other way around: it is unclear if this was caused by the same or two different problems.

Solution: The problem seem to be related to a socket leak. libnss_mdns never called DNSServiceRefDeallocate(). See http://lists.apple.com/archives/Bonjour-dev/2005/Sep/msg00046.html and other mails in that thread for details (credits to Jason Fritcher of Earthlink for finding this problem)

Problems


Unfortunately, the following problems occur:
  1. Two issues with reliability of name resolving:
    1. On Vangogh8, gethostbyaddr() and gethostbyname() stop working after about 10 minutes. Such a function call simply hangs, for at least 15 minute (so probably indefinately). This only affects calls which involve mdns, not regular dns lookup calls. Restarting mdns using /etc/init.d/mdns restart resolves the problem. The external mdns daemon does not seem to be affected (other hosts get a reply), so the real bug might be in libnns_mdns, not in mdnsd. Note that in this case, the getlocalhosts.py scripts continous to serve request, but that it's data is stale (apparently, it is unable to parse the data or ping doesn't return). After mdns is restarted, the daemon then only returns an empty list (internally, it gethostbyaddr() then still keeps getting a herror 2).
    2. If it is tried to resolve this issue by installing tmdns, instead of mdns, it seems that tmdns has a bug so that local requests are not correctly forwarded to other servers. requests from other hosts are handled correctly. However, debugging is a night-mare since the results of gethostby*() seem extremely cached. It seems that there is a bug in tmdns, since it gives this error: "ns_initparse: Message too long" when it receives a message from a mdns daemon.
  2. The getlocalhosts.py script sometimes stops working after about 5 minutes (the daemon accepts the incoming connection, but closes the connection immediately). Note that this does not seem related to the above problem, since it happens when gethostby*() are still responsive. It is unclear what causes this. (Problem was gone after massive rewrite, not sure what made the fix)
  3. On vangogh8 (Python 2.4.1), the getlocalhosts.py scripts spawns two "ping" processes, one of which goes defunct. This does not happen when running the script on a Mac (Python 2.3.5). It has not been tested on Debian yet.
  4. It does not seem to be possible to use longer DNS names with .local (e.g. webcam.uva.netherlight.nl.local), but only two elements (e.g. webcam-uva-netherlight-nl.local.). This applies for both mdnsd (Linux) as well as mDNSResponder (Mac OS) (this is not surprisingly, since it is basically the same code).
  5. The vizualisation script only runs once, not continuously, and is not very visually appealing yet.
  6. ZCIP stops
  7. getlocalhosts.py -c has a small bug, where hosts with the same name (e.g. "None") are ignored, except for the first. visualize.php has the same bug, but this time everything but the last is ignored. (fixed)
  8. Windows XP does not respond to broadcast pings
  9. Apple does not advertise _http._tcp service when Personal Web Sharing is enabled in the preferences. (fixed, User must have setup a webpage, otherwise no announcement is made)
  10. Apple laptop sometimes disables link-local address is a routed address is available via the wireless network
  11. We don't have "poster" pictures yet
  12. Machines in Amsterdam sometimes are not visible with a broadcast ping. (This problem had magically gone away on thursday, not sure why or how)
  13. Firefox on igrid-demo06 does not support Java applets yet
  14. gridftp client (globus-url-copy) on Mac says the CRL (certificat revokation list) on vangogh7 is expired, though it really is not. This applies to gridftp of globus 4.0.0 as well as globus 3.2.1. The client on for example rembrandt0 do not complain about the CRL op vangogh7. (Problem was locally on Freek's Mac)
  15. If there is no reverse DNS lookup of the routed IP address, the hosts do not advertise their name (gethostbyaddr() returns nothing). This is odd, since it could (and should?) just have used the result of /sbin/hostname. Also, it often does respond to hostname.local., but with the routable IP address. Perhaps that's related. (This issue was only seen on the igrid-demo05 machine and only when it did lookup of itself, it seemed to work when lookups were done by a Mac).

Possible solutions


1) If dns lookup continuous to be a problem, it may be better to run a local bind server, and configure it to forward the link-local related requests to 224.0.0.251 (the multicast IP address) on port 5353, while forwarding all other request to whatever the current name server (as listed in /etc/resolv.conf) is (or just return a SERVFAIL return code, which automatically causes the caller script to try the next name server).

2) The script should log to a file, so that it can more easily debuged. That functionality can relatively easy be build in.

3) The script probably needs to be modified for stability. In particular, the daemon should run in it's own thread (now ping and a result parser run in their own thread, while the daemon is called from the main code. This results in a blocking call to server.serve_forever() (or server.serverequest()). The current parser thread already detects if it did not receive any data from the ping thread for 5 seconds. This event should trigger an event message (perhaps in a Queue) to signal to kill the current ping function and start a new one. (or even kill of the thread, by deleting the thread instance). In addition, I should lock the Queue.

4) The issue of longer dns names may be solved by using tmdns instead of mdns.

5) The vizualisations script may need to be rewritten. Perhaps using Java (or flash).

6) This probably happens after logout. Apparently, running it in the background doesn't properly de-attach itself from the terminal. It probably needs to be started with a script in /etc/rc.d/zcip (or any script which does a proper two-step fork). Or we can run it ourself, using the screen command, so it keeps running, even after the terminal is closed. Bit ugly, but it works.

7) bug squatting time! FIXED, both unrelated.

8) Check routes of those machines. Linux machines still seem to have a bug that puts the default route on all machines. Windows machines are still a mystery.

9) Fixed: Apache by default only advertise websites with a non-default content. I just never changed it. I now enabled it (RegisterUserSite all-users instead of RegisterUserSite customized-users). Note: it can also be enable it by hand using "mdns -R "Freek Dijkstra" _http._tcp local. 80 path=/~freek/" or something simular.

10) You must put the ethernet NIC above the wireless NIC in the network configuration system preferences.

11) We still need to make poster pictures.

12) Still mostly a mystery. The main culprit seem the ARP cache somehow. With ethereal, we confirmed that all ARP requests arrive at all machines. However, a machine (mostly if it's one at the other site of the ocean) does not respond to that broadcast ping. We first suspected it was sending it on the wrong interface, but that does not seem to be the case. Further debugging showed that at that time, the machine was in fact attempting to send out a broadcast ping, but logged a "host not reachable" error at that time. As soon as it receives a unicast ARP request (regardless from which machine), it then suddenly things the machine is reachable, and ARP replies are returned. This behavour also seems to apply to unicast ARP request in some cases. We are still puzzled by this behaviour.

13) Need to install it

14) No clue why. Perhaps we can run gridftp client on igrid-demo06 instead of the laptop.

15) Fill bugreport?

Categories
CategoryZeroconf

There is one comment on this page. [Display comment]

Valid XHTML 1.0 Transitional :: Valid CSS :: Powered by Wikka Wakka Wiki 1.1.6.0
Page was generated in 0.0616 seconds