See
IPv4LinkLocalInstallation for now
Actions taken
Configured hosts:
- webcam.uva.netherlight.nl (Mac, Amsterdam)
- vangogh7.uva.netherlight.nl (Linux, Amsterdam)
- vangogh8.uva.netherlight.nl (Linux, Amsterdam)
- igrid-demo05 (Windows, San Diego)
- igrid-demo06 (Linux, San Diego)
- laptop (Mac, San Diego)
Basic setup:
- All hosts reside in the same VLAN
- All hosts have IPv4 link-local addresses on the secondary interface (eth2 or en1)
- The IPv4 link-local address is defended by a daemon, which watches for collisions
- The kernel has been patched to detect collisions after network joins (ARP replies for 169.245/16 are broadcasted)
- All hosts (except igrid-demo06) run multicast DNS (vangogh7 runs Apple's mdns, vangogh8 runs tmdns, igrid-demo06 runs Porchdog's mDNSResponder, laptop runs default Mac mdns)
- All hosts (except igrid-demo06) are patched so that gethostbyaddr() and gethostbyname() first use mDNS rather then dns (for the appropriate IP range and the .local zone) (vangogh7 uses libnss_mdns, which was included with Apple's mdns, while vangogh8 has an entry in resolv.conf for lookup, igrid-demo06 uses ????)
Tasks
- Make Java visualization, and allow it to make remote connections (fixed, adjusted global java.policy file on laptop)
- Python 2.4.1 on VanGogh7 (done)
- test ping results from igrid-demo06 (seemed to magically just work on thursday)
- name of igrid-demo05 and demo06 (is now None, since it has no hostname, however, macs seem to resolve these names just fine.)
- Bind on igrid-demo06 (not done, does not improve stability)
- Script Jeroen: robustness for gethostbyaddr() failures (Host name lookup failure (herror 2)) and timeouts (seperate thread?) (lookups were done in a seperate script, so that the select module could be used for timeouts)
- Test extra pings in python script (Adding ip-addresses after daemon started works, but removing them doesn't)
- working GridFTP client on laptop or igrid-demo06 (no CRL errors) (fixed)
- robust mDNS on VanGogh8 (either script that runs /etc/init.d/mdns restart upon failures or use howl instead of mDNS or install bind) (howl has no gethostby*() hook; made script that restarts mdns, needs testing)
- Applet and script on igrid-demo06 (Firefox does not yet support Applets there)
- Test IP reconfiguration (collision detection)
- Make Slides (if needed)
Bug encountered
- No name parsed (announced?) on Fedore/mDNS/libnss_mdns, when name was igrid-demo.local. It was when name was Doc.local. very strange
- Socket leak in libnn_mdns
- tmdns gives error when forwarding unicast requests to multicast
- Hosts does sometimes not respond to broadcast pings (ARP cache says host is unreachable while it is)
- Hostnames are cached during conflict detection, which causes interupts in services no matter how
- Windows XP never replies to broadcast pings (Windows 2000 does), even when firewall is off.
- Name lookup on igrid-demo05 failed to resolve its own hostname and that of igrid-demo05, while the Mac had no problem with it.
Libnss_mdns
(see problem 1a below)
Need to:
- Install newer version of mDNSResponder on vangogh8
- Need to set verbosity to high of libnss_mdns
- read mDNSResponder-107.1/mDNSPosix/nss_ReadMe.txt for more clues (there is a config file, and it log to syslog)
There is an odd thing -- at first, rebooting mdns solved the problem (restarting the python script did not). Later, it was the other way around: it is unclear if this was caused by the same or two different problems.
Solution: The problem seem to be related to a socket leak. libnss_mdns never called
DNSServiceRefDeallocate(). See
http://lists.apple.com/archives/Bonjour-dev/2005/Sep/msg00046.html∞ and other mails in that thread for details (credits to Jason Fritcher of Earthlink for finding this problem)
Problems
Unfortunately, the following problems occur:
- Two issues with reliability of name resolving:
- On Vangogh8, gethostbyaddr() and gethostbyname() stop working after about 10 minutes. Such a function call simply hangs, for at least 15 minute (so probably indefinately). This only affects calls which involve mdns, not regular dns lookup calls. Restarting mdns using /etc/init.d/mdns restart resolves the problem. The external mdns daemon does not seem to be affected (other hosts get a reply), so the real bug might be in libnns_mdns, not in mdnsd. Note that in this case, the getlocalhosts.py scripts continous to serve request, but that it's data is stale (apparently, it is unable to parse the data or ping doesn't return). After mdns is restarted, the daemon then only returns an empty list (internally, it gethostbyaddr() then still keeps getting a herror 2).
- If it is tried to resolve this issue by installing tmdns, instead of mdns, it seems that tmdns has a bug so that local requests are not correctly forwarded to other servers. requests from other hosts are handled correctly. However, debugging is a night-mare since the results of gethostby*() seem extremely cached. It seems that there is a bug in tmdns, since it gives this error: "ns_initparse: Message too long" when it receives a message from a mdns daemon.
- The getlocalhosts.py script sometimes stops working after about 5 minutes (the daemon accepts the incoming connection, but closes the connection immediately). Note that this does not seem related to the above problem, since it happens when gethostby*() are still responsive. It is unclear what causes this. (Problem was gone after massive rewrite, not sure what made the fix)
- On vangogh8 (Python 2.4.1), the getlocalhosts.py scripts spawns two "ping" processes, one of which goes defunct. This does not happen when running the script on a Mac (Python 2.3.5). It has not been tested on Debian yet.
- It does not seem to be possible to use longer DNS names with .local (e.g. webcam.uva.netherlight.nl.local), but only two elements (e.g. webcam-uva-netherlight-nl.local.). This applies for both mdnsd (Linux) as well as mDNSResponder (Mac OS) (this is not surprisingly, since it is basically the same code).
- The vizualisation script only runs once, not continuously, and is not very visually appealing yet.
- ZCIP stops
- getlocalhosts.py -c has a small bug, where hosts with the same name (e.g. "None") are ignored, except for the first. visualize.php has the same bug, but this time everything but the last is ignored. (fixed)
- Windows XP does not respond to broadcast pings
- Apple does not advertise _http._tcp service when Personal Web Sharing is enabled in the preferences. (fixed, User must have setup a webpage, otherwise no announcement is made)
- Apple laptop sometimes disables link-local address is a routed address is available via the wireless network
- We don't have "poster" pictures yet
- Machines in Amsterdam sometimes are not visible with a broadcast ping. (This problem had magically gone away on thursday, not sure why or how)
- Firefox on igrid-demo06 does not support Java applets yet
- gridftp client (globus-url-copy) on Mac says the CRL (certificat revokation list) on vangogh7 is expired, though it really is not. This applies to gridftp of globus 4.0.0 as well as globus 3.2.1. The client on for example rembrandt0 do not complain about the CRL op vangogh7. (Problem was locally on Freek's Mac)
- If there is no reverse DNS lookup of the routed IP address, the hosts do not advertise their name (gethostbyaddr() returns nothing). This is odd, since it could (and should?) just have used the result of /sbin/hostname. Also, it often does respond to hostname.local., but with the routable IP address. Perhaps that's related. (This issue was only seen on the igrid-demo05 machine and only when it did lookup of itself, it seemed to work when lookups were done by a Mac).
Possible solutions
1) If dns lookup continuous to be a problem, it may be better to run a local bind server, and configure it to forward the link-local related requests to 224.0.0.251 (the multicast IP address) on port 5353, while forwarding all other request to whatever the current name server (as listed in /etc/resolv.conf) is (or just return a SERVFAIL return code, which automatically causes the caller script to try the next name server).
2) The script should log to a file, so that it can more easily debuged. That functionality can relatively easy be build in.
3) The script probably needs to be modified for stability. In particular, the daemon should run in it's own thread (now ping and a result parser run in their own thread, while the daemon is called from the main code. This results in a blocking call to server.serve_forever() (or server.serverequest()). The current parser thread already detects if it did not receive any data from the ping thread for 5 seconds. This event should trigger an event message (perhaps in a Queue) to signal to kill the current ping function and start a new one. (or even kill of the thread, by deleting the thread instance). In addition, I should lock the Queue.
4) The issue of longer dns names may be solved by using tmdns instead of mdns.
5) The vizualisations script may need to be rewritten. Perhaps using Java (or flash).
6) This probably happens after logout. Apparently, running it in the background doesn't properly de-attach itself from the terminal. It probably needs to be started with a script in /etc/rc.d/zcip (or any script which does a proper two-step fork). Or we can run it ourself, using the screen command, so it keeps running, even after the terminal is closed. Bit ugly, but it works.
7) bug squatting time! FIXED, both unrelated.
8) Check routes of those machines. Linux machines still seem to have a bug that puts the default route on all machines. Windows machines are still a mystery.
9) Fixed: Apache by default only advertise websites with a non-default content. I just never changed it. I now enabled it (
RegisterUserSite all-users instead of
RegisterUserSite customized-users). Note: it can also be enable it by hand using "mdns -R "Freek Dijkstra" _http._tcp local. 80 path=/~freek/" or something simular.
10) You must put the ethernet NIC above the wireless NIC in the network configuration system preferences.
11) We still need to make poster pictures.
12) Still mostly a mystery. The main culprit seem the ARP cache somehow. With ethereal, we confirmed that all ARP requests arrive at all machines. However, a machine (mostly if it's one at the other site of the ocean) does not respond to that broadcast ping. We first suspected it was sending it on the wrong interface, but that does not seem to be the case. Further debugging showed that at that time, the machine was in fact attempting to send out a broadcast ping, but logged a "host not reachable" error at that time. As soon as it receives a unicast ARP request (regardless from which machine), it then suddenly things the machine is reachable, and ARP replies are returned. This behavour also seems to apply to unicast ARP request in some cases. We are still puzzled by this behaviour.
13) Need to install it
14) No clue why. Perhaps we can run gridftp client on igrid-demo06 instead of the laptop.
15) Fill bugreport?
Categories
CategoryZeroconf
There is one comment on this page. [Display comment]