2015年9月11日 星期五

Openstack nova Instance failed to push static IP through config drive. openstack/content/0000 is missing

So I need to bootstrap an instance that is configured as static IP through config drive, instead of the generic DHCP.  So with nova boot, everything seems to be happy but the instance just failed to get the static IP for some reason. Doing a nova console-log, the IP address shows like this in cloud-init.

cloud-init[775]: Cloud-init v. 0.7.5 running 'init' at Fri, 11 Sep 2015 17:56:46 +0000. Up 29.44 seconds.
cloud-init[775]: ci-info: +++++++++++++++++++++++Net device info+++++++++++++++++++++++
cloud-init[775]: ci-info: +--------+------+-----------+-----------+-------------------+
cloud-init[775]: ci-info: | Device |  Up  |  Address  |    Mask   |     Hw-Address    |
cloud-init[775]: ci-info: +--------+------+-----------+-----------+-------------------+
cloud-init[775]: ci-info: |  lo:   | True | 127.0.0.1 | 255.0.0.0 |         .         |
cloud-init[775]: ci-info: | eth0:  | True |     .     |     .     | fa:16:3e:4d:35:34 |
cloud-init[775]: ci-info: +--------+------+-----------+-----------+-------------------+

Obviously the eth0 is not getting the static IP. Once again I mounted the config-drive and then try to find the /openstack/content/0000 file but I can't find it.

[root@nova ~]# find /mnt/openstack | grep content
[root@nova ~]#

This is pretty weird as I remember the file should be there to allow static ip assignment to work through config-drive. So taking further looks, it seems the problem go to the subnet configuration. The subnet that the instance is on is having "enable_dhcp" equal to true, and that prohibited the config-drive to create the openstack/content/0000 file.

To disable DHCP for a subnet, run this.
# neutron  subnet-update  $SUBNET_UUID --enable-dhcp=False

2015年9月10日 星期四

Openstack broken metadata interfaces template

Was my 2nd time hitting the same issue and it wasted my whole day so I think I should document this as a note just in case it happens again.

In case you are seeing something like this from your metadata (be it metadata server, or config-drive), in this example I mounted the config drive as /mnt.

# cat /mnt/openstack/content/0000
DEVICE="{{ name }}"
NM_CONTROLLED="no"
ONBOOT=yes
TYPE=Ethernet
BOOTPROTO=static
IPADDR={{ address }}
NETMASK={{ netmask }}
BROADCAST={{ broadcast }}
GATEWAY={{ gateway }}
DNS1={{ dns }}

#if $use_ipv6
IPV6INIT=yes
IPV6ADDR={{ address_v6 }}
#end if

Chances that you are hitting this bug.

As cloud-init doesnt really recognize format like that, to fix the issue you will need to update the template /usr/share/nova/interfaces.template (assuming you are on CentOS/RHEL7) with something like this which is a debian-ish template

# Injected by Nova on instance boot
#
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
# The loopback network interface
auto lo
iface lo inet loopback
{% for ifc in interfaces -%}
auto {{ ifc.name }}
iface {{ ifc.name }} inet static
address {{ ifc.address }}
netmask {{ ifc.netmask }}
broadcast {{ ifc.broadcast }}
{%- if ifc.gateway %}
gateway {{ ifc.gateway }}
{%- endif %}
{%- if ifc.dns %}
dns-nameservers {{ ifc.dns }}
{%- endif %}
{% if use_ipv6 -%}
iface {{ ifc.name }} inet6 static
address {{ ifc.address_v6 }}
netmask {{ ifc.netmask_v6 }}
{%- if ifc.gateway_v6 %}
gateway {{ ifc.gateway_v6 }}
{%- endif %}
{%- endif %}
{%- endfor %}

2015年9月2日 星期三

Cloudera Manager agent failed to connect to previous supervisor

Continue to another article written earlier, I am hitting another road block while installing Cloudera manager agent.

[10/Sep/2015 09:10:54 +0000] 19017 MainThread agent        ERROR    Failed to connect to previous supervisor.
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/agent.py", line 1524, in find_or_start_supervisor
    self.get_supervisor_process_info()
  File "/usr/lib64/cmf/agent/src/cmf/agent.py", line 1725, in get_supervisor_process_info
    self.identifier = self.supervisor_client.supervisor.getIdentification()
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request
    verbose=self.__verbose
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/xmlrpc.py", line 460, in request
    self.connection.request('POST', handler, request_body, self.headers)
  File "/usr/lib64/python2.6/httplib.py", line 914, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib64/python2.6/httplib.py", line 951, in _send_request
    self.endheaders()
  File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders
    self._send_output()
  File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.6/httplib.py", line 739, in send
    self.connect()
  File "/usr/lib64/python2.6/httplib.py", line 720, in connect
    self.timeout)
  File "/usr/lib64/python2.6/socket.py", line 567, in create_connection
    raise error, msg
error: [Errno 111] Connection refused

So if you are seeing something like this which complain failure of Cloudera manager agent to provide the heartbeat and such, your probably run into a hostname issue. You may want to fix the hostname entry by referencing this link Check the hostname on the server and compare it with the one shown in the installer web GUI, if they are different then you probably want to follow below procedure to refresh the cached hostname (ref: link). It took me like an hour to figure out this painful workaround.

Installing on AWS, you must use private EC2 hostnames.
When installing on an AWS instance, and adding hosts using their public names, the installation will fail when the hosts fail to heartbeat.

Severity: Med

Workaround:

Use the Back button in the wizard to return to the original screen, where it prompts for a license.

Rerun the wizard, but choose "Use existing hosts" instead of searching for hosts. Now those hosts show up with their internal EC2 names.

Continue through the wizard and the installation should succeed.

2015年9月1日 星期二

Cloudera manager agent installation fail due to missing of ntp package

So was trying to explore Cloudera Hadoop by following this installation guide and run into issue in bringing up cloudera manager agent (cloudera-scm-agent).

The installation GUI was complaining with below message

  Installation failed. Failed to receive heartbeat from agent.

Ensure that the host's hostname is configured properly.
Ensure that port 7182 is accessible on the Cloudera Manager Server (check firewall rules).
Ensure that ports 9000 and 9001 are free on the host being added.
Check agent logs in /var/log/cloudera-scm-agent/ on the host being added (some of the logs can be found in the installation details).


So looking at /var/log/cloudera-scm-agent/cloudera-scm-agent.log

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
[10/Sep/2015 07:57:16 +0000] 2366 Monitor-HostMonitor throttling_logger ERROR    Failed to collect NTP metrics
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/ntp_monitor.py", line 37, in collect
    result, stdout, stderr = self._subprocess_with_timeout(args, self._timeout)
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/ntp_monitor.py", line 30, in _subprocess_with_timeout
    return subprocess_with_timeout(args, timeout)
  File "/usr/lib64/cmf/agent/src/cmf/subprocess_timeout.py", line 49, in subprocess_with_timeout
    p = subprocess.Popen(**kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

So looking at /usr/lib64/cmf/agent/src/cmf/monitor/host/ntp_monitor.py, the culprit is there

     35     try:
     36       args = ["ntpdc", "-np"]
     37       result, stdout, stderr = self._subprocess_with_timeout(args, self._timeout)

As a quick fix, do a yum install ntp should help getting rid of this error.