2015年9月11日 星期五

Openstack nova Instance failed to push static IP through config drive. openstack/content/0000 is missing

So I need to bootstrap an instance that is configured as static IP through config drive, instead of the generic DHCP.  So with nova boot, everything seems to be happy but the instance just failed to get the static IP for some reason. Doing a nova console-log, the IP address shows like this in cloud-init.

cloud-init[775]: Cloud-init v. 0.7.5 running 'init' at Fri, 11 Sep 2015 17:56:46 +0000. Up 29.44 seconds.
cloud-init[775]: ci-info: +++++++++++++++++++++++Net device info+++++++++++++++++++++++
cloud-init[775]: ci-info: +--------+------+-----------+-----------+-------------------+
cloud-init[775]: ci-info: | Device |  Up  |  Address  |    Mask   |     Hw-Address    |
cloud-init[775]: ci-info: +--------+------+-----------+-----------+-------------------+
cloud-init[775]: ci-info: |  lo:   | True | 127.0.0.1 | 255.0.0.0 |         .         |
cloud-init[775]: ci-info: | eth0:  | True |     .     |     .     | fa:16:3e:4d:35:34 |
cloud-init[775]: ci-info: +--------+------+-----------+-----------+-------------------+

Obviously the eth0 is not getting the static IP. Once again I mounted the config-drive and then try to find the /openstack/content/0000 file but I can't find it.

[root@nova ~]# find /mnt/openstack | grep content
[root@nova ~]#

This is pretty weird as I remember the file should be there to allow static ip assignment to work through config-drive. So taking further looks, it seems the problem go to the subnet configuration. The subnet that the instance is on is having "enable_dhcp" equal to true, and that prohibited the config-drive to create the openstack/content/0000 file.

To disable DHCP for a subnet, run this.
# neutron  subnet-update  $SUBNET_UUID --enable-dhcp=False

2015年9月10日 星期四

Openstack broken metadata interfaces template

Was my 2nd time hitting the same issue and it wasted my whole day so I think I should document this as a note just in case it happens again.

In case you are seeing something like this from your metadata (be it metadata server, or config-drive), in this example I mounted the config drive as /mnt.

# cat /mnt/openstack/content/0000
DEVICE="{{ name }}"
NM_CONTROLLED="no"
ONBOOT=yes
TYPE=Ethernet
BOOTPROTO=static
IPADDR={{ address }}
NETMASK={{ netmask }}
BROADCAST={{ broadcast }}
GATEWAY={{ gateway }}
DNS1={{ dns }}

#if $use_ipv6
IPV6INIT=yes
IPV6ADDR={{ address_v6 }}
#end if

Chances that you are hitting this bug.

As cloud-init doesnt really recognize format like that, to fix the issue you will need to update the template /usr/share/nova/interfaces.template (assuming you are on CentOS/RHEL7) with something like this which is a debian-ish template

# Injected by Nova on instance boot
#
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
# The loopback network interface
auto lo
iface lo inet loopback
{% for ifc in interfaces -%}
auto {{ ifc.name }}
iface {{ ifc.name }} inet static
address {{ ifc.address }}
netmask {{ ifc.netmask }}
broadcast {{ ifc.broadcast }}
{%- if ifc.gateway %}
gateway {{ ifc.gateway }}
{%- endif %}
{%- if ifc.dns %}
dns-nameservers {{ ifc.dns }}
{%- endif %}
{% if use_ipv6 -%}
iface {{ ifc.name }} inet6 static
address {{ ifc.address_v6 }}
netmask {{ ifc.netmask_v6 }}
{%- if ifc.gateway_v6 %}
gateway {{ ifc.gateway_v6 }}
{%- endif %}
{%- endif %}
{%- endfor %}

2015年9月2日 星期三

Cloudera Manager agent failed to connect to previous supervisor

Continue to another article written earlier, I am hitting another road block while installing Cloudera manager agent.

[10/Sep/2015 09:10:54 +0000] 19017 MainThread agent        ERROR    Failed to connect to previous supervisor.
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/agent.py", line 1524, in find_or_start_supervisor
    self.get_supervisor_process_info()
  File "/usr/lib64/cmf/agent/src/cmf/agent.py", line 1725, in get_supervisor_process_info
    self.identifier = self.supervisor_client.supervisor.getIdentification()
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request
    verbose=self.__verbose
  File "/usr/lib64/cmf/agent/build/env/lib/python2.6/site-packages/supervisor-3.0-py2.6.egg/supervisor/xmlrpc.py", line 460, in request
    self.connection.request('POST', handler, request_body, self.headers)
  File "/usr/lib64/python2.6/httplib.py", line 914, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib64/python2.6/httplib.py", line 951, in _send_request
    self.endheaders()
  File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders
    self._send_output()
  File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output
    self.send(msg)
  File "/usr/lib64/python2.6/httplib.py", line 739, in send
    self.connect()
  File "/usr/lib64/python2.6/httplib.py", line 720, in connect
    self.timeout)
  File "/usr/lib64/python2.6/socket.py", line 567, in create_connection
    raise error, msg
error: [Errno 111] Connection refused

So if you are seeing something like this which complain failure of Cloudera manager agent to provide the heartbeat and such, your probably run into a hostname issue. You may want to fix the hostname entry by referencing this link Check the hostname on the server and compare it with the one shown in the installer web GUI, if they are different then you probably want to follow below procedure to refresh the cached hostname (ref: link). It took me like an hour to figure out this painful workaround.

Installing on AWS, you must use private EC2 hostnames.
When installing on an AWS instance, and adding hosts using their public names, the installation will fail when the hosts fail to heartbeat.

Severity: Med

Workaround:

Use the Back button in the wizard to return to the original screen, where it prompts for a license.

Rerun the wizard, but choose "Use existing hosts" instead of searching for hosts. Now those hosts show up with their internal EC2 names.

Continue through the wizard and the installation should succeed.

2015年9月1日 星期二

Cloudera manager agent installation fail due to missing of ntp package

So was trying to explore Cloudera Hadoop by following this installation guide and run into issue in bringing up cloudera manager agent (cloudera-scm-agent).

The installation GUI was complaining with below message

  Installation failed. Failed to receive heartbeat from agent.

Ensure that the host's hostname is configured properly.
Ensure that port 7182 is accessible on the Cloudera Manager Server (check firewall rules).
Ensure that ports 9000 and 9001 are free on the host being added.
Check agent logs in /var/log/cloudera-scm-agent/ on the host being added (some of the logs can be found in the installation details).


So looking at /var/log/cloudera-scm-agent/cloudera-scm-agent.log

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
[10/Sep/2015 07:57:16 +0000] 2366 Monitor-HostMonitor throttling_logger ERROR    Failed to collect NTP metrics
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/ntp_monitor.py", line 37, in collect
    result, stdout, stderr = self._subprocess_with_timeout(args, self._timeout)
  File "/usr/lib64/cmf/agent/src/cmf/monitor/host/ntp_monitor.py", line 30, in _subprocess_with_timeout
    return subprocess_with_timeout(args, timeout)
  File "/usr/lib64/cmf/agent/src/cmf/subprocess_timeout.py", line 49, in subprocess_with_timeout
    p = subprocess.Popen(**kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

So looking at /usr/lib64/cmf/agent/src/cmf/monitor/host/ntp_monitor.py, the culprit is there

     35     try:
     36       args = ["ntpdc", "-np"]
     37       result, stdout, stderr = self._subprocess_with_timeout(args, self._timeout)

As a quick fix, do a yum install ntp should help getting rid of this error.

2013年7月30日 星期二

Passed EX436 Clustering and Storage Management last sunday.

For sake of self achievement, I planned to get RHCA for some time since 2011 and hopefully this could be done before end of 2013, if everything go smooth.

Last sunday, I just finished EX436 which is my 2nd RHCA exam out of a series of 5. The score was 268 out of 300 which is pretty good enough for me. The major focus of EX436 is clustering and storage management which is an area that I have some experience on but would definitely love to improve. The exam itself is pretty interesting, challenging and fun. Like other famous RHCA blogger out there,  I have the comment that the exam is not really tough given that you are well prepared with the topics.

For those that would like to take this exam, no doubt the best bet would certainly be joining the training class provided by Redhat :-) . But for those that want to save some bucks, you gotta work on your own and a good start would be to revisit the Course Outline here (link). For me, I basically stick with the official Redhat guide of clustering, GFS/GFS2, Multipath, Fencing, LVM and CLVM(all can be found here) and keep practice on my own lab. And, base on latest outline, you may also want to check with XFS and Gluster ( I am not sure how can you find resources from redhat site though). For me, I didn't practise XFS and Gluster in my own lab as I was sticking with previous Course Outline which didn't include XFS and Gluster at all (!!!). So it is a big surprise when I saw those questions in the exam. However, due to my job duties, I did have few exposure on them and luckily those exposure helped me to survive in the exam.
 
People may interest to know the distro or exact version in question, but due to N.D.A agreement I can't say which version it is here :-) . What I would say is , clustering, GFS/GFS2 don't have a major different on RHEL5/6, at least from exam perspective. For me, my lab was based on RHEL 5 and i didn't subscribe to any Redhat subscription service (I copied all rpm based off the ISO and create my own repository tree as well as yum repository configuration files to allow the lab machine to fetch the required packages).

My another advise is to stay calm during the exam. During my exam, I was stucked on a particular task and was going a bit nervous, this lead me mistakenly reboot my host (!!) at the middle of the exam. For those that already took RHCSA/RHCE exam, you may know that your exam system is on a VM that sit on a physical machines dedicated to you. So in my situation I was rebooting my host that caused all my VM suspended and then at that point I didn't know what could happen. You know, the worse case would be a re-image of all exam VM and I have to rebuild everything in the remaining 2 hours out of 4 hours exam duration.  Luckily all my VMs are still there after the host reboot and the only impact was a 15 minutes downtime on my exam environment (million thanks to the examiner who helped recovering my environment though).

The upcoming exam for me would be

EX442    Red Hat Enterprise System Monitoring and Performance Tuning Expertise Exam
EX333    Red Hat Enterprise Security: Network Services Expertise Exam
EX401    Red Hat Enterprise Deployment and Systems Management Expertise Exam

Hopefully I would be taking EX442 on Sept if everything go smooth.  EX442 was well known among RHCA-er for its complexity so I would look forward to give a try on it.

2013年7月24日 星期三

A quick and dirty munin plugins to count number of VM running on RHEL/CentOS based KVM host.

So recently I was configuring munin to monitor some QEMU/KVM hosts which based on generic RHEL servers (Noted, not RHEV) which run libvirtd and QEMU/KVM.

So here is a plugins that I created, it is quick and dirty but this should work as expected. Just copy and paste the plugins file into /etc/munin/plugins/ directory and make sure it is executable (755, ideally), then you should be good.

So here is the content of the file.

[root@localhost plugins]# cat /etc/munin/plugins/vm_count
#!/bin/sh

case $1 in
   config)
        cat <<'EOM'
graph_title Number of VMs
graph_vlabel VMcount
vmcount.label VMcount
vmcount.graph_category Vserver
EOM
        exit 0;;
esac

i=`ps auxww | grep [/]usr/libexec/qemu-kvm | wc -l`
echo -n "vmcount.value "
echo $i


And it is how it would work.

# You should be able to execute it directly from system shell. In this example I had 17 VMs running on the host.

[root@localhost plugins]# pwd
/etc/munin/plugins

[root@localhost plugins]# ./vm_count
vmcount.value 17


# Alternatively, you can test it with munin-run. This is how the script will look like when it is being loaded

[root@localhost plugins]# munin-run vm_count
vmcount.value 17





# And here is the parameters of this plugins.

[root@localhost plugins]# munin-run vm_count  config
graph_title Number of VMs
graph_vlabel VMcount
vmcount.label VMcount
vmcount.graph_category Vserver

2013年7月22日 星期一

Resource [Host:N] is unreachable: Host N: Unable to start instance due to Template systemvm-kvm-3.0.0 has not been completely downloaded to zone N

So,  because of my job duty and I have to deal with Citrix Cloudstack day by day. Recently we are deploying a new advanced zone and for some reason we are seeing errors like this during deploy of our first VM instance.


2013-07-22 22:09:27,625 WARN  [api.commands.DeployVMCmd] (Job-Executor-50:job-534828) Exception:
com.cloud.exception.AgentUnavailableException: Resource [Host:N] is unreachable: Host N: Unable to start instance due to Template systemvm-kvm-3.0.0 has not been completely downloaded to zone N

................
Caused by: com.cloud.utils.exception.CloudRuntimeException: Template systemvm-kvm-3.0.0 has not been completely downloaded to zone N
................
2013-07-22 22:09:27,626 WARN  [cloud.api.ApiDispatcher] (Job-Executor-50:job-534828) class com.cloud.api.ServerApiException : Resource [Host:N] is unreachable: Host N: Unable to start instance due to Template systemvm-kvm-3.0.0 has not been completely downloaded to zone N


So, basically, what Cloudstack doing is to
1. check if there is any valid systemvm template (in this case systemvm-kvm-3.0.0) deployed to the zone.
2. If things works as it should, you should be able to find the installed/downloaded template from table cloud.vm_template, cloud.template_zone_ref and template_host_ref. Hence, if you scan through the template list from the Web GUI, you should be able to see the template be downloaded.

In my case, the template was not downloaded as it should (or marked as downloaded at DB layer), and if you look at the table cloud.template_host_ref, there is some abnormality here.

mysql> select * from  template_host_ref where id=11111\G
*************************** 1. row ***************************
            id: 11111
       host_id: *masked*
   template_id: *masked*
       created: 2013-07-18 17:50:43
  last_updated: 2013-07-22 20:04:52
        job_id: 75a75e55-5280-4ba5-b823-cadbcbe2cc7a
  download_pct: 0
          size: 0
 physical_size: 0
download_state: DOWNLOAD_ERROR
     error_str: No route to host

    local_path: /mnt/SecStorage/04ab8f0b-c4e0-34a4-80b3-457c433acde3/template/tmpl/2/1686/dnld6951269530983090325tmp_
  install_path: NULL
           url: http://download.cloud.com/templates/acton/acton-systemvm-02062012.qcow2.bz2
     destroyed: 0
       is_copy: 0


So, basically the things are 1) download_pct is 0 (while it should be 100 if download succeed), 2) download_state is DOWNLOAD_ERROR (while it should be DOWNLOADED if download successed and 3) error_str is "No route to host".

In my case, the template installation procedures was not completed (though I have completed the cloud-install-sys-tmplt script per official installation guide), at least at DB layer.

So I double checked the secondary storage to make sure the template file is completely downloaded (IMPORTANT!!!, if the file is not there, go through installation guide and re-run cloud-install-sys-tmplt script) and hacked the DB by updating the cloud.template_host_ref table. (Replace "N" with the correct account id and template id respectively)

mysql> updated template_host_ref set download_pct=100, download_state='DOWNLOADED', error_str=NULL, localpath='template/tmpl/N/N' where id=11111\G
*************************** 1. row ***************************
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0


Now cloudstack could launch VM as it should.