3.3. Methodology for Containerized Openstack Monitoring¶
- Abstract
This document describes one of the Containerized Openstack monitoring solutions to provide scalable and comprehensive architecture and obtain all crucial performance metrics on each structure layer.
3.3.1. Containerized Openstack Monitoring Architecture¶
This part of documentation describes required performance metrics in each distinguished Containerized Openstack layer.
Containerized Openstack comprises three layers where Monitoring System should be able to query all necessary counters:
OS layer
Kubernetes layer
Openstack layer
- Monitoring instruments must be logically divided in two groups:
Monitoring Server Side
Node Client Side
3.3.1.1. Operation System Layer¶
We were using Ubuntu Xenial on top of bare-metal servers for both server and node side.
3.3.1.1.1. Baremetal hardware description¶
We deployed everything at 200 servers environment with following hardware characteristics:
server |
vendor,model |
HP,DL380 Gen9 |
CPU |
vendor,model |
Intel,E5-2680 v3 |
processor_count |
2 |
|
core_count |
12 |
|
frequency_MHz |
2500 |
|
RAM |
vendor,model |
HP,752369-081 |
amount_MB |
262144 |
|
NETWORK |
interface_name |
p1p1 |
vendor,model |
Intel,X710 Dual Port |
|
bandwidth |
10G |
|
STORAGE |
dev_name |
/dev/sda |
vendor,model |
raid10 - HP P840
12 disks EH0600JEDHE
|
|
SSD/HDD |
HDD |
|
size |
3,6TB |
3.3.1.1.2. Operating system configuration¶
Baremetal nodes were provisioned with Cobbler with our in-home preseed scripts. OS versions we used:
Software |
Version |
Ubuntu |
Ubuntu 16.04.1 LTS |
Kernel |
4.4.0-47-generic |
You can find /etc folder contents from the one of the typical system we were using:
3.3.1.1.3. Required system metrics¶
At this layer we must get this list of processes:
List of processes |
Mariadb |
Keystone |
|
Glance |
|
Cinder |
|
Nova |
|
Neutron |
|
Openvswitch |
|
Kubernetes |
And following list of metrics:
Node load average |
1min |
15min |
|
Global process stats |
Running |
Waiting |
|
Global CPU Usage |
Steal |
Wait |
|
User |
|
System |
|
Interrupt |
|
Nice |
|
Idle |
|
Per CPU Usage |
User |
System |
|
Global memory usage |
bandwidth |
Cached |
|
Buffered |
|
Free |
|
Used |
|
Total |
|
Numa monitoring For each node |
Numa_hit |
Numa_foreign |
|
Local_node |
|
Other_node |
|
Numa monitoring For each pid |
Huge |
Stack |
|
Private |
|
Global IOSTAT + Per device IOSTAT |
Merge reads /s |
Merge write /s |
|
read/s |
|
write/s |
|
Read transfer |
|
Write transfer |
|
Read latency |
|
Write latency |
|
Write transfer |
|
Queue size |
|
Await |
|
Network per interface |
Octets /s (in, out) |
Dropped /s |
|
Other system metrics |
Entropy |
DF per device |
3.3.1.2. Kubernetes Layer¶
- Kargo from Fuel-CCP-installer was our main tool to deploy K8S
on top of provisioned systems (monitored nodes).
Kargo sets up Kubernetes in the following way:
masters: Calico, Kubernetes API services
nodes: Calico, Kubernetes minion services
etcd: etcd service
3.3.1.2.1. Kargo deployment parameters¶
You can find Kargo deployment script in Kargo deployment script section
docker_options: "--insecure-registry 172.20.8.35:5000 -D"
upstream_dns_servers: [172.20.8.34, 8.8.4.4]
nameservers: [172.20.8.34, 8.8.4.4]
kube_service_addresses: 10.224.0.0/12
kube_pods_subnet: 10.240.0.0/12
kube_network_node_prefix: 22
kube_apiserver_insecure_bind_address: "0.0.0.0"
dns_replicas: 3
dns_cpu_limit: "100m"
dns_memory_limit: "512Mi"
dns_cpu_requests: "70m"
dns_memory_requests: "70Mi"
deploy_netchecker: false
Software |
Version |
6fd81252cb2d2c804f388337aa67d4403700f094 |
|
2c23027794d7851ee31363c5b6594180741ee923 |
3.3.1.2.2. Required K8S metrics¶
Here we should get K8S health metrics and ETCD performance metrics:
ETCD performance metrics |
members count / states |
Size of data set |
|
Avg. latency from leader to followers |
|
Bandwidth rate, send/receive |
|
Create store success/fail |
|
Get success/fail |
|
Set success/fail |
|
Package rate, send/receive |
|
Expire count |
|
Update success/fail |
|
Compare-and-swap success/fail |
|
Watchers |
|
Delete success/fail |
|
Compare-and-delete success/fail |
|
Append req, send/ receive |
|
K8S health metrics |
Number of node in each state |
Total number of namespaces |
|
Total number of PODs per cluster,node,ns |
|
Total of number of services |
|
Endpoints in each service |
|
Number of API service instances |
|
Number of controller instances |
|
Number of scheduler instances |
|
Cluster resources, scheduler view |
|
K8S API log analysis |
Number of responses (per each HTTP code) |
Response Time |
For last two metrics we should utilize log collector to store and parse all log records within K8S environments.
3.3.1.3. Openstack Layer¶
CCP stands for “Containerized Control Plane”. CCP aims to build, run and manage production-ready OpenStack containers on top of Kubernetes cluster.
Software |
Version |
8570d0e0e512bd16f8449f0a10b1e3900fd09b2d |
3.3.1.3.1. CCP configuration¶
CCP was deployed on top of 200 nodes K8S cluster in the following configuration:
node[1-3]: Kubernetes
node([4-6])$: # 4-6
roles:
- controller
- openvswitch
node[7-9]$: # 7-9
roles:
- rabbitmq
node10$: # 10
roles:
- galera
node11$: # 11
roles:
- heat
node(1[2-9])$: # 12-19
roles:
- compute
- openvswitch
node[2-9][0-9]$: # 20-99
roles:
- compute
- openvswitch
node(1[0-9][0-9])$: # 100-199
roles:
- compute
- openvswitch
node200$:
roles:
- backup
CCP Openstack services list ( versions.yaml ):
openstack/cinder:
git_ref: stable/newton
git_url: https://github.com/openstack/cinder.git
openstack/glance:
git_ref: stable/newton
git_url: https://github.com/openstack/glance.git
openstack/heat:
git_ref: stable/newton
git_url: https://github.com/openstack/heat.git
openstack/horizon:
git_ref: stable/newton
git_url: https://github.com/openstack/horizon.git
openstack/keystone:
git_ref: stable/newton
git_url: https://github.com/openstack/keystone.git
openstack/neutron:
git_ref: stable/newton
git_url: https://github.com/openstack/neutron.git
openstack/nova:
git_ref: stable/newton
git_url: https://github.com/openstack/nova.git
openstack/requirements:
git_ref: stable/newton
git_url: https://git.openstack.org/openstack/requirements.git
openstack/sahara-dashboard:
git_ref: stable/newton
git_url: https://git.openstack.org/openstack/sahara-dashboard.git
K8S Ingress Resources rules were enabled during CCP deployment to expose Openstack services endpoints to external routable network.
See CCP deployment script and configuration files in the CCP deployment and configuration files section.
3.3.2. Implementation¶
This part of documentation describes Monitoring System implementation. Here is software list that we chose to realize all required tasks:
Monitoring Node Server Side |
Monitored Node Client Side |
||
Metrics server |
Log storage |
Metrics agent |
Log collector |
3.3.2.1. Server Side Software¶
3.3.2.1.1. Prometheus¶
Software |
Version |
7e369b9318a4d5d97a004586a99f10fa51a46b26 |
Due to high load rate we faced an issue with Prometheus performance at metrics count up to 15 millions. We split Prometheus setup in 2 standalone nodes. First node - to poll API metrics from K8S-related services that natively available at /metrics uri and exposed by K8S API and ETCD API by default. Second node - to store all other metrics that should be collected and calculated locally on environment servers via Telegraf.
Prometheus nodes deployments scripts and configuration files could be found at Prometheus deployment and configuration files section
3.3.2.1.2. Grafana¶
Software |
Version |
v4.0.1 |
Grafana was used as a metrics visualizer with several dashboards for each metrics group. Separate individual dashboards were built for each group of metrics:
System nodes metrics
Kubernetes metrics
ETCD metrics
Openstack metrics
You can find their setting at Grafana dashboards configuration
Grafana server deployment script:
#!/bin/bash
ansible-playbook -i ./hosts ./deploy-graf-prom.yaml --tags "grafana"
It uses the same yaml configuration file deploy-graf-prom.yaml from Prometheus deployment and configuration files section.
3.3.2.1.3. ElasticSearch¶
Software |
Version |
2.4.2 |
ElasticSearch is well-known proven log storage and we used it as a standalone node for collecting Kubernetes API logs and all other logs from containers across environment. For appropriate performance at 200 nodes lab we increased ES_HEAP_SIZE from default 1G to 10G in /etc/default/elasticsearch configuration file.
Elastic search and Kibana dashboard were installed with deploy_elasticsearch_kibana.sh deployment script.
3.3.2.1.4. Kibana¶
Software |
Version |
4.5.4 |
We used Kibana as a main visualization tool for Elastic Search. We were able to create chart graphs based on K8S API logs analysis. Kibana was installed on a single separate node with a single dashboard representing K8S API Response time graph.
Dashboard settings:
3.3.2.2. Client side Software¶
3.3.2.2.1. Telegraf¶
Software |
Version |
v1.0.0-beta2-235-gbc14ac5 git: openstack_stats bc14ac5b9475a59504b463ad8f82ed810feed3ec |
Telegraf was chosen as client-side metrics agent. It provides multiple ways to poll and calculate from variety of different sources. With regard to its plugin-driven nature, it takes data from different inputs and exposes calculated metrics in Prometheus format. We used forked version of Telegraf with custom patches to be able to utilize custom Openstack-input plugin:
Following automation scripts and configuration files were used to start Telegraf agent across environment nodes.
Telegraf deployment and configuration files
Below you can see which plugins were used to obtain metrics.
3.3.2.2.1.1. Standart Plugins¶
inputs.cpu CPU
inputs.disk
inputs.diskio
inputs.kernel
inputs.mem
inputs.processes
inputs.swap
inputs.system
inputs.kernel_vmstat
inputs.net
inputs.netstat
inputs.exec
3.3.2.2.1.2. Openstack input plugin¶
inputs.openstack custom plugin was used to gather the most of required Openstack-related metrics.
settings:
interval = '40s'
identity_endpoint = "http://keystone.ccp.svc.cluster.local:5000/v3"
domain = "default"
project = "admin"
username = "admin"
password = "password"
3.3.2.2.1.3. System.exec plugin¶
system.exec plugin was used to trigger scripts to poll and calculate all non-standard metrics.
common settings:
interval = "15s"
timeout = "30s"
data_format = "influx"
commands:
"/opt/telegraf/bin/list_openstack_processes.sh"
"/opt/telegraf/bin/per_process_cpu_usage.sh"
"/opt/telegraf/bin/numa_stat_per_pid.sh"
"/opt/telegraf/bin/iostat_per_device.sh"
"/opt/telegraf/bin/memory_bandwidth.sh"
"/opt/telegraf/bin/network_tcp_queue.sh"
"/opt/telegraf/bin/etcd_get_metrics.sh"
"/opt/telegraf/bin/k8s_get_metrics.sh"
"/opt/telegraf/bin/vmtime.sh"
"/opt/telegraf/bin/osapitime.sh"
You can see full Telegraf configuration file and its custom input scripts in the section Telegraf deployment and configuration files.
3.3.2.2.2. Heka¶
Software |
Version |
0.10.0 |
We chose Heka as log collecting agent for its wide variety of inputs (possibility to feed data from Docker socket), filters (custom shorthand SandBox filters in LUA language) and possibility to encode data for ElasticSearch.
With Heka agent started across environment servers we were able to send containers’ logs to ElasticSearch server. With custom LUA filter we extracted K8S API data and convert it in appropriate format to visualize API timing counters (Average Response Time).
Heka deployment scripts and configuration file with LUA custom filter are in Heka deployment and configuration section.
3.3.3. Applications¶
3.3.3.1. Kargo deployment script¶
3.3.3.1.1. deploy_k8s_using_kargo.sh¶
#!/usr/bin/env bash
: ${DB_CONNECTION_STRING:?"You need to specify DB_CONNECTION_STRING parameter"}
: ${ENV_NAME:?"You need to specify ENV_NAME parameter"}
: ${MANAGEMENT_INTERFACE:="p1p1.602"}
: ${COBBLER_ADDRESS:="172.20.8.34"}
: ${CUSTOM_YAML}
: ${KARGO_REPO}
: ${KARGO_COMMIT}
: ${FUEL_CCP_COMMIT}
: ${ADMIN_USER}
: ${ADMIN_PASSWORD}
: ${ADMIN_NODE_CLEANUP}
DEPLOY_METHOD="kargo"
WORKSPACE="~/kargo_workspace_${ENV_NAME}"
SSH_OPTIONS="-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null"
get_env_nodes ()
{
ENV_NODES_NAMES=$(echo $(psql ${DB_CONNECTION_STRING} -c "select name from servers where environment_id in (select id from environments where name='${ENV_NAME}')" -P format=unaligned -t))
if [ -z "${ENV_NODES_NAMES}" ]
then
echo "No nodes in environment with name ${ENV_NAME}"
exit 1
fi
}
get_env_nodes_ips ()
{
ENV_NODES_IPS=$(echo $(ssh ${SSH_OPTIONS} root@${COBBLER_ADDRESS} bash -ex << EOF
for COBBLER_SYSTEM_NAME in ${ENV_NODES_NAMES}
do
NODE_IP=\$(cobbler system dumpvars --name=\${COBBLER_SYSTEM_NAME} | grep ^ip_address_${MANAGEMENT_INTERFACE} | awk '{print \$3}')
NODE_IPS+=\${NODE_IP}" "
done
echo \${NODE_IPS}
EOF
))
}
main ()
{
get_env_nodes
get_env_nodes_ips
export ADMIN_IP=$(echo ${ENV_NODES_IPS} | awk '{print $1}')
export SLAVE_IPS=$(echo ${ENV_NODES_IPS})
# for SLAVE_IP in ${SLAVE_IPS}
# do
# ssh ${SSH_OPTIONS} root@${SLAVE_IP} bash -ex << EOF
#echo "deb https://apt.dockerproject.org/repo ubuntu-\$(grep DISTRIB_CODENAME /etc/lsb-release | awk -F"=" '{print \$2}') main" >> /etc/apt/sources.list
#apt-get update && apt-get install -y --allow-unauthenticated -o Dpkg::Options::="--force-confdef" docker-engine
#EOF
# done
if [ -d "$WORKSPACE" ] ; then
rm -rf $WORKSPACE
fi
mkdir -p $WORKSPACE
cd $WORKSPACE
if [ -d './fuel-ccp-installer' ] ; then
rm -rf ./fuel-ccp-installer
fi
git clone https://review.openstack.org/openstack/fuel-ccp-installer
cd ./fuel-ccp-installer
if [ "$FUEL_CCP_COMMIT" ]; then
git fetch https://git.openstack.org/openstack/fuel-ccp-installer $FUEL_CCP_COMMIT && git checkout FETCH_HEAD
fi
echo "Running on $NODE_NAME: $ENV_NAME"
bash -xe "./utils/jenkins/run_k8s_deploy_test.sh"
}
main
3.3.3.2. CCP deployment and configuration files¶
3.3.3.2.1. deploy-ccp.sh¶
#!/bin/bash
set -ex
if [ -z "$1" ]; then
echo "Please set number of env as argument"
exit 1
fi
DEPLOY_TIMEOUT=1200
export SSH_USER="root"
export SSH_PASS="r00tme"
cd $(dirname $(realpath $0))
NODE1="172.20.8.6${1}"
SSH_OPTS="-q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
SSH_CMD="sshpass -p ${SSH_PASS} ssh ${SSH_OPTS} ${SSH_USER}@${NODE1}"
SCP_CMD="sshpass -p ${SSH_PASS} scp ${SSH_OPTS}"
if [ ! -d ./env-${1} ]; then
echo "Yaml files for env-${1} is not found"
echo "Please, create and commit deployment/ccp/rackspace/env-${1}/configs with correct yaml files"
echo "Main file should be deployment/ccp/rackspace/env-${1}/configs/ccp.yaml"
exit 1
fi
$SCP_CMD ./env-${1}/configs/ccp.yaml ${SSH_USER}@${NODE1}:/root/.ccp.yaml
for i in $(ls -1 ./env-${1}/configs/ | grep -v ccp.yaml ); do
$SCP_CMD ./env-${1}/configs/${i} ${SSH_USER}@${NODE1}:/root/
done
$SSH_CMD "rm -rf /root/fuel-ccp; cd /root; git clone https://git.openstack.org/openstack/fuel-ccp"
$SSH_CMD "apt-get -y install python-pip"
$SSH_CMD "/usr/bin/pip install --upgrade pip"
$SSH_CMD "/usr/bin/pip install /root/fuel-ccp/"
CCP_STATUS=$($SSH_CMD "/usr/local/bin/ccp status")
if [ -n "$CCP_STATUS" ]; then
echo "Active deployment was found"
echo "$CCP_STATUS"
echo "Please execute 'ccp cleanup' and 'rm -rf /var/lib/mysql/*' on the ${NODE1} manually"
exit 1
fi
$SSH_CMD "echo '172.20.8.6${1} cloudformation.ccp.external console.ccp.external identity.ccp.external object-store.ccp.external compute.ccp.external orchestration.ccp.external network.ccp.external image.ccp.external volume.ccp.external horizon.ccp.external' >> /etc/hosts"
# $SSH_CMD kubectl delete configmaps traefik-conf -n kube-system
# $SSH_CMD kubectl delete service traefik -n kube-system
# $SSH_CMD kubectl delete secret traefik-cert -n kube-system
# $SSH_CMD kubectl delete deployment traefik -n kube-system
$SSH_CMD "/root/fuel-ccp/tools/ingress/deploy-ingress-controller.sh -i 172.20.8.6${1}" || echo "Already configured"
$SSH_CMD "echo 172.20.8.6${1} \$(ccp domains list -f value) >> /etc/hosts"
$SSH_CMD "openssl s_client -status -connect identity.ccp.external:8443 < /dev/null 2>&1 | awk 'BEGIN {pr=0;} /-----BEGIN CERTIFICATE-----/ {pr=1;} {if (pr) print;} /-----END CERTIFICATE-----/ {exit;}' >> /usr/local/lib/python2.7/dist-packages/requests/cacert.pem"
$SSH_CMD "openssl s_client -status -connect identity.ccp.external:8443 < /dev/null 2>&1 | awk 'BEGIN {pr=0;} /-----BEGIN CERTIFICATE-----/ {pr=1;} {if (pr) print;} /-----END CERTIFICATE-----/ {exit;}' > /usr/share/ca-certificates/ingress.crt"
$SSH_CMD "cp /usr/share/ca-certificates/ingress.crt /usr/local/share/ca-certificates/"
$SSH_CMD "update-ca-certificates"
if [ $($SSH_CMD "curl -s 'https://identity.ccp.external:8443/' > /dev/null; echo \$?") != 0 ]
then
echo "keystone is unreachable check https://identity.ccp.external:8443"
exit 1
fi
#$SSH_CMD "/root/fuel-ccp/tools/registry/deploy-registry.sh" &&
$SSH_CMD "/usr/local/bin/ccp fetch"
$SSH_CMD "/usr/local/bin/ccp build"
$SSH_CMD "/usr/local/bin/ccp deploy"
DEPLOY_TIME=0
while [ "$($SSH_CMD '/usr/local/bin/ccp status -s -f value' 2>/dev/null)" != "ok" ]
do
sleep 5
DEPLOY_TIME=$((${DEPLOY_TIME} + 5))
if [ $DEPLOY_TIME -ge $DEPLOY_TIMEOUT ]; then
echo "Deployment timeout"
exit 1
fi
done
$SSH_CMD "/usr/local/bin/ccp status"
3.3.3.2.2. ccp.yaml¶
builder:
push: true
no_cache: false
registry:
address: "172.20.8.35:5000/env-1"
repositories:
skip_empty: True
kubernetes:
server: http://172.20.9.234:8080
---
!include
- versions.yaml
- topology.yaml
- configs.yaml
- repos.yaml
3.3.3.2.3. configs.yaml¶
configs:
private_interface: p1p1.602
public_interface: p1p1.602
ingress:
enabled: true
glance:
bootstrap:
enable: true
# nova:
# allocation_ratio:
# cpu: 16.0
neutron:
physnets:
- name: "physnet1"
bridge_name: "br-ex"
interface: "p1p1.649"
flat: true
vlan_range: false
bootstrap:
internal:
enable: true
external:
enable: true
net_name: ext-net
subnet_name: ext-subnet
physnet: physnet1
network: 10.144.0.0/12
gateway: 10.144.0.1
nameserver: 10.144.0.1
pool:
start: 10.144.1.0
end: 10.159.255.250
keystone:
debug: true
heat:
debug: true
memcached:
ram: 30720
3.3.3.2.4. topology.yaml¶
nodes:
# node[1-3]: Kubernetes
node([4-6])$: # 4-6
roles:
- controller
- openvswitch
node[7-9]$: # 7-9
roles:
- rabbitmq
node10$: # 10
roles:
- galera
node11$: # 11
roles:
- heat
node(1[2-9])$: # 12-19
roles:
- compute
- openvswitch
node[2-9][0-9]$: # 20-99
roles:
- compute
- openvswitch
node(1[0-9][0-9])$: # 100-199
roles:
- compute
- openvswitch
node200$:
roles:
- backup
replicas:
glance-api: 1
glance-registry: 1
keystone: 3
nova-api: 3
nova-scheduler: 3
nova-conductor: 3
neutron-server: 3
neutron-metadata-agent: 3
horizon: 3
heat-api: 1
heat-api-cfn: 1
heat-engine: 1
roles:
galera:
- galera
rabbitmq:
- rabbitmq
controller:
- etcd
- glance-api
- glance-registry
- horizon
- keystone
- memcached
- neutron-dhcp-agent
- neutron-l3-agent
- neutron-metadata-agent
- neutron-server
- nova-api
- nova-conductor
- nova-consoleauth
- nova-novncproxy
- nova-scheduler
compute:
- nova-compute
- nova-libvirt
openvswitch:
- neutron-openvswitch-agent
- openvswitch-db
- openvswitch-vswitchd
backup:
- backup
heat:
- heat-api
- heat-api-cfn
- heat-engine
3.3.3.2.5. repos.yaml¶
repositories:
repos:
- git_url: https://git.openstack.org/openstack/fuel-ccp-ceph
name: fuel-ccp-ceph
- git_url: https://git.openstack.org/openstack/fuel-ccp-cinder
name: fuel-ccp-cinder
- git_url: https://git.openstack.org/openstack/fuel-ccp-debian-base
name: fuel-ccp-debian-base
- git_url: https://git.openstack.org/openstack/fuel-ccp-entrypoint
name: fuel-ccp-entrypoint
- git_url: https://git.openstack.org/openstack/fuel-ccp-etcd
name: fuel-ccp-etcd
- git_url: https://git.openstack.org/openstack/fuel-ccp-glance
name: fuel-ccp-glance
- git_url: https://git.openstack.org/openstack/fuel-ccp-heat
name: fuel-ccp-heat
- git_url: https://git.openstack.org/openstack/fuel-ccp-horizon
name: fuel-ccp-horizon
# - git_url: https://git.openstack.org/openstack/fuel-ccp-ironic
# name: fuel-ccp-ironic
- git_url: https://git.openstack.org/openstack/fuel-ccp-keystone
name: fuel-ccp-keystone
# - git_url: https://git.openstack.org/openstack/fuel-ccp-mariadb
# name: fuel-ccp-mariadb
- git_url: https://git.openstack.org/openstack/fuel-ccp-galera
name: fuel-ccp-galera
- git_url: https://git.openstack.org/openstack/fuel-ccp-memcached
name: fuel-ccp-memcached
# - git_url: https://git.openstack.org/openstack/fuel-ccp-murano
# name: fuel-ccp-murano
- git_url: https://git.openstack.org/openstack/fuel-ccp-neutron
name: fuel-ccp-neutron
- git_url: https://git.openstack.org/openstack/fuel-ccp-nova
name: fuel-ccp-nova
- git_url: https://git.openstack.org/openstack/fuel-ccp-openstack-base
name: fuel-ccp-openstack-base
- git_url: https://git.openstack.org/openstack/fuel-ccp-rabbitmq
name: fuel-ccp-rabbitmq
# - git_url: https://git.openstack.org/openstack/fuel-ccp-sahara
# name: fuel-ccp-sahara
# - git_url: https://git.openstack.org/openstack/fuel-ccp-searchlight
# name: fuel-ccp-searchlight
# - git_url: https://git.openstack.org/openstack/fuel-ccp-stacklight
# name: fuel-ccp-stacklight
3.3.3.2.6. versions.yaml¶
images:
tag: newton
# image_specs:
# keystone:
# tag: newton
# horizon:
# tag: newton
# nova-upgrade:
# tag: newton
# nova-api:
# tag: newton
# nova-conductor:
# tag: newton
# nova-consoleauth:
# tag: newton
# nova-novncproxy:
# tag: newton
# nova-scheduler:
# tag: newton
# nova-compute:
# tag: newton
# nova-libvirt:
# tag: newton
# neutron-dhcp-agent:
# tag: newton
# neutron-l3-agent:
# tag: newton
# neutron-metadata-agent:
# tag: newton
# neutron-server:
# tag: newton
# neutron-openvswitch-agent:
# tag: newton
# glance-api:
# tag: newton
# glance-registry:
# tag: newton
# glance-upgrade:
# tag: newton
sources:
openstack/cinder:
git_ref: stable/newton
git_url: https://github.com/openstack/cinder.git
openstack/glance:
git_ref: stable/newton
git_url: https://github.com/openstack/glance.git
openstack/heat:
git_ref: stable/newton
git_url: https://github.com/openstack/heat.git
openstack/horizon:
git_ref: stable/newton
git_url: https://github.com/openstack/horizon.git
openstack/keystone:
git_ref: stable/newton
git_url: https://github.com/openstack/keystone.git
openstack/neutron:
git_ref: stable/newton
git_url: https://github.com/openstack/neutron.git
openstack/nova:
git_ref: stable/newton
git_url: https://github.com/openstack/nova.git
openstack/requirements:
git_ref: stable/newton
git_url: https://git.openstack.org/openstack/requirements.git
openstack/sahara-dashboard:
git_ref: stable/newton
git_url: https://git.openstack.org/openstack/sahara-dashboard.git
3.3.3.3. Prometheus deployment and configuration files¶
3.3.3.3.1. Deployment scripts¶
3.3.3.3.1.1. deploy_prometheus.sh¶
#!/bin/bash
ansible-playbook -i ./hosts ./deploy-graf-prom.yaml --tags "prometheus"
3.3.3.3.1.2. deploy-graf-prom.yaml¶
---
- hosts: common
remote_user: root
tasks:
- name: Install common packages
apt: name={{ item }} state=installed
with_items:
- python-pip
tags: [ 'always' ]
- name: Install docker for Ubuntu 14.04
apt: name=docker.io state=installed
when: ansible_distribution == 'Ubuntu' and ansible_distribution_version == '14.04'
tags: [ 'always' ]
- name: Install docker for Ubuntu 16.01
apt: name=docker state=installed
when: ansible_distribution == 'Ubuntu' and ansible_distribution_version == '16.0.'
tags: [ 'always' ]
- name: Install python deps
pip: name={{ item }}
with_items:
- docker-py
- docker-compose
tags: [ 'always' ]
- hosts: grafana
remote_user: root
vars:
postgresql_root_user: root
postgresql_root_password: aijoom1Shiex
grafana_postgresql_user: grafana
grafana_postgresql_password: sHskdhos6se
grafana_postgresql_db: grafana
grafana_user: admin
grafana_password: admin
tasks:
- name: Install packages for grafana
apt: name={{ item }} state=installed
with_items:
- postgresql-client-9.3
- python-psycopg2
- name: Create postgres data dir
file: path=/var/lib/postgres/data/db state=directory
tags: [ 'grafana' ]
- name: Run postgres in docker
docker_container:
name: postgres
image: 'postgres:latest'
ports: 5432:5432
volumes: '/var/lib/postgres/data:/var/lib/postgres/data'
env:
POSTGRES_USER: "{{ postgresql_root_user }}"
POSTGRES_PASSWORD: "{{ postgresql_root_password }}"
PGDATA: /var/lib/postgres/data/db
tags: [ 'grafana' ]
- name: Create DB for grafana
postgresql_db:
name: "{{ grafana_postgresql_db }}"
login_user: "{{ postgresql_root_user }}"
login_password: "{{ postgresql_root_password }}"
login_host: localhost
encoding: 'UTF-8'
tags: [ 'grafana' ]
- name: Create user for grafana in postgres
postgresql_user:
name: "{{ grafana_postgresql_user }}"
login_user: "{{ postgresql_root_user }}"
login_password: "{{ postgresql_root_password }}"
login_host: localhost
password: "{{ grafana_postgresql_password }}"
db: grafana
priv: ALL
tags: [ 'grafana' ]
- name: Create data dir for Grafana
file: path=/var/lib/grafana state=directory
tags: [ 'grafana' ]
- name: Start Grafana container
docker_container:
name: grafana
image: 'grafana/grafana:4.0.1'
volumes: '/var/lib/grafana:/var/lib/grafana'
ports: 3000:3000
env:
GF_SECURITY_ADMIN_PASSWORD: "{{ grafana_user }}"
GF_SECURITY_ADMIN_USER: "{{ grafana_password }}"
GF_DATABASE_TYPE: postgres
GF_DATABASE_HOST: "{{ ansible_default_ipv4.address }}"
GF_DATABASE_NAME: "{{ grafana_postgresql_db }}"
GF_DATABASE_USER: "{{ grafana_postgresql_user }}"
GF_DATABASE_PASSWORD: "{{ grafana_postgresql_password }}"
GF_INSTALL_PLUGINS: grafana-piechart-panel
tags: [ 'grafana' ]
- hosts: prometheuses
remote_user: root
tasks:
- name: Data dir for prometheus
file: path=/var/lib/prometheus state=directory
tags: [ 'prometheus' ]
- include: docker_prometheus.yaml
- hosts: prometheus-kuber
remote_user: root
tasks:
- name: Copy prometheus config
template: src=prometheus/prometheus-kuber.yml.j2 dest=/var/lib/prometheus/prometheus.yml
register: prometheus_yml
tags: [ 'prometheus', 'prometheus-conf' ]
- include: docker_prometheus.yaml
- name: Send kill -1 to prometheus if prometheus.yml changed
command: pkill -1 prometheus
when: prometheus_yml.changed
tags: [ 'prometheus', 'prometheus-conf']
- hosts: prometheus-system
remote_user: root
tasks:
- name: Copy prometheus config
template: src=prometheus/prometheus-system.yml.j2 dest=/var/lib/prometheus/prometheus.yml
register: prometheus_yml
tags: [ 'prometheus', 'prometheus-conf' ]
- include: docker_prometheus.yaml
- name: Send kill -1 to prometheus if prometheus.yml changed
command: pkill -1 prometheus
when: prometheus_yml.changed
tags: [ 'prometheus', 'prometheus-conf']
3.3.3.3.1.3. docker_prometheus.yaml¶
---
- name: Deploy prometheus in docker
docker_container:
name: prometheus
image: 'prom/prometheus:v1.4.0'
ports: 9090:9090
state: started
volumes: ['/var/lib/prometheus:/prometheus']
command: '-config.file=/prometheus/prometheus.yml -storage.local.retention 168h0m0s -storage.local.max-chunks-to-persist 3024288 -storage.local.memory-chunks=50502740 -storage.local.num-fingerprint-mutexes=300960'
tags: [ 'prometheus' ]
3.3.3.3.1.4. deploy_etcd_collect.sh¶
#!/bin/bash
CLUSTER=${1}
TMP_YAML=$(mktemp -u)
export ANSIBLE_HOST_KEY_CHECKING=False
export SSH_USER="root"
export SSH_PASS="r00tme"
cd $(dirname $(realpath $0))
ENV=${1}
if [ -z "${ENV}" ]; then
echo "Please provide env number $(basename $0) [1|2|3|4|5|6]"
exit 1
fi
PROMETHEUS_HOST="172.20.9.115"
KUBE_MAIN_NODE="172.20.8.6${ENV}"
CLUSTER_TAG="env-${ENV}"
ETCD=""
SSH_OPTS="-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
TARGETS=$(sshpass -p ${SSH_PASS} ssh ${SSH_OPTS} ${SSH_USER}@${KUBE_MAIN_NODE} curl -ks https://127.0.0.1:2379/v2/members | python -m json.tool | grep 2379)
if [ -z "$TARGETS" ]; then
echo "No etcd found"
exit 1
fi
for i in ${TARGETS}; do
TEMP_TARGET=${i#\"https://}
ETCD="$ETCD ${TEMP_TARGET%\"}"
done
echo "- targets:" > ${TMP_YAML}
for i in ${ETCD}; do
echo " - $i" >> ${TMP_YAML}
done
echo " labels:" >> ${TMP_YAML}
echo " env: ${CLUSTER_TAG}" >> ${TMP_YAML}
echo "Targets file is ready"
cat ${TMP_YAML}
sshpass -p ${SSH_PASS} scp ${SSH_OPTS} ${TMP_YAML} root@${PROMETHEUS_HOST}:/var/lib/prometheus/etcd-env-${1}.yml
rm ${TMP_YAML}
3.3.3.3.2. Configuration files¶
3.3.3.3.2.1. prometheus-kuber.yml.j2¶
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # By default, scrape targets every 15 seconds.
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
scrape_timeout: 5s
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['172.20.9.115:9090']
{% for env_num in range(1,7) %}
- job_name: 'k8-env-{{env_num}}'
scrape_interval: 30s
scrape_timeout: 30s
scheme: https
tls_config:
insecure_skip_verify: true
kubernetes_sd_configs:
- api_server: 'https://172.20.8.6{{env_num}}:443'
role: node
tls_config:
insecure_skip_verify: true
basic_auth:
username: kube
password: changeme
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__address__]
target_label: env
regex: .*
replacement: env-{{env_num}}
- job_name: 'etcd-env-{{env_num}}'
scrape_interval: 5s
scrape_timeout: 5s
scheme: https
tls_config:
insecure_skip_verify: true
file_sd_configs:
- files:
- etcd-env-{{env_num}}.yml
{% endfor %}
3.3.3.3.2.2. prometheus-system.yml.j2¶
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # By default, scrape targets every 15 seconds.
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'codelab-monitor'
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
scrape_timeout: 5s
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['172.20.124.25:9090']
{% for env_num in range(1,7) %}
- job_name: 'telegraf-systems-env-{{env_num}}'
scrape_interval: 30s
scrape_timeout: 30s
file_sd_configs:
- files:
- targets-env-{{env_num}}.yml
{% endfor %}
3.3.3.3.2.3. targets.yml.j2¶
- targets:
{% for host in groups['all-cluster-nodes']%}
- {{hostvars[host]['inventory_hostname']}}:9126
{% endfor %}
labels:
env: {{ cluster_tag }}
3.3.3.4. Grafana dashboards configuration¶
3.3.3.5. ElasticSearch deployment script¶
3.3.3.5.1. deploy_elasticsearch_kibana.sh¶
#!/bin/bash -xe
HOSTNAME=`hostname`
ELASTICSEARCH_NODE=${ELASTICSEARCH_NODE:-172.20.9.3}
# install java
sudo add-apt-repository -y ppa:webupd8team/java
sudo apt-get update
sudo apt-get -y install oracle-java8-installer
# install elastic by adding extra repository
wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb http://packages.elastic.co/elasticsearch/2.x/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-2.x.list
sudo apt-get update
sudo apt-get -y install elasticsearch
# edit configuration:
sed -i -E -e 's/^.*cluster.name: .*$/ cluster.name: elasticsearch_k8s/g' /etc/elasticsearch/elasticsearch.yml
sed -i -E -e "s/^.*node.name: .*$/ cluster.name: ${HOSTNAME}/g" /etc/elasticsearch/elasticsearch.yml
sed -i -E -e "s/^.*network.host: .*$/ network.host: ${ELASTICSEARCH_NODE}/g" /etc/elasticsearch/elasticsearch.yml
# increase memory limits:
sed -i -E -e "s/^.*ES_HEAP_SIZE=.*$/ES_HEAP_SIZE=10g/g" /etc/default/elasticsearch
# start service:
sudo systemctl restart elasticsearch
sudo systemctl daemon-reload
sudo systemctl enable elasticsearch
# install kibana from extra repository:
echo "deb http://packages.elastic.co/kibana/4.5/debian stable main" | sudo tee -a /etc/apt/sources.list
sudo apt-get update
sudo apt-get -y install kibana
sed -i -E -e "s/^.*elasticsearch.url:.*$/ elasticsearch.url: \"http://${ELASTICSEARCH_NODE}:9200\"/g" /opt/kibana/config/kibana.yml
# enable kibana service:
sudo systemctl daemon-reload
sudo systemctl enable kibana
sudo systemctl start kibana
# install nginx:
sudo apt-get -y install nginx
# set kibana admin:password (admin:admin)
echo "admin:`openssl passwd admin`" | sudo tee -a /etc/nginx/htpasswd.users
# prepare nginx config:
cat << EOF >> /etc/nginx/sites-available/default
server {
listen 80;
server_name ${HOSTNAME};
auth_basic "Restricted Access";
auth_basic_user_file /etc/nginx/htpasswd.users;
location / {
proxy_pass http://localhost:5601;
proxy_http_version 1.1;
proxy_set_header Upgrade \$http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host \$host;
proxy_cache_bypass \$http_upgrade;
}
}
EOF
# check and start nginx service:
sudo nginx -t
sudo systemctl restart nginx
3.3.3.6. Telegraf deployment and configuration files¶
3.3.3.6.1. deploy_telegraf.sh¶
#!/bin/bash
set -e
export ANSIBLE_HOST_KEY_CHECKING=False
export SSH_USER="root"
export SSH_PASS="r00tme"
cd $(dirname $(realpath $0))
ENV=${1}
if [ -z "${ENV}" ]; then
echo "Please provide env number $(basename $0) [1|2|3|4|5|6]"
exit 1
fi
PROMETHEUS_NODE="172.20.124.25"
KUBE_MAIN_NODE="172.20.8.6${ENV}"
CLUSTER_TAG="env-${ENV}"
# Secret option
ANSIBLE_TAG=$2
SSH_OPTS="-q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
echo "Get clusters nodes"
NODES_TMP=$(sshpass -p ${SSH_PASS} ssh ${SSH_OPTS} ${SSH_USER}@${KUBE_MAIN_NODE} 'kubectl get nodes -o jsonpath='"'"'{.items[*].status.addresses[?(@.type=="InternalIP")].address}'"'"'')
ALL_IP_ON_KUBER_NODE=$(sshpass -p ${SSH_PASS} ssh ${SSH_OPTS} ${SSH_USER}@${KUBE_MAIN_NODE} ip addr | grep 172.20 | awk '{print $2}' | awk -F'/' '{print $1}')
GREP_STRING_TMP=""
for i in $ALL_IP_ON_KUBER_NODE; do
GREP_STRING_TMP="${GREP_STRING_TMP}${i}|"
done
GREP_STRING=${GREP_STRING_TMP:0:-1}
SSH_AUTH="ansible_ssh_user=${SSH_USER} ansible_ssh_pass=${SSH_PASS}"
echo "[main]" > cluster-hosts
echo "${PROMETHEUS_NODE} ${SSH_AUTH}" >> cluster-hosts
echo "[main-kuber]" >> cluster-hosts
echo "${KUBE_MAIN_NODE} ${SSH_AUTH}" >> cluster-hosts
echo "[cluster-nodes]" >> cluster-hosts
set +e
# Remove IP of kuber node
for i in ${NODES_TMP} ; do
TMP_VAR=$(echo $i | grep -vE "(${GREP_STRING})")
NODES="${NODES} ${TMP_VAR}"
done
set -e
for i in ${NODES} ; do
if [ "$i" != "${KUBE_MAIN_NODE}" ]; then
echo "${i} ${SSH_AUTH}" >> cluster-hosts
fi
done
echo "[all-cluster-nodes:children]" >> cluster-hosts
echo "main-kuber" >> cluster-hosts
echo "cluster-nodes" >> cluster-hosts
LINES=$(wc -l cluster-hosts | awk '{print $1}')
NUM_NODES=$(($LINES - 7))
if [ ${NUM_NODES} -le 0 ]; then
echo "Something wrong, $NUM_NODES nodes found"
exit 1
else
echo "${NUM_NODES} nodes found"
fi
if [ -z "${ANSIBLE_TAG}" ]; then
ansible-playbook -f 40 -i ./cluster-hosts -e cluster_tag=${CLUSTER_TAG} ./deploy-telegraf.yaml
else
ansible-playbook -f 40 -i ./cluster-hosts -e cluster_tag=${CLUSTER_TAG} -t ${ANSIBLE_TAG} ./deploy-telegraf.yaml
fi
3.3.3.6.2. deploy-telegraf.yaml¶
---
- hosts: all-cluster-nodes
remote_user: root
tasks:
- name: Create user telegraf
user: name=telegraf home=/opt/telegraf
- name: Create /opt/telegraf
file: path=/opt/telegraf state=directory owner=telegraf
- name: Create bin dir for telegraf
file: path=/opt/telegraf/bin state=directory owner=telegraf
- name: Create etc dir for telegraf
file: path=/opt/telegraf/etc state=directory owner=telegraf
- name: Copy telegraf to server
copy: src=../../telegraf/opt/bin/telegraf dest=/opt/telegraf/bin/telegraf mode=0755
register: telegraf_bin
- name: Copy telegraf.service
copy: src=telegraf/telegraf.service dest=/etc/systemd/system/telegraf.service
register: telegraf_service
- name: Start and enable telegraf
systemd: state=started enabled=yes daemon_reload=yes name=telegraf
- name: Delete allmetrics.tmp.lock
file: path=/opt/telegraf/bin/data/allmetrics.tmp.lock state=absent
when: telegraf_service.changed or telegraf_bin.changed
- name: Restart telegraf if telegraf binary has been changed
systemd: state=restarted name=telegraf
when: telegraf_bin.changed
- name: Install software
apt: name={{ item }} state=installed
with_items:
- sysstat
- numactl
- name: Copy system metric scripts
copy: src=../../telegraf/opt/system_stats/{{ item }} dest=/opt/telegraf/bin/{{ item }} mode=0755
with_items:
- entropy.sh
- iostat_per_device.sh
- memory_bandwidth.sh
- numa_stat_per_pid.sh
- per_process_cpu_usage.sh
- list_openstack_processes.sh
- network_tcp_queue.sh
- name: Copy pcm-memory-one-line.x
copy: src=../../telegraf/opt/system_stats/intel_pcm_mem/pcm-memory-one-line.x dest=/opt/telegraf/bin/pcm-memory-one-line.x mode=0755
- name: Add sysctl for pcm
sysctl: name=kernel.nmi_watchdog value=0 state=present reload=yes
- name: Load kernel module msr
modprobe: name=msr state=present
- name: Add module autoload
lineinfile: dest=/etc/modules line='msr'
- name: Add user telegraf to sudoers
lineinfile:
dest: /etc/sudoers
state: present
line: "telegraf ALL=(ALL) NOPASSWD: ALL"
- hosts: cluster-nodes
remote_user: root
tasks:
- name: Copy telegraf config
copy: src=./telegraf/telegraf-sys.conf dest=/opt/telegraf/etc/telegraf.conf
register: telegraf_conf
- name: Restart telegraf if config has been changed
systemd: state=restarted name=telegraf
when: telegraf_conf.changed
- hosts: main-kuber
remote_user: root
tasks:
- name: Copy openstack scripts
copy: src=../../telegraf/opt/osapi/{{ item }} dest=/opt/telegraf/bin/{{ item }} mode=0755
with_items:
- glog.sh
- osapitime.sh
- vmtime.sh
tags: [ 'openstack' ]
- name: Copy etcd scripts
copy: src=../../telegraf/opt/k8s_etcd/{{ item }} dest=/opt/telegraf/bin/{{ item }} mode=0755
with_items:
- etcd_get_metrics.sh
- k8s_get_metrics.sh
- name: Install software for scripts
apt: name={{ item }} state=installed
with_items:
- mysql-client
- bc
- jq
tags: [ 'openstack' ]
- name: Create dirs for scripts
file: path=/opt/telegraf/bin/{{ item }} state=directory owner=telegraf
with_items:
- log
- data
- name: Copy telegraf config
template: src=telegraf/telegraf-openstack.conf.j2 dest=/opt/telegraf/etc/telegraf.conf
register: telegraf_conf
tags: [ 'openstack' ]
- name: Delete allmetrics.tmp.lock
file: path=/opt/telegraf/bin/data/allmetrics.tmp.lock state=absent
when: telegraf_conf.changed
- name: Restart telegraf if config has been changed
systemd: state=restarted name=telegraf
when: telegraf_conf.changed
tags: [ 'openstack' ]
- hosts: all-cluster-nodes
remote_user: root
tasks:
- name: Reload telegraf is service file has been changed
systemd: daemon_reload=yes state=reloaded name=telegraf
when: telegraf_service.changed
- hosts: main
remote_user: root
tasks:
- name: update prometheus config
template: src=./prometheus/targets.yml.j2 dest=/var/lib/prometheus/targets-{{ cluster_tag }}.yml
tags: [ 'prometheus' ]
3.3.3.6.3. Telegraf system¶
3.3.3.6.3.1. telegraf-sys.conf¶
[global_tags]
metrics_source="system"
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "15s"
flush_jitter = "5s"
precision = ""
debug = false
quiet = false
hostname = ""
omit_hostname = false
[[outputs.prometheus_client]]
listen = ":9126"
[[inputs.cpu]]
percpu = true
totalcpu = true
fielddrop = ["time_*"]
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.kernel_vmstat]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/iostat_per_device.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/per_process_cpu_usage.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/entropy.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "60s"
commands = [
"/opt/telegraf/bin/numa_stat_per_pid.sh"
]
timeout = "60s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/memory_bandwidth.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/list_openstack_processes.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/network_tcp_queue.sh"
]
timeout = "30s"
data_format = "influx"
3.3.3.6.4. Telegraf openstack¶
3.3.3.6.4.1. telegraf-openstack.conf.j2¶
[global_tags]
metrics_source="system_openstack"
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "15s"
flush_jitter = "5s"
precision = ""
debug = false
quiet = false
hostname = ""
omit_hostname = false
[[outputs.prometheus_client]]
listen = ":9126"
[[inputs.cpu]]
percpu = true
totalcpu = true
fielddrop = ["time_*"]
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.kernel_vmstat]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/vmtime.sh",
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "30s"
commands = [
"/opt/telegraf/bin/osapitime.sh",
]
timeout = "60s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/etcd_get_metrics.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/k8s_get_metrics.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.openstack]]
interval = '40s'
identity_endpoint = "http://keystone.ccp.svc.cluster.local:5000/v3"
domain = "default"
project = "admin"
username = "admin"
password = "password"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/iostat_per_device.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/per_process_cpu_usage.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/entropy.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "60s"
commands = [
"/opt/telegraf/bin/numa_stat_per_pid.sh"
]
timeout = "60s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/memory_bandwidth.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/list_openstack_processes.sh"
]
timeout = "30s"
data_format = "influx"
[[inputs.exec]]
interval = "15s"
commands = [
"/opt/telegraf/bin/network_tcp_queue.sh"
]
timeout = "30s"
data_format = "influx"
3.3.3.6.5. Telegraf inputs scripts¶
3.3.3.6.5.1. list_openstack_processes.sh¶
#!/bin/bash
export LANG=C
PS_ALL=$(ps --no-headers -A -o command | grep -vE '(sh|bash)')
M_NAME=system_openstack_list
MARIADB=$(echo "${PS_ALL}" | grep 'mariadb' | wc -l)
RABBITMQ=$(echo "${PS_ALL}" | grep 'rabbitmq' | wc -l)
KEYSTONE=$(echo "${PS_ALL}" | grep 'keystone' | wc -l)
GLANCE=$(echo "${PS_ALL}" | grep -E '(glance-api|glance-registry)' | wc -l)
CINDER=$(echo "${PS_ALL}" | grep 'cinder' | wc -l)
NOVA=$(echo "${PS_ALL}" | grep -E '(nova-api|nova-conductor|nova-consoleauth|nova-scheduler)' | wc -l)
NEUTRON=$(echo "${PS_ALL}" | grep -E '(neutron-server|neutron-metadata-agent|neutron-dhcp-agent|neutron-l3-agent|neutron-openvswitch-agent)' | wc -l)
OPENVSWITCH=$(echo "${PS_ALL}" | grep -E '(ovsdb-server|ovs-vswitchd|ovsdb-client)' | wc -l)
echo "${M_NAME} mariadb=${MARIADB},rabbitmq=${RABBITMQ},keystone=${KEYSTONE},glance=${GLANCE},cinder=${CINDER},nova=${NOVA},neutron=${NEUTRON},openvswitch=${OPENVSWITCH}"
3.3.3.6.5.2. per_process_cpu_usage.sh¶
#!/bin/bash
export LANG=C
for i in $(ps --no-headers -A -o pid); do
pidstat -p $i | tail -n 1 | grep -v PID | awk '{print "system_per_process_cpu_usage,process="$9" user="$4",system="$5}'
done
3.3.3.6.5.3. numa_stat_per_pid.sh¶
#!/bin/bash
set -o nounset # Treat unset variables as an error
#set -x
export LANG=C
if [ ! -d '/sys/devices/system/node' ]; then
# This host does not have NUMA
exit 44
fi
ALL_PROCESS="$(ps --no-headers -A -o pid,ucomm)"
for i in $(echo "${ALL_PROCESS}" | awk '{print $1}'); do
if [ -f "/proc/$i/numa_maps" ]; then
NUM_STAT=$(numastat -p $i)
PROC_NAME=$(echo "${ALL_PROCESS}" | grep -E "( $i |^$i )" | awk '{print $2}')
echo "${NUM_STAT}" | grep Huge | awk -v p=$i -v n=$PROC_NAME \
'{printf "system_numa_memory_per_pid,pid="p",name="n" memory_huge="$NF","}'
echo "${NUM_STAT}" | grep Heap | awk '{printf "memory_heap="$NF","}'
echo "${NUM_STAT}" | grep Stack | awk '{printf "memory_stack="$NF","}'
echo "${NUM_STAT}" | grep Private | awk '{print "memory_private="$NF}'
fi
done
3.3.3.6.5.4. iostat_per_device.sh¶
#!/bin/bash
# output from iostat -Ndx is
# Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
export LANG=C
iostat -Ndx | tail -n +4 | head -n -1 | awk '{print "system_per_device_iostat,device="$1" read_merge="$2",write_merge="$3",await="$10",read_await="$11",write_await="$12",util="$14",average_queue="$9}'
3.3.3.6.5.5. memory_bandwidth.sh¶
#!/bin/bash
# Output in MB/s
# echo 0 > /proc/sys/kernel/nmi_watchdog
# modprobe msr
export LANG=C
MEM_BW=$(sudo /opt/telegraf/bin/pcm-memory-one-line.x /csv 1 2>/dev/null | tail -n 1 | awk '{print $28}')
echo "system_memory bandwidth=${MEM_BW}"
3.3.3.6.5.6. network_tcp_queue.sh¶
#!/bin/bash
export LANG=C
IFS='
'
SUM_RESV_Q=0
SUM_SEND_Q=0
for i in $(netstat -4 -n); do
RESV_Q=$(echo $i | awk '{print $2}')
SEND_Q=$(echo $i | awk '{print $3}')
SUM_RESV_Q=$((${SUM_RESV_Q} + ${RESV_Q}))
SUM_SEND_Q=$((${SUM_SEND_Q} + ${SEND_Q}))
done
echo "system_tcp_queue sum_recv=${SUM_RESV_Q},sum_send=${SUM_SEND_Q}"
3.3.3.6.5.7. etcd_get_metrics.sh¶
#!/bin/bash -e
ETCD=/usr/local/bin/etcdctl
type jq >/dev/null 2>&1 || ( echo "Jq is not installed" ; exit 1 )
type curl >/dev/null 2>&1 || ( echo "Curl is not installed" ; exit 1 )
# get etcd members credentials
MEMBERS="${ETCD} --endpoints https://127.0.0.1:2379 member list"
LEADER_ID=$(eval "$MEMBERS" | awk -F ':' '/isLeader=true/ {print $1}')
LEADER_ENDPOINT=$(eval "$MEMBERS" | awk '/isLeader=true/ {print $4}' | cut -d"=" -f2)
SLAVE_ID=$(eval "$MEMBERS" | grep 'isLeader=false' | head -n 1 | awk -F ":" '{print $1}')
SLAVE_ENDPOINT=$(eval "$MEMBERS" | grep 'isLeader=false' | head -n 1 | awk '{print $4}' | cut -d"=" -f2)
# member count:
metric_members_count=`curl -s -k https://172.20.9.15:2379/v2/members | jq -c '.members | length'`
metric_total_keys_count=`${ETCD} --endpoints https://127.0.0.1:2379 ls -r --sort | wc -l`
metric_total_size_dataset=`pidof etcd | xargs ps -o rss | awk '{rss=+$1} END {print rss}'`
metric_store_stats=`curl -s -k ${LEADER_ENDPOINT}/v2/stats/store| tr -d \"\{\} | sed -e 's/:/=/g'`
metric_latency_from_leader_avg=`curl -s -k ${LEADER_ENDPOINT}/v2/stats/leader | \
jq -c ".followers.\"${SLAVE_ID}\".latency.average"`
metric_leader_stats=`curl -s -k ${LEADER_ENDPOINT}/v2/stats/self | \
jq -c "{ sendBandwidthRate: .sendBandwidthRate, sendAppendRequestCnt: \
.sendAppendRequestCnt, sendPkgRate: .sendPkgRate }"| tr -d \"\{\} | sed -e 's/:/=/g'`
metric_slave_stats=`curl -s -k ${SLAVE_ENDPOINT}/v2/stats/self | \
jq -c "{ recvBandwidthRate: .recvBandwidthRate, recvAppendRequestCnt: \
.recvAppendRequestCnt, recvPkgRate: .recvPkgRate }"| tr -d \"\{\} | sed -e 's/:/=/g'`
cat << EOF
etcd_general_stats,group=etcd_cluster_metrics members_count=${metric_members_count},dataset_size=${metric_total_size_dataset},total_keys_count=${metric_total_keys_count}
etcd_leader_stats,group=etcd_cluster_metrics $metric_leader_stats
etcd_follower_stats,group=etcd_cluster_metrics ${metric_slave_stats},latency_from_leader_avg=${metric_latency_from_leader_avg}
etcd_store_stats,group=etcd_cluster_metrics $metric_store_stats
EOF
3.3.3.6.5.8. k8s_get_metrics.sh¶
#!/bin/bash -e
K8S_MASTER=127.0.0.1
if [[ $1 ]] ; then
K8S_MASTER=$1
fi
type jq >/dev/null 2>&1 || ( echo "Jq is not installed" ; exit 1 )
type curl >/dev/null 2>&1 || ( echo "Curl is not installed" ; exit 1 )
curl_get() {
url="https://${K8S_MASTER}$@"
curl -k -s -u kube:changeme $url || ( echo "Curl failed at: $url" 1>&2; exit 1 )
}
# gathering frequent API calls output to separate file(in order to avoid long timeouts):
node_file=`mktemp /tmp/XXXXX`
pods_file=`mktemp /tmp/XXXXX`
endpoints_file=`mktemp /tmp/XXXXX`
curl_get "/api/v1/nodes" > $node_file
curl_get "/api/v1/pods" > $pods_file
curl_get "/api/v1/endpoints" > $endpoints_file
# metrics withdrawal:
number_of_namespaces_total=`curl_get "/api/v1/namespaces" | jq '[ .items[] .metadata.name ] | length'`
number_of_services_total=`curl_get "/api/v1/services" | jq -c '[ .items[] .metadata.name ] | length'`
number_of_nodes_total=`jq -c '[ .items[] .metadata.name ] | length' $node_file`
number_of_unsched=`jq -c '[ .items[] | select(.spec.unschedulable != null) .metadata.name ] | length' $node_file`
number_in_each_status=`jq -c '[ .items[] | .status.conditions[] | select(.type == "Ready") .status \
| gsub("(?<a>.+)"; "number_of_status_\(.a)" ) ] | group_by(.) | map({(.[0]): length}) | add ' $node_file \
| tr -d \"\{\} | sed -e 's/:/=/g'`
number_of_pods_total=`jq -c '[ .items[] .metadata.name ] | length' $pods_file`
number_of_pods_state_Pending=`jq -c '[ .items[] .status.phase | select(. == "Pending")] | length' $pods_file`
number_of_pods_state_Running=`jq -c '[ .items[] .status.phase | select(. == "Running")] | length' $pods_file`
number_of_pods_state_Succeeded=`jq -c '[ .items[] .status.phase | select(. == "Succeeded")] | length' $pods_file`
number_of_pods_state_Failed=`jq -c '[ .items[] .status.phase | select(. == "Failed")] | length' $pods_file`
number_of_pods_state_Unknown=`jq -c '[ .items[] .status.phase | select(. == "Unknown")] | length' $pods_file`
number_of_pods_per_node=`jq -c '[ .items[] | .spec.nodeName ] | group_by(.) | \
map("k8s_pods_per_node,group=k8s_cluster_metrics,pod_node=\(.[0]) value=\(length)")' $pods_file \
| sed -e 's/\["//g' -e 's/"\]//g' -e 's/","/\n/g'`
number_of_pods_per_ns=`jq -c '[ .items[] | .metadata.namespace ] | group_by(.) | \
map("k8s_pods_per_namespace,group=k8s_cluster_metrics,ns=\(.[0]) value=\(length)")' $pods_file \
| sed -e 's/\["//g' -e 's/"\]//g' -e 's/","/\n/g'`
number_of_endpoints_each_service=`jq -c '[ .items[] | { service: .metadata.name, endpoints: .subsets[] } | \
. as { service: $svc, endpoints: $endp } | $endp.addresses | length | . as $addr | $endp.ports | length | \
. as $prts | "k8s_services,group=k8s_cluster_metrics,service=\($svc) endpoints_number=\($addr * $prts)" ] ' $endpoints_file \
| sed -e 's/\["//g' -e 's/"\]//g' -e 's/","/\n/g'`
number_of_endpoints_total=`jq -c '[ .items[] | .subsets[] | { addrs: .addresses, ports: .ports } \
| map (length ) | .[0] * .[1] ] | add' $endpoints_file`
number_of_API_instances=`curl_get "/api/" | jq -c '.serverAddressByClientCIDRs | length'`
number_of_controllers=`curl_get "/api/v1/replicationcontrollers" | jq '.items | length'`
number_of_scheduler_instances=`curl_get /api/v1/namespaces/kube-system/pods?labelSelector='k8s-app=kube-scheduler' \
| jq -c '.items | length' `
cluster_resources_CPU=`jq -c '[ .items[] .status.capacity.cpu | tonumber ] | add' $node_file`
cluster_resources_RAM=`jq -c '[ .items[] .status.capacity.memory| gsub("[a-z]+$"; "" ; "i") | tonumber] | add' $node_file`
# output:
cat << EOF
k8s_nodes,group=k8s_cluster_metrics number_of_nodes_total=${number_of_nodes_total},number_of_unsched=${number_of_unsched}
k8s_nodes_states,group=k8s_cluster_metrics ${number_in_each_status}
k8s_namespaces,group=k8s_cluster_metrics number_of_namespaces_total=${number_of_namespaces_total}
k8s_pods,group=k8s_cluster_metrics number_of_pods_total=${number_of_pods_total}
k8s_pods_states,group=k8s_cluster_metrics number_of_pods_state_Pending=${number_of_pods_state_Pending},number_of_pods_state_Running=${number_of_pods_state_Running},number_of_pods_state_Succeeded=${number_of_pods_state_Succeeded},number_of_pods_state_Failed=${number_of_pods_state_Failed},number_of_pods_state_Unknown=${number_of_pods_state_Unknown}
${number_of_pods_per_node}
${number_of_pods_per_ns}
${number_of_endpoints_each_service}
k8s_services,group=k8s_cluster_metrics number_of_services_total=${number_of_services_total},number_of_endpoints_total=${number_of_endpoints_total}
k8s_number_of_API_instances,group=k8s_cluster_metrics value=${number_of_API_instances}
k8s_number_of_controllers,group=k8s_cluster_metrics value=${number_of_controllers}
k8s_number_of_scheduler_instances,group=k8s_cluster_metrics value=${number_of_scheduler_instances}
k8s_cluster_resources,group=k8s_cluster_metrics cpu_total=${cluster_resources_CPU},ram_total=${cluster_resources_RAM}
EOF
# cleanup
rm -f $node_file $pods_file $endpoints_file
3.3.3.6.5.9. vmtime.sh¶
#!/bin/bash
#
WORKDIR="$(cd "$(dirname ${0})" && pwd)"
SCRIPT="${WORKDIR}/$(basename ${0})"
MYSQLUSER="nova"
MYSQPASSWD="password"
MYSQLHOST="mariadb.ccp"
avgdata=$(mysql -u${MYSQLUSER} -p${MYSQPASSWD} -h ${MYSQLHOST} -D nova --skip-column-names --batch -e "select diff from (select avg(unix_timestamp(launched_at) - unix_timestamp(created_at)) as diff from instances where vm_state != 'error' and launched_at >= subtime(now(),'30')) t1 where diff IS NOT NULL;" 2>/dev/null | sed 's/\t/,/g';)
if [ ! -z "${avgdata}" ]; then
echo "vm_spawn_avg_time timediffinsec=${avgdata}"
fi
3.3.3.6.5.10. osapitime.sh¶
#!/bin/bash
# Variables declaration
WORKDIR="$(cd "$(dirname ${0})" && pwd)"
OS_LOG_PARSER="${WORKDIR}/glog.sh"
TMPDATADIR="${WORKDIR}/data"
TMP_METRICS="${TMPDATADIR}/allmetrics.tmp"
MODE="${MODE:-bg}"
SCRIPT_LOG_DIR="${WORKDIR}/logs"
SCRIPT_LOG_FILE="${SCRIPT_LOG_DIR}/run_results_$(date +%Y-%m-%d).log"
SCRIPT_LOG_LVL=2
K8S_NS="${K8S_NS:-ccp}"
declare -a OSCONTROLLER=(
'cinder-api:1,2,21'
'glance-api:1,2,22'
'heat-api:1,2,22'
'neutron-metadata-agent:1,2,17'
'neutron-server:1,2,22'
'nova-api:1,2,21'
'keystone:4,5,11'
)
declare -a OSCOMPUTE=(
'nova-compute:'
)
# crete subfolder under working directory
function mk_dir()
{
local newdir="${TMPDATADIR}/${1}"
if [ ! -d "${newdir}" ]; then
mkdir -p ${newdir}
fi
}
# log function
function log()
{
local input
local dtstamp
input="$*"
dtstamp="$(date +%Y-%m-%d_%H%M%S)"
if [ ! -d "${SCRIPT_LOG_DIR}" ]; then
mkdir -p "${SCRIPT_LOG_DIR}"
fi
case "${SCRIPT_LOG_LVL}" in
3)
if [ ! -z "${input}" ]; then
echo "${dtstamp}: ${input}" | tee -a "${SCRIPT_LOG_FILE}"
fi
;;
2)
if [ ! -z "${input}" ]; then
echo "${dtstamp}: ${input}" >> "${SCRIPT_LOG_FILE}"
fi
;;
1)
if [ ! -z "${input}" ]; then
echo "${dtstamp}: ${input}"
fi
;;
*)
;;
esac
}
# get roles according to predefined in OSCONTROLLER & OSCOMPUTE
function get_role()
{
local role
local input
local arr_name
local arr_name_fields
role=${1}
shift
input=$*
case ${role} in
"controller")
for i in $(seq 0 $(( ${#OSCONTROLLER[@]} - 1)))
do
arr_name=$(echo ${OSCONTROLLER[${i}]} | cut -d":" -f1)
arr_name_fields=$(echo ${OSCONTROLLER[${i}]} | cut -d":" -f2)
if [[ "${arr_name}" == "${input}" ]]; then
echo "${arr_name_fields}"
return 0
fi
done
;;
"compute")
for i in $(seq 0 $(( ${#OSCOMPUTE[@]} - 1)))
do
arr_name=$(echo ${OSCOMPUTE[${i}]} | cut -d":" -f1)
arr_name_fields=$(echo ${OSCOMPUTE[${i}]} | cut -d":" -f2)
if [ "${arr_name}" == "${input}" ]; then
echo "${arr_name_fields}"
return 0
fi
done
;;
esac
return 1
}
# diff in seconds
function tdiff()
{
local now
local datetime
local result
datetime="$(date -d "${1}" +%s)"
now="$(date +%s)"
result=$(( ${now} - ${datetime} ))
echo ${result}
}
# lock file function
function glock()
{
local action
local lockfile
local accessdate
local old_in_sec=120
action="${1}"
# lockfile="${TMP_METRICS}.lock"
lockfile="${TMPDATADIR}/allmetrics.tmp.lock"
if [[ "${action}" == "lock" && ! -e "${lockfile}" ]]; then
touch "${lockfile}"
elif [[ "${action}" == "lock" && -e "${lockfile}" ]]; then
accessdate="$(stat ${lockfile} | grep Modify | cut -d' ' -f2,3)"
if [ "$(tdiff "${accessdate}")" -ge "${old_in_sec}" ]; then
rm "${lockfile}"
touch "${lockfile}"
else
log "Lock file ${lockfile} exists!"
return 1
fi
else
rm "${lockfile}"
fi
return 0
}
# wait for parcers launched in backgroud mode
function gatherchildren()
{
local childrencount
while true
do
childrencount=$(ps axf| grep ${OS_LOG_PARSER} | grep -v grep | wc -l)
if [ "${childrencount}" -eq 0 ]; then
return
fi
log "Children running ${childrencount}."
sleep 1
done
}
# list of running contaners
function get_k8s_containers()
{
local cont_host
local cont_pod
local cont_name
local cont_id
local os_log_fields
local cont_tmp_dir
local _raw_data
glock "lock"
if [ "$?" -ne 0 ]; then exit 1;fi
#echo '[' > ${TMP_METRICS}
_raw_data="${TMPDATADIR}/._raw_data"
rm -rf ${_raw_data}
kubectl get pods -n "${K8S_NS}" -o 'go-template={{range .items}}{{if or (ne .status.phase "Succeeded") (eq .status.phase "Running")}}{{.spec.nodeName}},{{.metadata.name}},{{range .status.containerStatuses}}{{.name}},{{.containerID}}{{end}}{{"\n"}}{{end}}{{end}}' > ${_raw_data}
for data in $(cat ${_raw_data})
do
cont_host=$(echo ${data} | cut -d',' -f1)
cont_pod=$(echo ${data} | cut -d',' -f2)
cont_name=$(echo ${data} | cut -d',' -f3)
cont_id=$(echo ${data} | cut -d',' -f4 | sed 's|^docker://||')
cont_tmp_dir="${cont_host}_${cont_pod}_${cont_name}"
os_log_fields=$(get_role "controller" "${cont_name}")
if [ "$?" -eq 0 ]; then
mk_dir "${cont_tmp_dir}"
export K8S_NS=${K8S_NS}
export TMP_DIR=${TMPDATADIR}/${cont_tmp_dir}
# export TMP_METRICS=${TMP_METRICS}
export TMP_METRICS="${TMPDATADIR}/results/${cont_pod}.tmp"
export CONTID=${cont_id}
export CONTAINER=${cont_name}
export HOST=${cont_host}
export POD=${cont_pod}
export OS_LOG_FIELDS=${os_log_fields}
log "MODE=${MODE} CONTID=${cont_id} TMP_METRICS=${TMP_METRICS} ROLE=controller HOST=${cont_host} POD=${cont_pod} CONTAINER=${cont_name} OS_LOG_FIELDS=${os_log_fields} TMP_DIR=${TMPDATADIR}/${cont_tmp_dir} K8S_NS=${K8S_NS} ${OS_LOG_PARSER}"
if [[ "${MODE}" == "bg" ]]; then
log "${cont_pod} ${cont_name} ${cont_id}"
${OS_LOG_PARSER} &
else
${OS_LOG_PARSER}
fi
unset TMP_METRICS
unset CONTID
unset CONTAINER
unset POD
unset OS_LOG_FIELDS
unset HOST
fi
# os_log_fields=$(get_role "compute" "${cont_name}")
# if [ "$?" -eq 0 ]; then
# mk_dir "${cont_tmp_dir}"
# log "ROLE=compute HOST=${cont_host} POD=${cont_pod} CONTAINER=${cont_name} OS_LOG_FIELDS=${os_log_fields} TMP_DIR=${TMPDATADIR}/${cont_tmp_dir} K8S_NS=${K8S_NS} ${OS_LOG_PARSER}"
# fi
done
gatherchildren
if [ "$(ls ${TMPDATADIR}/results/ | wc -l)" -gt 0 ]; then
cat ${TMPDATADIR}/results/*.tmp
log "Resulting lines $(cat ${TMPDATADIR}/results/*.tmp | wc -l)"
rm -rf ${TMPDATADIR}/results/*
fi
glock "unlock"
}
# Main logic
mk_dir
mk_dir "results"
get_k8s_containers
3.3.3.7. Heka deployment and configuration¶
3.3.3.7.1. Deployment¶
3.3.3.7.1.1. deploy_heka.sh¶
#!/bin/bash
set -e
export ANSIBLE_HOST_KEY_CHECKING=False
export SSH_USER="root"
export SSH_PASS="r00tme"
cd $(dirname $(realpath $0))
ENV=${1}
if [ -z "${ENV}" ]; then
echo "Please provide env number $(basename $0) [1|2|3|4|5|6]"
exit 1
fi
# elastic for k8s at rackspace as default
ELASTICSEARCH_NODE=${ELASTICSEARCH_NODE:-172.20.9.3}
# heka 0.10.0 as default
HEKA_PACKAGE_URL=${HEKA_PACKAGE_URL:-https://github.com/mozilla-services/heka/releases/download/v0.10.0/heka_0.10.0_amd64.deb}
KUBE_MAIN_NODE="172.20.8.6${ENV}"
SSH_OPTS="-q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
echo "Get clusters nodes ..."
NODES_TMP=$(sshpass -p ${SSH_PASS} ssh ${SSH_OPTS} ${SSH_USER}@${KUBE_MAIN_NODE} 'kubectl get nodes -o jsonpath='"'"'{.items[*].status.addresses[?(@.type=="InternalIP")].address}'"'"'')
ALL_IP_ON_KUBER_NODE=$(sshpass -p ${SSH_PASS} ssh ${SSH_OPTS} ${SSH_USER}@${KUBE_MAIN_NODE} ip addr | grep 172.20 | awk '{print $2}' | awk -F'/' '{print $1}')
GREP_STRING_TMP=""
for i in $ALL_IP_ON_KUBER_NODE; do
GREP_STRING_TMP="${GREP_STRING_TMP}${i}|"
done
GREP_STRING=${GREP_STRING_TMP:0:-1}
SSH_AUTH="ansible_ssh_user=${SSH_USER} ansible_ssh_pass=${SSH_PASS}"
echo "[main-kuber]" > cluster-hosts
echo "${KUBE_MAIN_NODE} ${SSH_AUTH}" >> cluster-hosts
echo "[cluster-nodes]" >> cluster-hosts
set +e
# Remove IP of kuber node
for i in ${NODES_TMP} ; do
TMP_VAR=$(echo $i | grep -vE "(${GREP_STRING})")
NODES="${NODES} ${TMP_VAR}"
done
set -e
for i in ${NODES} ; do
if [ "$i" != "${KUBE_MAIN_NODE}" ]; then
echo "${i} ${SSH_AUTH}" >> cluster-hosts
fi
done
echo "[all-cluster-nodes:children]" >> cluster-hosts
echo "main-kuber" >> cluster-hosts
echo "cluster-nodes" >> cluster-hosts
# Calculate parallel ansible execution
NODES_IPS=( $NODES )
if [[ "${#NODES_IPS[@]}" -lt 50 ]] && [[ "${#NODES_IPS[@]}" -gt 5 ]]; then
ANSIBLE_FORKS="${#NODES_IPS[@]}"
elif [[ "${#NODES_IPS[@]}" -ge 50 ]]; then
ANSIBLE_FORKS=50
else
ANSIBLE_FORKS=10
fi
echo "Starting ansible ..."
ansible-playbook -v --ssh-extra-args "-o\ StrictHostKeyChecking=no" -f ${ANSIBLE_FORKS} -i ./cluster-hosts -e env_num=${ENV} -e elasticsearch_node="${ELASTICSEARCH_NODE}" -e heka_package_url=${HEKA_PACKAGE_URL} ./deploy-heka.yaml --diff
3.3.3.7.1.2. deploy-heka.yaml¶
---
- hosts: main-kuber
remote_user: root
tasks:
- name: Fetch heka package
get_url:
url: "{{ heka_package_url }}"
dest: /tmp/heka_amd64.deb
mode: 0664
force: yes
- name: Download heka package locally
fetch:
src: /tmp/heka_amd64.deb
dest: ./heka_amd64.deb
fail_on_missing: yes
flat: yes
- hosts: cluster-nodes
remote_user: root
tasks:
- name: Propagate heka package across cluster nodes
copy:
src: ./heka_amd64.deb
dest: /tmp/heka_amd64.deb
- hosts: all-cluster-nodes
remote_user: root
tasks:
- name: Install heka package
apt: deb=/tmp/heka_amd64.deb
- name: Adding heka user to docker group
user: name='heka' groups=docker append=yes
- name: Copy heka conf
template: src=heka/00-hekad.toml.j2 dest=/etc/heka/conf.d/00-hekad.toml
notify: restart heka
- name: Copy heka lua scripts
template: src=heka/kubeapi_to_int.lua.j2 dest=/usr/share/heka/lua_filters/kubeapi_to_int.lua
register: heka_lua
notify: restart heka
- name: ensure heka is running
systemd: state=started name=heka enabled=yes
handlers:
- name: restart heka
systemd: state=restarted name=heka
3.3.3.7.2. Configuration¶
3.3.3.7.2.1. 00-hekad.toml.j2¶
# vim: set syntax=yaml
[hekad]
maxprocs = 2
[DockerLogInput]
endpoint = "unix:///var/run/docker.sock"
#decoder = "KubeAPI_decoder"
decoder = "MultiDecoder"
[MultiDecoder]
type = "MultiDecoder"
subs = ["KubeAPI_decoder", "EnvironmentScribbler"]
cascade_strategy = "all"
#log_sub_errors = true
{% raw %}
[KubeAPI_decoder]
type = "PayloadRegexDecoder"
match_regex = '\S+ \S+ .+ (?P<Code>\S+)\] (?P<Method>[A-Z]+) (?P<Url>\S+)\: \((?P<ResponseTime>\S+)ms\) (?P<StatusCode>\d+) \[\[(?P<Agent>.+)\] (?P<RemoteIP>\S+)\:(?P<RemotePort>\d+)\]'
[KubeAPI_decoder.message_fields]
Type = "KubeAPIlog"
Logger = "Docker"
Code = "%Code%"
Method = "%Method%"
Url|uri = "%Url%"
ResponseTime = "%ResponseTime%"
StatusCode = "%StatusCode%"
Agent = "%Agent%"
RemoteIP|ipv4 = "%RemoteIP%"
RemotePort = "%RemotePort%"
{% endraw %}
[EnvironmentScribbler]
type = "ScribbleDecoder"
[EnvironmentScribbler.message_fields]
Environment = "env-{{ env_num }}"
[KubeAPI_to_int]
type = "SandboxFilter"
filename = "lua_filters/kubeapi_to_int.lua"
message_matcher = "Type == 'KubeAPIlog'"
[ESJsonEncoder]
index = "env-{{ env_num }}-{{ '%{Type}-%{%Y.%m.%d}' }}"
#es_index_from_timestamp = true
type_name = "%{Type}"
[ElasticSearchOutput]
message_matcher = "Type == 'heka.sandbox.KubeAPIlog' || Type == 'DockerLog'"
server = "http://{{ elasticsearch_node }}:9200"
flush_interval = 5000
flush_count = 10
encoder = "ESJsonEncoder"
[PayloadEncoder]
append_newlines = false
#
[LogOutput]
<<<<<<< HEAD
#message_matcher = "Type == 'KubeAPIlog'"
message_matcher = "TRUE"
#encoder = "ESJsonEncoder"
encoder = "PayloadEncoder"
=======
message_matcher = "Type == 'heka.sandbox.KubeAPIlog' || Type == 'DockerLog'"
#message_matcher = "TRUE"
encoder = "ESJsonEncoder"
#encoder = "PayloadEncoder"
>>>>>>> b0caa3ceb82399dd16465645eebdebf90242662c
3.3.3.7.2.2. kubeapi_to_int.lua.j2¶
{% raw %}
-- Invert Response time and some more fields to integer type
local fields = {["ResponseTime"] = 0, ["RemotePort"] = 0, ["StatusCode"] = 0}
local msg = {
Type = "KubeAPIlog",
Severity = 6,
Fields = fields
}
function process_message ()
fields["ResponseTime"] = tonumber(read_message("Fields[ResponseTime]"))
fields["RemotePort"] = tonumber(read_message("Fields[RemotePort]"))
fields["StatusCode"] = tonumber(read_message("Fields[StatusCode]"))
msg.Payload = read_message("Payload")
fields["Code"] = read_message("Fields[Code]")
fields["ContainerID"] = read_message("Fields[ContainerID]")
fields["ContainerName"] = read_message("Fields[ContainerName]")
fields["Environment"] = read_message("Fields[Environment]")
fields["Method"] = read_message("Fields[Method]")
fields["RemoteIP"] = read_message("Fields[RemoteIP]")
fields["Url"] = read_message("Fields[Url]")
local ok, msg = pcall(inject_message, msg)
if not ok then
inject_payload("txt", "error", msg)
end
return 0
end
{% endraw %}