Linux Troubleshooting: 2010

Wednesday, November 10, 2010

CIFS VFS: cifs_mount failed w/return code = -101

CIFS VFS: Error connecting to socket. Aborting operation
CIFS VFS: cifs_mount failed w/return code = -101

Wednesday, November 3, 2010

how to redirect strace to text file

#strace -o filename command

For example:

strace -o strace1.txt  ls

Tuesday, November 2, 2010

read and edit initrd image contents

1. Make temporary directory

#mkdir /tmp/initrd
2. Copy initrd image to destination directory
#cp /source/directory/initrd-version.img /tmp/initrd
3. unzip and use cpio
#gzip -dc initrd-2.6.18-164.6.1.el5.img | cpio -ivd

Monday, November 1, 2010

blocked FC remote port time out

logs shows 'blocked FC remote port time out' errors

Tuesday, October 26, 2010

SIOCADDRT: No such device or SIOCADDRT: Network is unreachable

1. Check 'ifconfig ethx' to determine ip address has been assigned. If not assign ip address

Wednesday, October 20, 2010

Write protecting the kernel read-only data

There was a bug in previous version, please update to the latest version. That should get rid of these error messages

Monday, October 11, 2010

upgrade from RHEL4 to RHEL5

Upgrading between major versions is not support. Install a required version from scratch.

Sunday, September 5, 2010

linux bonding does not double bandwidth

For 1G Nic, the maximum theoretical speed limit is 128MB/s.
In real world you get much lesser, adding the protocol overhead like we have scp, wget and other protocols.
If you are currently using mode 0.
In mode 0 there is chances for packet re-ordering. Which can delay the speed.
There may be slight improvement if you use mode 4.
With mode 4 the incoming connection is controlled by switch. So there will be no packet re-order issues or packet drops.
Even if you use mode 4, you cannot achieve 2G speed for a single connection. You can enjoy 2G bandwidth for multiple connections.

There is a assumption that configuration of bonding with multiple network cards gives double bandwidth. That is not at all true.

Thursday, August 5, 2010

GFS2 lockdump analysis

G: s:EX n:2/1fda70 f:Dy t:EX d:UN/132055446000 l:0 a:0 r:5
H: s:EX f:H e:0 p:1405 [httpd] gfs2_write_begin+0x61/0x33e [gfs2]
H: s:EX f:W e:0 p:31819 [umount] gfs2_write_inode+0x57/0x152 [gfs2]

The content of the file is a series of lines. Each line starting with G: represents one glock, and the following lines, indented by a single space, represent an item of information relating to the glock immediately before them in the file

Lines in the debugfs file starting with H: (holders) represent lock requests either granted or waiting to be granted

The flags field on the holders line f: shows which: The 'W' flag refers to a waiting request, the 'H' flag refers to a granted request

The glocks which have large numbers of waiting requests are likely to be those which are experiencing particular contention.

Having identified a glock which is causing a problem, the next step is to find out which inode it relates to. The glock number (n: on the G: line) indicates this. It is of the form type/number and if type is 2, then the glock is an inode glock and the number is an inode number. To track down the inode, you can then run find -inum number where number is the inode number converted from the hex format in the glocks file into decimal.

If the glock that was identified was of a different type, then it is most likely to be of type 3: (resource group). If you see significant numbers of processes waiting for other types of glock under normal loads, then please report this to Red Hat support.

f you do see a number of waiting requests queued on a resource group lock there may be a number of reason for this. One is that there are a large number of nodes compared to the number of resource groups in the filesystem. Another is that the filesystem may be very nearly full (requiring, on average, longer searches for free blocks). The situation in both cases can be improved by adding more storage and using the gfs2_grow command to expand the filesystem.

The W flag indicates that this is waiting for a glock
UN (unlocked) state
SH (shared) lock

Monday, August 2, 2010

du command usage

1. du -hs * (shows linux directory size)
2. du * | sort -n (finding large files with sort option)

Friday, July 2, 2010

bash: fork: Resource temporarily unavailable

The error is generated because of recursive defunct processes going on in the memory and is unable to allocate specific amount of memory for the newly created process. Basically the general structure of process generation is as followed.

This is the normal life of a program.
fork()INIT->exec()->RUN->exit()->ZOMBIE->done

INIT(fork)
The program is started by a parent process, an action called fork()
The fork makes a copy(the child) of the calling process(the parent).
exec() The child then issues an exec() system call which replaces the new
process with the intended executable file.

SRUN/URUN (system/user run space) the new child program now runs. Now the parent is either waiting(in a SLEEP) for the child to finish or checks for the childs completion or the system notifys the parent on exit
of the child process.

exit()
The child exits and returns the resources(memory) to the system.
ZOMBIE At this point the child has terminated and is in ZOMBIE. THIS IS NORMAL!! It will stay in this state until the parent process acknowledges receipt of the appropriate signal or terminates.

If the parent process has died unexpectedly or something else has prevented acknowledgment from the parent then process ID #1 (init) takes over and becomes the childs parent until reboot.

So...
A zombie does not tie up memory but it still has a slot in the process table. I/O devices can get locked out.

You can't kill a ZOMBIE process because......IT'S ALREADY DEAD!!!

So finally try to check these defunct processes going on server You can do this by giving the ps -dfa command and killing manually. If this work you can start by simply relogin into the shell and if not you will need a reboot to the server in order to refresh the memory states.

Tuesday, June 29, 2010

packet drops in Linux

Packet drops can happen at two layers. one at the NIC level or at the Network stack level.

Check 'ifconfig' output.

RX packets:297126179 errors:0 dropped:3981 overruns:0 frame:0
TX packets:233589229 errors:0 dropped:0 overruns:0 carrier:0

That means packets drops at the NIC level. These are most likely caused by exhaustion of the RX ring buffer. Increase the size of the ethernet device ring buffer.

First inspect the output of "ethtool -g eth0". If the "Pre-set maximums" are more than the what's listed in the current hardware settings it's recommend to increase this number. As an example:

# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 1020
RX Mini: 0
RX Jumbo: 16320
TX: 255
Current hardware settings:
RX: 255
RX Mini: 0
RX Jumbo: 0
TX: 255

To increase the RX ring buffer to 4080 you would run "ethtool -G eth0 rx 1020".

Sunday, June 27, 2010

when we bring the node back , failback is not happening

when poweroff command is given through terminal , fail over is working but when we bring the node back , failback is not happening?

Sunday, June 20, 2010

kernel panic due to openafs filesystem

By the looks of the logs kernel panic occurred due to 'openafs' file system corruption.

Tuesday, June 15, 2010

configure guest virtual machines to use SAN storage using libvirt

Provision a new logical unit on iSCSI or fibre channel storage. Use virsh to trigger a scan for it, and confirm that it appears correctly.
To discover logical units on a particular HBA, create a pool for that HBA using:
virsh pool-create hbapool.xml
where hbapool.xml contains:


  host6
  /dev/disk/by-id

Confirm that all the appropriate logical units are visible as volumes with:
virsh vol-list host6
After creating the pool, add a new logical unit on a target that's visible on that host and refresh the pool with:
virsh pool-refresh host6
and confirm that the new storage is visible. Note that the refresh code only scans for new LUs on existing targets and does not issue a LIP to discover new targets as that would be disruptive to I/O.

Thursday, June 10, 2010

backup and restore partition table

Backing up partition table.

dd if=/dev/device of=/path/to/backup bs=512 count=1

Restore

dd if=/path/to/backup of=/dev/device bs=512 count=1

You can make sure data is intact using 'hexdump'.

hexdump /path/to/backup

Tuesday, June 8, 2010

gfs_controld mount: not in default fence domain

The above logs indicate that either means that the fence daemon was not running, the node was not a full member of cluster, or the node had joined the cluster in an unstable state. If any node did not join fence domain would cause to shared file systems hung.

To confirm the above run the following commands:

# group_tool dump fence | grep members

and

cman_tool services

type level name id state
fence 0 default 00000000 JOIN_START_WAIT
[1]
dlm 1 clvmd 00000000 none

If the state is "JOIN_START_WAIT" the above description of the problem is correct.

Monday, May 31, 2010

How to locate all the files recursively which is having a specific pattern?

find / -type f -exec grep -H 'pattern' {} \;

find . -name 'pattern'

For files containing a specific pattern, you can use grep with -r option.

eg. grep -r 'pattern' *

To find all files which contain "pattern"

find . -type f -print0 | xargs -0 grep -l "pattern"

find ./ -iname '*substring*'

grep -r "pattern" /

the above command will look for the pattern from root directory. Other wise you can mention the path from where you want to search.

find . -name "pattern" -exec grep -l otherpattern {} +

With the plus, grep will be given as many filenames as the system can handle rather than one file at a time, reducing the number of instances of gre

Sunday, May 30, 2010

ttloop: peer died: EOF

How to get rid of messages ttloop: peer died: EOF

There are possible reasons to it

1. May be faulty network card
2. Port scanning
3. Someone trying to connect to the host

Tuesday, May 18, 2010

add a physical block device to virtualized guest in KVM or XEN

1. Physically attach the hard disk device to the host. Configure the host if the drive is not accessible by default.
2. Configure the device with multipath and persistence on the host if required.
3. Use the virsh attach command. Replace: myguest with your guest's name, /dev/hdb1 with the device to add, and hdc with the location for the device on the guest. The hdc must be an unused device name.

# virsh attach-disk myguest /dev/hdb1 hdc --driver tap --mode readonly

4. The guest now has a new hard disk device called /dev/hdb on Linux. This device may require formatting.