Lost hard disk space: a solution to the statistical inconsistency between du and df

Suddenly, I received an alarm email from Zabbix saying that the hard disk space is in alarm. I quickly boarded the server and checked the utilization rate of the hard disk with df. I found that the used space is indeed low, as follows:

# Plus - h makes it easier to see the size of the space
df -h

The output results are as follows:

/dev/sdb1       2.2T  1.8T  488G  79% /home

Continue to analyze the occupancy distribution of "/ home", as follows:

# View only directories over 1G
du -h --max-depth=1 /home/yiifaa|sort -n -k1

The output results are as follows:

position Occupancy size
./logs 11G
./mis_analysis 51G
./opdir 253G
./openresty 5.3G
./sdk_collect 788G
./sdk_data 232G
Total 1340.30G

The gap actually reached 500G. It was initially thought that other users used other spaces. Switch to root as soon as possible, and analyze again:

# Switch to root
su root
# Check the occupancy of / home again
du -h --max-depth=1 /home|sort -n -k1

The output results are as follows:

1.3T    /
1.3T    /home
3.8G    /usr
4.0K    /cgroup
4.0K    /media
4.0K    /misc
4.0K    /mnt
4.0K    /net
4.0K    /selinux
4.0K    /srv
4.1G    /var
5.6M    /tmp
6.7M    /share
9.0M    /bin
10M     /root
16K     /lost+found
16M     /sbin
18M     /libexec
23M     /opt
28M     /lib64
34M     /etc
77M     /boot
200K    /dev
290M    /lib

In contrast, the two statistical data of du are basically the same, so the problem is determined that the hard disk space really disappears by 500G.

After reading a lot of documents, it can be basically confirmed that the reason for the disappearance of hard disk space is that the deleted files are referenced by other programs, which makes the space unable to be recycled. Therefore, the solution is very simple. Find the relevant process of the referenced files, and then stop the process to recycle the space. The method to find and reference the deleted files is as follows:

# Reverse sort by deleted file size
lsof -s|grep deleted|sort -nr -k7|less

The results are as follows:

python     9100   xiaoju    4w      REG               8,17 506684182703      11467 /home/xiaoju/sec_audit_log/biz/sec_audit.log (deleted)
python     9100   xiaoju    3w      REG               8,17 506684182703      11467 /home/xiaoju/sec_audit_log/biz/sec_audit.log (deleted)
python     9100   xiaoju   12w      REG               8,17 506684182703      11467 /home/xiaoju/sec_audit_log/biz/sec_audit.log (deleted)
python     9100   xiaoju   11w      REG               8,17 506684182703      11467 /home/xiaoju/sec_audit_log/biz/sec_audit.log (deleted)
python     9100   xiaoju   10w      REG               8,17 506684182703      11467 /home/xiaoju/sec_audit_log/biz/sec_audit.log (deleted)

It's easy to find the process number of the problem - 9100. After finishing the relevant process, use df to check the disk space again. The statistics are finally consistent, as follows:

kill 9100
df -h

df statistical results:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1       2.2T  1.3T  965G  57% /home

Tags: Python Zabbix SELinux lsof

Posted on Mon, 04 May 2020 11:15:08 -0700 by kylevisionace02