Sometimes you might get an incident for high usage on a filesystem. You check and yes, filesystem usage is high, but ‘du’ (disk usage) if different. why?

Some explanations might say: they don’t use the same methods or metrics to calculate what they are reporting.
Yes, that’s true, but pretty much outputs should be the same.
If you ask me, my answer is easy and simple: HUMAN ERROR! if there is human intervention in the system always assume someone did something wrong.

In this case we have a 1.1T filesystem:

[root@prod_dbnode ~]# df -h /u02
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdh       1.1T  943G   85G  92% /u02
[root@prod_dbnode ~]#


[root@prod_dbnode ~]# du -sh /u02
148G    /u02
[root@prod_dbnode ~]# 

Hey! did you see that?? there is a +800G difference.
When you see a huge difference, pretty much in all the cases it is because someone deleted a huge file while the OS process is still running (aka still holding the space). ‘du’ reports are already freed but ‘df’ still show the space is not released yet.

This is a real case where someone saw a huge trace file and decided to delete, but space never got released and file was growing and growing….

[root@prod_dbnode ~]# lsof  | grep -i deleteg | grep /u02 | sort -nk7 | tail -4
oracle_19 199546          oracle   21w      REG            202,112         1492   56080968 /u02/app/oracle/diag/rdbms/primaryDB/instance1/trace/instance1_ora_199546.trm (deleted)
oracle_20 203218          oracle   20w      REG            202,112         3689   55984621 /u02/app/oracle/diag/rdbms/primaryDB/instance1/trace/instance1_ora_203218.trc (deleted)
oracle_19 199546          oracle   20w      REG            202,112         4881   56080967 /u02/app/oracle/diag/rdbms/primaryDB/instance1/trace/instance1_ora_199546.trc (deleted)
ora_p007_ 116364          oracle   47w      REG            202,112 810482773836   56039058 /u02/app/oracle/diag/rdbms/primaryDB/instance1/trace/instance1_p007_116364.trc (deleted)  <<---- 810G trace
[root@prod_dbnode ~]# 

What do you do to fix this? Well… the easy fix is to stop the OS process and right away will release the space.
But what is process running is critical and you can not stop until you get a maintenance window? The only option is to null the pointer to that file:

*We need to check the fd's (file descriptor) for OS process 116364, in this case fd 47:

[root@ryderprod-ajr2k2 ~]# ls -tlr /proc/116364/fd | grep deleted
l-wx------ 1 oracle asmadmin 64 Jun 14 23:29 47 -> /u02/app/oracle/diag/rdbms/primaryDB/instance1/trace/instance1_p007_116364.trc (deleted) (deleted)
[root@ryderprod-ajr2k2 ~]#

* Just null the file and voila! space released:
[root@prod_dbnode ~]# cd /proc/116364/fd
[root@prod_dbnode fd]#  > 47
[root@prod_dbnode fd]#


[root@prod_dbnode fd]# df -h /u02
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdh       1.1T  139G  890G  14% /u02
[root@prod_dbnode fd]#
Last modified: 26 July 2021

Author

Comments

Write a Reply or Comment

Your email address will not be published.