Prevent accidental data loss in Hadoop

Some time you have accidentally delete some file which you are not suppose to do. So what will you do in that case?

The option with you is to enable trash (Recycle bin) for hadoop, and define fs.trash.interval
to one day with 1440 Minutes or more or less according to the cost of data you host in hadoop.

You can do it in following way :

Open core-site.xml and paste following parameter

<property>
<name>fs.trash.interval</name>
<value>1440</value>
<description>Number of minutes between trash checkpoints. If zero, the trash feature is disabled.
</description>
</property>

After include this property you may need not restart namenode or any service. so after including this when ever you delete intentionally or accidentally you fille will not be deleted permanently rather it will be moved to /user/hdfs/.Trash/Current/directory on hdfs from where you can always restore.

Another option option is there if you are deleting file intentionaly and dont want it go to the trash rather it should be deleted instently then you have two options :

hadoop dfs -rmr /user/hdfs/.Trash/Current/<folder or filename you want to delete>

or else you can sepcify execute like

hadoop dfs -rmr -skipTrash /<file or folder to delete>

Note : if you want to delete a file you can use -rm if you want to delete a directory and all directory inside recursively the you can use -rmr

+ If you want to empty the <.Trash> or recyclebin on hdfs using command you can use
-expunge
Which you can execute as –> hadoop dfs -expunge
This will remove things from trash automatically we will not need to go to trash directory and delete mannually using -rm or -rmr

When you execute hadoop dfs -expunge you will some output as follow

13/10/04 02:41:28 INFO fs.Trash: Created trash checkpoint: /user/hdfs/.Trash/1310040241

That shows that a check point has been created for deletion of file from recycle bin.

  • Ask Question