Find duplicated files and directories within a given path
Source:R/general_find_duplicates.R
find_duplicates.Rd
This function scans a directory tree for duplicated files (by content hash) and optionally duplicated directories (by identical file sets). It supports filtering by file extension, minimum file size, and parallel processing.
Usage
find_duplicates(
path = ".",
size_threshold = 0L,
extensions = NULL,
n_cores = 1L,
print_results = TRUE
)
Arguments
- path
Character. Root directory to scan for duplicates.
- size_threshold
Numeric. Minimum file size (in MB) to report as duplicate.
- extensions
Character vector. Optional file extensions (without the leading dot) to filter (case-insensitive); e.g.,
c("csv", "txt")
). If provided, only files with these extensions are considered. . Directories are excluded if they don't match. Defaults toNULL
(all files are considered).- n_cores
Integer. Number of parallel workers to use (default: 1).
- print_results
Logical. Whether to print results to the console (default:
TRUE
).
Value
A list of tibbles with duplicated files and directories, if found.
The duplicated_files
tibble (if any) contains the following columns:
path
: root path scanned;dup_group
: duplicate group ID;files
: list-column of tibbles with absolute/relative paths and modification times;file_ext
: file extension(s) of the group;n_files
: number of duplicated files in the group;file_size_mb
: size (MB) of the first file in the group; andcontent_hash
: MD5 hash of the file content. Theduplicated_dirs
tibble (if any) contains the following columns:dir
: directory path at the duplicated level;dir_abs
: absolute path to the directory;n_files
: number of files in the directory;n_dup_dirs
: number of duplicated directories in the group;dup_group
: unique identifier for each group of duplicated directories.