Big And Duplicaed File Search

Active4 years ago

I have really large file with approximately 15 million entries.Each line in the file contains a single string (call it key).

Find duplicate photos on your PC or Mac with Easy Duplicate Finder. Finding Duplicate Files Search for duplicate photos of all format types and on all media types quickly & accurately. Find and Manage Duplicate Photos. Do you love taking photos? It's great to have a large photo collection. But it's not all that great when it's full of.

I need to find the duplicate entries in the file using java.I tried to use a hashmap and detect duplicate entries.Apparently that approach is throwing me a 'java.lang.OutOfMemoryError: Java heap space' error.

How can I solve this problem?

I think I could increase the heap space and try it, but I wanted to know if there are better efficient solutions without having to tweak the heap space.

Eduard Wirch

7,8038 gold badges54 silver badges65 bronze badges

MaximusMaximus

3861 gold badge4 silver badges19 bronze badges

7 Answers

The key is that your data will not fit into memory. You can use external merge sort for this:

Partition your file into multiple smaller chunks that fit into memory. Sort each chunk, eliminate the duplicates (now neighboring elements).

Merge the chunks and again eliminate the duplicates when merging. Since you will have an n-nway merge here you can keep the next k elements from each chunk in memory, once the items for a chunk are depleted (they have been merged already) grab more from disk.

BrokenGlassBrokenGlass

139k23 gold badges255 silver badges301 bronze badges

I'm not sure if you'd consider doing this outside of java, but if so, this is very simple in a shell:

MichaelMichael

4,7191 gold badge15 silver badges21 bronze badges

You probably can't load the entire file at one time but you can store the hash and line-number in a HashSet no problem.

Pseudo code...

Andrew WhiteAndrew White

40.9k15 gold badges97 silver badges130 bronze badges

I don't think you need to sort the data to eliminate duplicates. Just use quicksort inspired approach.

Pick k pivots from the data (unless your data is really wacky this should be pretty straightforward )
Using these k pivots divide the data into k+1 small files
If any of these chunks are too large to fit in memory repeat the process just for that chunk
Once you have manageable sized chunks just apply your favorite method (hashing?) to find duplicates

Note that k can be equal to 1.

ElKaminaElKamina

One way I can imagine solving this is to first use an external sorting algorithm to sort the file (searching for external sort java yields lots of results with code). Then you can iterate the file line by line, duplicates will now obviously be directly following each other so you only need to remember the previous line while iterating.

DarkDustDarkDust

79.8k15 gold badges169 silver badges204 bronze badges

If you cannot build up a complete list since you don't have enough memory, you might try do it in loops. I.e. create a hashmap but only store a small portion of the items (for example, those starting with A). Then you gather the duplicates, then continue with 'B' etc.

Of course you can select any kind of 'grouping' (i.e. first 3 characters, first 6 etc).

It only will take (many) more iterations.

Michel KeijzersMichel Keijzers

10.5k22 gold badges76 silver badges105 bronze badges

You might try a Bloom filter, if you're willing to accept a certain amount of statistical error. Guava provides one, but there's a pretty major bug in it right now that should be fixed probably next week with release 11.0.2.

Louis Wasserman

Louis Wasserman

152k21 gold badges272 silver badges335 bronze badges

Not the answer you're looking for? Browse other questions tagged algorithmdata-structures or ask your own question.

Duplicate files are a waste of disk space, consuming that precious SSD space on a modern Mac and cluttering your Time Machine backups. Remove them to free up space on your Mac.

There are many polished Mac apps for this — but they’re mostly paid software. Those shiny apps in the Mac app store will probably work well, but we have some good options if you don’t want to whip out your credit card.

Gemini and Other Paid Apps

If you do want to spend money on a duplicate-file-finder app, Gemini looks like one of the best options with the slickest interfaces. The trial version worked well for us, and the interface certainly stands out from barebones, free applications like dupeGuru. Gemini can also scan your iTunes and iPhoto library for duplicates. If you’re willing to pay $10 for a better interface, Gemini seems like a good bet.

There are other, similarly polished duplicate-file-finders in the Mac App Store, too — but Apple flags this one as an Editors’ Choice, and we can see why.

As a bonus, the demo version of Gemini allows you to search for and find duplicates, but not remove them. So, if you really wanted, you could use the demo to find duplicates on your Mac, locate them in Finder, and then remove them by hand. Other paid duplicate-file-finder apps have demos that function in a similar way, so this may be convenient if you just want to run an occasional scan and you don’t mind deleting a handful of duplicates by hand.

There are many good-quality, paid duplicate-file-finding apps for Mac. You can find them with a quick trip to the Mac App Store.

dupeGuru, dupeGuru Music Edition, and dupeGuru Pictures Edition

RELATED:10 Ways To Free Up Disk Space on Your Mac Hard Drive

We also recommended dupeGuru for finding duplicate files on Windows. This application is both open-source and cross-platform. It’s simple to use — open the application, add one or more folders to scan, and click Scan. You’ll see a list of duplicate files, and you can select them and easily move them to the Trash or another folder. You can also preview them, verifying that they actually are duplicates before tossing them away.

dupeGuru is available in three different flavors — a standard edition, an edition designed for finding duplicate music files, and an edition designed for finding duplicate pictures. These tools won’t just find exact duplicates, but should find the same songs encoded at different bitrates and the same picture resized, rotated, or edited.

This application is utilitarian, but it does its job well. You don’t get the shiny interface that you do with the paid Mac apps, but it’s a good free tool for finding and clearing duplicate files. If you want a free application for finding and removing duplicate files on a Mac, this is the one to use.

iTunes

iTunes has a built-in feature that can find duplicate music and video files in your iTunes library. It won’t help with other types of files or media files not in iTunes, but it can be a quick way to free up some space if you have a big media library with duplicate files.

To use this feature, open iTunes, click the View menu, and select Show Duplicate Items. You can also hold the Option key on your keyboard and then click the Show Exact Duplicate Items link. This will only show duplicates with the same exact name, artist, and album.

After you click this, iTunes will show you a sorted list of duplicates next to each other. You can go through the list and delete any duplicates from your computer if they actually are duplicates you want to delete. When you’re done, click View > Show All Items to get back to the default list of media.

That’s it? Yup, that’s it. We didn’t want to recommend potentially confusing Terminal commands that output a list of duplicates to a text file, awkward methods that involve scrolling through a list of all the files on your Mac in the Finder, or applications that require disabling the Mac’s Gatekeeper feature to run untrusted binaries. The tools above will do the job, whether you want a barebones-and-free utility or a polished-but-paid application.