Importing large flat files into mongoDB

This is a very basic technique, but that's how I like to start. 
I will also show a couple tricks when working with large data files.
 
Editing large files
 
Let's assume you have a large data file, approximately 60MB with close to 400K rows and 20+ columns, and you need to change the file delimiter from pipe to tab (mongoimport doesn't take pipes). 
You are going to need a pretty powerful text editor to pull this off. I tried using Editplus and Notepad++, but both were hanging on my machine
One you have vim working you can open the file and run the following search & replace command:
 
:%s/|/\t/g
 
This will do a global search and replace to find all the "|" and replace them with tabs. The operation runs pretty quick in vim.
 
Converting file formats
 
The next step is to get the data file on the server. You can use any SFTP client to upload the file to the server. I used WinSCP that happened to be installed on my machine already.
 
With the file on the server, the next step is to make the file Unix friendly. Since I transferred the file from my Windows machine to the Linux server, I will need to convert it to the Unix format to deal with line break problems. There's a good article here talking about this common problem between dos and unix machines. I used a program called unix2dos to convert the file.
 
sudo yum install unix2dos
dos2unix -a myfile.txt

Alternative if you don't have the dos2unix program on the server. This command will remove dos carriage returns "\r".
 
sed -i.bak -e 's/\r//g' filename
 
Using mongoimport
 
Now the file is ready to load into MongoDB using the mongoimport tool. Go to the directory where you uploaded the file and run the following (change the data, collection and file name of course!):
 
mongoimport --db dbversity --collection dbversity_col --type tsv --file dbversity_col_data --headerline
 
On my dev server it took less than 10 seconds to import ~400K records. Pretty fast! Now it's a good idea to check the data in mongo.
 
mongo dbversity
db.dbversity_col.count()
db.dbversity_col.findOne()

  • Ask Question