Best Web Scraped Data Formats

The kind of data we can scrape from the web is text (not counting images, videos and other binary files). When we’re done scraping the data it’s up to us how we format and store it. When starting out it can be a challenge to decide which data format to use. In this article I’m giving you an overall view what data formats are available and which one you should use in your current situation.

So we’re gonna cover some ways how we can format our freshly scraped data. I include json, csv, xml, sql db, excel and even pdf as options on this list because these are fairly popular formats and we can cover all of the different use cases with these. Let’s start first with the one other developers and I use the most often probably, json.

JSON

The most used data format for API endpoints and other data in-between applications is definitely json. Json means Javascript Object Notation. It is a popular structure because it’s easy to read by humans AND computers too. For many systems json can be in and output as well. Normally json is just a string which represents bunch of key-value pairs (similar to a dictionary in python or a map in java)

{
  "name":"John",
  "age":30,
  "cars": [
    { "name":"Ford", "models":[ "Fiesta", "Focus", "Mustang" ] },
    { "name":"BMW", "models":[ "320", "X3", "X5" ] },
    { "name":"Fiat", "models":[ "500", "Panda" ] }
  ]
}

It’s widely supported by different libraries so it’s relatively easy to produce no matter what kind of data you are dealing with. For this reason, when we scrape the web putting together a json file from the scraped data should be quick and without headache. One kind of problem which sometimes pops up when using json is some sort of encoding error. But for the most part it’s easy to debug. Unless specified otherwise, json is my go-to data format when scraping from the web. This is what I suggest you should use most of the times as well.

CSV

The other format which is probably almost as popular as json is csv. The abbrevation is Comma Separated Values. This kind of format is usable when you have 2-dimensional data (like a table). Different column values(fields) are separated with commas. One reason csv might be more useful than json in some cases is that it’s simply more known by non-programmers. You can directly import it into excel or other spreadsheet software which is pretty cool. Although no styling possible. Also csv files work pretty good with sql databases and you can import and export csv as well

Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar

When I have clients who just directly want to see and use data (aka not import into software) they usually request and I give them this format because no further processing needed to make use of the data. When it comes to scaling, csv is not a very good choice. Also if data needs to be further processed json is a better format for that because it’s easier to work with. I only produce csv for scraped data when I’m asked to or when I just don’t have other choices.

XML

Extensible Markup Language is not really used to store scraped data. Also I don’t like it because it adds too much to the size of the data. Some legacy software needs xml as input but other than that it’s not popular nowadays.

<note>
 <to>Tove</to>
 <from>Jani</from>
 <heading>Reminder</heading>
 <body>Don't forget me this weekend!</body>
</note>

SQL Database

SQL stands for Structured Query Language. These kind of databases are used everywhere. They are scalable. They can be easy to work with. For scraped data, it can be good. But only when we have data with an already known schema. If we don’t know the schema or it’s changing constantly sql is really not a good choice.

sql database

Storing data in sql db is essential when we want to draw conclusions from the data through analysis. With sql we can make complex queries so the db spits out the truth. We should only use some sql based db when we know already the data schema and we want to analyse or query the data later.

EXCEL (xls, xlsx)

When a regular user hears that the data can be formatted as an excel spreadsheet she’s really happy about it because they trust and know it well. From an abstract perspective an excel file is not much different from a csv file (data-wise). We just add some styling, maybe

excel

I had one project in the last five years when generating an excel file with some fancy formatting was actually useful. It was a dynamically created table with recent price changes of some products. The fields that contained price change were colored accordingly(red cell background for price increase, green for decrease) so it was quicker for the analyst who read the data to recognize important changes.

PDF

Well, I really shouldn’t include pdf as an option to format scraped data. Though the reality is that some people think it looks more professional to present data this way. Which is true I guess if that’s the context.

pdf

Also if there’s some analytics work there not just pure data then delivering charts and diagrams for business people using a pdf might be a decent idea.

Wrapping up

Okay so these are the options you will need to choose from when dealing with scraped data. I would say that 90% of the time it’s gonna be json which is the best choice because of its scalability and flexibility. Also if you need analysis or further querying then having the data in json will not stop you from inserting it into an sql database to work with it later.

Free Ebook

scrapy fundamentals

Scrapy Fundamentals

Your information will be used to send you these ebooks and subscribe you to our weekly newsletter. We will only send you relevant information. We may use your email address for marketing purposes but we will never sell or share your information to any third parties. You can unsubscribe at any time.