The extract
method in Cheerio allows you to extract data from an HTML documentand store it in an object. The method takes a map
object as a parameter, wherethe keys are the names of the properties to be created on the object, and thevalues are the selectors or descriptors to be used to extract the values.
To use the extract
method, you first need to import the library and load anHTML document. For example:
import * as cheerio from 'cheerio';
const $ = cheerio.load(`
<ul>
<li>One</li>
<li>Two</li>
<li class="blue sel">Three</li>
<li class="red">Four</li>
</ul>
`);
Once you have loaded the document, you can use the extract
method on theloaded object to extract data from the document.
Here are some examples of how to use the extract
method:
// Extract the text content of the first .red element
const data = $.extract({
red: '.red',
});
This will return an object with a red
property, whose value is the textcontent of the first .red
element.
To extract the text content of all .red
elements, you can wrap the selector inan array:
// Extract the text content of all .red elements
const data = $.extract({
red: ['.red'],
});
This will return an object with a red
property, whose value is an array of thetext content of all .red
elements.
To be more specific about what you'd like to extract, you can pass an objectwith a selector
and a value
property. For example, to extract the textcontent of the first .red
element and the href
attribute of the first a
element:
const data = $.extract({
red: '.red',
links: {
selector: 'a',
value: 'href',
},
});
The value
property can be used to specify the name of the property to extractfrom the selected elements. In this case, we are extracting the href
attributefrom the <a>
elements. This uses Cheerio'sprop method under the hood.
value
defaults to textContent
, which extracts the text content of theelement.
As an attribute with special logic inside the prop
method, href
s will beresolved relative to the document's URL. The document's URL will be setautomatically when using fromURL
to load the document. Otherwise, use thebaseURL
option to specify the documents URL.
There are many props available here; have a look at theprop method for details. For example, toextract the outerHTML
of all .red
elements:
const data = $.extract({
red: [
{
selector: '.red',
value: 'outerHTML',
},
],
});
You can also extract data from multiple nested elements by specifying an objectas the value
. For example, to extract the text content of all .red
elementsand the first .blue
element in the first <ul>
element, and the text contentof all .sel
elements in the second <ul>
element:
const data = $.extract({
ul1: {
selector: 'ul:first',
value: {
red: ['.red'],
blue: '.blue',
},
},
ul2: {
selector: 'ul:eq(2)',
value: {
sel: ['.sel'],
},
},
});
This will return an object with ul1
and ul2
properties. The ul1
propertywill be an object with a red
property, whose value is an array of the textcontent of all .red
elements in the first ul element, and a blue
property.The ul2
property will be an object with a sel
property, whose value is anarray of the text content of all .sel
elements in the second <ul>
element.
Finally, you can pass a function as the value
property. The function will becalled with each of the selected elements, and the key
of the property:
const data = $.extract({
links: [
{
selector: 'a',
value: (el, key) => {
const href = $(el).attr('href');
return `${key}=${href}`;
},
},
],
});
This will extract the href
attribute of all <a>
elements and return a stringin the form links=href_value
for each element, where href_value
is the valueof the href
attribute. The returned object will have a links
property whosevalue is an array of these strings.
Putting it all together
Let's fetch the latest release of Cheerio from GitHub and extract the releasedate and the release notes from the release page:
import * as cheerio from 'cheerio';
const $ = await cheerio.fromURL(
'https://github.com/cheeriojs/cheerio/releases',
);
const data = $.extract({
releases: [
{
// First, we select individual release sections.
selector: 'section',
// Then, we extract the release date, name, and notes from each section.
value: {
// Selectors are executed whitin the context of the selected element.
name: 'h2',
date: {
selector: 'relative-time',
// The actual date of the release is stored in the `datetime` attribute.
value: 'datetime',
},
notes: {
selector: '.markdown-body',
// We are looking for the HTML content of the element.
value: 'innerHTML',
},
},
},
],
});