Waiting..
Auto Scroll
Sync
Top
Bottom
Select text to annotate, Click play in YouTube to begin
00:00:00
uh hi everyone and this is sharp API podcast uh episode number five uh we will talk about the topic how to
00:00:13
scrape Dynamic web pages and this time Jima will ask questions and I will answer them uh myself and Dima we work at Surf API
00:00:28
we scrape Google baido and other search engines Dima handles uh developer relations I handle engineering and marketing habit and
00:00:43
team up feel free to ask questions yeah hi everyone it's nice to be here thank you for inviting me to this podcast I appreciate it and uh
00:00:57
the very first question I have right now is [Music] what is the hardest website to your ever first and what what was the hardest part
00:01:08
in in Parson stuff like that uh it's like ideal answer in several parts because question contains several
00:01:29
bars uh hardest one I guess I guess it was a mix between uh and do I mean Dynamic web pages or generally generally any type
00:01:45
I guess it was a mix of like not mix but uh one that I like it the most was probably Google Maps and
00:01:57
uh also I like scraping Walmart I don't know why like I've came to scraping from dropshipping automation thanks to you and it was it's also always fun and uh
00:02:12
Walmart has pretty nice API that is uh like not public it's publicly available but it's not considered public because Walmart uses its it only from their own
00:02:25
website but they have also their API that require signing up but most of the day like more much more data is available from Walmart's API that is
00:02:39
used on their website and Walmart said Walmart itself requires a lot of like work to bypass their
00:02:50
uh like bot protection they use perimeter X and they have captchas and they have like tons of like security measures
00:03:03
and interesting part was uh about scraping Walmart stores I've made the
00:03:14
rust script script in Rust language that uses some like multi that can currently downloaded multiple stores like it's basically made the request made the
00:03:27
request to some endpoint of Pub of a website on Walmart then I was extracted data from the page then made
00:03:40
concurrently multiple requests to each of each City and then build a Json file of several thousands of
00:03:54
Walmart stores and then I've debug at Walmart website to understand to how they like which cookie how they
00:04:07
built a location cookie uh actually Amazon does something similar but it it's a bit more complex on Amazon it's a bit more complicated but on Walmart it's
00:04:22
they also wrote a blog post about how to reverse how the how you reverse an engineered wall of cookies because I remember something like that yes yes yes
00:04:37
we have a blog post and also I've published a repository in GitHub with a script that does that and actually we need to uh it's probably time to update our
00:04:52
database of Walmart stores it's our PPI because it's probably dated because let me actually check uh no I don't want to check
00:05:03
[Laughter] because it will be it will take time uh I guess it was half a year ago then when we uh scrape at Walmart stores and it's
00:05:18
time to update and also it's fun to write some more rust to extract someone much more data uh was it cars yeah go ahead
00:05:30
no please I yeah I just wanted to ask about Walmart one more question but like was it like straightforward for you to I guess reverse engineered how the cookie was
00:05:43
built like uh and was was this uh cookie actually do in like in the what the purpose in the parser this cookie like
00:05:56
might make sense in which parser like Serpentine uh yeah yeah for example if if you can share this information yes
00:06:07
like it's our PPI we like sharp API supports parameters that's named store ID so we we allow our customers to pass a Walmart
00:06:22
store ID and under the hood we built a location cookie and make request to Walmart with that
00:06:32
story like for specific store so for example when we search for uh yoga mats and we want to search and by default
00:06:44
Walmart searches uh near to proxy location or if the proxy is outside or if the request is made outside of United States
00:06:56
they output products from Sacramento to some store in Sacramento I don't know why just Sacramento do you think it's just a base like a default location for
00:07:10
like guess yes like an E5 for example I can like oh no I can share it right now because Walmart blocks requests from Ukraine currently I don't know why uh
00:07:26
like same for Home Depot yes yeah and I don't have a VPN right now uh so how we and that's how we use the serp API we build the cookie and make a
00:07:40
request for the specific store so customers can uh extra uh retrieve products from specific stores not from default ones not from some
00:07:53
random one but from specific one and also not only a list of products but the specific uh products for example I've selected a specific yoga mat and I want
00:08:07
to find availability and price in specific location because it differs from location to location and where is the name
00:08:31
sorry I want to work overlay but it doesn't work for some reason uh anyway uh and the second uh not hard but the
00:08:48
second most interesting uh situation with scraping was Google Maps uh like it required and we also have a Blog I have written a blog post
00:09:01
about scraping not scraping but kind of reverse engineering it's not exactly the rewards engineering because your videos because reverse engineering is mostly about dealing with binary
00:09:16
binaries or executables and uh this and um work the and the work that we have done with Google Maps was mostly about reading the obfuscated JavaScript and
00:09:32
understanding how it works uh it was hard and also fun so how would how we did that uh and it actually took a long time uh
00:09:50
something's something specific took a long time or like whole parser I'm just not parser but like we have I mean written and yeah sorry we have written the parser and then
00:10:04
customers requested pagination and for pagination of like when we search on Google Maps for Samsung we can click next page and as we don't use browser
00:10:18
it we we needed to like re-implement and understand how they send HTTP requests for the next page and the challenge was
00:10:31
about understanding uh how your rail is built for the next page and we begin this manually trying like I've listed on
00:10:44
the scene like I've created a document not document like I've used text like paste it in text editor multiple URLs to multiple requests for different
00:10:58
pages and manually check at the parameters uh and then started to divide them like Google Maps has it will be
00:11:09
actually better like let me share my screen and we can play with that awesome so I can create right now
00:11:23
from your top Maps Google Chrome we can go to and we can also switch to another language United States I'm sharing my screen
00:11:40
window can you see my screen yep yep uh so we want to search for something
00:11:51
for example in Madagascar what we have in Madagascar I don't know like and uh let's open developer tools
00:12:11
so Google Maps I want madagascarp so man I guess car beaches okay clear everything and when I
00:12:28
scroll down there is a request for like Maps preview place and then PB equals to Long stream like we can check the payload here it
00:12:45
will be more usable and this is uh like PB stands for I guess broad above but it's encoded in
00:12:55
like uh front above is uh file format no not the file like it's a yeah actually it's something like a file format uh for encoding I would say and
00:13:14
it's encoded here not in a by and usually it's encoded as a binary uh and here it's encoded ss3 with separators lens exclamation marks are
00:13:29
separators between different and if you see here we have 3M 1M 1D
00:13:39
3D 3F to F 4 F 4M 1m3 I like so they have mult several uh the field so this exclamation mark and
00:13:58
then uh number and letter means uh some field format uh for example they have strings uh nested structures uh
00:14:12
integers floats and something else I don't remember right now and what I've done here is I've clicked I've copied this split it by exclamation mark and
00:14:27
started manually no not actually manually I have used it some some online div tool and I've compared this string for example when I scroll down there is
00:14:40
TV another way I'm sorry to interrupt you for our users who are my listeners who are not really into Roblox creating but want to
00:14:54
understand uh how what they're doing uh so join devtools Network Tab and uh on the fetch xhr right yes I'm filtering my thank you all by the way thank you for
00:15:08
clarification question yes I'm using the browser Dev Tools in Brave browser it's chromium based browser I have filtered requests by exit sharing Fetch and I've
00:15:22
clicked on search and actually yeah I've clicked on this search request click payload and then it's a Google restring that is
00:15:38
divided for each equivalent parameter clear parameter so key value key value key value key value and and what we have found is like search
00:15:53
uh so search request with this PB query parameter is actual request that we are interested in and we started to dig into how to
00:16:08
recreate this PB URL by using our like by so we have only several parameters we have uh latitude and longitude zoom
00:16:24
level and we have query and we have some extra parameters and from that we want to build this PB URL and it will respond
00:16:37
with something like this and from uh like it's a Json
00:16:49
with several fields and so several Fields so list all of these places for example if I
00:17:01
recall this play center try to find this keymon in which more on Dava in this response it's somewhere here so we have data for
00:17:14
its uh sorry for its coordinates title also some images so it's deeply nested
00:17:25
Json response and but the story here is about uh kind of reverse engineer on this uh
00:17:38
so I've listed several request parameters on the single page in my text editor manually try to understand
00:17:51
how it's built what does it what if what is different and then after I understood what is different uh
00:18:11
like let me actually go through let me actually go through the blog post so PPI block
00:18:28
if it's here I'm not sure if it's here thank you I have to should be better prepared so we don't even have blog post here it's
00:18:43
probably somewhere on other platform okay so what we have done I've looked at I think I can actually show here
00:18:55
so uh when I uh hover so in the network Dev tools in the devtools network tab I can hover to in over initiator and it's displays call
00:19:10
stack uh like functions that are called one after another to make the specific request so to understand uh how the and
00:19:24
it's not clear how these parameters are built uh like for example we can I'm not sure if I should actually
00:19:37
explain uh that at that level uh let's maybe like explain could you explain like a like to people why to possibly look uh for to solve this thing
00:19:55
okay okay okay not actually explaining the whole process but like uh pinpoint to like places yep so no
00:20:09
then it's actually will be a good idea to go into sources tab in devtools so the first step is to and try to
00:20:22
understand where the data is being retrieved from uh for example I open network scroll
00:20:34
nothing happens scroll something should happen you have reached end of the list okay let's search for Northern Beach but rod
00:20:49
I don't know maybe let's search for what not here but in Madagascar and no roads on Madagascar that's perfect maybe you should clarify it in the query
00:21:10
like rods and Monica's car I'm not sure yes and so I scroll down and I see another request let's name
00:21:28
George like Google usually names search uh like search request when whenever we show we'd want to retrieve the search results so I immediately look at this
00:21:41
uh then uh I go to preview or click uh like I select element on page uh and it's uh like I used I use
00:21:56
keyboard by and then like Ctrl shift C here I see that this place has a name like ebis antenna blah blah blah and I go to I want to
00:22:11
check if it's actually the request that contains the specific data clicking Ctrl f paste in here yes it's this request so I
00:22:23
have found like actually previously I I could use Ctrl shift F search but right now it doesn't work I don't know like several uh releases ago or maybe several tenths of
00:22:37
like more than several releases ago Chrome uh stop it searching in uh Network requests for specific data previously I could write anything here
00:22:50
like its name like when I opened developer tool tools and click Ctrl shift F or just click search here
00:23:02
I could search for text and previously it searched not only in HTML but also in all of the requests are done and it was and also in JavaScript sources in any
00:23:14
text so in any uh in any response from some request from all of the requests that are recorded but right now it doesn't work but previously it was much simpler because it helped it a lot and
00:23:29
right now I had have to dig more so we have found the request and right now and when I click for payload I don't understand like have no clue about what
00:23:43
it actually means like how how to recreate this PPU like no way I look at uh in like I've went to uh I have used it probably grab app and
00:24:01
Source graph to search for this PB and not also not only for Pb but Google com okay and what is grab that app and Source
00:24:16
right what is it for first graph and wrap up by searching over public repositories uh like grab app searches over GitHub repositories because GitHub storage is uh
00:24:30
not convenient and the source graph searches not only on GitHub but also on bitbucket and gitlab and probably over s
00:24:42
t h t it's uh some distributed widget repository got it not repository but it costs them so here we want to search for Pb and
00:24:57
also I went to Google and language to English and search it for something like uh Google Maps PB parameter
00:25:15
what is that like oh the code and it was this question or something like that and there were several really good answers
00:25:27
on stacker flow probably this one and look look at that like how good is that question like it's so well structured not as usual uh random thought
00:25:47
it's as usual but he is a well-structured question like like I want to answer this question and there were several really good answers
00:25:58
probably and hear people explaining what egg like yeah you look uh it's a nested means nested structure probably and Nest attraction as obstruction and this means
00:26:11
uh G is a decimal decimal decimal it's awesome I'm sorry to interrupt you so this is this is like what I'm looking at is like
00:26:23
really reminds me Jason for some reason uh because like in the indentation yeah but it's uh like it's data structure uh
00:26:36
that is just encoded in a different way like instead of having uh like multiple nested objects that are in Json
00:26:49
or in Python object or like Ruby hash whatever here we have different uh ways of telling uh and also they have a parser and just different way to tell
00:27:03
that something is not nested something is decimal something is a string and if if we actually uh
00:27:19
it's probably base 64 recording and A2B a function is for for engaging what is it uh no it doesn't work A to B is
00:27:36
the chord basically from base from base 64 but it doesn't work right now I mean it doesn't work right now because it's probably not not base 64 right or
00:27:48
I'm doing something wrong but here yep you see less than data structures like yeah and another person explain it and they have your
00:28:01
so I've looked into like some GitHub repositories this is also good so they uh listed what is integer what is flawed what is double what is Boolean
00:28:15
what is string and there is another type probably Z that means Nest data structure uh so in some of the related
00:28:31
questions there was yeah you see it's how like data different data types and they explained so at some point I
00:28:48
have found uh I I also played with this chord in developer tools on Google Maps and at
00:28:59
some point I have found to GitHub repository it's named [Music] Google Maps
00:29:14
how it was named I don't remember right now but there was some GitHub repository actually I haven't checked this one maybe it's
00:29:35
maybe it actually works let's check the source in Google Maps data parameters pass through all classes Ms native data structure find latest incomplete GMG
00:29:47
playpoint Road whoa interesting so it's actually in Quad npb parameter really how to use it
00:30:10
so demo page demonstration two maps URL right [Music] not personal why why not possible at some point it will the URL will
00:30:39
yeah it will have data and what we are interested in is data from parse parameter predefinite test URL what it means map type pin from parset and
00:30:58
smooth Waypoint with coordinates but that's not what I want method but it's also interesting like it has a mapping for Road method of Transport basically some
00:31:14
kind of parser but I have found not this but another repository in GitHub that uh is written in Python and it uses qt5
00:31:28
browser so I can I I can I have pasted URL from Google Maps and then it automatic and then navigate it on the
00:31:42
page and it tried to collect data structures that collect and decode data structure from Google Maps that will use it on my usage session
00:31:56
but it didn't work uh maybe right now it works but I don't remember how it was named but uh I can actually sorry uh
00:32:16
I guess it's not like pretty yes yes so I have uh searched on GitHub found some tool for like found some scripts to split request
00:32:31
parameters from a string to a nested data structure then I tried to understand well like I basically I was getting familiarized with the data then
00:32:43
I was interested in actually how the URL is built and then I've accused developer tools so hover over initiator and like
00:32:57
reading the Stacked back Trace so render renders like render what we have request animation frame render send start uh start like send this probably too low
00:33:11
level but we have start what is start let's let's actually like let's format this code and
00:33:25
go to network again download so it sends requests uses some RPC method name payload let's click the breakpoint and
00:33:40
then scroll down you have reached okay Road Madagascar what else we have on Madagascar oh I have zoomed in and it's updated so we
00:33:52
can now like and right now we are in developer tools so let's actually uh use another window can you see the developer tools no no
00:34:09
let me let me change screen sharing so I want to I want to share entire screen actually uh
00:34:24
sharing and we have this stuff here you know here with the the bigger in developer tools so we can click
00:34:42
oh we have payload so payload is PB so this pay load this with screen okay sounds interesting how it's been built
00:34:59
where we so this payload so if we have this payload it's not a random name you can search for this payload oh interesting so we have a
00:35:12
Constructor and this payload is assigned to a so disable this breakpoint click and resume
00:35:26
all interesting so we can right now check how the URL is built so we have some requests and
00:35:39
a send a where a scanning from a is coming from here send send here a again no we have
00:35:52
so a b c a this o w f what is a o w f let's check this financial location or encode you write component interesting
00:36:06
interesting so this is how this parameter is built let's deactivate the previous break point and resume okay so we have some array
00:36:25
hmm summary of data let's inspect it here summer range it has like localization
00:36:41
[Music] um missed the race and it also have container ID so this is a probably settings
00:36:57
this is some Nest at least we don't know what it means trying to find some meaningful data yeah let's add some
00:37:13
so it's a query it's a query and probably there are also oh what is this interesting they updated zero JavaScript
00:37:27
stuff like because we have written the parser and we reverse engineered the request parameters two years ago and probably right now we can uh pass even more data or we can
00:37:40
oh so we can re-implement this function in Ruby and instead of heart so right now we have a like a string and we
00:37:56
have several placeholders on a string for search query for location like latitude longitude and altitude altitude
00:38:06
we convert zoom level to altitude and right now maybe we can get more data or pass more data interesting what is this okay
00:38:25
so at some point we are finding that we are passing some data yep uh and
00:38:38
uh to not uh like if I will explain the entire process it will take several days I guess but
00:38:51
uh what time what we did what we have done we have found like moving parts of a URL so for the same request when we paginate or when we zoom in zoom out the
00:39:06
your engine parts so we have found uh like variables that we need to put uh for specific page and for specific search
00:39:24
query it makes sense I mean keyword and uh then uh let's just start and the process was
00:39:37
uh like pretty much the same like trying to understand how the URL was built was built it was actually like I've done exactly what I'm showing you right now so I found out that URL some Euro some
00:39:50
array is converted by using like what is wgf what is that like function location what is that let's predified the weight so WF is
00:40:05
what it means okay we are converting this stuff what what happened so we have oh
00:40:19
interesting h a is some like you see uh we have B like we have a parameter is like data that we want to Ill chord and
00:40:34
we have B parameter that is probably uh how to name it yeah yeah I I can't provide the word for that
00:40:52
so it's something like a sequence of characters uh like not an algorithm for encoding but
00:41:07
something like something something like and algorithm foreign code and I don't remember the correct words for that but let's
00:41:19
and yeah so there are several functions SMA engine Una what is sna let's check sna for a is
00:41:31
some number don't know what it means and una for this stuff is what zero it doesn't work right now because we
00:41:45
need to ah so we have to execute this code and now we can execute please call 25 year old let's shoot it
00:41:59
here some number what it means and if I if you do it again so at least it would responds in the same oh whoa
00:42:13
foreign you see interesting it's an array and then see join uh
00:42:31
exclamation mark oh he joined space okay interesting gcid and we have right now so
00:42:45
uh for sure Una and chords or dag oh so we have
00:42:57
we have a intern array it's probably a proton boxer uh that we that the Google Maps page have retrieved from Google Maps backend like as we have seen
00:43:11
previously uh here not here here we have some RPC method name like remote procedure call it's for sure it's like some grpc backend and it's named request
00:43:27
scheduler Channel like red search on Google or on Source graph what is requests Candler Channel thank you nothing
00:43:41
Google com request scheduler chat nothing but we are interested in you and a una is
00:43:59
I'm right now I'm starting to room yes so it will call some function name TNA and Z means nested structure and others mean uh like
00:44:17
object and it's recursively called una it replaces some like something with another stuff it shifts right uh cell like
00:44:29
[Music] uh uh sorry values and basically it decodes data that came from so we have our a
00:44:41
array and this a array so una calls ZJ and then it calls for each array element it calls TNA and TNA is
00:44:57
does all of the work so it uh it returns actually this nested array it decodes it decodes this into this
00:45:14
array and then we just string join and uh at the beginning I wanted to find maybe
00:45:26
on GitHub there is some code so if we have a function that the codes data then we should have a function that encodes the data on the Google backend but at the point
00:45:40
like at the moment when we scrape out the data and reverse engineer the pagination I haven't found the code maybe Source graph wasn't there or maybe
00:45:52
my Google full view my Google skills you're not that good I don't know but what they've done I've placed debugger here then click it here
00:46:06
and then use it step by step the bar Drive check it what it means like here you like I'm clicking F10 and F10 means uh
00:46:20
what it means it means step over the next function call and f11 calls step into I'm in right now stepping over and checking like here you see some
00:46:35
counter b equals then VNA like what did this it's I then we check type of o h it's for sure it's uh like it's decoding of
00:46:50
data that's came over the wire and yep sorry I'm trying to right now think maybe I should try to find something on
00:47:09
sourcecraft then that's something similar hmm thank you always did you always like hmm did you usually use a circus graph to
00:47:27
search parts of the code uh yes it's usually better than GitHub searching than Google like this subscribe doesn't work I use
00:47:38
what is it like you see the same chord 1263 100 what what is this after it's some it's it is some well
00:47:51
known same is some sort of algorithm yes gtf a really the same chord
00:48:05
[Music] that it's the same code the if I convert same chord like so this function like it's interesting actually because at that
00:48:31
point two years ago I didn't understood what happened here but but now it's uh probably uh in line at function named etf8 right it's
00:48:43
it's the same like it's the same just uh optimize it by closure compiler but it's gtf 8 right and if we have h8 right
00:48:56
function so we have a string buffer soul is array a is a string buffer which part of data we
00:49:14
are actually passing here so we have our call stack and say GNA una and we pass to DNA h m
00:49:27
t n a h m what parameters we have we have a b g this and array uh-huh it's a result she
00:49:39
what is she I see I don't remember right now but it's interesting if you have utf-8 right react with PDF browser
00:49:56
C4 Mercury ball called point gtf2 bytes it's UTF to bytes
00:50:12
why do we write GF to bytes here TNA returns so we have G return e what are we trying to do here so
00:50:28
it's probably should be covered over another episode or another show of your PPI but
00:50:39
here was a process and then we can just deactivate breakpoints clicks f8 and check if it actually works awesome awesome uh so
00:50:53
yep please ask yeah thank you so much for for pretty much showing the basically like I think like 70 percent of work or maybe more uh
00:51:07
and it's very interesting so that was a part of understanding how it works because I wanted to understand how the string is built based on some parameters so we have some
00:51:21
parameters like deeply nested arrays and then we have some string with encoded data and I wanted to understand how first of all it's built and then like if
00:51:34
you look here we have like categorical injection then roadmap and at some point here we have a lot latitude and longitude and don't
00:51:44
remember let's search for it I guess this is a very very interesting question to you and topic yeah I like JavaScript and dialect
00:51:57
developer tools maybe we can like split uh like a stop process here and maybe record Another live stream about like um
00:52:14
like specific for this task for this type like type of topic because I'm not sure if people will be able to handle that much of information yeah
00:52:26
okay uh yep I have a question another question like uh like you're showing Google Maps and browse automation you you said that
00:52:39
certified uh I guess sir people I never used browse automation before no we have used and uh I want to finish my thought on the
00:52:53
previous uh so uh in the so if we are going to split this into next future episodes then I want to explain what we will cover in
00:53:04
the future and then I will answer your question uh that you have asked so for the Google Maps I have reverse engine they bugged it and trying to understand
00:53:15
how it works and then uh their like logic latitude and long GPS parameter gel coordinates and Google Maps uh in
00:53:28
the URL has latitude longitude and zoom level but when they make requests to Google do their backend instead of zoom level they pass uh
00:53:42
altitude it's uh like hate over the ground and I have found several formulas in
00:53:55
Google map sources and debugged it how it works uh uh we have written down uh on the text editor for
00:54:08
all of the functions then I've tried to understand how it worked then uh it uh minimizes it down to several mathematical formulas then I have used
00:54:23
it my whiteboard and of Wolfram Alpha to simplify formulas I have uh Solve IT differential equation like
00:54:38
they have integral not different I don't remember like the user who had the either differential equation uh or
00:54:53
logarithmic equation I don't remember why I had used differentials but basically I have used my whiteboard I have solved mathematical formulas then you said evolved from alpha to simplify
00:55:06
all of the formulas then written Ruby code written down Google Maps blog post with all of this formal screenshots and that was it and probably in the next
00:55:19
episode hey hey hey hey hey it's a huge Short episode of this topic of evil uh start from this and we'll try to
00:55:36
build the URL for the zpb parameter of evil uh dig into source code of Google Maps front and how they
00:55:49
built how they convert uh latitude longitude and zoom level to altitude and it's it's not that hard like that the hardest part was about understanding
00:56:03
specific moving parts uh yeah and he was saying about Wolfram Alpha yeah Alpha it's uh some
00:56:13
application yes it's a website and also desktop application it's something like Matlab but much better like for example
00:56:26
Google provides questions for specific or Google provides answers for uh like live questions or something like that and well from alpha alpha provides
00:56:40
questions for oh we have chat here we have Amir Khan and milosh oh let's show chat here
00:56:55
[Music] um so cool oh wait no we can actually like uh gave links
00:57:14
[Laughter] we can like use uh like as real streamers I've seen some chat messages
00:57:29
hey everyone we can have an actual discussion like in a live stream not like not like talking to the money to the desktop and but talking to actual people
00:57:43
cool and it's actually very interesting you were saying uh like uh you created the you you searched a few formulas a mathematical formulas to solve this so
00:57:57
uh issue or problem of this task and it's very very interesting and uh like in fact in web scraping formulas mathematical formulas yeah let me
00:58:11
actually have some spoiler for the next episode because it's really fun like we go to the developer tools and do Google Maps and open sources and then we
00:58:23
search for like I don't remember but uh so I remember that some formula contained it uh a radius of Earth and Google I if I will
00:58:37
search for a radius of yours and then we'll search for the specific value on uh source code then I will find out specific function so Google
00:58:50
maps for our reviews may I interrupt you I have a can you uh make a your screen screen share like full screen because uh on my here I have
00:59:11
a huge part of my text messages I guess or comments uh yeah this or maybe
00:59:26
this way yeah both ways both ways works great for me or so find these developer tools and we can
00:59:39
actually scroll like this is how actually I've done it so I've started from this and use it page down to scroll over everything so what is it don't know don't know then
00:59:55
go go below below below below below below in person it looks aha but this looks like what we are looking for I don't know why
01:00:12
but and just like clarify this is for the next stream yes yes this is for next game so I have dig it into
01:00:27
oh yeah this is your Earth radio since I remember so [Music] um radios yeah something like that so it's probably Earth radios but it's very
01:00:47
spherical spherical and I have written down in my text editor several functions like this one uh look at where it's been called called and
01:01:03
then converted all of the function calls into a single inline a single function with all of the function calls in line then
01:01:15
uh written down by hand mathematical formula then try to simplify it try to understand basically what happens
01:01:29
try to then use it all from paste it entire formula to evolve from Alpha here it looks like like this so paste that entire formula here it's
01:01:43
it simplified my formula and then we got what we have right now so a simplified formula to convert zoom level to
01:01:56
altitude or altitude to zoom level I don't remember like what exactly requires two Google Maps we will cover it in the next episode problem and it will be named either
01:02:12
Dynamic web pages web scraping or specifically Google Maps pagination reverse engineering something like that um oh and the Milos actually uh
01:02:25
uh showing formula it's a part of uh cool it's cool to have real audience thank you
01:02:39
and for someone who doesn't know who is me or who Miller she says it's a sir paper engineer engineer also as well as Imran Khan and Ali was Easter customer success
01:02:52
[Music] and my my school and last a question if if we still have time yep we have time like we have like I have energy to show
01:03:07
to tell more because I have like one third of what I have wanted to answer previously and you have asked a question and I haven't answered it and I actually
01:03:21
forget about the question so uh I had a question about if Serb API before used browse automation you said
01:03:34
no you said no and uh and like another question is when it's a good time to use browse automation like a
01:03:49
I mean if there is uh if at some point there is no way to scrape the data like even as you showed or by reading JavaScript code and reverse engineer it
01:04:06
uh is it a good time to use browse automation at that point and will be efficient and uh can
01:04:17
can we somehow speed up browse automation like yeah uh so first question was about who did Sir have sir PPI used browser automation
01:04:30
yes uh before it was probably in 2000 18 or 19 before I joined circuit guy uh like the first version of sharp API was
01:04:45
using Phantom JS so Julian either Julian or one of the Engineers that your that are before like
01:04:57
any of us here uh but probably Julian has written a script and we have using we were using sort of apis using uh Phantom JS uh instead of playing HTTP
01:05:12
requests would be just made the request to specific web page and then we also had the captcha solver uh so instead of dropping requests we had
01:05:25
uh we have used I don't remember like I can you can actually check the code base because we still have uh Phantom JS based code in shop API or in or at least
01:05:38
in git history so uh we you you probably if I remember correctly that Phantom JS driver will watch using
01:05:52
uh machine learning yeah it was literally it was using either machine learning like model or some API of two
01:06:03
to install captcha uh sound caption not image capture but it's used at some API or some uh X browser extension and actually right now there is on GitHub
01:06:19
extensions with several app stores with several thousands of GitHub stars that uses Google API to solve Google Sound capture and we did exactly that
01:06:32
previously uh so yes we were using a phantom.js but uh we uh right now the Demand right now
01:06:45
uh previously previously we use VM sorry Vivo using uh previously uh Phantom jet but then uh at at that point in the past
01:06:59
uh serp API supported only Google and the your no concurrent requests but but then uh like if you read sir PPI uh code
01:07:13
Base history in git you will see that at some point at the same point when uh we drop it out support when we switch it
01:07:26
from Phantom Jazz to planetary requests and that's like in this similar near similar time uh we switch it to concurrent requests from H like the
01:07:39
concurrent density process because previously we were using like requests one by one and actually at that point in the past we will you or I may be wrong here about maybe video using concurrent
01:07:53
requests with Phantom JS I don't remember I have to check git history but uh if you're using a like single fat powerful uh sure virtual
01:08:08
machine on some Cloud hosting and then we switch it to multiple servers and spreads a lot uh low balance the law yes friends a lot
01:08:20
and yes right now we we are not using a browser for automation but we will use it for our Global positions feature uh like if you have
01:08:35
checked it but that's like off topic right now I guess and the second question was uh when it's good idea to use browser automation like basically to use browsers and your third
01:08:49
questions third question was how to speed up uh I want to answer the third question and then second one so how to speed up
01:09:00
uh if we I can share melon yes I can share my screen or I can talk to me no no I will just talk there are several projects on GitHub
01:09:15
there are two projects that named browser-less uh like one is uh software as a service and the second one is a Docker image
01:09:27
and so there is browserless.js.io let me type something into chat Bros let me actually check if it's correct so
01:09:42
there is browserless.io and there is browser maths.js.org so zero there are several
01:10:00
projects uh so browser let's dot IO this software as a service uh like it's a Chrome as a service so
01:10:16
and browserless.js.org is like convenience methods over like there is a PPT here it's a chrome developer
01:10:32
cgp Chrome developers protocol uh API on in node.js and browser-less is a wrapper around like PPT
01:10:46
it it has pretty nice default browserwise.js.org so it has first of all it has a live
01:10:58
debugging so when you spin up Docker image it can open a browser with a on the right like it splits uh
01:11:12
your browser window to to Parts one is uh uh code editor and second part is video stream from headless browser inside
01:11:26
Docker so you can write script and you actually see what's happening and basically visual debugging of your browser automation it's something looks like cypress.io cypress.io is a test
01:11:40
automation but Cyprus allows recording and browser-less also it actually allows recording of tests and of entire session
01:11:53
so and also browser-less provides convenience scripts to block requests like block images block unwanted CSS
01:12:05
block unwanted HTTP requests like uh like PPT itself provides it and browser-less digest the torque
01:12:16
have it configured so when you go to some Page by using that browser-less stuff it already cuts down number of network requests
01:12:30
so because most of the time browser automation being slow when uh when we wait until some response because we need we have to like when we use browser automation we have to wait until
01:12:46
something appears on the page before we we can interact with the page because [Music] like it's how websites work it's like Dynamic web pages work
01:13:00
and when the block or some requests uh web page is faster so we can speed it up by using that uh
01:13:13
yes and Amir Khan tells that it's about test flakiness so most of the dynamic web pages uh automation are flake
01:13:25
because of flake means uh like it fails time to time and it fails time to time because sometimes and network requests
01:13:37
are not like predictable like and by reducing number of network requests we can increase predictability of our script I guess and that's the reason why
01:13:54
we actually switch it from Phantom Jazz to plane HD request because we make requests it either fails or succeeds or it times out at some point but it still
01:14:06
fails but with browsers we have uh Network requests and we also have a we have a page and we have to check if it's like properly rendered if it's inside
01:14:20
the viewport and only then we can click on something and it's like uh it's not that fun like uh like it's it's like comparing back-end work and front-end
01:14:32
work like front-end work is always like it's nice for some people but for backend people it's like not that nice because and it's like they're yours front-end
01:14:46
people usually don't like concurrency and digging into something like and there are also full stack people who are fine with everything but uh like I've explained the speed up
01:15:00
part so we you can we can go to browser less and they have a Docker image packages browser-less
01:15:18
PDF screenshot PPT or SRT ah I don't share I'm not sharing screen yeah uh let me share my screen
01:15:30
um could you also collapsed collapse comments what uh yeah say um so I'm sharing my window browser collapse in comments
01:15:44
yep we see it here we are but maybe like that so PPI logo is too big I guess here hide it right now
01:15:57
uh it was too loud so uh we have browser-less uh package on GitHub and we have packages
01:16:15
driver uh dot JS it's uh oh basically a wrap around the picture and like here we can check specific Flags
01:16:29
specific Network interceptors that can improve reliability like and reduce flakiness so for example they have that number of default arguments and that
01:16:42
that arguments are needed can also should I increase the font size for me it's fine actually okay uh so that default documents are
01:16:57
not only for introducing flakiness but also for being able to uh run that headless browser from Docker from AWS Lambda from like anything and those uh I
01:17:11
remember they had either these like either enable webgl or something else so they had some Chrome flag to be able to stream uh
01:17:29
let me actually to Showcase it so if you go to And to clarify this is about speeding up yes this is about speeding up yes so we
01:17:45
provide in specific Flags uh yeah providing specific Flex to chromium and also save your some Network Interceptor so I don't remember where
01:18:02
they are um Lighthouse function your CLI Benchmark devices uh let's charge intercept
01:18:18
because P picture has Network Interceptor so packages go to yes and also media code excavations so page on request what we have
01:18:33
built-in invasions built in it like you see a browser-less has built Innovations technique built-in blog ad blocker and yes so
01:18:45
I this is what we wanted to do so it's been it has built-in ad blocker and it's it will for sure speed up uh navigation on any web page because
01:18:59
like a blockers like most of the reasons why page Why websites are slow because of web of Adblock because of ads add scripts because they are improperly used
01:19:12
on pages they like uh page website performance is another topic but ads are bad yeah yep got it interesting so and if we search for add block here to understand
01:19:26
like we have ad block parameter so we want to understand or how they are actually block it like if and we are doing that we are doing that
01:19:46
like if anyone wouldn't want to use browser-less and or you'll want to understand how browserless works or will want you will want if it's a word like
01:20:01
if anyone would want under to copy only specific parts of a browser let's into their PPT or selenium scripts it's the way so it injects uh a specific stuff
01:20:17
inject Style if like scripts modules media type emulate media animation you can disable animations Keys what is selection key so it can do
01:20:36
like multiple stuff multiple interesting things and got a browser which is cool and uh stop sharing screen showing chart
01:20:52
and Mila is telling that font size is fine thank you awesome uh so your question like I have answered
01:21:05
how to speed up and your second question was about when it's good case to use browser automation to scrape data like for
01:21:17
example uh when we scrape uh like foreign Google Maps it was much easier much simpler to use
01:21:30
headless browser instead of reverse engineer imagination and and the same is for uh like Google events uh Google jobs
01:21:44
what else and some other services I don't remember right now and that's the goal and for something else and it will be much simpler to use a browser like
01:21:57
make a request describe the data because under the hood view also like we extract data but also we kind of like we don't execute in line JavaScript but we
01:22:10
extract like we emulate execution JavaScript of JavaScript execution so big strike data from inline JavaScript and in line JavaScript uh just script
01:22:23
text in the HTML that we initially retrieved so we exit in extract the data from inline JavaScript we have six viewers oh
01:22:38
so we extract data from inline JavaScript and then some parts of the data we extract by using regular Expressions
01:22:51
some parts are just Json inside script text so we evaluate the J not evaluate like we part the Json and then only map the data
01:23:05
but there are Parts when specific Parts there are Parts when there's a specific parts for there are cases when for example
01:23:19
Google images and like carousel on top of Google search results in all some parts of Knowledge Graph Google search engine results Google search page
01:23:33
executes JavaScript that inserts HTML to the page using JavaScript and we extract that HTML structures from H from inline JavaScript
01:23:47
and by using our HTTP HTML parser uh parsing that HTML and inserting it to the page so we emulate JavaScript execution
01:24:02
uh and it's like extra work and also extra time also another like challenge is uh
01:24:15
we like not Google and codes UTF like Google and quotes stream in multiple formats like binary encoding uh and it happens on the same page like
01:24:34
multiple and it works because it like JavaScript uses V8 uh most of the time like also spider monkey but
01:24:47
JavaScript engine like when we uh when uh JavaScript engine evaluate encoded string uh like hex encoded parts of a string
01:25:03
decimal encoded parts of string it can handle all of that but Ruby and python can do that so it's fine uh like for us it's extra challenge to also uh decode
01:25:17
uh hex encoded or like or the code improv in not incorrect but complex gtf symbols like emojis that are con
01:25:31
made from several parts so there are several challenge uh multiple challenges of extracting extracting data from HTML from inline JavaScript in HTML because
01:25:44
it requires a decoding and it requires uh like constr like kind of uh also reverse engineering
01:25:56
if you if we could use browser like we could use browser but we don't use that right now uh it would be fun uh at I guess at some point we will do
01:26:09
that uh we will I guess at some point uh in the future we will spin up a cluster of browsers that uh will be a
01:26:21
lot balanced and we will talk to them like it will be kind of microservice or a service and from our Ruby servers we will talk to them and uh so we will get
01:26:35
HTML and it will be some rendering service so like not rendering JavaScript execution service so what we need is to
01:26:47
evaluate JavaScript we need to load the page and evaluate all of the JavaScript so we don't have to extract data from inline JavaScript we don't have to uh
01:26:59
the code that you don't have to do multiple like complex things like they are fun once it's solved but they're also like com like hard
01:27:12
and for example uh scraping Hub on GitHub has uh I don't remember the name but they have a cutie5 browser uh that
01:27:27
has they have a wrapper like they have a HTTP Service open source HTTP service that starts a python server I guess and that
01:27:40
opens uh qt5 browser that renders JavaScript on HTML and responds with rendered HTML so exactly what we want
01:27:52
but if we would use PPT it would be or not exactly picture but like real browser like chromium or
01:28:05
Safari actually spy Safari webkit is faster like there is a another topic that is interesting for me like there is a
01:28:19
so we node.js is built on top of V8 V8 is built on top of webkit and we ate what is it
01:28:31
it's a browser engine JavaScript yeah it's just it's just for Universe it's JavaScript runtime and right now we have three viewers I mean it was a future viewers uh no no
01:28:46
I mean uh I'm talking about that because uh I'm getting tired because uh I'm talking too much so and I'm talking
01:28:58
about not related topics so it would be nice to not use browser automation but uh uh use uh
01:29:09
JavaScript runtime that can render that can evaluate JavaScript in HTML page and return uh HTML with evaluated JavaScript
01:29:23
uh that kind of browser automation would be nice for me but if you would talk about like using browser entirely I don't know
01:29:38
like I've seen some commenters on Reddit as who told that uh it's simpler to avoid uh to bypass blocking of
01:29:51
specific Target so specific websites when using uh browser automation like Zera there is Chrome and detectable uh by uh
01:30:04
repository in GitHub undetectable Chrome or Chrome and detectable don't remember how it's named and there was several others
01:30:15
and people claim that it's more efficient than using proxies or maybe we can combine using Pro proxies with that
01:30:27
browser and if it will be like 90 percent uh 90 success rate uh comparing to how like 80 percent or like it depends on
01:30:42
the proxy but if it generally will increase success rate without uh making things more complex it will be fine uh interesting and
01:30:53
I guess this question should be all like we can discuss this question again in the future episode because uh I'm getting pirate for Heroes like
01:31:05
either we can make a pause and also I have to answer phone call and I don't know how to do that right now during my screen so let's wrap up
01:31:18
yeah oh I know uh I have to just send some money to my wife uh sorry so do you have other questions right now or
01:31:34
I only one question like final one um like to only to I want to ask you to summarize uh like uh do you have a an
01:31:48
algorithm of steps for viewers who are only who just starting up starting in web scripting what steps they should
01:32:02
do in order to like scrape like for example static off State static let's say static and dynamic websites like summarize steps like first this look at this then look at this and then look at
01:32:16
this like stuff like that I have to actually answer the phone or maybe okay steps first step is to learn front end like for real uh this is the
01:32:30
simplest uh see the simplest way for me to understand uh to describe the data is to like I've came to backend from the front
01:32:45
end and I've used to developer Tools in Chrome and I used to debugging web pages so the first step would be learning the bugger and
01:32:58
there should be some reason to learn the Biden and the simplest reason is to actually work on some front-end project like that have some bugs so you you
01:33:12
don't you're not uh randomly like digging into one or another thing because scraping static websites is like simple just use like selector Gadget
01:33:25
uh but Chrome extension or bookmarklet JavaScript bookmark less but if we are talking about like Dynamic web pages uh this like the must-have is to
01:33:39
use like learn developer tools and to learn developer tools the simplest way is to learn front and and work on some front-end projects like debug front-end project
01:33:54
uh I don't have a like scenario on how to learn the bargain basically trying to fix things or trying
01:34:08
to understand and like or trying to re-implement something so the first step is to learn front-end uh because web scripting also requires
01:34:21
CSS selectors and to understand CSS selectors like the simplest way is to write some style sheets because you will understand specific specificity
01:34:32
of selectors uh because it hurts a lot like specificity of selectors oh like it's a like separate skill
01:34:46
and like for you also Ali has a question uh like we will cover it right after this one so
01:35:02
first one is to learn front end and like write some front-end code uh write some project front-end project and it will that will require uh no like
01:35:17
and use don't use UI kit for first time because UI kit is simple like you would just uh use some components and that's it like no need to write CSS but when
01:35:31
you write CSS and write several like components widgets on the page it will require uh reusing uh CSS rules and reusing CSS rules will
01:35:50
require understanding how CSS rules how CSS selectors work and the houses are select no not how csls selectors works but uh how to write CSS selectors that
01:36:04
aren't broken when you put elements outside of its contexts when you written it uh hard to explain right now but when
01:36:19
you write CSS like it's like well-known problem and there are like multiple ways to avoid this like right now the most modern is CSS models
01:36:30
of CSS modules before that there was like named in conventions like eight and CSS or Bam from the index Putin whoever by the way uh then uh they can really
01:36:45
agree multiple things and so write CSS right front end the back front end and you will have to use uh like developer tools and when you
01:36:58
will use developer tools then uh you will understand how to debug your code how to debug your style sheets then uh write some Dynamic parts of like add
01:37:18
some Dynamic parts for your application like make requests and make requests like uh concurrently then because it will require the bargain because concurrent requests from Front End are
01:37:31
always like from like from the at the beginning they always work not in a way that expected and it will require a checking Network to making breakpoints
01:37:45
and when you make big points you will understand like and also uh what's important about the biome uh
01:37:57
at the at the beginning uh it's simpler to debug this console log but then uh when it's uh it at some point it becomes too slow to console logger reload click
01:38:10
click click click click click click click then it's much simpler to just put a breakpoint and use hotkeys like F10 or I don't know how which hotkeys I used on
01:38:21
Mac or with on Windows but but just use hotkeys and when the bargain it will require understanding like how like the recruit
01:38:35
flow of the application and [Music] that's probably it so have some project that will require the biking and that's
01:38:48
it oh I don't know like for example when I right now look into some code uh like for example when I have the bucket uh local Geary
01:39:01
and libxml I I didn't start from writing C I started right from using genuine debugger so I've started right from the bike and not from writing console log
01:39:18
and stuff and I like barely understand C but still I understand the Biden interesting and something and also
01:39:30
for me personally like I like uh uh stories of Sherlock Holmes like for me Elena and Zara like there are multiple people uh with multiple uh conference
01:39:44
talk talks who who compare Sherlock Holmes and like who talk about Sherlock Holmes stories and debugging in a single talk and for me it clicks like uh I like
01:39:59
that process of understanding how things work like I've when I read when I use it to read Sherlock Holmes stories I uh like normally reminded Sev like some
01:40:12
parts of my dividing experience in software and I don't know a different point of view interesting does it answer your question
01:40:24
about how to deba and also uh in Dev Tools in Chrome devtools there we said think like you can format code so when
01:40:35
we double brackets yes because most of the code uh uh Sherlock Holmes decision anyway uh
01:41:02
cool so we have some delay uh it's better to format quad because uh reading uh minified call This Heart like it's much simpler for sure
01:41:21
formatted code at least somehow formatted does it answer your question yep yep cool and ali uh have asked uh is
01:41:36
it better to use expat or css selectors why industry tend to use CSS select or small instead of expat actually uh like it's interesting because it's
01:41:49
interesting question uh because all of the tools uh translate in CSS selectors to expat because all of the tools are using libsml or some sort of competitor
01:42:00
of alternative of libxml so all of them are using this like the same algorithm so yeah all of them use the same algorithm to convert CSS selector to
01:42:12
expat and then Traverse HTML by using the expat queries and also in uh like uh in testers in web tests in in community
01:42:27
of test quality assurance like testers of web pages like who use selenium expats are like having dominance so most of the
01:42:40
testers are using expats because they are more reliable there is more not more reliable but syntax of expat provides it's
01:42:56
how to say not it's not more strict but it has more power like it it it allows
01:43:07
to Traverse the like it allows to select some node then go to parent and select all of the specific elements that have
01:43:19
specific stuff and that dot heads so it have a much more powerful powerful filtering comparing to CSS and for example modern uh
01:43:35
Georgian successes by versions I mean css3 CSS four CSS five flag in says C in css4 or in or in third
01:43:48
they have added uh pursue the selector has uh uh before they even know something such thing and in CSS 4 and css5 they
01:44:01
add into the standard even more things to be able to select like relevant elements like for example css3 have added uh like star
01:44:14
circumflex uh and sign and it means and also plus sign so D plus a or d uh star uh some class
01:44:27
name and it allows to select to write a CSS selector to four that will capture more elements but using by using expat it's like available right now like but
01:44:41
unfortunately like uh none of like libxml doesn't support somebody expect 1.1 but there is a little expect
01:44:54
2.0 and it doesn't I have seen only one goal and one goal land libraries that supports
01:45:06
expect to but uh anyway uh all of the libraries that extract a low extract data from HTML to from HTML translate
01:45:21
CSS selectors to expat and Emir Khan says that uh CSS selectors are still poorly implemented like there are many escaping
01:45:34
problems and for example this has text equals to something uh trans is translated to expat query and by using expat queries like there is
01:45:47
much more power it's like it's like writing an expression yes it's like writing an expression it's it has much and it can be minimized down to
01:46:00
much more uh concise uh expect query and also uh like I have raised it a year ago uh an issue in local Geary Repository
01:46:13
they their algorithm of uh translate and uh of translating CSS selector to expat uh duplicates
01:46:27
uh parent nodes I don't remember but from what I remember uh it has some duplication and by minimizing that expat query manually I have sped up
01:46:43
uh local gearu usage by 10 to 15 percent and I haven't implemented it so people haven't implemented it in sharp API I have Have
01:46:59
and Have Not submitted a pull request we only discussed it with Mike in issue and we stopped at it Mike is uh
01:47:10
maintainer like primary maintainer of Monon and also he is director of engineering at Shopify and uh so yes
01:47:21
[Music] but why like to answer like the like part uh why industry tend to use CSS
01:47:35
selectors more than expat I guess because right now the trend is to use browser Automation and to use PPT and all of the front-end people
01:47:48
are using PPT and front-end people know CSS but don't know expat and just using what's more
01:48:01
convenient not convenient what then they people use what they know how to use it doesn't mean it's actually the most efficient but at some point they switch
01:48:14
to expats because it provides more power but at the same point CSS is in multiple ways CSS is more concise for example we can write a
01:48:28
DOT classname comma dot class name comma.class name comma it's short by using expat it will be like 1000 characters between who that will
01:48:41
use uh uh some string manipulation because expats uh queries Canon liquidy over entire
01:48:51
value of attribute and to query only if you want to query over only uh like for example do you have three class names and we have we want to match class names that second one in the
01:49:09
attribute value and if we will write expat that looks like like class equals to something it won't
01:49:22
work because it will match by entire tribute value and nokogiri and uh Beautiful soap and also parcel you they use streaming
01:49:36
manipulation so they wrap class name around they prepare the pen space and uh also and have a function that names text
01:49:48
containing and match that way so in some point some like if we use only class names it's more concise to
01:50:03
use CSS and translate it to expat but if you want to have something more complex it's simpler to use expats
01:50:16
oh I'm joking I may ask you all like a final question about the expression CSS what is in your experience what is faster
01:50:30
it's translated into expats so then no got it uh I don't know how it works in JavaScript runtime like V8 or spider
01:50:43
monkey or whatever else but even we use beautiful Geary parcel uh or
01:50:55
something else I'm not sure about rust or golang libraries but it's like like because all of that
01:51:06
libraries are anyway uh working with XML like they treat HTML SX HTML and they translate CSS to expat
01:51:20
and uh like wheelies over by the way uh uh what what is faster for example uh in
01:51:33
the previous year on past previous year uh in the maintenance yeah scrape is personal Alice is showing it and scrape is parcel is uh great it's
01:51:49
much I like it much more than beautiful soup because it allows to write expat and CSS selectors they say like the same as we do in local area
01:52:03
so but in nokogiri in the past or past past year uh Mike uh maintenance has written so it uses sleep XML and libxml allows
01:52:19
to write something like a plugin uh to execute over uh
01:52:33
uh X xml3 by using expat or expat provide allows I don't remember we can also talk about it in some like we can review it this pull request it's pretty interesting
01:52:46
uh so he has written a custom uh expat query for CSS selector like instead of using
01:52:57
that complex stuff with string manipulation he has written uh class containing function for expat he has written it in C and
01:53:09
we can and nokogiri invokes Cloud contains class function uh that is like custom it's not in x
01:53:23
but standard its custom functions that libxml can understand because he has written I I can actually oh I can actually find the Bluegrass so
01:53:37
he has written a function for libxml like expat uh and it's like two times faster I guess so but still it's it's bad so I don't
01:53:51
know uh how to name it so topic performance topic performance supplements civilization improve written spoon
01:54:04
compile really numbers realization performance I don't remember anyway we can check it later
01:54:20
yeah so I haven't compared the perform and I don't know how JavaScript runtime treats CSS selectors [Music]
01:54:33
it's so also like my to do in my to-do list because laughs we have a visitor from Facebook called
01:54:49
nice to meet you sir here thank you for answering me so in my to-do list uh to render JavaScript in
01:55:05
HTML and return hdrendered HTML I wanted to dig into V8 or other browser or other JavaScript on time and browser
01:55:18
runtime to be able instead of spinning up entire browser only execute specific parts from the terminal and that's it
01:55:29
like it if it will be the case I guess we will use it so instead of uh decoding Unicode emojis instead of decoding like
01:55:41
all of the random stuff like hex encoded that smell encoded instead of uh red reverse engine reverse engineer in JavaScript we will uh just
01:55:54
in the Vogue from the terminal like render this page and return me the rendered HTML string and that's it yeah makes sense but anyway in some
01:56:08
cases we'll still need to do that because we will still need to uh extract data from inline JavaScript because for example some data is collapsed and for example for example
01:56:23
for example uh people also ask a block on Google when you click on it only then it expands and to get data from it uh yeah and it's rendered to the page only
01:56:36
when you click so still some cases and also Google Images like when you search for some image and specify image parameters like uh large small big and
01:56:50
dimensions it responds on livings 20 images and others are hidden and you click on some stuff it's rendered on the page and that's also the case the reason why we have a bug when on the first page
01:57:04
we have like 20 images and we have this the same 20 images duplicated uh we have a bug report about that actually uh instead of instructing data from HTML
01:57:19
you should extract data from inline JavaScript only on Google Images awesome awesome awesome thank you so much for showing the process of extracting data from
01:57:35
Google Maps right this is a search query called awesome yeah I have another a few questions but first of all for another podcast like that
01:57:47
so for false false false yeah okay then we are wrapping up so this was a sharp API podcast uh Dima uh has asked
01:58:05
me about how to scrape Dynamic web pages how to extract data from Google Maps uh yeah thank you all for watching it was fun and actually thank you all for
01:58:19
asking questions and commenting in the chat because it was really like it's the first time I have a live stream with real people like like previously I only had several talks
01:58:36
on conferences on meetups but uh in live stream it's much more fun uh I hope we will have uh viewers like concurrent viewers uh on
01:58:50
the future episodes so uh and yeah Miller's telling that we are all up Bots approach so uh we have dig it into how
01:59:05
Google Maps [Music] how how Google Maps uh requests are made how they
01:59:17
build a URL parameters we have partially debugged Google Maps per generation then we talked about uh differences
01:59:29
between CSS selectors and expat queries we have talked about uh usages of browser automation when it's better to use when it's not how to speed up it
01:59:41
speed it up we have to cover it a bit [Music] um different browser runtimes and JavaScript runtimes and some parts of performance of new Calgary
01:59:55
and in the future episode of like when we will have this topic in the sharp API podcast we will probably uh we will probably
02:00:08
[Music] talk about how to uh convert how to actually construct the foreign
02:00:21
URL to Google maps to retrieve the data by using just HTTP requests without using the browser and it will require the bargain and it will require also some math mathematics and
02:00:38
some people on Twitter with my mindshow asking where are you from and where you we are from Ukraine and please donate some money to Ukraine foreign
02:00:55
forces it's a joke and not joke at the same time so uh like and no we are not Russians Putin
02:01:09
hello uh yep slavo crania Jacobs I'm sure probably so this was a sure PPI podcast what else
02:01:40
I guess let's wrap up thank you for all thank you all for uh discussion thank you Dima for questions it was fun uh thank you for having me
02:01:54
I appreciate it go go let's go yep thank you mine show thank you so I'm clicking and stream yeah so see
02:02:16
you see you see you in the probably next Friday because soap API podcast is uh hosted every Friday Friday okay good luck good luck yeah yeah see
02:02:31
you see you next time cheers
End of transcript