Using Python to download data on LAADS automatically and in batches

Catalog

  1. Preface
  2. Profile data download link
  3. Python+Selenium+ChromeDriver configuration
  4. Using Python+Selenium to call wget to download data
  5. Using Python+Selenium to call IDM to download data
  6. summary

1. Preface

LAADS(https://ladsweb.modaps.eosdis.nasa.gov/ )It's a NASA data distribution website. On this website, you can download many satellite product data, such as MODIS, VIIRS, Sentinel-3 data.

To download data in LAADS, we always have to go through the following steps: select sensor (or product version) - > select time - > select range - > search file - > submit order.

The first four steps can be carried out without the need to log in the account, and the last step must be logged in the account (before the need to log in the account, in the fourth step, you can click to download the retrieved file).

NASA official provides you with many ways to download data using scripts, as shown in: https://ladsweb.modaps.eosdis.nasa.gov/tools-and-services/data-download-scripts/

According to NASA's instructions, we write scripts as needed to simplify data download.

If the study area and the data products to be used are determined, only the data at different times will be obtained. With the change of time, new data will be generated. We need to manually repeat the data download steps.

In order to download data, our main task is to get the download link of data. Once upon a time, it was unnecessary to log in to the account when downloading data on LAADS, but now users are required to log in. Therefore, it is not enough to only get the download link, but also add the user's authentication information to the link. According to different download tools, there are different ways to add verification information. Here I will introduce how to use Python to download data on LAADS automatically and in batches.

Of course, you should have an account before downloading the data.

2. Analysis data download link

Here is an example of downloading MODIS MOD021KM data.

When we download data, when we go to step 4, we have retrieved the required files. In the case of user login, we can directly click the download arrow to download a single file.

Right click the download arrow to copy the download link of a file:

https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/61/MOD021KM/2019/336/MOD021KM.A2019336.0200.061.2019336180824.hdf

Then click "csv" on the page and it will download to a csv file. We open the file:

We can see the second column in the downloaded CSV file, just add
"https://ladsweb.modaps.eosdis.nasa.gov" can build the download link of the data (at present, only the download link can't be downloaded, and verification information needs to be added, and solutions will be provided later for this problem).

Now our question becomes how to automatically get the CSV file containing data information?

The manual steps are:
Open step 4 website - > Download CSV file.
So we just need to build the website of step 4 and download the CSV file automatically.

Through observation, we can see that the website of step 4 is:

https://ladsweb.modaps.eosdis.nasa.gov/search/order/4/MOD021KM--61/2019-12-02..2020-01-28/DB/117.6,35.1,124.8,28.7

We analyze each part of the link:

https://ladsweb.modaps.eosdis.nasa.gov is the main site name of the website
search/order/4    Is the fourth step in the search step
MOD021KM--61    Product number and version number
2019-12-02..2020-01-28   Selected date range
DB/117.6,35.1,124.8,28.7  Study area. There are many ways to choose a research area, DB Express use Draw Box Other methods can also be used to select the study area.

According to this rule, we can build our own website in step 4.

To download the CSV file in the web page, I thought of using Python+Selenium to operate the web page. Different browsers will have different situations. You can configure them according to your own needs. Take Chrome for example. In the third part, we will explain the configuration of Python+Selenium+ChromeDriver.

Now we can summarize the download steps:
a. Build the download website of the fourth step as needed;
b. Download CSV file with Python+Selenium;
c. Build the download link of remote sensing data;
d. Download the data according to the download link.

3.Python+Selenium+ChromeDriver configuration

Before the official download, we should configure the Python+Selenium+ChromeDriver environment.
3.1. First of all, of course, you need to install selenium Library (there are many tutorials about selenium on the Internet, you can learn about them in depth):

pip install selenium

3.2. Download chrome browser driver: chrome driver. Download address:
http://chromedriver.storage.googleapis.com/index.html
Or: Add link description

Note that the downloaded driver version should match the version number of the Chrome browser on your computer. In Windows environment, there are only 32-bit, and chrome version is 64 bit.

3.3. Unzip the downloaded package and put the chrome driver in the root directory of your Chrome browser.

3.4. Add the environment variable, and add the address of the folder where the chromedriver is located in the system variable Path. For example, I add C: \ program files (x86) to Path \ Google \ chrome \ application

Now that the environment has been configured, let's start to explain the download with Wget or IDM.

4. Use Python+Selenium to call wget to download data

NASA officials have introduced many ways to download. There is a way to download data directly using Python scripts. It's convenient to download with Wget, so here's how to download with Wget.

The first thing to do, of course, is to install wget. Download address of wget for wget:
https://eternallybored.org/misc/wget/

Let's take a look at the command to download data using Wget,

WGet - e robots = off - M - NP - R.html,. TMP - NH -- cut dirs = 3 "data download link address" -- header "authorization: bear your app_key" -- p the location to download to your computer

Parameter interpretation (don't worry if you can't understand):
-e robots=off : Bypass the robots.txt file, to allow access to all files in the order;
-m: Enable mirroring options (-r -N -l inf) for recursive download, timestamping & unlimited depth;
-np: Do not recurse into the parent location;
-R .html,.tmp : Reject (do not save) any .html or .tmp files (which are extraneous to the order);
-nH : Do not create a subdirectory with the Host name (ladsweb.modaps.eosdis.nasa.gov);
--cut-dirs=3 : Do not create subdirectories for the first 3 levels ;
--header : Adds the header with your appKey (which is encrypted via SSL);
-P : Specify the directory prefix (may be relative or absolute);

An example is as follows:

wget -e robots=off -m -np -R .html,.tmp -nH --cut-dirs=3 "https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/450/S3A_OL_1_EFR/2020/005/S3A_OL_1_EFR____20200105T014337_20200105T014637_20200106T062806_0179_053_231_2340_LN1_O_NT_002.zip" --header "Authorization: Bearer 00000000-0000-0000-0000-000000000000" -P F:/TestDownload

Using Python+Selenium to call wget to download data can be summarized as follows:
a. Build the web address of Search page;
b.python+selenium downloads the CSV file in the Search web page;
c. Read the content of CSV file and build the download link of remote sensing data;
d.python calls wget to download data according to the download link of data.

**Note: * * how to get the file name of the csv file that is automatically downloaded to. The method I use here is to build a folder based on the running time of the program, and then set the folder as the location where the browser downloads the files. After the download is completed, use the function in the built folder to read the file name of the download file and get the path name of the file.

The code is as follows:

from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.options import Options
from subprocess import call
from selenium.common.exceptions import NoSuchElementException
import os
import pandas as pd
import time
from datetime import datetime
from apscheduler.schedulers.blocking import BlockingScheduler  

def CSVDown(Driver):
    #Find the label with the text containing the csv
    csvElement=Driver.find_element_by_link_text('csv')
    #Click to download
    csvElement.click()
    #Leave time to download the csv file
    sleep(10)

def MODISDown(FileDir):
    #Get the file name of the downloaded csv file
    csvfilename=os.listdir(FileDir)[0]
    #Construction file path
    csvfilepath=os.path.join(FileDir,csvfilename)
    #Read values from file
    csvvalues=pd.read_csv(csvfilepath).values
    #Using wget
    for cv in csvvalues:
        #Build download link for data
        modislink='https://ladsweb.modaps.eosdis.nasa.gov'+cv[1]
        #Build wget command
        #Replace your AppKey with your AppKey in the string
        wgetcmd='wget -e robots=off -m -np -R .html,.tmp -nH --cut-dirs=3 "'+modislink+'" --header "Authorization: Bearer 00000000-0000-0000-0000-000000000000" -P F:/TestDownload'
        #Using CMD to call wget for download
        call(wgetcmd)


def LocalTime():
    CurrentYear=datetime.now().year
    CurrentMonth=datetime.now().month
    CurrentDay=datetime.now().day
    CurrentHour=datetime.now().hour
    CurrentMinute=datetime.now().minute
    CurrentSecond=datetime.now().second
    return CurrentYear,CurrentMonth,CurrentDay,CurrentHour,CurrentMinute,CurrentSecond


#Create a folder and name it as the time when the program runs
#Use this folder to store files downloaded using selenium
Year,Month,Day,Hour,Minute,Second=LocalTime()
csvdir='d:\\'+str(Year)+str(Month)+str(Day)+str(Hour)+str(Minute)+str(Second)
os.mkdir(csvdir)

#Configure parameters for selenium
options = webdriver.ChromeOptions()
prefs = {'profile.default_content_settings.popups': 0, 'download.default_directory': csvdir}
options.add_experimental_option('prefs', prefs)
#options.add_argument('--headless')  #Whether there is browser interface mode, set as required
driver = webdriver.Chrome(chrome_options=options)

#Define information to download data
ProductID='MOD021KM--61/' #product ID
#Set the start and end time of the data. In fact, it is to construct a simple string according to the needs
StartTime='2019-01-01'  #start time
EndTime='2019-01-05'    #Up to date
Area='119.7,32.6,123.1,30.2'  #Study area range, upper left and lower right. Construct strings as needed

#Build the web address of the Search page based on the above information
url='https://ladsweb.modaps.eosdis.nasa.gov/search/order/4/'+ProductID+StartTime+'..'+EndTime+'/DB/'+Area

#Automatically open Search page
driver.get(url)
#When the browser opens the Search page, it also leaves time for the server to retrieve data
#So here the sleep is 50 seconds, which can be set according to the network speed
#Of course, you can also judge whether the search results, that is, whether the tags containing csv appear
sleep(20)
#Download csv file
CSVDown(driver)

#Close browser
driver.quit()

#Download remote sensing data
MODISDown(csvdir)

When using the program, remember to fill in your own APP_Key and replace 00000000-0000-0000-000000000000 in the program

The results of program operation are as follows:

5. Use Python+Selenium to call IDM to download data

When we download data, we must not let go of IDM's excellent Downloader, and how to download IDM in the command line. https://blog.csdn.net/mrzhy1/article/details/104098007 There is some introduction in this article. I will not repeat it here.

Before I downloaded MODIS data on LAADS, I never logged in, so as long as I have a data link, I can download data directly. But now you need to log in. But don't worry, as long as you simply set it up in IDM, you can still use the download link to download data.

Open IDM at:
Download - > Options - > Site Manager, create a new one, fill in the website: https://urs.earthdata.nasa.gov, then fill in your account password, and confirm. As shown in the picture:

After this configuration, you can simply change the 'modidown' function in the above code:

from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.options import Options
from subprocess import call
from selenium.common.exceptions import NoSuchElementException
import os
import pandas as pd
import time
from datetime import datetime
from apscheduler.schedulers.blocking import BlockingScheduler 
#Installation location of IDM 
IDM = r"D:\Program Files (x86)\Internet Download Manager\IDMan.exe"
def CSVDown(Driver):
    #Find the label with the text containing the csv
    csvElement=Driver.find_element_by_link_text('csv')
    #Click to download
    csvElement.click()
    #Leave time to download the csv file
    sleep(10)

def MODISDown(FileDir):
    #Get the file name of the downloaded csv file
    csvfilename=os.listdir(FileDir)[0]
    #Construction file path
    csvfilepath=os.path.join(FileDir,csvfilename)
    #Read values from file
    csvvalues=pd.read_csv(csvfilepath).values
    #Using wget
    for cv in csvvalues:
        #Build download link for data
        modislink='https://ladsweb.modaps.eosdis.nasa.gov'+cv[1]
        #Data storage address
        DownPath='F:/TestDownload'
        #Call IDM to queue tasks
        call([IDM, '/d',modislink, '/p',DownPath,'/n','/a'])
    #Start downloading   
    call([IDM,'/s'])


def LocalTime():
    CurrentYear=datetime.now().year
    CurrentMonth=datetime.now().month
    CurrentDay=datetime.now().day
    CurrentHour=datetime.now().hour
    CurrentMinute=datetime.now().minute
    CurrentSecond=datetime.now().second
    return CurrentYear,CurrentMonth,CurrentDay,CurrentHour,CurrentMinute,CurrentSecond


#Create a folder and name it as the time when the program runs
#Use this folder to store files downloaded using selenium
Year,Month,Day,Hour,Minute,Second=LocalTime()
csvdir='d:\\'+str(Year)+str(Month)+str(Day)+str(Hour)+str(Minute)+str(Second)
os.mkdir(csvdir)

#Configure parameters for selenium
options = webdriver.ChromeOptions()
prefs = {'profile.default_content_settings.popups': 0, 'download.default_directory': csvdir}
options.add_experimental_option('prefs', prefs)
#options.add_argument('--headless')  #Whether there is browser interface mode, set as required
driver = webdriver.Chrome(chrome_options=options)

#Define information to download data
ProductID='MOD021KM--61/' #product ID
#Set the start and end time of the data. In fact, it is to construct a simple string according to the needs
StartTime='2019-01-01'  #start time
EndTime='2019-01-05'    #Up to date
Area='119.7,32.6,123.1,30.2'  #Study area range, upper left and lower right. Construct strings as needed

#Build the web address of the Search page based on the above information
url='https://ladsweb.modaps.eosdis.nasa.gov/search/order/4/'+ProductID+StartTime+'..'+EndTime+'/DB/'+Area

#Automatically open Search page
driver.get(url)
#When the browser opens the Search page, it also leaves time for the server to retrieve data
#So here the sleep is 50 seconds, which can be set according to the network speed
#Of course, you can also judge whether the search results, that is, whether the tags containing csv appear
sleep(20)
#Download csv file
CSVDown(driver)

#Close browser
driver.quit()

#Download remote sensing data
MODISDown(csvdir)

The operation results are as follows:

6. summary

a. Maybe you can set a timing task, and use the idea of this blog to download data regularly every day.

b. Before you download the data, you still need to download it manually. Modify the code according to the actual situation to download the required data.

c. You can also find your own inspiration for downloading data through this blog according to your own needs.

d. If there are any mistakes in the blog, I hope you will be generous with your comments.

Published 2 original articles, praised 0, visited 79
Private letter follow

Tags: Selenium Python network pip

Posted on Wed, 29 Jan 2020 02:30:17 -0800 by Dax